That Time I Dropped The Production Database

I was reminded recently about how a GitLab engineer managed to delete the prod database, and that got me thinking about one of my biggest (production) mistakes.

It's been at least 5 years since this happened so I think it's safe to tell the story of when I dropped a production database of one of my clients.

At the time I was working on a (pretty large) Drupal website for an international organization. The site consisted of an extensive content area and allowed users to make anonymous purchases through the website, which were sent to a CRM every night. It was hosted on the Acquia platform and used BLT to perform the day-to-day development operations on the site.

BLT (if you aren't aware) is essentially a wrapper around a development environment, Drush (the Drupal command line tool) and a few other things. The idea is that you can type in a single command and perform a bunch of operations. This allows you to run unit tests, create deployment artifacts, and even grab the latest copy of the production environment.

The synchronization command (called blt sync) saved a lot of time in getting the latest copy of the production environment. It essentially dropped your local database, grabbed a copy of the production database and imported this into your local. There were also sanitisation steps that would remove all personally identifiable date from the database. 

It took time to copy the database down, but it meant you could go from nothing to a full install of Drupal using a single command. 

Note: I've not used BLT or the Acquia platform in a few years so this is a picture of the platform from some time ago. I won't mention the client by name but I need to mention Acquia and BLT because they are part of the backstory of the incident. BLT has had many updates in 5 years and I'm sure Acquia is a completely different platform these days.

The Problem

BLT has some configuration settings, and along with those it comes with a file called example.local.blt.yml. This file was renamed to local.blt.yml and tweaked to contain your site details. It allowed you to set up the Drush aliases for your project so that you could perform sync operations between your local and the remote platforms.

The local.blt.yml file allows for various overrides of the core BLT configurations, but this is what the Drush part of the file looked like:

#drush:
#  aliases:
#    local: local.mysite.dev

This file is not only created when you first setup the project but it was also deleted when running certain operations. For example, there was a "blt nuke" command that reset your workspace and this deleted the file; so there was more than one instance of this file needing editing.

By default, the file wasn't kept in git, and the .gitignore that BLT had deliberately ignored this file. This meant that the file occasionally had to re-created to connect your environments up.

What I had to do for this site was add in the following details to the local.blt.yml file. This pointed the "local" Drush alias at my local environment and the "remote" alias at the remote.

drush:
  aliases:
    local: local.mysite.dev
    remote: remote

As I could never remember these attributes I often had to copy and paste them out of our project documentation. Even though the file was transient, we actually only needed to edit it every couple of months (if that) so the chances of remembering what these setting are were remote.

The Incident

One morning I was working on some setup documentation and processes for the site. I reset this file as part of that process so that I could make sure the setup process worked for new starters. Whist setting things back up again I copied the values for this file incorrectly and ended up with this in my local.blt.yml file.

drush:
  aliases:
    local: mysite.prod
    remote: self

If you're familiar with BLT then you'll see the issue here. If you aren't, then let me explain a little.

What I was telling BLT here was that the "local" alias was the production site and that my "remote" alias was "self", which is another way of saying "local". I had essentially swapped the local and production alias so any action done to my local would contact the production site to perform the action (and visa-versa).

Then, as I had done a thousand times before, I ran the sync command to get the latest copy of the prod database.

blt sync

This command then proceeded to drop the "local" database and attempt to copy the database from the "remote" into that instance. Due to my misconfiguration it dropped the production database and started to copy my (empty) local database upstream to the production site.

The output of the sync command informed me that it was "dropping the database from the remote mysite.prod instance". When I saw this I immediately thought it looked strange so I smashed Ctrl+C to stop the process.

An icy chill rose up and my spine and checking the live site revealed the Drupal install screen, meaning that there was no production database at all.

I was too late in stopping it. I had just dropped the production database.

oops....

Fix It! Fix It! Fix It! Fix It! Fix It!

Once I realized what had happened I immediately notified the project manager, who got in touch with the client to let them know what was going on.

Actually, to tell the truth, I had one of those moments where you hear screaming and wonder where it's coming from only to realize you are the one doing the screaming. That got people's attention and started the ball rolling.

The good news is that we had backups. So I started the process of restoring the database and seeing how much data I had just wiped.

Acquia make a full database backup every night, which is then made available on the platform. We use these backups all the time to copy the database into the various non-prod environments so we knew that they worked.

Unfortunately, due to a small bug there was a slight discrepancy between the latest backup shown in the website interface and the actual latest backup present on the file system. This means that the website interface was actually 24 hours behind the latest backup of the site. I knew this fact when I went looking for the restore.

In my haste (and panic) to get the site back online, however, I went to the website interface and restored the latest available backup from the interface. This ended up restoring the backup that was a full day out of date.

The site was back online and working again after around 20-30 minutes of downtime.

I then started trying to recover the lost data from the transactions that morning from the site logs. This was when I realized that the data I had restored was a day old (not a few hours old) and that I had restored the wrong backup.

As the site was back online and accepting traffic I couldn't do another restore; it was too late. Instead I had to take the actual latest backup and restore that locally to extract the last day worth of content edits and user transactions.

With the data up to date on the site it was then a case of filling in the gaps using the log data from that morning (including email logs). This data was then handed over to the client so that they could add it to their CRM system, a process that would normally be run by the site automatically overnight.

After all the dust settled we had only lost a couple of transactions from the data, but the client was able to get in contact with those people and apologize as we had the email logs available.

The Aftermath

Once the site has been restored and the data recovered it was a case of explaining to the client what had happened and how we were going to prevent this from happening again. Unfortunately, I had to admit that it was a simple misconfiguration of values in a single file that caused the issue and that we weren't quite sure how to prevent it quite yet.

Thankfully, the client was really nice and understood that this was a mistake, but were still a little nervous about my response. I always take the approach of being as honest as possible with clients so I just had to admit it was my mistake and I would do some research into preventative measures.

Restoring the wrong database backup was another issue that I had to resolve. But even if I had restored the correct database I would still have had to go into the logs to find the missing data. In this case there was just a little more data to send to the client. I was able to use Drupal's built in revision tools to see where the content updates had been done in the last 24 hours and apply them myself.

The Acquia technical support team were very helpful (and sympathetic). I had been talking with them for several months about more than this website so they knew this wasn't a skill issue. In fact, they understood what had happened and offered to ask their BLT team internally to see if anything could be done.

We also posted a full breakdown of the problem in the BLT issue queue. I thought this was a pretty serious issue since anyone could accidentally configure their local environment and knock out potentially multiple production websites in the process.

Unfortunately, this resulted in a response that was basically this:

err, don't do that...

Thanks chaps, that's really useful feedback.

Preventing Future Catastrophes

Thankfully, I did find a solution.

Drush has the ability to create hooks that you can run before certain actions are triggered. This is a Drupal tool after all so having hooks is commonplace. As BLT is a wrapper around Drush I was able to create a hook that would intercept the Drush "sql:sync" command (the first step in the "blt sync" command) and look for anything that might indicate that "local" was pointing at something else.

Here is the Drush hook I came up with.

<?php

namespace Drush\Commands;

use Consolidation\AnnotatedCommand\CommandData;

/**
 * Drush policy class that hooks into Drush commands.
 *
 * @package Drush\Commands
 */
class PolicyCommands extends DrushCommands {

  /**
   * Prevent catastrophic sync command.
   *
   * @hook validate sql:sync
   *
   * @throws \Exception
   */
  public function sqlSyncValidate(CommandData $commandData) {
    $target = $commandData->input()->getArgument('target');

    if ($this->localTargetIsLocal($target) == FALSE) {
      throw new \Exception(dt('Your target is not set to "@self" or "@site.local". This may cause catastrophic problems involving overwriting the production database. Please be sure what you are doing. (!file)', ['!file' => __FILE__]));
    }
  }

  /**
   * Validate that this target is correct.
   *
   * @param string $target
   *   The target string.
   *
   * @return bool
   *   True if the target is local.
   */
  protected function localTargetIsLocal($target) {
    if ($target == '@self') {
      // Target is self, which we are assuming is correct.
      return TRUE;
    }

    if (preg_match('/\.local$/', $target) == 1) {
      // Target is a 'local' alias, which is correct.
      return TRUE;
    }

    return FALSE;
  }

}

This was tested this by disconnecting my computer from the internet and using XDebug to step debug through the Drush execution to made triple sure this would flag in the event of incorrect configuration. Deliberately putting my local into an incorrect state made me very nervous!

With that (very carefully) tested, I committed the class to the repo and made it part of the project.

As it happens, adding this small class was a really good idea as no less than a month later (a month!) one of the developers on the team approached me with a problem. They were trying to run a "blt sync" operation and they were getting a strange "catastrophic problems" error.

On closer inspection they had made the same mistake I had and wired up the connection details backwards, which had triggered protection class.

As you can imagine, this simple class very quickly became the standard across all of our projects.

After the second incident almost happened also added a different example.local.blt.yml file with some more sensible defaults and built in documentation (and warnings) so that it was not a trivial matter to wire up the systems incorrectly.

Conclusion

I fully admit that this was my fault, but the fact that I was able to put the site back online within half an hour helped the situation.

There was a little bit of revenue loss from the downtime and although there was a little bit of data loss it could have been much worse. The incident happened on a quiet mid-week morning so I was lucky that the site wasn't in the middle of an event or the disruption would have been more significant.

What scared me about the incident was how simple it was to perform. I've never before (or since) accidentally dropped the production database and it shocked me at how easy it was to get two values in the wrong place and take down a site. The fact that the BLT team weren't really interested in closing this loophole annoyed me a little at the time. I've not looked to see if the current version of BLT has the same issue or not but I would advise you to check.

The protection mechanisms I put in place turned out to be absolutely essential to our use of BLT and become a standard part of all of our BLT projects. The class I posted here had several more additions after it was first created to cover different commands and types of misconfiguration.

The fact that it took only a month for another developer in my team to make the same mistake was vindication that we had the right solution. Thankfully on this occasion, the Drush hook class saved the day.

During the course of your development career you will make mistakes, it's what you do next that matters. Try to stay calm. With web development your first priority is normally to get the site back online and working fully again, after that you can take your time to decompress and fill in any gaps in the data.

Make sure you communicate with people to let them know what is going on and what you are doing to fix it. This is a key step in your process as it reassures clients that a fix is coming, but don't hound developers for answers; let them do their thing.

After the dust settles it is essential to look at everything that happened to learn lessons from it. On this occasion it was a configuration option that caused the mistake, but it could be a process mistake, or a coding error that causes the problem.

The time you spend learning lessons here will mean that the same mistakes won't be repeated in the future. Even learning from small mistakes can help you stop making bigger ones.

More importantly, be sure to own the mistake. I'm sure the fact that I was open an honest with the client about the problem helped the situation. Going back to them a couple of days later with a solution to the problem (i.e. the Drush hook class) also showed that we understood what had happened and put safeguards in place to prevent it.

Finally, I think it's important to understand (I mean really understand) the tools you use on a daily basis; especially if they are open source. Be sure to dig into them and see what it's doing under the hood. I had taken a look at BLT internals before this, but this incident allowed me to understand it more fundamentally. I then properly understood what the configuration files were doing and how they interacted with Drush. I could also troubleshoot issues and see exactly where problems had come from.

Do you have any stories of disasters you are willing to share? Obviously, there would be no names or any other identifying evidence of what site this was connected to. I'd love to create more articles like this so please let me know.

Add new comment

The content of this field is kept private and will not be shown publicly.