Cloning RDS Instances for Testing

I have one or more staging/test environments at any given point in time. I want them to each have their own database which should be a nightly clone of the production database.

Previously, my MySQL database was hosted at Rackspace and was backed up using Holland. Holland uses regular mysqldump under the hood. To sync my staging/test databases nightly, I would just run a cron job on each that would download the latest dump, drop its own database and restore from the dump.

In AWS, database backups and cloning are very well automated and tooled. But they are done at a binary/disk level. Generally, this is a good thing, but the cloning process is a tiny bit more involved. To clone your database in AWS, you’ll want to follow these steps:

Install the AWS command line client
Create a bash script that does the following…
Fetch the snapshot id to use for the clone
Delete the existing instance
Create the new instance
Modify the new instance
Wait for the instance to reach various states

Install the AWS command line client

AWS CLI tools is well documented. After you install, configure it with aws configure. Make sure you set your output format to text because the example scripts I will provide assume that format and use tools like awk to parse the output.

Fetch the Snapshot Id

We are going to provide the cli with the instance name and grab the latest snapshot for it. My production instance is called production. Replace with the name of your instance.

aws rds describe-db-snapshots \
    --db-instance-identifier production \
    | tail -n 1 \
    | awk -F \  '{print $5}'

Delete the existing instance

Your database endpoint is based on the instance name. By giving the new instance the same name as the previous one, your endpoint URI will not change, so you won’t have to reconfigure your consuming app(s). But RDS won’t allow you to have two instances with the same instance identifier, so we must first delete the existing instance.

Alternately, you could rename the existing instance, but it won’t help much. There will be extra steps and you’ll still have downtime. The only way to avoid downtime would be to use a NEW instance name, leaving your old instance alone and reconfigure your app to use the new instance once it is ready.

In my case, I don’t care about downtime. It is just a testing environment.

Again, my staging/test instance is called staging. Replace with a name of your choosing.

aws rds delete-db-instance \
    --db-instance-identifier staging \
    --skip-final-snapshot > /dev/null 2>&1

Create the new instance

Creating the new instance is straightforward at this point. Some notable settings:

No multi AZ so we’re not paying for availability we don’t need
No minor version upgrade. We’re going to clobber it every night anyway
Mine is in not in the default VPC, so I must tell it was subnet group it should be in

aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier staging \
    --db-snapshot-identifier=$snapshot_id \
    --db-instance-class db.t2.micro \
    --storage-type standard gp2 \
    --publicly-accessible \
    --no-multi-az \
    --no-auto-minor-version-upgrade \
    --vpc-security-group-ids $security_group \
    --db-subnet-group-name $subnet_group > /dev/null

Choosing your storage settings

Production databases often use Provisioned IOPS. This is a bit expensive for a test environment, but converting between that and other types of storage is slow. This table should help you weigh your options. It assumes your production database is a 100 GB volume with 1000 provisioned IOPS and you’re restoring into a db.t2.micro instance.

Storage Type	Code	Monthly Cost	Migration Time
Provisioned IOPS	io1	$77.50	N/A
General Purpose (SSD)	gp2	$10	67 mins
Magnetic	standard	$5	87 mins

Modify the new instance

Some settings are inherited from the snapshot and can’t be changed when you call the restore-db-instance-from-db-snapshot command. You must wait until the instance is created and available and then perform a modification operation. In my case, I needed to disable backups.

Always remember the --apply-immediately flag or else you’ll be waiting until the next maintenance window for your changes to take effect.

aws rds modify-db-instance \
    --db-instance-identifier staging \
    --backup-retention-period 0 \
    --apply-immediately

Wait for the instance to reach various states

When you delete the instance (above), you can’t create the new instance until the delete operation is complete or else the instance identifier will still clash. Here’s a function that will do that for you:

function wait-until-deleted {
    instance=$1
    count=1
    while [[ "$count" != "0" ]]; do
        count=`aws rds describe-db-instances \
            --db-instance-identifier $instance 2>/dev/null \
            | grep DBINSTANCES \
            | wc -l`
        sleep 5
    done
}

It accepts a single parameter, the instance identifier, and will block until that instance is fully deleted.

Example: wait-until-deleted staging

You also need to wait for the new created instance to be in the available state before you can perform the modification operation on it. In my case, I also wanted to wait until the modification was complete before allowing my script to exit. Here’s a function that will do that for you:

function wait-for-status {
    instance=$1
    target_status=$2
    status=unknown
    while [[ "$status" != "$target_status" ]]; do
        status=`aws rds describe-db-instances \
            --db-instance-identifier $instance | head -n 1 \
            | awk -F \  '{print $10}'`
        sleep 5
    done
}

It accepts the instance identifier and the desired status as parameters and blocks until that status is achieved.

Example: wait-for-status staging available

All Together

Here’s my final script that gets run nightly.

#!/bin/bash

##########################################################################
# create-staging-db.sh
#
# Usage:
#   ./create-staging-db.sh [instance_name]
#
# Creates a new RDS instance by cloning the latest production snapshot.
# More specifically, the following steps are performed:
#   - Determine the snapshot id to use
#   - Delete the existing database
#   - Create the new database
#   - Make necessary modifications to the new instances (disable backups)
##########################################################################

instance_identifier=$1
instance_class=db.t2.micro
security_group=sg-XXXXXXXX
subnet_group=default-vpc-XXXXXXXX

function wait-for-status {
    instance=$1
    target_status=$2
    status=unknown
    while [[ "$status" != "$target_status" ]]; do
        status=`aws rds describe-db-instances \
            --db-instance-identifier $instance | head -n 1 \
            | awk -F \  '{print $10}'`
        sleep 5
    done
}

function wait-until-deleted {
    instance=$1
    count=1
    while [[ "$count" != "0" ]]; do
        count=`aws rds describe-db-instances \
            --db-instance-identifier $instance 2>/dev/null \
            | grep DBINSTANCES \
            | wc -l`
        sleep 5
    done
}

# fetch snapshot id
snapshot_id=`aws rds describe-db-snapshots \
    --db-instance-identifier production \
    | tail -n 1 \
    | awk -F \  '{print $5}'`

echo "Snapshot Id: $snapshot_id"
echo "Deleting database (if exists): $instance_identifier"

# delete the existing instance
aws rds delete-db-instance \
    --db-instance-identifier $instance_identifier \
    --skip-final-snapshot > /dev/null 2>&1

wait-until-deleted $instance_identifier

echo "Creating new database: $instance_identifier"

# create the new instance
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier $instance_identifier \
    --db-snapshot-identifier=$snapshot_id \
    --db-instance-class $instance_class \
    --storage-type standard gp2 \
    --publicly-accessible \
    --no-multi-az \
    --no-auto-minor-version-upgrade \
    --vpc-security-group-ids $security_group \
    --db-subnet-group-name $subnet_group > /dev/null

echo "Waiting for new DB instance to be available"

wait-for-status $instance_identifier available

echo "New instance is available"
echo "Disabling backup retention"

# disable backup retention
aws rds modify-db-instance \
    --db-instance-identifier $instance_identifier \
    --backup-retention-period 0 \
    --apply-immediately

echo "Waiting for new DB instance to be available"

wait-for-status $instance_identifier available

echo "New instance is available"
echo "Clone process is copmlete"

Pain Points

The fact that RDS uses snapshotting performs more consistent and reliable backups. The interface, both console and API, is very robust and easy to use. And, if you use a Multi-AZ production database, those snapshots are performed on your fallback instance in a different AZ, so the snapshot don’t cause any performance, latency or locking issues on your DB.

But in the case of creating a clone from those snapshots, there are some downsides:

Scrubbing your data is a pain. If you want to modify the dump, you must first restore a snapshot as described above and then you can use mysqldump as you normally would, excluding what you need and scrubbing the plain text output as needed.
Most production RDS instances should use provisioned IOPS. Converting from that to SSD or magnetic is slow, but if you don’t then you have to pay a hefty fee. See price table above.

Be thoughtful about going around snapshots by using mysqldump directly against your production instance. This will cause downtime. Snapshots based on production Multi-AZ instances are taken using the hot copy, not the live instances, which means no downtime. If you want to use mysqldump without downtime, you’ll need to use a read replica for this purpose, or clone using the process above and then run mysqldump against that clone.