April 3, 2018

Elegant Database Migrations on ECS

published by	Adam Stepinski
in blog	Instawork Engineering
original entry	Elegant Database Migrations on ECS

“Birds flying away” by elPadawan / CC BY SA

Over the last few months, the Instawork engineering team has been moving our AWS infrastructure from manually deployed EC2 instances to an Elastic Container Service (ECS) cluster managed with CI/CD. The benefit is clear: as our engineering team and traffic grow, automated deployments and auto-scaling free us up from worrying about dev-ops and allow us to focus on product improvements.

The core functionality of ECS works really well:

We push a new Docker image to ECR
We register a new task definition (which defines how to run the Docker container and how many resources it should consume), and update a service (which defines how many tasks to run).
When we update the service, ECS starts a rolling deployment, replacing the old Docker containers with ones running the new code. Integration with our CI service (CircleCI) is simple through the ECS API.

ECS did require us to rethink some aspects of deployment. Specifically, how do we handle database migrations? Our Django backend includes migration scripts that need to run before the new code can correctly handle requests. On our old infrastructure, we simply ran the migration script before updating the code on the server, so our initial designs for ECS also involved sequencing the migration before deployment. We quickly realized that all of our designs had significant drawbacks:

Design 1: Run migration and update from CircleCI

We considered using CircleCI to run the migration script before updating the ECS service. This approach lets us be sure that the migration completes before the new code starts running. However, we felt uneasy about making CircleCI a requirement for deployment. We try to minimize the external dependencies for deployment to remove points of failure (CircleCI recently had downtime). Another downside was that CircleCI would need to have production credentials and access to our database, which isn’t ideal from a security perspective.

Design 2: Run an ECS task to migrate, then update from CircleCI

In this design, CircleCI would run a one-time ECS task to execute the migration script. CircleCI would then poll the ECS API to see if the task finished running. Once it stopped running, CircleCI could update the ECS service to finish deploying. This approach removed the need for CircleCI to talk to our production environment, since the migration would be running on ECS. However, we still have the problem of tight coupling to CircleCI. Polling the ECS API to check if a task is still running is pretty brittle, and we don’t have a good way of knowing if the migration succeeded or failed.

Design 3: Run an ECS task to both migrate and update

The last design we considered was having CircleCI run a one-time ECS task to both execute the migration and then update the service. By having the task do the migration and update, we remove the need for polling the API and we only have a minimal dependency on CircleCI. If the migration fails, we don’t update the service so the procedure is safe. However, we don’t have a good way of finding out when the migration fails. This approach also requires our Docker image to have AWS credentials (to handle the deployment), which brings back the issue of breaking security boundaries.

Embracing the design of ECS

All of these approaches had shortcomings for a simple reason. Sequencing migration and deployment was fighting against the design of ECS, which doesn’t have primitives for pipelining tasks. So we switched to thinking of ways to handle migrations without relying on sequencing. Our breakthrough came from studying how ECS handles deployments:

A task running the new Docker image starts running.
The task gets added to the ELB target group and starts the health check process.
If the task registers as healthy, it starts receving traffic. ECS stops an old task and continues the process until all tasks have been replaced.
If the server doesn’t register as healthy within some time period, ECS stops the task. It still tries to continue the deployment process, so it will go back to the first step with a new task running the latest code.

The key insight was that we can halt the ECS rolling deployment by controlling when the server passes the health check. All we needed was to make the server respond as unhealthy if the code expects a newer version of the DB schema. Luckily, we can plug into Django’s migration code to make this happen via a custom health check view:

from django.db import DEFAULT_DB_ALIAS, connections
from django.db.migrations.executor import MigrationExecutor

def health_check(request):
 executor = MigrationExecutor(connections[DEFAULT_DB_ALIAS])
 plan = executor.migration_plan(executor.loader.graph.leaf_nodes())
 status = 503 if plan else 200
 return HttpResponse(status=status)

If the executor has a migration plan, that means the database is not up-to-date with the code on the server. As soon as the database is migrated, no plan gets returned, and we know the server can correctly handle requests.

With this new health check in place at the load balancer, our deployment looks like this:

CircleCI runs a one-time ECS task to execute the migration script.
CircleCI updates the ECS service.
That’s it!

If the deployment doesn’t require a migration, the migration task will exit immediately. The service tasks with the new code will be healthy from the start, so they’ll start accepting traffic right away. The old tasks will be stopped and the rolling deployment will finish quickly.

If the deployment does require a migration, the migration task will take a few minutes to run. In the meantime, the deployment will be suspended since new service tasks will be coming up unhealthy. The old tasks will continue serving requests. As soon as the migration finishes, the tasks will become healthy and the rolling deployment will continue until all tasks have been replaced.

If the migration fails for some reason, the deployment will remain suspended. The old tasks will keep running, and ECS will will start and stop new tasks since they cannot become healthy. We set up a CloudWatch to detect when tasks are rapidly going up and down, and send an appropriate alert.

With this approach, we have a deployment process that isn’t tightly coupled to CircleCI, doesn’t require spreading credentials across services, and is robust against edge cases while giving us visibility when things go wrong. The lesson here is that the world of ECS (and Docker orchestration in general) requires a shift in mindset from the old world of manually managing instances. DevOps approaches that worked well for the latter will be awkward in a Docker world. By fully embracing the primitives and mechanics of ECS, we came up with a simpler, more robust solution that will serve us well.

Let us know if you’ve had similar challenges with DB migrations using ECS and how you solved them. Also, if you try out this approach let us know how it worked for you!

Elegant Database Migrations on ECS was originally published in Instawork Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.