Auto-Scale Everything

FullContactMarch 23, 2016

If you’ve never received a message like this from your cloud provider, consider yourself lucky, because you will eventually:

Dear Amazon EC2 Customer,
One or more of your Amazon EC2 instances is scheduled for maintenance on 2016–01–01 for 2 hours starting at 05:00 UTC. During this time, the following instances in the us-east-1 region will be unavailable and then rebooted:

i-abcd1234

Your instance will return to normal operations after maintenance is complete.

This is essentially saying the node in question will be completely useless for two hours. Worse, AWS will sometimes notify us that our machine will be retired completely. Or, worst of all, a node might encounter an unscheduled issue. Our team maintains an infrastructure of over 500 AWS EC2 nodes, so events like this are not irregular.

To many teams, this means work for them to replace the node. Hopefully the node is provisioned with configuration management, so bringing up a replacement isn’t too difficult. But ideally, the node can be replaced automatically. This is the true beauty of auto-scaling groups (ASG).

Our team aspires to have every node in our infrastructure in an ASG. From data stores like Elasticsearch and Cassandra to application nodes and internal services.

Self-provisioning nodes

To facilitate our micro-service infrastructure, we use a shared image for our entire application tier. Java applications are deployed into an ASG with this image. We use Ubuntu Upstart along with AWS’s UserData to fetch dependencies and start the application. The image has everything it needs to operate in production from boot without any manual interaction. This means if a node is having issues, we can simply terminate it, and allow the ASG to bring up its replacement. Similarly, if a node goes offline on its own, the ASG will take care of it. Having a node go offline at 3:00 AM does not notify an engineer, because there is no human interaction needed.

This extends to our Elasticsearch and Cassandra clusters as well. Because of our replication settings, losing one node does not cause any loss of data. Instead, a new node quickly comes up to replace it, and the cluster redistributes data back to the new node. As an aside, the use of ASG’s here has proven doubly useful. With minor configuration changes, we can increase and decrease the sizes of our clusters and have data automatically re-balanced with the new cluster size. In fact, in the more traditional auto-scaling sense, policies can be created to change cluster size automatically as the data-set changes.

An even better strategy we’ve been exploring is to create a fully functioning image per ASG. This means not relying on any external services to have a new node in an ASG come online. Deployment tools like Hashicorp’s Packer and Netflix’s Spinnaker are making this much more streamlined. While this makes deploys take a bit longer, the benefit is completely deterministic scaling within the ASG.

Everything?

It makes intuitive sense to auto-scale multi-node services. We know we need somewhere between 5–10 nodes to serve up a given application, and so we allow AWS to handle that fluctuation for us. But what about single-node services that are used internally? Our team has several cases where a service lives as a single node because we’ve decided that making it highly available is not worth the cost. Instead, we put them in an ASG of size one. With the same ability to bootstrap themselves from their image, these nodes can also go down, and be automatically replaced. Obviously the service is down for a few minutes while a new node comes up, but sometimes that is acceptable.

Filling the ASG Feature Gaps

The ASG offering from Amazon is lacking things we need for some of our services. Our API edge tier, for example, uses Elastic IP addresses (EIP) to serve our API from the same set of IP addresses, even as we regularly change out the nodes. Unfortunately, the ASG API does not currently allow you to associate a set of EIPs with it. To solve this, we leverage the EC2 API and simple Upstart scripts to automatically associate from our pool of EIPs on boot. A simple python script like this will do:

The nodes are provisioned with the necessary UserData (a list of EIPs) and Upstart simply runs a script like this to grab an EIP from the pool.

A process very similar to this can be used for other things, like automatically reattaching a block storage volume from a previous node, so that large discs of data can be transferred between deploys. Or perhaps the nodes in your ASG require DNS entries. A simple script to access the Route53 API will work fine for that as well.

That said, there are some systems that don’t lend themselves to auto-scaling very easily. We have not, for example, successfully moved our Apache Storm cluster into an ASG. Re-balancing is something we do manually, and there is an ongoing discussion around making this more automated in the future.

Getting There

It is not trivial to transfer your infrastructure to a pattern like this. But the great thing about it is that you can adapt single services at a time. If your team runs configuration management on all new nodes, a step in the right direction might be to have Upstart scripts that are able to invoke that process automatically. Or perhaps even have them kick off a Jenkins job to provision themselves.

Caveats

This technique can introduce new problems. Here are a few things to monitor and possibly alert on:

Excessive auto-scaling: If nodes are coming up in a bad state, and deleting themselves as they fail a healthcheck, the process can continue indefinitely. A constantly-scaling group might never be doing work.
Auto-scaling ceilings: Make sure your ASGs have reasonable maximum sizes. A data-store in an ASG that scales on disc usage can quickly grow out of control if excess data is being written rapidly.

Tracking costs: Knowing your true cloud costs can be difficult if your infrastructure is constantly changing in size. We use tools like Cloudability and Netflix’s Ice to monitor this.