This weekend was the weekend from hell for all engineers involved in ops, devops, infrastructure engineering, or whatever your company may call it. Perhaps you were among the lucky few who didn’t have some issue this weekend. We certainly had a few issues, which we profusely apologize for. Most of our countermeasures worked as intended, but due to unforeseen issues with ELBs and the addition of the leap second, we had more than our share of sleepless nights.
Electrical Storms Cause AWS Outage
Amazon Web Services is known worldwide for their simple elastic provisioning of compute power. This weekend their US-East-1 region in Virginia experienced an outage. One of their Availability Zones (AZ), a partition of their facility, lost power due to the awesome power of this weekend’s electrical storm. Further, networking was impacted during this outage and we experienced widespread disruption of our own services. We make extensive use of AWS, naturally this impacted us.
This story always begins with the long *buzz-buzz* of my custom Android pagerduty notification. Oh no. A downtime notice is always bad, but it’s compounded by getting a message with not only one error, but a long chain of numbered errors:
ALRT: #378,#379,#380,#381,#382 on Pingdom; Mongo MMS Reply 4: Ack all 6: Resolv all
I’d call this a nightmare scenario for me. Multiple endpoints down, and Mongo DB (MMS from 10gen monitoring) complaining about downed node(s). At the time, I’d been out enjoying a Friday night movie (Ted, for those interested), followed by rushing home to hop on the computer, send a tweet from our @FullContactAPI account, and begin diagnosing the issues.
The first thing to try is to make sure Pingdom isn’t experiencing routing issues. With a quick curl check, it’d become obvious API was truly down. The only way it could be worse would be for HBase to also spontaneously crash. The first place to go during an outage is Twitter. Believe it or not, you can sometimes diagnose your own issues with a quick cruise down Twitter (if you follow the right people). In Twitterific, I also have a list of saved hashtags I follow during emergencies, usually #ops, #devops, #aws, and #ec2.
This time, #aws immediately showed me that people were complaining about widespread power outages in an AWS East AZ. A single AZ shouldn’t remove us from the internet, but clearly it had. I noticed two of our servers, a MongoDB replica member and a MySQL slave both were taken out and rendered offline during the AZ outage.
Until Amazon’s scripts had found (and released) the instances, these were completely inoperable. Neither of these outages should have impacted operations, so the next step was to check the application layer. To check the application layer, I usually load up our Airbrake account to see what kinds of exceptions the API bubbles up. Usually a JDBC exception, or Zookeeper malfunction can be found prominently. Airbrake also runs on EC2 (Heroku), and specifically RDS (Relational Database Service) instances which were taken offline by the outage, so that was a bust. My next step is to SSH into our “webhead” nodes and tail logfiles when Airbrake fails. This highlighted the issue immediately, the momentary network outage (from the AZ crashing) rendered the JDBC (Java Database Connector) driver inoperative despite retries (bug filed internally for this).
A reboot of the API web servers brought the majority of our internal API services back online, but NGINX was still returning gateway errors and claiming it could not connect to upstream servers. I was puzzled at this, given that our applications had booted successfully and log output looked OK. This is when I discovered Elastic Load Balancers (ELBs) were broken. We use ELBs to front each of our autoscaling groups, each of which represents a different role or service in our architecture. Some of our ELBs worked, some had deregistered instances and refused to register new ones in any AZ. Amazon refused to acknowledge the issue for several hours, even though ELBs pointing to instances in other AZs were also impacted.
What followed was a fast-paced session of choosing instances from each group, creating a corresponding NGINX upstream block, and overloading the proxy-pass for each location with the proper hard-coded upstream block. Given that we use NGINX to do contextual routing to different services already, this was easier than it might have been if we’d relied on ELBs alone.
The first step was to disable any activities on our autoscaling groups, this includes healthchecks, adding instances to ELBs, and removing failed servers.
Once this was done, I added the servers to their upstream blocks. After a config reload, I could finally tell @FullContactAPI followers we were running in a degraded state, if up. This was approximately 40 minutes after the first PagerDuty notifications. Given our multi-AZ deployments, we weren’t impacted nearly as much as we might otherwise have been. If ELBs had been working correctly, it would have been an easy fix. Our faith in the ELB system has been shaken significantly. Following restoring service to the API, I ensured a Jenkins build wouldn’t disrupt the very fragile NGINX config by disabling our deployment jobs. What followed was spending time on the AWS forums, booting micros to fix EBS volumes that had been improperly shut down, and restoring these. By 3am, we’d mostly restored our Mongo nodes and restored our MySQL slave’s EBS RAID.
The Great Leapocalpyse
This wasn’t to be the end of ops disasters for the weekend. Many of the same companies who were impacted by AWS East outages were similarly affected by what many ops engineers are calling “The Great Leapocalypse”. Saturday, June 30th, 2012 was scheduled as the day to add a leap second to the clocks to bring UT1 (UTC) and UT0 into sync, essentially making our atomic clocks match solar/astronomical time.
The last scheduled leap second occurred in December 2008 with very little fuss. Nobody expected this leap second to be any different. The first reports started popping up on ServerFault (the ops brother of well-loved programmer-based StackOverflow) that many Linux servers were crashing in the day leading up to the leap second. Due to an unfortunate kernel bug, well-loaded servers had a high chance of kernel livelock every 17 minutes. Much of our data storage resides on Apache HBase, which relies on ZooKeeper, a quorum/metadata service. These HBase nodes weren’t loaded very heavily given the downtime on Friday night/Saturday morning. Our ZooKeeper nodes finally livelocked around 11pm UTC — 6pm MST. None of the services on our HBase cluster would start after the leap second. JVMs started and immediately stopped. We were puzzled. We came across the Mozilla IT Blog post about high-CPU usage of MySQL after the leap second which provided the seemingly useless command:
date -s "`date -u`"
Set the date, from the date. Everything booted, and from there it was a simple matter to ensure HBase was serving our regions again. It’s almost embarrassing how easy the solution was once it’d been found. All we had to do is go around to all of our boxes (including MySQL) and set the date, CPU loads dropped and services recovered.
In retrospect, repeating a second in the web’s massively concurrent, high-volume, coordinated nature today seems like an awful idea. Google does too, their Site Reliability Engineer Christopher Pascoe blogged about their process of doing “leap smears”, adding a few ms here and there throughout the day. This outage should never really have happened, John Stultz patched this issue back in March, but the patch wasn’t picked up by most mainstream vendors.
AWS outages shouldn’t impact our services, and we’re working on expanding our operations to multiple redundant datacenters so our customers are never without a fast, reliable API. Our downstream customers rely on us, and our outages impact them and their customers as well. We apologize for the downtime, hopefully this blog can convey that we take all outages seriously. We’re on call to ensure our service is of the highest quality (and availability).
The real takeaway from the incidents in the last few weeks is that the cloud is not a magical high-availability machine. We never treated it as such, but we were still blown away with how interconnected and fault-sensitive Amazon really is. Diversity among cloud providers and the locations of their data centers is where the greatest benefit will be found.