I want to apologize to our customers and users. Our API was down last night for 53 minutes and, in our business, this is simply inexcusable. We went down when Amazon Web Services rebooted a key piece of our infrastructure but we failed to modify our systems to handle the outage gracefully. Not only did our monitoring services fail to send notifications to key employees, but, to make matters worse, we knew about this downtime possibility well in advance. We failed our customers yesterday, and for that, I apologize.
Afterwards, while our team was working through ways to ensure that we don’t expose our customers to this kind of downtime in the future, it dawned on me that our redundancy and monitoring needs apply not just to our startup, but to any startup whose primary source of revenue is derived from the support of other production systems. Great companies like Contactually, Mingly, or Intercom depend upon our service to provide accurate and up-to-date contact information to their customers, and they simply can’t tell their users, “we can’t help you, FullContact is down”. This means that for companies like us, downtime flows downstream, affecting many systems or companies at once. We must be extra-vigilant.
Given our failure, I’d like to share with you some of the changes we’ve made. We’re switching our notifications system to PagerDuty. Previously, we used Pingdom to monitor our uptime and to notify us of any downtime. This has worked great in the past, but none of our engineers received text messages from Pingdom, despite us updating the notifications rules less than 24 hours previous. The lesson we learned here is, test your notifications frequently. We didn’t know they weren’t functioning until we got burned.
We’re also upgrading our backend architecture to support automated failover of MySQL instances. This should come as no surprise to any seasoned MySQL administrator, but a solid database failover configuration is an absolute must if your company is in the business of supporting other business’ realtime data needs. We’ve since made the necessary upgrades, utilizing great technologies like HAProxy and the MySQL JDBC ReplicationDriver. We’ll be implementing a Netflix-style Chaos Monkey soon too.
At FullContact, we’re trying to solve the worlds contact information problem. We can’t do this if our production systems aren’t rock solid, and I’m making the promise to you, our customers and users, that we can and will do better.