Chaos Testing our Mobile Apps at FullContact

FullContactDecember 30, 2016

Let’s have a bit of a straight talk — chances are your app isn’t tested nearly as well as you’d want it to be. And neither is ours. There are a lot of ways to get to this point and a lot of articles covering them so let’s skip this part and focus on making the situation better instead.

So how can we fix that? Well, the obvious answer is — slowly. When you make changes to some part of the codebase — write a test for them. Maybe introduce a new architecture pattern to make the code more testable (we are fans of Clean Architecture ourselves).

But in the meantime — your app is in production. It is complex. It doesn’t wait. It sometimes fails for weird reasons. Networks are flakey. You have a bunch of interdependent network calls. HELP!

Chaos Testing

Chaos testing is basically causing systems to fail in a controlled manner to see how the failures impact your solution as a whole. The term was coined by the good folks at Netflix who’ve had great success testing the stability of their services using Chaos Monkey. And while this has been applied very successfully with backend service testing, there hasn’t been much talk around chaos testing mobile apps, except for the UI automation and testing tools like Application Exerciser Monkey for Android or UI AutoMonkey for iOS.

Since our mobile apps here at FullContact do a lot of network calls while synchronising your contacts, we rely heavily on chains of network requests to make sure that everything syncs nicely and without race conditions (RxJava and RxSwift are VERY handy in that respect). So the question becomes — what happens when a network call in the middle of a request chain fails? Maybe it doesn’t fail completely, just partially. Maybe it causes some unforeseen issues in a completely unrelated part of your app!

The way we approached this ourselves was to gather our Android and iOS teams in the same room, change the API url’s to another dev’s computer, which was running our chaos testing tool Proxinius on it, and then observe what happened when we purposely mess up an endpoint or two.

Proxinius is a Clojure REPL based proxy (it’s open-source, so go try it out!), which mutates the responses returned from our backend.

There are two mutations available right now:

Return a 500 HTTP status
Return an empty response

The default mode of operation for Proxinius is to randomly mutate 10% (easily configurable) of requests passing through it. Mutation of random requests is invaluable when you want to test how well your retry policies and recoveries work.

The beauty of Proxinius partly stems from its REPL nature. Since Clojure allows us to create new functions at runtime, we can come up with crazy (maybe even nonsense) mutations such as:

Change our camelCase based JSON responses into kebab-cased responses
Return a random response from another, previous request
Make requests wait for 3.14 seconds before returning

The ability to iterate on available mutations quickly allows us to come up with tests that would be very hard to create without Proxinius.

Chaos Testing with Proxinius has thus far helped us find a bunch of really nasty edge cases, as well as a number of minor errors and deficiencies in both our Android and iOS apps.

A word of caution — using this in a CI environment is probably not a good idea. While the UI Monkeys work well there, they do so because they aren’t really checking for anything. The only thing they are doing is testing if your app crashes or not. Checking for states with Proxinius would require having various assertions, which would inevitably lead to flaky tests. And nobody likes flaky tests.

In conclusion — Chaos Testing and tools like Proxinius are not some kind of a panacea or a silver bullet. They are not going to transform your app overnight, but they can be a very valuable tool in your arsenal when trying to make your app more robust and maintainable.

Chaos Testing our Mobile Apps at FullContact

Chaos Testing

Recent Blogs