Dealing with Failure in a Distributed System

Jason LingleOctober 31, 2019

FullContact uses a microservice architecture both internally and for our customer-facing APIs. While microservices have a lot of benefits, one major drawback compared to a monolithic architecture is that there are a lot of additional potential points of failure. Given the nature of this architecture, our team tries to anticipate issues to ensure success for everyone, especially our customers.

A particularly bad type of failure is the hung task, wherein a piece of code simply ceases to make progress (as opposed to succeeding or failing quickly). For network clients, the obvious solution here is to set proper timeouts, something that should be done regardless.

In a perfect world, timeouts would be sufficient. Unfortunately, in the real world, one needs to worry about certain deficiencies and unknowns:

Not all client libraries actually allow setting timeouts.
Certain networking clients may have timeouts, but can only detect network failures and cannot time out waiting for a response on a healthy connection (due to a keep-alive mechanism, for example).
Mistakes happen. Setting a timeout on a client may be forgotten. In some cases, a problem can be entirely in-process, such as a poor algorithm choice.
Unknowns. A client library that appears to be non-blocking may actually block when there are networking problems, or another client might not actually respect the time out configuration when trying to resolve a DNS name.

Hung tasks are a particularly insidious problem since they accumulate. In a synchronous service, the request handler thread pool can quickly become full.

Asynchronous services aren’t safe either: While there may not be a thread pool to saturate, they will still tie up more and more resources in those tasks, which in many cases are heavy-weight resources like network connections. In either case, the service must either start rejecting requests to shed load or run out of resources entirely.

What we needed was a solution to prevent this accumulation from happening.

Throwing thread pools at the problem

In the past, FullContact used Hystrix as a solution to this problem.

By default, Hystrix runs each piece of user code in its own thread pool. Running tasks on separate threads are necessary for two features on the JVM: being able to abandon waiting for a task to finish, and being able to interrupt a specific task to try to cause it to fail faster.

Running each task on a separate thread incurs substantial overhead. Locking is required to enqueue the task onto the thread pool; a thread in the pool must wake up to take the task; the task may well end up on a different CPU core with a cold cache; finally, there is more locking to ferry the response back to the requester thread, and the requester thread must wake back up when the result arrives. The simple existence of the threads also adds overhead, as every thread needs its own stack — even on modern hardware, those megabytes add up.

There are also drawbacks to imposing timeouts by asynchronous interruption. Interrupting a thread puts it into a state that is not often considered during development; while most code will still fail fast, it is also possible to expose other bugs which are highly timing sensitive.

Further, in general, interrupting a thread working with a network connection effectively destroys that connection. For example, in HTTP 1.1, the only way to abandon a request is to gracelessly close the whole TCP connection. A similar problem typically affects database connections since, after an interruption, little to nothing is known about the state of the connection that was in use.

This issue with interruption can actually prevent a service from being able to recover from a transient failure. Most commonly, this happens with database connections, which are often expected to be fast (we see sub-millisecond response times usually) once established but can take seconds to set up.

While the connections are usually set up when a service starts, if the connections are lost while the service is running, they need to be re-established as needed, typically during the execution of a request. An aggressive timeout around database access will end up interrupting this process and killing every nascent connection. In its attempt to defend itself from minor database issues, the service has become unable to recover from major database issues!

Hystrix also offers a mode where it uses a semaphore to do concurrency limiting. This largely addresses the above issues, since cross-thread action ceases to happen and the timeout feature is lost. However, it still adds quite a bit of computing overhead due to many layers of abstraction, and still brings along some other general issues we had with Hystrix.

One of Hystrix’s features — present for both thread-pool-based and semaphore-based configurations — is “circuit breaking“. Hystrix monitors the ratio of successful executions to failed executions within some time period, and if too few executions succeed, the task starts “short-circuiting” — almost all executions are made to fail immediately. The intent of this behavior is to reduce resources spent on the failing task and to reduce load on a possible downstream service experiencing issues.

Whether circuit breaking is at all useful is a matter of some debate, though there is general agreement that it makes it more difficult to identify the source of a problem since one must first filter out all failures that occur due to short-circuiting.

In my personal experience, circuit breaking has never had any benefit; it has, however, caused problems of its own, as it acts as an outage amplifier: A partial outage that trips the circuit breaker results in a total outage. For example, if the downstream service starts returning 500 Internal Server Error 30% of the time, the circuit breaker will trip and make essentially all requests fail.

A final, much less significant, issue with Hystrix is that it is extremely configurable. While this is nice on the surface, having a dozen knobs that interact with the code and with each other in subtle ways tends to produce a lack of confidence that the failure protection will actually work as desired.

A new solution

While Hystrix does do its job, we eventually felt that it causes more grief than it is worth, and we simply stripped Hystrix out of most of our services entirely. This improved some performance somewhat, but we quickly learned that Hystrix’s functions to deal with hung tasks were, in fact, something we depended on.

Some criteria we wanted for a replacement:

Minimal configuration. Fewer knobs mean less time is needed to come up with a correct configuration, and less time is needed to understand a configuration someone else put together.
Minimal overhead on the “happy path”.
Integration with our existing metrics system. This gives visibility into execution time/results within each service.
Simplicity. Not just simplicity of usage, but also simplicity of implementation: the less the system does, the easier it is to understand what protection it does or does not provide.

Ultimately, we ended up making an in-house solution called a spillway.

That’s it — it’s just a wrapper around a semaphore. The real code is larger due to extra metric reporting and a mirror API for ListenableFuture-based asynchronous code but is otherwise essentially the same.

The way this works is pretty simple:
concurrencyLimit is set to a value above the expected “normal” amount of concurrent requests. If the code run inside the spillway starts getting much slower than usual, concurrent requests will accumulate until the semaphore has no leases left, then further executions will be rejected with SpillwayOverflowException immediately, preventing more resources being tied up in whatever is failing.

There is no special way of handling fast failures — we simply let them keep failing naturally.

Similarly, there are no true timeouts — if execution takes much longer than usual, it is still allowed to complete naturally, as this is often more robust than trying to enforce timeouts. Forced timeouts also don’t make a huge difference anyway.

For example, if concurrencyLimit is set to 10 and the task inside the spillway suddenly starts taking 10 seconds per execution, on average only one execution per second actually sees extra latency; all others are rejected quickly. In a service handling hundreds of requests per second, this is an insignificant effect compared to the presumably larger issue of the non-availability of whatever functionality is inside the spillway.

One disadvantage of this system is that it is not capable of fully clearing resources used by the failing task — resources proportional to concurrencyLimit will remain tied up during a failure scenario — but so far, we have not found that to be much of an issue.

As can be seen in the earlier code, there are only two configuration knobs: concurrencyLimit and semaphoreTimeoutMs.

concurrencyLimit has been described already. semaphoreTimeoutMs indicates how long to wait for the semaphore to be available if it is already full when an execution would start; it is used to deal with sudden bursts.

Both configurations are fairly easy to set:

Decide on a “maximum normal” latency of the task, essentially a boundary between “normal” and “something’s wrong”. For example, 25ms.
Set concurrencyLimit to the expected maximum request rate times the latency. For example, if 200 request/second is expected, concurrencyLimit = 200 * 0.025 = 5
Decide on a maximum burst size. i.e., a maximum number of requests that could by chance come in within a very short time (such as the typical time needed for a single execution), for example 10 requests. Multiply by the latency used above, then divide by concurrencyLimit and use the result for semaphoreTimeoutMs: semaphoreTimeoutMs = 10 requests * 0.025 sec/request / 5 = 50ms

The above steps are just guidelines, of course. For tasks that have viable fall back paths, using tighter parameters is often a good idea. Conversely, tasks that need to be reliable but are not in a place to directly affect customer-facing latency, more permissive values may be preferable.

Performance

To get a rough idea of how a system performs with various approaches, we set up a simple test program:

100 threads simulate clients performing repeated requests by going through a fixed code path.
On each request, each thread randomly performs either “task A” or “task B”.
Both tasks normally just sleep for 10ms to simulate a network call and are surrounded by whatever protection method is being used for the test.
The test proceeds in three one-minute phases: Normal, in which both tasks work as described above, to measure the overhead incurred by the protection method; failure, in which task A will start sleeping for 10 seconds 90% of the time and then throw an exception, to see how well or poorly the system copes with that; and finally recovery, where task A returns to normal behavior to see how long the after-effects of the failure linger.

With no protection at all, we get a progression like this:

Normal:

A: 4806 exec / sec, 100% success, 10 avg latency

B: 4810 exec / sec, 100% success, 10 avg latency

Failure:
A: 10 exec / sec, 17% success, 8294 avg latency
B: 13 exec / sec, 100% success, 10 avg latencyRecovery:
A: 4778 exec / sec, 99% success, 13 avg latency
B: 4767 exec / sec, 100% success, 10 avg latency

This test took 39 seconds of CPU time on the test system, equivalent to one CPU core at 17% usage.

The results here are pretty typical: As soon as task A starts becoming latent, the whole system slows to a crawl as almost all resources get tied up in task A, even though half of the requests don’t even run task A. On the other hand, recovery is near-instantaneous, with the only latency and failures being those whose execution had started in the failure phase.

To test Hystrix in thread-pool mode, we’ll use a concurrency limit of 60 (for a total of 220 threads — each task needs its own pool) and a 500ms timeout. Other settings are default.

Normal:

A: 4910 exec / sec, 99% success, 10 avg latency

B: 4893 exec / sec, 99% success, 10 avg latency

Failure:
A: 2551 exec / sec, 0% success, 29 avg latency
B: 2564 exec / sec, 100% success, 10 avg latencyRecovery:
A: 4830 exec / sec, 97% success, 10 avg latency
B: 4823 exec / sec, 99% success, 10 avg latency

This run required 3 minutes 51 seconds of CPU time — equivalent to more than one CPU core at 100%, dramatically more overhead than the first test.

There’s several things going on in the final results. First, notice that we did not get 100% success even in the normal phase. The failures were a result of bursts exceeding the thread pool size. Larger thread pools and configuring a request queue help here a bit, but there were still failures even with 100 threads per pool and a 100-deep request queue — I wasn’t able to figure out why.

In the failure phase, latency of the failing task only increased to 29ms, but notice the 0% success rate: This is Hystrix’s circuit breaker feature in action. Shortly after the failure started, most requests for task A were immediately rejected, saving the latency but increasing the 90% failure to a 100% failure. Task B continued running normally, albeit at a lower rate since fewer threads were available.

In the recovery phase, task A success was only 97%. This is because it takes a while for the circuit breaker to start allowing normal execution again.

Switching Hystrix to its semaphore-based concurrency limit, still using a limit of 60, we get this:

Normal:

A: 4937 exec / sec, 99% success, 10 avg latency

B: 4945 exec / sec, 99% success, 10 avg latency

Failure:
A: 2524 exec / sec, 0% success, 25 avg latency
B: 2531 exec / sec, 99% success, 10 avg latencyRecovery:
A: 4888 exec / sec, 98% success, 12 avg latency
B: 4886 exec / sec, 99% success, 10 avg latency

This run required 2 minutes 33 seconds of CPU time — less overhead than before, but still a lot compared to no protection.

The final results are mostly the same.

Finally, we’ll do a test with a spillway. For parity with the Hystrix tests, we’ll use a concurrencyLimit of 60. The maximum possible burst size above the limit is 40 and expected latency is 10ms, so we get a semaphoreTimeoutMs of just under 7ms.

Results:

Normal:

A: 4808 exec / sec, 99% success, 10 avg latency

B: 4823 exec / sec, 99% success, 10 avg latency

Failure:
A: 2307 exec / sec, 0% success, 28 avg latency
B: 2304 exec / sec, 100% success, 10 avg latencyRecovery:
A: 4814 exec / sec, 99% success, 12 avg latency
B: 4826 exec / sec, 99% success, 10 avg latency

This run required just over one minute of CPU time, a substantial improvement over Hystrix.

The results are generally comparable to what we saw with Hystrix. The execution rate in the failure phase was slightly lower since more threads were kept tied up in task A’s failure case. One significant change is that recovery was instant like we saw in the unprotected case. Most importantly though, we retained the property that task B was able to continue operating.

There are still a handful of failures in the non-failure states due to the aggressive configuration — the test system has 8 CPU cores and preempts at 1000Hz, so if all threads are ready at the same time, they only get a time slice once per 13ms. Consistent with this, a 13ms value for semaphoreTimeoutMs is the point at which the test experiences no spurious failures. Since 7ms is a very short time to wait, in a production application, we would probably use a larger semaphoreTimeoutMs value anyway.

Summary

	Nothing	Hystrix Thread Pool	Hystrix Semaphore	Spillway
CPU seconds	39	231	153	66
Throughput during failure	13	2564	2531	2304
Failures after recovery	1%	3%	2%	1%

Conclusion

Wrapping code that could be prone to hanging in a semaphore is a light-weight and surprisingly effective way to insulate the rest of the service from failure. Compared to full-fledged frameworks, it incurs lower overhead and is easier to develop against while still providing the most important benefits.