Data is always old - it's just a question of how old

Data Decay and the Illusion of the Present

Written By:

Back in October, I spoke at about data decay at API Strategy + Practice in San Francisco and, since then, there has been quite a bit of interest in a written form of my talk. Here goes:

Look at the nearest clock. It will give you a number that you will then associate with the word “now”. You might say to yourself, “it’s now 1:23 PM”. You might think about how lunch has left you in a mild food coma and perhaps you should go get that cup of coffee and perhaps one of those tasty cookie things that Sean brought in from home…

Ok, now look again. Has your definition of “now” changed? It probably has. Is your old definition of “now” any less valid? What does it even mean to have a “valid definition of now”?

On Background

Let’s dip our toes into some light metaphysics for a second. As humans, we tend to think of time as some kind of abstract arrow, divided into sections we label “the past”, “the present”, and “the future”.

While convenient for day-to-day life, we must admit that this idealized concept of time is fallacious. Unless you’ve been practicing advanced meditation for 20 years, you’re probably like me in that you can’t even perceive the present. “Wait…what!? I’m reading this right now!”, you say. The yogi strokes his wise, long, white beard and then counters, “don’t you mean that you were reading it right then?

Our perception of the present is actually just our most recent memories, even if they are only milliseconds old. This means we can only have knowledge of the past and we can only guess (however well-educated) at the future. We use data from the past to model the present, so that we may hopefully predict the future–and make decisions that take advantage of our predictions. (If you really want to bend your mind around this, imagine that each choice you make, each choice that others make, and every bit that’s flipped by a stray cosmic ray causes a brand new universe to come into being. The “future” is really just a probability density function with a near infinite number of dimensions.)

What’s the Point?

Now is about time for you to ask, “what the hell does this have to do with APIs?” The short answer is, “everything”.

Data is always old. It just a question of how old it is. We as system designers often make a terrible mistake by ignoring this fact. Nowhere is data decay more important than in the securities markets. For example, on Wednesday, September 18th, at 2:00 and 2 milliseconds PM, approximately $600 Billion worth of gold futures orders hit Chicago’s futures markets in response to the Fed’s announcement in New York that it would hold off on its “tapering” program. Someone made a killing as gold prices shot up. The problem is that it takes 7 milliseconds for light to travel from New York to Chicago. Foul play? That’s what the SEC thinks.

Your system might not be quite as sensitive to change, but it isn’t invulnerable to change. Bank accounts change value. Inventories levels fluctuate. EC2 spot prices might be astronomical at the moment. People in your address book change jobs and phone numbers, email addresses, and even their names. You get the picture. So, it’s important for us as system designers — and more publicly as API designers — to accept that our data will change and for us to make it easy for our customers to manage that change.

Trend Spotting

There are three basic patterns that emerge in API design:

Polling is the simplest way to expose a resource to the rest of the world. It’s great. Just hit me up with GET /notifications?since=1382630675166 HTTP/1.1 and I’ll give you all the notifications I have since the last time you checked.

But we can get into trouble with polling when we want really fresh data. How frequently are you going to check for updates? Once per second? Ten times per second? On an API with lots of resources and thousands of clients, this can quickly become a scaling problem for back-end servers, not to mention the wasted data transfer. In fact, our friends at Zapier found that 98.5% of all polling seen through their system amounts to wasted traffic. So we as designers attempt to bandaid upgrade the system by adding complexity such as adaptive scheduling systems and the like.

You can make polling work, and it’s a great way to get your API out there for the first time, but you’ll likely want to grow out of it.

If you work with Facebook’s API at volume, you’ll be familiar with this pattern. You tell Facebook you’re interested in updates by logging into the application admin area and tell them which kinds of updates you’d like to receive. You also provide them with the URL of an API endpoint you’ve created to receive their updates. Assuming you get everything to work, Facebook calls your endpoint any time information changes on their end that you might be interested in.

The catch is that they don’t give you the information; they just tell you where to get it. This is good for them in that they don’t have to send data out that you might not care about, but bad for you in that you have to make another API call to get the data in response to their notification. It’s arguably one of the more complicated setups and is sometimes hard to debug, which leads us to pure Push models.

Push-based models are very similar to the Facebook events pattern. The difference is that they just simply send the whole payload (or some kind of diff) when something changes. A classic example of this is Github’s service hooks implementation. You head over to your repository’s settings page and register a service hook. The endpoint you register can be one of the dozens of integrations that Github has ready for you, or it could be your own in-house API. This works for simple cases, and is often all you need. When things get more complicated, like when subscriptions are short-lived or completely transient, you need to bring out the big guns.

Dynamic subscriptions will be familiar to people who have worked with the PubSubHubbub protocol, but there are much hipper flavors out now like my favorite, The protocol might seem complicated, but it’s based around a relatively simple idea. When you’re interested in a resource, you subscribe to updates. When you’re not longer interested, you cancel your subscription:

This minimizes the amount of data sent between systems and optimizes the signal-to-noise ratio. “The catch” is that each system must maintain state about each subscription. A producer must be able to efficiently multicast updates to multiple subscribers. This is not always trivial to design, but can become a necessity in very high volume interactions between systems. This makes a lot of sense for contact systems like FullContact, who needs to keep your contacts in sync with multiple 3rd party systems (and keep you in sync with FullContact), even if you have tens of thousands of contacts in your address book.

How About You?

So you’re setting out to build an API and you’re trying to decide how to deal with aging data. Which model should you choose? The simple way is to ask yourself how sensitive your data is to change, or better, what is the cost of acting on out-of-date data? If the data doesn’t change very frequently and the cost of making decisions based upon obsolete information isn’t very high, keep it simple! Go with a push-based model

When I first started FullContact with Bart and Travis, we found that it can cost businesses up to $100 per contact record over time if they never clean up data entry errors or allow their contacts to age into obsolescence (source: SiriousDecisions). This is by no means stock-market-style sensitivity, but those costs can really add up. For that reason, we like push-based models.


Image credit: Michael Himbeault via Flickr

Quickly build contact management features into your app.

Get Started for FreeNo Credit Card Required

Recent Posts