Bart, Travis, and I started this company with a simple goal. David Cohen put it best:
Just fix people’s contacts. Oh, and don’t fuck ’em up.
Step 3: profit, right? As entrepreneurs, we prefer diving into a problem first and sorting out the details later, so we got right to work. As every other contact information management company knows, there are a lot of details.
What does it actually mean to fix people’s contacts? This requires some definition of the engineering problem. Specifically, our goal was to “find and merge all the duplicate contact records across every end-user contact system, clean them up, and update them with information available on the public web”. Let me break that down for you a bit. In doing so, I’ll be offering a preview of an upcoming series of posts on just what it takes to create a unified contact management system.
Rule #1: Don’t fuck ’em up
Rule #1 at FullContact is simple: do not fuck up people’s contacts.
Once back in our pre-TechStars days, a bug in our Rainmaker application merged every contact together that had the same website domain in it. Luckily this only affected one user, but he was furious! We had just merged every one of his contacts together. Now he had 1800 records with the same data in it (more or less). The 3 of us had to spend 18 man-hours late into the wee hours of the morning hand-cleaning all the contacts for the user using a Google spreadsheet. The user went from incredibly pissed-off to grateful, but we got the message loud and clear.
So how do you follow these rules? You have to be able to restore a user’s account to any point in time, and you have to allow the user to restore their own contacts too. Not only is this a nice feature, it really communicates that your contacts are safe in FullContact.
Check back here in the coming weeks for a deep-dive into how we implemented contact and address book versioning.
How can you find duplicates in 3rd party systems? First, you have to sync with them so you have a copy of the user’s contacts. For those of you who have worked with the Google Contacts API, you know this isn’t simple. Given the heterogeneous landscape of contact data APIs (or lack thereof), there are several ways to skin this cat. Some require time or eTag-based polling. Some require a dedicated HTTP endpoint to send updates. I’ll be diving into much more detail on this in the coming weeks.
Find the duplicates
If you have two contacts with the name “Dan Lynn”, do they represent the same person? Maybe. But maybe not.
According to some cocktail napkin math I did using our Name Stats API, there are about 632 individuals in the United States with the name “Dan Lynn” or “Daniel Lynn” (and 3 of them are likely female!). Inside of one end-user’s address book, the full name “Dan Lynn” alone is likely to be enough information to consider two contacts duplicates. What about a whole organization’s collective address book (potentially millions of contacts)? You have to accept that your system will get it wrong sometimes, and plan around this reality.
Any system you build to find duplicate contacts is attempting to solve Information Retrieval problem and, as such, it’s best to use some accepted terminology to measure how well your system works. We use the measures of Precision and Recall to track our progress. Simply put, when you’re looking for something (like a duplicate of a given contact), Precision measures the percentage of what you found that is actually correct. Recall measures the percentage of what actually exists that you were able to find. We pay very close attention to both of these measures, but we prefer to be optimize Precision first.
Stay tuned for a detailed post that gets into the messy details of this problem.
Merge the duplicates automatically
Ok, your system found a bunch of duplicate “Dan Lynn” contacts across 2 Google accounts, Twitter, Foursquare, and LinkedIn. Now you have to merge them together into one Unified Contact. How should you do it? Your system needs to simply do what the user intuitively expects but, of course, not all users have the same expectations.
You could just union the sets of phone numbers, emails, URLs, etc…? That gets close, but there is a laundry list of edge cases. You have to choose a name if you’re merging “Daniel Lynn” and “Dan Lynn”. You have to handle a huge variety of phone number and email formats, eliminating duplicates. You could simply concatenate plain text fields like “notes” but, when it comes to change detection and synchronization, you’ll pay for it later if you don’t attempt to resolve duplicate notes.
Merging multiple contacts and cleansing the result deserves a post of its own. Check back in a few weeks and we’ll have one for you!
Keeping a copy of remote address books up to date is a pretty simple problem. Copying changes back to those remote sources is more complicated, but not by much. Merging a bunch of contacts together and then updating the original sources is another matter entirely. First, not all remote address books allow 3rd party writes. Some systems only allow conditional writes or updates to limited portions of a contact. You have to keep track of the fact that your system’s local copy of the contact may have drifted from the remote system’s copy.
If you merged two duplicates that came from the same remote system, you should probably apply the merge on the remote system too. Some systems support a “merge” operation, which takes care of also merging related records (e.g. labels in Google Contacts or message history in a CRM), but most do not, leaving your system with the task of keeping the remote system consistent in the event you have to update one contact and delete the duplicate(s). On top of all of that, you have to contend with remote API rate limits and spotty service availability.
3rd party systems might synchronize with other systems. This can create loops in which contact changes propagate endlessly around multiple systems, wasting CPU and disk space all the way. Detecting and terminating these loops is essential to create a reliable system.
We’ll have a post dedicated to the bidirectional challenges of contact synchronization coming soon.
Rule #2: Keep Address Books Secure
Contacts are personal. Contacts represent the relationships in our lives and, as such, much be treated with great care. A user’s address book must be protected against unauthorized access, and the related systems must be hardened against intrusion. As several high-profile hacks over the past few years have shown, if you build it, they (i.e. intruders) will come.
Now do it all at scale
Lots of these problems become exponentially more complicated once you realize that there are tens of billions of contacts out there in the wild. The average FullContact user has 1300 contacts. For every million users that sign up, we add 1.3 billion contact records! We also keep a version history for every one of these contacts. That’s a lot of data, but it’s a LOT of data when you consider that each contact can also have several photos.
In a synchronized world, contacts constantly change. Users add contacts on their phones and email accounts without even realizing they are creating contacts. These changes add up to quite a bit of real-time synchronization work. Keeping up with this work requires a lot of computers. Keeping all those computers (AWS instances, in our case) running smoothly is significant operations challenge in an of itself.
We likely won’t have a dedicated scaling post, but scaling and reliability concerns will appear in every one of the other posts in this series.
We’ve come very far down this path, and I’m very proud of the system we’ve built. I can’t wait to dive into more detail on these different topics. Check back here in a couple of weeks for the next post on how we built version control for your contacts!