Providing the capability for Multi-Field Enrichment and real-time identity resolution via API has been a long time goal for FullContact. After our launch of the Enrich API in the fall of 2017, we have been quietly evolving our person-centric Identity Graph to be more dynamic, consistent and accurate. Below, I am excited to share some of the technological advances we made along the way to provide you —our customers and partners— the long-awaited (and requested) experience.
In all cases, the below has been achieved for API and Batch workflows alike and will be released on July 1st, 2019:
- Multi-Field Enrichment capabilities for all Person Enrichment
- Security improvements with per account encryption on usage and storage
- Added capability for social usernames, URLs, and IDs, name and postal address lookups
- New Enrichment Data Packs, many focused on consumer level data
- Achieving absolute parity between Batch and API workflows
Out With the New and Back With the Old
One of the key strengths and advantages of FullContact is that we grew up in a world of APIs. Since 2010, we predominantly operated in the API ecosystem building out numerous internal and external APIs; most recently our Person Enrich API. As expected, we have always been capable and comfortable in the near real-time world exposing data and integrating with data partners. However, as we continued to grow as a company and diversified our customer base, we soon discovered that many of our customers and partners were more fluent with batch data processing systems.
Over the last few years, we flexed a new muscle in Spark-land, taking on all the new challenges of a young technology head-on. In the end, we wanted a cleaner algorithm, faster results and a cheaper process to rebuild our entire identity graph from the ground up in less than a day. This is no small feat to achieve as it entails trillions of edges and observations, and stems from a diverse set of sources.
The ID to Rule Them All
The audacious goal of FullContact has always been to have a stable, unique identifier for every person in the world. For a long time, we struggled to master the simple in concept, yet complex engineering challenge of creating such an identifier. Most simple solutions involve a random identifier and a rolling registry of all historical matches. But to be best-in-class, we knew that such a naive solution would only take us so far.
Over the last couple quarters, our Identity Resolution team maintained a steadfast pursuit of this powerful algorithm to create a more fluid, yet unique identifier which we call our FullContact ID (FCID). As part of our GA release, we are bringing two new key feats of FullContact engineering to bear:
- A significantly more advanced Identity Resolution algorithm
- New matching capability in Name and Postal Address
With these new features, we are seeing significantly increased matches with the new support of multi-field input (up to 80%), while maintaining parity on the match rates between API and Batch matching methods. This means that a combination of name, email, address, and phone will yield the exact same match rates and could match at up to 80%. The subtle detail here is that the systems providing these match rates are using a completely different technology: one being an API and supporting databases, and the other being Spark.
What we are most excited about is that we have set the table to enable new functionality by:
- Exposing these unique IDs to our customers and partners in an obfuscated form
- Supporting even more exotic and diverse set linkages to link on
Both of which we hope to put on display soon!
Purpose Built Data Ingest
Quite frequently we found ourselves having to ingest data for third party data installs or to load up customer data to perform various match tests or data appends. Without the right tooling, it could take days! As Spark began to come into focus, we were able to quickly realize the other potentials beyond our patented Identity Resolution algorithms. We discovered we could absolve ourselves from our old Java Application ways of ingesting flat files and replace them with fancier technology.
We started using AirFlow to assist in orchestrating the compilation of our identity graph to better sequence our workflows in AWS Elastic Map Reduce (EMR). Leveraging these technologies, we created a new way to steward large flat files through our systems which we fittingly named “Data Pipeline”. The pipeline is more advanced than the building blocks you get for “free” with AWS, but is in many ways not so different.
We have multiple stages to map inputs, apply our identity resolution algorithm, computing stats, and so on, and can run a multi-million-row file through the system in less than an hour. The end result is a data set that is keyed by our FCID and hence joinable against another similarly processed dataset. All results are in parquet format which is great for further tooling in Spark or in AWS Athena.
Today, we are leveraging our “Data Pipeline” to assist in batch customer deliverables as well as match tests with success. Ultimately this has lead to engineering efficiencies and has increased our accuracy, match rates, and yield on customer match tests.
Adapting for Real-Time
The other set of advancements over the last few quarters is around our Person Enrich endpoint. We spent quality time improving stability, security, latency and of course added new capabilities along the way. We focused on speed of access, achieving 50ms inside of AWS and around 150ms outside of AWS. Feature-wise, we enabled new lookup capabilities, such as name and postal address, as well as opened up the floodgates making all social handles/IDs queryable. Name and postal address is a not-so-new capability from the offline world, but with its addition in our API, we are now enabling a truly omnichannel experience. This enables marketers to better target their customers and offer them a more uniform experience across multiple channels, like email, postal or digital. When paired with our new consumer Data Packs, we can assist our customers and partners to better understand their customer bases on a request by request basis.
As part of our security improvements and operational stability measures, we have gone to great lengths to protect our users’ data and are seeking to gain SOC 2 Type 2 compliance. We have built in many security measures and activated per-account encryption to keep the account specific information locked up and an audit trail for decryption requests. We built out a specific system to handle all encryption and decryption to minimize the potential for any keys to be leaked. With our usage logs being encrypted, we can support cryptographic wiping at a customer’s request with a simple decryption key deletion. We had built out this capability as part of our Private Plan query option, which allows our customers and partners to ensure their queries remain obfuscated and truly “private”.
For this release, we are pleased to offer a more secure, performant, and dynamic API to our partners and customers that offers true real-time identity resolution and Data Packs. We are very excited about this real-time capability that is not only something new for us at FullContact but is new to the entire industry.
The Data Pipeline served us well for the Batch case, but most recently we took it a step further and engineered a process to take data sets out of the Data Pipeline and make them available for APIs to access. More specifically, we needed a way to take flat files from customers or third parties, key it by our FCID, and then make it accessible for random access lookups. This way we could expose an updated data set in both Batch and API on the same day. We coined this process “Data Onlining”.
We leveraged Spark and AirFlow to orchestrate the process of taking offline parquet files from the Data Pipeline, keyed by our FCID, and transformed them into basic HFiles. Once we have the HFiles you can then boot an HBase cluster around them. HBase is built on HDFS, which is native to the Hadoop and Map Reduce ecosystem – we tried a few other database options but didn’t find something that felt mature enough to advance with. Our Data Onlining process leverages this HDFS convenience and allows us to create massive indices straight out of MapReduce.
Airflow helps us orchestrate the “how” and “when” to both remap our data as well as boot up a new cluster when new data arrives. The process is fully capable of hundreds of millions of records, can be completed in a matter of hours, and the resulting HBase database has single-digit millisecond latency.
This behind the scenes capability is part of our launch as it will allow us to move more swiftly in rolling out algorithmic changes as well as enabling us to efficiently provide the freshest data possible in both Batch and API.
Unified Usage Analytics
The last part of our recently updated tech stack was to integrate one more shiny new gem of technology – Druid. If you are not familiar with Druid, it is a “high-performance, real-time analytics database” and is fantastic at aggregating and providing a SQL-like query interface on data. Druid is also designed from the ground up to be scalable and redundant. We have followed the common pattern of deploying Druid using three different server types (master, query, and data) with the zookeeper and metadata databases residing on different hardware. All of this means that the entire set of Druid servers can be lost and restored with little to no loss of data. We plan on sharing more details around the setup and management of the underlying infrastructure in a separate blog post.
As part of our relationship with various customers and data partners, we needed to build a more advanced usage tracking mechanism for all the data we are returning to our partners and customers on a request-by-request basis. With a combination of Kafka, Avro and Schema Registry, Druid is reading messages off a topic and ingesting and indexing in a way that enables rapid aggregation and insight. The best part about Druid is that we can define pre-rolled up aggregations that are applied to the data points as they are ingested. This reduces the data footprint size and allows our backend to query Druid using SQL like syntax to return usage by account, time period and Data Pack in “UI time”. This rapid aggregation of usage enables our customers and partner managers to get near real-time feedback on how our Identity Resolution systems (API and Batch) are being used to solve real-world problems.
Sticking with the theme above of ‘parity’, we also wanted to be able to generate exactly the same usage reports when common inputs are processed by both Batch and API. When ingesting batch files through the data pipeline we use the same extraction libraries we have for API. Usage reports on the batch file are calculated on a row by row basis and persisted in Amazon S3. When the batch file is delivered to our partners we finalize and commit the usage report by streaming it to the same Avro formatted Kafka topic the API writes to. Once the data is on the Kafka topic it is both ingested by Druid for the rapid aggregations described above and persisted to Amazon S3 in a columnar parquet format to be available for other types of queries not suited for Druid (joins, etc).
What does this mean for our awesome customers and partners? We can now offer you a much-improved experience when understanding your usage. Our Developer Dashboard will have new charts and graphs and the capability to roll up custom periods quite quickly. Furthermore, our Stats API will serve data from a new source and be both snappier and more accurate.
On the Horizon
As we glance out onto the horizon, we look to expand our breadth of matching capabilities and our selection of Data Packs, creating fine-tuned solutions that support our customer and partner needs. We want to further assist our customers and partners by providing FCIDs for de-duplication and a localized identity resolution experience.
Beyond the near term, we realize and acknowledge the challenges of privacy, consent and a shifting landscape on the Digital Identifier front. We are positioning ourselves to embrace these challenges rather than shying away. We believe in an open identity graph where each individual has the power to truly own their data and hope to continue to build trust with our users. Privacy and consent are tough nuts to crack, but we see them as the way into the future.
FullContact has a lot of great tech and talent and we are always looking to add great people to our team. If you are curious, hardworking, and passionate about helping us solve the future problems around consented identity, please apply here!