Longtime users of FullContact’s Person API know that our data is improving every day, but isn’t always perfect. This is especially true when it comes to photos from users’ social profiles. The problem only becomes worse when looking at the profile of someone with a broad social media presence.
The Problem – Duplicate Images
All the big social media sites take the (reasonable) step of processing images before showing them. This makes FullContact’s job difficult, though; now one user maps to many image files! This happens even when the user uploads the same exact file each time. Each social media site’s “special sauce” is going to be a different combination of compression, reformatting, cropping, and other adjustments. While the image looks the same from each source, each version will have minute differences. This makes it difficult for a computer to determine whether two images are really “the same.”
|Service||Stored At||Displayed At|
|200 x 200||200 x 200|
|400 x 400||200 x 200|
|Google+||Original Size||120 x 120|
|165 x 165||165 x 165|
The result of this in our API is the appearance of unwanted images. A human could easily recognize which of these FullContact should remove, but we can’t have a human look at each of our tens of thousands API queries every minute. And the system as-is sees the image bytes as different even when the images depict the same scene. To address this, we set out to develop an automated system which could identify any duplicate images that show up as we assemble API responses.
Of course, this is not wholly unexplored territory. Academic papers abound that work on this problem, as well as much more difficult variations; “Decide if two images depict the same scene” is just the starting point for this entire class of problems. The requirements of our particular variation are what make this a meaty and intriguing problem.
Each deduplication task is small
Some of the worst profiles we’ve found in our system, with regard to images, still only have a couple dozen images. We can afford to be thorough by examining each pair of images to see if they are duplicates. O(n2) complexity is not as big a deal in practice when n is small.
We need to determine duplication in milliseconds to avoid blocking requests
If we have to choose between perfect accuracy and fast response times, we want the latter every time to satisfy customers. Calculating identifying information about an image when a request comes in will not be fast enough; the network latency imposed by having to fetch the image bytes on every request would be crippling. This leads us to use some form of identifier caching and lookup.
Most of the images we want to suppress are resized versions of other images
Here we find some maneuvering room with regard to our choice of algorithm. There is no free lunch, but we have options. We can trade something we don’t care about right now (performance on cropped and rotated images) for something we do (performance on scaled images). A more general algorithm which is sensitive to crops and rotations would be fantastic, but need more effort than is currently practical to accomplish at our scale.
Once we find duplicates, we need to also pick the best one.
The complex, expensive, accurate approach here is to score images based on a variety of visual defects. The simple, cheap, usually-accurate approach is to pick the biggest image. Spoiler Alert: We went with the simple approach for this one.
In a perfect world, we could have the perfect algorithm. It would be fast. It would be accurate. It would work on all the images we ever see. Alas, we do not live in a perfect world. We have to compromise.
Our algorithm must be fast out the door, but can be slow behind the scenes. We don’t need to find all duplicates, only the most obvious ones. Our solution doesn’t even need to work on all the images we’ve ever seen at once, just the ones in any single profile. All of these constraints help to drive us towards the algorithm that we want.
We want something called a perceptual hash.
The Solution – PHASH
After some cursory internet searches for existing techniques that might apply to our problem, we found this detailed blog post by Dr. Neal Krawetz and immediately knew we had a winner. Dr. Krawetz describes a technique for identifying near-duplicate images with minor changes using discrete cosine transforms, originally inspired by the work of the pHash (“perceptual hash”) project.
A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are “close” to one another if the features are similar.
The algorithm converts a color image of any resolution to a 64-bit hash value, and the way this hash is constructed leads to the useful property that two images with small visual changes will have only small hash changes. The bitwise distance (Hamming distance) between two hashes is then a similarity score for the images they represent. Counting bits — that sounds like a problem a computer will be really good at!
Dr. Krawetz breaks the process down into just over a half-dozen steps, which are:
- Downscale the image to 32×32 pixels
- Convert the image to grayscale
- Compute the 2D Discrete Cosine Transform (More detail below)
- Select the coefficients for the 8×8 lowest frequencies from the transform
- Compute the average of 63 of the 64 coefficients, dropping the coefficient corresponding to flat color (the DC term)
- Convert each coefficient of the DCT to a single bit: 1 if greater than or equal to the average, 0 if less
- Apply a consistent ordering to the 64 generated bits to construct a hash value
Each image is given a unique identifier, which we can use to look up its hash before it goes out the door to a caller. We then store these identifier-hash pairs in a table on Amazon’s DynamoDB, which gives us the flexibility to scale up database throughput as needed without worrying about hardware limits. We’re currently hashing up to 200 images per second, and deduplicating 4 times that, with plenty of room to grow. DynamoDB’s NoSQL style also means we can easily add new data (gender, faces, age, etc.) as we find ways to extract it from our images.
With the hashes in hand it is very easy (and very fast!) to calculate the Hamming distance between each image in the response and merge any images that fall within a similarity threshold. We also check against a small list of blacklisted images so that we don’t display Twitter eggs, Gravatar G’s, Facebook silhouettes, and so on. In all, it takes us less than 100ms to fully deduplicate a set of a dozen previously-seen images taken from a contact.
Why It Works
The reason this pHash implementation works so well (despite being a surprisingly lightweight algorithm) is that it distills out only the coarsest visual information — the bits of color and shape that would be visually distinctive or similar to the human eye.
To see why this works, take a look at the 1-dimensional Discrete Cosine Transform. Being a member of the family of Fourier-Related Transforms, the DCT decomposes a signal (specifically a discretely sampled signal, hence discrete cosine transform):
f(x) = cos(x) + 0.5 * cos(x * 2) + 0.25 * cos(x * 4) [100 samples over the range (0, 8)]
Into a finite set of scalar coefficients which correspond to cosine functions of varying frequencies:
The height of each bar corresponds to the strength of each cosine frequency in the original signal. The distance from zero doubles for each of the three strongest frequencies, while their strength halves each time, which matches the original function.
The weighted sum of those frequencies can reconstruct the original samples with a minimal amount of error:
This shows the weighted sum of the three most prominent frequencies, and the resultant signal. Note how the strongly-weighted low frequency (red) usually overpowers the mid and high frequency signals (green, blue). (The change in scale from the original is due to a lack of normalization)
If only a subset of the frequencies from the DCT are used to “reconstruct” the signal, there is a clear difference in similarity to the original:
Utilized frequencies for the Inverse DCT are highlighted in green beside the corresponding reconstruction.
Notice how the low-frequency reconstruction is much more recognizable as the original signal! The high-frequency reconstruction looks more like noise than a recognizable corruption of the original. It’s easy to see that the low frequencies carry most of the information for the signal, and the high frequencies can be effectively ignored for comparisons.
It’s easy to picture the one-dimensional case as a function of time — what’s the value of the signal at time x? But if we consider it as a function of space — what’s the value of the signal at position w — we can more easily extend the concept to 2 dimensions. Imagine such a 2-dimensional signal as a mountain range, full of peaks and valleys. If you hold one of the coordinate dimensions constant, you can extract a 1-dimensional signal by “slicing” the terrain.
If the frequencies in each dimension of the 2-dimensional signal are varied regularly and independently of each other, a grid of component 2D signals can be constructed.
Credit to wikimedia foundation. White = high, black = low.
The two-dimensional DCT decomposes a two-dimensional signal into the coefficients for each of the n*n component signals such that the weighted sum of the components reconstructs the original with minimal error — just like the 1D case! Of special note in the component signals is the (0,0) signal, corresponding to flat signal (color) adjustment. We exclude it from our average coefficient calculation for pHash because a slightly brighter version of an image should be “the same” image, whereas an image that is bright on just one side is not.
How well does pHash fit our original constraints?
1. Each deduplication task is small
This works in our favor, because even looking at every pair of images (O(n2)) is feasible for these values of n. There’s no reason to let profiles with excessively large sets of photos give us a hard time, however, so we actually use a O(n log n) DBSCAN algorithm and treat each hash as a 64-dimensional point.
2. We need to determine duplication in milliseconds to avoid blocking requests
Once the hash values are in hand, this is easily met. Using the DBSCAN algorithm we see clustering times in the single-digits of milliseconds. Caching and retrieving the hash values is noticeably more expensive than bit-counting their XOR, and a decent database makes retrieval plenty fast enough to do in real time.
3. Most of the images we want to suppress are resized versions of other images
This is an image attack which pHash is very good at identifying. Normalizing the size of each image to 32×32 ensures that the same image at multiple resolutions consistently maps to the same DCT coefficients. Cropped and rotated images are less common, and more difficult to detect with pHash because they disturb the position of features in image space. Missing most of those images is the biggest compromise we have to make with this system.
4. Once we find duplicates, we need to also pick the best one.
Selecting the best image is not actually addressed by using pHash. But it’s exceedingly rare that we see a poorly upscaled image alongside a crisp, low-resolution one, so choosing the largest image among duplicates is perfectly fine.
And that’s how FullContact is now deduplicating photos. Just 8 bytes can represent the underlying visual structure of an image well enough to match up most images in a sea of potential duplicates.
Those same images from before, after deduplication.