At FullContact, we’re always experimenting with new technologies and techniques. Machine learning has come into vogue as of late, and has shown some impressive results within our company and without. Recently, we had an opportunity to apply some machine learning to improve our handling of job title data we find throughout the web. The choice, however, was not an easy one. To someone unfamiliar with the technology, machine learning can seem complex, foreign, expensive, and hard. Training data, precision and recall, neural networks, scary-sounding academic papers — it’s overwhelming. Given such a large investment, is it really the right choice? In this post, I’ll try to explain our thought process at FullContact on solving a tough problem, and how we decided to use machine learning.
FullContact handles job title data in a variety of forms. Understanding job titles at a deep level is critical for contact search, analytics, and formatting for use in our APIs and consumer products. One problem that we had historically not solved was department classification. Given a job title, we needed to classify it into a corporate department.
For instance, a job title input of “software engineer” should return a department classification of “engineering.” “Controller” should return the “finance” department. “Customer Success Manager” is “Customer Support.
This solution may seem simple, but it quickly becomes complicated after further thought. For instance, what rules could we use to classify a job title into the “engineering” department? Maybe if the job title contains the word “engineer”? This works for “software engineer” but fails for “programmer,” which should also be an engineering job. This approach classifies “support engineer” as “engineering,” when at many companies that position is a customer support job. The story is similar for many other professions, and simple rules like “contains the word engineer” quickly become a quagmire of logic and exceptions.
This was the state our department classification methodology. It worked for common job titles, but it was painful to maintain and had undefined behavior for job titles outside the norm. Addressing bugs in this system was usually “add another special case,” because fixing the underlying logic was either not useful or too complex.
After talking about the problem, we decided that it was worth trying a prototype using machine learning. This was not a decision made lightly. We were well aware of the drawbacks of machine learning:
- It’s complex. Machine learning takes a lot of specialized knowledge. When something goes wrong, it can be hard to figure out why and how to improve the situation.
- It’s data-hungry and expensive. Most machine learning paradigms require a large amount of data to serve as training data for the algorithm. Even with good data, it can take a lot of time to get good results as time is spent training, tweaking parameters, and then training again.
- It’s easy to accrue technical debt. What happens if the developer on the project leaves? If you want the input to change the output slightly? The rules based approach, though complex, is “just code” — if a developer with no additional context spends enough time with it, they’ll come to a better understanding of the program. But the machine learning approach does not have the same easy tools for introspection as “read this source code” and requires more domain-specific knowledge about machine learning.
These factors made machine learning a difficult task to take on. Yet, we also understood that machine learning could potentially solve our problem more elegantly than any “bag of rules” approach. There’s a certain class of problems where machine learning does very well:
- When you have a lot of data at your disposal. With minimal data, any machine learning approach is going to have trouble (especially with no evaluation test set). In our case, we had hundreds of gigabytes of job titles and job title description data.
- When you have the relevant expertise. It’s easy to get lost in something more complex than you can handle (or more complex than the problem requires). We weren’t going to try speech recognition or a Chess AI, but we had some relevant in-house Natural Language Processing expertise that made machine learning more realistic.
- When the logic is hard to describe. In traditional AI approaches, the human has to write out every piece of logic that the computer uses. In some cases, this logic is not easily expressed or easily discoverable. To understand this, sit down with a couple of example problems from your data, and try to figure out a solution using only the human brain. Try to explain the solution clearly and concisely. If it’s not easy to explain, or there are multiple right answers, or it’s not clear what the 100 percent correct answer is, machine learning algorithms may be able to navigate the ambiguity better than a rigid set of rules.
- When you can’t easily write code for 80 percent of inputs. If it’s possible to write fairly simple, clean code that will handle 80 percent of cases correctly, consider just hard-coding or edge-casing the other 20 percent. For job department classification, we knew this wasn’t the case. There’s a huge diversity of titles.
- When you have no other choice. Machine learning is the “High Interest Credit Card” (https://research.google.com/pubs/pub43146.html) of technical debt. It’s very powerful, but if there are traditional approaches that work just as well, use those instead.
After reviewing this list, we decided that our problem fit the machine learning use case, and we went to work on a prototype. Going from nothing to a machine learning pipeline taught us many lessons through a process of trial and error. For instance:
- Iterate as quickly as possible. You need to see results and be able to track their progress long before shipping to production. The entire process of machine learning is a constant cycle of iteration.
- Using a test set is an absolute must. You can’t tell if the changes you’re making are an improvement if you can’t measure actual performance.
- Explore related work before trying to solve the problem. We used a lot of papers on job title normalization to help refine our process and ensure we were headed in the right direction.
- Never underestimate the power of incremental improvements. Our first pass at job title department classification had 50% precision and 50% recall. We spent the entire rest of the project getting that number to 95%.
Following these general guidelines, our application of machine learning definitely paid off. We ended up with an algorithm that correctly classified the department of a job title in 88 percent of our test set (and likely much more than that in production contexts). The algorithm correctly classified job titles that we had never considered.
How it Works
To understand the power of our machine learning classifier, we need explain how the job title department process works. We use a program called word2vec to consume a large quantity of job title and job description data, observing how and where words appear near each other. The data is then used to train a neural network to assign a vector to every word in the dataset. The vectors have a variety of special characteristics, but the most obvious one is that words that are similar to each other have similar vectors (“running” and “sprint” are more similar to each other than “sprint” and “cake”).
With our newly created data set, we have a metric for similarity between words. When we want to classify a job title, we combine the vectors tied to each word in a given job title to create an associated vector . We then compare the job title vector with programmer-defined vectors that represent each department (the vector for the department “engineering” might be close to the vectors for the words “engineer,” “programmer,”and “development”). The department vector that is the closest to the job title vector is considered the correct classification.
The algorithm that results from this machine learning approach lead to behavior that is much more elegant and nuanced than previous methodologies. There are some straightforward examples:
- “controller” is in the “finance” department
- “software engineer” is in the “engineering” department
- “recruiter” is in the “HR” department
None of the answers came from having explicit rules. The associations were learned automatically by looking at how often words occur near each other from our training data. This saves programmers huge amounts of work considering every possible title, domain, and department, as well as the many ways people commonly express these job titles. It is resilient to minor changes in expression:
- “senior software engineer in test” is still “engineering”
- “interim paralegal in residence” is still “legal”
- “programmer,” “developer,” and “computer engineer” are all still “engineering”
The resilience is due to the fact that low-signal words such as “interim” are weighted less in determining the results. What is really impressive is the ability of the model to learn associations for rare words:
- “account manager” is “sales”, but “creative manager” is “marketing”
- “people operations manager” is “HR”, even though the job title contains “operations”, a completely different department at most companies
- “seo strategist” is “marketing”
- “head of ux” is “product”, even though the term “UX” is only a few years old
- “southwest regional manager” is correctly classified as “sales”, even though the title itself contains no words that explicitly relate to sales.
That said, the results are not perfect.
- “property manager” is incorrectly classified as “finance”
- “product outreach coordinator” is incorrectly classified as “product” instead of “marketing”
- “remodeling consultant” is classified as “executive”
Many of these examples are understandable — “product outreach coordinator” contains “product,” which in this case actually isn’t the product department. Others, like “remodeling consultant,” are a bit stranger, and unfortunately one of the drawbacks of our approach is that debugging bad word associations is difficult.
Overall, machine learning can be an incredibly powerful tool, but it has to be used in the right context. Its cost, complexity, and propensity for generating tech debt makes it no small undertaking. However, some problems, especially ones with complex and ambiguous solutions, can be solved more elegantly by machine learning than any other approach. Armed with the relevant expertise, a test set, and the willingness to grind for incremental improvements, the results can be quite impressive.
We’ve been excited by our work so far, as well as the massive potential for improvement. If you want to see the power of machine learning in the data in our APIs, contact the FullContact support team (email@example.com) to see a first-hand demonstration. And if you’d like to help improve it, consider joining our team.