Data from thousands of sources can pose many challenges, such as how to standardize, de-duplicate and timestamp the data. In this article, we explain how we deal with these core product issues.
We standardize our data in two ways: by applying general formatting rules and by canonicalizing data.
We make all data lowercase and remove any whitespace at the beginning or end. This helps us normalize our data sources, which may have different capitalization styles. We also remove any punctuation that we consider unnecessary.
Moreover, we canonicalize some of our fields into standardized values. For example, we use standardized values for majors and minors, which makes our data more consistent and searchable. For other fields, such as schools, companies and locations, our canonicalization techniques add more information to our data.
We use both deterministic and probabilistic methods to de-duplicate our data, based on a blocking/matching logic. We group records that have similar values, such as the same email, name, etc., and then we compare all records within a group to see if they are a match. We are very strict in this process, because we want to avoid merging records that are not the same. We believe that false positive merges are more harmful than missing a potential match of two very different values. This ensures that our data sets do not have duplicate people and that our APIs return as much information as we can confidently provide for any given input.
After each quarterly data build, we spend about 2-3 weeks doing quality assurance on our data. Our QA process involves manual checking, running aggregations and extensive unit testing to make sure that we have not reduced the quality of our data in any way. Sometimes, this allows us to fix issues before we release our data to production, but often the main goal is to communicate to our customers what our objectives for our quarterly improvement were and how we achieved them.
We aim to have all the data related to an individual that we can possibly have. This includes historical work experience, locations, emails, social media profiles and so on. Some of these fields, such as email or location, are useful for custom audience targeting. Others, such as historical work locations, are useful for modeling. All these fields are valuable for matching out-of-date data and providing newer, more useful information through our APIs.
A common question that we get is whether we validate emails and/or profile URLs. Because we have so many profile URLs and emails, it would be impractical to validate all of them regularly to ensure accuracy. Many of our customers use our data as a starting point and run a third-party email validator on top of it. Validating profile URLs would go against the Terms of Service of most social networks and we do not recommend it.
Need a solution to generate leads from your data?
If you need to have your emails verified and automatically distributed to relevant prospects, why not check out our lead generation app.