What is record matching?
Record matching is the process of comparing data records to determine whether they refer to the same entity. It's the core technical operation behind deduplication, entity resolution, and identity resolution — the step where the system decides "these two records are the same person" or "these are different people."
Record matching techniques range from exact matching (simple but brittle) to fuzzy matching (handles variations) to ML-based scoring (handles ambiguity at scale). The choice depends on data quality, volume, and tolerance for false positives vs false negatives.
Record matching techniques
- Exact matching — Compares fields for identical values. Fast and precise, but brittle — any variation in spelling, formatting, or completeness causes a miss. Best for unique identifiers like email addresses.
- Fuzzy / approximate matching — Uses algorithms like Levenshtein distance (edit distance), Jaro-Winkler similarity (weighted for prefix matches), and phonetic matching (Soundex, Metaphone) to find records that are similar but not identical. Handles typos, abbreviations, and nickname variants.
- Rule-based matching — Applies hand-coded rules: "if first name and last name match and company is the same, it's a match." Simple to understand but hard to maintain as edge cases accumulate.
- ML-based probabilistic matching — Trains models on labeled record pairs to learn which field combinations predict true matches. Considers all fields simultaneously, handles ambiguous cases, and improves over time from correction patterns. The best approach for complex, large-scale matching.
Challenges at scale
Comparing every record to every other record is computationally infeasible at scale. A million records would require nearly 500 billion pairwise comparisons. Blocking and windowing strategies solve this by grouping records into buckets (by first letter, domain, zip code, company) and only comparing records within the same block.
The common names problem is particularly acute at scale. "John Smith" appears thousands of times — without additional signals like company, title, or email domain, matching is unreliable.
Data sparsity compounds the challenge. When records have different fields populated (one has email but no phone; another has phone but no email), there may be no overlapping fields to compare directly.
Cross-language matching introduces additional complexity — transliteration differences, character set variations, and cultural naming conventions all affect match quality.
How Salmon performs record matching
Salmon combines multiple matching techniques — exact, fuzzy, and ML-based — to achieve high-precision record matching at scale. Cross-source verification adds a second layer: even when names are ambiguous, confirming identity across LinkedIn, company registries, and professional data resolves the match with confidence.
Record matching in practice
Record matching is the technical foundation of several critical business operations:
- CRM deduplication — Finding and merging duplicate contact and account records within Salesforce, HubSpot, or other CRM systems.
- Data migration — When moving from one system to another, record matching ensures entities aren't duplicated in the transition.
- System integration — When CRM, marketing automation, and support systems sync, record matching determines which records to link vs create new.
- Compliance screening — Matching customer records against sanctions lists, adverse media, and watch lists for KYC/KYB compliance.
The critical tuning decision in any matching system is threshold selection. Set the threshold too strict and you'll miss true matches (false negatives). Set it too loose and you'll merge records that shouldn't be merged (false positives). False merges are typically harder to undo than missed matches.
Related concepts
See real-time enrichment on your data.
Send us a sample from your CRM. We'll show you what Salmon enriches, verifies, and fixes — live, in 30 minutes.