What is duplicate detection?

Duplicate detection (also called deduplication or dedup) is the process of identifying and resolving redundant records in CRM and database systems. Duplicates arise from multiple data sources, inconsistent data entry, and system migrations — and they silently inflate pipeline metrics, waste sales outreach, and break reporting.

Duplicate detection is related to but distinct from entity resolution, which determines whether records across different systems refer to the same real-world entity. Dedup typically operates within a single system; entity resolution operates across systems.

Common causes of CRM duplicates

  • Multiple import sources — Marketing lists, sales research tools, inbound form fills, and third-party vendors all create records independently, often for the same person.
  • Inconsistent data entry — Nicknames (Bob vs Robert), abbreviations (Corp vs Corporation), and formatting differences (phone numbers, addresses) create records that look different but represent the same entity.
  • System migrations — Moving from one CRM to another (or merging two CRMs after an acquisition) introduces duplicates when records don't map cleanly across systems.
  • Marketing and sales creating records independently — Marketing captures a form fill while sales manually creates a record from a prospecting call — same person, two records.
  • No dedup at point of entry — Without real-time dedup checks, every new import or integration sync can create net-new records for existing contacts.

How duplicate detection works

Exact matching is the simplest approach — comparing fields like email address or phone number for identical values. It's fast and precise but misses duplicates with any variation.

Fuzzy matching uses algorithms like Levenshtein distance and Jaro-Winkler similarity to find records that are similar but not identical. This catches typos, abbreviations, and formatting differences — but can produce false positives without careful threshold tuning.

ML-based scoring trains models on labeled match/non-match pairs to learn which field combinations best predict true duplicates. ML models handle ambiguous cases better than rule-based approaches and improve over time from correction patterns.

Merge rules determine what happens after a duplicate is detected: which record survives, which fields take precedence, and how conflicting values are resolved. Poor merge logic can destroy data even when detection is accurate.

At scale, blocking and windowing strategies group records into buckets (by company, domain, or first letter) so the system only compares records within the same block — avoiding the computationally infeasible comparison of every record against every other.

How Salmon handles deduplication

Salmon performs deduplication in real time at point of record creation. AI-powered fuzzy matching catches duplicates that rule-based systems miss — like "Bob Smith at Salesforce" and "Robert Smith at Salesforce.com". Confidence scoring ensures correct merges without data loss.

The cost of duplicates

Enterprise CRMs average 20–30% duplicate rates. The downstream impacts are significant and compounding:

  • Inflated pipeline — Duplicate contacts and accounts artificially inflate pipeline metrics, making forecasts unreliable and board-level reporting inaccurate.
  • Double-touched prospects — Multiple reps contact the same person from different records, eroding trust and making your team look disorganized.
  • Broken routing rules — Lead routing, territory assignment, and account ownership break when the same entity exists as multiple records with different assignments.
  • Inaccurate reporting — Every report that counts contacts, accounts, or opportunities is inflated. Conversion rates, win rates, and engagement metrics are all distorted.
  • Compliance risk — Inconsistent records for the same entity create gaps in audit trails and make regulatory reporting unreliable.

See real-time enrichment on your data.

Send us a sample from your CRM. We'll show you what Salmon enriches, verifies, and fixes — live, in 30 minutes.