Identity resolution, merge duplicate customers

Cluster duplicate customer records that share emails, phones, or addresses across systems

All recipes· graph· 7 minutesintermediatecypher

Identity resolution, merge duplicate customers

Objective

Marketing has Sarah in Salesforce, support has her in Zendesk, billing has her in Stripe — all with slightly different details. Identity resolution stitches these together. Modeling shared identifiers (email, phone, address) as graph nodes turns deduplication into connected-component discovery: customers linked through any common identifier belong to the same person.

Step 1: Create the graph

// Source records — same person, different systems, slightly different fields
MERGE (c1:Customer {id: "SF-101",  source: "Salesforce", name: "Sarah Chen"})
MERGE (c2:Customer {id: "ZD-77",   source: "Zendesk",   name: "S. Chen"})
MERGE (c3:Customer {id: "ST-9001", source: "Stripe",    name: "Sarah K. Chen"})

MERGE (c4:Customer {id: "SF-202",  source: "Salesforce", name: "Raj Patel"})
MERGE (c5:Customer {id: "ZD-88",   source: "Zendesk",   name: "Rajesh Patel"})

MERGE (c6:Customer {id: "SF-303",  source: "Salesforce", name: "Mia Rossi"})
MERGE (c7:Customer {id: "ST-9100", source: "Stripe",    name: "Mia R."})
MERGE (c8:Customer {id: "ZD-99",   source: "Zendesk",   name: "Mia R Rossi"})

MERGE (c9:Customer  {id: "SF-404", source: "Salesforce", name: "Leo Park"})
MERGE (c10:Customer {id: "SF-505", source: "Salesforce", name: "Eli Tanaka"})

// Identifier nodes
MERGE (e1:Email {value: "sarah.chen@acme.com"})
MERGE (e2:Email {value: "schen@gmail.com"})
MERGE (e3:Email {value: "raj.patel@acme.com"})
MERGE (e4:Email {value: "mia.rossi@studio.io"})
MERGE (e5:Email {value: "leo@parks.dev"})
MERGE (e6:Email {value: "eli.t@example.com"})

MERGE (ph1:Phone {value: "+1-415-555-0142"})
MERGE (ph2:Phone {value: "+1-415-555-0199"})
MERGE (ph3:Phone {value: "+39-06-555-0011"})

MERGE (ad1:Address {value: "100 Mission St, SF, CA"})
MERGE (ad2:Address {value: "20 Via Roma, Rome, IT"})

// Customer -> identifier edges
MERGE (c1)-[:HAS_EMAIL]->(e1)
MERGE (c2)-[:HAS_EMAIL]->(e2)
MERGE (c3)-[:HAS_EMAIL]->(e1)
MERGE (c1)-[:HAS_PHONE]->(ph1)
MERGE (c2)-[:HAS_PHONE]->(ph1)
MERGE (c3)-[:HAS_ADDRESS]->(ad1)
MERGE (c1)-[:HAS_ADDRESS]->(ad1)

MERGE (c4)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_PHONE]->(ph2)

MERGE (c6)-[:HAS_EMAIL]->(e4)
MERGE (c7)-[:HAS_EMAIL]->(e4)
MERGE (c8)-[:HAS_PHONE]->(ph3)
MERGE (c7)-[:HAS_PHONE]->(ph3)
MERGE (c6)-[:HAS_ADDRESS]->(ad2)

MERGE (c9)-[:HAS_EMAIL]->(e5)
MERGE (c10)-[:HAS_EMAIL]->(e6);

Step 2: Cluster duplicates by shared identifiers

// Two customers belong to the same identity cluster if you can walk
// Customer -> identifier <- Customer in 1..3 hops.
MATCH (a:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(ident)
        <-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]-(b:Customer)
WHERE id(a) < id(b)
RETURN a.id   AS record_a,
       b.id   AS record_b,
       a.name AS name_a,
       b.name AS name_b,
       labels(ident)[0] AS shared_via,
       ident.value      AS shared_value
ORDER BY shared_via, record_a;

What's happening

  • Each shared identifier (Email, Phone, Address) becomes its own node, so two customers pointing at the same email are one hop apart through that node.
  • The relationship-type union HAS_EMAIL|HAS_PHONE|HAS_ADDRESS lets a single pattern walk over any identifier type — adding :HAS_DEVICE_ID later requires no query rewrite.
  • Variable-length traversal (*1..4) finds transitive matches: c1↔c2 share a phone, c1↔c3 share an email, so c2↔c3 are linked through c1 even without a direct identifier overlap.
  • id(a) < id(b) deduplicates symmetric pairs.
  • This is the same algorithm used by Customer 360 and CDP vendors — fast, explainable, no ML.

Try this next

MATCH (c:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS*1..4]-(other:Customer)
WHERE id(c) <= id(other)
WITH c, collect(DISTINCT other.id) AS cluster
RETURN c.id AS seed, cluster
ORDER BY size(cluster) DESC;
MATCH (e:Email)<-[:HAS_EMAIL]-(c:Customer)
WITH e, collect(DISTINCT c.source) AS systems, count(c) AS records
WHERE size(systems) > 1
RETURN e.value AS email, systems, records
ORDER BY records DESC;
MATCH (c:Customer)-[r:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(id)
RETURN c.id, c.source, type(r) AS id_type, id.value;

Tags

graphcyphermdmintermediate

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE