Identity resolution, merge duplicate customers

Objective

Marketing has Sarah in Salesforce, support has her in Zendesk, billing has her in Stripe — all with slightly different details. Identity resolution stitches these together. Modeling shared identifiers (email, phone, address) as graph nodes turns deduplication into connected-component discovery: customers linked through any common identifier belong to the same person.

Step 1: Create the graph

// Source records — same person, different systems, slightly different fields
MERGE (c1:Customer {id: "SF-101",  source: "Salesforce", name: "Sarah Chen"})
MERGE (c2:Customer {id: "ZD-77",   source: "Zendesk",   name: "S. Chen"})
MERGE (c3:Customer {id: "ST-9001", source: "Stripe",    name: "Sarah K. Chen"})

MERGE (c4:Customer {id: "SF-202",  source: "Salesforce", name: "Raj Patel"})
MERGE (c5:Customer {id: "ZD-88",   source: "Zendesk",   name: "Rajesh Patel"})

MERGE (c6:Customer {id: "SF-303",  source: "Salesforce", name: "Mia Rossi"})
MERGE (c7:Customer {id: "ST-9100", source: "Stripe",    name: "Mia R."})
MERGE (c8:Customer {id: "ZD-99",   source: "Zendesk",   name: "Mia R Rossi"})

MERGE (c9:Customer  {id: "SF-404", source: "Salesforce", name: "Leo Park"})
MERGE (c10:Customer {id: "SF-505", source: "Salesforce", name: "Eli Tanaka"})

// Identifier nodes
MERGE (e1:Email {value: "sarah.chen@acme.com"})
MERGE (e2:Email {value: "schen@gmail.com"})
MERGE (e3:Email {value: "raj.patel@acme.com"})
MERGE (e4:Email {value: "mia.rossi@studio.io"})
MERGE (e5:Email {value: "leo@parks.dev"})
MERGE (e6:Email {value: "eli.t@example.com"})

MERGE (ph1:Phone {value: "+1-415-555-0142"})
MERGE (ph2:Phone {value: "+1-415-555-0199"})
MERGE (ph3:Phone {value: "+39-06-555-0011"})

MERGE (ad1:Address {value: "100 Mission St, SF, CA"})
MERGE (ad2:Address {value: "20 Via Roma, Rome, IT"})

// Customer -> identifier edges
MERGE (c1)-[:HAS_EMAIL]->(e1)
MERGE (c2)-[:HAS_EMAIL]->(e2)
MERGE (c3)-[:HAS_EMAIL]->(e1)
MERGE (c1)-[:HAS_PHONE]->(ph1)
MERGE (c2)-[:HAS_PHONE]->(ph1)
MERGE (c3)-[:HAS_ADDRESS]->(ad1)
MERGE (c1)-[:HAS_ADDRESS]->(ad1)

MERGE (c4)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_PHONE]->(ph2)

MERGE (c6)-[:HAS_EMAIL]->(e4)
MERGE (c7)-[:HAS_EMAIL]->(e4)
MERGE (c8)-[:HAS_PHONE]->(ph3)
MERGE (c7)-[:HAS_PHONE]->(ph3)
MERGE (c6)-[:HAS_ADDRESS]->(ad2)

MERGE (c9)-[:HAS_EMAIL]->(e5)
MERGE (c10)-[:HAS_EMAIL]->(e6);

Step 2: Cluster duplicates by shared identifiers

// Two customers belong to the same identity cluster if you can walk
// Customer -> identifier <- Customer in 1..3 hops.
MATCH (a:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(ident)
        <-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]-(b:Customer)
WHERE id(a) < id(b)
RETURN a.id   AS record_a,
       b.id   AS record_b,
       a.name AS name_a,
       b.name AS name_b,
       labels(ident)[0] AS shared_via,
       ident.value      AS shared_value
ORDER BY shared_via, record_a;

What's happening

Each shared identifier (Email, Phone, Address) becomes its own node, so two customers pointing at the same email are one hop apart through that node.
The relationship-type union HAS_EMAIL|HAS_PHONE|HAS_ADDRESS lets a single pattern walk over any identifier type — adding :HAS_DEVICE_ID later requires no query rewrite.
Variable-length traversal (*1..4) finds transitive matches: c1↔c2 share a phone, c1↔c3 share an email, so c2↔c3 are linked through c1 even without a direct identifier overlap.
id(a) < id(b) deduplicates symmetric pairs.
This is the same algorithm used by Customer 360 and CDP vendors — fast, explainable, no ML.

Try this next

MATCH (c:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS*1..4]-(other:Customer)
WHERE id(c) <= id(other)
WITH c, collect(DISTINCT other.id) AS cluster
RETURN c.id AS seed, cluster
ORDER BY size(cluster) DESC;

MATCH (e:Email)<-[:HAS_EMAIL]-(c:Customer)
WITH e, collect(DISTINCT c.source) AS systems, count(c) AS records
WHERE size(systems) > 1
RETURN e.value AS email, systems, records
ORDER BY records DESC;

MATCH (c:Customer)-[r:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(id)
RETURN c.id, c.source, type(r) AS id_type, id.value;

Identity resolution, merge duplicate customers

Identity resolution, merge duplicate customers

Objective

Step 1: Create the graph

Step 2: Cluster duplicates by shared identifiers

What's happening

Try this next

Run this on your own machine