Identity resolution, merge duplicate customers
Objective
Marketing has Sarah in Salesforce, support has her in Zendesk, billing has her in Stripe — all with slightly different details. Identity resolution stitches these together. Modeling shared identifiers (email, phone, address) as graph nodes turns deduplication into connected-component discovery: customers linked through any common identifier belong to the same person.
Step 1: Create the graph
// Source records — same person, different systems, slightly different fields
MERGE (c1:Customer {id: "SF-101", source: "Salesforce", name: "Sarah Chen"})
MERGE (c2:Customer {id: "ZD-77", source: "Zendesk", name: "S. Chen"})
MERGE (c3:Customer {id: "ST-9001", source: "Stripe", name: "Sarah K. Chen"})
MERGE (c4:Customer {id: "SF-202", source: "Salesforce", name: "Raj Patel"})
MERGE (c5:Customer {id: "ZD-88", source: "Zendesk", name: "Rajesh Patel"})
MERGE (c6:Customer {id: "SF-303", source: "Salesforce", name: "Mia Rossi"})
MERGE (c7:Customer {id: "ST-9100", source: "Stripe", name: "Mia R."})
MERGE (c8:Customer {id: "ZD-99", source: "Zendesk", name: "Mia R Rossi"})
MERGE (c9:Customer {id: "SF-404", source: "Salesforce", name: "Leo Park"})
MERGE (c10:Customer {id: "SF-505", source: "Salesforce", name: "Eli Tanaka"})
// Identifier nodes
MERGE (e1:Email {value: "sarah.chen@acme.com"})
MERGE (e2:Email {value: "schen@gmail.com"})
MERGE (e3:Email {value: "raj.patel@acme.com"})
MERGE (e4:Email {value: "mia.rossi@studio.io"})
MERGE (e5:Email {value: "leo@parks.dev"})
MERGE (e6:Email {value: "eli.t@example.com"})
MERGE (ph1:Phone {value: "+1-415-555-0142"})
MERGE (ph2:Phone {value: "+1-415-555-0199"})
MERGE (ph3:Phone {value: "+39-06-555-0011"})
MERGE (ad1:Address {value: "100 Mission St, SF, CA"})
MERGE (ad2:Address {value: "20 Via Roma, Rome, IT"})
// Customer -> identifier edges
MERGE (c1)-[:HAS_EMAIL]->(e1)
MERGE (c2)-[:HAS_EMAIL]->(e2)
MERGE (c3)-[:HAS_EMAIL]->(e1)
MERGE (c1)-[:HAS_PHONE]->(ph1)
MERGE (c2)-[:HAS_PHONE]->(ph1)
MERGE (c3)-[:HAS_ADDRESS]->(ad1)
MERGE (c1)-[:HAS_ADDRESS]->(ad1)
MERGE (c4)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_EMAIL]->(e3)
MERGE (c5)-[:HAS_PHONE]->(ph2)
MERGE (c6)-[:HAS_EMAIL]->(e4)
MERGE (c7)-[:HAS_EMAIL]->(e4)
MERGE (c8)-[:HAS_PHONE]->(ph3)
MERGE (c7)-[:HAS_PHONE]->(ph3)
MERGE (c6)-[:HAS_ADDRESS]->(ad2)
MERGE (c9)-[:HAS_EMAIL]->(e5)
MERGE (c10)-[:HAS_EMAIL]->(e6);
Step 2: Cluster duplicates by shared identifiers
// Two customers belong to the same identity cluster if you can walk
// Customer -> identifier <- Customer in 1..3 hops.
MATCH (a:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(ident)
<-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]-(b:Customer)
WHERE id(a) < id(b)
RETURN a.id AS record_a,
b.id AS record_b,
a.name AS name_a,
b.name AS name_b,
labels(ident)[0] AS shared_via,
ident.value AS shared_value
ORDER BY shared_via, record_a;
What's happening
- Each shared identifier (Email, Phone, Address) becomes its own node, so two customers pointing at the same email are one hop apart through that node.
- The relationship-type union
HAS_EMAIL|HAS_PHONE|HAS_ADDRESSlets a single pattern walk over any identifier type — adding:HAS_DEVICE_IDlater requires no query rewrite. - Variable-length traversal (
*1..4) finds transitive matches: c1↔c2 share a phone, c1↔c3 share an email, so c2↔c3 are linked through c1 even without a direct identifier overlap. id(a) < id(b)deduplicates symmetric pairs.- This is the same algorithm used by Customer 360 and CDP vendors — fast, explainable, no ML.
Try this next
MATCH (c:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS*1..4]-(other:Customer)
WHERE id(c) <= id(other)
WITH c, collect(DISTINCT other.id) AS cluster
RETURN c.id AS seed, cluster
ORDER BY size(cluster) DESC;
MATCH (e:Email)<-[:HAS_EMAIL]-(c:Customer)
WITH e, collect(DISTINCT c.source) AS systems, count(c) AS records
WHERE size(systems) > 1
RETURN e.value AS email, systems, records
ORDER BY records DESC;
MATCH (c:Customer)-[r:HAS_EMAIL|HAS_PHONE|HAS_ADDRESS]->(id)
RETURN c.id, c.source, type(r) AS id_type, id.value;