Semantic identity resolution beyond exact match

Objective

Recipe 012 dedupes customers who share a literal email or phone. But "Jonathan Smith at 100 Mission St" and "John D. Smith at 100 Mission Street, San Francisco" never share an exact identifier and slip through. Embedding the (name + address) string lets SIMILAR_TO link them semantically. The wow moment: a single MATCH finds near-duplicates that exact-match tooling misses, then LLM_SCORE confirms whether the merge is safe — explainable, auditable, all in one engine.

Step 1: Set up customers with identity-string embeddings

// `identity_text` is name + address concatenated; `embedding` is its vector.
MERGE (c1:Customer {id: "CRM-401",
       name: "Jonathan Smith",   address: "100 Mission St, San Francisco, CA",
       email: "j.smith@acme.com",
       identity_text: "Jonathan Smith 100 Mission St San Francisco CA",
       embedding: [0.61, 0.18, -0.05, 0.42, 0.13]})

MERGE (c2:Customer {id: "ERP-9988",
       name: "John D. Smith",    address: "100 Mission Street, SF, CA 94105",
       email: "jonathan.smith@acmeholdings.com",
       identity_text: "John D Smith 100 Mission Street SF CA",
       embedding: [0.62, 0.17, -0.04, 0.43, 0.14]})

MERGE (c3:Customer {id: "BILL-707",
       name: "Jon Smith",        address: "100 Mission St #201, San Francisco",
       email: "jsmith.work@gmail.com",
       identity_text: "Jon Smith 100 Mission St 201 San Francisco",
       embedding: [0.60, 0.19, -0.06, 0.41, 0.12]})

MERGE (c4:Customer {id: "CRM-555",
       name: "Mia Rossi",        address: "20 Via Roma, Rome, IT",
       email: "mia.rossi@studio.io",
       identity_text: "Mia Rossi 20 Via Roma Rome",
       embedding: [-0.40, 0.55, 0.21, -0.10, 0.06]})

MERGE (c5:Customer {id: "ERP-9999",
       name: "Maria Rossi-Bianchi", address: "20 Via Roma, Roma, Italia",
       email: "m.rossibianchi@studio.io",
       identity_text: "Maria Rossi Bianchi 20 Via Roma Roma",
       embedding: [-0.39, 0.56, 0.22, -0.09, 0.07]})

MERGE (c6:Customer {id: "CRM-888",
       name: "Leo Park",          address: "44 King St West, Toronto, ON",
       email: "leo@parks.dev",
       identity_text: "Leo Park 44 King St West Toronto",
       embedding: [0.10, -0.55, 0.31, 0.22, -0.18]})

MERGE (c7:Customer {id: "ERP-1010",
       name: "Eli Tanaka",         address: "9-1 Roppongi, Minato-ku, Tokyo",
       email: "eli.t@example.com",
       identity_text: "Eli Tanaka 9 1 Roppongi Minato Tokyo",
       embedding: [-0.18, 0.04, -0.61, 0.07, 0.34]});

Step 2: Find candidate duplicates via SIMILAR_TO

// Pairs of customer records whose identity-text embeddings are very close
// but whose emails are NOT identical (otherwise exact-match would have caught them).
MATCH (a:Customer)-[:SIMILAR_TO > 0.95]->(b:Customer)
WHERE id(a) < id(b)
  AND a.email <> b.email
RETURN a.id   AS record_a,
       a.name AS name_a,
       a.email AS email_a,
       b.id   AS record_b,
       b.name AS name_b,
       b.email AS email_b;

Step 3: Have the LLM judge each merge candidate

// Use LLM_SCORE to decide whether each candidate pair really is the same person,
// flagging only confidently-mergeable rows (>0.8) for an automated merge job.
MATCH (a:Customer)-[:SIMILAR_TO > 0.95]->(b:Customer)
WHERE id(a) < id(b)
  AND a.email <> b.email
WITH a, b,
     llm_score(
       "Decide if these two customer records describe the same real-world person. " +
       "Score 0 if clearly different, 1 if clearly the same. " +
       "Consider name variants, address abbreviation differences, and email reuse.",
       a
     ) AS merge_confidence
RETURN a.id, b.id, a.name, b.name, merge_confidence
ORDER BY merge_confidence DESC;

What's happening

SIMILAR_TO > 0.95 is intentionally tight — semantic dedup wants very close near-neighbors. Below 0.9 you would start matching unrelated names with similar formatting.
id(a) < id(b) removes mirror-image pairs so each candidate appears once.
The exact-match identifier graph (recipe 012) and this semantic graph are complementary: union the two for the most thorough dedupe pipeline.
LLM_SCORE carries the prompt that defines what "same person" means in your business — for GDPR you might tighten to 0.95, for marketing 0.7. The decision is auditable: store the prompt
- score on the merge edge.
This is exactly the "semantic entity resolution" pattern McKinsey calls out as a 2025 trend — but here it runs as one Cypher query, no separate ML service, no offline batch job.

Try this next

// Connected-component-style clustering on the SIMILAR_TO graph.
MATCH (c:Customer)-[:SIMILAR_TO > 0.92]-(other:Customer)
WITH c, collect(DISTINCT other.id) AS cluster
WHERE size(cluster) > 0
RETURN c.id AS seed, cluster
ORDER BY size(cluster) DESC;

// "Records that share an exact email AND are semantically similar" — strong merge signal.
MATCH (a:Customer)-[:SIMILAR_TO > 0.9]->(b:Customer)
WHERE id(a) < id(b) AND a.email = b.email
RETURN a.id, b.id, a.name, b.name;

// Loose net for analyst review queue.
MATCH (a:Customer)-[:SIMILAR_TO > 0.85]->(b:Customer)
WHERE id(a) < id(b)
RETURN a.name, b.name, a.address, b.address;

Semantic identity resolution beyond exact match

Semantic identity resolution beyond exact match

Objective

Step 1: Set up customers with identity-string embeddings

Step 2: Find candidate duplicates via SIMILAR_TO

Step 3: Have the LLM judge each merge candidate

What's happening

Try this next

Run this on your own machine