Literature review graph from paper abstracts
Objective
A new researcher coming into a field has two questions: who studies what (entities and links) and which papers cluster on the same idea (semantic groups). The first is structural — extract authors, datasets, and methods from abstracts. The second is semantic — cluster papers by abstract embeddings. The wow moment: paste five abstracts, get back a graph that supports "all papers that use dataset X" and "papers semantically near the seed paper" with the same Cypher syntax.
Step 1: Extract authors, methods, datasets from abstracts
curl -X POST https://localhost:8443/v2/graph/extract \
-H "Authorization: Bearer $AIDB_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "We introduce GraphRAG-Lite, a retrieval-augmented generation system that fuses vector search with graph traversal. Authors Sarah Chen and Raj Patel evaluate on the HotpotQA benchmark and beat the LangChain baseline by 12%. The implementation uses LlamaIndex and a Neo4j-compatible backend.",
"default_node_label": "ResearchEntity",
"node_provenance": {"paper": "PAP-001"},
"edge_provenance": {"paper": "PAP-001"},
"min_confidence": 0.55
}'
Step 2: Pre-seed papers + abstract embeddings
MERGE (p1:Paper {id: "PAP-001",
title: "GraphRAG-Lite: hybrid vector + graph retrieval",
abstract: "Hybrid retrieval combining vector search and graph traversal, evaluated on HotpotQA",
embedding: [0.61, 0.18, -0.04, 0.42, 0.21]})
MERGE (p2:Paper {id: "PAP-002",
title: "Knowledge-aware RAG for enterprise search",
abstract: "RAG system that walks knowledge graph relations to improve multi-hop QA accuracy",
embedding: [0.62, 0.17, -0.03, 0.43, 0.22]})
MERGE (p3:Paper {id: "PAP-003",
title: "Memory-efficient HNSW for billion-scale vectors",
abstract: "We propose a quantization scheme that shrinks HNSW indexes by 6x with no recall loss",
embedding: [-0.31, 0.55, 0.10, -0.04, 0.16]})
MERGE (p4:Paper {id: "PAP-004",
title: "Agentic memory for long-horizon LLM tasks",
abstract: "An episodic memory layer for agents that recalls past sessions semantically",
embedding: [0.40, -0.15, 0.21, 0.32, -0.08]})
MERGE (p5:Paper {id: "PAP-005",
title: "Hybrid graph + vector retrieval, a survey",
abstract: "Survey of techniques combining graph traversal with vector retrieval, covers HotpotQA",
embedding: [0.60, 0.19, -0.05, 0.41, 0.20]})
MERGE (auth1:Author {name: "Sarah Chen"})
MERGE (auth2:Author {name: "Raj Patel"})
MERGE (auth3:Author {name: "Mia Rossi"})
MERGE (auth4:Author {name: "Leo Park"})
MERGE (ds1:Dataset {name: "HotpotQA"})
MERGE (ds2:Dataset {name: "MS MARCO"})
MERGE (m1:Method {name: "Hybrid Retrieval"})
MERGE (m2:Method {name: "HNSW Quantization"})
MERGE (m3:Method {name: "Episodic Memory"})
MERGE (auth1)-[:WROTE]->(p1)
MERGE (auth2)-[:WROTE]->(p1)
MERGE (auth3)-[:WROTE]->(p2)
MERGE (auth4)-[:WROTE]->(p3)
MERGE (auth1)-[:WROTE]->(p5)
MERGE (p1)-[:USES]->(ds1)
MERGE (p2)-[:USES]->(ds1)
MERGE (p5)-[:USES]->(ds1)
MERGE (p3)-[:USES]->(ds2)
MERGE (p1)-[:APPLIES]->(m1)
MERGE (p2)-[:APPLIES]->(m1)
MERGE (p3)-[:APPLIES]->(m2)
MERGE (p4)-[:APPLIES]->(m3)
MERGE (p5)-[:APPLIES]->(m1);
Step 3: Find related papers, semantic AND structural
// "Papers semantically near my seed paper that use the same dataset."
MATCH (seed:Paper {id: "PAP-001"})-[:SIMILAR_TO > 0.85]->(related:Paper)
MATCH (related)-[:USES]->(ds:Dataset)<-[:USES]-(seed)
RETURN related.id AS paper_id,
related.title AS title,
ds.name AS shared_dataset;
Step 4: Author collaboration suggestions
// "Authors who haven't co-authored with me but work on semantically similar problems."
MATCH (me:Author {name: "Sarah Chen"})-[:WROTE]->(my_paper:Paper)
MATCH (my_paper)-[:SIMILAR_TO > 0.85]->(near:Paper)<-[:WROTE]-(other:Author)
WHERE NOT (me)-[:WROTE]->()<-[:WROTE]-(other) AND me <> other
RETURN DISTINCT other.name AS potential_collaborator,
near.title AS overlapping_paper;
What's happening
/v2/graph/extractcollapses the prose-to-entity step that historically takes weeks of NER training and labeling. Provenance (paper: "PAP-001") traces every fact back to its abstract.- Abstract embeddings cluster naturally: PAP-001, PAP-002, PAP-005 all sit on the hybrid-retrieval theme, while PAP-003 (HNSW) and PAP-004 (memory) drift apart. The threshold knob lets you tune cluster tightness.
- The collaborator query mixes a structural anti-pattern (
NOT (me)-[:WROTE]->()<-[:WROTE]-(other)) with a semantic hop. SQL would need EXCEPT plus a vector subquery. - The same shape supports literature-review agents, paper recommendation, conflict-of-interest detection (papers reviewing a dataset that the reviewer co-authored).
- Adding a new abstract: extract → MERGE the paper node with its embedding → all the queries above just work for it. No re-indexing, no schema migration.
Try this next
MATCH (a:Author)-[:WROTE]->(p:Paper)-[:USES]->(ds:Dataset)
WITH a, ds, count(p) AS papers_on_dataset
WHERE papers_on_dataset > 1
RETURN a.name AS author, ds.name AS dataset, papers_on_dataset
ORDER BY papers_on_dataset DESC;
MATCH (m:Method)<-[:APPLIES]-(p:Paper)
WITH m, collect(p.title) AS papers, count(p) AS n
RETURN m.name AS method, n AS paper_count, papers
ORDER BY n DESC;
MATCH (seed:Paper {id: "PAP-001"})-[:SIMILAR_TO > 0.7]->(p:Paper)
RETURN p.title AS related_paper, p.abstract;