Literature review graph from paper abstracts

Objective

A new researcher coming into a field has two questions: who studies what (entities and links) and which papers cluster on the same idea (semantic groups). The first is structural — extract authors, datasets, and methods from abstracts. The second is semantic — cluster papers by abstract embeddings. The wow moment: paste five abstracts, get back a graph that supports "all papers that use dataset X" and "papers semantically near the seed paper" with the same Cypher syntax.

Step 1: Extract authors, methods, datasets from abstracts

curl -X POST https://localhost:8443/v2/graph/extract \
  -H "Authorization: Bearer $AIDB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "We introduce GraphRAG-Lite, a retrieval-augmented generation system that fuses vector search with graph traversal. Authors Sarah Chen and Raj Patel evaluate on the HotpotQA benchmark and beat the LangChain baseline by 12%. The implementation uses LlamaIndex and a Neo4j-compatible backend.",
    "default_node_label": "ResearchEntity",
    "node_provenance": {"paper": "PAP-001"},
    "edge_provenance": {"paper": "PAP-001"},
    "min_confidence": 0.55
  }'

Step 2: Pre-seed papers + abstract embeddings

MERGE (p1:Paper {id: "PAP-001",
       title:    "GraphRAG-Lite: hybrid vector + graph retrieval",
       abstract: "Hybrid retrieval combining vector search and graph traversal, evaluated on HotpotQA",
       embedding: [0.61, 0.18, -0.04, 0.42, 0.21]})
MERGE (p2:Paper {id: "PAP-002",
       title:    "Knowledge-aware RAG for enterprise search",
       abstract: "RAG system that walks knowledge graph relations to improve multi-hop QA accuracy",
       embedding: [0.62, 0.17, -0.03, 0.43, 0.22]})
MERGE (p3:Paper {id: "PAP-003",
       title:    "Memory-efficient HNSW for billion-scale vectors",
       abstract: "We propose a quantization scheme that shrinks HNSW indexes by 6x with no recall loss",
       embedding: [-0.31, 0.55, 0.10, -0.04, 0.16]})
MERGE (p4:Paper {id: "PAP-004",
       title:    "Agentic memory for long-horizon LLM tasks",
       abstract: "An episodic memory layer for agents that recalls past sessions semantically",
       embedding: [0.40, -0.15, 0.21, 0.32, -0.08]})
MERGE (p5:Paper {id: "PAP-005",
       title:    "Hybrid graph + vector retrieval, a survey",
       abstract: "Survey of techniques combining graph traversal with vector retrieval, covers HotpotQA",
       embedding: [0.60, 0.19, -0.05, 0.41, 0.20]})

MERGE (auth1:Author {name: "Sarah Chen"})
MERGE (auth2:Author {name: "Raj Patel"})
MERGE (auth3:Author {name: "Mia Rossi"})
MERGE (auth4:Author {name: "Leo Park"})

MERGE (ds1:Dataset {name: "HotpotQA"})
MERGE (ds2:Dataset {name: "MS MARCO"})

MERGE (m1:Method {name: "Hybrid Retrieval"})
MERGE (m2:Method {name: "HNSW Quantization"})
MERGE (m3:Method {name: "Episodic Memory"})

MERGE (auth1)-[:WROTE]->(p1)
MERGE (auth2)-[:WROTE]->(p1)
MERGE (auth3)-[:WROTE]->(p2)
MERGE (auth4)-[:WROTE]->(p3)
MERGE (auth1)-[:WROTE]->(p5)

MERGE (p1)-[:USES]->(ds1)
MERGE (p2)-[:USES]->(ds1)
MERGE (p5)-[:USES]->(ds1)
MERGE (p3)-[:USES]->(ds2)

MERGE (p1)-[:APPLIES]->(m1)
MERGE (p2)-[:APPLIES]->(m1)
MERGE (p3)-[:APPLIES]->(m2)
MERGE (p4)-[:APPLIES]->(m3)
MERGE (p5)-[:APPLIES]->(m1);

Step 3: Find related papers, semantic AND structural

// "Papers semantically near my seed paper that use the same dataset."
MATCH (seed:Paper {id: "PAP-001"})-[:SIMILAR_TO > 0.85]->(related:Paper)
MATCH (related)-[:USES]->(ds:Dataset)<-[:USES]-(seed)
RETURN related.id    AS paper_id,
       related.title AS title,
       ds.name       AS shared_dataset;

Step 4: Author collaboration suggestions

// "Authors who haven't co-authored with me but work on semantically similar problems."
MATCH (me:Author {name: "Sarah Chen"})-[:WROTE]->(my_paper:Paper)
MATCH (my_paper)-[:SIMILAR_TO > 0.85]->(near:Paper)<-[:WROTE]-(other:Author)
WHERE NOT (me)-[:WROTE]->()<-[:WROTE]-(other) AND me <> other
RETURN DISTINCT other.name AS potential_collaborator,
                near.title AS overlapping_paper;

What's happening

/v2/graph/extract collapses the prose-to-entity step that historically takes weeks of NER training and labeling. Provenance (paper: "PAP-001") traces every fact back to its abstract.
Abstract embeddings cluster naturally: PAP-001, PAP-002, PAP-005 all sit on the hybrid-retrieval theme, while PAP-003 (HNSW) and PAP-004 (memory) drift apart. The threshold knob lets you tune cluster tightness.
The collaborator query mixes a structural anti-pattern (NOT (me)-[:WROTE]->()<-[:WROTE]-(other)) with a semantic hop. SQL would need EXCEPT plus a vector subquery.
The same shape supports literature-review agents, paper recommendation, conflict-of-interest detection (papers reviewing a dataset that the reviewer co-authored).
Adding a new abstract: extract → MERGE the paper node with its embedding → all the queries above just work for it. No re-indexing, no schema migration.

Try this next

MATCH (a:Author)-[:WROTE]->(p:Paper)-[:USES]->(ds:Dataset)
WITH a, ds, count(p) AS papers_on_dataset
WHERE papers_on_dataset > 1
RETURN a.name AS author, ds.name AS dataset, papers_on_dataset
ORDER BY papers_on_dataset DESC;

MATCH (m:Method)<-[:APPLIES]-(p:Paper)
WITH m, collect(p.title) AS papers, count(p) AS n
RETURN m.name AS method, n AS paper_count, papers
ORDER BY n DESC;

MATCH (seed:Paper {id: "PAP-001"})-[:SIMILAR_TO > 0.7]->(p:Paper)
RETURN p.title AS related_paper, p.abstract;

Literature review graph from paper abstracts

Literature review graph from paper abstracts

Objective

Step 1: Extract authors, methods, datasets from abstracts

Step 2: Pre-seed papers + abstract embeddings

Step 3: Find related papers, semantic AND structural

Step 4: Author collaboration suggestions

What's happening

Try this next

Run this on your own machine