Semantic Document Search

Build intelligent document search using text embeddings for meaning-based retrieval

All recipes· advanced-patterns· 15 minutesadvanced

Semantic Document Search

Objective

Create a semantic document search system that understands document meaning and context, enabling users to find relevant documents based on concepts rather than exact keyword matches.

Step 1: Create Document Repository Table

Store documents with embeddings.

CREATE TABLE document_repository (
    id INTEGER PRIMARY KEY,
    doc_code VARCHAR(50) NOT NULL UNIQUE,
    title VARCHAR(300) NOT NULL,
    summary TEXT,
    full_content TEXT,
    document_file PDF,
    category VARCHAR(100),
    subcategory VARCHAR(100),
    document_type VARCHAR(50),
    author VARCHAR(200),
    department VARCHAR(100),
    page_count INTEGER,
    word_count INTEGER,
    language VARCHAR(20) DEFAULT 'en',
    title_embedding VECTOR(384),
    summary_embedding VECTOR(384),
    content_embedding VECTOR(384),
    indexed_at TIMESTAMP,
    published_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Create Document Chunks Table

Store chunked content for better retrieval.

CREATE TABLE document_chunks (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    chunk_number INTEGER,
    chunk_text TEXT,
    chunk_embedding VECTOR(384),
    start_position INTEGER,
    end_position INTEGER,
    page_number INTEGER,
    section_title VARCHAR(200),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES document_repository(id)
);

Step 3: Create Search Queries Table

Log search requests.

CREATE TABLE document_searches (
    id INTEGER PRIMARY KEY,
    user_id VARCHAR(50),
    query_text TEXT,
    query_embedding VECTOR(384),
    search_type VARCHAR(50),
    category_filter VARCHAR(100),
    results_returned INTEGER,
    search_time_ms INTEGER,
    clicked_doc_id INTEGER,
    click_position INTEGER,
    searched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (clicked_doc_id) REFERENCES document_repository(id)
);

Step 4: Create Related Documents Table

Track document relationships.

CREATE TABLE related_documents (
    id INTEGER PRIMARY KEY,
    source_doc_id INTEGER NOT NULL,
    related_doc_id INTEGER NOT NULL,
    relationship_type VARCHAR(50),
    similarity_score DECIMAL(5, 4),
    is_auto_generated BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (source_doc_id) REFERENCES document_repository(id),
    FOREIGN KEY (related_doc_id) REFERENCES document_repository(id)
);

Step 5: Insert Sample Documents

Add document catalog.

INSERT INTO document_repository (id, doc_code, title, summary, category, subcategory, document_type, author, department, page_count, word_count, published_at) VALUES
    (1, 'DOC-HR-001', 'Employee Handbook 2024', 'Comprehensive guide to company policies, benefits, and workplace guidelines for all employees', 'HR', 'Policies', 'handbook', 'HR Team', 'Human Resources', 85, 25000, '2024-01-01 00:00:00'),
    (2, 'DOC-HR-002', 'Remote Work Policy', 'Guidelines for remote and hybrid work arrangements including eligibility, expectations, and equipment policies', 'HR', 'Policies', 'policy', 'HR Team', 'Human Resources', 12, 3500, '2024-01-15 00:00:00'),
    (3, 'DOC-IT-001', 'Information Security Policy', 'Security requirements for protecting company data, systems, and user accounts', 'IT', 'Security', 'policy', 'IT Security', 'Information Technology', 28, 8500, '2024-01-10 00:00:00'),
    (4, 'DOC-IT-002', 'Software Development Standards', 'Coding standards, best practices, and development workflow guidelines for engineering teams', 'IT', 'Development', 'standard', 'Engineering', 'Information Technology', 45, 15000, '2024-01-05 00:00:00'),
    (5, 'DOC-FIN-001', 'Expense Reimbursement Guide', 'Procedures for submitting and approving business expense reimbursements', 'Finance', 'Procedures', 'guide', 'Finance Team', 'Finance', 15, 4200, '2024-01-08 00:00:00'),
    (6, 'DOC-FIN-002', 'Travel Policy', 'Guidelines for business travel including booking procedures, per diems, and approvals', 'Finance', 'Policies', 'policy', 'Finance Team', 'Finance', 18, 5500, '2024-01-12 00:00:00'),
    (7, 'DOC-PROD-001', 'Product Requirements Document', 'Detailed requirements specification for Product X version 3.0 release', 'Product', 'Requirements', 'prd', 'Product Team', 'Product', 35, 12000, '2024-01-20 00:00:00'),
    (8, 'DOC-PROD-002', 'User Research Findings', 'Summary of user research conducted for mobile app redesign project', 'Product', 'Research', 'report', 'UX Research', 'Product', 42, 14000, '2024-01-18 00:00:00'),
    (9, 'DOC-SALES-001', 'Sales Playbook', 'Comprehensive sales methodology and tactics guide for the sales team', 'Sales', 'Training', 'playbook', 'Sales Enablement', 'Sales', 65, 22000, '2024-01-22 00:00:00'),
    (10, 'DOC-LEGAL-001', 'Data Privacy Compliance Guide', 'Guide to GDPR, CCPA, and other data privacy regulations compliance', 'Legal', 'Compliance', 'guide', 'Legal Team', 'Legal', 55, 18000, '2024-01-25 00:00:00');

Step 6: Generate Document Embeddings

Create vector representations.

-- Generate embeddings for document titles
UPDATE document_repository
SET title_embedding = EMBED(title)
WHERE title_embedding IS NULL;

-- Generate embeddings for summaries
UPDATE document_repository
SET summary_embedding = EMBED(summary)
WHERE summary_embedding IS NULL;

-- Verify embedding generation
SELECT
    doc_code,
    title,
    CASE WHEN title_embedding IS NOT NULL THEN 'Yes' ELSE 'No' END as title_embedded,
    CASE WHEN summary_embedding IS NOT NULL THEN 'Yes' ELSE 'No' END as summary_embedded
FROM document_repository
ORDER BY id;

Step 7: Insert Document Chunks

Break documents into searchable pieces.

INSERT INTO document_chunks (id, document_id, chunk_number, chunk_text, page_number, section_title) VALUES
    (1, 1, 1, 'Welcome to our company. This handbook provides essential information about our policies, culture, and expectations for all employees.', 1, 'Introduction'),
    (2, 1, 2, 'Our benefits package includes comprehensive health insurance, 401k matching, paid time off, and professional development allowances.', 5, 'Benefits Overview'),
    (3, 1, 3, 'We are committed to providing a safe and inclusive workplace free from discrimination and harassment of any kind.', 12, 'Workplace Safety'),
    (4, 2, 1, 'Remote work is available to eligible employees based on role requirements and manager approval. Hybrid arrangements typically include 2-3 days in office.', 1, 'Remote Work Eligibility'),
    (5, 2, 2, 'Remote employees must maintain a dedicated workspace with reliable internet connectivity and appropriate security measures.', 4, 'Workspace Requirements'),
    (6, 3, 1, 'All employees must use strong passwords and multi-factor authentication for accessing company systems and data.', 2, 'Access Security'),
    (7, 3, 2, 'Confidential information must be encrypted when stored or transmitted. Do not share credentials or sensitive data via unsecured channels.', 8, 'Data Protection'),
    (8, 4, 1, 'All code must follow our established coding standards including consistent naming conventions, documentation, and testing requirements.', 3, 'Coding Standards'),
    (9, 4, 2, 'Pull requests require at least two approving reviews before merging. All tests must pass in the CI pipeline.', 15, 'Code Review Process'),
    (10, 5, 1, 'Business expenses must be submitted within 30 days of incurrence with appropriate receipts and approval documentation.', 2, 'Submission Requirements');

-- Generate embeddings for chunks
UPDATE document_chunks
SET chunk_embedding = EMBED(chunk_text)
WHERE chunk_embedding IS NULL;

Step 8: Semantic Document Search

Find documents by meaning.

SELECT
    dr.doc_code,
    dr.title,
    dr.summary,
    dr.category,
    dr.department,
    COSINE_SIMILARITY(dr.summary_embedding, EMBED('working from home guidelines and requirements')) as relevance
FROM document_repository dr
WHERE dr.summary_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 5;

Step 9: Search Within Document Chunks

Find specific sections.

SELECT
    dr.title as document,
    dc.section_title,
    dc.chunk_text,
    dc.page_number,
    COSINE_SIMILARITY(dc.chunk_embedding, EMBED('password security and authentication')) as relevance
FROM document_chunks dc
INNER JOIN document_repository dr ON dc.document_id = dr.id
WHERE dc.chunk_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 5;

Step 10: Category-Scoped Search

Search within specific categories.

SELECT
    dr.doc_code,
    dr.title,
    dr.subcategory,
    COSINE_SIMILARITY(dr.summary_embedding, EMBED('employee benefits and compensation')) as relevance
FROM document_repository dr
WHERE dr.category = 'HR'
  AND dr.summary_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 5;

Step 11: Find Related Documents

Discover similar documents.

-- Find documents similar to Employee Handbook
SELECT
    target.doc_code,
    target.title,
    target.category,
    COSINE_SIMILARITY(source.summary_embedding, target.summary_embedding) as similarity
FROM document_repository source
CROSS JOIN document_repository target
WHERE source.doc_code = 'DOC-HR-001'
  AND target.doc_code != 'DOC-HR-001'
  AND source.summary_embedding IS NOT NULL
  AND target.summary_embedding IS NOT NULL
ORDER BY similarity DESC
LIMIT 5;

Step 12: Log Searches

Track search activity.

INSERT INTO document_searches (user_id, query_text, search_type, category_filter, results_returned, search_time_ms)
VALUES ('USR-001', 'working from home guidelines and requirements', 'semantic', NULL, 5, 145);

INSERT INTO document_searches (user_id, query_text, search_type, category_filter, results_returned, search_time_ms, clicked_doc_id, click_position)
VALUES ('USR-002', 'password security best practices', 'semantic', 'IT', 3, 98, 3, 1);

Step 13: Build Document Relationships

Create automatic links.

-- Auto-generate related documents based on similarity
INSERT INTO related_documents (source_doc_id, related_doc_id, relationship_type, similarity_score)
SELECT
    source.id,
    target.id,
    'similar_content',
    COSINE_SIMILARITY(source.summary_embedding, target.summary_embedding)
FROM document_repository source
CROSS JOIN document_repository target
WHERE source.id < target.id
  AND source.summary_embedding IS NOT NULL
  AND target.summary_embedding IS NOT NULL
  AND COSINE_SIMILARITY(source.summary_embedding, target.summary_embedding) > 0.6;

Step 14: Search Analytics

Analyze search patterns.

SELECT
    category_filter,
    COUNT(*) as search_count,
    AVG(results_returned) as avg_results,
    AVG(search_time_ms) as avg_time_ms,
    COUNT(clicked_doc_id) as clicks,
    AVG(click_position) as avg_click_position
FROM document_searches
WHERE searched_at >= CURRENT_DATE - 30
GROUP BY category_filter
ORDER BY search_count DESC;

Step 15: Document Recommendations

Suggest related reading.

SELECT
    dr.doc_code,
    dr.title,
    dr.category,
    rd.similarity_score
FROM related_documents rd
INNER JOIN document_repository dr ON rd.related_doc_id = dr.id
WHERE rd.source_doc_id = 1
ORDER BY rd.similarity_score DESC
LIMIT 5;

Cleanup (Optional)

DROP TABLE IF EXISTS related_documents;
DROP TABLE IF EXISTS document_searches;
DROP TABLE IF EXISTS document_chunks;
DROP TABLE IF EXISTS document_repository;

Expected Outcomes

  • Documents indexed with embeddings
  • Semantic search enabled
  • Chunk-level retrieval works
  • Relationships discovered
  • Search analytics tracked

Search Quality Tips

Technique Use Case
Summary search Quick document discovery
Chunk search Finding specific sections
Title search Known document lookup
Combined search Comprehensive results

Key Concepts Learned

  • Document embedding
  • Chunk-based retrieval
  • Semantic similarity
  • Search logging
  • Document relationships

Tags

sqladvancedpdfvectorsimilarityainlpsearch

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE