Document Embedding Search

Use vector embeddings for semantic document search and similarity matching

All recipes· document-processing· 15 minutesadvanced

Document Embedding Search

Objective

Implement semantic search for documents using vector embeddings. This enables finding documents by meaning rather than keywords, improving search relevance and discovery.

Step 1: Create Embedded Documents Table

Create a table for documents with embeddings.

CREATE TABLE embedded_documents (
    id INTEGER PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    document PDF,
    summary TEXT,
    full_text TEXT,
    category VARCHAR(50),
    author VARCHAR(100),
    page_count INTEGER,
    title_embedding VECTOR(384),
    summary_embedding VECTOR(384),
    content_embedding VECTOR(384),
    is_indexed BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Insert Sample Documents

Add documents with content for embedding.

INSERT INTO embedded_documents (id, title, summary, full_text, category, author, page_count) VALUES
    (1, 'Machine Learning in Healthcare',
        'This paper explores applications of machine learning algorithms in medical diagnosis and patient care optimization.',
        'Machine learning is revolutionizing healthcare by enabling predictive diagnostics, personalized treatment plans, and automated medical image analysis. This comprehensive review examines current applications and future directions.',
        'research', 'Dr. Smith', 35),
    (2, 'Financial Risk Management Framework',
        'A comprehensive guide to implementing risk management strategies in financial institutions.',
        'Risk management is essential for financial stability. This document outlines frameworks for credit risk, market risk, operational risk assessment and mitigation strategies for banking institutions.',
        'finance', 'Risk Team', 48),
    (3, 'Software Architecture Patterns',
        'Overview of modern software architecture patterns including microservices and event-driven design.',
        'Modern software systems require scalable architectures. This guide covers microservices, event-driven architecture, CQRS, and domain-driven design patterns with practical implementation examples.',
        'technical', 'Tech Architects', 62),
    (4, 'Employee Performance Management',
        'Best practices for evaluating and developing employee performance in organizations.',
        'Effective performance management drives organizational success. This handbook covers goal setting, feedback mechanisms, performance reviews, and professional development planning.',
        'hr', 'HR Department', 28),
    (5, 'Data Privacy Compliance Guide',
        'Guidelines for ensuring compliance with data privacy regulations including GDPR and CCPA.',
        'Data privacy is a critical concern for modern organizations. This guide provides comprehensive coverage of GDPR, CCPA, and other privacy regulations with compliance checklists.',
        'legal', 'Compliance Team', 42),
    (6, 'Cloud Migration Strategy',
        'Strategic planning guide for migrating enterprise applications to cloud infrastructure.',
        'Cloud migration requires careful planning. This document covers assessment, planning, migration strategies, and optimization for AWS, Azure, and GCP deployments.',
        'technical', 'Cloud Team', 55),
    (7, 'Customer Experience Optimization',
        'Strategies for improving customer experience through data-driven insights.',
        'Customer experience is key to business success. This report examines customer journey mapping, feedback analysis, and experience optimization techniques using analytics.',
        'marketing', 'CX Team', 38),
    (8, 'Cybersecurity Best Practices',
        'Essential cybersecurity practices for protecting organizational assets and data.',
        'Cybersecurity threats continue to evolve. This manual covers threat assessment, security controls, incident response, and security awareness training programs.',
        'security', 'Security Team', 45);

Step 3: Generate Embeddings

Create embeddings for document fields.

UPDATE embedded_documents
SET title_embedding = EMBED(title),
    summary_embedding = EMBED(summary),
    content_embedding = EMBED(full_text),
    is_indexed = TRUE
WHERE is_indexed = FALSE;

Step 4: Semantic Search by Query

Search documents using natural language.

SELECT
    title,
    category,
    author,
    COSINE_SIMILARITY(summary_embedding, EMBED('healthcare and medical applications')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY relevance DESC
LIMIT 5;

Step 5: Multi-Field Semantic Search

Search across title, summary, and content.

SELECT
    title,
    category,
    COSINE_SIMILARITY(title_embedding, EMBED('data security and privacy')) * 0.3 +
    COSINE_SIMILARITY(summary_embedding, EMBED('data security and privacy')) * 0.4 +
    COSINE_SIMILARITY(content_embedding, EMBED('data security and privacy')) * 0.3 as combined_score
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY combined_score DESC
LIMIT 5;

Step 6: Find Similar Documents

Get documents similar to a specific one.

SELECT
    b.title,
    b.category,
    COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id = 3
  AND b.id != 3
  AND a.is_indexed = TRUE
  AND b.is_indexed = TRUE
ORDER BY similarity DESC
LIMIT 5;

Step 7: Category-Filtered Search

Search within a specific category.

SELECT
    title,
    summary,
    COSINE_SIMILARITY(summary_embedding, EMBED('software design and development')) as relevance
FROM embedded_documents
WHERE category = 'technical'
  AND is_indexed = TRUE
ORDER BY relevance DESC;

Step 8: Document Clusters

Find document clusters by similarity.

SELECT
    a.title as document_1,
    b.title as document_2,
    a.category as category_1,
    b.category as category_2,
    COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id < b.id
  AND a.is_indexed = TRUE
  AND b.is_indexed = TRUE
  AND COSINE_SIMILARITY(a.content_embedding, b.content_embedding) > 0.5
ORDER BY similarity DESC;

Step 9: Topic Discovery

Find documents matching specific topics.

SELECT
    title,
    category,
    COSINE_SIMILARITY(content_embedding, EMBED('risk management and compliance')) as risk_relevance,
    COSINE_SIMILARITY(content_embedding, EMBED('technology and software')) as tech_relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY risk_relevance DESC;

Step 10: Search with Threshold

Get only highly relevant documents.

SELECT
    title,
    category,
    author,
    page_count,
    COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
  AND COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) > 0.4
ORDER BY relevance DESC;

Step 11: Cross-Category Recommendations

Recommend documents from other categories.

SELECT
    source.title as if_you_liked,
    recommended.title as you_might_like,
    recommended.category,
    COSINE_SIMILARITY(source.content_embedding, recommended.content_embedding) as match_score
FROM embedded_documents source
CROSS JOIN embedded_documents recommended
WHERE source.id = 1
  AND recommended.id != 1
  AND source.category != recommended.category
  AND source.is_indexed = TRUE
  AND recommended.is_indexed = TRUE
ORDER BY match_score DESC
LIMIT 3;

Step 12: Search Quality Analysis

Analyze search relevance distribution.

SELECT
    category,
    AVG(COSINE_SIMILARITY(summary_embedding, EMBED('organizational management'))) as avg_relevance,
    COUNT(*) as doc_count
FROM embedded_documents
WHERE is_indexed = TRUE
GROUP BY category
ORDER BY avg_relevance DESC;

Cleanup (Optional)

DROP TABLE IF EXISTS embedded_documents;

Expected Outcomes

  • Semantic search returns relevant documents
  • Multi-field search improves accuracy
  • Similar document discovery works
  • Category filtering combines with embeddings
  • Recommendations generated

Search Weights

Field Weight Rationale
Title 0.30 Concise topic
Summary 0.40 Key concepts
Content 0.30 Full detail

Key Concepts Learned

  • Vector embeddings for documents
  • COSINE_SIMILARITY search
  • Multi-field weighted search
  • Document similarity matching
  • Topic-based discovery

Tags

sqladvancedpdfembeddingssemantic-searchaivectors

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE