Document Embedding Search

Objective

Implement semantic search for documents using vector embeddings. This enables finding documents by meaning rather than keywords, improving search relevance and discovery.

Step 1: Create Embedded Documents Table

Create a table for documents with embeddings.

CREATE TABLE embedded_documents (
    id INTEGER PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    document PDF,
    summary TEXT,
    full_text TEXT,
    category VARCHAR(50),
    author VARCHAR(100),
    page_count INTEGER,
    title_embedding VECTOR(384),
    summary_embedding VECTOR(384),
    content_embedding VECTOR(384),
    is_indexed BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Insert Sample Documents

Add documents with content for embedding.

INSERT INTO embedded_documents (id, title, summary, full_text, category, author, page_count) VALUES
    (1, 'Machine Learning in Healthcare',
        'This paper explores applications of machine learning algorithms in medical diagnosis and patient care optimization.',
        'Machine learning is revolutionizing healthcare by enabling predictive diagnostics, personalized treatment plans, and automated medical image analysis. This comprehensive review examines current applications and future directions.',
        'research', 'Dr. Smith', 35),
    (2, 'Financial Risk Management Framework',
        'A comprehensive guide to implementing risk management strategies in financial institutions.',
        'Risk management is essential for financial stability. This document outlines frameworks for credit risk, market risk, operational risk assessment and mitigation strategies for banking institutions.',
        'finance', 'Risk Team', 48),
    (3, 'Software Architecture Patterns',
        'Overview of modern software architecture patterns including microservices and event-driven design.',
        'Modern software systems require scalable architectures. This guide covers microservices, event-driven architecture, CQRS, and domain-driven design patterns with practical implementation examples.',
        'technical', 'Tech Architects', 62),
    (4, 'Employee Performance Management',
        'Best practices for evaluating and developing employee performance in organizations.',
        'Effective performance management drives organizational success. This handbook covers goal setting, feedback mechanisms, performance reviews, and professional development planning.',
        'hr', 'HR Department', 28),
    (5, 'Data Privacy Compliance Guide',
        'Guidelines for ensuring compliance with data privacy regulations including GDPR and CCPA.',
        'Data privacy is a critical concern for modern organizations. This guide provides comprehensive coverage of GDPR, CCPA, and other privacy regulations with compliance checklists.',
        'legal', 'Compliance Team', 42),
    (6, 'Cloud Migration Strategy',
        'Strategic planning guide for migrating enterprise applications to cloud infrastructure.',
        'Cloud migration requires careful planning. This document covers assessment, planning, migration strategies, and optimization for AWS, Azure, and GCP deployments.',
        'technical', 'Cloud Team', 55),
    (7, 'Customer Experience Optimization',
        'Strategies for improving customer experience through data-driven insights.',
        'Customer experience is key to business success. This report examines customer journey mapping, feedback analysis, and experience optimization techniques using analytics.',
        'marketing', 'CX Team', 38),
    (8, 'Cybersecurity Best Practices',
        'Essential cybersecurity practices for protecting organizational assets and data.',
        'Cybersecurity threats continue to evolve. This manual covers threat assessment, security controls, incident response, and security awareness training programs.',
        'security', 'Security Team', 45);

Step 3: Generate Embeddings

Create embeddings for document fields.

UPDATE embedded_documents
SET title_embedding = EMBED(title),
    summary_embedding = EMBED(summary),
    content_embedding = EMBED(full_text),
    is_indexed = TRUE
WHERE is_indexed = FALSE;

Step 4: Semantic Search by Query

Search documents using natural language.

SELECT
    title,
    category,
    author,
    COSINE_SIMILARITY(summary_embedding, EMBED('healthcare and medical applications')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY relevance DESC
LIMIT 5;

Step 5: Multi-Field Semantic Search

Search across title, summary, and content.

SELECT
    title,
    category,
    COSINE_SIMILARITY(title_embedding, EMBED('data security and privacy')) * 0.3 +
    COSINE_SIMILARITY(summary_embedding, EMBED('data security and privacy')) * 0.4 +
    COSINE_SIMILARITY(content_embedding, EMBED('data security and privacy')) * 0.3 as combined_score
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY combined_score DESC
LIMIT 5;

Step 6: Find Similar Documents

Get documents similar to a specific one.

SELECT
    b.title,
    b.category,
    COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id = 3
  AND b.id != 3
  AND a.is_indexed = TRUE
  AND b.is_indexed = TRUE
ORDER BY similarity DESC
LIMIT 5;

Step 7: Category-Filtered Search

Search within a specific category.

SELECT
    title,
    summary,
    COSINE_SIMILARITY(summary_embedding, EMBED('software design and development')) as relevance
FROM embedded_documents
WHERE category = 'technical'
  AND is_indexed = TRUE
ORDER BY relevance DESC;

Step 8: Document Clusters

Find document clusters by similarity.

SELECT
    a.title as document_1,
    b.title as document_2,
    a.category as category_1,
    b.category as category_2,
    COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id < b.id
  AND a.is_indexed = TRUE
  AND b.is_indexed = TRUE
  AND COSINE_SIMILARITY(a.content_embedding, b.content_embedding) > 0.5
ORDER BY similarity DESC;

Step 9: Topic Discovery

Find documents matching specific topics.

SELECT
    title,
    category,
    COSINE_SIMILARITY(content_embedding, EMBED('risk management and compliance')) as risk_relevance,
    COSINE_SIMILARITY(content_embedding, EMBED('technology and software')) as tech_relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY risk_relevance DESC;

Step 10: Search with Threshold

Get only highly relevant documents.

SELECT
    title,
    category,
    author,
    page_count,
    COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
  AND COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) > 0.4
ORDER BY relevance DESC;

Step 11: Cross-Category Recommendations

Recommend documents from other categories.

SELECT
    source.title as if_you_liked,
    recommended.title as you_might_like,
    recommended.category,
    COSINE_SIMILARITY(source.content_embedding, recommended.content_embedding) as match_score
FROM embedded_documents source
CROSS JOIN embedded_documents recommended
WHERE source.id = 1
  AND recommended.id != 1
  AND source.category != recommended.category
  AND source.is_indexed = TRUE
  AND recommended.is_indexed = TRUE
ORDER BY match_score DESC
LIMIT 3;

Step 12: Search Quality Analysis

Analyze search relevance distribution.

SELECT
    category,
    AVG(COSINE_SIMILARITY(summary_embedding, EMBED('organizational management'))) as avg_relevance,
    COUNT(*) as doc_count
FROM embedded_documents
WHERE is_indexed = TRUE
GROUP BY category
ORDER BY avg_relevance DESC;

Cleanup (Optional)

DROP TABLE IF EXISTS embedded_documents;

Expected Outcomes

Semantic search returns relevant documents
Multi-field search improves accuracy
Similar document discovery works
Category filtering combines with embeddings
Recommendations generated

Search Weights

Field	Weight	Rationale
Title	0.30	Concise topic
Summary	0.40	Key concepts
Content	0.30	Full detail

Key Concepts Learned

Vector embeddings for documents
COSINE_SIMILARITY search
Multi-field weighted search
Document similarity matching
Topic-based discovery

Document Embedding Search

Document Embedding Search

Objective

Step 1: Create Embedded Documents Table

Step 2: Insert Sample Documents

Step 3: Generate Embeddings

Step 4: Semantic Search by Query

Step 5: Multi-Field Semantic Search

Step 6: Find Similar Documents

Step 7: Category-Filtered Search

Step 8: Document Clusters

Step 9: Topic Discovery

Step 10: Search with Threshold

Step 11: Cross-Category Recommendations

Step 12: Search Quality Analysis

Cleanup (Optional)

Expected Outcomes

Search Weights

Key Concepts Learned

Run this on your own machine