Document Embedding Search
Objective
Implement semantic search for documents using vector embeddings. This enables finding documents by meaning rather than keywords, improving search relevance and discovery.
Step 1: Create Embedded Documents Table
Create a table for documents with embeddings.
CREATE TABLE embedded_documents (
id INTEGER PRIMARY KEY,
title VARCHAR(255) NOT NULL,
document PDF,
summary TEXT,
full_text TEXT,
category VARCHAR(50),
author VARCHAR(100),
page_count INTEGER,
title_embedding VECTOR(384),
summary_embedding VECTOR(384),
content_embedding VECTOR(384),
is_indexed BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Insert Sample Documents
Add documents with content for embedding.
INSERT INTO embedded_documents (id, title, summary, full_text, category, author, page_count) VALUES
(1, 'Machine Learning in Healthcare',
'This paper explores applications of machine learning algorithms in medical diagnosis and patient care optimization.',
'Machine learning is revolutionizing healthcare by enabling predictive diagnostics, personalized treatment plans, and automated medical image analysis. This comprehensive review examines current applications and future directions.',
'research', 'Dr. Smith', 35),
(2, 'Financial Risk Management Framework',
'A comprehensive guide to implementing risk management strategies in financial institutions.',
'Risk management is essential for financial stability. This document outlines frameworks for credit risk, market risk, operational risk assessment and mitigation strategies for banking institutions.',
'finance', 'Risk Team', 48),
(3, 'Software Architecture Patterns',
'Overview of modern software architecture patterns including microservices and event-driven design.',
'Modern software systems require scalable architectures. This guide covers microservices, event-driven architecture, CQRS, and domain-driven design patterns with practical implementation examples.',
'technical', 'Tech Architects', 62),
(4, 'Employee Performance Management',
'Best practices for evaluating and developing employee performance in organizations.',
'Effective performance management drives organizational success. This handbook covers goal setting, feedback mechanisms, performance reviews, and professional development planning.',
'hr', 'HR Department', 28),
(5, 'Data Privacy Compliance Guide',
'Guidelines for ensuring compliance with data privacy regulations including GDPR and CCPA.',
'Data privacy is a critical concern for modern organizations. This guide provides comprehensive coverage of GDPR, CCPA, and other privacy regulations with compliance checklists.',
'legal', 'Compliance Team', 42),
(6, 'Cloud Migration Strategy',
'Strategic planning guide for migrating enterprise applications to cloud infrastructure.',
'Cloud migration requires careful planning. This document covers assessment, planning, migration strategies, and optimization for AWS, Azure, and GCP deployments.',
'technical', 'Cloud Team', 55),
(7, 'Customer Experience Optimization',
'Strategies for improving customer experience through data-driven insights.',
'Customer experience is key to business success. This report examines customer journey mapping, feedback analysis, and experience optimization techniques using analytics.',
'marketing', 'CX Team', 38),
(8, 'Cybersecurity Best Practices',
'Essential cybersecurity practices for protecting organizational assets and data.',
'Cybersecurity threats continue to evolve. This manual covers threat assessment, security controls, incident response, and security awareness training programs.',
'security', 'Security Team', 45);
Step 3: Generate Embeddings
Create embeddings for document fields.
UPDATE embedded_documents
SET title_embedding = EMBED(title),
summary_embedding = EMBED(summary),
content_embedding = EMBED(full_text),
is_indexed = TRUE
WHERE is_indexed = FALSE;
Step 4: Semantic Search by Query
Search documents using natural language.
SELECT
title,
category,
author,
COSINE_SIMILARITY(summary_embedding, EMBED('healthcare and medical applications')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY relevance DESC
LIMIT 5;
Step 5: Multi-Field Semantic Search
Search across title, summary, and content.
SELECT
title,
category,
COSINE_SIMILARITY(title_embedding, EMBED('data security and privacy')) * 0.3 +
COSINE_SIMILARITY(summary_embedding, EMBED('data security and privacy')) * 0.4 +
COSINE_SIMILARITY(content_embedding, EMBED('data security and privacy')) * 0.3 as combined_score
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY combined_score DESC
LIMIT 5;
Step 6: Find Similar Documents
Get documents similar to a specific one.
SELECT
b.title,
b.category,
COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id = 3
AND b.id != 3
AND a.is_indexed = TRUE
AND b.is_indexed = TRUE
ORDER BY similarity DESC
LIMIT 5;
Step 7: Category-Filtered Search
Search within a specific category.
SELECT
title,
summary,
COSINE_SIMILARITY(summary_embedding, EMBED('software design and development')) as relevance
FROM embedded_documents
WHERE category = 'technical'
AND is_indexed = TRUE
ORDER BY relevance DESC;
Step 8: Document Clusters
Find document clusters by similarity.
SELECT
a.title as document_1,
b.title as document_2,
a.category as category_1,
b.category as category_2,
COSINE_SIMILARITY(a.content_embedding, b.content_embedding) as similarity
FROM embedded_documents a
CROSS JOIN embedded_documents b
WHERE a.id < b.id
AND a.is_indexed = TRUE
AND b.is_indexed = TRUE
AND COSINE_SIMILARITY(a.content_embedding, b.content_embedding) > 0.5
ORDER BY similarity DESC;
Step 9: Topic Discovery
Find documents matching specific topics.
SELECT
title,
category,
COSINE_SIMILARITY(content_embedding, EMBED('risk management and compliance')) as risk_relevance,
COSINE_SIMILARITY(content_embedding, EMBED('technology and software')) as tech_relevance
FROM embedded_documents
WHERE is_indexed = TRUE
ORDER BY risk_relevance DESC;
Step 10: Search with Threshold
Get only highly relevant documents.
SELECT
title,
category,
author,
page_count,
COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) as relevance
FROM embedded_documents
WHERE is_indexed = TRUE
AND COSINE_SIMILARITY(summary_embedding, EMBED('cloud computing infrastructure')) > 0.4
ORDER BY relevance DESC;
Step 11: Cross-Category Recommendations
Recommend documents from other categories.
SELECT
source.title as if_you_liked,
recommended.title as you_might_like,
recommended.category,
COSINE_SIMILARITY(source.content_embedding, recommended.content_embedding) as match_score
FROM embedded_documents source
CROSS JOIN embedded_documents recommended
WHERE source.id = 1
AND recommended.id != 1
AND source.category != recommended.category
AND source.is_indexed = TRUE
AND recommended.is_indexed = TRUE
ORDER BY match_score DESC
LIMIT 3;
Step 12: Search Quality Analysis
Analyze search relevance distribution.
SELECT
category,
AVG(COSINE_SIMILARITY(summary_embedding, EMBED('organizational management'))) as avg_relevance,
COUNT(*) as doc_count
FROM embedded_documents
WHERE is_indexed = TRUE
GROUP BY category
ORDER BY avg_relevance DESC;
Cleanup (Optional)
DROP TABLE IF EXISTS embedded_documents;
Expected Outcomes
- Semantic search returns relevant documents
- Multi-field search improves accuracy
- Similar document discovery works
- Category filtering combines with embeddings
- Recommendations generated
Search Weights
| Field | Weight | Rationale |
|---|---|---|
| Title | 0.30 | Concise topic |
| Summary | 0.40 | Key concepts |
| Content | 0.30 | Full detail |
Key Concepts Learned
- Vector embeddings for documents
- COSINE_SIMILARITY search
- Multi-field weighted search
- Document similarity matching
- Topic-based discovery