Knowledge Base with PDFs
Objective
Build an AI-powered knowledge base from PDF documents with semantic search capabilities. This enables building intelligent documentation systems, FAQ bots, and information retrieval applications.
Step 1: Create Knowledge Articles Table
Create a table for knowledge articles.
CREATE TABLE knowledge_articles (
id INTEGER PRIMARY KEY,
source_pdf PDF,
title VARCHAR(255) NOT NULL,
category VARCHAR(50),
subcategory VARCHAR(50),
summary TEXT,
full_content TEXT,
content_embedding VECTOR(384),
keywords VARCHAR(500),
author VARCHAR(100),
version VARCHAR(20) DEFAULT '1.0',
is_published BOOLEAN DEFAULT FALSE,
view_count INTEGER DEFAULT 0,
helpful_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Create Content Chunks Table
Store document chunks for retrieval.
CREATE TABLE knowledge_chunks (
id INTEGER PRIMARY KEY,
article_id INTEGER NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_text TEXT NOT NULL,
chunk_embedding VECTOR(384),
section_title VARCHAR(255),
start_page INTEGER,
token_count INTEGER,
FOREIGN KEY (article_id) REFERENCES knowledge_articles(id)
);
Step 3: Create FAQ Table
Store frequently asked questions.
CREATE TABLE knowledge_faq (
id INTEGER PRIMARY KEY,
article_id INTEGER,
question TEXT NOT NULL,
answer TEXT NOT NULL,
question_embedding VECTOR(384),
category VARCHAR(50),
view_count INTEGER DEFAULT 0,
is_featured BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (article_id) REFERENCES knowledge_articles(id)
);
Step 4: Insert Knowledge Articles
Add sample articles.
INSERT INTO knowledge_articles (id, title, category, subcategory, summary, full_content, keywords, author, is_published) VALUES
(1, 'Getting Started Guide', 'onboarding', 'basics',
'Complete guide for new users to set up their account and learn the basics of the platform.',
'Welcome to our platform. This guide will walk you through the initial setup process. First, create your account by clicking Sign Up. Enter your email and create a strong password. Once verified, you can customize your profile settings. The dashboard provides quick access to all features. Navigation is organized into main sections: Home, Projects, Reports, and Settings.',
'getting started,setup,account,basics,onboarding', 'Documentation Team', TRUE),
(2, 'User Authentication', 'security', 'authentication',
'Learn about authentication methods including password policies, two-factor authentication, and single sign-on options.',
'Security is our top priority. We support multiple authentication methods. Password requirements include minimum 8 characters with uppercase, lowercase, numbers, and special characters. Two-factor authentication adds an extra layer of security using authenticator apps or SMS. Enterprise users can configure SSO with SAML or OAuth providers.',
'authentication,security,password,2FA,SSO,login', 'Security Team', TRUE),
(3, 'API Integration Guide', 'technical', 'api',
'Technical documentation for integrating with our REST API including authentication, endpoints, and examples.',
'Our REST API enables programmatic access to all platform features. Authentication uses API keys or OAuth tokens. Base URL is api.example.com/v1. Common endpoints include /users, /projects, /reports. All responses are JSON formatted. Rate limits apply based on your subscription tier. Webhooks provide real-time event notifications.',
'API,REST,integration,endpoints,authentication,webhooks', 'Engineering Team', TRUE),
(4, 'Billing and Subscriptions', 'account', 'billing',
'Information about subscription plans, payment methods, invoices, and billing management.',
'We offer flexible subscription plans to meet your needs. Plans include Free, Professional, and Enterprise tiers. Payment methods accepted include credit cards, wire transfers, and purchase orders for enterprise. Invoices are generated monthly and available in your account. Upgrades take effect immediately while downgrades apply at next billing cycle.',
'billing,subscription,payment,invoice,pricing,plans', 'Support Team', TRUE),
(5, 'Troubleshooting Common Issues', 'support', 'troubleshooting',
'Solutions to frequently encountered problems and error messages.',
'This guide addresses common issues users encounter. Login problems: Clear browser cache, verify email, reset password if needed. Performance issues: Check internet connection, disable browser extensions, try incognito mode. Error messages: Document specific error codes and contact support with details. Data sync issues: Refresh the page, check for conflicts, force sync from settings.',
'troubleshooting,errors,problems,solutions,support,help', 'Support Team', TRUE);
Step 5: Generate Article Embeddings
Create embeddings for articles.
UPDATE knowledge_articles
SET content_embedding = EMBED(title || ' ' || summary || ' ' || full_content)
WHERE content_embedding IS NULL;
Step 6: Insert Knowledge Chunks
Break articles into searchable chunks.
INSERT INTO knowledge_chunks (id, article_id, chunk_index, chunk_text, section_title, token_count) VALUES
-- Getting Started chunks
(1, 1, 1, 'Welcome to our platform. This guide will walk you through the initial setup process.', 'Introduction', 15),
(2, 1, 2, 'First, create your account by clicking Sign Up. Enter your email and create a strong password.', 'Account Creation', 18),
(3, 1, 3, 'Once verified, you can customize your profile settings. The dashboard provides quick access to all features.', 'Profile Setup', 17),
(4, 1, 4, 'Navigation is organized into main sections: Home, Projects, Reports, and Settings.', 'Navigation', 13),
-- Authentication chunks
(5, 2, 1, 'Security is our top priority. We support multiple authentication methods.', 'Security Overview', 12),
(6, 2, 2, 'Password requirements include minimum 8 characters with uppercase, lowercase, numbers, and special characters.', 'Password Policy', 15),
(7, 2, 3, 'Two-factor authentication adds an extra layer of security using authenticator apps or SMS.', '2FA Setup', 15),
(8, 2, 4, 'Enterprise users can configure SSO with SAML or OAuth providers.', 'SSO Integration', 11),
-- API chunks
(9, 3, 1, 'Our REST API enables programmatic access to all platform features.', 'API Introduction', 11),
(10, 3, 2, 'Authentication uses API keys or OAuth tokens. Base URL is api.example.com/v1.', 'API Authentication', 14),
(11, 3, 3, 'Common endpoints include /users, /projects, /reports. All responses are JSON formatted.', 'API Endpoints', 13),
(12, 3, 4, 'Rate limits apply based on your subscription tier. Webhooks provide real-time event notifications.', 'Rate Limits', 15);
Step 7: Generate Chunk Embeddings
Create embeddings for chunks.
UPDATE knowledge_chunks
SET chunk_embedding = EMBED(chunk_text)
WHERE chunk_embedding IS NULL;
Step 8: Insert FAQ Items
Add frequently asked questions.
INSERT INTO knowledge_faq (id, article_id, question, answer, category, is_featured) VALUES
(1, 1, 'How do I create an account?', 'Click Sign Up on the homepage, enter your email address, and create a password. You will receive a verification email to confirm your account.', 'onboarding', TRUE),
(2, 2, 'How do I enable two-factor authentication?', 'Go to Settings > Security > Two-Factor Authentication. Click Enable and follow the prompts to set up your authenticator app.', 'security', TRUE),
(3, 3, 'Where do I find my API key?', 'Navigate to Settings > API > API Keys. Click Generate New Key. Make sure to copy and securely store your key.', 'technical', TRUE),
(4, 4, 'How do I upgrade my subscription?', 'Go to Settings > Billing > Subscription. Click Upgrade and select your desired plan. Changes take effect immediately.', 'account', FALSE),
(5, 5, 'Why am I getting a login error?', 'First, verify your email address is correct. Try clearing your browser cache. If issues persist, use the Forgot Password link to reset your password.', 'support', TRUE),
(6, 3, 'What are the API rate limits?', 'Free tier: 100 requests/hour. Professional: 1000 requests/hour. Enterprise: Custom limits. Check our documentation for details.', 'technical', FALSE);
Step 9: Generate FAQ Embeddings
Create embeddings for FAQs.
UPDATE knowledge_faq
SET question_embedding = EMBED(question || ' ' || answer)
WHERE question_embedding IS NULL;
Step 10: Semantic Article Search
Search articles by natural language query.
SELECT
title,
category,
summary,
COSINE_SIMILARITY(content_embedding, EMBED('how to set up authentication')) as relevance
FROM knowledge_articles
WHERE is_published = TRUE
AND content_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 5;
Step 11: Chunk-Level Search (RAG)
Search for specific content chunks.
SELECT
ka.title as article,
kc.section_title,
kc.chunk_text,
COSINE_SIMILARITY(kc.chunk_embedding, EMBED('API authentication methods')) as relevance
FROM knowledge_chunks kc
INNER JOIN knowledge_articles ka ON kc.article_id = ka.id
WHERE kc.chunk_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 5;
Step 12: FAQ Search
Search FAQs by question.
SELECT
question,
answer,
category,
COSINE_SIMILARITY(question_embedding, EMBED('reset my password')) as relevance
FROM knowledge_faq
WHERE question_embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 3;
Cleanup (Optional)
DROP TABLE IF EXISTS knowledge_faq;
DROP TABLE IF EXISTS knowledge_chunks;
DROP TABLE IF EXISTS knowledge_articles;
Expected Outcomes
- Knowledge base searchable
- Chunk-level retrieval works
- FAQ matching enabled
- Semantic search accurate
- RAG-ready architecture
Search Strategies
| Strategy | Use Case |
|---|---|
| Article | High-level topics |
| Chunk | Specific answers |
| FAQ | Direct questions |
Key Concepts Learned
- Knowledge base architecture
- Document chunking for RAG
- Multi-level semantic search
- FAQ system design
- Content organization