PDF Text Extraction
Objective
Extract text content from PDF documents for full-text search and analysis. This enables building searchable document archives and content analysis systems.
Step 1: Create Extractable Documents Table
Create a table for PDFs with extracted text.
CREATE TABLE extractable_documents (
id INTEGER PRIMARY KEY,
title VARCHAR(255) NOT NULL,
document PDF,
category VARCHAR(50),
file_size BIGINT,
page_count INTEGER,
extracted_text TEXT,
word_count INTEGER,
extraction_status VARCHAR(20) DEFAULT 'pending',
extraction_date TIMESTAMP,
language VARCHAR(10) DEFAULT 'en',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Insert Sample Documents
Add documents for text extraction.
INSERT INTO extractable_documents (id, title, category, file_size, page_count, extraction_status) VALUES
(1, 'Company Policy Manual', 'policy', 2457600, 45, 'completed'),
(2, 'Technical Specification v2', 'technical', 1536000, 28, 'completed'),
(3, 'Research Report 2024', 'research', 1843200, 35, 'completed'),
(4, 'Meeting Minutes Jan 2024', 'meetings', 512000, 8, 'completed'),
(5, 'Product Requirements Doc', 'product', 921600, 18, 'completed'),
(6, 'New Proposal Draft', 'proposals', 768000, 12, 'pending'),
(7, 'Industry Analysis', 'research', 2048000, 42, 'processing');
Step 3: Simulate Extracted Text
Add sample extracted text content.
UPDATE extractable_documents SET
extracted_text = 'This comprehensive policy manual outlines all company policies including workplace conduct, attendance requirements, benefits information, safety protocols, and compliance guidelines. All employees are expected to read and understand these policies. The document covers hiring procedures, performance evaluations, disciplinary actions, and termination processes.',
word_count = 52,
extraction_date = '2024-01-15 10:30:00'
WHERE id = 1;
UPDATE extractable_documents SET
extracted_text = 'Technical specification document for the new software platform. This document details system architecture, API endpoints, database schemas, security requirements, and performance benchmarks. The system supports cloud deployment with horizontal scaling capabilities. Integration points include REST APIs and webhook notifications.',
word_count = 47,
extraction_date = '2024-01-16 14:20:00'
WHERE id = 2;
UPDATE extractable_documents SET
extracted_text = 'Research report analyzing market trends and competitive landscape. Key findings include market growth projections, customer behavior analysis, and technology adoption patterns. The report recommends strategic initiatives for market expansion and product development priorities based on data-driven insights.',
word_count = 45,
extraction_date = '2024-02-01 09:15:00'
WHERE id = 3;
UPDATE extractable_documents SET
extracted_text = 'Meeting minutes from the January 2024 quarterly planning session. Attendees discussed Q1 objectives, budget allocations, and project timelines. Action items were assigned to department leads with specific deadlines. Next meeting scheduled for April.',
word_count = 38,
extraction_date = '2024-01-20 16:45:00'
WHERE id = 4;
UPDATE extractable_documents SET
extracted_text = 'Product requirements document defining features for the next release. Priority features include user authentication improvements, dashboard redesign, and mobile responsiveness. The document includes user stories, acceptance criteria, and technical constraints for implementation.',
word_count = 40,
extraction_date = '2024-02-05 11:00:00'
WHERE id = 5;
Step 4: Full-Text Search
Search documents by text content.
SELECT
title,
category,
page_count,
word_count
FROM extractable_documents
WHERE extracted_text LIKE '%policy%'
OR extracted_text LIKE '%compliance%'
ORDER BY word_count DESC;
Step 5: Search with Context
Get matching content with context.
SELECT
title,
category,
extracted_text
FROM extractable_documents
WHERE extracted_text LIKE '%market%'
AND extraction_status = 'completed';
Step 6: Extraction Status Report
Check extraction progress.
SELECT
extraction_status,
COUNT(*) as document_count,
SUM(page_count) as total_pages
FROM extractable_documents
GROUP BY extraction_status
ORDER BY document_count DESC;
Step 7: Document Content Statistics
Analyze extracted content.
SELECT
title,
page_count,
word_count,
word_count / NULLIF(page_count, 0) as words_per_page
FROM extractable_documents
WHERE extraction_status = 'completed'
ORDER BY word_count DESC;
Step 8: Category Content Analysis
Analyze content by category.
SELECT
category,
COUNT(*) as doc_count,
SUM(word_count) as total_words,
AVG(word_count) as avg_words
FROM extractable_documents
WHERE extraction_status = 'completed'
GROUP BY category
ORDER BY total_words DESC;
Step 9: Recent Extractions
Get recently extracted documents.
SELECT
title,
category,
word_count,
extraction_date
FROM extractable_documents
WHERE extraction_status = 'completed'
ORDER BY extraction_date DESC
LIMIT 5;
Step 10: Pending Extractions
Find documents needing extraction.
SELECT
id,
title,
category,
page_count,
'Needs extraction' as action
FROM extractable_documents
WHERE extraction_status IN ('pending', 'processing')
ORDER BY page_count;
Step 11: Multi-Keyword Search
Search for multiple keywords.
SELECT
title,
category,
CASE
WHEN extracted_text LIKE '%API%' THEN 'Technical'
WHEN extracted_text LIKE '%policy%' THEN 'Policy'
WHEN extracted_text LIKE '%market%' THEN 'Business'
ELSE 'Other'
END as content_type
FROM extractable_documents
WHERE extraction_status = 'completed'
AND (
extracted_text LIKE '%API%'
OR extracted_text LIKE '%policy%'
OR extracted_text LIKE '%market%'
);
Step 12: Word Density Report
Analyze document verbosity.
SELECT
title,
page_count,
word_count,
word_count / NULLIF(page_count, 0) as density,
CASE
WHEN word_count / NULLIF(page_count, 0) > 300 THEN 'Dense'
WHEN word_count / NULLIF(page_count, 0) > 150 THEN 'Normal'
ELSE 'Sparse'
END as density_rating
FROM extractable_documents
WHERE extraction_status = 'completed'
ORDER BY density DESC;
Cleanup (Optional)
DROP TABLE IF EXISTS extractable_documents;
Expected Outcomes
- Text extracted from PDFs
- Full-text search works
- Extraction status tracked
- Word statistics calculated
- Content analysis enabled
Extraction Status
| Status | Description |
|---|---|
| pending | Awaiting extraction |
| processing | Currently extracting |
| completed | Text extracted |
| failed | Extraction error |
Key Concepts Learned
- PDF text extraction
- Full-text search patterns
- Extraction workflow
- Content statistics
- Multi-keyword search