Scanned Document Images
Objective
Create a system for storing scanned documents as images with OCR text extraction. This enables building searchable document archives from paper documents.
Step 1: Create Scanned Documents Table
Create a table for storing scanned document images.
CREATE TABLE scanned_documents (
id INTEGER PRIMARY KEY,
document_name VARCHAR(255) NOT NULL,
scan_image IMAGE(PNG),
document_type VARCHAR(50),
page_number INTEGER DEFAULT 1,
total_pages INTEGER DEFAULT 1,
scan_resolution INTEGER DEFAULT 300,
scan_color_mode VARCHAR(20) DEFAULT 'color',
scanned_by VARCHAR(100),
scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Create OCR Results Table
Create a table for storing extracted text.
CREATE TABLE document_ocr (
id INTEGER PRIMARY KEY,
document_id INTEGER NOT NULL,
extracted_text TEXT,
confidence_score DECIMAL(5, 4),
word_count INTEGER,
language_detected VARCHAR(10),
processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (document_id) REFERENCES scanned_documents(id)
);
Step 3: Insert Sample Scanned Documents
Add sample scanned document records.
INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages, scanned_by) VALUES
(1, 'Contract_2024_001', 'contract', 1, 5, 'legal_dept'),
(2, 'Contract_2024_001', 'contract', 2, 5, 'legal_dept'),
(3, 'Contract_2024_001', 'contract', 3, 5, 'legal_dept'),
(4, 'Invoice_March_001', 'invoice', 1, 1, 'accounting'),
(5, 'Receipt_Office_Supplies', 'receipt', 1, 1, 'accounting'),
(6, 'Employee_ID_JohnDoe', 'identification', 1, 1, 'hr_dept'),
(7, 'Meeting_Notes_Q1', 'notes', 1, 2, 'admin'),
(8, 'Meeting_Notes_Q1', 'notes', 2, 2, 'admin');
Step 4: Process OCR Extraction
Extract text from scanned images.
INSERT INTO document_ocr (id, document_id, extracted_text, confidence_score, word_count, language_detected)
VALUES
(1, 1, 'SERVICE AGREEMENT\n\nThis Agreement is entered into as of January 1, 2024...', 0.9523, 450, 'en'),
(2, 2, 'TERMS AND CONDITIONS\n\nSection 2.1 Payment Terms...', 0.9412, 380, 'en'),
(3, 4, 'INVOICE #INV-2024-0315\nDate: March 15, 2024\nTotal Due: $1,250.00', 0.9856, 85, 'en'),
(4, 5, 'OFFICE DEPOT\nReceipt #12345\nPaper supplies $45.99\nToner $89.99', 0.9234, 42, 'en'),
(5, 7, 'Q1 Planning Meeting\nAttendees: John, Sarah, Mike\nAgenda items discussed...', 0.8876, 320, 'en');
Step 5: Search Document Content
Search across extracted text.
SELECT
sd.document_name,
sd.document_type,
sd.page_number,
doc.extracted_text
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.extracted_text LIKE '%Agreement%'
OR doc.extracted_text LIKE '%Contract%';
Step 6: Create Document Index
Build a searchable document index.
CREATE TABLE document_search_index (
id INTEGER PRIMARY KEY,
document_id INTEGER NOT NULL,
document_name VARCHAR(255),
document_type VARCHAR(50),
full_text TEXT,
keywords TEXT,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO document_search_index (id, document_id, document_name, document_type, full_text, keywords)
SELECT
sd.id,
sd.id,
sd.document_name,
sd.document_type,
doc.extracted_text,
CASE sd.document_type
WHEN 'contract' THEN 'agreement, terms, conditions, legal'
WHEN 'invoice' THEN 'payment, due, amount, billing'
WHEN 'receipt' THEN 'purchase, payment, expense'
ELSE 'document'
END
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id;
Step 7: Multi-Page Document View
Get all pages of a document.
SELECT
sd.document_name,
sd.page_number,
sd.total_pages,
doc.word_count,
doc.confidence_score
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE sd.document_name = 'Contract_2024_001'
ORDER BY sd.page_number;
Step 8: OCR Quality Report
Analyze OCR extraction quality.
SELECT
sd.document_type,
COUNT(*) as document_count,
AVG(doc.confidence_score) as avg_confidence,
MIN(doc.confidence_score) as min_confidence,
SUM(doc.word_count) as total_words
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
GROUP BY sd.document_type
ORDER BY avg_confidence DESC;
Step 9: Documents Needing Review
Find documents with low OCR confidence.
SELECT
sd.document_name,
sd.document_type,
sd.page_number,
doc.confidence_score,
'Needs manual review' as status
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.confidence_score < 0.90
ORDER BY doc.confidence_score;
Step 10: Unprocessed Documents
Find documents pending OCR.
SELECT
sd.id,
sd.document_name,
sd.document_type,
sd.scanned_at
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.id IS NULL
ORDER BY sd.scanned_at;
Step 11: Document Statistics
Generate scanning statistics.
SELECT
document_type,
COUNT(DISTINCT document_name) as unique_documents,
COUNT(*) as total_pages,
COUNT(DISTINCT scanned_by) as scanners
FROM scanned_documents
GROUP BY document_type
ORDER BY total_pages DESC;
Step 12: Recent Scans Timeline
View recent scanning activity.
SELECT
DATE(scanned_at) as scan_date,
COUNT(*) as pages_scanned,
COUNT(DISTINCT document_name) as documents,
COUNT(DISTINCT scanned_by) as operators
FROM scanned_documents
GROUP BY DATE(scanned_at)
ORDER BY scan_date DESC
LIMIT 7;
Cleanup (Optional)
DROP TABLE IF EXISTS document_search_index;
DROP TABLE IF EXISTS document_ocr;
DROP TABLE IF EXISTS scanned_documents;
Expected Outcomes
- Scanned documents stored as images
- OCR extracts searchable text
- Multi-page documents linked
- Quality metrics tracked
- Full-text search enabled
OCR Best Practices
| Factor | Recommendation |
|---|---|
| Resolution | 300 DPI minimum |
| Color Mode | Grayscale for text |
| Format | PNG for quality |
| Alignment | Straight scans |
Key Concepts Learned
- Scanned document storage
- OCR text extraction
- Multi-page document handling
- Quality confidence tracking
- Full-text search indexing