Scanned Document Images

Objective

Create a system for storing scanned documents as images with OCR text extraction. This enables building searchable document archives from paper documents.

Step 1: Create Scanned Documents Table

Create a table for storing scanned document images.

CREATE TABLE scanned_documents (
    id INTEGER PRIMARY KEY,
    document_name VARCHAR(255) NOT NULL,
    scan_image IMAGE(PNG),
    document_type VARCHAR(50),
    page_number INTEGER DEFAULT 1,
    total_pages INTEGER DEFAULT 1,
    scan_resolution INTEGER DEFAULT 300,
    scan_color_mode VARCHAR(20) DEFAULT 'color',
    scanned_by VARCHAR(100),
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Create OCR Results Table

Create a table for storing extracted text.

CREATE TABLE document_ocr (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    extracted_text TEXT,
    confidence_score DECIMAL(5, 4),
    word_count INTEGER,
    language_detected VARCHAR(10),
    processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES scanned_documents(id)
);

Step 3: Insert Sample Scanned Documents

Add sample scanned document records.

INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages, scanned_by) VALUES
    (1, 'Contract_2024_001', 'contract', 1, 5, 'legal_dept'),
    (2, 'Contract_2024_001', 'contract', 2, 5, 'legal_dept'),
    (3, 'Contract_2024_001', 'contract', 3, 5, 'legal_dept'),
    (4, 'Invoice_March_001', 'invoice', 1, 1, 'accounting'),
    (5, 'Receipt_Office_Supplies', 'receipt', 1, 1, 'accounting'),
    (6, 'Employee_ID_JohnDoe', 'identification', 1, 1, 'hr_dept'),
    (7, 'Meeting_Notes_Q1', 'notes', 1, 2, 'admin'),
    (8, 'Meeting_Notes_Q1', 'notes', 2, 2, 'admin');

Step 4: Process OCR Extraction

Extract text from scanned images.

INSERT INTO document_ocr (id, document_id, extracted_text, confidence_score, word_count, language_detected)
VALUES
    (1, 1, 'SERVICE AGREEMENT\n\nThis Agreement is entered into as of January 1, 2024...', 0.9523, 450, 'en'),
    (2, 2, 'TERMS AND CONDITIONS\n\nSection 2.1 Payment Terms...', 0.9412, 380, 'en'),
    (3, 4, 'INVOICE #INV-2024-0315\nDate: March 15, 2024\nTotal Due: $1,250.00', 0.9856, 85, 'en'),
    (4, 5, 'OFFICE DEPOT\nReceipt #12345\nPaper supplies $45.99\nToner $89.99', 0.9234, 42, 'en'),
    (5, 7, 'Q1 Planning Meeting\nAttendees: John, Sarah, Mike\nAgenda items discussed...', 0.8876, 320, 'en');

Step 5: Search Document Content

Search across extracted text.

SELECT
    sd.document_name,
    sd.document_type,
    sd.page_number,
    doc.extracted_text
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.extracted_text LIKE '%Agreement%'
   OR doc.extracted_text LIKE '%Contract%';

Step 6: Create Document Index

Build a searchable document index.

CREATE TABLE document_search_index (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    document_name VARCHAR(255),
    document_type VARCHAR(50),
    full_text TEXT,
    keywords TEXT,
    indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO document_search_index (id, document_id, document_name, document_type, full_text, keywords)
SELECT
    sd.id,
    sd.id,
    sd.document_name,
    sd.document_type,
    doc.extracted_text,
    CASE sd.document_type
        WHEN 'contract' THEN 'agreement, terms, conditions, legal'
        WHEN 'invoice' THEN 'payment, due, amount, billing'
        WHEN 'receipt' THEN 'purchase, payment, expense'
        ELSE 'document'
    END
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id;

Step 7: Multi-Page Document View

Get all pages of a document.

SELECT
    sd.document_name,
    sd.page_number,
    sd.total_pages,
    doc.word_count,
    doc.confidence_score
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE sd.document_name = 'Contract_2024_001'
ORDER BY sd.page_number;

Step 8: OCR Quality Report

Analyze OCR extraction quality.

SELECT
    sd.document_type,
    COUNT(*) as document_count,
    AVG(doc.confidence_score) as avg_confidence,
    MIN(doc.confidence_score) as min_confidence,
    SUM(doc.word_count) as total_words
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
GROUP BY sd.document_type
ORDER BY avg_confidence DESC;

Step 9: Documents Needing Review

Find documents with low OCR confidence.

SELECT
    sd.document_name,
    sd.document_type,
    sd.page_number,
    doc.confidence_score,
    'Needs manual review' as status
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.confidence_score < 0.90
ORDER BY doc.confidence_score;

Step 10: Unprocessed Documents

Find documents pending OCR.

SELECT
    sd.id,
    sd.document_name,
    sd.document_type,
    sd.scanned_at
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.id IS NULL
ORDER BY sd.scanned_at;

Step 11: Document Statistics

Generate scanning statistics.

SELECT
    document_type,
    COUNT(DISTINCT document_name) as unique_documents,
    COUNT(*) as total_pages,
    COUNT(DISTINCT scanned_by) as scanners
FROM scanned_documents
GROUP BY document_type
ORDER BY total_pages DESC;

Step 12: Recent Scans Timeline

View recent scanning activity.

SELECT
    DATE(scanned_at) as scan_date,
    COUNT(*) as pages_scanned,
    COUNT(DISTINCT document_name) as documents,
    COUNT(DISTINCT scanned_by) as operators
FROM scanned_documents
GROUP BY DATE(scanned_at)
ORDER BY scan_date DESC
LIMIT 7;

Cleanup (Optional)

DROP TABLE IF EXISTS document_search_index;
DROP TABLE IF EXISTS document_ocr;
DROP TABLE IF EXISTS scanned_documents;

Expected Outcomes

Scanned documents stored as images
OCR extracts searchable text
Multi-page documents linked
Quality metrics tracked
Full-text search enabled

OCR Best Practices

Factor	Recommendation
Resolution	300 DPI minimum
Color Mode	Grayscale for text
Format	PNG for quality
Alignment	Straight scans

Key Concepts Learned

Scanned document storage
OCR text extraction
Multi-page document handling
Quality confidence tracking
Full-text search indexing

Scanned Document Images

Scanned Document Images

Objective

Step 1: Create Scanned Documents Table

Step 2: Create OCR Results Table

Step 3: Insert Sample Scanned Documents

Step 4: Process OCR Extraction

Step 5: Search Document Content

Step 6: Create Document Index

Step 7: Multi-Page Document View

Step 8: OCR Quality Report

Step 9: Documents Needing Review

Step 10: Unprocessed Documents

Step 11: Document Statistics

Step 12: Recent Scans Timeline

Cleanup (Optional)

Expected Outcomes

OCR Best Practices

Key Concepts Learned

Run this on your own machine