Scanned Document Images

Store scanned documents as images with OCR text extraction for searchable document archives

All recipes· image-management· 12 minutesintermediate

Scanned Document Images

Objective

Create a system for storing scanned documents as images with OCR text extraction. This enables building searchable document archives from paper documents.

Step 1: Create Scanned Documents Table

Create a table for storing scanned document images.

CREATE TABLE scanned_documents (
    id INTEGER PRIMARY KEY,
    document_name VARCHAR(255) NOT NULL,
    scan_image IMAGE(PNG),
    document_type VARCHAR(50),
    page_number INTEGER DEFAULT 1,
    total_pages INTEGER DEFAULT 1,
    scan_resolution INTEGER DEFAULT 300,
    scan_color_mode VARCHAR(20) DEFAULT 'color',
    scanned_by VARCHAR(100),
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Create OCR Results Table

Create a table for storing extracted text.

CREATE TABLE document_ocr (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    extracted_text TEXT,
    confidence_score DECIMAL(5, 4),
    word_count INTEGER,
    language_detected VARCHAR(10),
    processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES scanned_documents(id)
);

Step 3: Insert Sample Scanned Documents

Add sample scanned document records.

INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages, scanned_by) VALUES
    (1, 'Contract_2024_001', 'contract', 1, 5, 'legal_dept'),
    (2, 'Contract_2024_001', 'contract', 2, 5, 'legal_dept'),
    (3, 'Contract_2024_001', 'contract', 3, 5, 'legal_dept'),
    (4, 'Invoice_March_001', 'invoice', 1, 1, 'accounting'),
    (5, 'Receipt_Office_Supplies', 'receipt', 1, 1, 'accounting'),
    (6, 'Employee_ID_JohnDoe', 'identification', 1, 1, 'hr_dept'),
    (7, 'Meeting_Notes_Q1', 'notes', 1, 2, 'admin'),
    (8, 'Meeting_Notes_Q1', 'notes', 2, 2, 'admin');

Step 4: Process OCR Extraction

Extract text from scanned images.

INSERT INTO document_ocr (id, document_id, extracted_text, confidence_score, word_count, language_detected)
VALUES
    (1, 1, 'SERVICE AGREEMENT\n\nThis Agreement is entered into as of January 1, 2024...', 0.9523, 450, 'en'),
    (2, 2, 'TERMS AND CONDITIONS\n\nSection 2.1 Payment Terms...', 0.9412, 380, 'en'),
    (3, 4, 'INVOICE #INV-2024-0315\nDate: March 15, 2024\nTotal Due: $1,250.00', 0.9856, 85, 'en'),
    (4, 5, 'OFFICE DEPOT\nReceipt #12345\nPaper supplies $45.99\nToner $89.99', 0.9234, 42, 'en'),
    (5, 7, 'Q1 Planning Meeting\nAttendees: John, Sarah, Mike\nAgenda items discussed...', 0.8876, 320, 'en');

Step 5: Search Document Content

Search across extracted text.

SELECT
    sd.document_name,
    sd.document_type,
    sd.page_number,
    doc.extracted_text
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.extracted_text LIKE '%Agreement%'
   OR doc.extracted_text LIKE '%Contract%';

Step 6: Create Document Index

Build a searchable document index.

CREATE TABLE document_search_index (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    document_name VARCHAR(255),
    document_type VARCHAR(50),
    full_text TEXT,
    keywords TEXT,
    indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO document_search_index (id, document_id, document_name, document_type, full_text, keywords)
SELECT
    sd.id,
    sd.id,
    sd.document_name,
    sd.document_type,
    doc.extracted_text,
    CASE sd.document_type
        WHEN 'contract' THEN 'agreement, terms, conditions, legal'
        WHEN 'invoice' THEN 'payment, due, amount, billing'
        WHEN 'receipt' THEN 'purchase, payment, expense'
        ELSE 'document'
    END
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id;

Step 7: Multi-Page Document View

Get all pages of a document.

SELECT
    sd.document_name,
    sd.page_number,
    sd.total_pages,
    doc.word_count,
    doc.confidence_score
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE sd.document_name = 'Contract_2024_001'
ORDER BY sd.page_number;

Step 8: OCR Quality Report

Analyze OCR extraction quality.

SELECT
    sd.document_type,
    COUNT(*) as document_count,
    AVG(doc.confidence_score) as avg_confidence,
    MIN(doc.confidence_score) as min_confidence,
    SUM(doc.word_count) as total_words
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
GROUP BY sd.document_type
ORDER BY avg_confidence DESC;

Step 9: Documents Needing Review

Find documents with low OCR confidence.

SELECT
    sd.document_name,
    sd.document_type,
    sd.page_number,
    doc.confidence_score,
    'Needs manual review' as status
FROM scanned_documents sd
INNER JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.confidence_score < 0.90
ORDER BY doc.confidence_score;

Step 10: Unprocessed Documents

Find documents pending OCR.

SELECT
    sd.id,
    sd.document_name,
    sd.document_type,
    sd.scanned_at
FROM scanned_documents sd
LEFT JOIN document_ocr doc ON sd.id = doc.document_id
WHERE doc.id IS NULL
ORDER BY sd.scanned_at;

Step 11: Document Statistics

Generate scanning statistics.

SELECT
    document_type,
    COUNT(DISTINCT document_name) as unique_documents,
    COUNT(*) as total_pages,
    COUNT(DISTINCT scanned_by) as scanners
FROM scanned_documents
GROUP BY document_type
ORDER BY total_pages DESC;

Step 12: Recent Scans Timeline

View recent scanning activity.

SELECT
    DATE(scanned_at) as scan_date,
    COUNT(*) as pages_scanned,
    COUNT(DISTINCT document_name) as documents,
    COUNT(DISTINCT scanned_by) as operators
FROM scanned_documents
GROUP BY DATE(scanned_at)
ORDER BY scan_date DESC
LIMIT 7;

Cleanup (Optional)

DROP TABLE IF EXISTS document_search_index;
DROP TABLE IF EXISTS document_ocr;
DROP TABLE IF EXISTS scanned_documents;

Expected Outcomes

  • Scanned documents stored as images
  • OCR extracts searchable text
  • Multi-page documents linked
  • Quality metrics tracked
  • Full-text search enabled

OCR Best Practices

Factor Recommendation
Resolution 300 DPI minimum
Color Mode Grayscale for text
Format PNG for quality
Alignment Straight scans

Key Concepts Learned

  • Scanned document storage
  • OCR text extraction
  • Multi-page document handling
  • Quality confidence tracking
  • Full-text search indexing

Tags

sqlintermediateimagesdocumentsocrscanningai

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE