Image OCR Extraction

Objective

Extract text from images using AIDB's EXTRACT_TEXT function. This enables searching within images containing text such as documents, signs, and screenshots.

Step 1: Create Table for Images with Text

Create a table to store images and their extracted text.

CREATE TABLE images_with_text (
    id INTEGER PRIMARY KEY,
    filename VARCHAR(255) NOT NULL,
    image IMAGE(JPEG),
    image_type VARCHAR(50),
    extracted_text TEXT,
    extraction_confidence DECIMAL(5, 4),
    processed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Insert Sample Image Records

Add sample images that contain text.

INSERT INTO images_with_text (id, filename, image_type) VALUES
    (1, 'business_card.jpg', 'business_card'),
    (2, 'receipt_001.jpg', 'receipt'),
    (3, 'street_sign.jpg', 'sign'),
    (4, 'whiteboard_notes.jpg', 'whiteboard'),
    (5, 'book_page.jpg', 'document'),
    (6, 'handwritten_note.jpg', 'handwritten'),
    (7, 'product_label.jpg', 'label'),
    (8, 'screenshot_error.png', 'screenshot');

Step 3: Extract Text from Images

Use EXTRACT_TEXT to perform OCR on images.

UPDATE images_with_text
SET extracted_text = EXTRACT_TEXT(image),
    processed_at = CURRENT_TIMESTAMP
WHERE image IS NOT NULL;

Step 4: Create Scanned Documents Table

Create a specialized table for scanned documents.

CREATE TABLE scanned_documents (
    id INTEGER PRIMARY KEY,
    document_name VARCHAR(255) NOT NULL,
    scan_image IMAGE(PNG),
    full_text TEXT,
    page_number INTEGER DEFAULT 1,
    total_pages INTEGER DEFAULT 1,
    document_type VARCHAR(50),
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages) VALUES
    (1, 'Contract_2024_001', 'contract', 1, 5),
    (2, 'Contract_2024_001', 'contract', 2, 5),
    (3, 'Invoice_March', 'invoice', 1, 1),
    (4, 'Meeting_Notes', 'notes', 1, 2),
    (5, 'ID_Document', 'identification', 1, 1);

Step 5: Process Scanned Documents

Extract text from scanned document images.

UPDATE scanned_documents
SET full_text = EXTRACT_TEXT(scan_image)
WHERE scan_image IS NOT NULL AND full_text IS NULL;

Step 6: Search Extracted Text

Search within extracted text content.

-- Search for specific terms in extracted text
SELECT
    filename,
    image_type,
    extracted_text
FROM images_with_text
WHERE extracted_text LIKE '%email%'
   OR extracted_text LIKE '%phone%';

Step 7: Create Receipt Scanner Table

Build a receipt scanning system.

CREATE TABLE receipt_scans (
    id INTEGER PRIMARY KEY,
    receipt_image IMAGE(JPEG),
    store_name VARCHAR(255),
    receipt_text TEXT,
    total_amount DECIMAL(10, 2),
    receipt_date DATE,
    expense_category VARCHAR(50),
    processed BOOLEAN DEFAULT FALSE,
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO receipt_scans (id, store_name, total_amount, receipt_date, expense_category) VALUES
    (1, 'Office Supply Co', 125.50, '2024-03-15', 'office'),
    (2, 'Tech Electronics', 899.99, '2024-03-14', 'equipment'),
    (3, 'Coffee Shop', 15.75, '2024-03-15', 'meals'),
    (4, 'Gas Station', 65.00, '2024-03-13', 'travel');

Step 8: Process Receipt Text

Extract and analyze receipt content.

UPDATE receipt_scans
SET receipt_text = EXTRACT_TEXT(receipt_image),
    processed = TRUE
WHERE receipt_image IS NOT NULL AND processed = FALSE;

SELECT
    store_name,
    total_amount,
    receipt_date,
    expense_category
FROM receipt_scans
WHERE processed = TRUE
ORDER BY receipt_date DESC;

Step 9: Document Search System

Build a searchable document index.

CREATE TABLE document_index (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    source_table VARCHAR(50) NOT NULL,
    searchable_text TEXT,
    word_count INTEGER,
    indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index scanned documents
INSERT INTO document_index (id, document_id, source_table, searchable_text)
SELECT
    id,
    id,
    'scanned_documents',
    full_text
FROM scanned_documents
WHERE full_text IS NOT NULL;

Step 10: Full Text Search Query

Search across all indexed documents.

SELECT
    di.document_id,
    di.source_table,
    di.searchable_text,
    di.indexed_at
FROM document_index di
WHERE di.searchable_text LIKE '%contract%'
ORDER BY di.indexed_at DESC;

Step 11: OCR Processing Statistics

Track OCR processing status and results.

SELECT
    image_type,
    COUNT(*) as total_images,
    COUNT(extracted_text) as processed_count,
    COUNT(*) - COUNT(extracted_text) as pending_count
FROM images_with_text
GROUP BY image_type
ORDER BY total_images DESC;

Step 12: Unprocessed Images Report

Find images that haven't been processed.

SELECT
    id,
    filename,
    image_type,
    created_at
FROM images_with_text
WHERE extracted_text IS NULL
   OR extracted_text = ''
ORDER BY created_at;

Cleanup (Optional)

DROP TABLE IF EXISTS document_index;
DROP TABLE IF EXISTS receipt_scans;
DROP TABLE IF EXISTS scanned_documents;
DROP TABLE IF EXISTS images_with_text;

Expected Outcomes

EXTRACT_TEXT performs OCR on images
Extracted text is searchable
Different image types processed
Receipt scanning system functional
Document index enables search

OCR Use Cases

Image Type	Use Case
Business Cards	Contact extraction
Receipts	Expense tracking
Documents	Document search
Screenshots	Error log capture
Signs	Location identification

Key Concepts Learned

EXTRACT_TEXT function for OCR
Storing extracted text for search
Processing status tracking
Document indexing pattern
Full-text search on images

Image OCR Extraction

Image OCR Extraction

Objective

Step 1: Create Table for Images with Text

Step 2: Insert Sample Image Records

Step 3: Extract Text from Images

Step 4: Create Scanned Documents Table

Step 5: Process Scanned Documents

Step 6: Search Extracted Text

Step 7: Create Receipt Scanner Table

Step 8: Process Receipt Text

Step 9: Document Search System

Step 10: Full Text Search Query

Step 11: OCR Processing Statistics

Step 12: Unprocessed Images Report

Cleanup (Optional)

Expected Outcomes

OCR Use Cases

Key Concepts Learned

Run this on your own machine