Image OCR Extraction

Extract text from images using the EXTRACT_TEXT function for searchable image content

All recipes· image-management· 12 minutesintermediate

Image OCR Extraction

Objective

Extract text from images using AIDB's EXTRACT_TEXT function. This enables searching within images containing text such as documents, signs, and screenshots.

Step 1: Create Table for Images with Text

Create a table to store images and their extracted text.

CREATE TABLE images_with_text (
    id INTEGER PRIMARY KEY,
    filename VARCHAR(255) NOT NULL,
    image IMAGE(JPEG),
    image_type VARCHAR(50),
    extracted_text TEXT,
    extraction_confidence DECIMAL(5, 4),
    processed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2: Insert Sample Image Records

Add sample images that contain text.

INSERT INTO images_with_text (id, filename, image_type) VALUES
    (1, 'business_card.jpg', 'business_card'),
    (2, 'receipt_001.jpg', 'receipt'),
    (3, 'street_sign.jpg', 'sign'),
    (4, 'whiteboard_notes.jpg', 'whiteboard'),
    (5, 'book_page.jpg', 'document'),
    (6, 'handwritten_note.jpg', 'handwritten'),
    (7, 'product_label.jpg', 'label'),
    (8, 'screenshot_error.png', 'screenshot');

Step 3: Extract Text from Images

Use EXTRACT_TEXT to perform OCR on images.

UPDATE images_with_text
SET extracted_text = EXTRACT_TEXT(image),
    processed_at = CURRENT_TIMESTAMP
WHERE image IS NOT NULL;

Step 4: Create Scanned Documents Table

Create a specialized table for scanned documents.

CREATE TABLE scanned_documents (
    id INTEGER PRIMARY KEY,
    document_name VARCHAR(255) NOT NULL,
    scan_image IMAGE(PNG),
    full_text TEXT,
    page_number INTEGER DEFAULT 1,
    total_pages INTEGER DEFAULT 1,
    document_type VARCHAR(50),
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages) VALUES
    (1, 'Contract_2024_001', 'contract', 1, 5),
    (2, 'Contract_2024_001', 'contract', 2, 5),
    (3, 'Invoice_March', 'invoice', 1, 1),
    (4, 'Meeting_Notes', 'notes', 1, 2),
    (5, 'ID_Document', 'identification', 1, 1);

Step 5: Process Scanned Documents

Extract text from scanned document images.

UPDATE scanned_documents
SET full_text = EXTRACT_TEXT(scan_image)
WHERE scan_image IS NOT NULL AND full_text IS NULL;

Step 6: Search Extracted Text

Search within extracted text content.

-- Search for specific terms in extracted text
SELECT
    filename,
    image_type,
    extracted_text
FROM images_with_text
WHERE extracted_text LIKE '%email%'
   OR extracted_text LIKE '%phone%';

Step 7: Create Receipt Scanner Table

Build a receipt scanning system.

CREATE TABLE receipt_scans (
    id INTEGER PRIMARY KEY,
    receipt_image IMAGE(JPEG),
    store_name VARCHAR(255),
    receipt_text TEXT,
    total_amount DECIMAL(10, 2),
    receipt_date DATE,
    expense_category VARCHAR(50),
    processed BOOLEAN DEFAULT FALSE,
    scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO receipt_scans (id, store_name, total_amount, receipt_date, expense_category) VALUES
    (1, 'Office Supply Co', 125.50, '2024-03-15', 'office'),
    (2, 'Tech Electronics', 899.99, '2024-03-14', 'equipment'),
    (3, 'Coffee Shop', 15.75, '2024-03-15', 'meals'),
    (4, 'Gas Station', 65.00, '2024-03-13', 'travel');

Step 8: Process Receipt Text

Extract and analyze receipt content.

UPDATE receipt_scans
SET receipt_text = EXTRACT_TEXT(receipt_image),
    processed = TRUE
WHERE receipt_image IS NOT NULL AND processed = FALSE;

SELECT
    store_name,
    total_amount,
    receipt_date,
    expense_category
FROM receipt_scans
WHERE processed = TRUE
ORDER BY receipt_date DESC;

Step 9: Document Search System

Build a searchable document index.

CREATE TABLE document_index (
    id INTEGER PRIMARY KEY,
    document_id INTEGER NOT NULL,
    source_table VARCHAR(50) NOT NULL,
    searchable_text TEXT,
    word_count INTEGER,
    indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index scanned documents
INSERT INTO document_index (id, document_id, source_table, searchable_text)
SELECT
    id,
    id,
    'scanned_documents',
    full_text
FROM scanned_documents
WHERE full_text IS NOT NULL;

Step 10: Full Text Search Query

Search across all indexed documents.

SELECT
    di.document_id,
    di.source_table,
    di.searchable_text,
    di.indexed_at
FROM document_index di
WHERE di.searchable_text LIKE '%contract%'
ORDER BY di.indexed_at DESC;

Step 11: OCR Processing Statistics

Track OCR processing status and results.

SELECT
    image_type,
    COUNT(*) as total_images,
    COUNT(extracted_text) as processed_count,
    COUNT(*) - COUNT(extracted_text) as pending_count
FROM images_with_text
GROUP BY image_type
ORDER BY total_images DESC;

Step 12: Unprocessed Images Report

Find images that haven't been processed.

SELECT
    id,
    filename,
    image_type,
    created_at
FROM images_with_text
WHERE extracted_text IS NULL
   OR extracted_text = ''
ORDER BY created_at;

Cleanup (Optional)

DROP TABLE IF EXISTS document_index;
DROP TABLE IF EXISTS receipt_scans;
DROP TABLE IF EXISTS scanned_documents;
DROP TABLE IF EXISTS images_with_text;

Expected Outcomes

  • EXTRACT_TEXT performs OCR on images
  • Extracted text is searchable
  • Different image types processed
  • Receipt scanning system functional
  • Document index enables search

OCR Use Cases

Image Type Use Case
Business Cards Contact extraction
Receipts Expense tracking
Documents Document search
Screenshots Error log capture
Signs Location identification

Key Concepts Learned

  • EXTRACT_TEXT function for OCR
  • Storing extracted text for search
  • Processing status tracking
  • Document indexing pattern
  • Full-text search on images

Tags

sqlintermediateimagesocrtext-extractionai

Run this on your own machine

Install SynapCores Community Edition free, paste the SQL or Cypher above into the bundled web UI, and watch it run.

Download Free CE