Image OCR Extraction
Objective
Extract text from images using AIDB's EXTRACT_TEXT function. This enables searching within images containing text such as documents, signs, and screenshots.
Step 1: Create Table for Images with Text
Create a table to store images and their extracted text.
CREATE TABLE images_with_text (
id INTEGER PRIMARY KEY,
filename VARCHAR(255) NOT NULL,
image IMAGE(JPEG),
image_type VARCHAR(50),
extracted_text TEXT,
extraction_confidence DECIMAL(5, 4),
processed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Insert Sample Image Records
Add sample images that contain text.
INSERT INTO images_with_text (id, filename, image_type) VALUES
(1, 'business_card.jpg', 'business_card'),
(2, 'receipt_001.jpg', 'receipt'),
(3, 'street_sign.jpg', 'sign'),
(4, 'whiteboard_notes.jpg', 'whiteboard'),
(5, 'book_page.jpg', 'document'),
(6, 'handwritten_note.jpg', 'handwritten'),
(7, 'product_label.jpg', 'label'),
(8, 'screenshot_error.png', 'screenshot');
Step 3: Extract Text from Images
Use EXTRACT_TEXT to perform OCR on images.
UPDATE images_with_text
SET extracted_text = EXTRACT_TEXT(image),
processed_at = CURRENT_TIMESTAMP
WHERE image IS NOT NULL;
Step 4: Create Scanned Documents Table
Create a specialized table for scanned documents.
CREATE TABLE scanned_documents (
id INTEGER PRIMARY KEY,
document_name VARCHAR(255) NOT NULL,
scan_image IMAGE(PNG),
full_text TEXT,
page_number INTEGER DEFAULT 1,
total_pages INTEGER DEFAULT 1,
document_type VARCHAR(50),
scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO scanned_documents (id, document_name, document_type, page_number, total_pages) VALUES
(1, 'Contract_2024_001', 'contract', 1, 5),
(2, 'Contract_2024_001', 'contract', 2, 5),
(3, 'Invoice_March', 'invoice', 1, 1),
(4, 'Meeting_Notes', 'notes', 1, 2),
(5, 'ID_Document', 'identification', 1, 1);
Step 5: Process Scanned Documents
Extract text from scanned document images.
UPDATE scanned_documents
SET full_text = EXTRACT_TEXT(scan_image)
WHERE scan_image IS NOT NULL AND full_text IS NULL;
Step 6: Search Extracted Text
Search within extracted text content.
-- Search for specific terms in extracted text
SELECT
filename,
image_type,
extracted_text
FROM images_with_text
WHERE extracted_text LIKE '%email%'
OR extracted_text LIKE '%phone%';
Step 7: Create Receipt Scanner Table
Build a receipt scanning system.
CREATE TABLE receipt_scans (
id INTEGER PRIMARY KEY,
receipt_image IMAGE(JPEG),
store_name VARCHAR(255),
receipt_text TEXT,
total_amount DECIMAL(10, 2),
receipt_date DATE,
expense_category VARCHAR(50),
processed BOOLEAN DEFAULT FALSE,
scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO receipt_scans (id, store_name, total_amount, receipt_date, expense_category) VALUES
(1, 'Office Supply Co', 125.50, '2024-03-15', 'office'),
(2, 'Tech Electronics', 899.99, '2024-03-14', 'equipment'),
(3, 'Coffee Shop', 15.75, '2024-03-15', 'meals'),
(4, 'Gas Station', 65.00, '2024-03-13', 'travel');
Step 8: Process Receipt Text
Extract and analyze receipt content.
UPDATE receipt_scans
SET receipt_text = EXTRACT_TEXT(receipt_image),
processed = TRUE
WHERE receipt_image IS NOT NULL AND processed = FALSE;
SELECT
store_name,
total_amount,
receipt_date,
expense_category
FROM receipt_scans
WHERE processed = TRUE
ORDER BY receipt_date DESC;
Step 9: Document Search System
Build a searchable document index.
CREATE TABLE document_index (
id INTEGER PRIMARY KEY,
document_id INTEGER NOT NULL,
source_table VARCHAR(50) NOT NULL,
searchable_text TEXT,
word_count INTEGER,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index scanned documents
INSERT INTO document_index (id, document_id, source_table, searchable_text)
SELECT
id,
id,
'scanned_documents',
full_text
FROM scanned_documents
WHERE full_text IS NOT NULL;
Step 10: Full Text Search Query
Search across all indexed documents.
SELECT
di.document_id,
di.source_table,
di.searchable_text,
di.indexed_at
FROM document_index di
WHERE di.searchable_text LIKE '%contract%'
ORDER BY di.indexed_at DESC;
Step 11: OCR Processing Statistics
Track OCR processing status and results.
SELECT
image_type,
COUNT(*) as total_images,
COUNT(extracted_text) as processed_count,
COUNT(*) - COUNT(extracted_text) as pending_count
FROM images_with_text
GROUP BY image_type
ORDER BY total_images DESC;
Step 12: Unprocessed Images Report
Find images that haven't been processed.
SELECT
id,
filename,
image_type,
created_at
FROM images_with_text
WHERE extracted_text IS NULL
OR extracted_text = ''
ORDER BY created_at;
Cleanup (Optional)
DROP TABLE IF EXISTS document_index;
DROP TABLE IF EXISTS receipt_scans;
DROP TABLE IF EXISTS scanned_documents;
DROP TABLE IF EXISTS images_with_text;
Expected Outcomes
- EXTRACT_TEXT performs OCR on images
- Extracted text is searchable
- Different image types processed
- Receipt scanning system functional
- Document index enables search
OCR Use Cases
| Image Type | Use Case |
|---|---|
| Business Cards | Contact extraction |
| Receipts | Expense tracking |
| Documents | Document search |
| Screenshots | Error log capture |
| Signs | Location identification |
Key Concepts Learned
- EXTRACT_TEXT function for OCR
- Storing extracted text for search
- Processing status tracking
- Document indexing pattern
- Full-text search on images