Image Similarity Matching
Objective
Use vector embeddings and cosine similarity to find visually similar images. This enables content discovery, duplicate detection, and recommendation systems.
Step 1: Create Image Catalog Table
Create a table for storing images with embeddings.
CREATE TABLE image_catalog (
id INTEGER PRIMARY KEY,
image_id VARCHAR(50) UNIQUE NOT NULL,
filename VARCHAR(255) NOT NULL,
image IMAGE(JPEG),
description TEXT,
category VARCHAR(50),
embedding VECTOR(384),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2: Insert Sample Images
Add sample images with descriptions.
INSERT INTO image_catalog (id, image_id, filename, description, category) VALUES
(1, 'IMG001', 'beach_sunset_1.jpg', 'Beautiful sunset over ocean beach with orange and pink sky', 'nature'),
(2, 'IMG002', 'beach_sunset_2.jpg', 'Tropical beach at sunset with palm trees silhouette', 'nature'),
(3, 'IMG003', 'mountain_snow.jpg', 'Snow covered mountain peaks under blue sky', 'nature'),
(4, 'IMG004', 'city_night_1.jpg', 'City skyline at night with lights reflecting on water', 'urban'),
(5, 'IMG005', 'city_night_2.jpg', 'Downtown buildings illuminated at night', 'urban'),
(6, 'IMG006', 'forest_path.jpg', 'Winding path through autumn forest with fallen leaves', 'nature'),
(7, 'IMG007', 'dog_portrait.jpg', 'Golden retriever dog portrait in natural light', 'animals'),
(8, 'IMG008', 'cat_sleeping.jpg', 'Orange tabby cat sleeping on soft blanket', 'animals'),
(9, 'IMG009', 'food_pasta.jpg', 'Italian pasta dish with tomato sauce and basil', 'food'),
(10, 'IMG010', 'food_pizza.jpg', 'Wood-fired pizza with fresh mozzarella', 'food');
Step 3: Generate Image Embeddings
Create embeddings from image descriptions.
UPDATE image_catalog
SET embedding = EMBED(description)
WHERE embedding IS NULL;
Step 4: Find Similar Images by Query
Search for images similar to a text query.
SELECT
image_id,
filename,
description,
category,
COSINE_SIMILARITY(embedding, EMBED('sunset at the beach with waves')) as similarity
FROM image_catalog
WHERE embedding IS NOT NULL
ORDER BY similarity DESC
LIMIT 5;
Step 5: Find Similar Images to an Existing Image
Find images similar to a specific image in the database.
SELECT
b.image_id,
b.filename,
b.description,
COSINE_SIMILARITY(a.embedding, b.embedding) as similarity
FROM image_catalog a
CROSS JOIN image_catalog b
WHERE a.image_id = 'IMG001'
AND b.image_id != 'IMG001'
AND a.embedding IS NOT NULL
AND b.embedding IS NOT NULL
ORDER BY similarity DESC
LIMIT 5;
Step 6: Create Similarity Pairs Table
Store pre-computed similarity scores.
CREATE TABLE image_similarity_pairs (
id INTEGER PRIMARY KEY,
image_id_1 VARCHAR(50) NOT NULL,
image_id_2 VARCHAR(50) NOT NULL,
similarity_score DECIMAL(6, 5),
computed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Compute similarities for all pairs
INSERT INTO image_similarity_pairs (id, image_id_1, image_id_2, similarity_score)
SELECT
ROW_NUMBER() OVER (),
a.image_id,
b.image_id,
COSINE_SIMILARITY(a.embedding, b.embedding)
FROM image_catalog a
CROSS JOIN image_catalog b
WHERE a.id < b.id
AND a.embedding IS NOT NULL
AND b.embedding IS NOT NULL;
Step 7: Find Potential Duplicates
Identify images that may be duplicates.
SELECT
isp.image_id_1,
isp.image_id_2,
isp.similarity_score,
ic1.filename as file_1,
ic2.filename as file_2
FROM image_similarity_pairs isp
INNER JOIN image_catalog ic1 ON isp.image_id_1 = ic1.image_id
INNER JOIN image_catalog ic2 ON isp.image_id_2 = ic2.image_id
WHERE isp.similarity_score > 0.85
ORDER BY isp.similarity_score DESC;
Step 8: Category-Based Similarity
Find similar images within the same category.
SELECT
a.image_id as source_image,
b.image_id as similar_image,
a.category,
COSINE_SIMILARITY(a.embedding, b.embedding) as similarity
FROM image_catalog a
INNER JOIN image_catalog b ON a.category = b.category AND a.id < b.id
WHERE a.embedding IS NOT NULL
AND b.embedding IS NOT NULL
ORDER BY a.category, similarity DESC;
Step 9: Similarity Threshold Filtering
Find images above a similarity threshold.
SELECT
image_id,
filename,
description,
COSINE_SIMILARITY(embedding, EMBED('animals pets dogs cats')) as similarity
FROM image_catalog
WHERE embedding IS NOT NULL
AND COSINE_SIMILARITY(embedding, EMBED('animals pets dogs cats')) > 0.5
ORDER BY similarity DESC;
Step 10: Image Clusters by Similarity
Identify natural groupings based on similarity.
SELECT
ic.category,
COUNT(*) as image_count,
AVG(isp.similarity_score) as avg_internal_similarity
FROM image_catalog ic
INNER JOIN image_similarity_pairs isp ON
(ic.image_id = isp.image_id_1 OR ic.image_id = isp.image_id_2)
INNER JOIN image_catalog ic2 ON
(isp.image_id_1 = ic2.image_id OR isp.image_id_2 = ic2.image_id)
AND ic.category = ic2.category
AND ic.id != ic2.id
GROUP BY ic.category
ORDER BY avg_internal_similarity DESC;
Step 11: Similarity Distribution
Analyze the distribution of similarity scores.
SELECT
CASE
WHEN similarity_score >= 0.9 THEN '0.9-1.0 (Very Similar)'
WHEN similarity_score >= 0.7 THEN '0.7-0.9 (Similar)'
WHEN similarity_score >= 0.5 THEN '0.5-0.7 (Related)'
WHEN similarity_score >= 0.3 THEN '0.3-0.5 (Somewhat Related)'
ELSE '0.0-0.3 (Different)'
END as similarity_range,
COUNT(*) as pair_count
FROM image_similarity_pairs
GROUP BY
CASE
WHEN similarity_score >= 0.9 THEN '0.9-1.0 (Very Similar)'
WHEN similarity_score >= 0.7 THEN '0.7-0.9 (Similar)'
WHEN similarity_score >= 0.5 THEN '0.5-0.7 (Related)'
WHEN similarity_score >= 0.3 THEN '0.3-0.5 (Somewhat Related)'
ELSE '0.0-0.3 (Different)'
END
ORDER BY similarity_range DESC;
Step 12: Recommend Similar Images
Build a simple recommendation query.
SELECT
'Recommendations for ' || a.filename as recommendation_for,
b.filename as recommended_image,
b.category,
COSINE_SIMILARITY(a.embedding, b.embedding) as relevance
FROM image_catalog a
CROSS JOIN image_catalog b
WHERE a.image_id = 'IMG001'
AND b.image_id != a.image_id
AND a.embedding IS NOT NULL
AND b.embedding IS NOT NULL
ORDER BY relevance DESC
LIMIT 3;
Cleanup (Optional)
DROP TABLE IF EXISTS image_similarity_pairs;
DROP TABLE IF EXISTS image_catalog;
Expected Outcomes
- Embeddings generated from descriptions
- Cosine similarity finds related images
- Duplicate detection identifies copies
- Category-aware similarity works
- Recommendation queries return relevant results
Similarity Score Guide
| Score | Interpretation | Action |
|---|---|---|
| > 0.95 | Near duplicate | Review for removal |
| 0.8-0.95 | Very similar | Strong recommendation |
| 0.6-0.8 | Related content | Good recommendation |
| 0.4-0.6 | Some relation | Weak recommendation |
| < 0.4 | Different content | No recommendation |
Key Concepts Learned
- COSINE_SIMILARITY function usage
- Embedding-based image search
- Duplicate detection patterns
- Pre-computed similarity pairs
- Threshold-based filtering