Projects
Live Demo

SpotSeg Object Segmentation

Type what you're looking for — "dog", "car", "person" — and SpotSeg will highlight it in the image. Switch to Auto-Detect to find every object at once.

CLIPSeg CLIP ViT YOLOv8 Zero-Shot Resource-Constrained Inference
spotseg — find
Try an example
🖼️
Click to upload or drag & drop an image
JPG, PNG, WEBP — max 10 MB
Comma-separate for multiple objects · Leave empty for Auto-Detect
0.25
Segmentation Result CLIPSeg
Upload an image, type what to find, and click "Find Objects." The Space may need ~1-2 minutes to wake up on the first request.
Detected Objects

How it works

01

Text & Image Encoding

Your text query is encoded by CLIP's text encoder into a 512-d semantic vector. The image is simultaneously processed by CLIP's vision transformer to produce dense spatial features.

02

Cross-Modal Segmentation

CLIPSeg's decoder fuses the text and vision embeddings at every spatial location, producing a per-pixel probability heatmap of where the described object appears.

03

Mask Refinement

Raw probabilities are thresholded into a clean binary mask, upsampled to the original image resolution using bilinear interpolation, and edge-smoothed for natural boundaries.

04

Visualization

The mask is composited with the original image — as a colored highlight, background blur, or glowing contour — to clearly show exactly where the object is.

scroll