Live Demo

SpotSeg Object Segmentation

Type what you're looking for — "dog", "car", "person" — and SpotSeg will highlight it in the image. Switch to Auto-Detect to find every object at once.

CLIPSeg CLIP ViT YOLOv8 Zero-Shot Resource-Constrained Inference

How it works

Text & Image Encoding

Your text query is encoded by CLIP's text encoder into a 512-d semantic vector. The image is simultaneously processed by CLIP's vision transformer to produce dense spatial features.

Cross-Modal Segmentation

CLIPSeg's decoder fuses the text and vision embeddings at every spatial location, producing a per-pixel probability heatmap of where the described object appears.

Mask Refinement

Raw probabilities are thresholded into a clean binary mask, upsampled to the original image resolution using bilinear interpolation, and edge-smoothed for natural boundaries.

Visualization

The mask is composited with the original image — as a colored highlight, background blur, or glowing contour — to clearly show exactly where the object is.