Type what you're looking for — "dog", "car", "person" — and SpotSeg will highlight it in the image. Switch to Auto-Detect to find every object at once.
Your text query is encoded by CLIP's text encoder into a 512-d semantic vector. The image is simultaneously processed by CLIP's vision transformer to produce dense spatial features.
CLIPSeg's decoder fuses the text and vision embeddings at every spatial location, producing a per-pixel probability heatmap of where the described object appears.
Raw probabilities are thresholded into a clean binary mask, upsampled to the original image resolution using bilinear interpolation, and edge-smoothed for natural boundaries.
The mask is composited with the original image — as a colored highlight, background blur, or glowing contour — to clearly show exactly where the object is.