Prompt Engineering in Practice · Multimodality
Grounding and Bounding Boxes
Multimodality
Introduction
How to make a vision-language model point at a specific spot on an image: bounding boxes, points, segmentation. Coordinate systems (Gemini [y,x,0-1000] vs Claude pixels), Set-of-Mark, IoU, coordinate hallucinations, and cost vs specialised detectors (Grounding DINO, SAM).