Robots Atlas>ROBOTS ATLAS

Prompt Engineering in Practice · Multimodality

Grounding and Bounding Boxes

Multimodality

Introduction

How to make a vision-language model point at a specific spot on an image: bounding boxes, points, segmentation. Coordinate systems (Gemini [y,x,0-1000] vs Claude pixels), Set-of-Mark, IoU, coordinate hallucinations, and cost vs specialised detectors (Grounding DINO, SAM).