Grounded SAM 2 Stream was an early investigation (2024) into inferring user interest from natural language queries. The system combines Grounding DINO for object detection with SAM 2 for segmentation and local LLMs via Ollama to interpret what the user wants to identify in a video stream.
Instead of requiring users to specify exact object classes, this system lets them describe what they’re looking for in natural language:
“I am looking for something blue that squirts water”
The LLM interprets this query and guides the vision models to find and track the relevant object in real-time.
This project was an early prototype exploring how users might interact with vision systems more naturally. While the field has evolved significantly since 2024, it represented an interesting approach to bridging language and vision for assistive applications.