Grounded SAM 2 Stream

Overview

Grounded SAM 2 Stream was an early investigation (2024) into inferring user interest from natural language queries. The system combines Grounding DINO for object detection with SAM 2 for segmentation and local LLMs via Ollama to interpret what the user wants to identify in a video stream.

The Concept

Instead of requiring users to specify exact object classes, this system lets them describe what they’re looking for in natural language:

“I am looking for something blue that squirts water”

The LLM interprets this query and guides the vision models to find and track the relevant object in real-time.

Key Features

Natural Language Queries – Describe objects conversationally
Real-time Tracking – Live video stream processing
Multi-Camera Support – Stream from multiple webcams simultaneously
Local LLM Integration – Uses Ollama for privacy-preserving inference

Architecture

Grounding DINO – Open-vocabulary object detection
SAM 2 – High-quality instance segmentation
Ollama/LLM – Natural language understanding and query interpretation
Socket Server – Receives queries from clients in real-time

Historical Note

This project was an early prototype exploring how users might interact with vision systems more naturally. While the field has evolved significantly since 2024, it represented an interesting approach to bridging language and vision for assistive applications.

Links

GitHub Repository