Molmo 2 is a suite of state-of-the-art vision-language models with open weights, training data, and training code that can analyze videos and multiple images at once. It delivers strong video understanding, pointing, and tracking capabilities for researchers, developers, and practitioners working on multimodal intelligence applications. The models are designed to bring the same impact to video that the original Molmo brought to image understanding, expanding grounded vision capabilities to temporal and multi-image contexts.
Video is quickly becoming the dominant language of data across applications from smartphones to autonomous vehicles and industrial sensors. Understanding how the world changes over time underpins research into next-generation multimodal intelligence for robotics, assistive technology, traffic monitoring and safety, scientific measurement, and more. Molmo 2 addresses the gap in open models for video grounding, enabling models to move beyond simple descriptive answers to pinpointing events in space and time.
Molmo 2 offers three variants serving different needs: Molmo 2 (8B) is Qwen 3-based and the best overall model for video grounding and QA; Molmo 2 (4B) is also Qwen 3-based and optimized for efficiency; and Molmo 2-O (7B) is built on Olmo, offering a fully open end-to-end model flow including the underlying LLM for researchers who want full control over every part of the stack. These smaller models punch above their weight, with the 8B variant outperforming the original Molmo (72B) on key image pointing and grounding benchmarks while delivering stronger localization and reasoning in a far more efficient package.
The models achieve state-of-the-art performance across multiple dimensions of multimodal evaluation, leading or tying for best results among open peers on image QA, short-video QA, video counting, video tracking, and human preference while remaining competitive with substantially larger proprietary systems. On video tracking specifically, Molmo 2 is the strongest tracker in evaluations, outperforming both open-weight VLM baselines and specialized open trackers including Sa2VA variants and a Molmo + SAM 2 baseline, while also beating proprietary systems like Gemini 3 Pro by a wide margin.
Molmo 2 natively supports single images, multi-image inputs, and video clips of varying length, extending the concept of pointing from static images to space and time. With Molmo 2, models can answer questions by returning concrete visual evidence through spatial and temporal grounding rather than just descriptive answers. Ask 'How many times does the robot grasp the red block?' and the model returns points and timestamps for each grasp event; ask 'When did the cup fall?' and it returns the timestamp and location of the fall; ask 'Which player scored the goal?' and it identifies and locates that player in relevant frames.
admin
Molmo 2's architecture consists of a vision encoder that processes images or video frames into visual tokens and a language model backbone (Qwen 3 or Olmo) that consumes those tokens alongside text. A lightweight connector interleaves visual tokens with timestamps, image indices, and text so the model can reason jointly over space, time, and language. All variants are trained in two stages: first focusing on pretraining for alignment and grounding through joint image captioning and image pointing, then supervised fine-tuning on a multimodal mixture integrating images, multi-image sets, videos, and pure text across categories including captions, image QA, video QA, pointing, tracking, and NLP.
These capabilities open up a range of applications: counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions and re-entries, dense video captioning with highly descriptive and searchable narratives of long clips, anomaly detection that flags rare or surprising events, artifact detection that points to flaws in generative video such as inconsistent lighting or broken object geometry, and subtitle-aware QA that combines visual evidence with in-video subtitles. For prompts like 'Point out every instance where the person in the striped shirt flexes their muscles,' Molmo 2 analyzes the entire clip, emits coordinates and timestamps for each event, and maintains stable IDs across sequences to avoid double-counting.
Concrete use cases include video summarization where users upload clips for automatic narration; industrial monitoring where systems track objects and events in manufacturing or safety footage; educational applications where students analyze scientific phenomena across multiple images or video sequences; content moderation that identifies specific events or objects in video streams; and research applications where scientists need to ground answers in visual evidence across temporal sequences. The models handle referring expressions like 'Find the window above the kitchen sink,' 'Identify the voice-capturing device held by the woman in yellow,' or 'Locate the animal that makes the plank tip downward' by resolving these queries and returning approximate locations and times.
Molmo 2 is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines and serves a wide audience from practitioners to researchers. The models are available through the Ai2 Playground with video and multi-image workflows, Hugging Face for download, and soon via API. The training code is released under an open-source license, and all newly introduced Molmo 2 open datasets are available with detailed data recipes, plus benchmarks and tools for grounded video evaluation. The architecture scales gracefully at inference time, allowing users to process more frames directly for maximum fidelity or adopt a SlowFast-style strategy that uses high resolution on key frames and lower resolution on others while maintaining similar accuracy on long-video tasks with significantly fewer vision tokens.
Molmo 2 represents a significant advancement in open video understanding, bringing state-of-the-art video pointing, tracking, and dense captioning to the same level of accessibility that Molmo achieved for image understanding. By providing open weights, training data, and training code alongside strong performance across multiple benchmarks, Molmo 2 enables researchers and developers to build systems that anyone can reuse, customize, and improve for multimodal intelligence applications.
Molmo 2 is designed for researchers, developers, and practitioners working on multimodal intelligence applications. Primary users include AI researchers who need open models for video understanding experiments, developers building applications in robotics, assistive technology, traffic monitoring, safety systems, and scientific measurement, educators and students studying computer vision and multimodal AI, and industry practitioners implementing video analysis solutions. The models serve those who want full control over their AI stack with open weights and training code, particularly researchers using the Olmo-backed variant for complete transparency. Users range from those prioritizing efficiency with the 4B model to those needing maximum performance with the 8B variant.