Molmo 2 is a suite of state-of-the-art vision-language models with open weights, training data, and training code that can analyze videos and multiple images at once. It delivers strong video understanding, pointing, and tracking capabilities for researchers, developers, and practitioners working on multimodal intelligence applications. The models are designed to bring the same impact to video that the original Molmo brought to image understanding, expanding grounded vision capabilities to temporal and multi-image contexts.

Video is quickly becoming the dominant language of data across applications from smartphones to autonomous vehicles and industrial sensors. Understanding how the world changes over time underpins research into next-generation multimodal intelligence for robotics, assistive technology, traffic monitoring and safety, scientific measurement, and more. Molmo 2 addresses the gap in open models for video grounding, enabling models to move beyond simple descriptive answers to pinpointing events in space and time.

Molmo 2 offers three variants serving different needs: Molmo 2 (8B) is Qwen 3-based and the best overall model for video grounding and QA; Molmo 2 (4B) is also Qwen 3-based and optimized for efficiency; and Molmo 2-O (7B) is built on Olmo, offering a fully open end-to-end model flow including the underlying LLM for researchers who want full control over every part of the stack. These smaller models punch above their weight, with the 8B variant outperforming the original Molmo (72B) on key image pointing and grounding benchmarks while delivering stronger localization and reasoning in a far more efficient package.

The models achieve state-of-the-art performance across multiple dimensions of multimodal evaluation, leading or tying for best results among open peers on image QA, short-video QA, video counting, video tracking, and human preference while remaining competitive with substantially larger proprietary systems. On video tracking specifically, Molmo 2 is the strongest tracker in evaluations, outperforming both open-weight VLM baselines and specialized open trackers including Sa2VA variants and a Molmo + SAM 2 baseline, while also beating proprietary systems like Gemini 3 Pro by a wide margin.

Molmo 2 natively supports single images, multi-image inputs, and video clips of varying length, extending the concept of pointing from static images to space and time. With Molmo 2, models can answer questions by returning concrete visual evidence through spatial and temporal grounding rather than just descriptive answers. Ask 'How many times does the robot grasp the red block?' and the model returns points and timestamps for each grasp event; ask 'When did the cup fall?' and it returns the timestamp and location of the fall; ask 'Which player scored the goal?' and it identifies and locates that player in relevant frames.

Molmo 2

Key Features

Use Cases

Who is this for?

Comments