Molmo 2 is a new suite of state-of-the-art vision-language models designed for advanced video and image analysis. It expands on Molmo's strengths in grounded vision to handle video and multi-image understanding, providing powerful capabilities for analyzing visual content across space and time.
Molmo 2 offers video tracking that outperforms both open-weight VLM baselines and specialized open trackers, including beating proprietary systems like Gemini 3 Pro. It provides image and multi-image reasoning capabilities where Molmo 2 (8B) leads all open-weight models on benchmark evaluations. The system delivers video grounding (pointing + counting) with concrete visual evidence through spatial and temporal grounding. It supports long-video QA with strong results and human preference evaluation where it leads open-weight models. Molmo 2 also enables dense video captioning with highly descriptive narratives and multi-object tracking with persistent IDs.
Molmo 2 natively supports single images, multi-image inputs, and video clips of varying length. The architecture consists of a vision encoder that processes images or video frames into visual tokens and a language model backbone (Qwen 3 or Olmo) with a lightweight connector that interleaves visual tokens with timestamps, image indices, and text. All variants are trained in two stages: pretraining for alignment and grounding through joint image captioning and pointing, followed by supervised fine-tuning on a multimodal mixture integrating images, multi-image sets, videos, and pure text.
The system enables counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions, dense video captioning with searchable narratives, anomaly detection that flags rare events, artifact detection pointing to flaws in generative video, and subtitle-aware QA combining visual evidence with in-video subtitles. At inference time, Molmo 2 scales gracefully using strategies like processing more frames directly or adopting SlowFast-style approaches with high resolution on key frames.
Molmo 2 is intended for research and educational use and is designed for practitioners and researchers who want to build systems that can understand visual content. It serves applications in robotics and assistive technology, traffic monitoring and safety, scientific measurement, and other domains requiring video understanding. The models are available via the Ai2 Playground and will soon be available via API.
admin
Molmo 2 is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines and is designed for a wide audience from practitioners to researchers. It serves users who want to build systems that can understand visual content, particularly those working in robotics, assistive technology, traffic monitoring, safety systems, and scientific measurement. The Olmo-backed variant is particularly useful for researchers who want full control over every part of the stack including vision encoder, connector, and language model.