Mercury 2 is the world's fastest reasoning language model built to make production AI feel instant. It uses parallel refinement diffusion technology to generate tokens simultaneously at speeds exceeding 1,000 tokens per second.

Mercury 2 is a reasoning language model designed specifically for production AI applications where speed and responsiveness are critical. It represents a fundamental shift from traditional autoregressive models by employing diffusion-based parallel refinement technology.

Key features include tunable reasoning capabilities, 128K context length, native tool use functionality, and schema-aligned JSON output. The model achieves speeds of 1,009 tokens per second on NVIDIA Blackwell GPUs while maintaining competitive quality with leading speed-optimized models.

Unlike traditional sequential decoding models that generate one token at a time, Mercury 2 generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. This diffusion-based approach provides a fundamentally different speed curve that's more than 5x faster than conventional methods.

The model excels in latency-sensitive applications where user experience is non-negotiable, including coding and editing workflows, agentic loops, real-time voice interactions, and search/RAG pipelines. It enables reasoning-grade quality within real-time latency budgets that traditional models cannot achieve.

Mercury 2 targets developers and enterprises building production AI systems that require high-speed inference, particularly those working on agentic workflows, coding assistants, voice interfaces, and search applications. The model is OpenAI API compatible and can be integrated into existing stacks without requiring rewrites.

Mercury 2

Mercury 2

Key Features

Use Cases

Who is this for?

Comments