

Mercury 2 is the world's fastest reasoning language model, built to make production AI feel instant. It is designed for production deployments where latency-sensitive applications and real-time user experiences are critical, such as coding agents, voice interfaces, and search pipelines. Its main purpose is to deliver reasoning-grade quality within real-time latency budgets, shifting the quality-speed curve for AI in production.
Production AI today involves loops: agents, retrieval pipelines, and extraction jobs running in the background at volume. In these loops, latency compounds across every step, user, and retry, creating a significant bottleneck. Current LLMs typically use autoregressive, sequential decoding, generating one token at a time, which limits speed and responsiveness in real-time applications.
Mercury 2 generates responses through parallel refinement diffusion technology, producing multiple tokens simultaneously and converging over a small number of steps. This approach is less like a typewriter and more like an editor revising a full draft at once, resulting in over 5x faster generation with a fundamentally different speed curve. It achieves speeds of 1,009 tokens per second on NVIDIA Blackwell GPUs, optimizing for responsiveness that users actually feel, including p95 latency under high concurrency and stable throughput.
The model offers tunable reasoning, allowing customization for specific tasks and complexity levels. It supports a 128K context window, enabling handling of extensive documents or long conversations. Native tool use is integrated, facilitating agentic workflows where the model can interact with external tools and APIs seamlessly.
Mercury 2 provides schema-aligned JSON output, ensuring structured and reliable data generation for applications requiring precise formatting. It is OpenAI API compatible, allowing easy integration into existing stacks without rewrites, and is available at a price of $0.25 per 1M input tokens and $0.75 per 1M output tokens.
Overall, Mercury 2 works by leveraging diffusion-based reasoning to bypass the sequential decoding bottleneck. Its unique methodology involves parallel token generation, which reduces latency and enables real-time performance. This approach changes the reasoning trade-off, allowing higher intelligence without sacrificing speed, making it suitable for applications where latency budgets are tight.
admin
Benefits for users include significantly reduced latency in AI interactions, making experiences feel instant and natural. It enables more complex agentic workflows by cutting latency per call, which improves final output quality and allows more steps to be run affordably. Users gain competitive quality with leading speed-optimized models, ensuring high performance without compromises.
For coding and editing, Mercury 2 supports autocomplete, next-edit suggestions, refactors, and interactive code agents, where fast suggestions maintain developer flow without pauses. In agentic loops, it chains dozens of inference calls per task, enhancing efficiency and output quality in applications like advertising optimization and transcript cleanup.
In real-time voice and interaction, it makes reasoning-level quality viable within natural speech cadences, enabling lifelike AI video avatars and responsive voice agents. For search and RAG pipelines, it adds reasoning to multi-hop retrieval, reranking, and summarization without blowing latency budgets, benefiting customer support, compliance, and e-commerce.
Target users include developers, companies deploying production AI, and enterprises needing real-time applications. Integrations are seamless via OpenAI API compatibility, and the tech stack leverages NVIDIA Blackwell GPUs for optimal performance. Pricing is transparent with per-token costs, and enterprise evaluations are supported for workload fit and performance validation.
In summary, Mercury 2 represents a breakthrough in AI speed and responsiveness, enabling real-time reasoning for production deployments where latency is non-negotiable, and user experience is paramount.
Mercury 2 targets developers and companies deploying production AI in latency-sensitive applications. It is ideal for enterprises needing real-time performance in coding agents, voice interfaces, agentic workflows, and search pipelines. Users include AI engineers, product teams, and organizations focused on enhancing user experience with instant, reasoning-grade AI interactions, leveraging integrations like OpenAI API compatibility and NVIDIA GPUs for scalable deployments.