MiniCPM-o 4.5 is a 9B omni-modal AI model that sees, listens, and speaks simultaneously with full-duplex streaming capabilities. It outperforms GPT-4o on vision benchmarks and runs locally on devices via llama.cpp and Ollama.

MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series, built as an end-to-end multimodal AI model based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits significant performance improvements and introduces new features for full-duplex multimodal live streaming, processing real-time video and audio inputs while generating concurrent text and speech outputs.

The model features leading visual capability with an average score of 77.6 on OpenCompass, surpassing proprietary models like GPT-4o and Gemini 2.0 Pro. It supports strong speech capability with bilingual real-time conversation in English and Chinese, configurable voices, voice cloning, and role play. The full-duplex streaming capability allows simultaneous processing of video and audio inputs without mutual blocking, enabling proactive interaction based on continuous scene understanding. Additional capabilities include strong OCR performance, high-resolution image processing up to 1.8 million pixels, high-FPS video support up to 10fps, multilingual support across 30+ languages, and trustworthy behaviors matching Gemini 2.5 Flash.

The model works through an end-to-end architecture that processes multimodal inputs simultaneously while generating outputs in real-time. It supports both simplex and duplex modes, with duplex mode enabling full-duplex streaming inference for real-time or recorded video conversations where the model can see, listen, and speak concurrently without turn-taking lag.

Benefits include efficient local deployment on consumer devices, state-of-the-art performance across multiple benchmarks, natural and expressive speech generation, and versatile multimodal capabilities. Use cases include real-time AI assistants, voice cloning applications, document parsing, multilingual communication, and interactive AI systems requiring simultaneous visual and auditory processing.

The model targets developers and researchers working with multimodal AI applications, supporting various deployment options including llama.cpp and Ollama for CPU inference, vLLM and SGLang for high-throughput inference, and web demos through platforms like Hugging Face. It integrates with popular AI frameworks and supports fine-tuning through LLaMA-Factory.

MiniCPM-o 4.5

MiniCPM-o 4.5

Key Features

Use Cases

Who is this for?

Comments