MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series, built as an end-to-end multimodal AI model based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits significant performance improvements and introduces new features for full-duplex multimodal live streaming, processing real-time video and audio inputs while generating concurrent text and speech outputs.
The model features leading visual capability with an average score of 77.6 on OpenCompass, surpassing proprietary models like GPT-4o and Gemini 2.0 Pro. It supports strong speech capability with bilingual real-time conversation in English and Chinese, configurable voices, voice cloning, and role play. The full-duplex streaming capability allows simultaneous processing of video and audio inputs without mutual blocking, enabling proactive interaction based on continuous scene understanding. Additional capabilities include strong OCR performance, high-resolution image processing up to 1.8 million pixels, high-FPS video support up to 10fps, multilingual support across 30+ languages, and trustworthy behaviors matching Gemini 2.5 Flash.
The model works through an end-to-end architecture that processes multimodal inputs simultaneously while generating outputs in real-time. It supports both simplex and duplex modes, with duplex mode enabling full-duplex streaming inference for real-time or recorded video conversations where the model can see, listen, and speak concurrently without turn-taking lag.
Benefits include efficient local deployment on consumer devices, state-of-the-art performance across multiple benchmarks, natural and expressive speech generation, and versatile multimodal capabilities. Use cases include real-time AI assistants, voice cloning applications, document parsing, multilingual communication, and interactive AI systems requiring simultaneous visual and auditory processing.
The model targets developers and researchers working with multimodal AI applications, supporting various deployment options including llama.cpp and Ollama for CPU inference, vLLM and SGLang for high-throughput inference, and web demos through platforms like Hugging Face. It integrates with popular AI frameworks and supports fine-tuning through LLaMA-Factory.
admin
MiniCPM-o 4.5 targets developers and researchers working with multimodal AI applications who need advanced visual, speech, and streaming capabilities. The model is designed for those requiring local deployment options through frameworks like llama.cpp and Ollama, as well as developers building AI assistants, document parsing systems, and real-time communication tools. It serves users needing state-of-the-art performance across vision, speech, and multimodal benchmarks while maintaining efficient operation on consumer hardware.