Forge is a CLI-based swarm agent system that automatically generates optimized GPU kernels from any PyTorch or HuggingFace model. It achieves up to 5× faster inference than torch.compile with 97.6% correctness.

Forge is a swarm-based kernel optimizer that accelerates GPU inference for any model. Users can enter a HuggingFace model ID and Forge automatically generates optimized CUDA/Triton kernels for every layer, achieving up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness.

The system runs 32 parallel Coder+Judge agent pairs that compete to find the fastest kernel implementation. Each agent explores optimization strategies including tensor core utilization, memory coalescing, and kernel fusion. Forge supports inputs from PyTorch (any nn.Module or function), KernelBench (250+ benchmark tasks), and HuggingFace (any model ID). It uses Pattern RAG with CUTLASS (NVIDIA's world-class GEMM templates) and Triton (OpenAI's high-level GPU DSL) optimized for H100, B200, and A100 GPUs.

Forge employs evolutionary optimization with MAP-Elites archive across 36 behavior cells and an island model with 4 specialized populations and migration. The system uses LLM-guided code transformations and inference-time scaling powered by a fine-tuned and optimized NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second. This enables deep exploration of the optimization space in minutes instead of hours.

The Coder+Judge pattern involves coders generating kernels with RAG context and judges validating correctness before compilation. The evaluation pipeline includes deduplication (skipping 95% similar kernels), compilation with nvcc/triton, quick correctness testing, and full performance evaluation. Forge produces output formats including CUDA native kernels with nvcc and Triton JIT-compiled Python DSL as drop-in PyTorch replacements.

Forge provides optimized kernels for various model architectures including Llama-3.1-8B with GQA + RoPE + SwiGLU fused kernels, Qwen2.5-7B with sliding window attention + fused MLP, Mistral-7B with grouped-query attention optimization, Phi-3-mini with long context RoPE + flash attention, SDXL UNet with cross-attention + conv fused kernels, Whisper-large with encoder-decoder attention optimization, and BERT-large with bidirectional attention + FFN fusion.

The product targets developers working with GPU inference optimization, particularly those using PyTorch and HuggingFace models who need performance improvements beyond what torch.compile provides. It offers datacenter GPU access including B200, H100, and H200 optimizations.

Forge

Forge

Key Features

Use Cases

Who is this for?

Comments