Mixture of Experts (MoE)

A model architecture that contains many specialized subnetworks ("experts") and activates only a small subset of them for each input, making very large models faster and cheaper to run.

A standard model uses all of its parameters for every input. A Mixture of Experts (MoE) model splits its parameters into groups of specialist subnetworks and uses a routing mechanism to send each piece of input to only the most relevant two or three experts. A model with 200 billion total parameters might activate only 40 billion for any given word. Speed and cost scale with the active parameters, while the model still draws on the full breadth of knowledge stored across all experts. Mistral's Mixtral popularized this approach in open models, and major frontier models are widely believed to use similar designs.

Builder example

MoE explains why some models with enormous parameter counts run faster than you would expect, and why headline parameter numbers can be misleading. When comparing models for deployment, active parameter count, memory requirements, and actual latency matter more than total size. A 200-billion-parameter MoE model might be faster and cheaper than a dense 70-billion-parameter model in practice.

Common confusion: Even though only a fraction of experts activate per input, all experts typically need to be loaded into memory. A trillion-parameter MoE model still requires the hardware to store a trillion parameters, even if each request only uses a small slice of them.