Google has released Multi-Token Prediction (MTP) drafter models for its Gemma 4 family of open AI models, capable of accelerating inference speeds by up to three times without degrading output quality. This addresses a key bottleneck in deploying large language models, where performance is often limited by memory bandwidth rather than raw computing power.
The new drafters utilize a technique called speculative decoding. This involves pairing the main Gemma 4 model with a smaller, faster "drafter" model that proposes several future tokens at once. The larger model then verifies this batch of tokens in a single parallel pass, significantly reducing latency compared to the standard one-token-at-a-time generation process. The MTP drafters are available under an open-source license and integrated into popular AI development platforms.