Google Research introduced TurboQuant to compress large language models and vector search engines. The algorithm reduces the size of high-dimensional vectors to solve memory bottlenecks in AI scaling.
TurboQuant reduces key-value cache memory by a factor of six or more. The compression process maintains full model accuracy.
Benchmarks on NVIDIA H100 GPUs demonstrate performance increases of up to eight times. The technology lowers operational costs and expands hardware compatibility without requiring model retraining.