Quantization

Technical explanation

Quantization is about making models smaller and faster without destroying their accuracy. Neural networks typically use 32-bit floating point numbers for their weights. Quantization converts these to lower precision formats like 16-bit, 8-bit, or even 4-bit integers. The result is a model that uses less memory, runs faster, and requires less compute.

There are different approaches. Post-training quantization applies the conversion after training is complete. It's quick and easy but can hurt accuracy. Quantization-aware training bakes the process into training itself, producing models that handle low precision better. For large language models, techniques like GPTQ and AWQ have become popular for aggressive 4-bit quantization with minimal quality loss.

The trade-offs are straightforward. Lower precision means smaller models and faster inference. But some accuracy is usually lost, especially at very low bit widths. The sweet spot depends on your use case. A chatbot might tolerate 4-bit quantization fine. A medical diagnosis model probably shouldn't compromise on precision. Testing is essential.

Quantization matters most for deployment. A 70B parameter model in full precision needs serious GPU memory. Quantized to 4-bit, it might run on a single consumer GPU. That's the difference between needing a cluster and running locally. Tools like llama.cpp, GGML, and bitsandbytes have made quantization accessible to developers who aren't optimization experts.

Technical explanation

Related Terms

More in Infrastructure