Pruning

Technical explanation

Pruning is like trimming a hedge. Neural networks often have more parameters than they actually need. Many weights end up close to zero and don't contribute much to predictions. Pruning identifies and removes these redundant connections, leaving a smaller, faster model that performs almost as well as the original.

The concept comes from biological neural networks, where synaptic pruning helps the brain become more efficient. In ML, there are two main approaches. Unstructured pruning removes individual weights wherever they're smallest, creating sparse networks. Structured pruning removes entire neurons, channels, or layers, which is easier to accelerate on regular hardware.

Getting pruning right requires iteration. You typically prune gradually, removing a small percentage of weights, then fine-tuning the model to recover accuracy, then pruning again. Going too aggressive too fast destroys performance. The lottery ticket hypothesis suggests that within a large network, there's a smaller subnetwork that could have trained to the same accuracy from the start. Pruning tries to find it after the fact.

In practice, pruning is often combined with quantization and distillation for maximum compression. The challenge is that unstructured sparsity doesn't always translate to real-world speedups. Hardware and libraries need to support sparse operations efficiently. That's improving, but structured pruning often delivers more practical benefits today.

Technical explanation

Related Terms

More in Infrastructure