Skip to main content
Back to Glossary
Technical

Inference

The process of using a trained model to make predictions or generate outputs on new, previously unseen data.


From Learning to Doing

Training is learning. Inference is applying what was learned. When you send a prompt to ChatGPT and get a response, that's inference. The model isn't learning anything new - it's using patterns it learned during training to generate an output for your specific input.

This distinction matters because training and inference have very different requirements. Training might take weeks on thousands of GPUs. Inference needs to happen in milliseconds on whatever hardware serves your users.

Why Inference Optimization Matters

In production, inference is where the rubber meets the road. A model that takes 10 seconds to respond isn't useful for real-time applications. This is why there's massive investment in making inference faster and cheaper.

Techniques include quantization (using lower-precision numbers), pruning (removing unnecessary weights), distillation (training smaller models to mimic larger ones), and specialized hardware designed specifically for inference workloads.

The economics are significant too. Training a large model is a one-time cost, but inference happens every time someone uses it. If millions of people use your model daily, inference costs dwarf training costs. This is why companies obsess over inference efficiency - shaving milliseconds off response time or reducing compute per request directly impacts the bottom line.

Related Terms

More in Technical