Skip to main content
Back to Glossary
Infrastructure

Training Data

The dataset used to teach a machine learning model patterns and relationships, forming the foundation of what the model learns.


The Foundation of Learning

Training data is the raw material of machine learning. Whatever patterns, biases, and knowledge exist in your training data will be reflected in your model. Feed it high-quality, diverse examples, and you get a capable model. Feed it garbage, and you get garbage predictions.

This is why data collection and curation is such a big deal. Companies spend enormous resources gathering, cleaning, and labeling training data. The quality of this data often matters more than the cleverness of the algorithm.

Quality Over Quantity (Usually)

More data generally helps, but only if it's relevant and clean. A million mislabeled images will train a confused model. A thousand perfectly labeled, diverse examples might work better. The key is having data that accurately represents the real-world situations where you'll deploy the model.

Training data also raises ethical questions. What gets included and excluded shapes what the model considers normal. Historical biases in data lead to biased models - if your resume-screening data comes from a biased hiring process, your model learns that bias. Copyright is another concern, especially for generative AI trained on creative works.

Understanding your training data - where it came from, what it represents, what it's missing - is crucial for building AI systems that work well and work fairly.

Related Terms

More in Infrastructure