Every AI program eventually arrives at the same uncomfortable question: where is the training data going to come from? The model that everyone wants to build is downstream of an answer to that question that almost no one wants to think about upfront. The answer is rarely “just label what we have.” In 2026 it is usually a deliberate mix of three sources: build (synthetic and observational), buy (licensed datasets), and label (human-annotated). Picking the wrong mix is the second-most common reason AI programs slip past their initial budget.
Build — synthesise or observe
Building data covers two related categories: synthetic generation and observational capture.
Synthetic generation uses a generative model — often an LLM, sometimes a diffusion model — to produce training examples. Useful for rare classes, privacy-sensitive domains, multilingual coverage, and adversarial scenarios. Cost is low per sample ($0.02–$0.20). Risk is high if validation is sloppy — synthetic data quietly teaches the model the shape of the generator, not the world.
Observational capture is collecting real-world events into your data pipeline. The most undervalued source. If your product already runs in production, instrument it to record the data your model wants. Sensible logging of user actions, decisions, and outcomes builds a labelled dataset over months — effectively for free, with the meaningful property that the distribution exactly matches production.
Build when: rare classes need amplification, privacy constraints prevent sharing real data, or you can instrument upstream to capture events. Avoid when: the domain is high-stakes and synthetic patterns might diverge undetected from reality.
Buy — license existing datasets
Licensed datasets remain underused. Buyers default to labelling because it feels safer, then spend months recreating data that was available for purchase at a fraction of the cost.
Major sources by category:
- Web-scale corpora: Common Crawl derivatives, RedPajama, FineWeb, RefinedWeb — for general pretraining and broad understanding.
- Specialist domain corpora: medical literature (PubMed), legal (case law databases), financial (filings, news), scientific (arXiv, PMC).
- Commercial labelled datasets: Scale Data Engine, Surge AI datasets, Sama, Snorkel — pre-labelled commercial assets.
- Image and video libraries: Shutterstock for legally-clear training imagery, broadcast archive licences, vertical-specific image banks.
Buy when: the data is non-differentiating to your competitive position, the dataset is well-known and stable, and licensing is cheaper than labelling at scale. Avoid when: the dataset is your competitive moat (your customer data, your operational data), or licensing terms constrain downstream use.
Label — human annotation
Labelling remains the largest category by spend. Two clear archetypes:
Generalist labelling via scaled platforms (Scale, Surge, Sama, Labelbox) for mainstream tasks where competent annotators without domain credentials produce good results: image classification, bounding boxes, basic NER, general preference data. Unit prices are predictable and have been falling year-on-year.
Expert labelling via specialist firms and expert networks for domains where generalist labels are not defensible: medical imaging, legal contracts, code, scientific text. 5–20× the unit cost of generalist, but the only way to produce training data that passes audit or regulatory review.
Label when: you need ground truth on your proprietary data, no commercial dataset matches your task, or domain expertise is required. Avoid as default — many labelling projects could have been buy decisions.
The combination that almost always wins
After ten or more training-data engagements, the pattern that consistently beats single-source strategies is layered:
- Buy a broad pretraining or domain-foundation corpus to establish baseline capability.
- Label a focused, expert-reviewed dataset of 5,000–50,000 examples that represent your actual production distribution and edge cases.
- Build through observational capture from your production system — the dataset that grows continuously after deployment and that your competitors literally cannot replicate.
Synthetic generation supplements all three for specific gaps: rare classes, adversarial robustness, privacy-protected variants.
The decision filter
For every dataset need, work through these five questions in order:
- Does a commercial dataset exist that fits? If yes, buy is usually the right answer.
- Can we capture this from production events with reasonable instrumentation? If yes, build (observational) is the right answer.
- Is the domain sufficiently specialised that only experts can label it credibly? If yes, expert labelling — budget accordingly.
- Is the dataset mainstream and the unit price acceptable? If yes, generalist labelling at scale.
- Are we missing rare classes or adversarial examples? If yes, synthetic generation — with rigorous validation against real-data anchors.
Cost mental model
For a 100,000-example training set in 2026, rough totals:
| Approach | Total cost band | Timeline |
|---|---|---|
| Buy a commercial dataset | $5,000–$80,000 | 1–4 weeks |
| Generalist labelling | $10,000–$60,000 | 4–8 weeks |
| Expert labelling (regulated domain) | $80,000–$400,000 | 8–16 weeks |
| Observational capture (existing production) | Engineering time only | 3–9 months to accumulate |
| Synthetic generation + validation | $5,000–$30,000 | 2–6 weeks |
The mistake most teams make
Defaulting to “we will label what we have.” The reasons it feels safe — control, IP retention, perceived quality — are sometimes real and sometimes excuses. Labelling the same data that’s available commercially for 10× the cost is a budget waste your finance team will eventually catch. Skipping observational capture because “we will set it up next quarter” means twelve months later you still do not have it.
Pick deliberately, layer the sources, validate ruthlessly. Training data is not a one-time spend — it is the recurring foundation your model improves on for years. Treat it like the strategic asset it is.