What does AI training data actually cost in 2026?

Five years ago, “training data” meant a stack of CSVs and a Mechanical Turk budget. Today it is a multi-vendor market with specialist firms for image, video, RLHF, expert review, synthetic generation, and red-team adversarial sets. Buyers routinely under-budget — not because the unit prices are surprising, but because they miss the second-order costs that dominate the total.

Below are the rates we see in 2026, the hidden costs that always show up, and the cost-quality trade-offs that decide whether the data is actually usable.

Unit prices by data type

These are the typical ranges from labelling vendors in 2026. Quoted in USD per unit. Discounts of 20–40% apply at scale (100k+ items).

Data type	Task	Typical price
Image	Bounding box (single class)	$0.04–$0.12
Image	Polygon segmentation	$0.30–$0.80
Image	Instance segmentation, complex scene	$1.50–$4.00
Video	Frame-by-frame bounding box (per sec)	$0.20–$0.60
3D / LiDAR	Cuboid annotation, point cloud	$1.20–$3.50
Text	Single-label classification	$0.05–$0.15
Text	Named entity recognition (NER)	$0.15–$0.40
Text	Multi-step reasoning judgment	$0.80–$3.00
Audio	Transcription, per minute	$0.80–$2.50
Audio	Speaker diarisation, per minute	$2.00–$5.00
RLHF	Preference pair (response A vs B)	$0.40–$2.00
RLHF	Expert preference (medical, legal, code)	$5.00–$20.00
Synthetic data	Per high-quality sample (generated + filtered)	$0.02–$0.20

What the unit prices do not show you

The unit price is a fraction of the total cost. Here is where the rest goes:

Schema design and pilot iteration (10–20% of budget)

Every labelling project that succeeds starts with a small pilot — a few hundred items labelled by 2–3 annotators, reviewed, refined, re-labelled. The schema usually changes twice before it stabilises. This work happens before the “real” budget kicks in and is the most common reason projects run over.

QA and adjudication (15–25% of budget)

Single-pass labels are not usable for training. Best-practice flows use 2–3 independent labellers per item, automated agreement scoring, adjudication by a senior reviewer for disagreements, and a recurring sample review by your own SME. Budget at least 1.5× the headline labelling cost for a usable dataset.

Subject-matter expert review (often the largest single line)

For regulated or specialist domains — medical, legal, finance, code, scientific — generalist annotators cannot produce defensible labels. Expert hourly rates in 2026 sit at $80–$300, depending on jurisdiction and specialty. A 50,000-item medical annotation project with appropriate expert oversight will spend more on SME time than on the labels themselves.

Data security, sovereignty, and on-shore handling

If your data contains PII, PHI, or other regulated content, vendor selection narrows sharply. SOC 2, ISO 27001, on-shore data residency, and background-checked workforces add 30–80% to base rates. Tools that allow annotation without raw data leaving your environment (federated annotation, screen-share-only flows) carry similar premiums.

Edge-case enrichment

The first 80% of any dataset is cheap. The long tail of edge cases — rare classes, ambiguous boundaries, adversarial examples — is where models actually fail in production. Targeted edge-case labelling typically costs 5–10× the headline unit price because it requires active learning loops, synthetic generation, or human-curated query construction.

RLHF and preference data: a separate market

Preference data for RLHF, DPO, or instruction tuning is its own economy. Pricing is dominated by who is doing the labelling.

Generalist preference labelling (“which of these two responses is better?”) at $0.40–$2.00 per pair works for general assistant tuning. The moment the domain narrows — medical reasoning, legal accuracy, code correctness, safety-critical outputs — the cost climbs fast because the labellers need domain credentials and meaningful capacity to evaluate the responses.

Three takeaways for budgeting RLHF:

Plan for at least 5,000–20,000 preference pairs for a meaningful tune. Below 1,000, the signal is too noisy.
For domain-specific preference data, expert costs dominate; expect $40,000–$200,000 for a serious tune.
Reuse the same labellers across rounds — annotator consistency matters more than the absolute label.

Synthetic data: cheap to generate, expensive to validate

Synthetic data generation has become a mainstream tactic — particularly for rare classes, privacy-sensitive domains, and adversarial scenarios. Per sample, synthetic data is cheap: $0.02–$0.20 for high-quality generated instances after filtering.

The cost shifts to validation. A synthetic dataset that has not been carefully validated against a real-world holdout will silently teach your model the shape of the generator, not the shape of the real distribution. Plan to spend roughly half what you save on generation, on validation infrastructure: real-data anchors, distribution-shift tests, downstream performance comparisons.

Vendor selection in 2026

The market has consolidated around three vendor archetypes:

Scaled platforms (Scale, Surge, Sama, Labelbox). Best for high-volume, mainstream tasks (image, text, RLHF). Mature tools, predictable quality, predictable price. The wrong choice for highly regulated or boutique-domain work.
Expert-network firms (newer entrants, including Mercor-style marketplaces). Direct access to credentialed SMEs for medical, legal, code, and scientific labelling. More expensive per hour but radically better quality. The right choice when you need defensibility.
In-house labelling teams. Increasingly common for long-running programs with stable schemas and IP-sensitive data. Higher upfront cost (tooling, hiring, management) but the unit cost falls below vendor rates within 12–18 months at sufficient volume.

The honest total-cost framework

For any serious training-data engagement in 2026, budget against this framework:

Labelling unit cost: 35–50% of total
QA, adjudication, and SME review: 25–35%
Tooling, integration, security: 10–15%
Schema design, pilot, and rework: 10–15%
Active-learning loops and edge-case targeting: 5–10%

A budget that allocates only the first line item — “labelling cost” — and treats the rest as overhead is the budget most likely to blow up. Plan properly, and labelling becomes the most predictable line item in an AI program. Plan poorly, and it becomes the most expensive.

Unit prices by data type

What the unit prices do not show you

Schema design and pilot iteration (10–20% of budget)

QA and adjudication (15–25% of budget)

Subject-matter expert review (often the largest single line)

Data security, sovereignty, and on-shore handling

Edge-case enrichment

RLHF and preference data: a separate market

Synthetic data: cheap to generate, expensive to validate

Vendor selection in 2026

The honest total-cost framework

Get new articles, the moment they ship.

Related articles

Vector databases compared in 2026: Pinecone, Weaviate, Milvus, pgvector, Qdrant

AI training data: build, buy, or label — choosing the right approach

The 2026 EU AI Act compliance checklist for non-EU companies

Turn one AI use case into measurable production value.