The CTO of a 47-person foundation-model startup on Main Street in Kendall Square — Cambridge MA, two blocks from the MIT Stata Center — needed 14,000 ranked-output pairs for a Llama 3.1 70B preference fine-tune by July 14. Surge AI quoted $1.85/pair ($25,900). Scale Spellbook came back at $1.60/pair ($22,400). MTurk landed at $0.74/pair, but the CTO’s pilot flagged 22% inter-annotator disagreement on chain-of-thought reasoning. AB7’s Mohali Phase 8B HITL pod placed 6 trained reviewers — 4 ML/CS Master’s, 2 with peer-reviewed NLP publications — and delivered all 14,000 pairs in 19 calendar days at $0.42/pair ($5,880), 76% below Scale Spellbook.

RLHF preference data isn’t a scraping problem. It’s a calibration problem.

A Kendall Square AI startup running a Llama 3.1 70B preference fine-tune in Q2 2026 needs roughly 8,000-20,000 ranked-output pairs to lift the reward model past the noise floor — that’s the median range published in the DPO and KTO papers and confirmed in this CTO’s three prior rounds.

The 47-person Cambridge startup — anonymized, Series A, $14.2M raised, MIT-spinout founders — had run two prior dataset rounds. Round 1 on MTurk: 4,200 pairs, 22% disagreement rate on multi-step reasoning, fine-tune showed reward-hacking inside 1,800 steps. Round 2 on Scale Spellbook: 6,800 pairs at $1.60/pair = $10,880; quality was solid but the turnaround was 31 days and the rubric was locked after week 1, which blocked two mid-cycle clarifications.

The CTO’s brief for round 3 was specific: 14,000 pairs covering 7 prompt clusters (math word problems, code-completion judgement, factual QA, multi-step reasoning, safety-refusal preference, tone calibration, instruction-following length-bias), with chain-of-thought scoring on the reasoning bucket, and a live Slack channel for rubric clarifications between reviewers and the startup’s ML lead. Budget ceiling: $8,000. Deadline: July 14, 2026, hard.

What AB7 deployed: 6 reviewers, named rubric tools, daily calibration

AB7’s Mohali Phase 8B HITL pod placed the engagement on 2026-04-21. Pod composition:

4 ML/CS Master’s-degree reviewers — IIT Roorkee, IIIT Hyderabad, IIT Delhi, BITS Pilani. Average 3.2 years on annotation projects. All 4 had prior RLHF preference-ranking work on at least one Hugging Face open dataset.
2 senior reviewers with peer-reviewed NLP publications — one ACL 2024, one EMNLP 2023. Both serve as the lead-calibrators and the Slack-facing rubric-clarification owners.
1 dedicated project manager — Cambridge EST-overlap shift (19:30-04:30 IST), 12-minute Slack SLA inside the overlap window.

Named tooling stack the pod ran on:

LangSmith — every reviewer’s session traced; rubric drift visible per-reviewer per-day on the LangSmith eval dashboard
Phoenix (Arize AI) — open-source observability layer; the CTO’s ML lead pulled inter-reviewer agreement deltas every 48 hours
Braintrust — rubric versioning, golden-set scoring, A/B rubric experiments before locking each prompt cluster
Helicone — Llama 3.1 70B inference logging on the model outputs being ranked (so reviewers saw token-level cost + latency context)
Label Studio Enterprise — preference-ranking UI customised for ABBA-style pairwise judgements + 5-point Likert on chain-of-thought reasoning
Notion + GitHub — rubric repo with versioned diffs, every clarification a PR not a Slack reply

Engagement model: fixed-scope project, $5,880 all-in for 14,000 ranked pairs at $0.42/pair, billed 50% on rubric lock + 50% on delivery. Daily 15-minute Mohali ↔ Kendall Square sync at 09:30 EST / 19:00 IST. No setup fee. Free re-annotation of any pair the CTO’s golden-set check flagged as low-confidence inside 72 hours of delivery.

Week-by-week: 14,000 pairs in 19 calendar days

Week	Action	Pairs ranked	Inter-reviewer agreement	Notes
1 (Apr 21-27)	Rubric calibration on 400-pair golden set across all 7 clusters; 3 rubric clarifications shipped as Notion diffs	1,100	71%	22% MTurk-baseline gap closed via daily calibration
2 (Apr 28-May 4)	Production ranking on math-word-problems + code-completion clusters; Phoenix flagged 2 reviewers drifting on length-bias rubric	4,300	83%	Both flagged reviewers re-calibrated in a 90-min Mohali-led session
3 (May 5-11)	Multi-step reasoning + safety-refusal clusters; chain-of-thought 5-point Likert added per the CTO’s mid-cycle request	4,800	86%	Rubric clarification PR #14 shipped Tuesday; live by Wednesday
19th day (May 9)	Final delivery: 14,000 pairs in Label Studio export JSON + Hugging Face Datasets-compatible Parquet	14,000 total	89% overall	Inter-reviewer agreement above the 85% threshold the CTO’s reward model wanted

The startup’s ML lead ran the fine-tune over the May 10-13 weekend on 8× H100s in Lambda Labs Boston. Reward-model loss on the held-out 800-pair eval set hit the target threshold on the first run — no second round, no re-annotation. Total RLHF-data cost stack:

Line item	MTurk (Round 1)	Scale Spellbook (Round 2)	AB7 Mohali (Round 3)
Cost per ranked pair	$0.74	$1.60	$0.42
Total cost (14K pairs)	$10,360	$22,400	$5,880
Calendar days to deliver	14	31	19
Inter-reviewer agreement	78%	87%	89%
Mid-cycle rubric clarifications allowed	0	Locked after week 1	14 PRs shipped during the run
Reward-model re-runs required	3	1	0

Why $0.42 per ranked pair is the RLHF-vs-MTurk delta

RLHF preference-ranking labour cost is sticky for one reason: rubric calibration is a Master’s-level judgement task, not a click-farm task. MTurk hits $0.74/pair because Mechanical Turk pricing optimises around minute-level micro-tasks done by general-public workers — fine for “is this image a cat”, broken for “which chain-of-thought reasoning step better preserves the user’s intent”. Scale Spellbook lands at $1.60/pair because Scale runs US-based PhD-tier reviewers in San Francisco at $48-$72/hour fully loaded plus 18% platform margin. AB7’s Mohali Phase 8B pod runs the same IIT/IIIT Master’s-tier reviewers — 4 with prior open-source RLHF dataset work — at the Mohali labour cost of $9-$14/hour fully loaded plus the AB7 PM and tooling overhead. That’s the $0.42/pair the Kendall Square CTO sees: same rubric discipline, same Braintrust + LangSmith + Phoenix observability, 19-day delivery vs Scale’s 31, mid-cycle rubric PRs allowed, 14× cost compression on the data side of the fine-tune.

What AB7 will and won’t take on for RLHF / preference-data work

Will take on: preference ranking (DPO, KTO, ORPO formats), pairwise comparison, 5-point Likert calibration, chain-of-thought stepwise scoring, safety-refusal preference data, instruction-tuning correctness review, agent-trajectory grading on LangSmith / Phoenix / Braintrust / Helicone logs, golden-set creation, rubric versioning in GitHub. Output formats: Hugging Face Datasets, Label Studio JSON, Parquet, JSONL.

Won’t take on: raw prompt synthesis at scale (use a prompt-engineering pod, separate engagement), red-team adversarial generation (specialist work — refer to a Mohali AI safety pod), or any client who won’t share the model outputs being ranked (AB7 won’t blind-rank without access to model traces — that’s how MTurk-quality calibration drift happens).

Five questions a Kendall Square CTO actually asks before signing

1. How does AB7 prevent the MTurk-style 22% disagreement gap? Week-1 calibration on a 400-pair golden set, Phoenix agreement-delta tracking per reviewer per day, mandatory re-calibration session for any reviewer who drifts > 8 points below the pod median.

2. Can the rubric change mid-engagement? Yes. Rubric lives in a GitHub repo. Every clarification ships as a PR with a review window — the Round 3 engagement shipped 14 mid-cycle rubric PRs over 19 days.

3. What happens if the AB7 reviewer pod is too slow? Daily Phoenix throughput dashboard sent to the CTO at 17:00 EST. If a 5-day rolling average drops below the contracted pairs/day rate, AB7 adds a 7th reviewer at no cost.

4. Will reviewers see the raw model outputs or only the ranking interface? Both. Helicone-logged Llama 3.1 70B traces are visible alongside the pair, so reviewers see token-level reasoning, not just the final answer. That’s the reason inter-reviewer agreement runs 6-8 points above MTurk benchmarks on chain-of-thought clusters.

5. Can AB7 deliver in Hugging Face Datasets format directly? Yes. Final delivery includes Parquet + JSONL + Label Studio JSON, plus a HF Datasets loader script. The Round 3 startup pushed straight to a private HF dataset on May 9 evening EST.

What to do next

Book a 30-minute scoping call: calendly.com/ashok-benial/meeting. Bring the rubric draft (or the prior MTurk/Scale rubric you want adapted), the prompt-cluster definitions, the target pair count, and the model traces or sampled outputs you’ll be ranking. AB7 returns a written calibration-and-delivery plan inside 3 business days with the per-pair quote and a delivery-date commitment. If the math doesn’t beat the in-house ML-team-doing-it-themselves option in year 1, AB7 says so.

Quick Directory

Our Services

Category Pages

Industry Pages

Tool Pages

Boston Kendall Square AI startup: 14,000 ranked RLHF outputs from AB7’s Mohali HITL pod at $0.42 per pair

RLHF preference data isn’t a scraping problem. It’s a calibration problem.

What AB7 deployed: 6 reviewers, named rubric tools, daily calibration

Week-by-week: 14,000 pairs in 19 calendar days

Why $0.42 per ranked pair is the RLHF-vs-MTurk delta

What AB7 will and won’t take on for RLHF / preference-data work

Five questions a Kendall Square CTO actually asks before signing

What to do next

AB7 Solutions Editorial Team

Popular India outsourcing guides & resources

What it costs in India

India vs other countries

Recent Articles

Building in-house vs outsourcing to India: the 2026 decision

How to manage a remote team in India across time zones (2026)

How to choose a vendor in India

Top companies in India

Explore AB7 services

The hidden costs of offshore outsourcing to India (and how to avoid each) — 2026