Boston Kendall Square AI startup: 14,000 ranked RLHF outputs from AB7’s Mohali HITL pod at $0.42 per pair
The CTO of a 47-person foundation-model startup on Main Street in Kendall Square — Cambridge MA, two blocks from the MIT Stata Center — needed 14,000 ranked-output pairs for a Llama 3.1 70B preference fine-tune by July 14. Surge AI quoted $1.85/pair ($25,900). Scale Spellbook came back at $1.60/pair ($22,400). MTurk landed at $0.74/pair, but the CTO’s pilot flagged 22% inter-annotator disagreement on chain-of-thought reasoning. AB7’s Mohali Phase 8B HITL pod placed 6 trained reviewers — 4 ML/CS Master’s, 2 with peer-reviewed NLP publications — and delivered all 14,000 pairs in 19 calendar days at $0.42/pair ($5,880), 76% below Scale Spellbook.
RLHF preference data isn’t a scraping problem. It’s a calibration problem.
A Kendall Square AI startup running a Llama 3.1 70B preference fine-tune in Q2 2026 needs roughly 8,000-20,000 ranked-output pairs to lift the reward model past the noise floor — that’s the median range published in the DPO and KTO papers and confirmed in this CTO’s three prior rounds.
The 47-person Cambridge startup — anonymized, Series A, $14.2M raised, MIT-spinout founders — had run two prior dataset rounds. Round 1 on MTurk: 4,200 pairs, 22% disagreement rate on multi-step reasoning, fine-tune showed reward-hacking inside 1,800 steps. Round 2 on Scale Spellbook: 6,800 pairs at $1.60/pair = $10,880; quality was solid but the turnaround was 31 days and the rubric was locked after week 1, which blocked two mid-cycle clarifications.
The CTO’s brief for round 3 was specific: 14,000 pairs covering 7 prompt clusters (math word problems, code-completion judgement, factual QA, multi-step reasoning, safety-refusal preference, tone calibration, instruction-following length-bias), with chain-of-thought scoring on the reasoning bucket, and a live Slack channel for rubric clarifications between reviewers and the startup’s ML lead. Budget ceiling: $8,000. Deadline: July 14, 2026, hard.
What AB7 deployed: 6 reviewers, named rubric tools, daily calibration
AB7’s Mohali Phase 8B HITL pod placed the engagement on 2026-04-21. Pod composition:
- 4 ML/CS Master’s-degree reviewers — IIT Roorkee, IIIT Hyderabad, IIT Delhi, BITS Pilani. Average 3.2 years on annotation projects. All 4 had prior RLHF preference-ranking work on at least one Hugging Face open dataset.
- 2 senior reviewers with peer-reviewed NLP publications — one ACL 2024, one EMNLP 2023. Both serve as the lead-calibrators and the Slack-facing rubric-clarification owners.
- 1 dedicated project manager — Cambridge EST-overlap shift (19:30-04:30 IST), 12-minute Slack SLA inside the overlap window.
Named tooling stack the pod ran on:
- LangSmith — every reviewer’s session traced; rubric drift visible per-reviewer per-day on the LangSmith eval dashboard
- Phoenix (Arize AI) — open-source observability layer; the CTO’s ML lead pulled inter-reviewer agreement deltas every 48 hours
- Braintrust — rubric versioning, golden-set scoring, A/B rubric experiments before locking each prompt cluster
- Helicone — Llama 3.1 70B inference logging on the model outputs being ranked (so reviewers saw token-level cost + latency context)
- Label Studio Enterprise — preference-ranking UI customised for ABBA-style pairwise judgements + 5-point Likert on chain-of-thought reasoning
- Notion + GitHub — rubric repo with versioned diffs, every clarification a PR not a Slack reply
Engagement model: fixed-scope project, $5,880 all-in for 14,000 ranked pairs at $0.42/pair, billed 50% on rubric lock + 50% on delivery. Daily 15-minute Mohali ↔ Kendall Square sync at 09:30 EST / 19:00 IST. No setup fee. Free re-annotation of any pair the CTO’s golden-set check flagged as low-confidence inside 72 hours of delivery.
Week-by-week: 14,000 pairs in 19 calendar days
| Week | Action | Pairs ranked | Inter-reviewer agreement | Notes |
|---|---|---|---|---|
| 1 (Apr 21-27) | Rubric calibration on 400-pair golden set across all 7 clusters; 3 rubric clarifications shipped as Notion diffs | 1,100 | 71% | 22% MTurk-baseline gap closed via daily calibration |
| 2 (Apr 28-May 4) | Production ranking on math-word-problems + code-completion clusters; Phoenix flagged 2 reviewers drifting on length-bias rubric | 4,300 | 83% | Both flagged reviewers re-calibrated in a 90-min Mohali-led session |
| 3 (May 5-11) | Multi-step reasoning + safety-refusal clusters; chain-of-thought 5-point Likert added per the CTO’s mid-cycle request | 4,800 | 86% | Rubric clarification PR #14 shipped Tuesday; live by Wednesday |
| 19th day (May 9) | Final delivery: 14,000 pairs in Label Studio export JSON + Hugging Face Datasets-compatible Parquet | 14,000 total | 89% overall | Inter-reviewer agreement above the 85% threshold the CTO’s reward model wanted |
The startup’s ML lead ran the fine-tune over the May 10-13 weekend on 8× H100s in Lambda Labs Boston. Reward-model loss on the held-out 800-pair eval set hit the target threshold on the first run — no second round, no re-annotation. Total RLHF-data cost stack:
| Line item | MTurk (Round 1) | Scale Spellbook (Round 2) | AB7 Mohali (Round 3) |
|---|---|---|---|
| Cost per ranked pair | $0.74 | $1.60 | $0.42 |
| Total cost (14K pairs) | $10,360 | $22,400 | $5,880 |
| Calendar days to deliver | 14 | 31 | 19 |
| Inter-reviewer agreement | 78% | 87% | 89% |
| Mid-cycle rubric clarifications allowed | 0 | Locked after week 1 | 14 PRs shipped during the run |
| Reward-model re-runs required | 3 | 1 | 0 |
Why $0.42 per ranked pair is the RLHF-vs-MTurk delta
RLHF preference-ranking labour cost is sticky for one reason: rubric calibration is a Master’s-level judgement task, not a click-farm task. MTurk hits $0.74/pair because Mechanical Turk pricing optimises around minute-level micro-tasks done by general-public workers — fine for “is this image a cat”, broken for “which chain-of-thought reasoning step better preserves the user’s intent”. Scale Spellbook lands at $1.60/pair because Scale runs US-based PhD-tier reviewers in San Francisco at $48-$72/hour fully loaded plus 18% platform margin. AB7’s Mohali Phase 8B pod runs the same IIT/IIIT Master’s-tier reviewers — 4 with prior open-source RLHF dataset work — at the Mohali labour cost of $9-$14/hour fully loaded plus the AB7 PM and tooling overhead. That’s the $0.42/pair the Kendall Square CTO sees: same rubric discipline, same Braintrust + LangSmith + Phoenix observability, 19-day delivery vs Scale’s 31, mid-cycle rubric PRs allowed, 14× cost compression on the data side of the fine-tune.
What AB7 will and won’t take on for RLHF / preference-data work
Will take on: preference ranking (DPO, KTO, ORPO formats), pairwise comparison, 5-point Likert calibration, chain-of-thought stepwise scoring, safety-refusal preference data, instruction-tuning correctness review, agent-trajectory grading on LangSmith / Phoenix / Braintrust / Helicone logs, golden-set creation, rubric versioning in GitHub. Output formats: Hugging Face Datasets, Label Studio JSON, Parquet, JSONL.
Won’t take on: raw prompt synthesis at scale (use a prompt-engineering pod, separate engagement), red-team adversarial generation (specialist work — refer to a Mohali AI safety pod), or any client who won’t share the model outputs being ranked (AB7 won’t blind-rank without access to model traces — that’s how MTurk-quality calibration drift happens).
Five questions a Kendall Square CTO actually asks before signing
1. How does AB7 prevent the MTurk-style 22% disagreement gap? Week-1 calibration on a 400-pair golden set, Phoenix agreement-delta tracking per reviewer per day, mandatory re-calibration session for any reviewer who drifts > 8 points below the pod median.
2. Can the rubric change mid-engagement? Yes. Rubric lives in a GitHub repo. Every clarification ships as a PR with a review window — the Round 3 engagement shipped 14 mid-cycle rubric PRs over 19 days.
3. What happens if the AB7 reviewer pod is too slow? Daily Phoenix throughput dashboard sent to the CTO at 17:00 EST. If a 5-day rolling average drops below the contracted pairs/day rate, AB7 adds a 7th reviewer at no cost.
4. Will reviewers see the raw model outputs or only the ranking interface? Both. Helicone-logged Llama 3.1 70B traces are visible alongside the pair, so reviewers see token-level reasoning, not just the final answer. That’s the reason inter-reviewer agreement runs 6-8 points above MTurk benchmarks on chain-of-thought clusters.
5. Can AB7 deliver in Hugging Face Datasets format directly? Yes. Final delivery includes Parquet + JSONL + Label Studio JSON, plus a HF Datasets loader script. The Round 3 startup pushed straight to a private HF dataset on May 9 evening EST.
What to do next
Book a 30-minute scoping call: calendly.com/ashok-benial/meeting. Bring the rubric draft (or the prior MTurk/Scale rubric you want adapted), the prompt-cluster definitions, the target pair count, and the model traces or sampled outputs you’ll be ranking. AB7 returns a written calibration-and-delivery plan inside 3 business days with the per-pair quote and a delivery-date commitment. If the math doesn’t beat the in-house ML-team-doing-it-themselves option in year 1, AB7 says so.
Written by
AB7 Solutions Editorial Team
Content & Research Division
The AB7 Solutions editorial team combines expertise across healthcare operations, IT staffing, cybersecurity, and workforce management to deliver actionable insights for business leaders.
Follow on LinkedIn →
