The Missing Benchmark:
Why AI’s Next Crisis Will Be About Data, Not Models - And the People Who Create It
By Vijay Saini, Founder, Zstate.ai
We have benchmarks for everything in AI. We benchmark model accuracy, reasoning ability, code generation, hallucination rates, multilingual performance, and toxicity. We have HELM, BIG-bench, MMLU, GLUE, and dozens of derivatives. We obsess over leaderboards. We argue about whether benchmarks are saturating. We build new benchmarks to test the benchmarks.
And yet, we have no serious, shared standard for the one thing that makes all of these models possible in the first place: the quality of the annotated data they were trained and evaluated on.
This is the gap I want to name - and, more importantly, invite the global research and practitioner community to help close.
The Human Cost We Cannot Ignore: Annotator Wellbeing
Before we talk about data quality frameworks, we must talk about the people at the centre of annotation work. Because the industry already has one cautionary tale it should not need to repeat.
In 2023, TIME magazine reported on the experiences of Kenyan workers hired through a Nairobi-based outsourcing firm called Sama to annotate data for OpenAI. These workers were tasked with reviewing some of the most disturbing content imaginable - graphic depictions of violence, child abuse, and other deeply traumatic material - in order to train safety classifiers. They were paid as little as $1.50 an hour. Many reported severe psychological trauma. Some described being haunted by the images for months. The contract between Sama and OpenAI was subsequently terminated.
This was not an isolated failure of one vendor. It was a symptom of an industry that had not built the infrastructure to protect the people doing its most cognitively and emotionally demanding work.
Annotation, at its most demanding, involves extended exposure to content that would trigger professional welfare protocols in any other field. Radiologists work limited hours reading scans. Content moderators at major platforms are required to have access to psychosocial support. Yet annotation workers - many of them in the Global South, hired through multi-layer contracting arrangements that dilute accountability - are routinely left without equivalent protections.
This is a moral failure. It is also a data quality failure, because an annotator working under psychological distress, without adequate support, without fair compensation, and without professional dignity is not producing their best judgement. The trauma leaks into the label.
Any serious benchmark for annotation quality must therefore include a welfare dimension. The AQTS framework I am proposing incorporates the following as non-negotiable components:
- Content exposure thresholds - documented limits on the volume and severity of harmful content any individual annotator handles within a defined period, with mandatory rotation and rest protocols
- Psychosocial support requirements - mandatory access to professional mental health support for annotators working on sensitive or potentially traumatic content categories
- Fair compensation benchmarking - annotation rates indexed to task complexity and regional living standards, not arbitraged to the lowest available labour market
- Welfare audit trail - as part of dataset transparency documentation, disclosure of whether annotators on the project were subject to welfare protocols and what those protocols were
- Direct contracting accountability - prohibition on obscuring employer identity through multi-layer subcontracting arrangements that make it impossible to assign welfare responsibility
The UAI system described below also plays a role here: when annotators have portable professional credentials, they have leverage. They can demonstrate their value, choose their engagements, and refuse conditions that compromise their wellbeing without losing their professional identity. Credentialing is not just a quality mechanism. It is a labour dignity mechanism.
The Invisible Foundation
Over the past year, in conversations with teams across LLM developers, AI safety organisations, and enterprise AI buyers, I kept running into the same uncomfortable admission. When pushed on annotation quality - the human feedback that shapes model behaviour through RLHF, the labelled datasets that ground evaluation - most teams acknowledged they were operating on trust and convention rather than verifiable standards.
They knew the name of their annotation vendor. They had spot-check processes. They had inter-annotator agreement numbers, sometimes. But they could not answer with confidence: who were these annotators, how experienced are they, how were conflicts between them resolved, and how do we know this dataset is trustworthy enough to train a model that will be deployed in a hospital, a courtroom, or a credit system?
This is not a criticism of any individual team. It is a structural gap in how the industry has evolved. We built the rocket and forgot to inspect the fuel.
Why This Matters More Now
The stakes have risen sharply. As models move from general-purpose assistants to high-stakes domain specialists - healthcare diagnostics, legal reasoning, financial advice, clinical trial analysis - the quality and traceability of the training signal becomes a patient safety question, a regulatory question, a liability question.
The EU AI Act, India’s emerging AI governance framework, and executive orders in the US are all converging on one demand: explainability and auditability. Model cards exist. Dataset cards are becoming more common. But annotation process cards - documenting how the data was created, by whom, and with what level of demonstrated expertise - are almost entirely absent from the conversation.
That absence is a vulnerability. And it will be exploited - by regulators, by adversaries, and eventually by model failures in high-stakes environments.
AQTS is not a new layer of bureaucracy. It is a practical implementation guide for obligations that are already becoming law. Three regulatory frameworks directly create demand for exactly what AQTS provides:
- EU AI Act - Article 10: Mandates data governance requirements for high-risk AI systems, including training data quality, relevance, and completeness. AQTS dataset-level transparency metrics map directly onto Article 10 compliance, giving organisations a defensible audit trail for regulators.
- India’s Digital Personal Data Protection Act (DPDPA) 2023: Creates data fiduciary obligations that intersect with annotator data rights and dataset provenance. AQTS’s UAI system, with its privacy-preserving credential model, provides a compliant architecture for managing annotator identity data under DPDPA.
- US AI Executive Orders: Require federal agencies procuring AI systems to assess training data provenance and bias. AQTS’s annotator diversity index and IAA documentation directly address these requirements.
Positioning AQTS as a compliance tool - rather than a voluntary standard - fundamentally changes its adoption dynamics. Organisations will adopt it because they have to, not only because they want to.
What a Benchmark for Annotated Data Should Look Like
I am proposing a framework - a starting point, not a decree - that I am calling the Annotation Quality and Transparency Standard (AQTS). It operates at two levels: the annotator level and the dataset level.
1. The Unique Annotator Identifier (UAI)
The annotator is the atom of this system, and currently the atom is invisible.
A UAI would be a portable, privacy-preserving credential that travels with an annotator across platforms and projects. Think of it as analogous to a CIBIL score in the credit system - a structured, evolving record of credibility built through demonstrated performance, not self-reported claims. Crucially, it is anonymised: the score and profile are auditable without exposing personal identity.
A UAI profile would include:
- Domain expertise signal - verified area of knowledge (clinical, legal, linguistic, technical), with evidence tier (self-declared, platform-tested, externally credentialed)
- Language and cultural context - mother tongue, working languages, regional context
- Cumulative annotation score - a rolling performance metric aggregated across platforms, weighted by task complexity and expert adjudication outcomes
- Consistency index - performance on calibration tasks (repeated, known-answer tasks seeded into annotation batches)
- Conflict rate and resolution record - how often this annotator’s labels diverge from peers, and whether divergence was later validated as correct or overruled
- Task volume and recency - not raw count, but a signal of active, recent practice
- Welfare and working conditions record - an auditable flag confirming that welfare protocols were in place during engagements, visible to prospective employers
UAI Technical Architecture and Governance
The UAI requires a defined technical architecture and a governance model that no single platform controls. The following is a proposed starting point:
- Technical standard: W3C Verifiable Credentials (VC) provides a production-ready, privacy-preserving standard for exactly this use case. UAIs should be issued as VCs, owned by the annotator, stored in a digital identity wallet, and selectively disclosed to prospective clients. The issuing platform attests to the credential’s validity without controlling its use.
- Scoring algorithm: The cumulative annotation score formula must be open-source and independently auditable. A closed scoring system controlled by a dominant platform creates a capture risk - platforms could systematically downgrade annotators who advocate for better working conditions.
- Governance body: UAI governance should follow a consortium model (analogous to SWIFT in banking or the W3C itself) - a neutral body with seats for annotation platforms, enterprise buyers, academic researchers, and - critically - annotator worker representatives. No single commercial entity should control scoring methodology.
- Anti-gaming mechanisms: Calibration tasks must be rotated and never reused; platform audit logs must be independently reviewable; annotators must have the right to contest scores through a defined appeals process.
2. Dataset-Level Transparency Metrics
At the dataset level, AQTS proposes a standardised release checklist that any organisation publishing or procuring annotated data should be able to produce:
- Inter-Annotator Agreement (IAA) - reported using appropriate statistics (Cohen’s Kappa, Krippendorff’s Alpha, or task-appropriate alternatives), not hidden behind aggregate accuracy numbers
- Annotator diversity index - distribution across domain expertise, geography, language, and seniority tier, because a dataset annotated by a homogeneous group carries systematic blind spots
- Conflict rate and resolution mechanism - what percentage of labels were contested, and how were they resolved? Majority vote, expert adjudication, hierarchical review? Each mechanism has different reliability properties and should be named
- Annotation guideline versioning - were guidelines updated mid-project? What changed, and were prior annotations revisited?
- Calibration and quality control protocol - what blind-check or seeded-task mechanism was used to monitor annotator drift?
- Task complexity classification - a structured rating of how cognitively demanding and domain-specific the annotation task was, so that IAA scores can be interpreted in context
- Annotator welfare disclosure - confirmation that annotators working on sensitive content categories were provided with welfare protocols, compensation disclosures, and content exposure limits
Disagreement as Signal, Not Noise
One of the most important recent insights in annotation research - central to Lora Aroyo’s work, whom this framework explicitly builds on - is that annotator disagreement is often meaningful signal, not a quality failure to be eliminated.
The current framing of IAA as a simple quality gate (higher agreement = better dataset) is wrong for subjective tasks. In toxicity classification, sentiment analysis in culturally ambiguous contexts, or clinical severity judgement, low IAA may accurately reflect genuine label uncertainty - the kind of uncertainty that a well-calibrated model should also express.
AQTS therefore distinguishes between two IAA regimes:
- Objective tasks (factual extraction, named entity recognition, structured clinical coding): High IAA is a genuine quality signal. Low agreement indicates annotator error or guideline ambiguity that should be resolved.
- Subjective tasks (toxicity, sentiment, cultural relevance, clinical severity): IAA should be reported but not used as a pass/fail gate. Datasets covering subjective tasks should release distributional labels - the full range of annotator judgements - rather than majority-vote collapsed single labels. This preserves the signal that the model needs to handle edge cases well.
This distinction matters enormously for safety-critical AI. The edge cases where annotators disagree are often precisely the cases where model errors have the highest real-world consequence.
None of these are novel metrics in isolation. What is novel is the proposal to standardise, require, and publish them together - making annotated datasets as legible to scrutiny as model evaluation results.
Making the Standard Stick
Standards without enforcement are wishlists. Model cards and dataset cards already exist - and most practitioners ignore them. AQTS will face the same fate unless it is attached to mechanisms that create real consequences for non-compliance. Three parallel adoption pathways must be built simultaneously:
- Regulatory pull: Position AQTS as the practical compliance guide for EU AI Act Article 10, India’s DPDPA, and US federal AI procurement requirements. Regulatory necessity is the strongest adoption driver - it removes optionality.
- Procurement leverage: Enterprise AI buyers - hospital networks, financial institutions, court systems - are the demand side with real contractual leverage over annotation vendors. A procurement requirement that training datasets must be AQTS-compliant forces the supply side to follow. The immediate priority is securing commitments from two or three flagship enterprise buyers to include AQTS compliance as a vendor contract condition.
- Public AQTS registry: Modelled on ClinicalTrials.gov for clinical research - a publicly searchable registry of AQTS-compliant datasets, with compliance status, audit dates, and welfare disclosures visible to anyone. Compliance becomes a reputational asset; non-compliance, or de-listing after audit failure, becomes publicly visible. This creates incentives without requiring legal enforcement.
What Could Go Wrong: A Threat Model
Any credentialing system creates attack surfaces. Naming these vulnerabilities is not a reason to abandon the framework - it is a prerequisite for designing it responsibly. The following are the primary threat vectors AQTS must address:
- Score farming: Annotators could collude on calibration tasks if those tasks are leaked or reused across batches. Mitigation: calibration tasks must be rotated, never reused, and seeded unpredictably within annotation streams. Platforms must not have advance knowledge of which tasks are calibration tasks.
- Platform capture: If annotation platforms control the UAI scoring mechanism, they can systematically downgrade annotators who advocate for better pay or working conditions - rewarding compliance over quality. Mitigation: open-source scoring algorithm, independent audit rights, annotator appeals process, and worker representation in governance body.
- Client-side stratification: Dataset buyers could selectively assign high-UAI annotators to easy tasks and low-UAI annotators to traumatic content - creating a two-tier welfare system within the AQTS framework itself. Mitigation: welfare disclosure must include annotator tier distribution across content harm tiers, making this stratification visible in audit documentation.
- Credential inflation: Platforms competing for annotator talent could issue inflated UAI scores to attract workers. Mitigation: cross-platform score normalisation, independent audit of scoring distributions, and public disclosure of each platform’s score distribution.
These are not solved problems. They require input from privacy researchers, labour economists, and platform designers. That is precisely why this should be a community effort, not a product launch.
An Open Invitation
I am not proposing that Zstate.ai owns this standard. That would defeat the purpose.
What I am proposing is a working group - drawn from academia, civil society, and industry - to stress-test, refine, and ultimately formalise the AQTS framework into something the community can adopt, adapt, and hold each other to.
I am specifically inviting researchers and practitioners from institutions and organisations that have already done serious foundational work in adjacent areas:
Global Voices
- Percy Liang and the Stanford Center for Research on Foundation Models (CRFM), whose HELM benchmark has already demonstrated what rigorous, multi-dimensional AI evaluation looks like
- Lora Aroyo (Google DeepMind), whose work on annotation disagreement as signal - not noise - is conceptually foundational to this effort
- Dirk Hovy (Bocconi University), who has pushed the field to reckon with annotator identity and positionality
- Emily Bender (University of Washington), whose work on data documentation and the limits of scale is essential context
- The Alan Turing Institute’s fairness and accountability teams
- The data-centric AI community convened around Andrew Ng’s initiative
- Researchers at the Distributed AI Research Institute (DAIR), whose work on the social and labour dimensions of AI data is directly relevant to the welfare components of this framework
The Indian Context
India is not merely a participant in global annotation work - it is one of the most important contexts for AQTS for three specific, structural reasons:
- Scale: India has the world’s largest annotation workforce by volume. Standards that do not work at Indian scale do not work at global scale. Any framework designed primarily for US or European conditions and grafted onto Indian operations will fail in practice.
- Regulatory relevance: India’s Digital Personal Data Protection Act (DPDPA) 2023 creates data fiduciary obligations that directly intersect with annotator data rights and dataset provenance - making AQTS a natural compliance architecture for Indian AI companies.
- Geopolitical opportunity: The global AI standards conversation is currently dominated by US and European frameworks. AQTS, co-developed with Indian institutions, could become India’s contribution to global AI governance - not a standard India adopts, but one it helped build. MeitY and NASSCOM are actively shaping AI policy; this is the moment to engage them.
I am specifically inviting the following Indian institutions to the working group:
- AI4Bharat at IIT Madras, who are building large-scale annotated datasets for Indian languages and understand the practical constraints of annotation at scale in multilingual, low-resource settings
- Kalika Bali and Monojit Choudhury at Microsoft Research India, whose work on language diversity in NLP data is directly relevant
- The NASSCOM AI Centre of Excellence, for grounding the framework in enterprise and regulatory realities
- Research groups at IIT Bombay and IIT Delhi working on responsible AI evaluation
To practitioners building annotation infrastructure at Hugging Face, Scale AI, Appen, Toloka, iMerit, and the growing ecosystem of Indian annotation firms: your operational knowledge is exactly what an academic framework risks getting wrong without you in the room.
To labour researchers and worker rights organisations who have documented the conditions of annotation work in the Global South: the framework needs your scrutiny and your input more than it needs any technical metric.
What Transparency Actually Requires
One objection I anticipate: won’t releasing annotator profiles, even anonymised ones, create gaming and manipulation?
It is a fair concern, and it points to the governance question that any credentialing system must answer. A UAI should be owned and controlled by the annotator, with selective disclosure - much like a verified credential in a digital identity wallet. The score is portable, but the annotator decides what they share and with whom. The platform that issued the credential attests to its validity.
A second concern is power asymmetry. If annotation platforms control the UAI scoring mechanism, they can manipulate credentialing to favour compliant annotators over quality ones. The governance model must therefore include independent audit rights, worker representation, and open scoring methodology - none of which are in place today.
These are not solved problems. They require input from privacy researchers, labour economists, and platform designers. That is precisely why this should be a community effort, not a product launch.
What Comes Next from Zstate
We are building toward something concrete.
In the coming weeks, Zstate.ai will publish its first transparency report: an in-depth look at a healthcare public dataset annotated by our expert annotator team. The report will apply the AQTS framework as we currently understand it - documenting annotator profiles, IAA statistics, conflict resolution mechanisms, calibration protocols, and our annotator welfare practices in full. It will not be a showcase. It will be a worked example, complete with the tensions and tradeoffs we encountered, offered as a contribution to the conversation this article is trying to start.
Healthcare is where we began because the stakes are highest and the standards are most legible. Clinical annotation requires domain expertise that is verifiable, disagreement that is meaningful, and errors that have real consequences. If AQTS cannot hold up in healthcare, it needs to be redesigned. If it does, the path to finance, legal, and general-purpose AI data becomes clearer.
We also chose healthcare because it is the domain where annotator wellbeing is hardest to overlook. The people reviewing clinical data need to be expert, supported, and fairly compensated. We intend to show that it is possible to build annotation infrastructure that meets all three criteria simultaneously.
The Ask
If annotation quality is AI’s missing benchmark, then what we need now is the equivalent of the research community that built HELM, or the practitioners who pushed for model cards, or the legal scholars who shaped early thinking on algorithmic accountability - people willing to do the unglamorous, necessary work of building shared standards before a crisis forces the issue.
We already have one crisis behind us, in the form of the trauma inflicted on workers in Nairobi. We should not need another to act.
I am writing this article because I believe that the crisis in data quality is closer than most people think, and because I believe the community capable of preventing it is already out there.
If you are working on annotation quality, data documentation, annotator welfare, or AI auditability - I want to hear from you. Reach out directly, or respond publicly. The framework above is a starting point, not a conclusion. It should be argued with.
The models will keep improving. Let’s make sure the data they learn from - and the people who create it - are worthy of that scrutiny.
Vijay Saini Founder, Zstate.ai
Zstate.ai is an AI data services company specialising in expert annotation for high-stakes domains. We are currently focused on healthcare data, with finance and coding annotation capabilities in development.