Domain intelligence for better models. Agentic AI for real workflows.
Zstate builds the training infrastructure that makes models smarter, and the AI-native systems that put them to work across domains.
Steps
Signals
Interactions
Generic annotation fails
where domain judgment matters
- Labels created by people who have never done the work
- Tasks reduced to shallow fragments with little workflow context
- Benchmarks that reward looking right, not being right
- Little connection between training data and production behaviour
- Tasks, rubrics, and reward signals designed with domain experts
- RL environments and evaluation loops grounded in real workflows
- Training infrastructure built with downstream system behaviour in mind
- Signals that help models learn judgment, not just pattern matching
Domain Intelligence and Agentic AI
Domain Intelligence
Expert-led data, evaluation, and feedback systems that keep specialist context inside the model-development loop.
- RLHF preference data & reward model training
- SFT instruction datasets from domain experts
- Red-teaming & adversarial evaluation
- Clinical NLP, diagnostic Q&A, EHR abstraction
- Earnings analysis, risk data, compliance evaluation
- Medical coding & ICD abstraction
Agentic AI
Production-grade agentic systems across software engineering, healthcare, and finance, built with domain context carried through design, evaluation, and deployment.
- End-to-end agentic system design & build
- Multi-agent pipelines & workflow automation
- AI-native architecture, not retrofitted legacy code
- From prototype to scalable production deployment
- Compliance-aware engineering for regulated industries
Four areas. Genuine depth in each.
Software engineering data
02 - HealthcareHealthcare records & reasoning
Agentic systems
Complete agent trajectories across 258k real-world software engineering problems with reasoning traces, tool calls, code edits, and explicit user acceptance signals. Three core layers: Task, Trajectory, and Reward datasets.
- 258k real engineering tasks. Complete agent trajectories across real-world software engineering problems with reasoning traces, tool calls, code edits, and explicit user acceptance signals. Nothing synthetic.
- Three derived datasets. Task dataset (258k cleaned prompts), Trajectory dataset (3.7M full agent traces), and Reward dataset (130k acceptance signals).
- Beyond SWE-Bench. Full lifecycle: reasoning → tool calls (6-7 per task across 22 tools) → code edits → human acceptance. Real production tasks, not curated benchmarks.
5M+ connected healthcare records covering prescription digitisation, diagnostic reasoning, radiology, pathology, and drug grounding mapped to symptoms, diseases, and side effects.
- 5M+ connected healthcare records. Prescription digitisation, diagnostic reasoning, radiology, and pathology report interpretation in one corpus.
- Prescription, diagnostic, and report workflows. Extraction and interpretation tasks spanning prescriptions, clinical reasoning, radiology, and pathology.
- Drug data that completes the corpus. Drug layer tied to symptoms, diseases, and side effects - the grounding context for medical AI training and evaluation.
Preference data and SFT datasets for earnings reports, 10-K filings, risk model assessment, regulatory compliance, fraud detection, and trade rationale evaluation.
- Earnings & analyst evaluation. Preference data and SFT datasets over earnings reports, 10-K filings, and sell-side research. Evaluated by credentialed analysts.
- Risk & compliance data. Training and evaluation data for risk modelling, regulatory tasks, and stress testing. Reviewed by risk professionals.
- Fraud detection & trade rationale. Expert-annotated datasets for fraud detection, trade rationale evaluation, and financial reasoning benchmarks.
Multi-agent system design, custom RL environments, and production deployment with guardrails, observability, human-in-the-loop checkpoints, and scalable infrastructure.
- Agentic system design. Multi-agent architectures with tool use, memory, orchestration, and handoff logic for long-horizon workflows.
- RL environment engineering. Custom RL environments that simulate real expert decision workflows. High-signal training data and meaningful evaluations.
- Production deployment & ops. Prototype to production with guardrails, observability, human-in-the-loop checkpoints, and reliable infrastructure at scale.
How we work
Some teams start with training infrastructure. Others start with the workflow. We support both.