Title: SwarmBench Task Engineer — Knowledge / Research

Full-time

Remote

Skills

Python

PhD Physics

PhD Chemistry

PhD Biology

Overview

About Turing:

Based in San Francisco, California, Turing is the world’s leading research accelerator for frontier AI labs and a trusted partner for global enterprises deploying advanced AI systems. Turing supports customers in two ways: first, by accelerating frontier research with high-quality data, advanced training pipelines, plus top AI researchers who specialize in coding, reasoning, STEM, multilinguality, multimodality, and agents; and second, by applying that expertise to help enterprises transform AI from proof of concept into proprietary intelligence with systems that perform reliably, deliver measurable impact, and drive lasting results on the P&L

Role Overview:

We are seeking a highly analytical and computationally proficient individual to join our team with a strong research background. You will be instrumental in contributing to this role by either crafting challenging and insightful problems in your respective research domain, devising elegant computational solutions.

Responsibilities:

Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections
Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
Design LLM judge prompts that evaluate agent output field-by-field against the oracle
Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis)

Required Qualifications:
5+ years of research experience — academic or industry research in any scientific domain
Strong reading comprehension and ability to extract structured information from unstructured text
Experience with JSON/data structures — designing schemas, validating output formats Python scripting ability (for judge scripts and data processing)
Experience with AI coding benchmarks (SWE-bench, Terminal-bench)
Comfortable with Docker — writing Dockerfiles, building images, debugging container issues
Attention to detail — building oracles requires exact values, not approximations
Strong plus:
Experience with systematic reviews, meta-analyses, or large-scale literature surveys Familiarity with medical/legal/scientific document analysis
Experience with NLP or information extraction tasks
Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, SimpleQA)
Experience curating datasets for AI evaluation
Example of what you'll produce: A task with 1500 medical case records (500 cardiac, 500 vascular, 500 systemic). The agent must read all cases, identify relevant ones, extract evidence, and produce a cross-domain diagnosis. The oracle requires exact first/last case IDs per file (proves the agent read start to end), verbatim excerpts from specific cases (proves it read individual records), and a cross-domain evidence matrix. The decomposition uses 15 chunk-reader sub-agents, 3 domain synthesizers, and 1 final synthesizer. Oracle scores 1.0, single-agent scores 0.15, multi-agent scores 0.80.

Perks of Freelancing With Turing: