LLM S3 Annotator (OpenClaw Trajectory Specialist)

Full-time

Remote

Skills

SQL

Python

Data Analysis

Overview

About Turing

Turing is one of the world’s fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: working with the world’s leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM, and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies.

Role Overview

We are seeking experienced SwarmBench Task Engineers — Data Analysis to design and develop high-quality multi-agent benchmark tasks that evaluate the analytical reasoning, coordination, and execution capabilities of advanced AI systems.

In this role, you will build realistic benchmark tasks that require AI agents to analyze large, messy, multi-source datasets, decompose work across specialist sub-agents, and arrive at specific, verifiable conclusions. These tasks may involve structured and semi-structured data such as CSVs, JSON files, logs, reports, survey results, vendor assessments, or financial and operational documents.

Your work will help measure how effectively AI systems perform complex analytical workflows involving cross-referencing, contradiction detection, anomaly identification, and statistical reasoning across multiple data sources.

What does day-to-day look like

Design and author multi-agent benchmark tasks centered on complex data analysis workflows
Create realistic synthetic datasets or curate real-world style datasets across domains such as finance, operations, security, or market analysis
Build tasks that require agents to perform cross-referencing, anomaly detection, contradiction identification, and statistical computation across multiple sources
Develop decomposition guides that split analytical work across specialist sub-agents such as financial, technical, security, or operations analysts
Write precise oracle logic or verification scripts that validate specific analytical conclusions rather than generic summaries
Create reproducible evaluation environments using Python and Docker
Review task performance signals to ensure strong separation between weaker and stronger agentic systems
Refine tasks to improve determinism, clarity, difficulty, and scoring quality

Requirements

5+ years of experience in data analysis
Strong proficiency in SQL and Python for data analysis and scripting (pandas, NumPy, or similar)
Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
Ability to design non-trivial analytical questions with clear, specific, and verifiable answers
Solid understanding of statistical concepts (averages, distributions, outliers, correlations)
Familiarity with AI coding benchmark environments (e.g., SWE-bench, Terminal-Bench)
Comfortable working with Docker (writing Dockerfiles, building images, debugging containers)

Perks of Freelancing With Turing

Work on cutting-edge AI projects with leading foundation model companies
Collaborate on high-impact work at the frontier of LLM evaluation and reasoning
Remote, flexible opportunities with global teams
Competitive compensation based on experience and project scope

Offer Details

Commitments Required: 8 hours per day with a 4-hour overlap with PST.
Employment Type: Contractor position (Note: this role does not include medical/paid leave).
Duration of Contract: 4 weeks; [expected start date is next week].

Create an account

Already have an account?

Or continue with email

Trusted by AI leaders, enterprises, and more

Don't miss out on this job opportunity!

LLM S3 Annotator (OpenClaw Trajectory Specialist)