Agentic Coding Annotator - Online / Offline Tasks

Full-time

Remote

Skills

Python

JavaScript

Java

C++

SQL

Overview

About Turing

Turing is one of the world’s fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: working with the world’s leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM, and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies.

Role Overview

We are looking for strong, detail-oriented software practitioners to help evaluate and improve datasets for agentic coding models.

This role involves working with realistic coding tasks in an agentic coding harness, reviewing model trajectories, verifying solutions, and producing high-quality annotations.

Depending on the assignment, the work may include:

Online evaluations: Manually interacting with blinded models on predefined tasks, then ranking and grading resulting trajectories
Offline evaluations: Designing realistic coding tasks, calibrating them through user simulation, writing task-specific rubrics, and grading generated trajectories

This is not a basic annotation role. Candidates are expected to read and debug code, validate behavior, follow detailed process rules, and make consistent judgment calls across model runs.

We are specifically looking for candidates with enough engineering maturity to independently work on realistic software tasks, not just toy problems or shallow code-review exercises.

What does day-to-day look like

Execute realistic coding tasks within the assigned agentic coding harness while maintaining model blindness and session independence
Follow task instructions, milestones, planned interactions, and evaluation guardrails consistently across runs
Verify model outputs by reading code, running commands, checking logs, and inspecting generated artifacts
Perform targeted validation of outputs using tests, scripts, and manual checks
Write clear, specific, evidence-based rationales for trajectory rankings and assessments
Design multi-step, realistic coding tasks (offline work), including user intent and milestone structure
Create and refine task-specific rubrics and binary evaluation criteria
Review completed work for quality, completeness, consistency, and schema compliance
Identify and escalate broken environments, unclear instructions, or process gaps with clear supporting evidence

Requirements

Software Engineering Fluency (Mandatory)

5+ years of experience in software engineering, QA, developer tooling, data/ML engineering, or similar code-heavy roles
Strong hands-on experience in at least 1–2 programming languages or ecosystems
Representative languages include :Python, JavaScript/TypeScript, Rust, Java, C/C++, Bash/CLI environments, Haskell, Swift, SQL, or other production-relevant ecosystems
Ability to: Read and understand unfamiliar codebases Run and interpret tests, scripts, and CLI tools Debug issues and reason about edge cases or partial fixes Evaluate whether an implementation is functionally correct

Terminal & Tooling Skills (Mandatory)

Comfortable working in Linux/Ubuntu-like environments
Proficient with:Terminal workflows Git basics Code editors or IDEs Package managers and test runners JSON, YAML, and Markdown
Familiarity with Docker and reproducible environments (strong plus, especially for offline work)

Coding-Agent Workflow Familiarity (Mandatory)

Comfortable working with or quickly adapting to agentic coding environments, such as:OpenCode, Claude Code, Cursor, Similar coding-agent tools

Quality Judgment & Annotation Accuracy (Mandatory)

Ability to:

Compare multiple model trajectories and identify meaningful differences
Distinguish correctness from style, communication quality, and agent behavior
Evaluate solutions consistently using defined rubrics
Follow detailed process instructions without deviation
Maintain consistency across repeated or similar evaluations
Write concise, evidence-based rationales (not generic summaries)

Work Style

Highly detail-oriented and process-driven
Comfortable with repetitive, high-precision evaluation work
Able to maintain consistency across long tasks and multiple model runs
Proactively flags ambiguity instead of making assumptions
Balances realism with strict evaluation consistency

Additional Preferred Qualifications (Offline / Senior Candidates)

Strong Docker skills and experience building/debugging reproducible environments
Experience working in large, complex repositories (not just small or greenfield projects)
Demonstrated originality and sound engineering judgment in defining technical problems
Ability to design realistic, non-trivial tasks that go beyond tutorials, README flows, or simple bug fixes

Perks of Freelancing With Turing

Work on cutting-edge AI projects with leading foundation model companies
Collaborate on high-impact work at the frontier of LLM evaluation and reasoning
Remote, flexible opportunities with global teams
Competitive compensation based on experience and project scope

Offer Details

Commitments Required: 8 hours per day with a 4-hour overlap with PST.
Employment Type: Contractor position (Note: this role does not include medical/paid leave).
Duration of Contract: 5 weeks; [expected start date is next week].

Create an account

Already have an account?

Or continue with email

Trusted by AI leaders, enterprises, and more

Don't miss out on this job opportunity!

Agentic Coding Annotator - Online / Offline Tasks