About Turing
Turing is one of the world's fastest-growing AI companies accelerating the advancement and deployment of powerful AI systems.
Turing helps customers in two ways: working with the world's leading AI labs to advance frontier model capabilities and leveraging that work to build real-world AI systems that help businesses solve complex problems and unlock new opportunities.
Role Overview
We are seeking experienced software engineers to join Turing's AI Evaluation team. As an AI Evaluation Engineer, you will design, author, and validate software engineering benchmark tasks that are used to evaluate the capabilities of advanced AI systems across Python, Java/JVM, and Web development environments.
What does day-to-day look like?
- Design realistic software engineering evaluation tasks for AI agents
- Write clear, unambiguous instructions that define expected outputs, constraints, and success criteria
- Create reference solutions that successfully solve the authored tasks
- Develop verification criteria and automated test descriptions for task validation
- Author domain-specific skill files that teach workflows, conventions, and best practices without revealing answers
- Ensure consistency between benchmark variants while maintaining rigorous evaluation standards
- Review task quality, edge cases, and failure modes to improve benchmark reliability
- Collaborate with AI researchers, evaluators, and engineering teams to refine benchmark quality
- Contribute domain expertise in Software Development, Python, Java/JVM, or Web/UI technologies
Requirements
- Bachelor's degree or higher in Computer Science, Software Engineering, or a related technical field
- 5+ years of hands-on software development experience
- Strong expertise in at least one of the following domains:
Python Development
Java/JVM Ecosystem
Web Application Development (Frontend, Backend, or Full Stack) - Excellent written English and ability to write precise technical instructions
- Strong understanding of software engineering workflows, debugging, testing, and code quality practices
- Ability to think critically about how AI systems interpret instructions and solve technical problems
- Experience working with structured file formats such as JSON, Markdown, YAML, DOCX, or XLSX
Nice to have:
- Experience with LLM evaluation, prompt engineering, or AI benchmarking
- Experience creating technical assessments, coding challenges, or educational content
- Experience with Docker, containers, or cloud-based development environments
Perks of Freelancing With Turing
- Work on cutting-edge AI projects with leading AI research organizations
- Flexible remote work opportunities
- Opportunity to influence the evaluation of next-generation AI systems
- Collaborate with a global network of highly skilled professionals
Offer details:
- Commitments Required : 40 hours per week with overlap of 4 hours with PST
- Engagement type : Contractor assignment/freelancer (no medical/paid leave)
- Duration of contract : 2 months; [expected start date is next week]
Evaluation Process
- One round of technical interview (or) Automated Live coding challenge