About the Role
Exciting opportunity to work directly with researchers at a top 8 Frontier Lab. The core objective of this role is to enhance the reasoning and problem-solving capabilities of a target frontier model in STEM topics by designing, validating, and analyzing challenging benchmark tasks.
Key Responsibilities
- Task Design and Development: Design challenging, real-world data science problems that serve as the foundation for Colab Bench tasks.
- Content Generation: Integrate the problems into an Agentic development environment, preparing all necessary components using Python, which include:
- Detailed Instructions and an overview of the required task.
- A Golden solution that follows the instructions.
- The necessary Environment, including datasets, Python libraries, and metadata.
- A Test notebook containing unit tests that solutions must pass.
- Evaluation and Analysis: Evaluate the cross model’s performance on the tasks
- Headroom Identification: Identify tasks where target model fails to pass all tests, specifically classifying the failure as a logical reasoning failure
- Loss Extraction: Analyze the agent’s steps (Agent Trajectory) to observe and extract core capability loss patterns from the model.
Qualifications and Recruitment
- Expertise Focus: Applicants must have strong expertise in data science, ML, finance, and coding, with a deep background in frontier STEM.
- Target Candidates: We are actively recruiting PhD students from top schools in the US and highly-skilled GitHub contributors. (Will consider a small cohort in India as well)
Offer Details
- Rate: ~ $30/hour.
- Commitment: Minimum 30 hours per week (on Week days).
- Employment Type: Contractor (no medical/paid leave).
- Duration: 3 months (expected start date: next week).
- Locations: India.