Evals Software Engineer
Apolloresearch
London
2d ago
engineering
Applications deadline: Our hiring cycle for 2025 has concluded for now. New applications will be considered from 2026 onwards.
ABOUT APOLLO RESEARCH
The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities, they also present significant risks, such as the potential for deliberate misuse or the deployment of sophisticated yet misaligned models. At Apollo Research, our primary concern lies with deceptive alignment, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight.
Our approach focuses on behavioral model evaluations, which we then use to audit real-world models. We also combine black-box approaches with applied interpretability. In our evaluations, we focus on LM agents, i.e. LLMs with agentic scaffolding similar to AIDE or SWE agent. We also study model organisms in controlled environments (see our security policies), e.g. to better understand capabilities related to scheming.
At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
THE OPPORTUNITY
We're seeking a Software Engineer who will enhance our capability to evaluate Large Language Models (LLMs) through building critical tools and libraries for our Evals team. Your work will directly impact our mission to make AI systems safer and more aligned.
What You'll Accomplish in Your First Year
1. Accelerate our frontier LLM evaluations research by leading the design and implementation of software libraries and tools that underpin our end-to-end research workflows
2. Ensure the reliability of our experimental results by building tools that identify subtle changes in LLM behavior and maintain integrity across our research
3. Shape the vision for our internal software platform, leading key decisions about how researchers will run workloads, interact with data, analyze results, and share insights
4. Increase team productivity by providing design guidance, debugging, and technical support to unblock researchers and enable them to focus on their core research
5. Build expertise working with state of the art (SOTA) AI systems and tackling the unique challenges posed when building software around them
Key Responsibilities
- Rapidly prototype and iterate on internal tools and libraries for building and running frontier language model evaluations
- Lead the development of major features from ideation to implementation
- Collaboratively define and shape the software roadmap and priorities
- Establish and advocate for good software design practices and codebase health
- Establish design patterns for new types of evaluations
- Build LLM agents that automate our internal software development and research
- Work closely with researchers to understand what challenges they face
- Assist researchers with implementation and debugging of research code
- Communicate clearly about technical decisions and tradeoffs
Job Requirements
You must have experience writing production-quality python code. We are looking for strong generalist software engineers with a track record of taking ownership. Candidates may demonstrate these skills in different ways. For example, you might have one of more of these:
- Led the development of a successful software tool or product over an extended period (e.g. 1 year or more)
- Started and built the tech stack for a company
- Worked your way up in a large organisation, repeatedly gaining more responsibility and influencing a large part of the codebase
- Authored and/or maintained a popular open-source tool or library
- 5+ years of professional software engineering experience
The following experience would be a bonus:
- Experience working with LLM agents or LLM evaluations
- Infosecurity / cybersecurity experience
- Experience working with AWS
- Interest in AI Safety
We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.
Representative projects
- Implement an internal job orchestration tool which allows researchers to run evals on remote machines.
- Build out an eval runs database which stores all historical results in a queryable format.
- Implement LLM agents to automate internal software engineering and research tasks.
- Design and implement research tools for loading, viewing and interacting with transcripts from eval runs.
- Establish internal patterns and conventions for building new types of evaluations within the Inspect framework.
- Optimize the CI pipeline to reduce execution time and eliminate flaky tests.
ABOUT THE TEAM
The current evals team consists of Mikita Balesni, Jérémy Scheurer, Alex Meinke, Rusheb Shah, Bronson Schoen, Andrei Matveiakin, Felix Hofstätter, and Axel Højmark. MariusHobbhahn manages and advises the team, though team members lead individual projects. You would work closely with Rusheb and Andrei, who are the full-time software engineers on the evals team, but you would also interact a lot with everyone else. You can find our full team here.
Similar Jobs
Software Engineer, Perception Evaluation
Waymo
Engineering
Staff Software Engineer, Perception Evaluation
Waymo
Engineering
Staff Software Engineer, Quantitative Evaluation
Waymo
Engineering
2026 Summer Intern, MS/PhD, Software Engineer, Planner Eval
Waymo
Engineering
2026 Summer Intern, PhD, Software Engineer, Release Evaluation
Waymo
Engineering
Sponsored
Similar Jobs
Software Engineer, Perception Evaluation
Waymo
Engineering
Staff Software Engineer, Perception Evaluation
Waymo
Engineering
Staff Software Engineer, Quantitative Evaluation
Waymo
Engineering
2026 Summer Intern, MS/PhD, Software Engineer, Planner Eval
Waymo
Engineering
2026 Summer Intern, PhD, Software Engineer, Release Evaluation
Waymo
Engineering