agi-inc1mo ago

Research Engineer - Evals

San Francisco Officefull-timemid

OtherResearch Engineer

2 views0 saves0 applied

Apply Now

Quick Summary

Overview

Think Different. Build the Future. 🚀 Our Mission Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions.

Technical Tools

OtherResearch Engineer

Responsibilities

~1 min read

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust — and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI — across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

The eval suites that gate every model and agent release — capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss
The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy
The bar — what counts as ready to ship, and how we know

Research, by making sure what we measure is what we want
Product engineers, by instrumenting real-user behavior on real devices
Partnerships, by translating "did it get better" into language an OEM partner can hold us to

How to measure non-deterministic systems — agent eval, tool use, long-horizon tasks, multilingual behavior
How to push back on a metric that's being gamed without breaking the team

On-device perf trade-offs and how they show up in real-user evals
What QA-ing AI at OEM scale actually looks like
The realities of shipping consumer agents to production partners

After 30 days — You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days — You've stood up a new eval surface — agentic, on-device, or behavioral — and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days — Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

What We Offer

~1 min read

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

Send a link to an eval, benchmark, or measurement system you built — and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.