agi-inc
agi-inc22h ago
New

Research Engineer - Evals

San Francisco Officefull-timemid
OtherResearch Engineer
0 views0 saves0 applied

Quick Summary

Overview

Think Different. Build the Future. πŸš€ Our Mission Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions.

Technical Tools
OtherResearch Engineer

Responsibilities

~1 min read

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust β€” and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI β€” across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

  • The eval suites that gate every model and agent release β€” capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss

  • The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy

  • The bar β€” what counts as ready to ship, and how we know

  • Research, by making sure what we measure is what we want

  • Product engineers, by instrumenting real-user behavior on real devices

  • Partnerships, by translating "did it get better" into language an OEM partner can hold us to

  • How to measure non-deterministic systems β€” agent eval, tool use, long-horizon tasks, multilingual behavior

  • How to push back on a metric that's being gamed without breaking the team

  • On-device perf trade-offs and how they show up in real-user evals

  • What QA-ing AI at OEM scale actually looks like

  • The realities of shipping consumer agents to production partners

After 30 days β€” You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days β€” You've stood up a new eval surface β€” agentic, on-device, or behavioral β€” and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days β€” Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

What We Offer

~1 min read

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

Send a link to an eval, benchmark, or measurement system you built β€” and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.

Location & Eligibility

Where is the job
San Francisco Office
On-site at the office
Who can apply
Same as job location

Listing Details

Posted
May 27, 2026
First seen
May 27, 2026
Last seen
May 27, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
52%
Scored at
May 27, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

agi-incResearch Engineer - Evals