inference
inference3mo ago
New

Senior Software Engineer - Model Performance

San Franciscofull-timesenior
Software EngineerSoftware Engineering
0 views0 saves0 applied

Quick Summary

Overview

Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs, diving deep into CUDA kernels, and turning optimization techniques into production systems, we'd love to meet you.

Key Responsibilities

Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries…

Requirements Summary

2+ years of experience in ML systems, inference optimization, or GPU programming Strong proficiency in Python and familiarity with C++ Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar) Deep understanding of…

Technical Tools
cppdockerkubernetespythonpytorch

Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs, diving deep into CUDA kernels, and turning optimization techniques into production systems, we'd love to meet you.

About the Role

~1 min read

You will be responsible for making our inference stack as fast and efficient as possible. Your work spans from implementing known optimization techniques to experimenting with novel approaches, always with the goal of serving models faster and cheaper at scale.

Your north star is inference performance: latency, throughput, cost efficiency, and how quickly we can bring new model architectures into production. You'll work across the full inference stack—from CUDA kernels to serving frameworks—to find and eliminate bottlenecks. This role reports directly to the founding team. You'll have the autonomy, a large compute budget, and technical support to push the limits of what's possible in model serving.

Responsibilities

~1 min read
  • Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving

  • Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance

  • Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure

  • Add support for new model architectures, ensuring they meet our performance standards before going to production

  • Experiment with novel inference techniques and bring successful approaches into production

  • Build tooling and benchmarks to measure and track inference performance across our fleet

  • Collaborate with applied ML engineers to ensure trained models can be served efficiently

Requirements

~1 min read
  • 2+ years of experience in ML systems, inference optimization, or GPU programming

  • Strong proficiency in Python and familiarity with C++

  • Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)

  • Deep understanding of GPU architecture and experience profiling GPU workloads

  • Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)

  • Experience with PyTorch and understanding of how models execute on hardware

  • Track record of measurably improving system performance

  • Experience with CUDA programming

  • Familiarity with serving non-LLM models (TTS, vision, embeddings)

  • Experience with distributed inference and multi-GPU serving

  • Contributions to open-source inference frameworks

  • Experience with Docker and Kubernetes

You don't need to tick every box. Curiosity and the ability to learn quickly matter more.

What We Offer

~1 min read

We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is $220,000 - $320,000, plus equity and benefits, depending on experience.

Inference.net is an equal opportunity employer. We welcome applicants from all backgrounds and don't discriminate based on race, color, religion, gender, sexual orientation, national origin, genetics, disability, age, or veteran status.

If you're excited about making AI inference faster for everyone, we'd love to hear from you. Please send your resume and GitHub to amar@inference.net and/or apply here on Ashby.

Location & Eligibility

Where is the job
San Francisco
On-site at the office
Who can apply
Same as job location

Listing Details

Posted
January 21, 2026
First seen
May 6, 2026
Last seen
May 8, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
14%
Scored at
May 6, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

inferenceSenior Software Engineer - Model Performance