Member of Technical Staff - Inference Serving

Yokohamafull-timelead

OtherMember Of Technical Staff

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

About ai& ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider.

Key Responsibilities

Runtime Selection & Deep Optimization: Lead the evaluation, integration, and continuous tuning of diverse inference frameworks to ensure best-in-class performance across LLM, Video, and Multimodal workloads.

Technical Tools

concurrencydistributed-systemsi18n

ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.

At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.

We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.

As an inference & serving engineer, your objective is to build a high-performance, multi-tenant serving stack that squeezes maximum utilization out of heterogeneous hardware. This involves navigating the trade-offs between various state-of-the-art inference frameworks and engines, selecting and optimizing the right runtime for the right workload. The scope of work is not limited to Large Language Models; it extends to the frontier of Generative AI, including high-throughput Video generation and complex Multimodal systems where memory pressure and compute requirements are significantly more demanding.

Beyond just deploying models at scale, this role is responsible for building a robust system that bridges the gap between boutique, high-performance clusters and massive, multi-node deployments as the company grows. This requires a deep understanding of the "Inference Triangle"—constantly tuning the stack to find the optimal equilibrium between low-latency (TTFT/ITL), high-throughput, and inference quality (Precision/Quantization). The ideal candidate is a hands-on engineer who views the entire GPU fleet as a single, programmable compute fabric and is eager to get their hands dirty at every level of the stack.

Responsibilities

~1 min read

Inference Engine: Deep experience with the internals of modern runtimes. You are a prominent contributor to inference engine ecosystems, including but not limited to OSS projects or proprietary engines at top-tier AI labs.
Multimodal Domain Knowledge: Understanding of the specific challenges involved in serving Large Language Models alongside Video and Vision-based generative models.
Scale-First Engineering: A track record of building and managing distributed systems that have evolved from small-scale proofs-of-concept to large-scale production deployments.
Great Team Spirit: A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.