Coupang
Coupang1mo ago

Tech Infra Engineer

Z-Test & Templates Onlymid
OtherInfra Engineer
3 views0 saves0 applied

Quick Summary

Overview

As a Staff Systems Engineer in Developer Platform, you will partner with leaders of multiple platform teams. You will work closely with product to define and implement simple solutions to complex orchestration problems, building a highly scalable, reliable, and efficient platform for our customers.

Key Responsibilities

Engineer and develop a unified application platform for hybrid (multi-cluster, multi-region, multi-cloud) application management using Kubernetes controllers and feedback-driven control systems to meet SLOs.

Requirements Summary

Kubernetes API machinery and semantics: SSA, SMP, server-side dry-run, watches/informers/listers, rate-limited workqueues, finalizers, owner references, leader election, API Priority and Fairness Controllers/operators and node daemons in Go:…

Technical Tools
awsazurecppgcpjavakafkakubernetesprometheuspythonredisdistributed-systemsforecastingperformance-optimizationstatistical-modeling

As a Staff Systems Engineer in Developer Platform, you will partner with leaders of multiple platform teams. You will work closely with product to define and implement simple solutions to complex orchestration problems, building a highly scalable, reliable, and efficient platform for our customers. You will engineer and develop Kubernetes controllers, operators, and node-level daemons for the application runtime; drive performance tuning and scaling; and design multi-cluster control-plane capabilities that scale to millions of pods across thousands of clusters.

Responsibilities

~1 min read
  • Engineer and develop a unified application platform for hybrid (multi-cluster, multi-region, multi-cloud) application management using Kubernetes controllers and feedback-driven control systems to meet SLOs.
  • Deliver end-to-end automation for application lifecycle (deployments, rollouts, failovers, policy enforcement) to minimize manual work for users.
  • Drive fleet-wide optimization for cost, performance, and latency through data-informed controls and capacity management, improving $/RPS and tail latency.
  • Build resilient, multi-tenant control planes and workflows that safely scale to millions of pods across thousands of clusters.
  • Ensure reliability, security, and governance with clear guardrails, safe defaults, and automated remediation.
  • Partner with product and customers to turn complex orchestration problems into simple, reusable platform primitives and great developer experiences.
  • Champion observability and continuous improvement with measurable, outcome-focused metrics.

Requirements

~1 min read
  • Bachelor’s degree in Computer Science, Electrical Engineering, Math, or a closely related field (or equivalent experience)
  • 10+ years in backend software development and operations
  • Recent experience designing and operating large-scale distributed systems (last 3 years) • Fluency in one or more among Go, C/C++, Python, or Java
  • Proven track record of delivering mission-critical systems
  • Experience with cloud computing using AWS or Azure or GCP

Requirements

~2 min read
  • Kubernetes API machinery and semantics: SSA, SMP, server-side dry-run, watches/informers/listers, rate-limited workqueues, finalizers, owner references, leader election, API Priority and Fairness
  • Controllers/operators and node daemons in Go: client-go/controller-runtime, reconciliation patterns, backoff and retry, idempotency, partitioned/sharded controllers, HA and failover
  • CRDs and webhooks: versioning, conversion functions/webhooks, validating/mutating admission webhooks, policy frameworks and best practices
  • Pod/runtime semantics: sidecars, init/ephemeral containers, probes (readiness/liveness/startup), lifecycle hooks, termination behavior, PDBs, QoS classes, ResourceQuota/LimitRange, topology spread, affinity/anti-affinity
  • Scaling systems: HPA (resource/custom/external metrics), VPA, cluster autoscaler; multi-dimensional scaling, health-aware/autopilot-style policies; external metrics adapters and SLO-driven scaling
  • Federated and multi-cluster: placement/propagation, failover, drift detection, reconciliation strategies; consistent hashing and partitioning for scale
  • Distributed systems: CRDTs and eventual consistency paradigms; Raft/memberlist/gossip; deep familiarity with etcd, Kafka, Redis and their operational characteristics (compaction, backpressure, retention, failover)
  • Observability and data: Prometheus (cardinality control, recording rules), tracing; experience with vector databases for search and diagnostics; strong time-series forecasting (classical + ML) and statistical modeling for proactive optimization
  • Languages and interfaces: Go (primary), Java/Python as needed; gRPC/protobuf; JSON/YAML/Jsonnet
  • Leadership: ability to handle multiple competing priorities in a fast-paced environment and lead the delivery of large-scale services for complex business offerings

Privacy Notice​ 

 

Location & Eligibility

Where is the job
Z-Test & Templates Only
On-site at the office
Who can apply
Same as job location
Listed under
Worldwide

Listing Details

Posted
March 13, 2026
First seen
April 3, 2026
Last seen
May 8, 2026

Posting Health

Days active
34
Repost count
0
Trust Level
31%
Scored at
May 8, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Coupang
Coupang
greenhouse

Coupang is a U.S. retail company known for its fast delivery services and commitment to customer satisfaction.

Employees
5k+
Founded
2010
View company profile
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

CoupangTech Infra Engineer