Senior Software Engineer, ML Platform (Stability & Infrastructure)
Quick Summary
Proven experience in architecting and managing large-scale AI/ML workloads in a production environment. Expertise in cloud compute design, specifically within Google Cloud Platform (GCP).
Isomorphic Labs is applying frontier AI to help unlock deeper scientific insights, faster breakthroughs, and life-changing medicines with an ambition to solve all disease.
The future is coming. A future enabled and enriched by the incredible power of machine learning. A future in which diseases are curtailed or cured starting with better and faster drug discovery.
Come and be part of an interdisciplinary team driving groundbreaking innovation and play a meaningful role in contributing towards us achieving our ambitious goals, while being a part of an inspiring and collaborative culture.
The world we want tomorrow is the one we’re building today. It starts with the culture at this company. It starts with you.
Isomorphic Labs (IsoLabs) was launched in 2021 to advance human health by building on and beyond the Nobel-winning AlphaFold system. Since then, our interdisciplinary team of drug discovery experts and machine learning specialists has built powerful new predictive and generative AI models that accelerate scientific discovery at digital speed.
Our name comes from the belief that there is an underlying symmetry between biology and information science. By harnessing AI’s powerful capabilities, we can use it to model complex biological phenomena to help design novel molecules, anticipate how drugs will perform and develop innovative medicines to treat and cure some of the world’s most devastating diseases.
We have built a world-leading drug design engine comprising AI models that are capable of working across multiple therapeutic areas and drug modalities. We are continually innovating on model architecture and developing cutting-edge capabilities to advance rational drug design.
Every day, and with each new breakthrough, we’re getting closer to the promise of digital biology, and achieving our ambitious mission to one day solve all disease with the help of AI.
We are building the largest foundation models in biotech and applying them immediately to cure disease. You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.
As a Principal Engineer, you will lead the efforts to harden our systems, ensuring our groundbreaking AI is built on an unshakeable base, working closely with the research team and the Applied ML teams to ensure the infrastructure is stable, reliable and can operate with more data and larger models as we grow.
Responsibilities
~1 min read- →You will own the end-to-end strategy for platform reliability, with a specific focus on our accelerator (GPU/TPU) infrastructure and workload orchestration. You will move between high-level architectural design and hands-on systems engineering to eliminate friction in the researcher experience.
- →Lead the reliability work for our global job scheduler . You will design and implement a robust "test harness" to safely validate infrastructure upgrades without impacting live research.
- →Architect and optimize our next-generation inference services. You will solve core scaling limits, ensuring high-throughput performance and feature parity across our model serving stack.
- →Overhaul our logging and monitoring systems to provide radical visibility. You will build proactive alerting and telemetry that identifies systemic failures before they impact research workflows.
- →
- →Improve our internal CI/CD stability, targeting a significant reduction in failure rates and significantly faster feedback loops for the engineering organization.
- →Contribute to core technical decisions on tooling and architectural design while partnering with science, product, and operations teams to align infrastructure with biotech R&D cycles.
Requirements
~1 min read- Proven experience in architecting and managing large-scale AI/ML workloads in a production environment.
- Expertise in cloud compute design, specifically within Google Cloud Platform (GCP).
- Orchestration: Significant experience deploying and managing complex workloads within Kubernetes (GKE).
- Professional familiarity with NVIDIA GPU generations and the intricacies of high-performance compute.
- Strong programming skills and a "reliability-first" approach to software development.
Nice to Have
~1 min read- A career history that spans both ML Software Engineering and Infrastructure SRE roles.
- Experience leading multi-disciplinary projects and navigating complex stakeholder requirements in a fast-paced environment.
- Familiarity with workload scheduling, ML efficiency research, and hardware benchmarking.
- Experience with Google TPU generations and specialized ML-driven R&D cycles.
We are guided by our shared values. It's not about finding people who think and act in the same way. These values help to guide our work and will continue to strengthen it.
Listing Details
- Posted
- April 8, 2026
- First seen
- March 26, 2026
- Last seen
- April 15, 2026
Posting Health
- Days active
- 20
- Repost count
- 0
- Trust Level
- 50%
- Scored at
- April 15, 2026
Signal breakdown
Please let Isomorphiclabs know you found this job on Jobera.
3 other jobs at Isomorphiclabs
View all →Explore open roles at Isomorphiclabs.
Similar Senior Software Engineer, ML Platform (Stability & Infrastructure) jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.