AI Cloud Infrastructure Engineer - Fury Team

Sunnyvalemid

EngineeringData ScienceDevOps & Infrastructure

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

The future of defense will be decided by those who field intelligent machines at scale. At Scout AI, we’re developing Fury, the first robotic foundation model for defense, to give U.S.

Technical Tools

EngineeringData ScienceDevOps & Infrastructure

The future of defense will be decided by those who field intelligent machines at scale. At Scout AI, we’re developing Fury, the first robotic foundation model for defense, to give U.S. forces overwhelming, adaptable, and autonomous power across every domain. Fury enables human operators to command fleets of robots through natural language, and empowers those machines to sense, decide, and act together as one. This mission will ask everything of us: urgency, precision, and relentless work.

Responsibilities

~1 min read

→Design and implement data pipelines for ingesting, transforming, and storing petabytes of multimodal data from Fury’s robotic and operator systems
→Develop internal tooling for dataset exploration, curation, versioning, and quality monitoring over time
→Build and maintain distributed training infrastructure (cloud and on-prem) for large-scale multimodal and foundation model training
→Implement job orchestration workflows for launching, tracking, and debugging large-scale model runs
→Identify and remediate bottlenecks in compute, memory, storage, and network performance to optimize throughput and cost efficiency
→Collaborate with AI, autonomy, and systems teams to ensure data and training infrastructure supports real-time and mission-critical use cases
→Maintain observability and reliability tooling for training and inference pipelines
→Stay current on best practices in MLOps, distributed training frameworks, and AI infrastructure at scale

Requirements

~1 min read

3+ years of experience in ML infrastructure, MLOps, or large-scale data systems
Proven experience with distributed training (PyTorch DDP, DeepSpeed, Ray, or similar) and workflow orchestration (Kubernetes, Airflow, or equivalent)
Strong proficiency in Python and cloud-native infrastructure (AWS, GCP, or Azure)
Deep understanding of data engineering (ETL pipelines, object storage, data versioning, metadata management)
Familiarity with containerization and deployment (Docker, Kubernetes) and monitoring systems (Prometheus, Grafana)
Experience optimizing GPU cluster utilization, scaling training jobs, and profiling model performance
Bachelor’s degree or higher in Computer Science, Electrical Engineering, or related technical field
Bonus: Experience with edge-deployed ML systems, federated training, or robotic data collection pipelines
Must be a U.S. Person due to required access to U.S. export controlled information or facilities