LLM Pre-training & Distributed Engineer (AI Infrastructure)

OtherEngineer
0 views0 saves0 applied

Quick Summary

Key Responsibilities

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

Technical Tools
OtherEngineer

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities

~1 min read
  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.
  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

 

Location & Eligibility

Where is the job
Boston, United States
On-site at the office
Who can apply
US
Listed under
United States

Listing Details

Posted
April 24, 2026
First seen
April 24, 2026
Last seen
May 4, 2026

Posting Health

Days active
10
Repost count
0
Trust Level
35%
Scored at
May 4, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Hyphenconnect
Hyphenconnect
greenhouse

Web3 and AI talent recruitment agency based in Hong Kong with 700+ placements globally

View company profile
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

HyphenconnectLLM Pre-training & Distributed Engineer (AI Infrastructure)