Hyphenconnect10d ago

LLM Pre-training & Distributed Engineer (AI Infrastructure)

United States·Bostonmid

OtherEngineer

0 views0 saves0 applied

Quick Summary

Key Responsibilities

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM. Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.

Technical Tools

OtherEngineer

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities

~1 min read

→Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
→Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
→Automate checkpointing and failure recovery during month-long training runs.

Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
Experience managing SLURM or Kubernetes-based GPU clusters.
Strong systems engineering background (C++, CUDA, Python).

Location & Eligibility

Where is the job

Boston, United States

On-site at the office

Who can apply

US

Listed under

United States

Listing Details

Posted: April 24, 2026
First seen: April 24, 2026
Last seen: May 4, 2026

Posting Health

Days active: 10
Repost count: 0
Trust Level: 35%
Scored at: May 4, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust

Apply for this position

greenhouse

Web3 and AI talent recruitment agency based in Hong Kong with 700+ placements globally

Domain

hyphen-connect.com

Jobs

View company profile

External application · ~5 min on Hyphenconnect's site

Please let Hyphenconnect know you found this job on Jobera.

4 other jobs at Hyphenconnect

Explore open roles at Hyphenconnect.

Investment Manager (Fund of funds/ Web3)

Investment Manager (Fund of funds/ Web3)

Treasury & Operations Manager (Web3)

Treasury & Operations Manager (Web3)

Similar Engineer jobs

Applications (Proposals) Engineer - Power Burners

Plumbing Engineer III/Designer III

USD 78853-110279

Sr. Software Engineer II – Full Stack- Apex Platform

Sr. Software Engineer II – Full Stack- Apex Platform

$146k–$165k/yr

Senior SATCOM Engineer

Phoenixtailings

3.2. Lead Separation Engineer

$150k–$180k/yr

Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A

B

C

D

Join 12,000+ marketers

No spam. Unsubscribe at any time.

LLM Pre-training & Distributed Engineer (AI Infrastructure)