L3 Hardware Support Lead
Quick Summary
Building and leading the L3 and escalation support function for datacenter server infrastructure across multiple regions Acting as Incident Commander for high-severity production incidents,
What We Offer
~1 min readWe are looking for a Lead Hardware Support Engineer to build and lead a production-grade L3 hardware support and escalation function for large-scale, GPU-dense datacenter infrastructure. This role owns high-severity incident response, complex hardware and firmware investigations, and enterprise customer escalations under contractual SLAs.
You will establish processes from scratch, lead incident command during critical outages, and build a team capable of operating across distributed, multi-region environments. The role combines hands-on technical depth with operational leadership and people management. You will be accountable for fleet stability, escalation efficiency, root cause clarity, and continuous improvement across server hardware platforms.
Responsibilities
~1 min read- →
Building and leading the L3 and escalation support function for datacenter server infrastructure across multiple regions
- →
Acting as Incident Commander for high-severity production incidents, driving structured mitigation and communication
- →
Owning incident response, problem management, and cross-team escalation workflows end-to-end
- →
Supporting enterprise bare metal customers under contractual SLAs, including executive-level stakeholder communication
- →
Driving root cause analysis for hardware, firmware, and platform-level failures with clear corrective actions
- →
Managing vendor escalations with ODMs and OEMs through formal support channels and direct engagement
- →
Partnering with datacenter operations, hardware engineering, and infrastructure teams to improve reliability at fleet scale
- →
Establishing KPIs, escalation standards, and operational playbooks for production hardware support
- →
Hiring, coaching, and scaling a high-performing support engineering team
- →
Ensuring continuous improvement of response times, incident quality, and customer experience
Requirements
~1 min read-
Experience building or leading an L3 and escalation support function for datacenter server infrastructure in distributed, multi-region environments
-
Experience supporting enterprise bare metal customers under contractual SLAs
-
Strong incident management leadership experience, including serving as Incident Commander
-
Proven ability to build and formalize incident response, problem management, and cross-team escalation processes from scratch
-
People management experience, including hiring, coaching, and performance management
-
Strong English communication skills, written and verbal
Nice to Have
~1 min read-
Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with ability to guide investigations at a systems engineer level
-
Strong familiarity with GPU server platforms and common diagnostics (for example: nvidia-smi, dcgmi, Linux log correlation)
-
Experience driving ODM and OEM vendor escalations through support portals and direct channels
-
Scripting skills (bash and basic Python) for troubleshooting and lightweight analytics
-
Exposure to OCP-based hardware platforms
-
Remote work within the United States
-
Full-time position
What We Offer
~1 min readWhat We Offer
~1 min readListing Details
- First seen
- April 3, 2026
- Last seen
- April 26, 2026
Posting Health
- Days active
- 22
- Repost count
- 0
- Trust Level
- 39%
- Scored at
- April 26, 2026
Signal breakdown
Nebius is a cutting-edge AI cloud platform that offers scalable infrastructure for developing and deploying AI solutions.
View company profilePlease let Nebius know you found this job on Jobera.
4 other jobs at Nebius
View all →Explore open roles at Nebius.
Similar Hardware jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.