Ardent MC
Ardent MC3h ago
New

Team Lead/Reliability Engineer

United StatesUnited States·Washingtonlead
EngineeringReliability Engineer
0 views0 saves0 applied

Quick Summary

Key Responsibilities

Manage Team schedules assuring every shift is manned every day of the year. Be on-call 24/7 to assist in emergency situations.

Requirements Summary

Experience in Production Monitoring & Support within a 24x7x365 operational environment. Strong expertise in incident management, root cause analysis,

Technical Tools
EngineeringReliability Engineer

At Ardent, we hire people who want more than a job — they want to serve a mission that matters. Our teams support the federal government’s most critical national security and defense priorities, helping protect the nation, strengthen resilience, and advance the technologies and capabilities that keep America secure. For veterans, cleared professionals, and purpose-driven innovators, Ardent is a place to continue serving alongside a team that understands the importance of the mission and the people behind it.

We also know top talent has choices, which is why we back our mission with benefits and flexibility that stand out: competitive pay, comprehensive health coverage, flexible PTO, federal holidays off, tuition reimbursement, professional development support, wellness stipends, and a culture that values and rewards hard work, dedication, and adaptability. If you want to build something meaningful, while enjoying the kind of flexibility and support that you need to do your best work — Ardent is where your next mission begins.


We are seeking a skilled Team Lead/Reliability Engineer to support our client's mission by enhancing Production Monitoring and ensuring optimal service delivery for their applications. This role involves proactive issue identification, incident resolution, and system health optimization within a 24x7x365 operational environment. The ideal candidate will lead monitoring solutions, manage other reliability engineers, and collaborate across IT and business teams to improve service reliability. Expertise in AWS environments, root cause analysis, and technical troubleshooting is essential, along with strong communication and leadership skills to drive continuous improvement.

Requirements

~1 min read
  • Experience in Production Monitoring & Support within a 24x7x365 operational environment.
  • Strong expertise in incident management, root cause analysis, and problem resolution for cloud-based applications.
  • Hands-on experience with Amazon Web Services (AWS) and cloud-based monitoring tools.
  • Ability to build and implement monitoring solutions, automate manual processes, and create alerts to ensure system stability.
  • Experience with system health monitoring, and troubleshooting production issues.
  • Strong leadership skills to collaborate with IT, business, and infrastructure teams to improve production support processes.
  • Effective communication skills to provide updates, incident reports, and status updates to leadership and stakeholders.
  • Ability to develop and maintain technical documentation and knowledge base resources for production support.
  • Experience in triaging and resolving production incidents, assessing severity, and properly escalating issues.

Responsibilities

~2 min read
  • Manage Team schedules assuring every shift is manned every day of the year.
  • Be on-call 24/7 to assist in emergency situations.
  • Proactive and early notification of potential and actual issues impacting service delivery.
  • Frequent and succinct communication to PSPD leadership during and post incident.
  • Identification of trends and corrective measures.
  • Provide needed metrics to PSPD leadership team.
  • The enhanced Production Monitoring Services Branch will provide resources to staff the operation 24x7x365. The resources should provide additional technical support and diagnosis.

Customer Facing:

  • Build monitoring and production support solutions to provide customer with visibility towards our services.
  • Triage and resolve production incidents related to the cloud platform and participate in root cause analysis and postmortem discussions.
  • Asses initial severity, gather impacts, create tickets, engage support teams, and escalate issues properly as they arrive.

Optimizes Work Processes:

  • Participate in the creation and maintenance of technical and knowledge base documentation.
  • Troubleshoot production issues problems and collaborate in developing simple technical solutions.
  • Use diagnostic tools to maintain, troubleshoot and restore standard service or data to systems.
  • Lead Implementation of production support activities in an Amazon Web Services environment.
  • Lead technical and design discussions with IT to help enterprises speed their adoption of new technologies and practices.
  • Perform System health monitoring and optimizing performance
  • Define and establish monitoring and other processes and tooling for monitoring and performing routine system health checks to ensure optimization and stability of application.
  • Provide training to new staff and refresher training to team members

Collaborates:

  • Work as a technical leader alongside business, development, and infrastructure teams.
  • Effectively work with IT and business teams, as well as external customers, to lead the resolution of production incidents and provide communication during outage.
  • Collaborate with other members of IT and business in streamlining production support processes.
  • Work closely with other teams and recommend solutions to improve production support current processes that reflect business needs, security, and SLAs of our production services.
  • Work closely with Infrastructure team and other support staff to identify and resolve incidents and create and implement long term remediation techniques and fixes.
  • Provide support and coach other members of the Production Support team.

Communicates Effectively:

  • Communicate clearly and effectively across IT, business process owners, and customers at all levels of the organization.
  • Communicate progress and any challenges to management.
  • Communicate overall status and health of the application to business and application support teams.

Due to the nature of the work we support, all candidates in consideration for this role must be willing to undergo the government issued background investigation process. We highly encourage all Veterans and those with disabilities to apply.


Ardent is an equal opportunity employer. We will not discriminate in employment, recruitment, advertisements for employment, compensation, termination, upgrading, promotions, and other conditions of employment against any employee or job applicant on the bases of race, color, gender, national origin, age, religion, creed, disability, veteran's status, sexual orientation, gender identity, gender expression, or any other basis protected by state, local, or federal law.

Location & Eligibility

Where is the job
Washington, United States
On-site at the office
Who can apply
US

Listing Details

Posted
June 5, 2026
First seen
June 5, 2026
Last seen
June 5, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
60%
Scored at
June 5, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Ardent MC
Ardent MC
greenhouse
Employees
125
Founded
2006
View company profile

3 other jobs at Ardent MC

View all →

Explore open roles at Ardent MC.

Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

Ardent MCTeam Lead/Reliability Engineer