Datadog Administration and Operations (Servicenow)

mid

Other

0 views0 saves0 applied

Apply Now

Quick Summary

Key Responsibilities

agents, integrations, logs pipelines, APM/tracing (including OpenTelemetry), RUM, synthetics, dashboards, monitors, service catalogs, tagging strategies. * ServiceNow: Event Management,

Technical Tools

Other

Datadog Administration and Operations (Servicenow) Description - We’re seeking a Datadog administration and operations expert who will be responsible for managing our observability platform to ensure comprehensive monitoring, alerting, and performance analytics across infrastructure and applications. This role is critical for maintaining system reliability, improving incident response, and supporting DevOps and engineering teams with actionable insights. This is an exciting opportunity to get in on the ground floor implementing and scaling the tools, processes, and governance at HP. Responsibilities * Observability Architecture & Ownership * Design and implement an enterprise‑grade observability strategy spanning Datadog (metrics, logs, traces/APM, synthetics, RUM, network performance, cloud cost) and integrations with ServiceNow. * Define monitoring standards, tagging conventions, dashboards, SLOs/SLIs, and alerting policies for infra and apps (on‑prem, cloud, containers). * Datadog Implementation & Scale * Deploy and manage Datadog agents, integrations (AWS/Azure/GCP, Kubernetes, NGINX, DBs, messaging), and service catalog coverage. * Build golden dashboards, standardized monitors, and runbooks for infra components (compute, storage, network), platforms (Kubernetes), and critical apps. * ServiceNow Integration & Event Management * Implement and optimize Datadog → ServiceNow event routing, correlation rules, deduplication, and Incident/Problem auto‑creation with enriched context. * Maintain CI relationships in ServiceNow CMDB, drive discovery mapping, and align alerts with CI ownership and support groups. * Enable closed‑loop remediation using IntegrationHub, workflows, and change controls; contribute to Change Advisory Board (CAB) standards. * Reliability Engineering & Operational Excellence * Maintain SLOs, error budgets, and escalation policies. Reduce alert noise; drive actionable, tiered alerts. * Partner with App, Infra, SecOps, and NOC teams to improve MTTR and post‑incident reviews with telemetry‑backed corrective actions. * Automation & IaC * Automate provisioning of monitors, dashboards, synthetics, tags, and service owner mapping. * Build runbooks, remediation scripts, and service workflows; integrate with CI/CD to promote consistent monitoring across environments. * Governance, Compliance & Cost Optimization * Implement data retention policies, access controls, RBAC, and tagging for chargeback/showback. * Optimize Datadog usage (APM sampling, log pipelines/archives, metric volumes) while protecting critical visibility. Preferred Education & Experience * Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent experience. * 5–8+ years in Infrastructure/Platform/SRE/Observability roles for enterprise environments. * Expert hands‑on Datadog: agents, integrations, logs pipelines, APM/tracing (including OpenTelemetry), RUM, synthetics, dashboards, monitors, service catalogs, tagging strategies. * ServiceNow: Event Management, Incident/Problem/Change, CMDB design, Discovery, integration patterns (webhooks, APIs, IntegrationHub), event correlation and enrichment. * Strong experience across Linux/Windows/Unix (cluster and workload monitoring). * Proficiency with scripting (Python/PowerShell/Bash), Datadog/ServiceNow APIs, and Git‑based workflows. * Demonstrated capability to design SLOs/SLIs, reduce false positives, and measurably improve MTTR and service reliability. * Excellent communication; able to drive standards across multiple engineering teams. Additional Qualifications * Experience across AWS/Azure/GCP, Kubernetes, Terraform * Prior ownership of enterprise observability programs (>500 nodes/services; multi‑account/multi‑subscription cloud). * Network (e.g., NPM/NTA) and database monitoring expertise (e.g. Postgres/SQL Server/Oracle/MySQL). * Experience with message brokers (Tibco), API gateways, and distributed tracing for microservices. * Basic experience in administering and maintaining relational and/or non-relational databases. * Security/Compliance awareness (SOX, HIPAA, PCI), log retention/archival strategies. * Experience with cost governance in Datadog (metrics vs. logs vs. traces), custom metrics, and sampling strategies. * ITIL v4 Foundation, Datadog Certifications, and ServiceNow Admin/Developer certifications. Knowledge & Skills * Systems thinking, reliability engineering mindset, data‑driven decision making. * Strong stakeholder collaboration (Infra, AppDev, SecOps, NOC). * Documentation and enablement: clear runbooks, patterns, standards. * Bias for automation, consistency, and measurable outcomes. Job - Software Schedule - Full time Shift - No shift premium (India) Travel - Relocation - Equal Opportunity Employer (EEO) \- HP, Inc. provides equal employment opportunity to all employees and prospective employees, without regard to race, color, religion, sex, national origin, ancestry, citizenship, sexual orientation, age, disability, or status as a protected veteran, marital status, familial status, physical or mental disability, medical condition, pregnancy, genetic predisposition or carrier status, uniformed service status, political affiliation or any other characteristic protected by applicable national, federal, state, and local law(s). Please be assured that you will not be subject to any adverse treatment if you choose to disclose the information requested. This information is provided voluntarily. The information obtained will be kept in strict confidence. For more information, review HP’s EEO Policy or read about your rights as an applicant under the law here: “Know Your Rights: Workplace Discrimination is Illegal"