At Shakudo, we're building the world's first operating system for data and AI. We use the term "operating system" in the truest sense: just like iOS, Windows, or Linux, Shakudo's end-to-end OS provides ever-evolving, fully automated, best-in-class open-source components tailored to each business's unique needs.
We are seeking a Senior DevOps Engineer to join our Engineering team and take ownership of deploying, configuring, and operating Shakudo in customer environments. This is a hands-on infrastructure role for someone who can work across Kubernetes, Helm charts, cloud and on-premise environments, and act as a trusted technical advisor to customers — diagnosing problems, designing deployment architectures, and ensuring Shakudo runs reliably in production.
In this role, you will own the deployment lifecycle from architecture to operations: assessing customer infrastructure, deploying Shakudo into complex environments, resolving production issues, and turning recurring problems into product improvements. This is not a traditional internal DevOps role — it is a mix of DevOps engineering, Kubernetes platform engineering, and solution architecture where success is measured by deployment reliability, customer satisfaction, and operational excellence.
Own the deployment and operation of Shakudo across customer Kubernetes environments
Design, develop, customize, and troubleshoot Helm charts for complex production deployments
Work deeply with Kubernetes primitives including deployments, stateful sets, services, ingress, storage classes, secrets, config maps, RBAC, network policies, CRDs, and operators
Debug Kubernetes issues across scheduling, networking, storage, permissions, DNS, ingress, certificates, and workload reliability
Build repeatable deployment patterns that work across different customer infrastructure environments
Assess customer infrastructure and recommend the right deployment architecture for Shakudo
Work with customer platform, DevOps, security, and infrastructure teams to deploy Shakudo into their environments
Support deployments across AWS, GCP, Azure, hybrid cloud, and on-premise Kubernetes clusters
Design for enterprise constraints such as private networking, IAM/RBAC, security controls, observability, compliance requirements, and restricted environments
Help customers make the right trade-offs across reliability, scalability, performance, cost, and operational complexity
Build and maintain infrastructure-as-code using tools such as Terraform and related cloud-native tooling
Operate cloud managed services that interface with Shakudo Kubernetes clusters, including databases, storage, networking, secrets, and identity services
Support GPU infrastructure and specialized compute environments for data and AI workloads
Improve deployment automation, release processes, upgrade workflows, monitoring, and operational runbooks
Identify recurring deployment issues and turn them into product improvements, automation, or reusable patterns
Monitor, debug, and resolve production issues in customer environments
Lead root-cause analysis for infrastructure, deployment, and platform reliability issues
Execute product upgrades, maintenance windows, rollouts, and customer-specific configuration changes
Improve observability, alerting, logging, and operational visibility across deployments
Ensure customer environments are stable, secure, scalable, and maintainable
Act as a trusted technical advisor to customers during deployment and production operations
Explain infrastructure decisions clearly to both technical and non-technical stakeholders
Collaborate with Solution Engineering, Product Engineering, and Customer Engineering teams to translate customer requirements into robust deployment architectures
Document deployment designs, customer-specific configurations, best practices, and troubleshooting guides
Represent the voice of the customer internally and influence product and platform improvements
5+ years of experience in DevOps, Platform Engineering, Infrastructure Engineering, SRE, or a related role
Strong hands-on experience with Kubernetes in production environments
Strong hands-on experience developing, maintaining, and troubleshooting Helm charts
Experience deploying and operating software in customer or enterprise environments
Experience with cloud platforms such as AWS, GCP, or Azure
Experience with infrastructure-as-code tools such as Terraform
Strong understanding of Kubernetes networking, storage, ingress, RBAC, secrets management, observability, and cluster operations
Ability to troubleshoot complex infrastructure issues across application, Kubernetes, cloud, and network layers
Familiarity with Python, Go, Bash, or TypeScript for automation and tooling
Strong communication skills and comfort working directly with customer technical teams
Ability to operate independently, make sound technical decisions, and drive deployments to completion
Experience with data platforms, AI infrastructure, MLOps, or GPU workloads
Experience with Kubernetes operators, CRDs, GitOps, Argo CD, Flux, or similar deployment tooling
Experience with enterprise security requirements, private networking, identity providers, SSO, and compliance-driven environments
Experience deploying software into air-gapped, restricted, or customer-managed infrastructure
Prior experience in a customer-facing infrastructure, solution engineering, or solution architecture role
Contributions to open-source Kubernetes, DevOps, or infrastructure projects