Quick Summary
troubleshooting and root-cause analysis - Write maintainable, hand-off-ready Bash / Ansible / Python automation - On-call for infrastructure, CI/CD,
Responsibilities
~1 min read- Manage large-scale Linux environments: troubleshooting and root-cause analysis
- Write maintainable, hand-off-ready Bash / Ansible / Python automation
- On-call for infrastructure, CI/CD, and production service incidents
- Operate HPC clusters (Slurm) along with usage analytics, auditing, and monitoring tools
- Maintain and plan storage for compute environments (Lustre, NAS)
- Manage multi-cloud environments (AWS, Alibaba Cloud, GCP) with Terraform / AWS CDK
- Build and operate Docker (ECS) / Kubernetes (EKS) environments and their deployment workflows
- Operate self-hosted GitLab server and Runner fleet
- Operate CI/CD systems and design deployment pipelines for research and other projects
- Build internal AI platforms (LangChain / LangGraph / Bedrock, Elasticsearch RAG)
- Develop MCP servers, chatbots, AI agents, and similar services
Requirements
~1 min read- **5+ years** of hands-on Linux systems administration and infrastructure operations experience
- Solid Linux internals knowledge (process / memory / filesystem / networking / systemd / cgroup); able to localize issues even without complete logs
- Strong Bash / Shell scripting skills — able to write maintainable scripts that others can pick up
- Programming ability for data processing, CLI tools, and API services; Python proficiency preferred
- Solid storage fundamentals with hands-on experience: RAID levels and rebuild trade-offs, filesystem selection, snapshot and backup planning; NAS / shared storage (NFS / SMB) operations experience
- Experience with at least one major public cloud (AWS / GCP / Alibaba Cloud) and IaC tooling (Terraform / CDK / Ansible)
- Familiar with containerization and orchestration (Docker, Kubernetes)
- CI/CD pipeline design and operations experience (GitLab CI / Jenkins / Airflow)
- Able to own a cross-service subsystem end-to-end: design, implementation, documentation, handoff
- **Strong autonomy**: can drive a problem from discovery, root-cause investigation, decision-making, to delivery with minimal supervision; able to make judgment calls under incomplete information and proactively communicate progress, risks, and rationale
- **Self-directed**: doesn't wait for tickets — identifies problems worth solving and prioritizes them independently
Nice to Have
~1 min read- HPC scheduler experience (Slurm / PBS / LSF)
- Parallel filesystem operations experience (Lustre / GPFS / BeeGFS)
- Advanced Linux performance analysis (perf, eBPF, ftrace) and kernel parameter tuning
- DB operations experience (MySQL, ClickHouse)
- Low-latency network tuning and cross-datacenter link optimization
- LLM application development (LangChain, RAG, Agent, MCP)
- Self-managed Kubernetes experience (Kubespray, kubeadm)
- GPU server operations (single-node): NVIDIA driver / CUDA toolkit version management, `nvidia-smi` / DCGM monitoring, nvidia-container-toolkit integration, troubleshooting XID / ECC errors and thermal throttling
- Experience or familiarity with integrating GPU resources into Slurm: GRES configuration, cgroup-based GPU isolation, user/job-level resource limits
Location & Eligibility
Listing Details
- Posted
- June 5, 2026
- First seen
- June 5, 2026
- Last seen
- June 5, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 60%
- Scored at
- June 5, 2026
Signal breakdown
Please let Kronosresearch know you found this job on Jobera.
3 other jobs at Kronosresearch
View all →Explore open roles at Kronosresearch.
Similar Sre Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.