K
New

Senior SRE Engineer

TaiwanTaiwansenior
OtherSre Engineer
0 views0 saves0 applied

Quick Summary

Key Responsibilities

troubleshooting and root-cause analysis - Write maintainable, hand-off-ready Bash / Ansible / Python automation - On-call for infrastructure, CI/CD,

Technical Tools
OtherSre Engineer

Responsibilities

~1 min read

- Manage large-scale Linux environments: troubleshooting and root-cause analysis
- Write maintainable, hand-off-ready Bash / Ansible / Python automation
- On-call for infrastructure, CI/CD, and production service incidents

- Operate HPC clusters (Slurm) along with usage analytics, auditing, and monitoring tools
- Maintain and plan storage for compute environments (Lustre, NAS)

- Manage multi-cloud environments (AWS, Alibaba Cloud, GCP) with Terraform / AWS CDK
- Build and operate Docker (ECS) / Kubernetes (EKS) environments and their deployment workflows

- Operate self-hosted GitLab server and Runner fleet
- Operate CI/CD systems and design deployment pipelines for research and other projects

- Build internal AI platforms (LangChain / LangGraph / Bedrock, Elasticsearch RAG)
- Develop MCP servers, chatbots, AI agents, and similar services

Requirements

~1 min read

- **5+ years** of hands-on Linux systems administration and infrastructure operations experience
- Solid Linux internals knowledge (process / memory / filesystem / networking / systemd / cgroup); able to localize issues even without complete logs
- Strong Bash / Shell scripting skills — able to write maintainable scripts that others can pick up
- Programming ability for data processing, CLI tools, and API services; Python proficiency preferred
- Solid storage fundamentals with hands-on experience: RAID levels and rebuild trade-offs, filesystem selection, snapshot and backup planning; NAS / shared storage (NFS / SMB) operations experience
- Experience with at least one major public cloud (AWS / GCP / Alibaba Cloud) and IaC tooling (Terraform / CDK / Ansible)
- Familiar with containerization and orchestration (Docker, Kubernetes)
- CI/CD pipeline design and operations experience (GitLab CI / Jenkins / Airflow)
- Able to own a cross-service subsystem end-to-end: design, implementation, documentation, handoff
- **Strong autonomy**: can drive a problem from discovery, root-cause investigation, decision-making, to delivery with minimal supervision; able to make judgment calls under incomplete information and proactively communicate progress, risks, and rationale
- **Self-directed**: doesn't wait for tickets — identifies problems worth solving and prioritizes them independently

Nice to Have

~1 min read

- HPC scheduler experience (Slurm / PBS / LSF)
- Parallel filesystem operations experience (Lustre / GPFS / BeeGFS)
- Advanced Linux performance analysis (perf, eBPF, ftrace) and kernel parameter tuning
- DB operations experience (MySQL, ClickHouse)
- Low-latency network tuning and cross-datacenter link optimization
- LLM application development (LangChain, RAG, Agent, MCP)
- Self-managed Kubernetes experience (Kubespray, kubeadm)
- GPU server operations (single-node): NVIDIA driver / CUDA toolkit version management, `nvidia-smi` / DCGM monitoring, nvidia-container-toolkit integration, troubleshooting XID / ECC errors and thermal throttling
- Experience or familiarity with integrating GPU resources into Slurm: GRES configuration, cgroup-based GPU isolation, user/job-level resource limits

Location & Eligibility

Where is the job
Taiwan
On-site within the country
Who can apply
TW

Listing Details

Posted
June 5, 2026
First seen
June 5, 2026
Last seen
June 5, 2026

Posting Health

Days active
0
Repost count
0
Trust Level
60%
Scored at
June 5, 2026

Signal breakdown

freshnesssource trustcontent trustemployer trust
Newsletter

Stay ahead of the market

Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.

A
B
C
D
Join 12,000+ marketers

No spam. Unsubscribe at any time.

K
Senior SRE Engineer