Kronosresearch is hiring a Web3 Senior SRE Engineer
Compensation: $90k - $100k estimated
Location: Remote
Responsibilities Linux Systems & Automation (Core) - Manage large-scale Linux environments: troubleshooting and root-cause analysis- Write maintainable, hand-off-ready Bash / Ansible / Python automation- On-call for infrastructure, CI/CD, and production service incidents HPC Cluster & Storage - Operate HPC clusters (Slurm) along with usage analytics, auditing, and monitoring tools- Maintain and plan storage for compute environments (Lustre, NAS) Cloud & Hybrid Infrastructure - Manage multi-cloud environments (AWS, Alibaba Cloud, GCP) with Terraform / AWS CDK- Build and operate Docker (ECS) / Kubernetes (EKS) environments and their deployment workflows CI/CD & Developer Experience - Operate self-hosted GitLab server and Runner fleet- Operate CI/CD systems and design deployment pipelines for research and other projects GenAI / Internal Platform - Build internal AI platforms (LangChain / LangGraph / Bedrock, Elasticsearch RAG)- Develop MCP servers, chatbots, AI agents, and similar services Requirements - 5+ years of hands-on Linux systems administration and infrastructure operations experience- Solid Linux internals knowledge (process / memory / filesystem / networking / systemd / cgroup); able to localize issues even without complete logs- Strong Bash / Shell scripting skills — able to write maintainable scripts that others can pick up- Programming ability for data processing, CLI tools, and API services; Python proficiency preferred- Solid storage fundamentals with hands-on experience: RAID levels and rebuild trade-offs, filesystem selection, snapshot and backup planning; NAS / shared storage (NFS / SMB) operations experience- Experience with at least one major public cloud (AWS / GCP / Alibaba Cloud) and IaC tooling (Terraform / CDK / Ansible)- Familiar with containerization and orchestration (Docker, Kubernetes)- CI/CD pipeline design and operations experience (GitLab CI / Jenkins / Airflow)- Able to own a cross-service subsystem end-to-end: design, implementation, documentation, handoff- Strong autonomy: can drive a problem from discovery, root-cause investigation, decision-making, to delivery with minimal supervision; able to make judgment calls under incomplete information and proactively communicate progress, risks, and rationale- Self-directed: doesn't wait for tickets — identifies problems worth solving and prioritizes them independently Nice to Have - HPC scheduler experience (Slurm / PBS / LSF)- Parallel filesystem operations experience (Lustre / GPFS / BeeGFS)- Advanced Linux performance analysis (perf, eBPF, ftrace) and kernel parameter tuning- DB operations experience (MySQL, ClickHouse)- Low-latency network tuning and cross-datacenter link optimization- LLM application development (LangChain, RAG, Agent, MCP)- Self-managed Kubernetes experience (Kubespray, kubeadm)- GPU server operations (single-node): NVIDIA driver / CUDA toolkit version management, nvidia-smi / DCGM monitoring, nvidia-container-toolkit integration, troubleshooting XID / ECC errors and thermal throttling- Experience or familiarity with integrating GPU resources into Slurm: GRES configuration, cgroup-based GPU isolation, user/job-level resource limits
Apply Now:
Compensation: $90k - $100k estimated
Receive similar jobs:
Remote