This role is for one of the Weekday's clients
Min Experience: 5 years
Location: hyderabad
JobType: full-time
We are looking for a highly skilled and motivated Senior Engineer – Site Reliability Engineering (SRE) to join our growing engineering team. In this role, you will be responsible for ensuring the reliability, scalability, performance, and availability of mission-critical systems across multi-cloud environments. You will work closely with platform, infrastructure, and application teams to build resilient systems using automation-first and cloud-native best practices.
This role is ideal for someone who is passionate about operational excellence, enjoys solving complex infrastructure challenges, and thrives in fast-paced, high-availability environments.
Requirements
Key Responsibilities
- Design, build, and operate highly available, scalable, and fault-tolerant systems using SRE principles and best practices
- Manage and operate containerized workloads using Kubernetes, including cluster setup, upgrades, monitoring, and troubleshooting
- Implement and maintain Infrastructure as Code (IaC) using Terraform and configuration management using Ansible
- Support and optimize cloud infrastructure across AWS, GCP, and Azure, ensuring cost efficiency, security, and performance
- Build, maintain, and enhance CI/CD pipelines to enable reliable and automated application deployments
- Develop automation scripts and tools using Python and Bash to reduce manual operations and improve system reliability
- Define and track SLIs, SLOs, and SLAs, and participate in error budget planning and incident response
- Lead incident management, root cause analysis (RCA), and post-mortem reviews to drive continuous improvement
- Implement monitoring, alerting, and observability solutions to proactively detect and resolve issues
- Collaborate with development teams to improve system design, deployment processes, and operational readiness
- Mentor junior engineers and contribute to SRE standards, documentation, and best practices
Required Skills & Qualifications
- 5–10 years of hands-on experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
- Strong expertise in Kubernetes and container orchestration in production environments
- Proven experience with Terraform and Ansible for infrastructure provisioning and configuration management
- Extensive experience working with at least one major cloud provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus
- Deep understanding of CI/CD systems, deployment strategies, and release automation
- Strong scripting and automation skills using Python and Bash
- Solid understanding of Linux systems, networking, and distributed systems concepts
- Experience with monitoring, logging, and alerting tools (Prometheus, Grafana, ELK, or similar)
- Strong troubleshooting skills and experience handling production incidents
Nice to Have
- Experience with security, compliance, and cloud cost optimization
- Knowledge of service meshes, load balancing, and auto-scaling strategies
- Prior experience in high-scale or high-availability production systems
Sponsored
Explore Engineering
Skills in this job
People also search for
Similar Jobs
More jobs at Weekday AI
Apply for this position
Sign In to ApplyAbout Weekday AI
At Weekday (backed by YC; also Product Hunt #1 product of the day), we are building the next frontier in hiring. We have built the largest database of white collar talent in India and have built outreach tools on top of it to generate highest response ...