Weekday AI

Senior Engineer - SRE

Weekday AI Hyderabad, Telangana, India Today
engineering

This role is for one of the Weekday's clients

Min Experience: 5 years

Location: hyderabad

JobType: full-time

We are looking for a highly skilled and motivated Senior Engineer – Site Reliability Engineering (SRE) to join our growing engineering team. In this role, you will be responsible for ensuring the reliability, scalability, performance, and availability of mission-critical systems across multi-cloud environments. You will work closely with platform, infrastructure, and application teams to build resilient systems using automation-first and cloud-native best practices.

This role is ideal for someone who is passionate about operational excellence, enjoys solving complex infrastructure challenges, and thrives in fast-paced, high-availability environments.

Requirements

Key Responsibilities

  • Design, build, and operate highly available, scalable, and fault-tolerant systems using SRE principles and best practices
  • Manage and operate containerized workloads using Kubernetes, including cluster setup, upgrades, monitoring, and troubleshooting
  • Implement and maintain Infrastructure as Code (IaC) using Terraform and configuration management using Ansible
  • Support and optimize cloud infrastructure across AWS, GCP, and Azure, ensuring cost efficiency, security, and performance
  • Build, maintain, and enhance CI/CD pipelines to enable reliable and automated application deployments
  • Develop automation scripts and tools using Python and Bash to reduce manual operations and improve system reliability
  • Define and track SLIs, SLOs, and SLAs, and participate in error budget planning and incident response
  • Lead incident management, root cause analysis (RCA), and post-mortem reviews to drive continuous improvement
  • Implement monitoring, alerting, and observability solutions to proactively detect and resolve issues
  • Collaborate with development teams to improve system design, deployment processes, and operational readiness
  • Mentor junior engineers and contribute to SRE standards, documentation, and best practices

Required Skills & Qualifications

  • 5–10 years of hands-on experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
  • Strong expertise in Kubernetes and container orchestration in production environments
  • Proven experience with Terraform and Ansible for infrastructure provisioning and configuration management
  • Extensive experience working with at least one major cloud provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus
  • Deep understanding of CI/CD systems, deployment strategies, and release automation
  • Strong scripting and automation skills using Python and Bash
  • Solid understanding of Linux systems, networking, and distributed systems concepts
  • Experience with monitoring, logging, and alerting tools (Prometheus, Grafana, ELK, or similar)
  • Strong troubleshooting skills and experience handling production incidents

Nice to Have

  • Experience with security, compliance, and cloud cost optimization
  • Knowledge of service meshes, load balancing, and auto-scaling strategies
  • Prior experience in high-scale or high-availability production systems

Sponsored

Explore Engineering

Skills in this job

People also search for