Job Specification: Site Reliability Engineer (Mid-Level)
Role Overview
We are seeking a Site Reliability Engineer (Mid-level) with strong expertise in AWS cloud infrastructure, containerized platforms, and Azure DevOps CI/CD pipelines. The successful candidate will focus on improving system reliability, availability, performance, and scalability while enabling engineering teams to deliver high-quality services efficiently.
This role blends software engineering with operational excellence, emphasizing automation, observability, incident response, and continuous improvement across cloud-native environments.
Note: This is a reliability-focused engineering role with on-call responsibilities and involvement in platform modernization initiatives.
Qualifications
Key Responsibilities
- Design, build, and operate highly available AWS infrastructure using Infrastructure as Code (Terraform / CloudFormation).
- Develop and maintain CI/CD pipelines to support automated deployments and testing.
- Implement and manage EC2 / containerised workloads using Docker and Kubernetes (EKS/ECS).
- Improve system reliability through automation, monitoring, alerting, and self-healing mechanisms.
- Define and track SLIs/SLOs and error budgets for critical services.
- Participate in incident response, lead root cause analysis, and drive post-incident improvements.
- Build observability platforms using CloudWatch, Prometheus, Grafana, ELK, or similar tooling.
- Automate operational tasks to reduce toil and improve deployment consistency.
- Optimise AWS environments for performance, scalability, and cost efficiency.
- Implement security best practices, including IAM, secrets management, and network segmentation.
- Collaborate with development teams to improve application reliability and deployment strategies.
- Maintain runbooks, architectural documentation, and operational playbooks.
Key Characteristics
- Reliability-driven: Focused on uptime, performance, and resilience.
- Automation-first mindset: Actively reduces manual effort and operational toil.
- Ownership mentality: Takes responsibility for services from design through production.
- Strong communicator: Clearly articulates incidents, improvements, and technical concepts.
- Collaborative: Works closely with platform, security, and application teams.
- Continuous learner: Keeps pace with SRE practices and cloud-native technologies.
Core Experience & Technical Skills
- 5–7 years of IT experience with at least 3+ years in SRE, DevOps, or Cloud Engineering roles.
- Strong hands-on experience with AWS services including EC2, VPC, IAM, S3, RDS, CloudWatch, ALB/ELB, and Route53.
- Proven experience creating, managing, and optimising CI/CD pipelines using Azure DevOps.
- Solid Linux/Windows system administration and troubleshooting skills across production environments.
- Hands-on experience with Docker for containerization and working knowledge of Kubernetes ECS/EKS, including container networking, scaling, rolling deployments, and service mesh concepts.
- Strong experience implementing Infrastructure as Code using Terraform and/or CloudFormation.
- Scripting proficiency in Bash and Python for automation and operational tooling.
- Experience automating infrastructure provisioning, deployments, and operational workflows.
- Practical experience implementing observability platforms, including monitoring, logging, and alerting solutions.
- Strong understanding of SRE principles, including SLIs, SLOs, error budgets, incident management, postmortems, and capacity planning.
- Familiarity with performance tuning, load testing, and reliability optimisation techniques.
Additional Information
D&I statement