Our client is a global technology company developing a large-scale cloud-based platform used by millions of people and businesses worldwide. The product focuses on secure, reliable, and highly available systems operating at global scale.
REQUIREMENTS
4+ years of experience in IT, system administration, operations, or SRE roles
Strong working knowledge of Linux and/or Windows systems administration
Proven experience troubleshooting production infrastructure and application issues
Hands-on experience with scripting and automation (Bash, Python, PowerShell)
Solid understanding of networking fundamentals (DNS, TCP/IP, routing, basic diagnostics)
Experience with monitoring, alerting, and logging tools (Prometheus, Grafana, ELK, or similar)
Experience working with cloud platforms such as AWS or Azure
Familiarity with CI/CD pipelines and DevOps practices
Experience with version control systems (Git)
RESPONSIBILITIES
Ensure reliability, availability, and performance of production systems
Monitor system health, respond to incidents, and perform root cause analysis
Troubleshoot complex infrastructure and application-related issues
Design and improve operational processes and best practices
Develop and maintain operational documentation and runbooks
Identify and implement automation opportunities to reduce manual work
Collaborate with engineering teams to improve system resilience and scalability
Participate in on-call rotations and incident response activities