Two95 International Inc.

DevOps / Site Reliability Engineer

Two95 International Inc. United States 1 day ago
engineering

Job Title: Lead SRE (Site Reliability Engineer )

Location: Remote Work

Type: 6+ Month Contract to hire

Rate: $Open /hr.

Pl forward updated resume to deivy.malli@two95intl.com  and include your rate requirement along with your contact details with a suitable time when we can reach you.

 

Responsibilities

·         Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform.

·         Lead incident response, root-cause analysis, and drive actionable postmortems.

·         Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team.

·         Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc).

·         Manage, operate and recommend improvement of mo

·         Execute and continuously improve disaster recovery and business continuity plans.

·         Partner with platform engineering, QA, and development teams to ensure operational readiness.

·         Establish and maintain runbooks, operational standards, and reliability best practices.

·         Provide leadership, mentorship, and clear communication during both normal operations and incidents.

·         Optimize cloud and Kubernetes environments for reliability, performance, and scalability.

 

Requirements

Qualifications

·         8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity.

·         Strong experience supporting production environments with strict SLAs and high uptime requirements.

·         Deep knowledge of Kubernetes, containers, and cloud-native infrastructure.

·         Proficiency in automation and scripting using Bash, Python, or Go.

·         Hands-on experience with CI/CD pipelines and release engineering in modern environments.

·         Expert-level familiarity with IaC tools (Terraform preferred).

·         Strong understanding of monitoring, alerting, logging, and observability tooling.

·         Experience implementing and managing GitOps workflows (ArgoCD or similar).

·         Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders.

·         Solid understanding of disaster recovery planning, resilience practices, and system hardening.

 

Sponsored

Explore Engineering

Skills in this job

People also search for