We are seeking a Lead Site Reliability Engineer to spearhead the reliability, scalability, and performance of our AI-powered property intelligence platform. Operating at the intersection of Geospatial AI and Insurance Technology, you will be responsible for a mission-critical Azure ecosystem supporting high-throughput Java microservices.
As a Lead, you will bridge the gap between complex AI model inference and enterprise-grade stability. You will own the "Production Excellence" mandate, mentoring a team of engineers and collaborating with Senior Delivery Directors to ensure our global infrastructure stays ahead of our rapid growth.
Key Responsibilities
Strategic Infrastructure & Azure Leadership
Cloud Architecture: Lead the design of highly available, multi-region architectures on Azure, utilizing AKS (Azure Kubernetes Service), Azure Functions, and Service Bus.
IaC Governance: Establish and enforce standards for Infrastructure as Code using Terraform or Bicep, ensuring 100% automated provisioning across all environments.
Java Performance Engineering: Partner with Backend squads to optimize JVM performance, garbage collection tuning, and memory management for high-concurrency insurance processing.
Reliability & AI Operations (AIOps)
Error Budgeting: Define, negotiate, and manage SLIs, SLOs, and SLAs with Product Stakeholders, balancing the velocity of AI feature releases with system stability.
Advanced Observability: Architect end-to-end monitoring and distributed tracing using Azure Monitor, Application Insights, and ELK/Grafana.
Incident Commander: Act as the ultimate escalation point for high-priority incidents, leading complex Root Cause Analysis (RCA) and driving long-term remediation tasks.
Security & Industry Compliance
Data Sovereignty: Ensure the platform adheres to insurance-specific data residency requirements and security frameworks (SOC2, HIPAA, or ISO 27001).
Automated Governance: Implement Azure Policy and automated security scanning within CI/CD pipelines to ensure a "Secure by Design" infrastructure.
Technical Leadership:
7+ years in SRE, DevOps, or Cloud Engineering, with at least 2 years in a Lead or Principal capacity.
Azure Mastery: Expert-level knowledge of the Azure Well-Architected Framework, specifically around networking (VNet/ExpressRoute) and Compute.
Java Ecosystem: Deep proficiency in the Java/Spring Boot stack from an operational perspective (JVM profiling, thread dump analysis).
Container Orchestration: Mastery of Kubernetes (AKS), including ingress controllers, service mesh (Istio), and cluster security.
Professional Competencies:
Strategic Mindset: Ability to translate technical debt and reliability risks into a data-driven business case for leadership.
Automation Advocate: Proven track record of eliminating "Toil" through Python, Go, or Java-based automation tooling.
Mentorship: Passion for leveling up the engineering organization through workshops, documentation, and pair programming.
AI-First Integration: Experience leveraging AI for predictive scaling and automated log summarization to reduce Mean Time to Recovery (MTTR).
Perks you enjoy at KMS Mexico