Customer currently uses ELK stack, and the goal is to standardize and modernize logs, metrics, and traces using OpenTelemetry, while improving visibility, reliability, and operational intelligence.
Observability Architecture & Modernization
· Assess the existing ELK-based observability setup and define a modern observability architecture
· Design and implement standardized logging, metrics, and distributed tracing using OpenTelemetry
· Define observability best practices for cloud-native and Azure-based applications
· Ensure consistent telemetry collection across microservices, APIs, and infrastructure
Logging, Metrics & Tracing
· Instrument applications using OpenTelemetry SDKs (SpringBoot, .NET, Python, Javascript – as applicable)
· Support Kubernetes and container-based workloads (if applicable)
· Configure and optimize log pipelines, trace exporters, and metric collectors
· Integrate OpenTelemetry with ELK / OpenSearch / Azure Monitor / other backends
· Define SLIs, SLOs, and alerting strategies
· Knowldege in integrating the GitHub and Jira metrics as DORA metrics to observability.
Operational Excellence
· Improve observability performance, cost efficiency, and data retention strategies
· Create dashboards, runbooks, and documentation
AI-based Anomaly Detection & Triage (Good to Have )
· Design or integrate AI/ML-based anomaly detection for logs, metrics, and traces
· Worked on AIOps capabilities for automated incident triage and insights
Required Technical Skills
Core Observability
· Strong hands-on experience with ELK Stack (Elasticsearch, Logstash, Kibana)
· Deep understanding of logs, metrics, traces, and distributed systems
· Practical experience with OpenTelemetry (Collectors, SDKs, exporters, receivers)
Cloud & Platforms
· Strong experience with Microsoft Azure to integrate with Observability platform.
· Experience with Kubernetes / AKS to integrate with Observability platform.
· Knowledge of Azure monitoring tools (Azure Monitor, Log Analytics, Application Insights)
· Experience with Kubernetes / AKS is a strong plus.
Soft Skills;'
· Strong architecture and problem-solving skills
· Clear communication and documentation skills
· Hands-on mindset with an architect-level view
Good to Have / Preferred Skills
· Experience with AIOps / anomaly detection platforms
· Exposure to tools like Prometheus, Grafana, Jaeger, OpenSearch, Datadog, Dynatrace, New Relic (any)
· Experience with incident management, SRE practices, and reliability engineering
Soft Skills;'
· Strong architecture and problem-solving skills
· Clear communication and documentation skills
· Hands-on mindset with an architect-level view
Good to Have / Preferred Skills
· Experience with AIOps / anomaly detection platforms
· Exposure to tools like Prometheus, Grafana, Jaeger, OpenSearch, Datadog, Dynatrace, New Relic (any)
· Experience with incident management, SRE practices, and reliability engineering
Experience Level: 5-8 Years
Location: Chennai