At Bellota Labs, we are a fast-paced, hypergrowth startup poised to revolutionize the gaming world with ClubWPT Gold—a groundbreaking product from the World Poker Tour. Driven by innovation, game integrity, and exceptional customer experiences, we are on a mission to set new standards in online gaming.
We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and maintain highly reliable, scalable, and secure systems. You will play a critical role in ensuring system availability, performance, and operational excellence across our infrastructure and applications.
As a senior member of the team, you will also mentor engineers, influence architecture decisions, and drive best practices in reliability engineering, automation, and incident management.
Key Responsibilities:
Reliability & Availability
Design and implement highly available, scalable, and fault-tolerant systems.Define and maintain SLIs, SLOs, and SLAs.Lead incident response, root cause analysis (RCA), and postmortems.Improve system resiliency and reduce operational toil through automation.Observability & Monitoring
Design monitoring, alerting, and logging strategies.Implement tools such as Prometheus, Grafana, Datadog, ELK, or similar.Establish proactive alerting and capacity planning processes.Performance & Scalability
Conduct performance testing and optimization.Identify bottlenecks and implement improvements.Support system scaling initiatives and architecture reviews.Collaboration & Leadership
Partner with engineering teams to embed reliability into development processes.Lead reliability initiatives and cross-functional projects.Mentor junior engineers and promote SRE best practices.
Experience:
5+ years of experience in SRE, DevOps, or Infrastructure Engineering.Strong experience with cloud platforms (AWS).Deep understanding of Linux systems and networking fundamentals.Experience with containerization and orchestration (Docker, Kubernetes).Proficiency in scripting/programming (Python, Go, Bash, or similar).Experience with monitoring and observability platforms (Datadog/Prometheus).
Preferred Technologies (Nice to Have):
Experience operating high-scale production systems.Experience with microservices architecture.Background in database reliability (Postgres, MySQL, Redis, etc.).Experience implementing SRE practices (error budgets, blameless postmortems).Experience with AI-driven SRE
Lead High-Impact Projects – Play a key role in delivering innovative gaming experiences to a global audience
Collaborate Across Borders – Work with talented teams across Asia and the US
Fast-Paced Growth – Be part of a hypergrowth startup with ambitious goals
Competitive Benefits – Enjoy a top-tier compensation package in a dynamic company