At favorited, we believe that digital communities should be more than just spaces to watch content. Our platform is a place to connect, engage, and play, and empowers creators by enhancing audience participation and fostering deeper connections.
Our work culture is intense and isn’t for everyone. But, if you’re a self-starter eager to shape the future of social interaction with a team that holds itself to the highest standards, this is the place for you. We value open, yet respectful communication and real-time feedback to help each other grow quickly. If you’re passionate about gaming and have a knack for gamifying everyday life, you’ll thrive in our fast-moving, collaborative environment.
We are looking for a Senior Site Reliability Engineer to help ensure the reliability, scalability, and performance of the infrastructure that powers favorited’s real-time platform. You will play a key role in building and maintaining systems that support high-traffic applications used by a rapidly growing global audience.
This role is ideal for someone who enjoys solving complex infrastructure challenges, improving system reliability, and building automation that allows engineering teams to move quickly and confidently.
Design, implement, and maintain highly reliable and scalable infrastructure supporting real-time applications.
Build automation and tooling to improve system reliability, deployment processes, and operational efficiency.
Develop and maintain monitoring, logging, and alerting systems to ensure high availability and rapid incident response.
Partner closely with engineering teams to improve service reliability, performance, and observability.
Support incident response, root cause analysis, and postmortems, ensuring learnings are incorporated into system improvements.
Optimize infrastructure for performance, cost efficiency, and scalability.
Manage and scale containerized environments using Docker, Kubernetes, and related orchestration technologies.
Help define and enforce reliability standards, SLOs, and operational best practices across engineering teams.
Continuously evaluate new infrastructure tools and practices to improve system resilience and developer productivity.
6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Experience managing infrastructure for large-scale systems supporting millions of users.
Strong expertise with cloud infrastructure, ideally Google Cloud Platform (GCP).
Hands-on experience with Kubernetes, container orchestration, and distributed systems.
Experience implementing monitoring and observability systems (Prometheus, Grafana, Datadog, or similar).
Strong scripting or programming experience in languages such as Python, Go, or TypeScript.
Deep understanding of reliability engineering practices including SLOs, SLIs, and incident management.
Strong collaboration skills and ability to work cross-functionally with engineering teams.
Nice to Have:
Experience supporting real-time streaming, gaming, or large-scale consumer applications.
Familiarity with event-driven architectures and large-scale data processing systems.
Experience optimizing infrastructure costs in high-growth environments.
Compensation: $150k - $200k base salary + options.
Benefits Include:
Unlimited PTO to prioritize work-life balance.
401(k) plan to invest in your future.
Comprehensive health insurance to support your well-being.
Paid company holidays for time to recharge.
Competitive salary that values your expertise and contributions.
Where You’ll Work: This is a full-time, on-site position in Santa Monica.