Site Reliability Engineer
Electrum is a next-generation payment software technology company.
Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing. Our deep expertise has established us as a respected partner in high-volume, low-value payment schemes, enabling clients to deliver services to millions of South Africans daily.
At Electrum, we are grounded in impact – designing solutions that matter, acting with urgency, and continuously learning as we scale. We believe in creating together – working side by side with our clients and teams to build meaningful, lasting solutions. We prioritise making it safe – encouraging open communication, smart risk-taking, and trust so that creativity and alignment thrive. And we back empowered strong teams – hiring brilliant people, collaborating hard, and holding each other to high standards while leading with empathy and kindness.
The Role
Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.
Requirements
Service Reliability and Availability
- Collaborate with teams to develop reliable, available, and scalable applications.
- Work closely with the development team to understand, address, and prevent technical issues.
- Participate in on-call rotations and manage critical incidents.
- Develop and maintain incident response processes and alerting mechanisms.
- Develop and maintain tools to monitor application and service SLIs and SLOs.
System Troubleshooting and Problem Resolution
- Diagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution.
- Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues.
- Participate in on-call rotations to provide 24/7 operational support as necessary.
Observability and Automation
- Utilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers.
- Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness.
- Implement automation tools and frameworks for deployment, configuration, and monitoring processes.
- Capacity management and planning for systems to ensure continued reliability.
Process Improvements
- Offer recommendations and improvements to enhance performance, security, and scalability.
- Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency.
- Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures.
Disaster Recovery
- Design and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity.
- Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures.
- Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests.
Requirements
- Bachelor's degree in Computer Science, Information Technology, or related field preferred.
- 3+ years experience in an SRE or similar role.
- Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch.
- Demonstrable experience with observability tools like Elastic and Grafana.
- Development skills advantageous.
- Proficient troubleshooting and problem-solving skills.
- Excellent collaboration, communication, and time management skills.
- Attention to detail and ability to work effectively in a team environment.
Benefits
Why Join Electrum?
- We believe in a People First approach, ensuring a culture where you can thrive and make a real difference
Your Career & Culture
- Career Growth: Delivering world-class financial software is challenging, but your effort will earn you hands-on experience with products used by millions, accelerating your career.
- Strong Teams: We keep teams small, focused, and collaborative to maximize impact.
- Transparency: We openly discuss strategy, finances, and salaries. Mistakes are viewed as learning opportunities that we actively discuss.
- Autonomy: We trust you. You're expected to seek out the data needed for informed decisions and manage your own time—knowing when to focus and when to recharge.
- Shared Vision: You'll have the power to shape the vision of how we build the future of financial services.
Practical Perks
- Here's how we support our culture:
- Flexible Work: Office-first environment with flexible hours.
- Generous Leave: Starting at 20 days per year.
- Office Perks (Cape Town): Fully-stocked kitchen and daily catered lunch.
- Social Life: Regular team activities like hikes, getaways, and dinners
Apply for this position
Sign In to ApplyAbout Electrum Software
Electrum provides enterprise software that represents the next generation of payments technology. Founded in Cape Town by some of South Africa’s most experienced and innovative payments experts, we have built the top payments technology team in the reg...