We are Kaizen Gaming
Kaizen Gaming, the team powering Betano, is one of the biggest GameTech companies in the world, operating in 19 markets. We always aim to leverage cutting-edge technology, providing the best experience to our millions of customers who trust us for their entertainment.
We are a diverse team of more than 2.700 Kaizeners, from 40+ nationalities spreading across 3 continents.
Our #oneteam is proud to be among the Best Workplaces in Europe and certified Great Place to Work across our offices. Here, there’ll be no average day for you. Ready to Press Play on Potential?
Let’s start with the role
As a Site Reliability Operations Manager, you will lead the operational reliability layer of our production environment, ensuring 24/7 service stability across networks, applications, and infrastructure.
You will own the performance and evolution of our Site Reliability Operations function — managing shift-based teams, strengthening incident response practices, driving measurable improvements in uptime, response time, and operational maturity, and directly handling and overseeing the end-to-end incident flow.
You will be responsible for ensuring that incidents are properly triaged, escalated, coordinated, and resolved, while continuously improving our incident management processes.
This role sits at the intersection of Infrastructure, Platform, Security, and Product, ensuring that reliability is not reactive, but engineered and continuously improved.
Reliability at scale in a high-traffic, real-time gaming environment demands precision, discipline, and strong leadership. This role is critical to that mission.
As a Site Reliability Operations Manager, you will:
- Lead and develop the Site Reliability Operations team, ensuring high performance across 24/7 shift coverage.
- Own incident management processes, including severity classification, escalation paths, communication standards, and post-incident reviews.
- Ensure proactive monitoring of production systems with meaningful alerting that minimizes noise and maximizes actionability.
- Track and improve key operational metrics such as MTTA, MTTR, uptime, and SLA adherence.
- Establish and refine standard operating procedures for monitoring, escalation, and vendor coordination.
- Drive structured communication during incidents, ensuring clear updates to technical and business stakeholders.
- Collaborate closely with SRE, Infrastructure, Security, and Engineering teams to eliminate recurring incidents through root cause analysis and systemic improvements.
- Oversee relationships with external vendors and providers during both routine operations and major outages.
- Promote a culture of operational excellence, accountability, and continuous improvement.
- Participate in capacity planning and operational readiness reviews for new launches and major changes.
What you will bring
- Proven experience leading technical operations or NOC/SRE Operations teams in high-availability environments.
- Strong understanding of production monitoring, alerting systems, and incident management frameworks.
- Solid knowledge of networking fundamentals (TCP/IP), infrastructure components, and cloud or hybrid environments.
- Experience working in 24/7 operational models with shift-based teams.
- Hands-on familiarity with ticketing systems and operational reporting.
- Ability to analyze operational data and translate it into improvement initiatives.
- Strong stakeholder communication skills, especially under pressure.
- Structured thinker with close attention to detail and strong execution discipline.
- Experience in gaming, fintech, e-commerce, or other real-time, high-scale digital environments is considered a strong plus.
Recruitment Privacy Notice
Regarding the data you share with us, you may find and read our recruitment privacy notice here.
We are an equal opportunity employer committed to fostering a diverse and inclusive workplace. We welcome applications from individuals of all backgrounds, regardless of race, gender, religion, sexual orientation,or age.