About Betfair Romania Development:
Betfair Romania Development is the largest technology hub of Flutter Entertainment, with over 2,000 people powering the world’s leading sports betting and iGaming brands. Exciting, immersive and safe experiences are delivered to over 18 million customers worldwide, from our office in Cluj-Napoca. Driven by relentless innovation and commitment to excellence, we operate our own unbeatable portfolio of diverse proprietary brands such as FanDuel, PokerStars, SportsBet, Betfair, Paddy Power, or Sky Betting & Gaming,
Our Values:
The values we share at Betfair Romania Development define what makes us unique as a team. They empower us by giving meaning to our contributions, and they ensure that we consistently strive for excellence in everything we do. We are looking for passionate individuals who align with our values and are committed to making a difference.
Win together | Raise the bar | Got your back | Own it | Positive impact
About Flutter Functions:
The Flutter Functions division is a key component of Flutter Entertainment, responsible for providing essential support and services across the organization. The division encompasses various corporate functions, including finance, legal, human resources, technology, and more, ensuring seamless operations and strategic alignment throughout the company.
Role Overview:
Flutter Technology is seeking a Site Reliability Engineer to ensure the reliability, availability, and performance of our critical gaming and betting platforms across our global operations. This role combines engineering expertise with operational excellence to maintain 24/7/365 service availability for millions of customers worldwide, through on-call support
As part of Flutter Functions, you will collaborate closely with development teams, infrastructure specialists, and business stakeholders to maintain the high-performance, scalable systems that power our iGaming & Sports platforms across multiple markets. You will be the subject matter expert responsible for implementing and maintaining enterprise-grade observability, disaster recovery, and business continuity capabilities across our AWS Cloud tenancy.
The ideal candidate will combine deep knowledge of SRE best practices with public cloud expertise (AWS, Azure, GCP) to ensure our systems maintain high availability, enable rapid recovery from incidents, and deliver comprehensive observability across our platform. You will be responsible for ensuring our systems are resilient, recoverable, and thoroughly tested through regular fire drills and enterprise-level testing scenarios.
You will engage with senior stakeholders during incident escalations and post-incident reviews, requiring exceptional communication skills to articulate technical issues and operational status. You should be comfortable working with cross-functional teams in dynamic environments, collaborating with development teams, infrastructure specialists, and business stakeholders across various functions and brands.
This role is critical to maintaining Flutter Entertainment's operational excellence and meeting stringent regulatory compliance requirements in our highly regulated gaming industry. The role requires a passion for system reliability, a proactive approach to identifying and resolving potential issues before they impact customers, and the ability to implement solutions that support Flutter's multi-regional, multi-market infrastructure.
Key Accountabilities & Responsibilities:
Maintain 99.9%+ uptime for the Observability platform that monitors and provides insights for systems serving millions of concurrent users.
Implement, and maintain comprehensive monitoring, alerting, and observability solutions, including ownership of the actual tooling infrastructure that integrates with multiple cloud services and platforms (e.g., Grafana, Splunk, CloudWatch, among others).
Conduct capacity planning and performance optimization to ensure systems can handle peak loads during major sporting events.
Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services with support from Service Management.
Partner with Service Management (who lead incident coordination) to drive continuous improvement through blameless post-mortems, identify recurring failure patterns across the platform, and collaborate with product teams on implementing resilience improvements that enhance system reliability.
Collaborate with Service Management on post-incident reviews, contributing technical insights and supporting the implementation of preventative measures to reduce repeat occurrences.
Assist in developing and maintaining comprehensive runbooks and incident response procedures in partnership with Service Management teams.
Deploy, and maintain comprehensive monitoring dashboards and visualization solutions for real-time system visibility across all Flutter platforms.
Create custom dashboards and visual analytics for business metrics, technical KPIs, and operational insights tailored to different stakeholder needs.
Configure and optimize data ingestion from diverse sources including time-series databases, log aggregation systems, cloud monitoring services, and custom APIs.
Implement and refine alerting rules and notification workflows that reduce alert fatigue while ensuring critical issues are promptly escalated.
Establish and maintain APM capabilities, integrating instrumentation and telemetry collection within the existing observability ecosystem.
Collaborate with development teams to define, implement, and instrument custom business and technical metrics that provide actionable insights.
Own and maintain the chaos testing framework and tooling, define standard failure scenarios, support product teams in executing tests safely and consistently, and proactively identify platform weaknesses to drive resilience improvements.
Conduct disaster recovery fire drills in collaboration with product teams, coordinating complex testing scenarios in isolated environments to validate system resilience and recovery procedures. Document outcomes and findings to inform future improvements and ensure organizational readiness.
Apply chaos engineering principles to proactively identify system weaknesses and vulnerabilities, using controlled experiments to test platform behavior under failure conditions and surface areas for resilience enhancement.
Partner with development teams to improve application reliability and deployment practices.
Mentor junior team members and contribute to the development of SRE practices across Flutter.
Participate in architecture reviews and provide reliability expertise for new system designs.
Document procedures, troubleshooting guides, and system architecture for knowledge sharing.
Building Support: We establish close relationships with our stakeholders, underpinned by trust, integrity, and respect. We build awareness, understanding, and positive momentum behind the group technology strategy, often without being able to assert authority.
Objective: We are impartial and unbiased, ensuring equal treatment for all and that decisions are based on objective criteria.
Collaborative: We work effectively and in partnership with our stakeholders on shared goals that align towards the achievement of the group strategy. We foster a collaborative environment and assume the role of leader when required.
Adaptable: We understand and appreciate different and opposing perspectives on an issue and can adapt our approach to achieve a successful outcome.
Strategic Thinking: We think about the big picture and use that perspective to support our divisions to achieve competitive advantage through greater agility, faster time to market and a better customer experience.
Strategic Communication: We are proactive and considered in our approach to stakeholder communications. We actively listen, provide constructive feedback and help others to consider new perspectives.
Skills, Capabilities & Experience Required:
Extensive experience with monitoring and observability tools including Prometheus, Grafana, ELK stack, or similar enterprise-scale solutions for maintaining high availability across production environments.
Demonstrated ability to work with cloud platforms including AWS, Azure, or Google Cloud Platform, with deep understanding of cloud services and architecture patterns.
Extensive experience implementing and maintaining reliability engineering practices in production environments supporting 24/7/365 operations.
Delivering and operating systems in stringent security-compliant and highly regulated environments.
Strong scripting and programming abilities in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code implementations.
Proven experience with CI/CD pipelines and tools including Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or similar continuous integration and deployment platforms.
Working knowledge of database technologies including SQL databases (PostgreSQL, MySQL) and NoSQL solutions for data persistence and management.
Producing comprehensive, clear, and actionable technical documentation for operational procedures and runbooks.
Working in an agile environment with cross-functional teams.
Proficiency with containerization technologies including Docker and Kubernetes for container orchestration and management at scale.
Bonus points for previous software engineering experience, AWS certifications, or experience in highly regulated industries such as gaming, financial services, or healthcare.
Benefits:
Hybrid & remote working options
€1,000 per year for self-development
Company share scheme
25 days of annual leave per year
20 days per year to work abroad
5 personal days/year
Flexible benefits: travel, sports, hobbies
Extended health, dental and travel insurances
Customized well-being programmes
Career growth sessions
Thousands of online courses through Udemy
A variety of engaging office events
Disclaimer:
We are an inclusive employer. By embracing diverse experiences and perspectives, we create a lasting, positive impact for our employees, customers, and the communities we’re part of. You don't have to meet all the requirements listed to apply for this role. If you need any adjustments to make this role work for you, let us know, and we’ll see how we can accommodate them.
We thank all applicants for their interest; however, only the candidates who best meet the job requirements will be contacted for an interview.
By submitting your application online, you agree that your details will be used to progress your application for employment. If your application is successful, your details will be used to administer your personnel record. If your application is unsuccessful, we will retain your details for a period no longer than three years, to consider you for prospective roles within the company.