Job title: Site Reliability Engineer
Location: Sofia, Bulgaria
Overview of the role:
Flutter Technology is looking for a Site Reliability Engineer to guarantee the stability, uptime, and efficiency of our essential gaming and betting platforms throughout our worldwide operations. This position blends engineering skills with operational proficiency to sustain continuous service availability for millions of users globally via on-call support.
As a member of Flutter Functions, you will work closely with development groups, infrastructure experts, and business partners. Together, you will maintain high-performance, scalable systems supporting our iGaming & Sports platforms in several markets. You will be the expert responsible for building and managing enterprise-level observability, disaster recovery, and business continuity features across our AWS Cloud environment.
The ideal candidate combines strong understanding of SRE protocols with public cloud experience (AWS, Azure, GCP). This ensures our systems maintain high availability, recover rapidly from incidents, and offer comprehensive observability on our platform. You will be responsible for making sure our systems are resilient, recoverable, and subjected to regular fire drills and extensive testing.
You will interact with senior participants during customer concern escalations and post-incident discussions, needing excellent communication abilities to convey technical challenges and operational updates. You should feel confident collaborating with cross-departmental teams in changing environments, working alongside development staff, infrastructure professionals, and business interested parties across various functions and brands.
This role is critical to maintaining Flutter Entertainment's operational excellence. It also ensures we meet strict regulatory compliance in the highly regulated gaming industry. The role requires passion for system reliability and a proactive approach to spotting and fixing potential issues before they affect customers. It also involves implementing solutions that support Flutter's multi-regional, multi-market infrastructure.
This role follows a hybrid approach to working, allowing you to combine working from home with working in our modern offices. These discussions are between you and your manager to find the best pattern for you both, while recognising that quality time together is essential for keeping us mission-aligned.
Our teams work from a lively location nestled within this historic city. Enjoy the best of both worlds with winter and summer offices, tantalizing free snacks, and a gaming paradise for endless entertainment.
What you’ll do:
Responsibilities:
System Reliability & Performance
Maintain 99.9%+ uptime for the Observability platform that monitors and provides insights for systems serving millions of concurrent users
Design and support complete monitoring, alerting, and observability systems. Take responsibility for the tooling infrastructure that connects with various cloud services and platforms such as Grafana, Splunk, and CloudWatch.
Conduct capacity planning and performance optimization to ensure systems can handle peak loads during major sporting events
Establish and uphold Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all essential services with assistance from Service Management
Incident Management & Response
Collaborate with Service Management to foster continuous improvement via blameless post-mortems. Detect repeated failure trends across the platform. Work with product teams on resilience upgrades that improve system dependability.
Work together with Service Management on post-incident reviews, offering technical insights and assisting in the adoption of preventative measures to minimize repeat incidents
Support the development and upkeep of detailed runbooks and incident response methods alongside Service Management teams
Observability & Monitoring Excellence
Deploy and maintain comprehensive monitoring dashboards and visualization tools for real-time system visibility across all Flutter platforms
Create custom dashboards and visual analytics for business metrics, technical indicators, and operational insights tailored to different partner needs
Configure and optimize data ingestion from diverse sources including time-series databases, log aggregation systems, cloud monitoring services, and custom APIs
Implement and refine alerting rules and notification workflows that reduce alert fatigue while ensuring critical issues are promptly brought up
Develop and sustain APM capabilities, incorporating instrumentation and telemetry collection into the current observability ecosystem
Work together with development teams to define, implement, and instrument custom business and technical metrics that provide actionable insights
Testing & Chaos Engineering:
Own and maintain the chaos testing framework and tools. Define standard failure scenarios. Support product teams in performing tests safely and consistently. Proactively identify platform weaknesses to drive resilience improvements.
Carry out disaster recovery fire drills together with product teams, managing complex testing scenarios in isolated environments to verify system resilience and recovery procedures. Record outcomes and findings to guide future enhancements and maintain organizational readiness.
Apply chaos engineering principles to proactively identify system weaknesses and vulnerabilities, using controlled experiments to test platform behavior under failure conditions and surface areas for resilience enhancement
Collaboration & Knowledge Sharing
Work alongside development teams to boost application reliability and deployment procedures
Mentor junior team members and contribute to the development of SRE practices across Flutter
Participate in architecture reviews and provide reliability expertise for new system designs
Document procedures, troubleshooting guides, and system architecture for knowledge sharing
What you’ll bring:
Competencies:
Building Support: We develop strong connections with our partners, founded on trust, integrity, and respect. We promote awareness, encourage understanding, and build positive momentum around the group technology strategy, frequently without direct control.
Objective: We are impartial and unbiased, ensuring equal treatment for all and that decisions are based on objective criteria.
Collaborative: We work effectively and in partnership with our collaborators on shared goals that align towards the achievement of the group strategy. We foster a cooperative environment and take on leadership roles when necessary.
Adaptable: We recognize and value diverse and conflicting viewpoints on a matter and can adjust our approach to reach a successful result.
Critical Thinking: We consider the big picture and use this perspective to help our divisions gain a competitive advantage. This advantage comes from greater agility, faster time to market, and an improved customer experience.
Critical Communication: We are proactive and thoughtful in our approach to communication with partners. We actively listen, provide constructive feedback, and help others to consider new perspectives.
Experience:
Extensive experience with monitoring and observability tools including Prometheus, Grafana, ELK stack, or similar enterprise-scale solutions for maintaining high availability across production environments
Established capability in handling cloud platforms like AWS, Azure, or Google Cloud Platform, accompanied by strong comprehension of cloud services and system architecture frameworks
Extensive experience applying and sustaining reliability engineering methods in production settings supporting 24/7/365 operations
Delivering and operating systems in stringent security-compliant and highly regulated environments
Strong scripting and programming abilities in Python, Go, Bash, TypeScript, or Terraform for automation and infrastructure as code implementations
Proven experience with CI/CD pipelines and tools including Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or similar continuous integration and deployment platforms
Working knowledge of database technologies including SQL databases (PostgreSQL, MySQL) and NoSQL solutions for data persistence and management
Producing comprehensive, clear, and actionable technical documentation for operational procedures and runbooks
Operating within an agile setting alongside cross-functional groups
Proficiency with containerization technologies including Docker and Kubernetes for container orchestration and management at scale. Bonus points for previous software engineering experience, AWS certifications, or experience in highly regulated industries such as gaming, financial services, or healthcare.
It’s okay if you don’t think you tick every box on this list. We love people who want to challenge themselves and are passionate about what they do. If you believe you can contribute in some areas and are eager to learn, we encourage you to apply.
Why choose us:
Aside from a generous base salary, we have a fantastic benefits & rewards program that is designed to encourage personal and career development.
Discretionary annual bonus
30 days paid leave
Health and Dental Insurance for you, your partner, and your children (if you all live at the same address)
Personal life insurance and disability coverage
Wellbeing fund
Continuous learning support for certifications and career growth
550 EUR gift for a newborn family member
26 weeks Maternity leave at 100% pay and 4 week secondary (Paternity) also at 100% pay, no eligibility period applies
A sports card membership valid across the country
Discounts as a compliment from us among different services
Monthly food vouchers
Equal opportunities:
At Flutter International we are committed to creating an inclusive environment where our people can be their authentic selves and thrive. We embrace and celebrate diversity, respecting all our uniqueness and differences.
We invite you to inform us if you have any accessibility requirements. Simply send an email to talent@flutterint.com. Our goal is to make sure you have what’s necessary to thrive with us.
Learn more about the work we are doing on Inclusion and Belonging here: https://careers.flutterinternational.com/working-at-flutter-international/diversity-equity-inclusion/
The group:
Flutter Functions is a proud member of the Flutter Entertainment family, a global leader in sports betting, iGaming, and entertainment. We are listed on the FTSE 100 index and the New York Stock Exchange (NYSE). Our world-class brands and innovative products set us apart. Our International division operates in over 100 global markets. It offers sports betting, casino, poker, rummy, and lottery, mainly online. We are committed to delivering gaming and entertainment responsibly and sustainably. Our team of over 8,000 colleagues supports this vision across 28 offices worldwide.