Lead DevOps & Site Reliability Engineer (Platform)
Cape Town
Hybrid working (2 days in the office)
At LiveScore Group, we’re the proud home of three of the most exciting brands in the sports and gaming world: LiveScore, LiveScore Bet and Virgin Bet. A fully owned and operated ecosystem that converges the two worlds of sports media and sports betting. We’re proud of the high ratings for our commitment to excellence and fuelling fan’s passion for sport driving us to the top.
We don’t just lead; we innovate. Our cutting-edge products and immersive experiences set the standard, but it’s our people who truly make the difference. Every day, our team embody our values: adaptability, teamwork, a fan-driven approach, and an ever-curious mindset that fuels our ambition.
As we scale and continue to create a culture that allows all employees to thrive, we know we need the most talented people with diverse backgrounds, perspectives and skills. If you’re good at what you do, come and join us. The more inclusive we are, the more amazing experiences we can create for our users.
We know that job descriptions can sometimes seem daunting, and you might not feel you tick every box. But, if you’re passionate about the role and have relevant experience, we want to hear from you!
The Role
At LiveScore Group, we are the driving force behind thrilling experiences for millions worldwide. We seek a visionary Lead DevOps & Site Reliability Engineer to ensure the stability, scalability, and security of our dynamic production environments. This pivotal leadership role isn't about traditional IT infrastructure; it's about building the core foundations for our streaming, media, and betting services. If you excel at constructing and optimising the intricate infrastructure supporting a web of interconnected services, particularly with critical databases like MySQL, this is your challenge.
You will define, implement, and oversee robust, scalable, secure, and highly available infrastructure across production in GCP and dev/test on-prem VMs. Championing DevSecOps principles, you will embed security and operational excellence throughout the software development lifecycle. Leading a specialised team, you'll ensure continuous performance, reliability, and security of all environments, directly enabling LiveScore to deliver innovative products rapidly and securely while maintaining an exceptional user experience, especially concerning production stability.
Key Responsibilities
- Develop and execute a forward-thinking infrastructure, reliability, and DevSecOps roadmap, aligning with LiveScore Group's technology roadmap and business objectives.
- Lead the architecture, design, implementation, and maintenance of highly available, resilient, and scalable infrastructure across cloud and on-premise environments, with a critical focus on production stability. (Production runs in GCP; dev/test environments operate on on-prem hardware.)
- Drive the widespread adoption and continuous refinement of DevSecOps practices, integrating security controls and automated governance throughout our software development and deployment pipelines.
- Partner with Security/Compliance teams as required.
- Lead, mentor, and develop a high-performing team of infrastructure engineers, DevOps specialists, fostering growth, ownership, and continuous learning using your hands-on technical leadership and ownership.
- Apply and continuously improve established practices for monitoring, alerting, incident management, RCA/postmortems, disaster recovery, and capacity planning—leading major incident resolution and ensuring corrective actions are delivered to minimise downtime.
- Strategically manage infrastructure resources, identifying cost efficiencies without compromising performance, security, or reliability.
- Drive cost optimisation and resource efficiency.
- Collaborate closely with the Head of Technical Architecture, Senior Manager of Quality Assurance & Release, and other engineering and product leaders for seamless integration and delivery.
- Select, implement, and optimise infrastructure, deployment, monitoring, and security tools to enhance automation, efficiency, and developer experience.
- Continuously evaluate emerging technologies to ensure our infrastructure remains cutting-edge and resilient, particularly in scaling our 200-300 services and managing their data flows.
Skills, Knowledge and Experience
- Extensive hands-on experience and strategic oversight of major cloud platforms (e.g., Azure, GCP), including compute, networking, storage, and managed services.
- Deep knowledge of container technologies (Docker) and container orchestration platforms (Kubernetes) for building and managing scalable microservices architectures.
- Proven proficiency with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, CloudFormation) for automating infrastructure provisioning and configuration management.
- Expert understanding and hands-on experience in designing, building, and maintaining robust Continuous Integration/Continuous Delivery (CI/CD) pipelines.
- Strong familiarity with security tools integrated into the CI/CD pipeline, including SAST/DAST, vulnerability management, and secrets management.
- Comprehensive understanding of network architecture, protocols, firewalls, intrusion detection/prevention systems, identity and access management, and data encryption including, but not limited to cloud security patterns and least-privilege access controls.
- Expertise in implementing and managing comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK stack, Splunk) for proactive issue detection.
- Strong scripting skills (e.g., Python, Go, Bash) for automating operational tasks and integrating various systems.
- Proven experience in designing and implementing disaster recovery and business continuity plans for critical systems.
- MySQL Database Management and Optimisation: This is an absolute must-have! Proven expertise in managing, optimising, and ensuring the high availability of MySQL databases, especially within production environments.
- Experience supporting hybrid environments, including on-prem dev/test infrastructure, and standardising environments through automation.
What can we offer?
- Discovery Medical aid
- 21 days annual leave
- Discretionary Company Performance bonus
- Thursday drinks in the office and socials