H

Senior Site Reliability Engineer

Hard Rock Digital
Full-time
Remote friendly (Florida, United States)
Worldwide

What are we building? 

Hard Rock Digital is on a mission to become the best online sportsbook, casino, and social gaming company in the world. We’re building a team that resonates passion for learning, operating and building new products and technologies for millions of consumers. We care about each customer's interaction, experience, behavior, and insight and strive to ensure we’re always acting authentically.

 

Rooted in the kindred spirits of Hard Rock and the Seminole Tribe of Florida, the new Hard Rock Digital taps a brand known the world over as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space — ready to join us?

 

What’s the position?

We are looking for a skilled Sr. Site Reliability Engineer (SRE) to maintain and improve the reliability, scalability, and performance of our Java-based application. You will be responsible for managing and monitoring the application’s infrastructure, using the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability, and implementing robust monitoring, alerting, and logging solutions.

 

Key Responsibilities:

Application Reliability & Performance:

  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.

  • Troubleshoot and resolve complex issues in production and non-production environments.

  • Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.

  • Optimize Java application performance, ensuring efficient resource utilization and scaling.

 

Monitoring & Observability:

  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.

  • Implement and refine observability strategies to enhance application and infrastructure visibility.

  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.

 

Incident Management & Root Cause Analysis:

  • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes of issues to prevent recurrence.

  • Document and share lessons learned from incidents, contributing to a culture of continuous improvement.

 

Collaboration & Cross-functional Support:

  • Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.

  • Collaborate closely with DevOps and NOC teams to support the application platform.

  • Communicate SRE practices and principles to technical and non-technical stakeholders.

  • Provide feedback and insights on application performance, potential improvements, and observability metrics.



What are we looking for?

  • Degree in computer science or a related field, or equivalent work experience

  • 5+ years in SRE, DevOps, or similar Infrastructure roles:

    Experience managing large-scale, high-availability production systems

    Track record of incident response and post-mortem processes

    Experience with capacity planning and performance optimization

  • 3+ years hands-on experience managing production Kubernetes clusters:

    Deep understanding of k8s architecture, networking, storage, and security

    Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management

    Proficiency with kubectl, Helm, and Kubernetes operators

    Container orchestration and troubleshooting expertise

  • Advanced expertise with the Grafana stack for dashboards, alerting, and visualization:

    Hands-on experience with Grafana Alloy for telemetry data collection

    Proficiency in PromQL

    Experience with Loki for log aggregation and analysis

    Experience building comprehensive monitoring and alerting strategies

  • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.

  • Cloud Platform expertise (AWS, GCP, or Azure)

  • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.

  • ArgoCD proficiency for GitOps workflows and continuous deployment

  • Strong scripting abilities in Bash, Python, or Go:

    Experience with CI/CD piplelines and automation tools

    Configuration Management and deployment automation

  • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.

  • Proven experience managing on-call rotations, incident response, and root cause analysis.

  • Ability to mentor junior team members

  • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.

 

What’s in it for you?

We offer our employees more than just competitive compensation. Our team benefits includes:

  • Competitive compensation and comprehensive benefits

  • Hybrid and Remote work

  • Flexible vacation allowance

  • Start up culture backed by a secure, global brand

 

Roster of Uniques

We care deeply about every interaction our customers have with us, and trust and empower our staff to own and drive their experience. Our vision for our business and customers is built on fostering a diverse and inclusive work environment where regardless of background or beliefs you feel able to be authentic and bring all your talent into play. We want to celebrate you being you (we are an equal opportunity employer).