7100 Corporate Drive
Plano, TX 75023
Sr. Site Reliability Engineer (Remote, Plano, Louisville or Chicago)
Description:Are you an experienced Site Reliability Engineer? Are you equally passionate about the joys of engineering software features and working through real-world operational challenges and problems? Do you find yourself filling the gaps between "dev" and "ops" more often than not? Do people from feature engineering to ops to the management team come to you when they have a problem no one else seems to be able to solve?
If so, you might be just the person we are looking for to fill our Senior Site Reliability Engineering role at Pizza Hut. Site Reliability Engineers are just as adept at software engineering as they are able to be at their best in the crunch of a production outage. They get the true meaning of the phrase "work smarter, not harder", and are able to bring their enormous technical and troubleshooting skill set to bear to help improve speed and efficiency for the organization they are a part of. Pizza Hut is seeking just this kind of person to join our eCommerce engineering team to assist as we continue to scale our cutting-edge container-based cloud hosted solutions. In this role, you will work directly with the feature engineering teams to help bring those solutions to production in a safe and supportable way. You will bring your Site Reliability experience to help implement best practices and contribute to designing and implementing improvements to our build pipelines & our monitoring and support strategy.
Your time will be split between several SRE Disciplines including:
- Assist in leading the implementation of SRE best practices
- Automate resolution of P1 alerts (i.e. write code to facilitate self-healing of issues)
- Automate processing of shared "issues" mailbox
- Lead in the proper configuration and handling of alerts and notifications (i.e. suppressing irrelevant, automating resolution of relevant)
- Building infrastructure in both GCP and AWS for enterprise applications
- Build proactive monitoring and alerting solution using PH tools on SRE Golden Signals
- Progressing code from development to production through automation leveraging blue green and canary deployment techniques.
- Real-Time service management inclusive of building monitoring for the golden signal SLIs, negotiating SLOs with the business, building alerting, creating playbooks and runbooks for services in conjunction with development teams, product owners and support
Minimum Requirements (any 5 of 7 bullet points below):
- 2+ years as SRE
- 5+ years systems programming experience. i.e. writing backend systems and tools using languages such as Kubernetes, Python, GO, Terraform etc.
- 5+ years managing production Unix based infrastructure particularly services and applications running within servers and virtual machines.
- 2+ years of experience with DevOps pipeline and CI/CD methodology
- 2+ years of experience with building and running docker containers in production environments
- 2+ experience of SRE proactive monitoring tools like Splunk, NewRelic and OpsGenie
- 2+ years working with REST APIs as both an API provider as well as consumer
- GCP Experience preferred
- AWS, GCP or MCSE certifications
- Experience running all tiers of application infrastructure including web servers, app servers, databases, etc.
- Experience with build systems & pipelining with tools such as Circle CI, Google Cloud Build, Jenkins, GitLab, and Spinnaker
- Atlassian product (Jira, OpsGenie and Confluence) experience is preferred