Site Reliability Engineering (SRE) is an engineering discipline that draws from software and systems engineering to define, measure and achieve reliability objectives. SRE embraces DevOps philosophies and leverages custom code, automation, tooling, support processes and service management frameworks to achieve reliability objectives. The SRE mindset considers reliability a first-class feature of any service and prioritizes engineering and automation over manual intervention.
S&P Global's Site Reliability Engineering teams are responsible for keeping our products and services available to customers and employees located around the world. We achieve this through software, system and process engineering to maintain service level objectives, limit human intervention and minimize the level of effort associated with support (a.k.a. "toil"). SRE teams at S&P Global are generally responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of our products and services.
Our SRE teams value:
End-user focus: As engineers and consumers of services, we deeply value the quality of our users' experience. We recognize that a solution is only as good as the quality of service it provides.
Passion for coding and automation: We leverage technology to improve reliability and make our lives easier. We are experienced problem-solvers and are proficient in scripting and programming languages. We look for people who enjoy problem-solving, writing code and exploring automation.
Curiosity: We compulsively search for the underlying cause of issues and ways to improve reliability.
Honesty: We value honesty and transparency over placing blame. We promote a blameless culture throughout the organization.
You have 5+ years of experience in software or systems engineering.
You have experience monitoring, supporting and tuning a production application stack.
You value your time and have experience with scripting and automation frameworks.
You want to support full-stack solutions, including applications, servers, networks, data pipelines and data platforms.
You have excellent troubleshooting skills.
You demonstrate an objective, data-driven approach to problem-solving.
You demonstrate excellent collaboration and communication skills.
You take a practical and iterative approach to improvement, making small changes and testing for effect.
You have experience working across silos in change-controlled environment.
You have experience working with a globally-distributed workforce.
You have experience with cloud hosting technologies (E.g., AWS, Azure, Google).
You may have some experience with containerization platforms (Docker, Kubernetes)
Develop, maintain and report on Service Level Objectives (SLOs).
Develop and support monitoring and automation to defend SLOs.
Resolve Incidents (outages and service disruptions), including participation in on-call rotations.
Perform root cause analysis and formal postmortem write-ups for service disruptions.
Perform capacity planning to assure future reliability and efficiency as utilization grows.
Develop and test disaster recovery plans.
Implement changes and support releases in a controlled environment.
Develop and maintain runbooks, share knowledge and cross-train members of SRE and Development teams.
Consult with Development teams during service design and in advance of releases.
Conduct production readiness reviews to ensure services meet SRE onboarding requirements.
Bachelor's degree or higher in computer science, math, engineering or related disciplines.
AWS technical certifications helpful.