Site Reliability Engineering
Site Reliability Engineering or SRE is a discipline that combines operations and software engineering. The latter is applied to infrastructure and operations problems, specifically. Meaning, instead of building product features, Site Reliability Engineers build systems to run applications. There are similarities with DevOps, but while DevOps focuses on getting code to production, SRE ensures that code running in production works properly.
Problem it addresses
Ensuring applications run reliably requires multiple capabilities, from performance monitoring, alerting, debugging to troubleshooting. Without these, system operators can only react to problems vs. proactively working towards avoiding them — downtime only becomes a matter of time.
How it helps
An SRE approach minimizes the cost, time, and effort of the software development process by continuously improving the underlying system. The system continuously measures and monitors the infrastructure and application components. When something goes wrong, the system points Site Reliability Engineers to when, where, and how to fix it. This approach helps create highly scalable and reliable software systems by automating operational tasks.
Feedback
Was this page helpful?
Thank you! Please let us know if you have any suggestions.
Thanks for your feedback. Please tell us how we can improve.