Site Reliability Engineering

Site Reliability Engineering or SRE is a discipline that combines operations and software engineering. The latter is applied to infrastructure and operations problems, specifically. Meaning, instead of building product features, Site Reliability Engineers build systems to run applications. There are similarities with DevOps, but while DevOps focuses on getting code to production, SRE ensures that code running in production works properly.

Problem it addresses

Ensuring applications run reliably requires multiple capabilities, from performance monitoring, alerting, debugging to troubleshooting. Without these, system operators can only react to problems vs. proactively working towards avoiding them — downtime only becomes a matter of time.

How it helps

An SRE approach minimizes the cost, time, and effort of the software development process by continuously improving the underlying system. The system continuously measures and monitors the infrastructure and application components. When something goes wrong, the system points Site Reliability Engineers to when, where, and how to fix it. This approach helps create highly scalable and reliable software systems by automating operational tasks.


Last modified November 30, 2023: chore: remove duplicated lines (e57ed31)