All about Site Reliability Engineering (SRE)

Website Reliability Engineering (SRE) is common in large companies; however, smaller companies also need it. While the site reliability engineer (SRE) role has become prevalent in recent years, many people, even in the software industry, need to learn what it is and does.

This article clarifies these issues by explaining website reliability engineering (SRE), how it relates to DevOps, and how an SRE works when your entire engineering organization can fit into one coffee shop.

What is Site Reliability Engineering?

The book Site Reliability Engineering: How Google Runs Production Systems, written by a group of Google engineers, is considered the definitive book on site reliability engineering (SRE).

Google vice president of engineering Ben Treynor Sloss coined the term in the early 2000s and defined it as: “This is what happens when you ask a software engineer to design an operations function.”

System administrators have been writing code for a long time. However, in many of those years, a team of systems administrators manually managed many machines.

Back then, “many” might have been tens or hundreds, but when you scale to thousands or hundreds of thousands of hosts.

That is, you can’t keep throwing people at the problem. So, when the number of machines gets this big, the obvious solution is to use code to manage hosts (and the software that runs on them).

Also, until recently, the operations team was separate from the developers. The skill sets for each job were completely different. However, the SRE function tries to bring the two jobs together.

However, before we delve into what an SRE is and how SREs work with the development team, we need to understand how site reliability engineering works within the DevOps paradigm.

Website Reliability Engineering (SRE) and DevOps

At its core, website reliability engineering (SRE) is an implementation of the DevOps paradigm. However, there is a wide variety of ways to define DevOps.

The traditional model, where the development teams (“devs”) and operations (“ops”) were separated, meant that the team writing the code was not responsible for its functioning when customers started using it.

The development team would “throw the code over the wall” for the operations team to install and support.

However, this situation can lead to significant dysfunction. After all, the goals of the development and operations teams were constantly in conflict.

For example, a developer wants customers to use the “latest and best” code; however, the operations team wants a stable system with as few changes as possible.

Its premise is that any change can introduce instability, whereas a system without a change must continue to behave similarly. However, note that minimizing change on the software side is one of many important factors in preventing instability.

For example, if your web application stays the same, but the number of customers grows 10 times, your application can fail in many ways.

Where does DevOps come in?

So the premise of DevOps is that by merging these two distinct tasks into one, you eliminate contention.

After all, if the “dev” wants to deploy new code all the time, it will have to deal with whatever failure the new code creates.

As Amazon’s Werner Vogels said, “you build it, you run it”. However, developers already have a lot to worry about.

After all, they are continually pressured to develop new features for their employers’ products. So asking them to understand the infrastructure, including how to deploy, configure, and monitor your service, can be a bit much. This is where an SRE comes in.

The role of SRE in practice

When a web application is developed, many people usually contribute. After all, there are:

User Interface Designers.
Graphic designer.
Front-end engineers.
Backend engineering and various other specialties (depending on technologies used).

In addition, requirements include how the code is managed (e.g., deployed, configured, monitored). And these are precisely the areas relevant to site reliability engineering (SRE).

However, it is important to remember that an engineer who develops a good look and feel for an application benefit from knowledge of the backend engineer’s work.

For example, how data is obtained from a database. Likewise, the SRE understands how the deployment system works and how to adapt it to the specific needs of that codebase or project.

So, an SRE is not just “an operations person who codes”. Instead, the SRE is another development team member with a different set of skills.

These particularly revolve around deployment, configuration management, monitoring, metrics, etc. However, just as an engineer developing a good look and feel for an application must know what the data obtained from a data store looks like, an SRE is only partially responsible for these areas. After all, the entire team works together to deliver a product that can be easily updated, managed, and monitored.

So, the need for an SRE comes naturally when a team is implementing DevOps but realizes that it demands a lot from developers and needs an expert to know what the operations team used to do.

Conclusion

A website reliability engineering (SRE) team is one of the most efficient ways to implement the DevOps paradigm in a startup.

After all, hiring a dedicated SRE early on in your startup will free up time for developers to focus on their specific challenges.

Nonetheless, the SRE can focus on improving the tools and processes that make developers more productive. In addition, an SRE will focus on ensuring that its customers have a reliable and secure product.

Discover 7 key terms in website reliability engineering (SRE)

Want to understand website reliability engineering (SRE) further? Then check out this SRE terminology primer that explains some of its fundamentals, from job roles to SLAs and problem-solving.

It is worth noting that site reliability engineering (SRE) can be:

A way to close the gap between software developers and IT operations teams.
A way to improve workflows and resilience for teams already practicing DevOps.

Or, as Google says, which established the term and concept of SRE, it is when an organization treats IT operations as a software issue.

After all, by doing this, a company will see many benefits in its development pipeline. For example, SRE reduces manual effort through automation and improves software quality.

This, in turn, increases the system’s reliability, repeatability and flexibility. Therefore, a site reliability engineering (SRE) team also addresses and improves other aspects of the IT ecosystem, such as overall performance, availability, troubleshooting, and monitoring.

However, before adopting an SRE approach, it is important to understand some key terms.

Website reliability engineer:

A site reliability engineer bridges the gap between developers and the IT operations team. Its purpose is to create and ensure an organization’s systems’ scalability, stability, and predictability.

So, SREs automate routine tasks, manage production changes, and determine emergency responses, among other tasks.

Toil

The tasks that keep the IT platform running are essential; however, completing them manually is optional. Therefore, reducing these tasks, also known as toil, is one of the main goals of SRE. Automated patches and updates are considered arduous tasks.

Error budget

One hundred per cent availability is an unrealistic standard. So, since no service is perfect, users must set a performance standard internally, externally, or both.

This performance gap, or acceptable amount of downtime, is called the error budget. The site reliability engineer’s responsibility is to maintain performance within this framework by completing administrative tasks and tracking key metrics.

Service Level Agreement (SLA)

This is the contract between a supplier and the customer that defines the expectations of both parties.

For example, SLAs set standards for services, such as level of availability, so that customers understand the provider’s responsibility for outages or performance issues.

The provider is exempt if an issue is outside the severity levels or circumstances defined in the SLA.

However, within the contract, there is usually a financial penalty. This guarantees liability on the part of the provider and the user.

Service Level Objective (SLO)

Rather than being a separate metric, the SLO is part of the SLA. SLOs track key performance indicators that the customer should expect from the supplier, such as penalties imposed if expectations are not met.

Then, SLOs define an acceptable performance range, starting with key metrics like disaster recovery time and application availability.

However, it is up to an SRE to align these SLO-defined performance targets with the organization’s error budget to ensure performance standards. This involves setting up alerts and monitoring the value of SLOs.

Service Level Indicator (SLI)

Another typical component of an SLA, SLIs are a direct measure of a service’s behavior that indicates the level of performance the customer receives.

The provider and the customer define them together. SLIs are the foundation of SLOs. Examples of SLIs include latency, error rate, and availability.

However, there must be a fine balance between the metrics chosen to ensure that those critical to the specific environment or user base are noticed.

Incident of TI post-mortem

The advantage of IT incidents is the opportunity to learn and improve.

Then, a post-mortem incident assesses the root cause of a problem and its effects, reveals important lessons, and helps the IT team establish a plan to prevent a recurrence.

SREs are responsible for conducting these autopsies and sharing the results with senior staff, such as executive leaders, engineers, and architects.

In addition, successful autopsies remove blame to create an environment where the team feels comfortable talking honestly about incidents, as it focuses the discussion on why the problem occurred.

Wildcard is one of these platforms. The wildcard is a No Code platform that provides a solution to help organizations, and developers, even those without DevOps and SRE experience or coding knowledge, to successfully implement reliability and stability practices and build, deploy, and manage applications without writing a single-line of code.

Start for free by singing using Github or GitLab.

Site Reliability Engineering SRE

Editor's Pick

Random Posts

Popular Categories