Site Reliability Engineering (SRE) is a discipline. It incorporates aspects of software engineering and applies them to operation tasks. It is borrowing a lot of Agile concepts hence it helps if your organization is already working with these principles in your day-to-day.

This long-form aspires to put together my experiences so far and things I’ve learned the hard way. This is by no means a complete guide. It is supposed to be a high level overview to help teams understand more about SRE, act as a quick how-to and intends to be ready by Product and Engineering people.

An engineering-only, mostly technical post will soon be published!

1 General
2 Toil
- 2.1 What is Toil
- 2.2 Eliminating Toil
3 Incidents
4 Metrics and KPIs
5 Onboarding and educating SREs
- 5.1 Educating SREs
- 5.2 Onboarding SREs
Next post!
- Credits
  - Links

1 General

1.1 DevOps vs SRE

Teams that want to offer a reliable service need to first define what “reliability” means for them. In most of the cases it translates to “availability”. However availability is only a subset of reliability.

In fact reliability covers more aspects, such as: availability (time), durability (persistence), disaster recovery, fault tolerance.

DevOps is a collection of practises and philosophies. The aim is to make siloed teams collaborate better between each other. It is rather a mindset which can be applied.

The essence of SRE, on the contrary, is a narrow-focused collection of metrics and practices which allow for better service delivery and teams collaboration.

I understand this may be a bit confusing as anything theoretical, so I’ll mention a couple of practical differences too. In fact, they both try to bridge the gap between development and operations. But DevOps works as a guideline regarding the collaboration of teams to deliver a better product. It focuses more in increasing speed of development and delivery. However not only developers or operations people are involved in this (eg. what about QA?).

SRE provides prescriptive ways to achieve reliability. This is why SRE teams include engineers with background in both operations and development roles and can enhance systems reliability as a whole. Preciselly, Google’s definition of SRE also includes strategic placement of coding in the product/service to enhance the reliability objective if needed be.

1.2 Main principles

Reliability is the most important feature of the service
Metrics and monitoring is used as a tool to service the user’s definition and feedback of what consists a “reliable service”
A well-engineered system has a small amount of isolated failure
Standardized processes are important when talking about reliability

1.3 Team topologies

The various ways SRE teams are organized has been largely undocumented. Luckily, Google has published an excellent article that gives a 30k-feet overview. Inspired by this article I will just sum up the basic details of each one for the shake of completeness:

The Kitchen Sink, aka. Everything SRE team

This is the first and only SRE team in existence throughout the entire engineering organization. The scope of services or workflows is mostly unbounded. It usually grows organically along with the org and can split into more narrow-scoped teams or transform to a hybrid model of some kind.

In this topology, it’s easy to spot similarities and draw patterns accross services or even products. This is achieved since the reliability of the services is dependent on the same group of people. However, as the complexity grows in the company or the product, the team tends to do more shallow contributions instead of fundamental and innovative changes.

It will be definitely the easiest to start with though. In many cases, since it will be the only SRE team among many development teams, it may act as the glue between development teams and operations, and solutions will be created because of that. Be aware though that this easily becomes a blocking point, if your goal is to enhance throughput within a reliability bounded context.

The Infrastructure team

This is a platform team, maintaining central infrastructure (eg. Kubernetes clusters), responsible for shared components (eg. VPCs and networking, CI/CD templates, etc). Sometimes they will also define standards for the rest of the development teams to follow. Also, this may be referred as the Cloud Operations team.

Having a staffed infrastructure team that does not interfere at all with the customer-facing product of the organization means that there are (several) other development teams surrounding this SRE team. If there is a distinction between a DevOps team and an SRE team, then the first focuses on building internal infrastructure-related software or customize/deploy OSS, while the latter would focus on the reliability aspect.

The downside of it is that as the org scales, it becomes an anti-pattern.

The internal tools team

This is a team fairly similar to the infrastructure one (or the DevOps). The main idea is that they develop tools used mostly by their development counterparts to measure and support the reliability of systems they build. In practice, tools teams tend to focus more on support and planning systems that have a reliability-oriented feature set.

The challenge with this team type is that the dividing line between an infra team and a tools team is not always clear and there is a risk one becomes the other and vice versa.

This topology scales well, but needs headcount and is considered a luxury for most mid-sized orgs.

The application team

As the name implies, this is the SRE team responsible for the reliability of critical product/applications of the organization. The team focuses only on that and they have a very clear scope. The downside is that as an organization gets larger, more of these teams will be needed since the critical applications will increase.

This model scales well, but can end up in silos.

The embedded team

In this scenario an SRE joins the application team. They share their work environment with the developers and provide hands-on help, eg. changing code and contributing that way to the reliability of the services.

This is a very beneficial setitng for the product, since SRE best practices are even driving features sometimes and empowers knowledge-sharing among the team members. However the lack of mentorship for the SRE person as well as the potential divergence of practices among various development teams may be some tough challenges.

One can overcome the silo forming, by encouraging cross-tribe collaboration on tasks. Usually cross-tribe collab attempts are met by managerial resistance, as luxury items.

The consulting team

This is an SRE team which provides ad-hoc consulting to development teams. Most of the times they provide hands-on help, however avoid making code changes in the product. It’s quite common that they maintain their own toolset to use while in communication with their development counterparts. Since they are unattached to any product team, it’s expected that this setting increases the overall SRE posture of the organization.

It’s not all-bright though. The consulting team members may not have the necessary time and also lack of context for the product poses a significant hurdle. This is why this team topology is mostly encountered in large organizations where demands for SRE skills have outgrown the existing setting.

At an initial stage during the onboarding of SRE practices in an org, or early startup phase this model could be a very viable solution.

2 Toil

As Bobby put it during Staff+ NY'22:

Toil is taking me away from having FUN. Toil is operational burden!

2.1 What is Toil

AKA request/ticket-based ops

It is all the repeatable, mundane work an SRE has to go through at regular intervals. Imagine having to set up an RDS DB instance each and every time a developer team filed a Jira ticket for this purpose, only changing some minor details every time.

The automatable work which is yet done manually. Why spin this RDS DB instance up by hand when there are eg. Kubernetes operators and/or CI/CD workflows?

Work that is only an overhead to the daily schedule. With no enduring value, as it is almost fire-and-forget.

And when the service grows, this grows linearly! Sadly, the complexity of it - if it were for an algorithm - would be O(n).

Some real-world examples include

Applying database migrations on a new release (or an existing one)
Help a user gain access to X instance or reset their credentials
Increase memory on X service and handle quota requests
Review non-critical monitoring alerts

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. (Carla Geisser, Google SRE)

The effects of toil are almost always negative. On top of mind, hiring would become more difficult. And I’m explaining myself below.

Heavy operational work means less time for engineering work. For interesting, meaningful work. If this continues for a prolonged period of time, it sets a precedent. But also slows the progress of the team, since everyone is less productive. Engineers are notoriously known for having low morale if they don’t feel productive. In this chain of events, talent attrition is inevitable while new hires will feel betrayed, cheated.

AKA in constant fire-fighting mode.

2.2 Eliminating Toil

As you understand, toil is something that is always around and whatever you do, will keep coming back. It is absolutely necessary to have an active and ongoing concern to eliminate toil and keep it under the desired threshold.

A good starting point is to define this desired threshold within your organization. Google has an advertised goal of keeping toil below 50%. Meaning, less than 50% of time spent on operational burden. Therefore more time should be spent on engineering work to either reduce toil or invest on service features.

To address toil (or tech debt) the important step is to document it and break it down to tickets in the backlog.

Since your organization has defined the threshold, things can get quantified. There is only a certain amount of items in your org backlog. Cherry-pick the ones that can get automated or eliminated at all. Invest in automating whatever can be automated. Start from the tasks you frequently do. Avoid touching the ones you rarely work on, as the assumptions or the procedure you have might change from time to time.

Treat your automation like any other production system. Can this procedure get repeated multiple times with no side effects? Is there a way to test it? Is there a way to verify the results of the procedure?

Split the team between firefighting and toil work. Google SRE advice: Switch roles in a frequent rotation to eliminate low-morale.

However toil is as much of a cultural thing as it is technical. Make sure you reduce the radius of people involved in toil work, eg. your on-callers. This will help at least a portion of the team to continue doing engineering work and in the long-run the org avoids sacrificing dev velocity.

3 Incidents

While I know that especially Product people would love their services to run smoothly all the time, this is practically unrealistic. Every modern sophisticated system is guaranteed to have issues which will eventually lead to incidents.

It depends on the perspective your org sees incidents. I personally prefer to consider them an opportunity to see things. In the long-run, people get more experienced (and promoted) when they have seen enough things to gain confidence and real-world exposure at what they do.

But let’s start with this: what is an incident?

ITIL 2011 defines Incident as,

“an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a Configuration Item that has not yet impacted an IT service [but has potential to do so]”

It becomes clear that in order to maintain acceptable service levels, the organisation needs to come up with an incident response plan in order to resolve them and restore normal operations.

3.1 Incident Response plan

Incident response plans vary, since each org is not only different in terms of hierarchy but also in size.

However, setting up a plan does not need to be a daunting task. The basic principles, as also seen in Google’s SRE book, are:

Maintain a clear line of command.
Designate clearly defined roles.
Keep a working record of debugging and mitigation as you go.
Declare incidents early and often.

3.2 On-call

Who doesn’t love to get paged? Said no one, ever..

Yet being on-call is a mission-critical duty that many organisations undertake in order to maintain service quality and availability. Having on-call as part of an org’s incident response plan is kind of assumed in every production-offering organisation.

Being on-call as an engineer means taking care of the team’s assigned operations by managing outage and performing and/or vetting production changes.

Response times differ based on the criticality of the product, SLAs with customers and the product needs. For example, processing the analytics of a post in a Social Media organisation can survive a 1h response time during an incident, while the inability for users to post new content or load the existing one is something that should be addressed within 15mins of occurrence.

There are several platforns, eg. Pagerduty, that help with managing the shifts and rotations for on-callees. One of the perks of those platforms is that you don’t have either to remember or keep track of on-call shifts in spreadsheets and overrided can easily be arranged in case of something humanly unexpected happening.

Usually there are two rotations: primary and secondary. The primary rotation is the one to be paged first during an incident, with the secondary to be alerted later on. I have found it to be a good idea to have a process in place to reach out all other on-callees within the organisation. This can help spread fast very important news for the stability of the service or give a high-severity heads-up to every other engineering team. Or, you know, simply ask for help while being in-panic in the middle of the night (which can happen).

On-call schedule requires for adequate compensation to be provided for out-of-hours support. Different organisations handle this differently, as this also varies by local legislation, unions or industry.

A tip from our beloved Google SRE: include developer (tier-3) on-call rotations, as it is critically to have a warroom with application and business knowledge depth.

3.3 Post-mortems

It is crucial to run a post-mortem process after every non-trivial incident occurence. There are many flavors of how to run post-mortems in an org. However here are the 3 most important characteristics that every post-mortem should include:

Timeline

Writing down the timeline of events that lead to the incident, how, by whom and when was it identified and which decisions lead to what result are important to note down. They might appear as useless at the time of the writing, however they prove to be valuable for retrospectives and they give a lot of perspective to posterity when reviewing similar cases in the future.

Lessons learned

Since a series of events lead to an incident, there are clearly opportunities for improvement, or lessons learned. A technical learning can be applied to an entirely different part of the system, enhancing the importance of writing down lessons learned that could benefit other engineering teams in the organization.

Course of action

Does not matter if actions were corrective or preventive, anything is important in this phase. Many incidents are resolved by temporary patches which then need to be reworked or an engineering team needs to invest time in some research capacity to figure out a proper, long-term solution.

When running post-mortems, it is really important to highlight they need to be entirelly blameless. In the opposite case, culture and technical problems will quickly arise, which is not to the benefit and longevity of the organisation. This can be achieved through transparency and knowledge-sharing processes. As mentioned before, it is very possible a technical solution of an incident can be applied to another part of the system.

4 Metrics and KPIs

There are multiple tools that help us measure metrics on the availability and essentially the level of service provided. However, the metrics outlined below are based on the assumption that a service cannot realistically reach a 100% availability percentage. Instead, the service provider guarantees that the system will be available greater than a certain percentage, eg. 99%, and utilizes specific tools to measure how many faults they can tolerate to reach these goals.

These are not just useful abstractions. These provide information regarding how reliable, available or even useful a system is.

4.1 SLO

SLO as Service Level Objective

It is targeting a specific level of reliability. In other words, it is a goal the service provider aspires to reach.

An SLO is a narrowed-down metric that is defined as part of an SLA (explained further below). As an example, the response time or uptime are some potential SLOs.

In order to avoid ambiguity, they need to be clear and specific. Only the most important metrics should qualify for SLO status, and the objectives should be able to be communicated and explained quickly between the members of an engineering team.

4.2 SLI

SLI stands for Service Level Indicator, is the measurement that the service provider is using for a goal. If you want to know how reliable a service is, you must be able to accurately measure the rates of successful vs unsuccessful requests on that service.

It is usually described by the following formula:

SLI = (successfull events) / (all valid events)

The definition of what is a successfull event and what is a valid event should be defined by the engineering team. A factor into this equasion is also the technology stack your team is leveraging to delivery its service.

The main challenge with the SLI metrics is how to keep them as simple as possible, while also keeping the system free of unneccesary complexity or metrics.

4.3 SLA

Normally most of you should be more familiar with the SLA term. SLA stands for Service Level Agreement. It is essentially an agreement with a customer about the reliability of your service. It has consequences, monetary or otherwise specified in case of violation.

An SLO and SLA are quite similar concepts, however an SLO is tighter than the SLA, since the service provider needs to ensure they can actually reach the level of service outlined in the SLA agreements with customers. Therefore an SLO is makign sure the service provider is always staying within compliance with those agreements.

A real-world example: a service provider is offering their song search product as a paid API. As money is involved, they need to outline to their customers what kind of availability they can expect to have regarding this API, so that they manage their expectations accordingly. Hence, a contractual obligation describes what happens if this agreement is breached. The service provider may say that they provide the service at a minimum 99.9% availability. If the service drops below this point during the period of a month, then a refund/credit will occur to compensate for this.

It is important to highlight that while SLAs are relevant for paying customers, SLOs are highly relevant in non-paid systems, eg. an internal CRM within an organization.

4.4 Error Budget

Incidents are in fact unavoidable. Every engineering team knows that. This is a strong motivation to not ever promise an 100% uptime in an SLA, since this would be a promise the team would not be able to keep.

For example, in an SLA an uptime of 99% may be promised and contractual obligations derive from that promise. However, not every month is a bad month. The engineering team may kill it with 99.99% uptime during a certain month. This .99% is called error budget. It is the time the team can take to run experiments, run maintenance or just fix their systems before they violate a term of their aggreements. Hence, it is the maximum amount of time a system can be down without triggering monetary or other consequences.

Error budgets are not just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks. [2]

In order to understand the situation well enough and accurately define error budget, you need to look further into your SLAs and SLOs. This is because you need to be aware of promises made earlier to customers, as well as objectives your engineering team has set.

Example: the SLA states 99.5% uptime. That translates to 3 hourls, 39 minutes of downtime every month.

To wrap this up, error budgets are important because they encourage teams minimize incidents while also taking risks within acceptable limits. It kind of bridges the gap between development and operations: as long as the uptime is high, the development teams can remain agile and keep pushing changes.

Bonus: you can expand error budget to the 4 KPIs that measure DR plans: RTO/RTA and RPO/RPA: Recovery time objective/actual, Recovery Point objective/actual.

4.5 Fault Tolerance

As we have mentioned in multiple occurences by now, faults are by no means avoidable in any software system. Hence we need to build systems that help us tolerate faults or recover from them.

Fault tolerance represents the capability of a system to sustain its operation even in the presence of faults. Systems with high fault tolerance are able to continue operating fully or partially upon the occurence of a fault. For this to work in practice, a system must be designed with no Single Point of Failure (SPOF).

Offering a fault tolerant system does not mean that it can tolerate all kinds of faults. Instead, a good system design should match the level of tolerance with the criticality of the fault, so that it achieves an overall optimization of cost and resources.

Fault detection refers to the capability of a system to sense a fault in a timely manner within its operation. There are several metrics that get measured and tracked to provide a better understanding of how faults are handled. Below the ones I consider the most useful ones are mentioned[3]:

Mean time to Repair (MTTR): the average time it takes to restore a broken system
Mean time between failures (MTBF): how long does it take for a system to brake again since the last time
Mean time to failure (MTTF): how long do we expect a system to remain functional until it faults
Mean time to detect (MTTD): the average time it takes to detect a failure in a system
Mean time to Investigate (MTTI): the average time it takes to investigate a fault after it has become known to the organization
Failure Rate: the frequency with which a system fails

4.6 Best practices and how-to

Observability plays a key role in measuring and tracking a system’s reliability and its KPIs. Although it heavily depends on an engineering team’s tech stack, some principles remain the same across all options.

I have gathered in the following list the #1 priority work a team needs to do in order to follow best practises:

Clearly define SLIs and SLOs and make sure they are part of building your system requirements.
Invest in building a metrics system. Prometheus has gained a lot of popularity in the Kubernetes world (and beyond) but other systems exist also out there.
Match metrics with SLOs. Make sure you gather and track these metrics.
Trigger alerts based on your SLOs. Make sure alerts reach the correct channels, eg. Slack during work-hours and your favorite paging platform during on-call shifts.
Measure the time it takes to resolve incidents.
Keep track of how frequent incidents are and how long it takes between two consecutive incidents.
Invest in tools to bring down MTTD, in favor of your error budget
Make sure to define your error budget by provisioning time to run maintenance tasks as well as keep innovating in an acceptable level for your organization.
Minimize alert fatigue by iteratively filtering down unnecessary messages, on each sprint. If an alert does not lead to an actionable item, it is not alert worthy, or prone for inclusion in the low-criticallity list that can be muted on-demand. This is crucial because when alerts > actions, its usually when incidents can go unnoticed.

5 Onboarding and educating SREs

As an organization continues growing, there is an always increasing demand for offering reliable products and services. However the earlier stages during an org’s lifecycle do not involve a lot of good decisions regarding reliability, since the main business priority is to onboard more customers and offer shiny new features which will funnel in much needed capital. Until that moment.

That moment when everyone realizes there has to be a better way to do things. That measuring stuff in an effective manner and take actionable points out of everything instead of looking at Excel Sheets should be long gone. Or this huge customer demands an SLA to sign and you should figure out as soon as possible what SLOs to set for your product.

It is when the term “reliability” starts coming up increasingly in team discussions.

Most of the companies I have worked with so far start with some variation of an infrastructure team or the everything-SRE team. The team responsible for provisioning, maintaining and supporting all-things-infrastucture and may or may not be practically engaged with code and configurationn changes of the product.

So how does an organization make the leap towards incorporating the SRE best practices? How do SRE teams form and operate?

I have expanded a few ideas on how to get things started in the Educating SREs part and there is a brief analysis of the qualities I’d be looking when hiring and onboarding SREs.

5.1 Educating SREs

At some point there is a group of people - being a formal team or a cross-team task force - who deal with the infrastucture. If you feel like there is enough operations skill hidden in the development teams as well, make it an open invite and gather these people up.

Then comes the implementation part. I am a huge fan of informal leadership compared to flat-out corporate training programs, and - in my opinion - this have much more decent end results.

These are the steps I would take in such a scenario and

Convey the message. Make it clear to everyone the scope of the (cross-)team is reliability, and define reliability both in business and technical terms. It’s quite common people, even team members between each other have slightly different definitions which will become a problem if this is not sorted out as early as possible.
Involve business/commercial people. They should know customers’ perspective regarding reliability in the organization’s industry and clientele. There is no one-size-fits-all. Healthcare is expected to have a lot higher reliability requirements in products offered compared to an advertising product. Decide on the reliability metrics together and set up an SLA basis, by clearly defining the SLOs underneath. More on that on the Metrics and KPIs section.
Involve outside experts. Make arrangements for someone with apparent expertise on the topic to join the organization for either an “inspirational talk” to lay out some initial details and ignite useful discussions. A hands-on workshop would be an even better approach since anyone can ask direct questions and get answers on org-specific challenges. Outside consulting on how to apply SRE and an initial hand-by-hand help is usually a fast way to hit the ground running.
Decide on a training path. Create your own internal documentation for SREs and roll out a mentorship program for newcomers in the role. The idea is to raise awareness about reliability throughout the whole tech organization and encourage people to either take on an SRE hat or at least have reliability in mind when designing and architecting a new service. This will be applicable on a recurring basis, too.
Brainstorm on SRE team topologies. Which is the right one for your organization? Nobody can give a straight answer because there is no right or wrong. My personal recommendation is to try out the topologies that seem to be working for your teams structure and measure the effectiveness of each one. Topologies can suit different time lines of organizational growth. A topology that was a good fit last year might become a burden this year etc.
Discuss practical matters. Which on-call scheme works best for your team? Does everyone participate? How should post-mortems be advertised? What about incidents response? Forming policies should come last, as leadership will need some time to measure results and take actions. Technology-wise mundane topics like bonus payments for on-callees included.

5.2 Onboarding SREs

There are certain qualities you should be looking for when hiring an SRE.

Poke around with everything, top to bottom, in any given system. They are genuinely curious about figuring out how things actually work.
They read - and eventually write - a lot of documentation. Understanding the hows and the whys is important.
Work well under pressure. After all, they are the people handling a large chunk of the incidents that are inevitable.
Passionate about automating most of the tasks. Increase efficiency and help with eliminating toil.
Adapt to changes. After all, everything is a living product and change is bound to happen anytime.

Bonus point: being a great informal leader. This is a soft-skill that cannot easily be taught and hiring an SRE who is happy to knowledge share, mentor and help out colleagues is definitely going to be an asset in the team.

When there is a great new addition in the team, follow your normal onboarding routines and also make sure the following are not forgotten:

Encourage newcomers to share feedback and make comments/edit existing docs. If docs don’t exist, allow time to make them. Fresh look from the outside is always beneficial to spot ways to improve.
Allow time to attend meetings, learn the terminology of the niche the organization is operating in. For you as an SRE, don’t get stuck in lingo. Take notes and try to learn the concepts by observing how people talk about them.
No question is a dumb one. Encourage questions on every level. A mentorship program would have an fantastic applicability domain here.
Gradually introduce to on-call. Use shadow on-callees if possible. Explain your escalation policies.
Embrace a blameless culture. When shit hits the fan, just accept it and fix it. No single one is to blame.

The indicator of a healthy SRE team, is how much time they are allowed each sprint to write documentation.

Next post!

The next post on SRE will engross technical (engineering-only) topics! As a teaser, I’ll go through some hot technical topics like:

chaos engineering
the operator pattern
observability
capacity planning

Stay tuned for more!

Credits

This post was reviewed by and co-authored with my good friend and partner Giannis!

I am excited to say the next (technical-only) SRE post will be co-authored by both of us from the ground up, combining the knowledge we gained over the years!

Navigating SRE - for Product and Engineering