How Site Reliability Engineering Drives Customer Success

Posted by Dominik Rose on February 26, 2021

blog-sre@2x

The debut of salesforce.com in 2000 arguably marks the beginning of modern Software-as-a-Service (SaaS). The software company we all know and use launched with the tagline “the end of software” — a bold way of saying that vendor-run software would be crowned king over customer-specific, self-hosted alternatives.

Not only was Salesforce’s company vision spot-on, their choice to move from substantial upfront payments to annual renewal fees resulted in a new business model with an ecosystem of supporting roles. Customer Success, a discipline used by LeanIX and other SaaS-based vendors to help customers adopt products, was borne out of this decision. Instead of closing contracts and delivering software via a cost center, Customer Success agents work to re-win customers every day by tracking product usage and improving the business value of services.

But the global shift to SaaS-based products has fundamentally changed the world of software development itself. Nowhere is this more apparent than in the tasks of Site Reliability Engineers (SREs), a role situated between engineering and operations that addresses two primary business drivers:

  • The availability and reliability of all software systems (i.e., the two most critical features of any solution). After all, what’s the value of User Interfaces powered by Artificial Intelligence or Robotic Processing Automation if a service isn’t available?

  • The growth and speed of modern software systems. To ensure long-term advantages with portfolios of SaaS-based software, it’s imperative to balance stability with speed and innovation — not to mention critical levels of technical debt.

SRE at Google

Speaking at GOTO Amsterdam in 2018, Christof Leng, SRE Manager for Uber-TL at Google, defined SREs as essential for managing the “enormous scale, rapid growth, and daunting complexity” of Google’s system landscape. But while SREs have been Google’s experts for operating technology and product infrastructure since 2003, Leng pointed out that not every company needs to be a global tech titan to leverage the discipline.

Here are some of the insights from Leng on SRE at his 2018 talk:

  • There is no golden rule for investments into SRE. In the end, an SRE program is determined by how a business needs to balance speed and stability. This can range from keeping software running for life-critical situations (e.g., running software for pacemakers or airplanes) to managing single features for an early-stage startup.

  • SRE is coding! Generally speaking, it’s good to annoy SREs with manual, tedious work. Automation needs to be thought about constantly, and Leng even goes as far to say that an SRE team should aim to make tasks redundant every 18 months via automation.

  • SRE is about communication, trade-offs, and incentives. The better SREs understand how and why to make systems, the better a company will succeed with its business goals

Redefining Service Level Agreements

Service Level Agreements (SLAs) in the pre-SaaS world were mainly the domain of legal teams. SLAs define when software bugs must be fixed, and how fast — tasks which prompt formal discussions between IT and business teams for client organizations and vendors alike.

As you can imagine, Customer Success programs in today’s best SaaS companies make sure that SLAs are understood and resolved more rapidly. Aiming to support customers during “moments of truth” (e.g., a C-level presentation) rather than with paperwork, Customer Success-led SaaS vendors provide last-minute and ad hoc fixes to their offerings — and by doing so, cultivate client relationships and economic benefits in the process.

SLAs mean different things to SREs, but here’s just the tip of the iceberg:

  • An SLA in the world of SRE is the promise made with the end-user on topics like system availability and application response times. To say the least, a broken SLA is bad. They are often documented in contracts, and failing to honor these agreements can be expensive.

  • A service-level objective (SLO) establishes which benchmarks a service must meet in order to achieve the SLA. SLOs are more granular and stricter than SLAs. You’ll want to break an SLO before the SLA. Consider it like an advance warning system.

  • A service-level indicator (SLI) measures the actual performance of an SLO and establishes whether the SLO and SLA were met.

While legal teams regularly bear responsibility for SLAs, SLOs and SLIs are co-owned by SREs and development teams. Both should be monitored and measured to provide KPIs on whether the entire software development organization is improving. In many cases, SLOs and SLIs are put into operations and referred to as “error budget”.

An error budget is a joint agreement between development teams and SREs on how often systems are allowed to “break”. An example would be a system with a 99.9% availability SLO. Out of one billion API calls per month, it would allow for an error budget of one million calls (meaning, it is acceptable if one million API calls fail monthly).

If you think a million failed API calls per month is a lot, remember that trade-offs are at the core of modern software development. In the spirit of Facebook’s “move fast and break things” mantra, innovation speed is often prioritized over 100% stability. Error budgets show very clearly where and when perfection can be (temporarily) overlooked — or, to put it even more simply, when it’s OK to break things (i.e., when the budget isn’t exceeded).

How SRE relates to DevOps

SRE and DevOps are fundamentally intertwined. Both accept that silos can’t exist between teams responsible for delivering a modern software system. SREs and DevOps are also heavily driven by automation, utilizing CI/CD pipeline and Infrastructure-as-Code to simplify complex, repetitive, and error-prone operations. The focus on reflection and the proximity to agile, iterative development is also a common denominator.

SRE is a concrete incarnation of the DevOps culture. Where DevOps focuses on shifting the mindsets of classical IT organizations, SREs constantly think about reliability and how to improve product usage for development teams. Leng mentions that Google goes so far as to even separate SRE reporting lines and development teams, and uses the rule of “SRE mobility”, allowing SREs to leave development teams where the focus on production does not exist as required.

What does it mean for you?

Not every company is Google or a SaaS vendor. But as businesses evolve into technology companies, SRE is a practice that can drive software availability while setting systemic guidelines on balancing speed with stability.

An SRE practice is a business decision and a deliberate investment into one’s own software development capabilities. As such, it must continually be justified with other business drivers and changing market needs. If COVID-19 has impacted a company, for example, it is likely that an SRE investment was lowered.

To help with the implementation of an SRE program, trade-offs need to be transparent and prioritized for all stakeholders. This clarity helps development and operation teams collaboratively resolve conflicting incentives. As well, it can also help business teams clarify differing interpretations of SLAs and SLOs.

Conflicting incentives among development and operation teams make it difficult for both to proceed in unison — as do differing interpretations of SLAs and SLOs between development and business stakeholders. Making these trade-offs transparent and quantifiable is a subject that I spoke about alongside Per Bernhardt at the 2020 EA Connect Days.

LeanIX Microservice Intelligence, a new module in the LeanIX Cloud Native Suite, offers modern development organizations one consistent microservice catalog. This catalog presents a clear line of sight into the ownership and performance of self-developed software, and gives business stakeholders a way to share and discuss SLAs and SLOs. Of note, the LeanIX Integration API makes it easy to automate the intake of SLIs (e.g., from systems like Pingdom) and show their development over time.

Subscribe to the LeanIX Blog and never miss a post again!