2024 Site Reliability Engineering: Key Trends and Focus Areas for SREs

Cindy Mullins

April 12, 2024

•

In modern tech organizations, SREs can wear many hats. Historically, SREs have often 'come to the rescue' for deployment and operational issues, taking the lead in deciding how applications are deployed, determining when something needs to be rolled back or modified, and adjusting health checks and monitoring. But as cloud-native application development has continued to progress, the processes of deploying, releasing, and operating applications have shifted, becoming more and more the realm of the DevOps team directly. Accordingly, the role of Site Reliability Engineers (SREs) has evolved to focus on implementing the right tools and processes to support deployment and to provide the first line of defense against downtime and system failure.

Explosive Growth of Cloud-Native Technologies

Cloud-native technologies have revolutionized the way applications are developed, deployed, and maintained. Relying on container orchestration and microservices, the expansion of cloud-native tech seems here to stay:

The CNCF (Cloud Native Computing Foundation) reports that the adoption of cloud-native technologies has grown significantly, with 78% of surveyed companies using containers in production environments.
The global cloud-native application market size is expected to reach $21.1 billion by 2024, growing at a CAGR of 22.7% during the forecast period. (MarketsandMarkets)

Within this shifting landscape, new job roles and responsibilities have emerged to keep pace with not only the changing architecture but also to ensure that the underlying systems best support the organization’s goals and will be implemented in a way that is maintainable and sustainable for devs and operations teams.

There is certainly some overlap in the roles of SREs, DevOps, and Platform Engineers: all three are concerned with issues like automation, infrastructure-as-code, systems engineering, and software development. Most recently, SREs have been filling roles beyond development and operations, and some SREs are focusing entirely on process, strategy, and culture. Let’s examine three areas SREs commonly focus on and refer to some leading concerns and tools in each space.

There are 3 themes around which SRE’s responsibilities revolve.

Automation: SRE teams are increasingly using automation to reduce toil and free up engineers to focus on more strategic work.
Observability: SRE teams are using observability tools to gain deep insights into the behavior of their systems. This helps them to identify and fix problems more quickly.
Security: SRE teams are taking a more proactive approach to security. They are working to embed security into the development lifecycle and to ensure that their systems are resilient to attack.

Recommended for you

Using AI for API Development

AI Code Generator: Cutting Repetitive Coding in Half for Faster Development

6 Proven Chaos Testing Techniques for More Resilient APIs

Focus on Automation

"Besides black art, there is only automation and mechanization," Federico García Lorca (1898–1936), Spanish poet and playwright

There’s a certain amount of repetitive maintenance work that is required to keep a system up and running. This includes things like provisioning infrastructure, systems monitoring, incident response, and running integration and other tests. It can also include things like updating documentation, which is often managed by another system or set of procedures. Automation is becoming ever more critical to SRE operations and the trend seems to only be set to expand. According to a survey by Atlassian, 61% of IT professionals say automation will be a high or extremely high priority for their organization in the next 12 months.

What kinds of tasks do SREs want to automate? One common example is creating user accounts. Others include operational duties like saving backups systematically, managing server failover, automating deployments, and small data manipulations like changing the upstream DNS servers’ resolv.conf, DNS server zone data, and similar tasks. The greater the volume of manual tasks that exist, the more likely the system will fall short as manual actions performed over and over by human developers cannot be consistent or even executed under exactly the same circumstances each time: these are the kinds of tasks better managed by machines.

What are some of the automation tools and principals SREs are looking at in 2024?

Argo, Flux, Chef, and Ansible, among others are popular automation platforms that can be used with container orchestration tools like Kubernetes. Of course, there are many considerations when choosing one that’s right for your team. As an example, there’s a helpful Argo CD vs Ansible comparison supported by user comments and data hosted on g2.com. Scroll down for more automation options like GitLab and Harness where you can explore further details.

Focus on Observability

"Observability is the degree to which the results of an innovation are visible to others." - Everett M. Rogers, Diffusion of Innovations.

Observability is about providing visibility into all aspects of your system to identify and fix issues before they cause customer-facing problems. This includes things like monitoring system health while also involving things like tracking changes made to the system and understanding how new implementations are performing. Monitoring is the process of collecting data about your systems at the application level and using it to generate reports. By contrast, observability uses data from all levels of the system and, therefore assists you in detecting and diagnosing issues in real-time.

Monitoring, for example, may show you how much disk space the database is using and how many requests the web server is handling per second. These are commonly built around a defined set of known failure scenarios. For example, running out of disk space is a very common failure, so monitoring can give you a heads-up if known parameters are being exceeded or things are headed in the wrong direction.

But what if something goes wrong in an unexpected way? Monitoring may tell you, for example, that requests are failing, but in order to diagnose the problem, you’d need a much more integrated view of your systems. Observability is meant to provide this holistic view, integrating data from several sources including logs, metrics, traces, and the ability to hone in on irregularities and anomalies. If monitoring provides data, observability aims to provide the information needed to make good remediating decisions.

Observability relies firstly on data collection, which is generally done through logging, tracing, and metrics. It’s considered a best practice to standardize on formats as this helps minimize the conversion of data as it's shared between different tools and systems. The next step is to analyze the data using tools like dashboards, graphs, and alerts. An alert system ensures the right people are notified when an issue arises, and will show resolution once the underlying problem has been identified and resolved.

What are some of the leading Observability tools currently?

Prometheus gathers metrics about your applications and infrastructure, monitors them, and produces data through dashboards and visualizations. It’s a popular application site reliability engineers rely for performance and KPI monitoring, load testing, and anomaly detection largely because Kubernetes outputs its own metrics in a format easily consumed by Prometheus.

Another advantage is Prometheus’ pull-driven approach: the system being monitored only has to serve its metrics as responses to requests on a specific port. Applications can update metrics as frequently as needed with no additional load on Prometheus, and if a Prometheus instance goes away, the application won’t be impacted.

Together with Prometheus, SREs often utilize Grafana, an analytics and monitoring application, to quickly display metrics and data. Key metrics may be set into dashboard panels. Grafana supports many data sources, including Prometheus, MySQL, Elasticsearch, SQL, AWS, and others. Grafana can also be set up with alerts to notify the right teams or people when problems arise.

Splunk is primarily used to discover, monitor, and investigate machine-generated Big Data through a web-style interface. A main advantage of using Splunk is that it does not require a database to store its data, as it makes extensive use of indexes. It correlates real-time data into a searchable container from which it can generate graphs, reports, alerts, dashboards, and visualizations that provide business intelligence.

Dynatrace allows SREs to monitor the infrastructure behind an application. AI-powered Dynatrace can track network traffic, host CPU usage, response times, and other metrics. By providing automatic and intelligent observability for even complex distributed cloud environments, Dynatrace helps SREs and DevOps teams to identify problems before they occur.

Focus on Security

"Security is a process, not a product." - Bruce Schneier, Information Security author and technologist.

SREs are primarily concerned with reducing the risk of security incidents. To counter today’s security threats requires things like implementing strong access control policies, conducting regular security assessments, monitoring, logging, and backing up critical data. By establishing a culture of proactivity, observability, and software automation, SREs aim to achieve maximum uptime while mitigating any threats that could cause downtime.

Some of the more common threats are DDoS attacks which can prevent access to web resources resulting in usage outages, software vulnerabilities that could be exploited by hackers to gain unauthorized access to resources, and ransomware which imposes a malicious lock on access or resources. In addition, there are new concerns such as smarter and more sophisticated AI-powered, next-level phishing attacks that aim to trick users or employees into divulging sensitive information.

Similarly, with more employees working from home, the risks posed by workers connecting or sharing data over improperly secured devices will continue to be a threat. Home consumer IoT devices are often designed for ease of use and convenience rather than security and may be at risk due to weak security protocols and passwords.

SRE teams must be well-informed about these and other common security threats in order to make their security procedures resilient and robust.

How are SREs tackling security concerns in 2024?

Delving into the details and available options for mitigating security risks are outside the scope of this article, but, in principle, establishing authentication and authorization protocols and encryption tools is a good place to start. A very useful overview on the Fundamentals of Security for SREs, which outlines many key concerns, can be found here, licensed under the Creative Commons Attribution 4.0 International Public License. Another helpful resource is the OWASP Top 10, an awareness document for developers and web application security. It represents a broad consensus about the most critical security risks to web applications.

In Conclusion

As the role of the SRE continues to evolve, SRE teams will likely gain even more influence over how companies manage their development and operations. Automation will become an even bigger focus in all areas, including maintenance, deployment, and monitoring in order to empower developer teams to focus on critical human-required tasks.