Maintain Service Level Objectives (SLO) for a production data application
The SRE book and Service Level terms
“Service Level Objectives” (or SLO) is one of the foundational sections of the Google SRE book, within which the differences among SLI, SLO, and SLA are carefully explained. Compared to SLA, a well-known term across the software industry, SLI and SLO are relatively newer concepts that catch more attentions, especially in the context of Site Reliability Engineering.
The 3 Service Level terminologies by definition (from the Google SRE book):
SLI (Service Level Indicator): a carefully defined quantitative measure of some aspect of the level of service that is provided.
SLO (Service Level Objective): a target value or range of values for a service level that is measured by an SLI.
SLA (Service Level Agreement): an explicit or implicit contract with your users that include consequences of meeting (or missing) the SLOs they contain.
From the single keyword highlighted above, you can see each term has its own special focus — SLI is mostly about measurements, SLO on target values, and SLA on customer contracts.
Data application running on production systems
In one of the companies I previously worked for, my responsibility was operating a tick data application running on production systems, as part of a small systems engineering team. The application supported by the systems offers a mixed of services:
- Continuous injection of tick raw data from two geographically redundant data sources (i.e. London and NYC), at a rate of ~500Gb per day
- Processing of the raw data as they come in, and save the processed data in MySQL and Cassandra databases.
- A UI for user to download processed data, through either a short-ranged or a long-ranged query
- An API for user to retrieve processed data, through either a short-ranged or a long-ranged query
At that time the systems engineering team relied on Nagios as the main monitoring and alerting system, plus an internally designed fancy graphing tool to visualize the data processing progress for each stock market during the day.
Service Level terms applied to the application
Though not using the term “SLI”, the team developed a series of key metrics to measure the underlying systems that provide the services aforementioned. For instance, a production node going down would trigger a pager to the on-call engineer who is expected check and fix the problem. In this case, the node availability is one simple example of SLI.
In order to be able to “perceive” the latency of short-range queries, the team setup a test host in NYC region and periodically initiate a test query from the test host to our production system, in order to better simulate the user behavior with an “external” trigger. The query latency is recorded in our monitoring system. Another metric was setup to measure the average latency of the external test queries over time. It’s yet another SLI.
The monitoring system periodically checks the average latency of the queries, and raises different level of alarms (minor, major, critical) when one of the configured thresholds is reached. If a minor threshold (e.g. 200ms) is reached, an email will be sent to the whole systems engineering team. If the issue turns worse and a major threshold (e.g. 2s) is reached, no immediate action would be taken, instead, the system will wait and if the situation does not recovers for two consecutive checking intervals (e.g. 2 x 5 mins), a pager will be sent to the on-call engineer, so that some human intervention could be taken to alleviate the situation. A critical threshold (e.g. 5s) means the issue becomes user perceivable, and should be handled with priority. The major or critical latency threshold discussed here is an instance of SLO — an internal target (or value) our systems engineering team aim to maintain. Actions need to be taken to avoid the situation to deteriorate — to a level that users are able to perceive, or even go worse to violate the service contract with the customers, i.e. SLA.
Maintaining SLO
In order to maintain a Service Level Objective, one critical thing is to understand which systems and their measurements (or SLI’s) the objective really depends on. Take the query latency SLO as an example, there are multiple factors that could affect a data query latency, e.g.
- data availability
- data system stability
- data system performance
As mentioned in earlier section, the data users query needs to “processed” data, which requires the “raw” data to be successfully ingested, processed and stored in the data system, making available for queries. You can imagine any failures during data ingestion, processing, or storage could lead to data unavailability and query failures or delays. The data processing and storage systems are key in this aspect.
Systems go unstable from time to time. There could be multiple contributing factors, e.g. hardware problem, network instability, OS or application software inefficiency. A series of measurements are needed to help understand which specific system component is dysfunction, so that a more targeted troubleshooting and recovery could be taken.
Systems go degraded sometimes as well. It could be an unusual high load caused by a couple of super-long-range queries, or some cluster member goes down which leads to short of capacity to handle even a normal workload. A series of performance related metrics are added to measure and detect those issues. By referring to the alerts raised for specific system components, there’s a good chance one can find out possible cause of a performance degradation for a given service.
In order to maintain SLO’s for the production data application, which is running 24x7 serving data and customers around the world, the systems engineering team never stop improving the monitoring system while keeping the underlying systems running. If a critical incident happens during work time, a team effort is usually taken to get it resolved, including issue recovery, root cause analysis (RCA), and system improvement. However, in the unlucky case that an incident happens during night time, the on-call engineer is the primary force to deal with the issue, and the focus is to recover the affected system so that SLO could be maintained, leaving the RCA and improvement work for the next day.
At the time I was part of a 4-member systems engineering team, and we share the on-call load with one person taking the pager for one week in a single month. We used PagerDuty to manage our on-call roster and escalation process. Nagios, together with an internally designed fancy data availability progress graph helped us on systems monitoring and alerting when things went south. There were post-mortem arranged after some serious incidents, so that we could find out improvement opportunities on both systems and process aspects.
Final words
Maintaining a production data application requires lots of work from keeping the underlying running, monitoring the systems, and continuous improving the process and systems. Making the system stable and performant is not a easy task that could be achieved overnight. It needs lots of attentions and initiatives to turn ideas into actions. A better understanding of the services we provided to our customers will definitely help with defining and maintaining those higher-standard internal targets— SLO’s.