Service resilience has become one of the most important topics today in the world of observability and for good reason; businesses need their services to be up and working, performant and fixed quickly, ideally through automatic remediation, when there is an issue. Adding into this is the wonders of artificial intelligence and machine learning, which can drive some amazing advancements in this world. But, peeling back the covers, there are a number of key challenges in achieving service resilience. This five minute read quickly goes into the ‘why’ resilience is important for observability today, the various definitions and misconceptions surrounding it, the challenges and outlines a key methodology in delivering service resilience in your organisation.
Before we dig into the details, let’s understand what service resilience is and for that let’s define the word resilience. The dictionary defines the term resilience as “the capacity to withstand or to recover quickly from difficulties; toughness” and this is exactly what is needed when looking at your services within your business. They need to work, perform, meet users’ expectations, be available and for you to help them withstand and recover from difficulties, such as slowness, poor performance and bad customer experiences. Now let’s define a service and here is where there is confusion as the definition of a service is very different across the industry and with each vendor in this space. In fact, the definition of a service differs from business to business and can also be interpreted differently by teams, depending on where they sit within the organisation and their sphere of control and influence. It is safe to say that it is a term that offers up many different definitions! From our experience here at Splunk and working with our customers, a service is a functional pillar of the business, typically owned by a stakeholder and specialist team to deliver value to the organisation. The service can be low-level, such as compute or connectivity, which delivers foundational business capability, up to fulfilling strategic business outcomes, such as revenue generation, providing customer satisfaction and/or a business process.
It is here, within the definition of the service, where we see some typical challenges as it is easy to define a service by starting with the existing silo monitoring that is in place and then leveraging the existing metrics to be the KPIs of that service. A typical view from an APM tool, for example, is to just look at the application, not the broader service or process that application is part of or name the API calls and logical processing elements within them a service. Adding to this is the difficulty in correlating this data together across the silo tooling which leads to relying on only a few subsets of technical related KPIs from them. The result? A failure to deliver a service resilience view as it lacks the necessary business focus and has multiple data gaps which makes any troubleshooting or impact analysis very hard to achieve.
So, how do we combat this? Let’s walk through an example of service resilience for a large insurance company. From a technology perspective, they have a multitude of apps, some in the traditional three tier space and others that have been migrated over to cloud native tech. They offer their customers a range of services starting from a quote & buy engine, cross selling and tailored products, services and discounts through to claims processing and call center services. The customers they serve only care about the service they are receiving and have no interest or knowledge in the sheer complexities of the service and the technology, apps, processes and people that sit behind and power it. The business only cares about the output and in this case the number of quotes being fulfilled, claims being processed, products being cross sold etc. and not the underlying technology. With this in mind, and as an example, let’s look at a customised service resilience view for the claims process service of the insurer, provided by Splunk’s ITSI engine.
1. The customised view above maps out the complete claim process and highlights the key business KPIs that need to be measured so that both the business, as well as the technical teams, can see how the service is performing. The technical KPIs can also be integrated alongside the business ones, thus allowing tech teams to understand the impact of the technology they are supporting on the service that is being delivered. As this is customisable, you can have any KPI detailed on here.
2. KPIs that demonstrate the quality of the claims process service. In this example, these are focused on the key business process, steps and outcomes of the service, so that the quality of it can be quickly determined. For example, we can understand:
3. RAG status on the KPIs - we can also put a RAG status on the KPI numbers too, which provides quick visibility into whether the service or process is in a degraded state. Artificial intelligence and machine learning (AI/ML) is used to predict where that number will be in the near future and thus allowing teams to take preemptive action now.
4. Business specific data. As Splunk is a data platform, any data can be ingested, whether that is structured or unstructured, in different formats or stored in multiple locations and calculations can then be performed on this data to provide bespoke and correlated business KPIs. This enables you to ingest or create business specific data. The example below shows key business data for the claims process engine, each with a RAG status, so you can quickly see the state of this part of the process, the business impact of issues and identify any emerging trends.
5. Integrating and mapping the technology into the business process and KPIs. In a similar way to above, we can define the technical KPIs and use that visibility to understand its service performance, thus quickly identifying the root cause of issues and getting the problem to the right team. This is where you can also use existing siloed monitoring that might be deployed as well as highlighting monitoring gaps that need to be filled. The key here is the ability to ingest this data and provide unique correlations and insights across existing multiple tooling.
Service resilience doesn’t have to be hard and here at Splunk, we follow this simple four step methodology. The key is that we start at the top level, with the business services, processes and internal and external customers in mind rather than starting at the tech and existing monitoring tool level.
We can strengthen service resilience by utilising AI and ML within Splunk ITSI by:
Taking this approach and using Splunk’s ITSI platform, you can build customised service resilience views for your business. Check out the links below for some great further reading:
My thanks to our local Splunk subject matter experts John Murdoch, Marc Serieys, Rachel Bourne and Jaana Nyfjord for their input to this blog.
Ian has worked in the Observability space for the best part of 20 years and has helped many organisations on their O11y journeys to provide better visibility into their critical apps and services. Ian has been at Splunk for over four years and is part of the O11y strategist team which focuses on helping Splunk's customers achieve their O11y goals.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.
Get the latest articles from Splunk straight to your inbox.
© 2005 - 2025 Splunk LLC All rights reserved.