Application Instrumentation for Observability
A production workload must emit information necessary to support operational excellence. This emitted information is used to quantify Service Line Indicators (SLI) related to reliability, security, performance, etc. Unpredictable behavior of user-facing or storage systems needs to be corrected on time to meet agreed Service Line Objectives (SLO). A system that cannot make sufficient information available to determine its health and behavior is considered unobservable, and hence for an operator, it is difficult to support.
Application instrumentation as a solution
Observability is the practice of instrumenting systems with tools to gather actionable data. This helps to observing and detect symptoms and helps companies understand the key reasons for any possible issues. Instrumentation enables a system to provide an understanding of its overall health, which is based on telemetry.
Telemetry consists of three major categories: Traces, Metrics, and Logs, which are collected at runtime around different cross-cutting concerns of a system. These cross-cutting concerns are the aspects of the system that are identified during system design and are managed at runtime with the help of an instrumentation agent like Tomcat or GlassFish server. To enable or improve the observability of a system, all application components – not just critical services – must be instrumented with observability in mind, to tell the entire story.
Aspect Oriented Programming (AOP) for Instrumentation
Aspect Oriented Programming (AOP) based frameworks generate and weave the instrumentation to the application code. An application developer concentrates on developing application code and leaves the responsibility of code instrumentation to the aspect code. Aspect code is generated from advice defined at pointcuts between cross-cutting concerns. Aspect weaver does the magic of connecting aspect code with the application code during code compilation. This modular approach provided by AOP allows for clean isolation between the application code and aspect code. It also helps to reuse the aspect code, simplifies maintenance, and provides much -needed insights on executing code.
When a system can externalize its state information, system monitoring can help further understand and predict when the system is likely to be broken and the key reasons for its breakdown. A reasonable alerting mechanism based on the system’s state can help in timely human intervention to determine the real problem and take steps to mitigate the issue. With woven instrumentation in application code, white-box monitoring becomes imaginable to inspect the innards of the system. It also helps to focus on causes instead of just symptoms. Collected telemetry data can be used to assist in effective debugging to fix imminent problems. With full-stack observability and white-box monitoring, a software team can deliver high-quality software at speed, see real-time performance, and build a culture of innovation.
Optimizing the instrumentation
Application instrumentation is a non-functional requirement, and it incurs additional implementation costs, which further depend on the level of instrumentation needed and the degree of automation that is achievable with respect to the desired instrumentation.
Every programming language permits writing logs with different log levels. Typically, the overhead for writing these logs is low. Brendan Gregg’s USE method suggests that enterprises must instrument the resources to log meaningful data around key factors such as utilization, saturation and error count. Utilization is “the average time that the resource was busy servicing ”, and saturation can be described as “the degree to which the resource has extra work which it can’t service, and is often queued”.
Within a distributed complex environment, dozens of services could call one another generating metrics, traces, and logs. If there is no established correlation between these calls, then the collected telemetry data will be like data silos and will not be of much use for root-cause analysis. This issue can be addressed by tracing instrumentation in a distributed environment to understand how different services connect and how requests flow through the path. A globally unique ID is assigned for each request, which is then propagated throughout the request path. Each point of instrumentation along the path can enrich metadata and insert data before passing the ID to the next service.
Instrumentation using AOP adds overhead to execution which is directly proportionate to the number of measurement points, defined using the aspect advice. To keep this overhead as low as possible, instrumentation needs to be applied wherever it makes sense, and visibility is needed the most.
References
Spring docs – Aspect Oriented Programming with Spring
Wikipedia – Distributed AOP
Brendan Gregg’s USE method
Latest Blogs
Introduction Artificial Intelligence (AI) is transforming industries and redefining possibilities…
Introduction The evolution of artificial intelligence (AI) has been a remarkable journey,…