A Vaccine for Performance and Reliability Engineers

June 14, 2022

By: Ramya Ramalinga Moorthy, Industrialization Head - Performance and Resilience Engineering Practice

With a significant shift in cloud adoption and architecture modernization initiatives, there is an increase in the technical complexity for delivering resilient and reliable IT systems. It is essential to have in-depth, end-to-end visibility of distributed architectural components to quickly pinpoint the issues in production, enabling reduction of Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR).

Be it problem diagnosis and troubleshooting activities carried out by Performance Engineers before production release, or the continuous production monitoring and performance efficiency improvement activities carried out by Site Reliability Engineers, a robust observability solution forms the backbone to successfully deliver and operate a highly scalable and available IT system.

Observability is the ability of the system to understand its internal state from external data sources that includes logs, metrics, and traces – the three pillars of observability. An observable system aids the building of quick inferences on its health using its outputs. Hence an IT system needs to be designed with observability capability to facilitate effective monitoring.

Advantages of the Observability Solution

The power of the observability solution can be well realized in complex containerized environments, running thousands of microservices. Data collected from various layers or server components across a hybrid cloud environment can be made available under one roof and are linked in a meaningful way for easy correlation, enabling problem diagnosis in a few clicks. Using the power of AI/ML, the observability solution has built-in intelligence to ignore outliers that could lead to false-positive alarms and help detect problems proactively through intelligent anomaly detection techniques.

Today, there is an increased shift from application performance monitoring to enterprise business process monitoring. This is made possible by mapping them to the application services and infrastructure to perform proactive impact analysis and intelligent problem-solving.

The observability solution generally offers telemetry capability through agent-based monitoring. The agent needs to be installed on the host that needs to be monitored. OpenTelemetry (OTel), an open-source vendor-agnostic observability framework (that includes libraries and APIs), has become the industry standard for data collection and export of telemetry data for cloud-native applications.

Key Capabilities of Observability Tool

A full-stack observability solution is expected to have the following capabilities:

Digital experience monitoringor real user monitoring comprises end-user traffic monitoring, user actions, browser performance metrics, and mobile device performance metrics.
Availability monitoring includes synthetic monitoring metrics on the performance and availability of key business-critical transactions or services across global locations by simulating user actions through automated scripts run at the scheduled frequency.
Infrastructure health monitoring includes resource utilization and health metrics covering CPU, Memory, Disk and Network for on-premises and cloud infrastructure (Azure, AWS and GCP) inclusive of physical servers, VMs and containers. It should also offer alerting capabilities based on static and dynamic threshold configurations.
Application health monitoring includes metrics like application throughput, error rate, response time break down across various layers, heap usage and GC statistics, queuing statistics, web service calls, third-party calls and connection pool. This helps to drill down the high response time request to service methods and database calls. It offers deep dive code analytics and distributed tracing capabilities to gain visibility on interactions between various services and their dependencies to quickly pinpoint issues in the code and SQL queries.
Log analytics includes capabilities for log ingestion and retention of data from various infrastructure sources to monitor and visualize the trends to spot the anomalies quickly. It should also provide a correlated view of various logs and event traces with resource utilization, user monitoring and application health metrics. It should in addition enable auto-alerting to report sudden changes or historical anomalies.
SLO/SLI management SRE dashboards includes capabilities to monitor the uptime, Service Level Objectives (SLOs) related to availability, latency, errors, traffic rate, tickets, etc., and the error budget usage levels.
Out-of-the-box integrations capabilities to integrate with various widely used incident management tools, service management tools and communication channels.

How Do We Choose the Right Observability Solution?

Observability solution is critical particularly for applications on a distributed architecture to enable quick problem diagnosis. In recent years, there has been a rapid increase in the number of commercial and open-source observability solutions in the market. The following are the key considerations in choosing an observability solution:

Licensing / Pricing

Does the pricing model fit your organization’s budget?
Are there premium options offered to pilot-on-test environment? Are there multiple pricing packages available to suit small, mid and large enterprises and arecustomizations possible according to the organizational needs?

Installation and Tech Stack Support

Does it require on-premises installation or is it available as a SaaS platform?
Does it support the target technology stack, container-based architectures, and what it takes to do the instrumentation, installation and setup?

Security Compliance / Organization Policies

Does the observability solution hold the required security compliance certifications/accreditations?
Do your organization’s security policies permit agent installation on the hosts in production environments?
Are there any organization privacy laws that restrict sharing of user data outside the organization?

Data Collection Standard

Is the observability solution compliant with the Open Telemetry standard?
How much telemetry data needs to be retained by the observability solution and what are the expectations on data storage duration?

Technical Capabilities

Does the observability solution offer the required technical capabilities through unified dashboards with seamless integration of metrics, logs and event traces?
Are the anomaly detection and analytics dashboards powered by Artificial Intelligence / Machine Learning models?
Does it offer capabilities to dynamically create custom dashboards for correlated trend analysis?
Are out-of-the-box integrations with the existing tools available (cloud-native monitoring tools, incident management tools, etc.)?
What is the user experience level to troubleshoot and nail down the problems?

One needs to be cognizant of the fact that, no one size fits all. Some of the commercial tools are offered as the SaaS platform that helps organizations to avoid doing installation, maintenance and scaling up the resource-intensive solution, but it comes with a high price tag. AppDynamics, Dynatrace, New Relic, Splunk, and Datadog are the top commercial players in the market. Apache PinPoint, Apache Skywalking, Jaeger, and Signoz are the popular open-source alternatives.

Most organizations use multiple tools in conjunction with the production environment to complement data collection and faster diagnosis to achieve the benefits of full-stack observability. Some organizations have built custom solutions that fit well within their existing tools ecosystem and compliments them by providing additional capabilities like predictive analytics, SLO monitoring dashboards, etc.

Though many tools claim to be an observability solution, some of them support only a partial set of capabilities. Hence a careful evaluation of observability solution capabilities is required to choose the right tool that meets your organization’s demand.

Ramya Ramalinga Moorthy

Industrialization Head - Performance and Resilience Engineering Practice

Ramya Ramalinga Moorthy is associated with LTIMindtree as Industrialization Head - Performance and Resilience Engineering practice. She carries 19+ years of experience in the Non-Functional Requirements compliance space, helping several Fortune 500 clients with engineering strategies to validate their systems against Performance, Scalability, Availability, Capacity, Security, Resilience, and Reliability. She is a certified SRE consultant and ethical hacker. She is a renowned author and conference speaker.

Latest Blogs

Transforming Large Language Model Optimization…

Introduction What if training powerful AI models didn’t have to be slow, expensive, or data-hungry?…

Speed to Market: How Gen AI accelerates MLR Reviews…

Pharmaceutical marketing has evolved significantly with digital platforms, but strict regulations…

Accelerating Cloud Migrations with Clarity and…

Leveraging the right cloud technology with appropriate strategies can lead to significant cost…

Generative AI-based Solution for “Strengthening…

Introduction The financial industry drives the global economy, but its exposure to risks has…

Blogs

Unfolding the Power of Observability: The Immunity Vaccine for Performance Engineers and Site Reliability Engineers

Blogger's Profile

Ramya Ramalinga Moorthy

Latest Blogs

Contact us

Blogs

Blogger's Profile

Ramya Ramalinga Moorthy

Latest Blogs