MELT: Metrics, Events, Logs & Traces (to improve performance)

Abílio Azevedo

February 25, 2024

Metrics for Performance Monitoring

Metrics are numerical indicators that provide an overview of the performance of a system. They allow mathematical modeling and forecasting, which can be represented in specific data structures. Examples of useful metrics for understanding system behavior include:

CPU utilization
Error rate

Using metrics has several advantages, such as facilitating long-term data retention and simplifying queries. This makes them excellent for building dashboards that display past trends across multiple services.

Events for Tracking and Problem Detection

Monitoring events are discrete occurrences with precise temporal and numerical values, allowing tracking of critical events and detecting potential issues related to user requests. In other words, they are actions that happened in a system at a given time.

Since events are extremely time-sensitive, they are usually accompanied by timestamps. Events also provide context for metric data. We can use events to identify the most critical points in our application, giving greater visibility into user behaviors that can affect performance or security. Examples of events include:

Login attempts
Alert notifications
HTTP requests and responses

Logs for Debugging and Monitoring

Logs provide a descriptive record of system behavior at a particular moment, being an essential tool for debugging. By analyzing log data, we can gain insights into application performance that are not accessible through APIs or databases.

A simple explanation is that logs are a record of all the activities that occur within the system.

Logs can take various forms, such as plain text or JSON objects, enabling diverse querying techniques. This makes logs one of the most useful data sources for investigating security threats and performance issues.

To get the most out of logs, it is essential to aggregate them in a centralized platform. This helps quickly find and fix errors, as well as monitor application performance.

Traces for Visibility in Distributed Systems

A trace refers to the complete path of a request or workflow as it progresses through the components of a distributed system, capturing the end-to-end flow.

Therefore, it is a collection of operations that represent a unique transaction handled by an application and its constituent services. A span represents a single operation within a trace. The span is a fundamental element in a distributed system and in distributed tracing.

Traces offer insights into directionality and relationships between data, providing information about service interactions and the effects of asynchronicity. By analyzing trace data, we can better understand the performance and behavior of a distributed system.

Some examples of traces include:

Executing SQL queries
Function calls during authentication requests

Instrumentation for tracing can be challenging, as each component of a request must be modified to transmit tracing data. Additionally, many applications are based on open-source frameworks or libraries that may require additional instrumentation.

Types of Logs

There are several common types of logs that are important for software engineers and system administrators:

Audit Logs: Record information about user activities, configuration changes, authentication, and authorization. They usually include details like user identification, IP address, date/time, action performed, status, etc.
System Logs: Record messages and events from the operating system, services, or applications. They include information about errors, failures, process execution, resource utilization, etc.
Debug Logs: Provide detailed, technical-level information for debugging errors and issues. They often include variables, function/method execution states, queries, etc.

OpenTelemetry

OpenTelemetry is a set of APIs, libraries, agents, and collectors that help capture and record information about the performance and behavior of distributed applications. With OpenTelemetry, you can collect, analyze, and visualize telemetry data, such as tracing, metrics, and logs, in a standardized and interoperable way.

Read about OpenTelemetry, an open-source observability framework that helps collect telemetry data from a variety of cloud-based sources.

Opentelemetry

Log Emission and Capture

The main tools and technologies for log emission and capture include:

Logging Libraries: Provide APIs and features for applications to generate and write logs in a standardized way (e.g., log4j, log4net).
Log Collection Agents: Installed alongside applications to capture logs and send them to a central server (e.g., Splunk Universal Forwarder, Logstash).
Syslog: Standard protocol for log messages, enabling network-based transmission to syslog servers (e.g., rsyslog, syslog-ng).
Beats: Lightweight agents for collecting logs and metrics for Elasticsearch (e.g., Filebeat, Metricbeat).

Log Processing and Analysis

Once collected, logs need to be processed and analyzed. Some solutions include:

Splunk: Leading platform for indexing, analyzing, and visualizing logs with advanced search, alerting, dashboards, and integration capabilities.
Elastic Stack: Open-source suite of tools from Elastic for ingestion (Logstash), storage (Elasticsearch), visualization (Kibana), and log analysis.
Datadog: Monitoring service with large-scale log processing and analysis capabilities in the cloud.
Graylog: Open-source solution for centralized log management with full-text search, analytics, alerting, and dashboards.
Sentry.io: Application monitoring platform for errors and performance. Captures exceptions, errors, and performance issues in real-time. Features advanced error grouping, issue tracking, alerts, and integrations with development tools. Excels at rapid detection and triage of software bugs, especially in production environments. Widely used for web and mobile applications.
Grafana Logs: Grafana Cloud Logs is the fully managed log aggregation system powered by Grafana Loki that allows you to store and query logs from all your applications and infrastructure – without worrying about log volumes, costs, or storage limits.

What is APM?

APM (Application Performance Management/Monitoring) is a set of tools for monitoring and managing the performance and availability of applications.
Provides visibility into how applications behave in production.
Allows teams to identify and resolve performance issues.

Key components:

Monitoring: collects metrics such as response times and error rates.
Analysis: identifies patterns and trends in the collected data.
Diagnostics: determines the root causes of performance problems.

Differences from traditional tools:

Focus on end-user experience and overall application performance.
Analyses at the function level and distribution of resource usage in the code.
Specific insights for improving code and performance.

Benefits of APM:

Better real-time visibility into performance.
Faster detection and resolution of problems.
Optimized application performance and efficiency.

APM for Node.js:

Helps identify asynchronous I/O bottlenecks and memory leaks.
Provides insights for better concurrency management and scalability.
N|Solid from NodeSource is specialized for Node.js: more efficient, security-focused, with specific metrics.

Use Case - Improving Latency Between Services

Problem

Through Datadog monitoring, we identified that the /test endpoint has two requests:

One for app1
Another for app2

Both with considerable latency.

DatadogWithoutServiceDiscovery

What is AWS Service Discovery?

AWS Service Discovery is an AWS service that allows applications to automatically locate and connect to other services at runtime, without needing to know the physical location of those services. This is particularly useful in microservices environments, where new services are constantly being added, removed, or moved.

How does Service Discovery improve latency?

Without Service Discovery, when an application needs to communicate with another, it usually has to go through the API Gateway, which acts as a centralized entry point. This can add latency to the communication, especially if the services are in different regions or networks.

With Service Discovery, applications can connect directly to each other, without needing to go through the API Gateway. This reduces the number of "hops" in the network, decreasing the latency between the services.

Additionally, Service Discovery allows applications to dynamically find the addresses and ports of the services they need to communicate with, eliminating the need for manual configuration or using an external DNS service.

How does Service Discovery work in AWS?

Service Discovery in AWS is provided by the AWS Cloud Map service. Cloud Map allows you to register your services and make them discoverable by other applications.

When an application needs to communicate with another service, it queries the Cloud Map to obtain the necessary information, such as the IP address and port of the target service. This is done transparently for the application, which simply uses the service name in its URL.

The Cloud Map can also provide additional features, such as health checks and automatic failover, to ensure the availability and resilience of the services.

Solution

What we did was through AWS Cloud Map to connect both services directly on the same network, which resulted in a better outcome.

DatadogWithServiceDiscovery

Configure the Service Discovery following this article.

In Cloud Map, get the name of your service:

Cloud Map

In Route53, get the name of the A record:

Screenshot 2024-04-09 at 19.32.28

We also need the container port, which you can obtain from the container task definition:

Container Task Definition

The final step is to replace the base URL in the application with:

http://{register-name}:{containerPort}

Learn more here.

Abílio Azevedo.