The Three Pillars of Observability: Logs, Metrics & Traces Explained

Imagine you are a detective investigating a mysterious system slowdown. You have three critical tools at your disposal: logs, metrics, and traces. These three pillars of observability work together to give you a clear picture of what went wrong and where. Without them, debugging a distributed system is like searching for a missing sock in a laundromat—frustrating and time-consuming!

In this article, we’ll break down these three pillars, explain their advantages, and demonstrate how to implement them using Prometheus, Grafana Tempo and Fluentd. We’ll even sprinkle in a bit of AI magic for anomaly detection to supercharge your observability stack.

1. Logs: The Who, What, and When of Your System

Logs are structured or unstructured records of events happening in a system. They provide the context needed to understand what happened and when. A well-structured logging system can help answer questions like:

Who triggered this event?
What action was taken?
When did it happen?

Implementing Logs with Fluentd

Fluentd is an open-source data collector that unifies logging across different environments.

Steps to Set Up Fluentd:

Install Fluentd:

curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-fluentd.sh | sh

Configure Fluentd to collect logs: Add the following configuration in /etc/fluent/fluent.conf:

<source>
  @type tail
  path /var/log/app.log
  pos_file /var/log/fluentd.pos
  tag app.logs
  format none
</source>

Send logs to Elasticsearch or another backend for querying:

<match app.logs>
  @type elasticsearch
  host localhost
  port 9200
</match>

2. Metrics: The Health Check of Your System

Metrics provide quantitative insights into system performance over time. Unlike logs, which capture discrete events, metrics track trends and patterns.

Using Prometheus for Metrics

Prometheus is a powerful monitoring system that collects and stores time-series data.

Steps to Set Up Prometheus:

Install Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz

Configure Prometheus to collect metrics: Add the following to prometheus.yml:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

Start Prometheus:

./prometheus --config.file=prometheus.yml

3. Traces: The Breadcrumb Trail of Requests

Traces track the journey of requests across microservices, helping diagnose latency issues and bottlenecks.

Using Grafana for Distributed Tracing

Grafana Tempo is an open-source tracing backend that works with Jaeger and OpenTelemetry.

Steps to Set Up Tracing:

Install Grafana and Tempo:

docker run -d --name=grafana -p 3000:3000 grafana/grafana
docker run -d --name=tempo -p 4317:4317 grafana/tempo

Configure Jaeger to send traces to Tempo:

receivers:
  jaeger:
    protocols:
      grpc:
      thrift_http:

View traces in Grafana:
- Add Tempo as a data source in Grafana.
- Use TraceQL to query traces and analyze performance bottlenecks.

Tips : AI-Powered Trace Analysis

AI models can analyze traces to detect patterns of slow requests and predict potential failures. Using AI anomaly detection, we can identify outlier traces and alert engineers before issues escalate. I will write a separate blog on my practical exposure to it.

Advantages of a Unified Observability Stack

Feature	Logs	Metrics	Traces
What it tracks	Events	Performance trends	Request journey
Storage duration	Long	Medium	Short
Data format	Text	Time-series	Distributed spans
Best for	Debugging	Monitoring	Performance tuning

By combining Fluentd, Prometheus, and Grafana, we create a powerful observability stack that provides deep insights into system performance and stability.

Benefits:

Faster Debugging: Quickly pinpoint issues before they escalate.
Proactive Monitoring: Detect anomalies using AI and alert teams.
Better Performance Optimization: Improve latency and efficiency.

Conclusion

Observability is not just about collecting data—it’s about understanding it. By leveraging logs, metrics, and traces, along with AI-driven anomaly detection, you can maintain a resilient and high-performing system.

So, next time your system is acting up, don’t panic! Just check the logs, monitor the metrics, and trace the requests. Your system’s health is in your hands.

🚀 Happy debugging!

References:

https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html
https://www.crowdstrike.com/en-us/cybersecurity-101/observability/three-pillars-of-observability/
https://www.datadoghq.com/three-pillars-of-observability/