The Three Pillars of Observability: Logs, Metrics & Traces Explained

Imagine you are a detective investigating a mysterious system slowdown. You have three critical tools at your disposal: logs, metrics, and traces. These three pillars of observability work together to give you a clear picture of what went wrong and where. Without them, debugging a distributed system is like searching for a missing sock in a laundromat—frustrating and time-consuming!

In this article, we’ll break down these three pillars, explain their advantages, and demonstrate how to implement them using Prometheus, Grafana Tempo and Fluentd. We’ll even sprinkle in a bit of AI magic for anomaly detection to supercharge your observability stack.


1. Logs: The Who, What, and When of Your System

Logs are structured or unstructured records of events happening in a system. They provide the context needed to understand what happened and when. A well-structured logging system can help answer questions like:

  • Who triggered this event?

  • What action was taken?

  • When did it happen?

Implementing Logs with Fluentd

Fluentd is an open-source data collector that unifies logging across different environments.

Steps to Set Up Fluentd:

  1. Install Fluentd:

    curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-fluentd.sh | sh
  2. Configure Fluentd to collect logs: Add the following configuration in /etc/fluent/fluent.conf:

    <source>
      @type tail
      path /var/log/app.log
      pos_file /var/log/fluentd.pos
      tag app.logs
      format none
    </source>
  3. Send logs to Elasticsearch or another backend for querying:

    <match app.logs>
      @type elasticsearch
      host localhost
      port 9200
    </match>

2. Metrics: The Health Check of Your System

Metrics provide quantitative insights into system performance over time. Unlike logs, which capture discrete events, metrics track trends and patterns.

Using Prometheus for Metrics

Prometheus is a powerful monitoring system that collects and stores time-series data.

Steps to Set Up Prometheus:

  1. Install Prometheus:

    wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
    tar xvfz prometheus-*.tar.gz
  2. Configure Prometheus to collect metrics: Add the following to prometheus.yml:

    scrape_configs:
      - job_name: 'node'
        static_configs:
          - targets: ['localhost:9100']
  3. Start Prometheus:

    ./prometheus --config.file=prometheus.yml

3. Traces: The Breadcrumb Trail of Requests

Traces track the journey of requests across microservices, helping diagnose latency issues and bottlenecks.

Using Grafana for Distributed Tracing

Grafana Tempo is an open-source tracing backend that works with Jaeger and OpenTelemetry.

Steps to Set Up Tracing:

  1. Install Grafana and Tempo:

    docker run -d --name=grafana -p 3000:3000 grafana/grafana
    docker run -d --name=tempo -p 4317:4317 grafana/tempo
  2. Configure Jaeger to send traces to Tempo:

    receivers:
      jaeger:
        protocols:
          grpc:
          thrift_http:
  3. View traces in Grafana:

    • Add Tempo as a data source in Grafana.

    • Use TraceQL to query traces and analyze performance bottlenecks.

Tips : AI-Powered Trace Analysis

AI models can analyze traces to detect patterns of slow requests and predict potential failures. Using AI anomaly detection, we can identify outlier traces and alert engineers before issues escalate. I will write a separate blog on my practical exposure to it.


Advantages of a Unified Observability Stack

FeatureLogsMetricsTraces
What it tracksEventsPerformance trendsRequest journey
Storage durationLongMediumShort
Data formatTextTime-seriesDistributed spans
Best forDebuggingMonitoringPerformance tuning

By combining Fluentd, Prometheus, and Grafana, we create a powerful observability stack that provides deep insights into system performance and stability.

Benefits:

  • Faster Debugging: Quickly pinpoint issues before they escalate.

  • Proactive Monitoring: Detect anomalies using AI and alert teams.

  • Better Performance Optimization: Improve latency and efficiency.


Conclusion

Observability is not just about collecting data—it’s about understanding it. By leveraging logs, metrics, and traces, along with AI-driven anomaly detection, you can maintain a resilient and high-performing system.

So, next time your system is acting up, don’t panic! Just check the logs, monitor the metrics, and trace the requests. Your system’s health is in your hands.

🚀 Happy debugging!

References:

  1. https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html
  2. https://www.crowdstrike.com/en-us/cybersecurity-101/observability/three-pillars-of-observability/
  3. https://www.datadoghq.com/three-pillars-of-observability/