Facebook iconThe Complete Guide to Observability for LiveKit Agents
F22 logo
Blogs/AI

The Complete Guide to Observability for LiveKit Agents

Written by Saisaran D
Feb 6, 2026
9 Min Read
The Complete Guide to Observability for LiveKit Agents Hero

Why do LiveKit agents sometimes fail without warning, leaving you unsure of what actually went wrong? I’ve run into sudden disconnections, poor audio, and unresponsive agents in production, and the most frustrating part is when logs show nothing more than “Agent disconnected” with no real context.

Real-time communication apps like LiveKit are much harder to monitor than standard web apps. I’ve seen cases where a half-second delay that’s acceptable for a webpage completely breaks the experience in a live call. With constant state changes, multiple failure points, and complex debugging across servers, networks, and devices, the need for observability becomes critical.

Yet most teams still struggle to get it right. Even after working with multiple production systems, I’ve seen how common this gap is. According to the 2024 Logz.io Observability Pulse Report, only 10% of organisations report having full observability across their systems. Even in the wider observability market, some areas show slower adoption, like unified analysis of Kubernetes infrastructure (27%), combining security data with telemetry (23%), and pipeline analytics (18%).

In this article, I’ll show how to close that gap by building a complete observability stack for LiveKit agents using Prometheus, Loki, Tempo, Grafana, and OpenTelemetry, based on what actually helps when things break in production. By the end, you’ll know how to monitor metrics, trace failures, analyze logs, and build dashboards that give you clear insights.

LiveKit Observability Stack

Let’s break down the observability stack you can use for LiveKit. This is the setup I rely on when I need fast answers during real incidents, with each tool handling a specific part of the problem.

1. Prometheus (The Metrics Detective)

Prometheus continuously collects numerical data from your LiveKit agents, such as CPU usage, memory consumption, active participant counts, and connection success rates. I use it as the first signal that something has changed, even before I know exactly what went wrong. It's like having a health monitor that takes your application's vital signs every few seconds. When something goes wrong, Prometheus tells you what changed and when it started happening.

2. Loki (The Log Librarian)

Loki gathers and organizes all the text logs from your services. When Prometheus shows me that something changed, Loki is usually where I go next to understand what actually caused it. It’s designed to search quickly through massive amounts of logs, which makes it a great fit for chatty LiveKit applications that generate thousands of entries every minute.

3. Tempo (The Story Teller)

Tempo tracks distributed traces, which are like detailed stories showing how a request moves through your system. For example, when a participant joins a room, Tempo can map every step, authentication, room setup, media negotiation, and connection establishment. It doesn’t just tell you what failed; it shows you where in the process things went wrong.

4. Grafana (The Visual Narrator)

Grafana is the dashboard that brings all your observability data together in one place. It takes raw inputs from Prometheus, Loki, and Tempo and turns them into clear charts, graphs, and alerts that are easy to understand. Think of it as your mission control center, where you can see everything happening across your LiveKit agents in real time.

5. OpenTelemetry (OTEL) (The Data Collector)

OpenTelemetry is the layer that adds tracing to your LiveKit agents and sends that data to Tempo. Think of it as placing sensors across your code that record what’s happening and how long each step takes. The best part is that it’s a standard; once you set it up, it works with Tempo, Grafana, or any other observability backend.

Together, these tools form more than just a stack, they work as a unified system. Prometheus gives you metrics, Loki provides logs, Tempo traces the journey, Grafana pulls it all into one view, and OpenTelemetry ties everything together. On their own, each tool is powerful. But combined, they create a complete observability layer that’s especially effective for real-time systems like LiveKit.

Why Is This Combination Perfect for LiveKit?

Traditional monitoring setups often fall short with real-time applications. I’ve seen basic monitoring miss issues that only became obvious once metrics, logs, and traces were viewed together. Here's why this stack is particularly well-suited for LiveKit agents:

-Low overhead: These tools are designed to monitor high-throughput systems without impacting performance

- Real-time capabilities: Dashboards update within seconds, critical for debugging live issues  

- Distributed tracing: Essential for understanding complex participant flows and media negotiations

- Cost-effective: All open-source tools that scale well without licensing costs

- Industry standard: Skills and knowledge transfer to other projects and teams

Suggested Reads- Graph RAG vs Temporal Graph RAG

Prerequisites

Now that you know why this stack works so well for LiveKit, let’s look at what you’ll need before setting it up.

Before starting, I usually make sure Docker is installed, a LiveKit agent is running, and the basics of containers and networking are covered.

Getting Started: Setting Up Your LiveKit Observability Stack

The beauty of modern observability stacks is that you can get everything running with Docker Compose in just a few minutes. Let's create a simple setup that gets all our tools talking to each other.

Step 1: Basic Docker Compose Setup

First, create a `docker-compose.yml` file. This single file will bring up our entire monitoring stack:

version: '3.8'
services:
  # Prometheus - Collects metrics from your LiveKit agents
  prometheus:
    image: prom/prometheus:v2.40.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
  # Loki - Stores and searches your application logs  
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
  # Tempo - Stores distributed traces from OTEL
  tempo:
    image: grafana/tempo:2.2.0
    ports:
      - "3200:3200"   # Main API
      - "4317:4317"   # Where your agent sends traces
  # Grafana - Your visual dashboard for everything
  grafana:
    image: grafana/grafana:10.1.0
    ports:
      - "3000:3000"   # Web interface
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

This Docker Compose file creates four containers that can all talk to each other. Each service exposes specific ports: Grafana runs on port 3000 (your main dashboard), Prometheus on 9090 (metrics), Loki on 3100 (logs), and Tempo on 3200 and 4317 (traces).

Observability for LiveKit Agents
Understand monitoring, logging, and debugging for LiveKit voice agents. Learn to track latency, events, and audio quality metrics.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 7 Mar 2026
10PM IST (60 mins)

Step 2: Configure Prometheus to Find Your Agent:

Prometheus needs to know where to find your LiveKit agent's metrics. Create a simple `prometheus.yml` file:

global:
  scrape_interval: 15s  # Check for new metrics every 15 seconds
scrape_configs:
  # Tell Prometheus to collect metrics from your LiveKit agent
  - job_name: 'livekit-agents'
    static_configs:
      - targets: ['host.docker.internal:8080']  # Your agent's metrics port
    scrape_interval: 5s  # Check more frequently for real-time apps

This configuration tells Prometheus to check your LiveKit agent every 5 seconds for new metrics. The `host.docker.internal:8080` address means "connect to port 8080 on the host machine" - you'll need to make sure your agent exposes a `/metrics` endpoint on this port.

Suggested Reads- How To Evaluate LLM Hallucinations and Faithfulness

Step 3: Start Everything Up and Connect the Pieces:

Now let's get our monitoring stack running:

# Start all services
docker-compose up -d
# Check that everything is running
docker-compose ps

This command starts all four services in the background. Give it a minute or two for everything to fully start up.

Once everything is running, you can access:

- Grafana: http://localhost:3000 (login: admin/admin)

- Prometheus: http://localhost:9090 (to see raw metrics)

The first thing you'll want to do is connect Grafana to your data sources. In Grafana:

1. Go to Configuration → Data Sources

2. Add Prometheus with URL: `http://prometheus:9090`

3. Add Loki with URL: `http://loki:3100`  

4. Add Tempo with URL: `http://tempo:3200`

These URLs use the service names from our Docker Compose file. Docker automatically creates a network where services can reach each other by name.

Adding Distributed Tracing to Your LiveKit Agent

Now comes the exciting part, actually getting trace data from your LiveKit agent into Tempo. Traces show you the journey of each request through your system, which is incredibly valuable for debugging real-time communication issues.

For Node.js/TypeScript Agents:

First, install the OpenTelemetry packages:

npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc

Create a simple tracing setup file called `tracing.js`:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// This tells OpenTelemetry to send traces to Tempo
const traceExporter = new OTLPTraceExporter({
  url: 'http://localhost:4317',
});
const sdk = new NodeSDK({
  traceExporter,
  serviceName: 'livekit-agent',
});
sdk.start();

This code sets up OpenTelemetry to automatically capture traces from your application and send them to Tempo on port 4317. The `serviceName` helps you identify traces from this specific agent in your dashboards.

Then, in your main agent file, import tracing FIRST (this is crucial):

// Import tracing BEFORE anything else
require('./tracing');
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('livekit-agent');
// Your existing LiveKit agent code...
class MyLiveKitAgent {
  async handleParticipantConnected(participant) {
    // Start a trace for this operation
    const span = tracer.startSpan('participant_connected');
    
    try {
      console.log(`Participant ${participant.name} connected`);
      // Your actual logic here...
      
      span.setAttributes({
        'participant.id': participant.id,
        'participant.name': participant.name,
      });
      
    } catch (error) {
      span.recordException(error);
      throw error;
    } finally {
      span.end(); // Always end the span
    }
  }
}

This code creates a trace span for each participant connection. The span captures timing information, custom attributes (like participant ID), and any errors that occur. When you look at this trace in Grafana, you'll see exactly how long each participant connection took and what information was associated with it.

For Python Agents:

If you're using Python, the setup is similar. Install the packages:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

Create `tracing.py`:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Set up tracing to send to Tempo
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer("livekit-agent")
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

This Python setup does the same thing as the Node.js version—it configures OpenTelemetry to send traces to your Tempo instance.

Then use it in your agent:

from tracing import tracer
class MyLiveKitAgent:
    async def on_participant_connected(self, participant):
        with tracer.start_as_current_span("participant_connected") as span:
            try:
                print(f"Participant {participant.name} connected")
                
                span.set_attributes({
                    "participant.id": participant.id,
                    "participant.name": participant.name,
                })
                
            except Exception as e:
                span.record_exception(e)
                raise

The Python version uses a `with` statement that automatically handles starting and ending the span. It's a bit cleaner than the JavaScript version.

Creating Your First Grafana Dashboard

Now for the fun part, actually seeing your data! Let's create a simple dashboard that shows key metrics for your LiveKit agent.

Observability for LiveKit Agents
Understand monitoring, logging, and debugging for LiveKit voice agents. Learn to track latency, events, and audio quality metrics.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 7 Mar 2026
10PM IST (60 mins)

Open Grafana at http://localhost:3000 and follow these steps:

1. Create a new dashboard: by clicking the "+" icon and selecting "Dashboard".

2. Add a panel: to show active participants.

3. Configure the query: to show your metrics.

Here's a simple panel configuration for showing participant count:

# Query for Prometheus
sum(livekit_participant_total)

This query asks Prometheus to sum up all the participant count metrics from your LiveKit agents. If you have multiple agents running, this gives you the total across all of them.

You can create additional panels for:

Memory usage: process_resident_memory_bytesCPU usage: process_cpu_seconds_totalConnection errors: rate(livekit_connection_errors_total[5m])

Each of these queries gives you insight into different aspects of your agent's health. Memory and CPU help you understand resource usage, while connection errors help you spot networking issues.

Setting Up Alerts

One of the best parts about having proper observability is getting notified when things go wrong, but only when they actually matter. Here are some alerts that will save you from middle-of-the-night debugging sessions:

Essential LiveKit Agent Alerts

1. Agent Down Alert

 up{job="outbound-agent"} == 0

2. High Memory Usage

( (process_resident_memory_bytes ) / 1024^3) > 2

  Alerts when your agent uses more than 2GB of memory—adjust based on your server.

3. Connection Error Rate:

 rate(livekit_connection_errors_total[5m]) > 0.1

Fires when you're seeing more than 0.1 connection errors per second.

These alerts focus on the most critical issues that actually require immediate attention. You can add them in Grafana under Alerting → Alert Rules.

Conclusion

Setting up observability for LiveKit agents can feel overwhelming at first. I felt the same way. But once I could quickly identify issues using metrics, logs, and traces together, the value became very clear.

  • Use managed services (Grafana Cloud, AWS CloudWatch, etc.) to reduce operational overhead
  • Implement proper security (authentication, TLS, network restrictions)
  • Set up data retention policies appropriate for your compliance needs
  • Consider costs, observability data can grow quickly with high-traffic applications

Setting up observability for LiveKit agents might seem daunting at first, but I promise it's worth every minute you invest. The first time you're able to quickly identify and fix an issue because you have proper traces and metrics, you'll wonder how you ever lived without it.

Remember, observability isn't a "set it and forget it" thing. It's an ongoing practice. Start with the basics we covered today, then gradually add more sophisticated monitoring as your needs grow.

Author-Saisaran D
Saisaran D

I'm an AI/ML engineer specializing in generative AI and machine learning, developing innovative solutions with diffusion models and creating cutting-edge AI tools that drive technological advancement.

Share this article

Phone

Next for you

DSPy vs Normal Prompting: A Practical Comparison Cover

AI

Feb 23, 202618 min read

DSPy vs Normal Prompting: A Practical Comparison

When you build an AI agent that books flights, calls tools, or handles multi-step workflows, one question comes up quickly: how should you control the model? Most developers use prompt engineering. You write detailed instructions, add examples, adjust wording, and test until it works. Sometimes it works well. Sometimes changing a single sentence breaks the entire workflow. DSPy offers a different approach. Instead of manually crafting prompts, you define what the system should do, and the fram

How to Calculate GPU Requirements for LLM Inference? Cover

AI

Feb 23, 20269 min read

How to Calculate GPU Requirements for LLM Inference?

If you’ve ever tried running a large language model on a CPU, you already know the pain. It works, but the latency feels unbearable. This usually leads to the obvious question:          “If my CPU can run the model, why do I even need a GPU?” The short answer is performance. The long answer is what this blog is about. Understanding GPU requirements for LLM inference is not about memorizing hardware specs. It’s about understanding where memory goes, what limits throughput, and how model choice

Map Reduce for Large Document Summarization with LLMs Cover

AI

Feb 23, 20268 min read

Map Reduce for Large Document Summarization with LLMs

LLMs are exceptionally good at understanding and generating text, but they struggle when documents grow large. Movies script, policy PDFs, books, and research papers quickly exceed a model’s context window, resulting in incomplete summaries, missing sections, or higher latency. When it’s tempting to assume that increasing context length solves this problem, real-world usage shows hits different. Larger contexts increase cost, latency, and instability, and still do not guarantee full coverage.