Facebook iconThe Complete Guide to Observability for LiveKit Agents
Blogs/AI

The Complete Guide to Observability for LiveKit Agents

Written by Saisaran D
Sep 3, 2025
8 Min Read
The Complete Guide to Observability for LiveKit Agents Hero

Why do LiveKit agents sometimes fail without warning, leaving you unsure of what went wrong? If you’ve dealt with sudden disconnections, poor audio, or unresponsive agents in production, you know how frustrating it is when logs only show “Agent disconnected” without contxext.

Real-time communication apps like LiveKit are much harder to monitor than standard web apps. A half-second delay that’s fine for a webpage can ruin a video call. With constant state changes, multiple failure points, and complex debugging across servers, networks, and devices, the need for observability becomes critical.

Yet most teams still struggle to get it right. According to the 2024 Logz.io Observability Pulse Report, only 10% of organisations say they have full observability across their systems. Even in the wider observability market, some areas show slower adoption, like unified analysis of Kubernetes infrastructure (27%), combining security data with telemetry (23%), and pipeline analytics (18%).

This article will show you how to close that gap by building a complete observability stack for LiveKit agents with Prometheus, Loki, Tempo, Grafana, and OpenTelemetry. By the end, you’ll know how to monitor metrics, trace failures, analyze logs, and build dashboards that give you clear insights.

LiveKit Observability Stack

Let’s break down the observability stack you’ll use for LiveKit. Think of these tools as your response team, with each one handling a different part of the job:

1. Prometheus (The Metrics Detective)

Prometheus continuously collects numerical data from your LiveKit agents, things like CPU usage, memory consumption, active participant counts, and connection success rates. It's like having a health monitor that takes your application's vital signs every few seconds. When something goes wrong, Prometheus tells you what changed and when it started happening.

2. Loki (The Log Librarian)

Loki gathers and organizes all the text logs from your services. While Prometheus might tell you that “CPU usage spiked at 2:15 AM,” Loki shows the exact error messages that caused it. It’s designed to search quickly through massive amounts of logs, which makes it a great fit for chatty LiveKit applications that generate thousands of entries every minute.

3. Tempo (The Story Teller)

Tempo tracks distributed traces, which are like detailed stories showing how a request moves through your system. For example, when a participant joins a room, Tempo can map every step, authentication, room setup, media negotiation, and connection establishment. It doesn’t just tell you what failed; it shows you where in the process things went wrong.

4. Grafana (The Visual Narrator)

Grafana is the dashboard that brings all your observability data together in one place. It takes raw inputs from Prometheus, Loki, and Tempo and turns them into clear charts, graphs, and alerts that are easy to understand. Think of it as your mission control center, where you can see everything happening across your LiveKit agents in real time.

5. OpenTelemetry (OTEL) (The Data Collector)

OpenTelemetry is the layer that adds tracing to your LiveKit agents and sends that data to Tempo. Think of it as placing sensors across your code that record what’s happening and how long each step takes. The best part is that it’s a standard; once you set it up, it works with Tempo, Grafana, or any other observability backend.

Together, these tools form more than just a stack, they work as a unified system. Prometheus gives you metrics, Loki provides logs, Tempo traces the journey, Grafana pulls it all into one view, and OpenTelemetry ties everything together. On their own, each tool is powerful. But combined, they create a complete observability layer that’s especially effective for real-time systems like LiveKit.

Why Is This Combination Perfect for LiveKit?

Traditional monitoring setups often fall short with real-time applications. Here's why this stack is particularly well-suited for LiveKit agents:

-Low overhead: These tools are designed to monitor high-throughput systems without impacting performance

- Real-time capabilities: Dashboards update within seconds, critical for debugging live issues  

- Distributed tracing: Essential for understanding complex participant flows and media negotiations

- Cost-effective: All open-source tools that scale well without licensing costs

- Industry standard: Skills and knowledge transfer to other projects and teams

Suggested Reads- Graph RAG vs Temporal Graph RAG

Prerequisites

Now that you know why this stack works so well for LiveKit, let’s look at what you’ll need before setting it up.

Before we start, make sure you have:

- Docker and Docker Compose installed

- A LiveKit agent application (Node.js, Python, or Go)

- Basic understanding of containers and networking

Getting Started: Setting Up Your LiveKit Observability Stack

The beauty of modern observability stacks is that you can get everything running with Docker Compose in just a few minutes. Let's create a simple setup that gets all our tools talking to each other.

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 18 Oct 2025
10PM IST (60 mins)

Step 1: Basic Docker Compose Setup

First, create a `docker-compose.yml` file. This single file will bring up our entire monitoring stack:

version: '3.8'
services:
  # Prometheus - Collects metrics from your LiveKit agents
  prometheus:
    image: prom/prometheus:v2.40.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
  # Loki - Stores and searches your application logs  
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
  # Tempo - Stores distributed traces from OTEL
  tempo:
    image: grafana/tempo:2.2.0
    ports:
      - "3200:3200"   # Main API
      - "4317:4317"   # Where your agent sends traces
  # Grafana - Your visual dashboard for everything
  grafana:
    image: grafana/grafana:10.1.0
    ports:
      - "3000:3000"   # Web interface
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

This Docker Compose file creates four containers that can all talk to each other. Each service exposes specific ports: Grafana runs on port 3000 (your main dashboard), Prometheus on 9090 (metrics), Loki on 3100 (logs), and Tempo on 3200 and 4317 (traces).

Step 2: Configure Prometheus to Find Your Agent:

Prometheus needs to know where to find your LiveKit agent's metrics. Create a simple `prometheus.yml` file:

global:
  scrape_interval: 15s  # Check for new metrics every 15 seconds
scrape_configs:
  # Tell Prometheus to collect metrics from your LiveKit agent
  - job_name: 'livekit-agents'
    static_configs:
      - targets: ['host.docker.internal:8080']  # Your agent's metrics port
    scrape_interval: 5s  # Check more frequently for real-time apps

This configuration tells Prometheus to check your LiveKit agent every 5 seconds for new metrics. The `host.docker.internal:8080` address means "connect to port 8080 on the host machine" - you'll need to make sure your agent exposes a `/metrics` endpoint on this port.

Suggested Reads- How To Evaluate LLM Hallucinations and Faithfulness

Step 3: Start Everything Up and Connect the Pieces:

Now let's get our monitoring stack running:

# Start all services
docker-compose up -d
# Check that everything is running
docker-compose ps

This command starts all four services in the background. Give it a minute or two for everything to fully start up.

Once everything is running, you can access:

- Grafana: http://localhost:3000 (login: admin/admin)

- Prometheus: http://localhost:9090 (to see raw metrics)

The first thing you'll want to do is connect Grafana to your data sources. In Grafana:

1. Go to Configuration → Data Sources

2. Add Prometheus with URL: `http://prometheus:9090`

3. Add Loki with URL: `http://loki:3100`  

4. Add Tempo with URL: `http://tempo:3200`

These URLs use the service names from our Docker Compose file. Docker automatically creates a network where services can reach each other by name.

Adding Distributed Tracing to Your LiveKit Agent

Now comes the exciting part, actually getting trace data from your LiveKit agent into Tempo. Traces show you the journey of each request through your system, which is incredibly valuable for debugging real-time communication issues.

For Node.js/TypeScript Agents:

First, install the OpenTelemetry packages:

npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc

Create a simple tracing setup file called `tracing.js`:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// This tells OpenTelemetry to send traces to Tempo
const traceExporter = new OTLPTraceExporter({
  url: 'http://localhost:4317',
});
const sdk = new NodeSDK({
  traceExporter,
  serviceName: 'livekit-agent',
});
sdk.start();

This code sets up OpenTelemetry to automatically capture traces from your application and send them to Tempo on port 4317. The `serviceName` helps you identify traces from this specific agent in your dashboards.

Then, in your main agent file, import tracing FIRST (this is crucial):

// Import tracing BEFORE anything else
require('./tracing');
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('livekit-agent');
// Your existing LiveKit agent code...
class MyLiveKitAgent {
  async handleParticipantConnected(participant) {
    // Start a trace for this operation
    const span = tracer.startSpan('participant_connected');
    
    try {
      console.log(`Participant ${participant.name} connected`);
      // Your actual logic here...
      
      span.setAttributes({
        'participant.id': participant.id,
        'participant.name': participant.name,
      });
      
    } catch (error) {
      span.recordException(error);
      throw error;
    } finally {
      span.end(); // Always end the span
    }
  }
}

This code creates a trace span for each participant connection. The span captures timing information, custom attributes (like participant ID), and any errors that occur. When you look at this trace in Grafana, you'll see exactly how long each participant connection took and what information was associated with it.

For Python Agents:

If you're using Python, the setup is similar. Install the packages:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

Create `tracing.py`:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Set up tracing to send to Tempo
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer("livekit-agent")
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

This Python setup does the same thing as the Node.js version—it configures OpenTelemetry to send traces to your Tempo instance.

Then use it in your agent:

from tracing import tracer
class MyLiveKitAgent:
    async def on_participant_connected(self, participant):
        with tracer.start_as_current_span("participant_connected") as span:
            try:
                print(f"Participant {participant.name} connected")
                
                span.set_attributes({
                    "participant.id": participant.id,
                    "participant.name": participant.name,
                })
                
            except Exception as e:
                span.record_exception(e)
                raise

The Python version uses a `with` statement that automatically handles starting and ending the span. It's a bit cleaner than the JavaScript version.

Creating Your First Grafana Dashboard

Now for the fun part, actually seeing your data! Let's create a simple dashboard that shows key metrics for your LiveKit agent.

Innovations in AI
Exploring the future of artificial intelligence
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 18 Oct 2025
10PM IST (60 mins)

Open Grafana at http://localhost:3000 and follow these steps:

1. Create a new dashboard: by clicking the "+" icon and selecting "Dashboard".

2. Add a panel: to show active participants.

3. Configure the query: to show your metrics.

Here's a simple panel configuration for showing participant count:

# Query for Prometheus
sum(livekit_participant_total)

This query asks Prometheus to sum up all the participant count metrics from your LiveKit agents. If you have multiple agents running, this gives you the total across all of them.

You can create additional panels for:

Memory usage: process_resident_memory_bytesCPU usage: process_cpu_seconds_totalConnection errors: rate(livekit_connection_errors_total[5m])

Each of these queries gives you insight into different aspects of your agent's health. Memory and CPU help you understand resource usage, while connection errors help you spot networking issues.

Setting Up Alerts

One of the best parts about having proper observability is getting notified when things go wrong, but only when they actually matter. Here are some alerts that will save you from middle-of-the-night debugging sessions:

Essential LiveKit Agent Alerts

1. Agent Down Alert

 up{job="outbound-agent"} == 0

2. High Memory Usage

( (process_resident_memory_bytes ) / 1024^3) > 2

  Alerts when your agent uses more than 2GB of memory—adjust based on your server.

3. Connection Error Rate:

 rate(livekit_connection_errors_total[5m]) > 0.1

Fires when you're seeing more than 0.1 connection errors per second.

These alerts focus on the most critical issues that actually require immediate attention. You can add them in Grafana under Alerting → Alert Rules.

Conclusion

This setup works great for development and small production deployments. For larger scale:

  • Use managed services (Grafana Cloud, AWS CloudWatch, etc.) to reduce operational overhead
  • Implement proper security (authentication, TLS, network restrictions)
  • Set up data retention policies appropriate for your compliance needs
  • Consider costs, observability data can grow quickly with high-traffic applications

Setting up observability for LiveKit agents might seem daunting at first, but I promise it's worth every minute you invest. The first time you're able to quickly identify and fix an issue because you have proper traces and metrics, you'll wonder how you ever lived without it.

Remember, observability isn't a "set it and forget it" thing. It's an ongoing practice. Start with the basics we covered today, then gradually add more sophisticated monitoring as your needs grow.

Author-Saisaran D
Saisaran D

I'm an AI/ML engineer specializing in generative AI and machine learning, developing innovative solutions with diffusion models and creating cutting-edge AI tools that drive technological advancement.

Share this article

Phone

Next for you

Codeium vs Copilot: A Comparative Guide in 2025 Cover

AI

Sep 30, 20259 min read

Codeium vs Copilot: A Comparative Guide in 2025

Are you still debating which AI coding assistant deserves a spot in your developer toolbox this year? Both Codeium and GitHub Copilot promise to supercharge productivity, but they approach coding differently.  GitHub made it known that developers using Copilot complete tasks up to 55% faster compared to coding alone. That’s impressive, but speed isn’t the only factor. Your choice depends on whether you are a solo developer building an MVP or part of a large enterprise team managing massive repo

Zed vs Cursor AI: The Ultimate 2025 Comparison Guide Cover

AI

Oct 14, 20257 min read

Zed vs Cursor AI: The Ultimate 2025 Comparison Guide

Coding has changed. A few years ago, AI lived in plugins and extensions. Today, editors like Zed and Cursor AI are built with AI at the core, reshaping how developers write, debug, and collaborate. But the real question in 2025 isn’t whether to use AI, it’s which editor makes the most sense for your workflow. According to Stack Overflow’s 2023 Developer Survey, 70% of developers are already using or planning to use AI tools in their workflow. With adoption accelerating, the choice of editor is

AWS CodeWhisperer vs Copilot: A Comparative Guide in 2025 Cover

AI

Sep 30, 20259 min read

AWS CodeWhisperer vs Copilot: A Comparative Guide in 2025

Tight deadlines. Security requirements. The pressure to deliver more with fewer resources. These are challenges every developer faces in 2025. Hence, the reason AI coding assistants are in such high demand.  Now, the question is, should your team rely on AWS CodeWhisperer or GitHub Copilot? This is more than a curiosity question. AI assistants are no longer simple autocomplete tools; they now understand project context, generate complete functions, and even flag security risks before code is de