Building Observable AI Agents: Langfuse + Braintrust in Practice

You've shipped your first AI agent. It works—mostly. Then production happens. A customer reports the agent "went off track." Another says it took 40 seconds to respond. Your CFO asks why the OpenAI bill jumped 300% last month.

Without observability, you're flying blind. You can't debug failures, optimize costs, or improve quality systematically. This is where Langfuse and Braintrust come in—two tools that give you visibility into what your AI agents are actually doing.

We run 49 production AI agent specialties at TechNova. Here's what we've learned about making them observable.

Building Observable AI Agents: Langfuse + Braintrust in Practice — illustration 1

Why Generic Monitoring Doesn't Cut It

You can't treat LLM calls like HTTP requests. Traditional APM tools (Datadog, New Relic) will tell you an API call took 3.2 seconds and cost $0.04. They won't tell you:

Which prompt template generated a hallucination
Why the agent called the same tool 6 times in a loop
Whether GPT-4 is actually better than GPT-3.5-turbo for this specific task
Which 5% of requests consume 80% of your token budget

LLM observability needs to capture:

Prompt templates and variables: What did you actually send?
Model parameters: Temperature, max_tokens, top_p settings
Chain traces: Step-by-step agent reasoning paths
Token usage: Input, output, and total counts per call
Latency breakdowns: Time spent on LLM calls vs. tool execution
Quality scores: Did the output match expectations?

Langfuse handles the first five. Braintrust specializes in the last one—plus experiment tracking.

Langfuse: The Trace-First Approach

Langfuse is open-source (MIT license) with a generous free tier. You can self-host or use their cloud. We run the cloud version for most clients—less ops overhead.

The core concept: traces. Every agent execution becomes a hierarchical trace showing:

# Agent trace structure
Agent Execution (root)
├─ Prompt Template Render
├─ LLM Call #1 (plan)
├─ Tool: query_database
│  ├─ SQL generation
│  └─ DB execution
├─ LLM Call #2 (synthesize)
└─ Response formatting

Integration is straightforward. For LangChain agents:

from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key="pk_...",
    secret_key="sk_..."
)

agent.run(
    "Find all pending invoices",
    callbacks=[langfuse_handler]
)

Now every LangChain call gets traced automatically. You see the full execution path in the Langfuse dashboard.

For custom agents (our preferred approach—more control than LangChain), you instrument manually:

from langfuse import Langfuse

lf = Langfuse()
trace = lf.trace(name="invoice_agent")

# Wrap each LLM call
span = trace.span(name="planning_llm_call")
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[...]
)
span.end(
    input=messages,
    output=response.choices[0].message.content,
    metadata={"model": "gpt-4", "temperature": 0.7},
    usage={
        "input": response.usage.prompt_tokens,
        "output": response.usage.completion_tokens
    }
)

What this gives you:

Cost tracking per trace, per user, per agent type
Latency P50/P95/P99 percentiles
Search and filter by user ID, session ID, or custom tags
Prompt version comparison (did v2 reduce hallucinations?)
Token usage trends over time

We caught a runaway agent last month—an email summarization agent that was calling GPT-4 with 15K token contexts when 2K would suffice. Langfuse showed us the outliers. We added context truncation and cut costs 70% for that agent.

Building Observable AI Agents: Langfuse + Braintrust in Practice — illustration 2

Braintrust: Evals and Experiments

Langfuse tells you what happened. Braintrust tells you how good it was.

Braintrust (YC-backed, also has a generous free tier) focuses on:

Dataset management: Golden test cases for your agent
Evaluation functions: Scoring outputs (factuality, relevance, tone)
Experiment tracking: Compare prompt versions, models, parameters

Here's the workflow:

Step 1: Build a dataset

Start with 20-50 real examples from production (Langfuse traces are perfect for this). Each example has:

Input (user query)
Expected output or scoring criteria
Optional metadata (user role, complexity level)

import braintrust

project = braintrust.init(project="customer_support_agent")

dataset = [
    {
        "input": "I want to cancel my subscription",
        "expected": {"intent": "cancel", "tone": "empathetic"},
        "tags": ["retention_critical"]
    },
    # ... more examples
]

Step 2: Define evaluators

Braintrust supports LLM-as-judge (GPT-4 scoring GPT-3.5 outputs) and custom Python functions:

from braintrust import Eval

def check_empathy(output, expected):
    # Simple keyword check (production uses LLM judge)
    empathy_words = ["understand", "sorry", "help"]
    score = sum(1 for word in empathy_words if word in output.lower())
    return score / len(empathy_words)

def check_intent(output, expected):
    # Call GPT-4 to verify intent extraction
    # Returns 0.0 to 1.0
    pass

Step 3: Run experiments

Test variations:

Prompt template A vs. B
GPT-4 vs. GPT-3.5-turbo
Temperature 0.3 vs. 0.7
Different system prompts

Eval(
    "customer_support_agent",
    data=dataset,
    task=lambda input: run_agent(input, model="gpt-4", temp=0.7),
    scores=[check_empathy, check_intent],
    metadata={"model": "gpt-4", "temp": 0.7}
)

Braintrust runs all examples, scores them, and gives you a leaderboard. You see which configuration wins on empathy, intent accuracy, and cost.

Real example: We built a legal document analyzer for one of our custom CRM clients. Initial version used GPT-4 with 2K token context. Evaluation showed:

GPT-4 (2K context): 87% accuracy, $0.12/doc
GPT-4 (4K context): 91% accuracy, $0.19/doc
GPT-3.5-turbo (4K context): 79% accuracy, $0.03/doc

Client cared more about accuracy than cost. We went with GPT-4 at 4K. Without evals, we'd have guessed.

The Combined Workflow

Here's how Langfuse + Braintrust work together:

Development: Build agent, instrument with Langfuse
Initial testing: Run 20-50 examples, review traces in Langfuse
Dataset creation: Export good/bad examples to Braintrust
Evaluation setup: Define scoring functions
Experimentation: Test variations in Braintrust
Production deployment: Keep Langfuse running
Continuous improvement: Weekly dataset updates from new Langfuse traces

For high-stakes agents (financial, medical, legal), we run nightly Braintrust evals against a frozen dataset. If scores drop below threshold, we get alerted before customers notice.

Practical Tradeoffs

Langfuse pros:

Open-source, self-hostable
Great for cost tracking
Prompt versioning built-in

Langfuse cons:

Evaluation features are basic
No built-in A/B testing

Braintrust pros:

Best-in-class experiment tracking
LLM-as-judge templates out of the box
Dataset versioning and collaboration

Braintrust cons:

Not open-source (though free tier is generous)
Lighter on production monitoring vs. evals

Do you need both? For anything beyond a prototype, yes. Langfuse for production visibility, Braintrust for quality assurance.

Getting Started

If you're building AI agents—whether as part of a custom software project or standalone—here's the minimal setup:

Add Langfuse to your first agent (30 minutes)
Let it run for a week
Export 20 interesting traces
Set up Braintrust with those examples
Write one evaluation function (even a simple keyword check)
Run your first experiment

You'll immediately see where your agent fails, what it costs, and how to make it better. The alternative is waiting for angry customer emails.

We've integrated both tools across our AI agent specialties—from appointment schedulers to contract analyzers. Happy to share more specific patterns if you're building something similar. Observability isn't optional anymore—it's table stakes for production AI.

Building Observable AI Agents: Langfuse + Braintrust in Practice

We run 49 production AI agent specialties at TechNova. Here's what we've learned about making them observable.

Building Observable AI Agents: Langfuse + Braintrust in Practice — illustration 1

Why Generic Monitoring Doesn't Cut It

You can't treat LLM calls like HTTP requests. Traditional APM tools (Datadog, New Relic) will tell you an API call took 3.2 seconds and cost $0.04. They won't tell you:

Which prompt template generated a hallucination
Why the agent called the same tool 6 times in a loop
Whether GPT-4 is actually better than GPT-3.5-turbo for this specific task
Which 5% of requests consume 80% of your token budget

LLM observability needs to capture:

Prompt templates and variables: What did you actually send?
Model parameters: Temperature, max_tokens, top_p settings
Chain traces: Step-by-step agent reasoning paths
Token usage: Input, output, and total counts per call
Latency breakdowns: Time spent on LLM calls vs. tool execution
Quality scores: Did the output match expectations?

Langfuse handles the first five. Braintrust specializes in the last one—plus experiment tracking.

Langfuse: The Trace-First Approach

Langfuse is open-source (MIT license) with a generous free tier. You can self-host or use their cloud. We run the cloud version for most clients—less ops overhead.

The core concept: traces. Every agent execution becomes a hierarchical trace showing:

# Agent trace structure
Agent Execution (root)
├─ Prompt Template Render
├─ LLM Call #1 (plan)
├─ Tool: query_database
│  ├─ SQL generation
│  └─ DB execution
├─ LLM Call #2 (synthesize)
└─ Response formatting

Integration is straightforward. For LangChain agents:

from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key="pk_...",
    secret_key="sk_..."
)

agent.run(
    "Find all pending invoices",
    callbacks=[langfuse_handler]
)

Now every LangChain call gets traced automatically. You see the full execution path in the Langfuse dashboard.

For custom agents (our preferred approach—more control than LangChain), you instrument manually:

from langfuse import Langfuse

lf = Langfuse()
trace = lf.trace(name="invoice_agent")

# Wrap each LLM call
span = trace.span(name="planning_llm_call")
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[...]
)
span.end(
    input=messages,
    output=response.choices[0].message.content,
    metadata={"model": "gpt-4", "temperature": 0.7},
    usage={
        "input": response.usage.prompt_tokens,
        "output": response.usage.completion_tokens
    }
)

What this gives you:

Cost tracking per trace, per user, per agent type
Latency P50/P95/P99 percentiles
Search and filter by user ID, session ID, or custom tags
Prompt version comparison (did v2 reduce hallucinations?)
Token usage trends over time

Building Observable AI Agents: Langfuse + Braintrust in Practice — illustration 2

Braintrust: Evals and Experiments

Langfuse tells you what happened. Braintrust tells you how good it was.

Braintrust (YC-backed, also has a generous free tier) focuses on:

Dataset management: Golden test cases for your agent
Evaluation functions: Scoring outputs (factuality, relevance, tone)
Experiment tracking: Compare prompt versions, models, parameters

Here's the workflow:

Step 1: Build a dataset

Start with 20-50 real examples from production (Langfuse traces are perfect for this). Each example has:

Input (user query)
Expected output or scoring criteria
Optional metadata (user role, complexity level)

import braintrust

project = braintrust.init(project="customer_support_agent")

dataset = [
    {
        "input": "I want to cancel my subscription",
        "expected": {"intent": "cancel", "tone": "empathetic"},
        "tags": ["retention_critical"]
    },
    # ... more examples
]

Step 2: Define evaluators

Braintrust supports LLM-as-judge (GPT-4 scoring GPT-3.5 outputs) and custom Python functions:

from braintrust import Eval

def check_empathy(output, expected):
    # Simple keyword check (production uses LLM judge)
    empathy_words = ["understand", "sorry", "help"]
    score = sum(1 for word in empathy_words if word in output.lower())
    return score / len(empathy_words)

def check_intent(output, expected):
    # Call GPT-4 to verify intent extraction
    # Returns 0.0 to 1.0
    pass

Step 3: Run experiments

Test variations:

Prompt template A vs. B
GPT-4 vs. GPT-3.5-turbo
Temperature 0.3 vs. 0.7
Different system prompts

Eval(
    "customer_support_agent",
    data=dataset,
    task=lambda input: run_agent(input, model="gpt-4", temp=0.7),
    scores=[check_empathy, check_intent],
    metadata={"model": "gpt-4", "temp": 0.7}
)

Braintrust runs all examples, scores them, and gives you a leaderboard. You see which configuration wins on empathy, intent accuracy, and cost.

Real example: We built a legal document analyzer for one of our custom CRM clients. Initial version used GPT-4 with 2K token context. Evaluation showed:

GPT-4 (2K context): 87% accuracy, $0.12/doc
GPT-4 (4K context): 91% accuracy, $0.19/doc
GPT-3.5-turbo (4K context): 79% accuracy, $0.03/doc

Client cared more about accuracy than cost. We went with GPT-4 at 4K. Without evals, we'd have guessed.

The Combined Workflow

Here's how Langfuse + Braintrust work together:

Development: Build agent, instrument with Langfuse
Initial testing: Run 20-50 examples, review traces in Langfuse
Dataset creation: Export good/bad examples to Braintrust
Evaluation setup: Define scoring functions
Experimentation: Test variations in Braintrust
Production deployment: Keep Langfuse running
Continuous improvement: Weekly dataset updates from new Langfuse traces

For high-stakes agents (financial, medical, legal), we run nightly Braintrust evals against a frozen dataset. If scores drop below threshold, we get alerted before customers notice.

Practical Tradeoffs

Langfuse pros:

Open-source, self-hostable
Great for cost tracking
Prompt versioning built-in

Langfuse cons:

Evaluation features are basic
No built-in A/B testing

Braintrust pros:

Best-in-class experiment tracking
LLM-as-judge templates out of the box
Dataset versioning and collaboration

Braintrust cons:

Not open-source (though free tier is generous)
Lighter on production monitoring vs. evals

Do you need both? For anything beyond a prototype, yes. Langfuse for production visibility, Braintrust for quality assurance.

Getting Started

If you're building AI agents—whether as part of a custom software project or standalone—here's the minimal setup:

Add Langfuse to your first agent (30 minutes)
Let it run for a week
Export 20 interesting traces
Set up Braintrust with those examples
Write one evaluation function (even a simple keyword check)
Run your first experiment

You'll immediately see where your agent fails, what it costs, and how to make it better. The alternative is waiting for angry customer emails.

Building Observable AI Agents: Langfuse + Braintrust in Practice

Building Observable AI Agents: Langfuse + Braintrust in Practice

Why Generic Monitoring Doesn't Cut It

Langfuse: The Trace-First Approach

Braintrust: Evals and Experiments

The Combined Workflow

Practical Tradeoffs

Getting Started

TechNova Team

More AI

The Eval Harness Every AI Feature Needs: Promptfoo + Langfuse

Customer Support Agents That Actually Deflect: 50+ Deployments Later

Stopping AI Hallucinations in Customer-Facing Bots: 7 Techniques That Work

Ready to ship the software your business actually runs on?

Building Observable AI Agents: Langfuse + Braintrust in Practice

Building Observable AI Agents: Langfuse + Braintrust in Practice

Why Generic Monitoring Doesn't Cut It

Langfuse: The Trace-First Approach

Braintrust: Evals and Experiments

The Combined Workflow

Practical Tradeoffs

Getting Started

TechNova Team

More AI

The Eval Harness Every AI Feature Needs: Promptfoo + Langfuse

Customer Support Agents That Actually Deflect: 50+ Deployments Later

Stopping AI Hallucinations in Customer-Facing Bots: 7 Techniques That Work