Building Observable AI Agents: Langfuse + Braintrust in Practice
You've shipped your first AI agent. It works—mostly. Then production happens. A customer reports the agent "went off track." Another says it took 40 seconds to respond. Your CFO asks why the OpenAI bill jumped 300% last month.
Without observability, you're flying blind. You can't debug failures, optimize costs, or improve quality systematically. This is where Langfuse and Braintrust come in—two tools that give you visibility into what your AI agents are actually doing.
We run 49 production AI agent specialties at TechNova. Here's what we've learned about making them observable.
Why Generic Monitoring Doesn't Cut It
You can't treat LLM calls like HTTP requests. Traditional APM tools (Datadog, New Relic) will tell you an API call took 3.2 seconds and cost $0.04. They won't tell you:
- Which prompt template generated a hallucination
- Why the agent called the same tool 6 times in a loop
- Whether GPT-4 is actually better than GPT-3.5-turbo for this specific task
- Which 5% of requests consume 80% of your token budget
LLM observability needs to capture:
- Prompt templates and variables: What did you actually send?
- Model parameters: Temperature, max_tokens, top_p settings
- Chain traces: Step-by-step agent reasoning paths
- Token usage: Input, output, and total counts per call
- Latency breakdowns: Time spent on LLM calls vs. tool execution
- Quality scores: Did the output match expectations?
Langfuse handles the first five. Braintrust specializes in the last one—plus experiment tracking.
Langfuse: The Trace-First Approach
Langfuse is open-source (MIT license) with a generous free tier. You can self-host or use their cloud. We run the cloud version for most clients—less ops overhead.
The core concept: traces. Every agent execution becomes a hierarchical trace showing:
# Agent trace structure
Agent Execution (root)
├─ Prompt Template Render
├─ LLM Call #1 (plan)
├─ Tool: query_database
│ ├─ SQL generation
│ └─ DB execution
├─ LLM Call #2 (synthesize)
└─ Response formatting
Integration is straightforward. For LangChain agents:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler(
public_key="pk_...",
secret_key="sk_..."
)
agent.run(
"Find all pending invoices",
callbacks=[langfuse_handler]
)
Now every LangChain call gets traced automatically. You see the full execution path in the Langfuse dashboard.
For custom agents (our preferred approach—more control than LangChain), you instrument manually:
from langfuse import Langfuse
lf = Langfuse()
trace = lf.trace(name="invoice_agent")
# Wrap each LLM call
span = trace.span(name="planning_llm_call")
response = openai.chat.completions.create(
model="gpt-4",
messages=[...]
)
span.end(
input=messages,
output=response.choices[0].message.content,
metadata={"model": "gpt-4", "temperature": 0.7},
usage={
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens
}
)
What this gives you:
- Cost tracking per trace, per user, per agent type
- Latency P50/P95/P99 percentiles
- Search and filter by user ID, session ID, or custom tags
- Prompt version comparison (did v2 reduce hallucinations?)
- Token usage trends over time
We caught a runaway agent last month—an email summarization agent that was calling GPT-4 with 15K token contexts when 2K would suffice. Langfuse showed us the outliers. We added context truncation and cut costs 70% for that agent.
Braintrust: Evals and Experiments
Langfuse tells you what happened. Braintrust tells you how good it was.
Braintrust (YC-backed, also has a generous free tier) focuses on:
- Dataset management: Golden test cases for your agent
- Evaluation functions: Scoring outputs (factuality, relevance, tone)
- Experiment tracking: Compare prompt versions, models, parameters
Here's the workflow:
Step 1: Build a dataset
Start with 20-50 real examples from production (Langfuse traces are perfect for this). Each example has:
- Input (user query)
- Expected output or scoring criteria
- Optional metadata (user role, complexity level)
import braintrust
project = braintrust.init(project="customer_support_agent")
dataset = [
{
"input": "I want to cancel my subscription",
"expected": {"intent": "cancel", "tone": "empathetic"},
"tags": ["retention_critical"]
},
# ... more examples
]
Step 2: Define evaluators
Braintrust supports LLM-as-judge (GPT-4 scoring GPT-3.5 outputs) and custom Python functions:
from braintrust import Eval
def check_empathy(output, expected):
# Simple keyword check (production uses LLM judge)
empathy_words = ["understand", "sorry", "help"]
score = sum(1 for word in empathy_words if word in output.lower())
return score / len(empathy_words)
def check_intent(output, expected):
# Call GPT-4 to verify intent extraction
# Returns 0.0 to 1.0
pass
Step 3: Run experiments
Test variations:
- Prompt template A vs. B
- GPT-4 vs. GPT-3.5-turbo
- Temperature 0.3 vs. 0.7
- Different system prompts
Eval(
"customer_support_agent",
data=dataset,
task=lambda input: run_agent(input, model="gpt-4", temp=0.7),
scores=[check_empathy, check_intent],
metadata={"model": "gpt-4", "temp": 0.7}
)
Braintrust runs all examples, scores them, and gives you a leaderboard. You see which configuration wins on empathy, intent accuracy, and cost.
Real example: We built a legal document analyzer for one of our custom CRM clients. Initial version used GPT-4 with 2K token context. Evaluation showed:
- GPT-4 (2K context): 87% accuracy, $0.12/doc
- GPT-4 (4K context): 91% accuracy, $0.19/doc
- GPT-3.5-turbo (4K context): 79% accuracy, $0.03/doc
Client cared more about accuracy than cost. We went with GPT-4 at 4K. Without evals, we'd have guessed.
The Combined Workflow
Here's how Langfuse + Braintrust work together:
- Development: Build agent, instrument with Langfuse
- Initial testing: Run 20-50 examples, review traces in Langfuse
- Dataset creation: Export good/bad examples to Braintrust
- Evaluation setup: Define scoring functions
- Experimentation: Test variations in Braintrust
- Production deployment: Keep Langfuse running
- Continuous improvement: Weekly dataset updates from new Langfuse traces
For high-stakes agents (financial, medical, legal), we run nightly Braintrust evals against a frozen dataset. If scores drop below threshold, we get alerted before customers notice.
Practical Tradeoffs
Langfuse pros:
- Open-source, self-hostable
- Great for cost tracking
- Prompt versioning built-in
Langfuse cons:
- Evaluation features are basic
- No built-in A/B testing
Braintrust pros:
- Best-in-class experiment tracking
- LLM-as-judge templates out of the box
- Dataset versioning and collaboration
Braintrust cons:
- Not open-source (though free tier is generous)
- Lighter on production monitoring vs. evals
Do you need both? For anything beyond a prototype, yes. Langfuse for production visibility, Braintrust for quality assurance.
Getting Started
If you're building AI agents—whether as part of a custom software project or standalone—here's the minimal setup:
- Add Langfuse to your first agent (30 minutes)
- Let it run for a week
- Export 20 interesting traces
- Set up Braintrust with those examples
- Write one evaluation function (even a simple keyword check)
- Run your first experiment
You'll immediately see where your agent fails, what it costs, and how to make it better. The alternative is waiting for angry customer emails.
We've integrated both tools across our AI agent specialties—from appointment schedulers to contract analyzers. Happy to share more specific patterns if you're building something similar. Observability isn't optional anymore—it's table stakes for production AI.