LLM Observability with LangSmith — Part 1: Tracing Everything & Building Audit-Grade Callbacks

Your agent demoed perfectly. Then someone asked, “What exactly did the bot tell that customer on the 14th?” — and nobody could answer. This is the story of fixing that, end to end, with code. *Dislaimer: All creatives in this article are AI-generated by the Author Meera ships the demo on a Friday. She’s a GenAI engineer at AcmeAI, a company that sells AI models and the hardware to run them. Her latest build is a customer-support agent: a LangGraph workflow that reads an incoming query, classifies it as *Technical* , *Billing* , or *General* , checks the customer’s sentiment, retrieves answers from a knowledge base, and replies. If the customer sounds furious, it skips the robot answer entirely and escalates to a human. The demo lands. Leadership loves it. And then Sanjay, the head of risk, asks three questions that stop the launch cold: 1. “ Can we replay any past customer interaction?” If a customer claims the bot promised them a free GPU, can we pull up exactly what happened? 2. “ Do we have a tamper-evident audit log?” Every LLM call, every retrieval, every error — somewhere *we* control, not just a vendor dashboard? 3. “ If somebody tweaks a prompt next quarter, will we catch the regression *before* it ships?” Or will we find out from angry customers? The three questions every head of risk eventually asks. Meera realizes something every LLM engineer eventually learns: building the agent is the easy half. Operating it is the hard half. 📚 This is Part 1 of a two-part series.** In this part, we cover what observability and traceability actually mean (and why LLM apps break every assumption your monitoring stack was built on), what LangSmith is and everything it can do, zero-config tracing for a real LangGraph agent, tracing *any* Python function, and building a compliance-grade audit callback — that’s Sanjay’s questions #1 and #2. The Part 2 takes on question #3: eval datasets and CI regression gates, prompt versioning with the Hub, the same playbook across six industries, and an honest LangSmith-vs-Langfuse decision matrix plus the wider 2026 tool landscape. Grab a coffee. Let’s go. 1. Observability and Traceability: What They Are, and Why You Can’t Skip Them Let’s define the words before we sling the tools, because they get used loosely. Observability is a property of a system, not a product you buy: it’s the degree to which you can understand what’s happening inside the system purely from what it emits — its logs, metrics, and traces. A highly observable system lets you ask questions you didn’t think of in advance (“why are answers about refunds suddenly 30% longer for German users?”) and get answers without shipping new code. Traceability is narrower and sharper: the ability to follow one specific request end to end — every hop, every transformation, every sub-call, in order, with timing. If observability is the security-camera system for the whole building, a trace is the complete CCTV cut of one visitor’s walk through it. Monitoring , for completeness, is the dashboards-and-alerts layer you bolt on top: predefined checks for failures you already anticipated. Monitoring catches known unknowns . Observability is what saves you when the failure is one nobody predicted — which, with LLMs, is most of them. Classic web apps had this figured out: APM tools, structured logs, distributed tracing. So why do LLM apps need their own version of the discipline? Because they violate the core assumption all of that tooling was built on — that failures are loud. Why LLM apps need their own observability The left half of that picture is the world your current tooling grew up in. When a traditional app breaks, it breaks *theatrically* : an exception fires, the response is a 500, the error-rate graph spikes, someone gets paged, and the stack trace points at the crime scene. The right half is the world you live in now. An LLM failure arrives wearing a 200 OK. It’s fluent, confident, grammatically lovely — and wrong. No exception, no spike, no page. Your logs swear everything is fine, and the first detector to fire is a customer, weeks later. The three boxes along the bottom are the answer this series builds, piece by piece: traces so you can replay any request, evaluations so every change gets scored before it ships, and versioned prompts so every change has a diff and an undo button. If that still sounds like nice-to-have engineering hygiene, walk through five very real scenarios. Five ways LLM apps fail quietly Scenario 1: The chatbot that invented a policy (real, and it went to a tribunal). In 2022, a passenger asked Air Canada’s website chatbot about bereavement fares. The bot confidently told him he could book a full-price ticket and claim the discount retroactively within 90 days — a policy that did not exist ; the airline’s actual policy page said the opposite. He flew, applied for the refund, and was rejected. In February 2024, a British Columbia tribunal ordered Air Canada to pay CA$812.02 , explicitly rejecting the airline’s argument that the chatbot was “a separate legal entity responsible for its own actions.” The ruling set the tone for every deployment since: your bot’s words are your company’s words. Now ask yourself — if a customer made that claim against your bot, could you produce the exact conversation, the documents the bot retrieved, and the prompt version that was live that day? Without tracing, the honest answer is no. You’d be litigating against a ghost. Scenario 2: The silent regression. A teammate “improves” a routing prompt — adds one clarifying sentence. Refund questions quietly start routing to General , where the responder can’t see billing documents. No error. No alert. Accuracy degrades for three weeks until a pattern emerges in complaints. The fix takes five minutes; finding it takes days — unless an eval suite had flagged it before merge. ( This scenario is the entire plot of Part 2. ) Scenario 3: The invisible money leak. Every LLM call has a price tag, which makes cost a first-class observability metric in a way traditional apps never needed. One team discovered that 4% of conversations were consuming 40% of their token spend — users pasting entire PDFs into the chat. That’s invisible in a monthly bill (“the OpenAI line item went up”) and obvious in five minutes of trace telemetry grouped by conversation. Scenario 4: Drift you didn’t deploy. Model providers update models. Sometimes behavior shifts subtly — formats change, refusal rates move, a model gets terser. Your code didn’t change, your prompts didn’t change, and yet Tuesday’s system is not Monday’s system. Without baseline evals re-run on a schedule, you discover drift the way you discover everything else: from users. Scenario 5: The 3 a.m. incident. Something went wrong with a customer interaction and it’s escalating — legal is asking questions. Your observability SaaS is having an outage, or your compliance team was never allowed to ship data there in the first place. What do you hand the auditor? If the answer isn’t “our own append-only log, on our own disk, with every event timestamped and correlated,” you have a governance gap, not just a tooling gap. Five scenarios, one conclusion: an LLM app without observability isn’t a product — it’s a liability with a chat interface. Traces answer “what happened?”, evaluations answer “is it still good?”, and versioned prompts answer “what changed, and can we undo it?” Now let’s meet the tool Meera reaches for. 2. What Is LangSmith, Exactly? LangSmith is the observability, evaluation, and prompt-engineering platform built by the LangChain team. It entered closed beta in mid-2023, reached general availability in February 2024 alongside LangChain’s $25M Series A led by Sequoia Capital, and has since grown from “a debugger for LangChain” into what the company now positions as a full agent engineering platform . A few facts worth knowing before you commit to it: It’s framework-agnostic, despite the family name. The deepest magic (zero-config tracing) lights up with LangChain and LangGraph, but the `langsmith` SDK traces any Python or JavaScript code via a decorator — we’ll prove that in §4 — and the platform speaks OpenTelemetry in both directions : it can ingest OTel spans from any stack and export its traces into the observability pipeline your org already runs (Grafana, Datadog, Jaeger). It’s closed-source SaaS at heart. There’s a free Developer tier (thousands of traces per month — enough for everything in this series), a per-seat Plus tier, and an Enterprise tier that unlocks self-hosting in your own VPC plus SSO/RBAC. There’s also an EU-region cloud for data-residency requirements. If open source is a hard requirement, that’s the Langfuse conversation — Part 2 has a full decision matrix. It’s one product with four jobs. Most teams discover LangSmith as a tracing tool and only later realize the other three quadrants exist: This part of the series lives in the first row of that table; Part 2 lives in the second and third. Let’s build. 3. Tracing: Two Environment Variables and You Can See Everything Setup: You need two API keys: one for your LLM provider (OpenAI here, but anything works) and one free LangSmith key from smith.langchain.com . pip install "langchain>=1.0" "langchain-core>=1.0" "langchain-openai>=1.0" \ "langgraph>=1.0" langchain-chroma "langsmith>=0.4" python-dotenv import os from dotenv import load_dotenv load_dotenv() # expects OPENAI_API_KEY and LANGSMITH_API_KEY in .env # The entire tracing setup. Yes, really. os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_PROJECT"] = "acmeai-support-router" # Optional: pin the endpoint (default is US; use eu.api.smith.langchain.com for EU residency) os.environ.setdefault("LANGSMITH_ENDPOINT", "https://api.smith.langchain.com") That’s the magic trick, and it’s worth pausing on: you haven’t imported LangSmith anywhere. With these environment variables set, every LangChain and LangGraph operation from this point on — every model call, every retriever hit, every graph node — reports itself to your LangSmith project automatically. The tracer rides LangChain’s internal callback system and uploads in batches, off the hot path, so your latency doesn’t pay for it. Now let’s build the thing worth observing. The Knowledge Base Meera’s agent answers from a small product knowledge base. In production this would be your real docs; here, twelve documents keep the story self-contained: from langchain_core.documents import Document knowledge_base = [ # - - technical - - {"text": "Our pre-trained models include vision (CLIP-style), speech (Whisper-style), and text (Llama-3 fine-tunes). They ship with example notebooks.", "metadata": {"category": "technical"}}, {"text": "On-prem deployment is supported via the AcmeAI Edge appliance - Kubernetes-based, runs Llama 3 70B on 2x H100.", "metadata": {"category": "technical"}}, {"text": "Hardware troubleshooting: if the GPU light blinks red, run acmectl diagnose - gpu; common cause is a loose NVLink bridge.", "metadata": {"category": "technical"}}, {"text": "AcmeAI SDK supports Python 3.10+, Node 20+, and Java 17. The REST API is OpenAPI 3.1 compliant.", "metadata": {"category": "technical"}}, # - - billing - - {"text": "We accept Visa, Mastercard, Amex, ACH bank transfer, and wire. Crypto is not supported.", "metadata": {"category": "billing"}}, {"text": "Invoices are emailed on the 1st of each month. To download past invoices, log in and visit Account → Billing → Invoices.", "metadata": {"category": "billing"}}, {"text": "You can update your billing info under Account → Billing → Payment Methods. Changes take effect immediately.", "metadata": {"category": "billing"}}, {"text": "Refunds are processed within 7 business days. We refund pro-rata on cancellation within 30 days of purchase.", "metadata": {"category": "billing"}}, # - - general - - {"text": "Our refund policy: full refund within 30 days, pro-rata thereafter. Contact billing@acmeai.example.", "metadata": {"category": "general"}}, {"text": "Standard shipping is 3–5 business days within the US. International shipping is 7–14 business days; duties not included.", "metadata": {"category": "general"}}, {"text": "Working hours: Mon–Fri 8am–8pm Eastern. Weekend support is available for Enterprise customers only.", "metadata": {"category": "general"}}, {"text": "You can reach support at support@acmeai.example or +1–555-ACME-HELP. Average first response: under 4 hours.", "metadata": {"category": "general"}}, ] docs = [Document(page_content=d["text"], metadata=d["metadata"]) for d in knowledge_base] Embed it into a Chroma vector store with cosine similarity: from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma embeddings = OpenAIEmbeddings(model="text-embedding-3-small") kbase_db = Chroma.from_documents( documents=docs, collection_name="knowledge_base", embedding=embeddings, collection_metadata={"hnsw:space": "cosine"}, # default is euclidean - be explicit persist_directory="./knowledge_base", ) retriever = kbase_db.as_retriever( search_type="similarity_score_threshold", search_kwargs={"k": 3, "score_threshold": 0.2}, ) The Agent’s Brain: State, Classification, Sentiment LangGraph agents pass a typed state dictionary between nodes: from typing import TypedDict, Literal from pydantic import BaseModel from langchain_openai import ChatOpenAI class CustomerSupportState(TypedDict): customer_query: str query_category: str query_sentiment: str final_response: str class QueryCategory(BaseModel): categorized_topic: Literal["Technical", "Billing", "General"] class QuerySentiment(BaseModel): sentiment: Literal["Positive", "Neutral", "Negative"] llm = ChatOpenAI(model="gpt-5-mini") # swap for any chat model you like The two Pydantic models matter more than they look. Paired with `with_structured_output`, the LLM *cannot* reply “I think this is probably a billing question 😊” — it must return one of the three allowed labels. Routers need guarantees, not vibes. def categorize_inquiry(state: CustomerSupportState) -> CustomerSupportState: """Classify the query into Technical / Billing / General.""" prompt = f"""Act as a customer support agent for an AI products and hardware company. Read the customer query and pick the best category: 'Technical', 'Billing', or 'General'. - Technical: AI models, hardware, software, SDK issues - Billing: payments, invoices, refunds, purchases - General: policies, contact info, shipping, everything else Query: {state["customer_query"]} """ result = llm.with_structured_output(QueryCategory).invoke(prompt) return {"query_category": result.categorized_topic} def analyze_inquiry_sentiment(state: CustomerSupportState) -> CustomerSupportState: """Classify sentiment as Positive / Neutral / Negative.""" prompt = f"""Act as a customer support agent. Read the customer query below and classify its sentiment as exactly one of: 'Positive', 'Neutral', or 'Negative'. Query: {state["customer_query"]} """ result = llm.with_structured_output(QuerySentiment).invoke(prompt) return {"query_sentiment": result.sentiment} Sanity-check the sentiment node before wiring anything — same question, two emotional registers: analyze_inquiry_sentiment({"customer_query": "what is your refund policy?"}) # {'query_sentiment': 'Neutral'} analyze_inquiry_sentiment({"customer_query": "what is your refund policy? I am fed up with this product and want my money back"}) # {'query_sentiment': 'Negative'} Same topic, opposite routing destinies — the first will get a polite RAG answer about refund windows; the second is heading straight to a human. That’s the whole escalation design in two lines of output. The RAG Responders — One Factory, Three Nodes Each responder filters the vector store to *its own category* using a metadata filter — the billing node physically cannot retrieve technical docs: from langchain_core.prompts import ChatPromptTemplate RESPONSE_TEMPLATE = ChatPromptTemplate.from_template( """Craft a clear and helpful {category} support response for the customer query below. Ground your answer in the provided knowledge base information. If the knowledge base does not contain the answer, say exactly: "Apologies, I was not able to answer your question, please reach out to +1-555-ACME-HELP" Customer Query: {customer_query} Relevant Knowledge Base Information: {retrieved_content} """ ) def make_category_responder(category: str): """Build a RAG responder node scoped to one KB category via metadata filter.""" def responder(state: CustomerSupportState) -> CustomerSupportState: retriever.search_kwargs["filter"] = {"category": category} docs = retriever.invoke(state["customer_query"]) retrieved = "\n\n".join(d.page_content for d in docs) chain = RESPONSE_TEMPLATE | llm reply = chain.invoke({ "category": category, "customer_query": state["customer_query"], "retrieved_content": retrieved, }).content return {"final_response": reply} return responder generate_technical_response = make_category_responder("technical") generate_billing_response = make_category_responder("billing") generate_general_response = make_category_responder("general") def escalate_to_human_agent(state: CustomerSupportState) -> CustomerSupportState: """Negative sentiment? No robot. A human will call.""" return {"final_response": "We're really sorry! Someone from our team will reach out to you shortly."} Wiring the Graph from langgraph.graph import StateGraph, END from langgraph.checkpoint.memory import MemorySaver def determine_route(state: CustomerSupportState) -> str: if state["query_sentiment"] == "Negative": return "escalate_to_human_agent" elif state["query_category"] == "Technical": return "generate_technical_response" elif state["query_category"] == "Billing": return "generate_billing_response" return "generate_general_response" graph = StateGraph(CustomerSupportState) graph.add_node("categorize_inquiry", categorize_inquiry) graph.add_node("analyze_inquiry_sentiment", analyze_inquiry_sentiment) graph.add_node("generate_technical_response", generate_technical_response) graph.add_node("generate_billing_response", generate_billing_response) graph.add_node("generate_general_response", generate_general_response) graph.add_node("escalate_to_human_agent", escalate_to_human_agent) graph.set_entry_point("categorize_inquiry") graph.add_edge("categorize_inquiry", "analyze_inquiry_sentiment") graph.add_conditional_edges("analyze_inquiry_sentiment", determine_route, [ "generate_technical_response", "generate_billing_response", "generate_general_response", "escalate_to_human_agent", ]) for terminal in ["generate_technical_response", "generate_billing_response", "generate_general_response", "escalate_to_human_agent"]: graph.add_edge(terminal, END) agent = graph.compile(checkpointer=MemorySaver()) If you’re in a notebook, LangGraph will draw itself — `agent.get_graph().draw_mermaid_png()` — and what it draws is this topology: The support router as a LangGraph The two indigo nodes at the top are LLM classifiers writing into the shared state; the diamond is plain Python reading that state — auditable logic, no model involved; the three teal terminals are the RAG responders, each fenced into its own slice of the knowledge base by that metadata filter; and the red terminal is the empathy hatch, where angry customers bypass the robot entirely. Keep this picture in mind, because in about thirty seconds every shape on it is going to reappear as a span in a trace tree. The Payoff: Run It and Watch the Traces Appear def ask(query: str, session_id: str = "demo") -> str: final = None for event in agent.stream( {"customer_query": query}, {"configurable": {"thread_id": session_id}}, stream_mode="values", ): final = event return final["final_response"] print(ask("Do you support pre-trained vision models?")) # → Technical path print(ask("How do I download my last invoice?")) # → Billing path print(ask("Can you tell me about your shipping policy?")) # → General path The billing answer comes back grounded in exactly the documents we seeded: You can download past invoices by logging in and going to Account → Billing → Invoices. Invoices are also emailed on the 1st of each month, so check the inbox associated with your account. If you don't see an invoice you expected, contact support@acmeai.example and we'll resend it. Pleasant enough. But the real payoff is on the other screen: open smith.langchain.com , click into the `acmeai-support-router` project, and three traces are waiting. Click the invoice one and you get the full waterfall — which looks like this: Anatomy of a LangSmith trace Time flows left to right; each bar is a run , LangSmith’s unit of work, and the indentation *is* the call hierarchy. Three things jump out the first time you see your own agent like this. First, the final generation call eats 1.25 of the total 2.95 seconds — so when someone says “the bot feels slow,” this chart settles the optimize-retrieval-or-optimize-generation argument in five seconds flat (it’s generation, and the two classifier calls in front of it are the next suspects). Second, that little amber sliver: 140 ms for the Chroma retrieval, and clicking it shows the exact three documents it returned — which is precisely the evidence you need when the bot confidently cites the wrong spec. Third, quietly doing the bookkeeping: every bar carries token counts in and out (LangSmith turns those into cost per trace, per user, per day), plus a `run_id` and `parent_run_id` linking each run to its parent. File those two IDs away — they become important in §5. And that’s Sanjay’s question #1 answered, with a click instead of a forensic project. When a customer claims “your bot told me the appliance supports water cooling” — or invents a bereavement-fare policy — support pulls the trace and reads exactly what the retriever returned and what the model said. 4. “But My Code Isn’t All LangChain”: Tracing Any Python Function One decorator, and plain Python shows up in the same trace tree A question Meera gets asked constantly, so let’s answer it head-on: yes, LangSmith traces arbitrary Python — no LangChain required. The `@traceable` decorator turns any function into a run, and nested decorated calls assemble into the same parent-child tree automatically. Here’s a fully standalone example — raw OpenAI SDK, a fake database call, plain Python orchestration. Not a LangChain import in sight: import os from langsmith import traceable from langsmith.wrappers import wrap_openai from openai import OpenAI os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_PROJECT"] = "acmeai-standalone-demo" # wrap_openai instruments the raw OpenAI client: every .create() becomes an LLM run oai = wrap_openai(OpenAI()) @traceable(name="crm_lookup", run_type="tool") def fetch_customer_tier(customer_id: str) -> str: # pretend this hits your CRM / database return "enterprise" if customer_id.startswith("ENT") else "standard" @traceable(name="ticket_summarizer") def summarize_ticket(ticket_text: str) -> str: response = oai.chat.completions.create( model="gpt-5-mini", messages=[{"role": "user", "content": f"Summarize this support ticket in one line: {ticket_text}"}], ) return response.choices[0].message.content @traceable(name="handle_ticket", tags=["support", "v2"], metadata={"team": "acmeai-support"}) def handle_ticket(customer_id: str, ticket_text: str) -> dict: tier = fetch_customer_tier(customer_id) # child run #1 (tool) summary = summarize_ticket(ticket_text) # child run #2 -> contains the LLM run return {"tier": tier, "summary": summary, "priority": "P1" if tier == "enterprise" else "P3"} handle_ticket("ENT-00451", "The Edge appliance reboots whenever we run the vision pipeline at batch size 64.") Open the project and the tree reads exactly like the code: `handle_ticket` as the parent, `crm_lookup` and `ticket_summarizer` nested inside it, and the wrapped OpenAI call inside that — with token counts captured even though LangChain was never involved. Three details worth knowing: `run_type` controls how a run renders and aggregates ( `tool`, `retriever`, `llm`, default `chain` ) — tag your custom vector-store calls as `retriever` and they get retrieval-style display. `tags` and `metadata` become filterable dimensions in the dashboard: `metadata={“tenant_id”: …}` is how multi-tenant apps slice cost and quality per customer. The same decorator exists in the JS/TS SDK, and if you’re already standardized on OpenTelemetry, you can skip the decorator entirely and ship OTel spans straight in. This matters strategically: your observability isn’t welded to your framework choice. If you rip out LangChain next year, the tracing survives. 5. The Auditor’s Question: Custom Callbacks for a Tamper-Evident Log Meera shows Sanjay the dashboard. He’s impressed — for about a minute. Then he leans in: “This is their dashboard, on their servers. If their cloud is down during an incident, what do we show the regulator? And does customer data leave our network before we’ve scrubbed it?” Fair. LangSmith is brilliant for debugging . But audit and compliance teams want guarantees a SaaS dashboard alone can’t give: The primitive that solves this is LangChain’s `BaseCallbackHandler` — the same machinery LangSmith itself rides on: lifecycle hooks that fire synchronously, in your process, on every LLM start/end, tool start/end, and error. Subclass it, and you decide what gets persisted, where, and in what shape. Meera writes hers to emit JSON Lines — one JSON object per line, append-only. It’s the dullest format in computing, and that’s the point: `grep` reads it, `jq` reads it, pandas reads it, Splunk ingests it. import json import re import time from datetime import datetime, timezone from pathlib import Path from typing import Any from uuid import UUID from langchain_core.callbacks import BaseCallbackHandler AUDIT_LOG_PATH = Path("./audit.jsonl") EMAIL_RE = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+") def redact(text: str) -> str: """Minimal demo redaction. In production, use a real PII engine (Microsoft Presidio, Amazon Comprehend) — emails alone won't cut it.""" return EMAIL_RE.sub("[EMAIL_REDACTED]", text) class JsonLinesAuditHandler(BaseCallbackHandler): """Append-only audit log of every LLM call, tool call, and error. One JSON object per line — grep-, jq-, pandas-, and Splunk-friendly. Designed for environments where an external SaaS alone can't be the system of record. """ def __init__(self, log_path: Path = AUDIT_LOG_PATH) -> None: self.log_path = log_path self._llm_starts: dict[UUID, float] = {} self._tool_starts: dict[UUID, float] = {} def _emit(self, event: dict[str, Any]) -> None: event["ts"] = datetime.now(timezone.utc).isoformat() with self.log_path.open("a", encoding="utf-8") as f: f.write(json.dumps(event, default=str) + "\n") def on_llm_start(self, serialized, prompts, *, run_id, parent_run_id=None, **kwargs): self._llm_starts[run_id] = time.perf_counter() self._emit({ "event": "llm_start", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "model": (serialized or {}).get("id", ["unknown"])[-1], "prompt_chars": sum(len(redact(p)) for p in prompts), }) def on_llm_end(self, response, *, run_id, parent_run_id=None, **kwargs): latency_ms = (time.perf_counter() - self._llm_starts.pop(run_id, time.perf_counter())) * 1000 usage = (response.llm_output or {}).get("token_usage", {}) if response.llm_output else {} self._emit({ "event": "llm_end", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "latency_ms": round(latency_ms, 1), "prompt_tokens": usage.get("prompt_tokens"), "completion_tokens": usage.get("completion_tokens"), "total_tokens": usage.get("total_tokens"), }) def on_llm_error(self, error, *, run_id, parent_run_id=None, **kwargs): self._emit({ "event": "llm_error", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "error_type": type(error).__name__, "error_msg": str(error)[:500], }) def on_tool_start(self, serialized, input_str, *, run_id, parent_run_id=None, **kwargs): self._tool_starts[run_id] = time.perf_counter() self._emit({ "event": "tool_start", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "tool": (serialized or {}).get("name"), "input_chars": len(input_str), }) def on_tool_end(self, output, *, run_id, parent_run_id=None, **kwargs): latency_ms = (time.perf_counter() - self._tool_starts.pop(run_id, time.perf_counter())) * 1000 self._emit({ "event": "tool_end", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "latency_ms": round(latency_ms, 1), "output_chars": len(str(output)), }) def on_chain_error(self, error, *, run_id, parent_run_id=None, **kwargs): self._emit({ "event": "chain_error", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "error_type": type(error).__name__, "error_msg": str(error)[:500], }) audit_handler = JsonLinesAuditHandler() A few deliberate choices worth noticing: `run_id` + `parent_run_id` on every event — the same UUIDs you met in the trace waterfall. Your forensics and your debugging share a correlation key. Character counts, not raw text, by default. The log proves that a prompt of a certain size went to a certain model at a certain time, without making the audit file itself a PII liability. Where you do log content, it goes through `redact()` first. Latency measured in-process with `perf_counter` , so the numbers hold up even if a trace upload is delayed or dropped. It also helps to see when each of those overridden hooks actually fires. For one billing query through the router, the sequence and what each hook leaves behind in the file looks like this: Callback hook firing order for one query Notice the greyed rows: `on_chain_start` and `on_chain_end` fire too — for the graph and for every node — but our handler deliberately lets them pass; LLM and error events are the audit-worthy moments. The amber row is an invitation: if your compliance story needs retrieval evidence (“which documents informed this answer?”), `on_retriever_start` / `on_retriever_end` are sitting there waiting for the same treatment. And the red strip at the bottom is the part auditors care about most — failures don’t vanish, they write a row with the error type, because the absence of a record is itself a finding in most audit frameworks. Attaching the handler costs one config key — and LangSmith keeps tracing alongside it. They’re independent layers: queries = [ ("audit-001", "Do you support pre-trained vision models?"), ("audit-002", "How do I download my last invoice?"), ("audit-003", "I am furious — your hardware bricked itself overnight, refund NOW."), ] for session_id, query in queries: events = agent.stream( {"customer_query": query}, config={ "configurable": {"thread_id": session_id}, "callbacks": [audit_handler], # ← audit log + LangSmith both fire "metadata": {"app": "support-router", "session_id": session_id}, }, stream_mode="values", ) final = None for ev in events: final = ev print(f"{session_id}: category={final['query_category']} sentiment={final['query_sentiment']}") audit-001: category=Technical sentiment=Neutral audit-002: category=Billing sentiment=Neutral audit-003: category=Billing sentiment=Negative Read that third line carefully — it’s the design working as intended. The router still classified the furious message as Billing (it is about a refund), but the Negative sentiment overrode the route and the customer got a human, not a robot. Meanwhile, `audit.jsonl` quietly collected the paper trail. Six LLM calls happened across those three queries (two for the escalated one — its generation step never ran), and each produced a start and end event: import pandas as pd events = [json.loads(line) for line in AUDIT_LOG_PATH.open() if line.strip()] df = pd.DataFrame(events) print("Total events:", len(events)) print("Total LLM tokens:", int(df["total_tokens"].dropna().sum())) print("Avg LLM latency (ms):", round(df.loc[df["event"] == "llm_end", "latency_ms"].mean(), 1)) Total events: 16 Total LLM tokens: 1732 Avg LLM latency (ms): 894.6 And this is what a single line of the file looks like — what your SOC team greps at 3 a.m. when an incident lands: {"event": "llm_end", "run_id": "9f2c...", "parent_run_id": "b41a...", "latency_ms": 842.3, "prompt_tokens": 187, "completion_tokens": 9, "total_tokens": 196, "ts": "2026-06-12T07:14:55.103+00:00"} Two independent observability lanes Picture the full event stream flowing down two lanes from the same source. Down the engineer’s lane: the LangSmith tracer batching events to the cloud, feeding the dashboards, the replay UI, and ( in Part 2 ) the experiments. Down the auditor’s lane: our handler, running synchronously in-process — and that placement is the entire compliance argument, because `redact()` runs before any byte leaves the machine — then the append-only file, then the SIEM with its retention policy. The two lanes share nothing but the callback events and those `run_id` s. LangSmith down? The auditor’s record is intact. Disk hiccup? LangSmith still has the traces. That’s Sanjay’s question #2 answered, and it earns the rule Meera writes on the team wiki: Run both, always. LangSmith for the humans debugging at their desks. The custom callback for the auditor’s chain-of-custody. They’re complementary layers, not alternatives. Where We Are — and What’s Left Take stock of what Meera has after one day of work. Two environment variables bought her a flight recorder: every conversation with the support agent is now a replayable trace with per-step prompts, retrieved documents, token costs, and latencies — that’s the Air Canada defense, question #1. Eighty lines of `BaseCallbackHandler` bought her an institution-grade audit lane: append-only, PII-redacted before egress, vendor-independent, SIEM-ready — question #2. And the `@traceable` decorator means none of this is hostage to LangChain: the day the team rewrites the agent in a different framework, the observability comes along. But Sanjay’s third question is still open — and it’s the one that bites teams after launch: “ if somebody tweaks a prompt next quarter, will we catch the regression before it ships?” Right now, the honest answer is still no. A well-meaning edit to the routing prompt tomorrow would sail straight into production, and Meera would learn about it from the complaints queue. In the Part 2 we close that gap and then zoom out: we turn LangSmith into a regression-test framework with datasets, evaluators, and experiments ; put prompts under real version control with the Hub (immutable commits, movable `:production` tags, CI-gated promotion — and an instant-rollback story your release manager will love); tour how the exact same four moves play out in healthcare, e-commerce, legal, and edtech ; and finish with the decision everyone eventually faces — an honest LangSmith vs Langfuse matrix and a 30-second decision tree across the wider 2026 tooling landscape. If this saved you a future debugging weekend, follow me here on Medium so Part 2 lands in your feed — and I’d genuinely love to hear your observability war stories in the comments . You can follow me and connect with me on LinkedIn as well https://www.linkedin.com/in/prashantksahu References & Further Reading (Part 1) LangSmith Observability docs LangSmith GA announcement (Feb 2024) End-to-end OpenTelemetry support in LangSmith Moffatt v. Air Canada — tribunal holds airline liable for chatbot’s answer (ABA summary) ] McCarthy Tétrault case note LLM Observability with LangSmith — Part 1: Tracing Everything & Building Audit-Grade Callbacks was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://pub.towardsai.net/llm-observability-with-langsmith-part-1-tracing-everything-building-audit-grade-callbacks-c477719af691?source=rss----98111c9905da---4