AI News Archive: June 13, 2026 — Part 3

Sourced from 500+ daily AI sources, scored by relevance.

How One Silicon Valley Congressman is Trying Get AI Mega Money Out of Politics
How One Silicon Valley Congressman is Trying Get AI Mega Money Out of Politics The Information
Score: 42🌐 MovesJun 13, 2026https://www.theinformation.com/articles/one-silicon-valley-congressman-trying-get-ai-mega-money-politics
Not All of Us Want to Talk to Our Tech. Do We Have a Choice?
Commentary: If the future of interaction is via voice, it risks excluding many of the same people that companies need to buy into AI.
Score: 42🌐 MovesJun 13, 2026https://www.cnet.com/tech/services-and-software/not-all-of-us-want-to-talk-to-our-tech-do-we-have-a-choice/
Galaxy S25 users are finally getting some missing One UI 8.5 AI features
Galaxy S25 users are finally getting Prioritize Notifications, Summarize Notifications and File Summaries with the June update, after the features were skipped in the first One UI 8.5 rollout.
Score: 41🌐 MovesJun 13, 2026https://www.digitaltrends.com/phones/galaxy-s25-users-are-finally-getting-some-missing-one-ui-8-5-ai-features/
Moonshot AI Releases Kimi K2.7-Code: a Coding Model Reporting +21.8% on Kimi Code Bench v2 Over K2.6
Moonshot AI Releases Kimi K2.7-Code: a Coding Model Reporting +21.8% on Kimi Code Bench v2 Over K2.6 MarkTechPost
Score: 41🤖 ModelsJun 13, 2026https://www.marktechpost.com/2026/06/12/moonshot-ai-releases-kimi-k2-7-code-a-coding-model-reporting-21-8-on-kimi-code-bench-v2-over-k2-6/
HR leaders remain wary of AI-led hiring
Human resource leaders in the Philippines remain wary of relying on artificial intelligence for final hiring decisions, with only 28 percent expressing confidence in using the technology for the task, according to global recruitment specialist Robert Walters.
Score: 40🌐 MovesJun 13, 2026https://www.philstar.com/business/2026/06/14/2534974/hr-leaders-remain-wary-ai-led-hiring
Visual Language Models Train Robots to Read Human Emotions
If robots are ever going to work alongside humans more generally, they’ll need read our moods
Score: 40🌐 MovesJun 13, 2026https://spectrum.ieee.org/robot-emotions-visual-language-models
For MetaComp’s Bo Bai, ‘Know Your Agent’ is the mantra in the age of agentic AI
For MetaComp’s Bo Bai, ‘Know Your Agent’ is the mantra in the age of agentic AI The Straits Times
Score: 40🌐 MovesJun 13, 2026https://www.straitstimes.com/opinion/for-metacomps-bo-bai-know-your-agent-is-the-mantra-in-the-age-of-agentic-ai
#6: The Flywheel: What Happens When Workflows Run Themselves
AI flywheels are closed-loop workflows that generate, measure, and decide what to try next. Here’s why verification must come before autonomy
Score: 40🌐 MovesJun 13, 2026https://www.turingpost.com/p/ai-flywheel-when-workflows-run-themselves
LLM Observability with LangSmith — Part 1: Tracing Everything & Building Audit-Grade Callbacks
Your agent demoed perfectly. Then someone asked, “What exactly did the bot tell that customer on the 14th?” — and nobody could answer. This is the story of fixing that, end to end, with code. *Dislaimer: All creatives in this article are AI-generated by the Author Meera ships the demo on a Friday. She’s a GenAI engineer at AcmeAI, a company that sells AI models and the hardware to run them. Her latest build is a customer-support agent: a LangGraph workflow that reads an incoming query, classifies it as *Technical* , *Billing* , or *General* , checks the customer’s sentiment, retrieves answers from a knowledge base, and replies. If the customer sounds furious, it skips the robot answer entirely and escalates to a human. The demo lands. Leadership loves it. And then Sanjay, the head of risk, asks three questions that stop the launch cold: 1. “ Can we replay any past customer interaction?” If a customer claims the bot promised them a free GPU, can we pull up exactly what happened? 2. “ Do we have a tamper-evident audit log?” Every LLM call, every retrieval, every error — somewhere *we* control, not just a vendor dashboard? 3. “ If somebody tweaks a prompt next quarter, will we catch the regression *before* it ships?” Or will we find out from angry customers? The three questions every head of risk eventually asks. Meera realizes something every LLM engineer eventually learns: building the agent is the easy half. Operating it is the hard half. 📚 This is Part 1 of a two-part series.** In this part, we cover what observability and traceability actually mean (and why LLM apps break every assumption your monitoring stack was built on), what LangSmith is and everything it can do, zero-config tracing for a real LangGraph agent, tracing *any* Python function, and building a compliance-grade audit callback — that’s Sanjay’s questions #1 and #2. The Part 2 takes on question #3: eval datasets and CI regression gates, prompt versioning with the Hub, the same playbook across six industries, and an honest LangSmith-vs-Langfuse decision matrix plus the wider 2026 tool landscape. Grab a coffee. Let’s go. 1. Observability and Traceability: What They Are, and Why You Can’t Skip Them Let’s define the words before we sling the tools, because they get used loosely. Observability is a property of a system, not a product you buy: it’s the degree to which you can understand what’s happening inside the system purely from what it emits — its logs, metrics, and traces. A highly observable system lets you ask questions you didn’t think of in advance (“why are answers about refunds suddenly 30% longer for German users?”) and get answers without shipping new code. Traceability is narrower and sharper: the ability to follow one specific request end to end — every hop, every transformation, every sub-call, in order, with timing. If observability is the security-camera system for the whole building, a trace is the complete CCTV cut of one visitor’s walk through it. Monitoring , for completeness, is the dashboards-and-alerts layer you bolt on top: predefined checks for failures you already anticipated. Monitoring catches known unknowns . Observability is what saves you when the failure is one nobody predicted — which, with LLMs, is most of them. Classic web apps had this figured out: APM tools, structured logs, distributed tracing. So why do LLM apps need their own version of the discipline? Because they violate the core assumption all of that tooling was built on — that failures are loud. Why LLM apps need their own observability The left half of that picture is the world your current tooling grew up in. When a traditional app breaks, it breaks *theatrically* : an exception fires, the response is a 500, the error-rate graph spikes, someone gets paged, and the stack trace points at the crime scene. The right half is the world you live in now. An LLM failure arrives wearing a 200 OK. It’s fluent, confident, grammatically lovely — and wrong. No exception, no spike, no page. Your logs swear everything is fine, and the first detector to fire is a customer, weeks later. The three boxes along the bottom are the answer this series builds, piece by piece: traces so you can replay any request, evaluations so every change gets scored before it ships, and versioned prompts so every change has a diff and an undo button. If that still sounds like nice-to-have engineering hygiene, walk through five very real scenarios. Five ways LLM apps fail quietly Scenario 1: The chatbot that invented a policy (real, and it went to a tribunal). In 2022, a passenger asked Air Canada’s website chatbot about bereavement fares. The bot confidently told him he could book a full-price ticket and claim the discount retroactively within 90 days — a policy that did not exist ; the airline’s actual policy page said the opposite. He flew, applied for the refund, and was rejected. In February 2024, a British Columbia tribunal ordered Air Canada to pay CA$812.02 , explicitly rejecting the airline’s argument that the chatbot was “a separate legal entity responsible for its own actions.” The ruling set the tone for every deployment since: your bot’s words are your company’s words. Now ask yourself — if a customer made that claim against your bot, could you produce the exact conversation, the documents the bot retrieved, and the prompt version that was live that day? Without tracing, the honest answer is no. You’d be litigating against a ghost. Scenario 2: The silent regression. A teammate “improves” a routing prompt — adds one clarifying sentence. Refund questions quietly start routing to General , where the responder can’t see billing documents. No error. No alert. Accuracy degrades for three weeks until a pattern emerges in complaints. The fix takes five minutes; finding it takes days — unless an eval suite had flagged it before merge. ( This scenario is the entire plot of Part 2. ) Scenario 3: The invisible money leak. Every LLM call has a price tag, which makes cost a first-class observability metric in a way traditional apps never needed. One team discovered that 4% of conversations were consuming 40% of their token spend — users pasting entire PDFs into the chat. That’s invisible in a monthly bill (“the OpenAI line item went up”) and obvious in five minutes of trace telemetry grouped by conversation. Scenario 4: Drift you didn’t deploy. Model providers update models. Sometimes behavior shifts subtly — formats change, refusal rates move, a model gets terser. Your code didn’t change, your prompts didn’t change, and yet Tuesday’s system is not Monday’s system. Without baseline evals re-run on a schedule, you discover drift the way you discover everything else: from users. Scenario 5: The 3 a.m. incident. Something went wrong with a customer interaction and it’s escalating — legal is asking questions. Your observability SaaS is having an outage, or your compliance team was never allowed to ship data there in the first place. What do you hand the auditor? If the answer isn’t “our own append-only log, on our own disk, with every event timestamped and correlated,” you have a governance gap, not just a tooling gap. Five scenarios, one conclusion: an LLM app without observability isn’t a product — it’s a liability with a chat interface. Traces answer “what happened?”, evaluations answer “is it still good?”, and versioned prompts answer “what changed, and can we undo it?” Now let’s meet the tool Meera reaches for. 2. What Is LangSmith, Exactly? LangSmith is the observability, evaluation, and prompt-engineering platform built by the LangChain team. It entered closed beta in mid-2023, reached general availability in February 2024 alongside LangChain’s $25M Series A led by Sequoia Capital, and has since grown from “a debugger for LangChain” into what the company now positions as a full agent engineering platform . A few facts worth knowing before you commit to it: It’s framework-agnostic, despite the family name. The deepest magic (zero-config tracing) lights up with LangChain and LangGraph, but the `langsmith` SDK traces any Python or JavaScript code via a decorator — we’ll prove that in §4 — and the platform speaks OpenTelemetry in both directions : it can ingest OTel spans from any stack and export its traces into the observability pipeline your org already runs (Grafana, Datadog, Jaeger). It’s closed-source SaaS at heart. There’s a free Developer tier (thousands of traces per month — enough for everything in this series), a per-seat Plus tier, and an Enterprise tier that unlocks self-hosting in your own VPC plus SSO/RBAC. There’s also an EU-region cloud for data-residency requirements. If open source is a hard requirement, that’s the Langfuse conversation — Part 2 has a full decision matrix. It’s one product with four jobs. Most teams discover LangSmith as a tracing tool and only later realize the other three quadrants exist: This part of the series lives in the first row of that table; Part 2 lives in the second and third. Let’s build. 3. Tracing: Two Environment Variables and You Can See Everything Setup: You need two API keys: one for your LLM provider (OpenAI here, but anything works) and one free LangSmith key from smith.langchain.com . pip install "langchain>=1.0" "langchain-core>=1.0" "langchain-openai>=1.0" \ "langgraph>=1.0" langchain-chroma "langsmith>=0.4" python-dotenv import os from dotenv import load_dotenv load_dotenv() # expects OPENAI_API_KEY and LANGSMITH_API_KEY in .env # The entire tracing setup. Yes, really. os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_PROJECT"] = "acmeai-support-router" # Optional: pin the endpoint (default is US; use eu.api.smith.langchain.com for EU residency) os.environ.setdefault("LANGSMITH_ENDPOINT", "https://api.smith.langchain.com") That’s the magic trick, and it’s worth pausing on: you haven’t imported LangSmith anywhere. With these environment variables set, every LangChain and LangGraph operation from this point on — every model call, every retriever hit, every graph node — reports itself to your LangSmith project automatically. The tracer rides LangChain’s internal callback system and uploads in batches, off the hot path, so your latency doesn’t pay for it. Now let’s build the thing worth observing. The Knowledge Base Meera’s agent answers from a small product knowledge base. In production this would be your real docs; here, twelve documents keep the story self-contained: from langchain_core.documents import Document knowledge_base = [ # - - technical - - {"text": "Our pre-trained models include vision (CLIP-style), speech (Whisper-style), and text (Llama-3 fine-tunes). They ship with example notebooks.", "metadata": {"category": "technical"}}, {"text": "On-prem deployment is supported via the AcmeAI Edge appliance - Kubernetes-based, runs Llama 3 70B on 2x H100.", "metadata": {"category": "technical"}}, {"text": "Hardware troubleshooting: if the GPU light blinks red, run acmectl diagnose - gpu; common cause is a loose NVLink bridge.", "metadata": {"category": "technical"}}, {"text": "AcmeAI SDK supports Python 3.10+, Node 20+, and Java 17. The REST API is OpenAPI 3.1 compliant.", "metadata": {"category": "technical"}}, # - - billing - - {"text": "We accept Visa, Mastercard, Amex, ACH bank transfer, and wire. Crypto is not supported.", "metadata": {"category": "billing"}}, {"text": "Invoices are emailed on the 1st of each month. To download past invoices, log in and visit Account → Billing → Invoices.", "metadata": {"category": "billing"}}, {"text": "You can update your billing info under Account → Billing → Payment Methods. Changes take effect immediately.", "metadata": {"category": "billing"}}, {"text": "Refunds are processed within 7 business days. We refund pro-rata on cancellation within 30 days of purchase.", "metadata": {"category": "billing"}}, # - - general - - {"text": "Our refund policy: full refund within 30 days, pro-rata thereafter. Contact billing@acmeai.example.", "metadata": {"category": "general"}}, {"text": "Standard shipping is 3–5 business days within the US. International shipping is 7–14 business days; duties not included.", "metadata": {"category": "general"}}, {"text": "Working hours: Mon–Fri 8am–8pm Eastern. Weekend support is available for Enterprise customers only.", "metadata": {"category": "general"}}, {"text": "You can reach support at support@acmeai.example or +1–555-ACME-HELP. Average first response: under 4 hours.", "metadata": {"category": "general"}}, ] docs = [Document(page_content=d["text"], metadata=d["metadata"]) for d in knowledge_base] Embed it into a Chroma vector store with cosine similarity: from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma embeddings = OpenAIEmbeddings(model="text-embedding-3-small") kbase_db = Chroma.from_documents( documents=docs, collection_name="knowledge_base", embedding=embeddings, collection_metadata={"hnsw:space": "cosine"}, # default is euclidean - be explicit persist_directory="./knowledge_base", ) retriever = kbase_db.as_retriever( search_type="similarity_score_threshold", search_kwargs={"k": 3, "score_threshold": 0.2}, ) The Agent’s Brain: State, Classification, Sentiment LangGraph agents pass a typed state dictionary between nodes: from typing import TypedDict, Literal from pydantic import BaseModel from langchain_openai import ChatOpenAI class CustomerSupportState(TypedDict): customer_query: str query_category: str query_sentiment: str final_response: str class QueryCategory(BaseModel): categorized_topic: Literal["Technical", "Billing", "General"] class QuerySentiment(BaseModel): sentiment: Literal["Positive", "Neutral", "Negative"] llm = ChatOpenAI(model="gpt-5-mini") # swap for any chat model you like The two Pydantic models matter more than they look. Paired with `with_structured_output`, the LLM *cannot* reply “I think this is probably a billing question 😊” — it must return one of the three allowed labels. Routers need guarantees, not vibes. def categorize_inquiry(state: CustomerSupportState) -> CustomerSupportState: """Classify the query into Technical / Billing / General.""" prompt = f"""Act as a customer support agent for an AI products and hardware company. Read the customer query and pick the best category: 'Technical', 'Billing', or 'General'. - Technical: AI models, hardware, software, SDK issues - Billing: payments, invoices, refunds, purchases - General: policies, contact info, shipping, everything else Query: {state["customer_query"]} """ result = llm.with_structured_output(QueryCategory).invoke(prompt) return {"query_category": result.categorized_topic} def analyze_inquiry_sentiment(state: CustomerSupportState) -> CustomerSupportState: """Classify sentiment as Positive / Neutral / Negative.""" prompt = f"""Act as a customer support agent. Read the customer query below and classify its sentiment as exactly one of: 'Positive', 'Neutral', or 'Negative'. Query: {state["customer_query"]} """ result = llm.with_structured_output(QuerySentiment).invoke(prompt) return {"query_sentiment": result.sentiment} Sanity-check the sentiment node before wiring anything — same question, two emotional registers: analyze_inquiry_sentiment({"customer_query": "what is your refund policy?"}) # {'query_sentiment': 'Neutral'} analyze_inquiry_sentiment({"customer_query": "what is your refund policy? I am fed up with this product and want my money back"}) # {'query_sentiment': 'Negative'} Same topic, opposite routing destinies — the first will get a polite RAG answer about refund windows; the second is heading straight to a human. That’s the whole escalation design in two lines of output. The RAG Responders — One Factory, Three Nodes Each responder filters the vector store to *its own category* using a metadata filter — the billing node physically cannot retrieve technical docs: from langchain_core.prompts import ChatPromptTemplate RESPONSE_TEMPLATE = ChatPromptTemplate.from_template( """Craft a clear and helpful {category} support response for the customer query below. Ground your answer in the provided knowledge base information. If the knowledge base does not contain the answer, say exactly: "Apologies, I was not able to answer your question, please reach out to +1-555-ACME-HELP" Customer Query: {customer_query} Relevant Knowledge Base Information: {retrieved_content} """ ) def make_category_responder(category: str): """Build a RAG responder node scoped to one KB category via metadata filter.""" def responder(state: CustomerSupportState) -> CustomerSupportState: retriever.search_kwargs["filter"] = {"category": category} docs = retriever.invoke(state["customer_query"]) retrieved = "\n\n".join(d.page_content for d in docs) chain = RESPONSE_TEMPLATE | llm reply = chain.invoke({ "category": category, "customer_query": state["customer_query"], "retrieved_content": retrieved, }).content return {"final_response": reply} return responder generate_technical_response = make_category_responder("technical") generate_billing_response = make_category_responder("billing") generate_general_response = make_category_responder("general") def escalate_to_human_agent(state: CustomerSupportState) -> CustomerSupportState: """Negative sentiment? No robot. A human will call.""" return {"final_response": "We're really sorry! Someone from our team will reach out to you shortly."} Wiring the Graph from langgraph.graph import StateGraph, END from langgraph.checkpoint.memory import MemorySaver def determine_route(state: CustomerSupportState) -> str: if state["query_sentiment"] == "Negative": return "escalate_to_human_agent" elif state["query_category"] == "Technical": return "generate_technical_response" elif state["query_category"] == "Billing": return "generate_billing_response" return "generate_general_response" graph = StateGraph(CustomerSupportState) graph.add_node("categorize_inquiry", categorize_inquiry) graph.add_node("analyze_inquiry_sentiment", analyze_inquiry_sentiment) graph.add_node("generate_technical_response", generate_technical_response) graph.add_node("generate_billing_response", generate_billing_response) graph.add_node("generate_general_response", generate_general_response) graph.add_node("escalate_to_human_agent", escalate_to_human_agent) graph.set_entry_point("categorize_inquiry") graph.add_edge("categorize_inquiry", "analyze_inquiry_sentiment") graph.add_conditional_edges("analyze_inquiry_sentiment", determine_route, [ "generate_technical_response", "generate_billing_response", "generate_general_response", "escalate_to_human_agent", ]) for terminal in ["generate_technical_response", "generate_billing_response", "generate_general_response", "escalate_to_human_agent"]: graph.add_edge(terminal, END) agent = graph.compile(checkpointer=MemorySaver()) If you’re in a notebook, LangGraph will draw itself — `agent.get_graph().draw_mermaid_png()` — and what it draws is this topology: The support router as a LangGraph The two indigo nodes at the top are LLM classifiers writing into the shared state; the diamond is plain Python reading that state — auditable logic, no model involved; the three teal terminals are the RAG responders, each fenced into its own slice of the knowledge base by that metadata filter; and the red terminal is the empathy hatch, where angry customers bypass the robot entirely. Keep this picture in mind, because in about thirty seconds every shape on it is going to reappear as a span in a trace tree. The Payoff: Run It and Watch the Traces Appear def ask(query: str, session_id: str = "demo") -> str: final = None for event in agent.stream( {"customer_query": query}, {"configurable": {"thread_id": session_id}}, stream_mode="values", ): final = event return final["final_response"] print(ask("Do you support pre-trained vision models?")) # → Technical path print(ask("How do I download my last invoice?")) # → Billing path print(ask("Can you tell me about your shipping policy?")) # → General path The billing answer comes back grounded in exactly the documents we seeded: You can download past invoices by logging in and going to Account → Billing → Invoices. Invoices are also emailed on the 1st of each month, so check the inbox associated with your account. If you don't see an invoice you expected, contact support@acmeai.example and we'll resend it. Pleasant enough. But the real payoff is on the other screen: open smith.langchain.com , click into the `acmeai-support-router` project, and three traces are waiting. Click the invoice one and you get the full waterfall — which looks like this: Anatomy of a LangSmith trace Time flows left to right; each bar is a run , LangSmith’s unit of work, and the indentation *is* the call hierarchy. Three things jump out the first time you see your own agent like this. First, the final generation call eats 1.25 of the total 2.95 seconds — so when someone says “the bot feels slow,” this chart settles the optimize-retrieval-or-optimize-generation argument in five seconds flat (it’s generation, and the two classifier calls in front of it are the next suspects). Second, that little amber sliver: 140 ms for the Chroma retrieval, and clicking it shows the exact three documents it returned — which is precisely the evidence you need when the bot confidently cites the wrong spec. Third, quietly doing the bookkeeping: every bar carries token counts in and out (LangSmith turns those into cost per trace, per user, per day), plus a `run_id` and `parent_run_id` linking each run to its parent. File those two IDs away — they become important in §5. And that’s Sanjay’s question #1 answered, with a click instead of a forensic project. When a customer claims “your bot told me the appliance supports water cooling” — or invents a bereavement-fare policy — support pulls the trace and reads exactly what the retriever returned and what the model said. 4. “But My Code Isn’t All LangChain”: Tracing Any Python Function One decorator, and plain Python shows up in the same trace tree A question Meera gets asked constantly, so let’s answer it head-on: yes, LangSmith traces arbitrary Python — no LangChain required. The `@traceable` decorator turns any function into a run, and nested decorated calls assemble into the same parent-child tree automatically. Here’s a fully standalone example — raw OpenAI SDK, a fake database call, plain Python orchestration. Not a LangChain import in sight: import os from langsmith import traceable from langsmith.wrappers import wrap_openai from openai import OpenAI os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_PROJECT"] = "acmeai-standalone-demo" # wrap_openai instruments the raw OpenAI client: every .create() becomes an LLM run oai = wrap_openai(OpenAI()) @traceable(name="crm_lookup", run_type="tool") def fetch_customer_tier(customer_id: str) -> str: # pretend this hits your CRM / database return "enterprise" if customer_id.startswith("ENT") else "standard" @traceable(name="ticket_summarizer") def summarize_ticket(ticket_text: str) -> str: response = oai.chat.completions.create( model="gpt-5-mini", messages=[{"role": "user", "content": f"Summarize this support ticket in one line: {ticket_text}"}], ) return response.choices[0].message.content @traceable(name="handle_ticket", tags=["support", "v2"], metadata={"team": "acmeai-support"}) def handle_ticket(customer_id: str, ticket_text: str) -> dict: tier = fetch_customer_tier(customer_id) # child run #1 (tool) summary = summarize_ticket(ticket_text) # child run #2 -> contains the LLM run return {"tier": tier, "summary": summary, "priority": "P1" if tier == "enterprise" else "P3"} handle_ticket("ENT-00451", "The Edge appliance reboots whenever we run the vision pipeline at batch size 64.") Open the project and the tree reads exactly like the code: `handle_ticket` as the parent, `crm_lookup` and `ticket_summarizer` nested inside it, and the wrapped OpenAI call inside that — with token counts captured even though LangChain was never involved. Three details worth knowing: `run_type` controls how a run renders and aggregates ( `tool`, `retriever`, `llm`, default `chain` ) — tag your custom vector-store calls as `retriever` and they get retrieval-style display. `tags` and `metadata` become filterable dimensions in the dashboard: `metadata={“tenant_id”: …}` is how multi-tenant apps slice cost and quality per customer. The same decorator exists in the JS/TS SDK, and if you’re already standardized on OpenTelemetry, you can skip the decorator entirely and ship OTel spans straight in. This matters strategically: your observability isn’t welded to your framework choice. If you rip out LangChain next year, the tracing survives. 5. The Auditor’s Question: Custom Callbacks for a Tamper-Evident Log Meera shows Sanjay the dashboard. He’s impressed — for about a minute. Then he leans in: “This is their dashboard, on their servers. If their cloud is down during an incident, what do we show the regulator? And does customer data leave our network before we’ve scrubbed it?” Fair. LangSmith is brilliant for debugging . But audit and compliance teams want guarantees a SaaS dashboard alone can’t give: The primitive that solves this is LangChain’s `BaseCallbackHandler` — the same machinery LangSmith itself rides on: lifecycle hooks that fire synchronously, in your process, on every LLM start/end, tool start/end, and error. Subclass it, and you decide what gets persisted, where, and in what shape. Meera writes hers to emit JSON Lines — one JSON object per line, append-only. It’s the dullest format in computing, and that’s the point: `grep` reads it, `jq` reads it, pandas reads it, Splunk ingests it. import json import re import time from datetime import datetime, timezone from pathlib import Path from typing import Any from uuid import UUID from langchain_core.callbacks import BaseCallbackHandler AUDIT_LOG_PATH = Path("./audit.jsonl") EMAIL_RE = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+") def redact(text: str) -> str: """Minimal demo redaction. In production, use a real PII engine (Microsoft Presidio, Amazon Comprehend) — emails alone won't cut it.""" return EMAIL_RE.sub("[EMAIL_REDACTED]", text) class JsonLinesAuditHandler(BaseCallbackHandler): """Append-only audit log of every LLM call, tool call, and error. One JSON object per line — grep-, jq-, pandas-, and Splunk-friendly. Designed for environments where an external SaaS alone can't be the system of record. """ def __init__(self, log_path: Path = AUDIT_LOG_PATH) -> None: self.log_path = log_path self._llm_starts: dict[UUID, float] = {} self._tool_starts: dict[UUID, float] = {} def _emit(self, event: dict[str, Any]) -> None: event["ts"] = datetime.now(timezone.utc).isoformat() with self.log_path.open("a", encoding="utf-8") as f: f.write(json.dumps(event, default=str) + "\n") def on_llm_start(self, serialized, prompts, *, run_id, parent_run_id=None, **kwargs): self._llm_starts[run_id] = time.perf_counter() self._emit({ "event": "llm_start", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "model": (serialized or {}).get("id", ["unknown"])[-1], "prompt_chars": sum(len(redact(p)) for p in prompts), }) def on_llm_end(self, response, *, run_id, parent_run_id=None, **kwargs): latency_ms = (time.perf_counter() - self._llm_starts.pop(run_id, time.perf_counter())) * 1000 usage = (response.llm_output or {}).get("token_usage", {}) if response.llm_output else {} self._emit({ "event": "llm_end", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "latency_ms": round(latency_ms, 1), "prompt_tokens": usage.get("prompt_tokens"), "completion_tokens": usage.get("completion_tokens"), "total_tokens": usage.get("total_tokens"), }) def on_llm_error(self, error, *, run_id, parent_run_id=None, **kwargs): self._emit({ "event": "llm_error", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "error_type": type(error).__name__, "error_msg": str(error)[:500], }) def on_tool_start(self, serialized, input_str, *, run_id, parent_run_id=None, **kwargs): self._tool_starts[run_id] = time.perf_counter() self._emit({ "event": "tool_start", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "tool": (serialized or {}).get("name"), "input_chars": len(input_str), }) def on_tool_end(self, output, *, run_id, parent_run_id=None, **kwargs): latency_ms = (time.perf_counter() - self._tool_starts.pop(run_id, time.perf_counter())) * 1000 self._emit({ "event": "tool_end", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "latency_ms": round(latency_ms, 1), "output_chars": len(str(output)), }) def on_chain_error(self, error, *, run_id, parent_run_id=None, **kwargs): self._emit({ "event": "chain_error", "run_id": str(run_id), "parent_run_id": str(parent_run_id) if parent_run_id else None, "error_type": type(error).__name__, "error_msg": str(error)[:500], }) audit_handler = JsonLinesAuditHandler() A few deliberate choices worth noticing: `run_id` + `parent_run_id` on every event — the same UUIDs you met in the trace waterfall. Your forensics and your debugging share a correlation key. Character counts, not raw text, by default. The log proves that a prompt of a certain size went to a certain model at a certain time, without making the audit file itself a PII liability. Where you do log content, it goes through `redact()` first. Latency measured in-process with `perf_counter` , so the numbers hold up even if a trace upload is delayed or dropped. It also helps to see when each of those overridden hooks actually fires. For one billing query through the router, the sequence and what each hook leaves behind in the file looks like this: Callback hook firing order for one query Notice the greyed rows: `on_chain_start` and `on_chain_end` fire too — for the graph and for every node — but our handler deliberately lets them pass; LLM and error events are the audit-worthy moments. The amber row is an invitation: if your compliance story needs retrieval evidence (“which documents informed this answer?”), `on_retriever_start` / `on_retriever_end` are sitting there waiting for the same treatment. And the red strip at the bottom is the part auditors care about most — failures don’t vanish, they write a row with the error type, because the absence of a record is itself a finding in most audit frameworks. Attaching the handler costs one config key — and LangSmith keeps tracing alongside it. They’re independent layers: queries = [ ("audit-001", "Do you support pre-trained vision models?"), ("audit-002", "How do I download my last invoice?"), ("audit-003", "I am furious — your hardware bricked itself overnight, refund NOW."), ] for session_id, query in queries: events = agent.stream( {"customer_query": query}, config={ "configurable": {"thread_id": session_id}, "callbacks": [audit_handler], # ← audit log + LangSmith both fire "metadata": {"app": "support-router", "session_id": session_id}, }, stream_mode="values", ) final = None for ev in events: final = ev print(f"{session_id}: category={final['query_category']} sentiment={final['query_sentiment']}") audit-001: category=Technical sentiment=Neutral audit-002: category=Billing sentiment=Neutral audit-003: category=Billing sentiment=Negative Read that third line carefully — it’s the design working as intended. The router still classified the furious message as Billing (it is about a refund), but the Negative sentiment overrode the route and the customer got a human, not a robot. Meanwhile, `audit.jsonl` quietly collected the paper trail. Six LLM calls happened across those three queries (two for the escalated one — its generation step never ran), and each produced a start and end event: import pandas as pd events = [json.loads(line) for line in AUDIT_LOG_PATH.open() if line.strip()] df = pd.DataFrame(events) print("Total events:", len(events)) print("Total LLM tokens:", int(df["total_tokens"].dropna().sum())) print("Avg LLM latency (ms):", round(df.loc[df["event"] == "llm_end", "latency_ms"].mean(), 1)) Total events: 16 Total LLM tokens: 1732 Avg LLM latency (ms): 894.6 And this is what a single line of the file looks like — what your SOC team greps at 3 a.m. when an incident lands: {"event": "llm_end", "run_id": "9f2c...", "parent_run_id": "b41a...", "latency_ms": 842.3, "prompt_tokens": 187, "completion_tokens": 9, "total_tokens": 196, "ts": "2026-06-12T07:14:55.103+00:00"} Two independent observability lanes Picture the full event stream flowing down two lanes from the same source. Down the engineer’s lane: the LangSmith tracer batching events to the cloud, feeding the dashboards, the replay UI, and ( in Part 2 ) the experiments. Down the auditor’s lane: our handler, running synchronously in-process — and that placement is the entire compliance argument, because `redact()` runs before any byte leaves the machine — then the append-only file, then the SIEM with its retention policy. The two lanes share nothing but the callback events and those `run_id` s. LangSmith down? The auditor’s record is intact. Disk hiccup? LangSmith still has the traces. That’s Sanjay’s question #2 answered, and it earns the rule Meera writes on the team wiki: Run both, always. LangSmith for the humans debugging at their desks. The custom callback for the auditor’s chain-of-custody. They’re complementary layers, not alternatives. Where We Are — and What’s Left Take stock of what Meera has after one day of work. Two environment variables bought her a flight recorder: every conversation with the support agent is now a replayable trace with per-step prompts, retrieved documents, token costs, and latencies — that’s the Air Canada defense, question #1. Eighty lines of `BaseCallbackHandler` bought her an institution-grade audit lane: append-only, PII-redacted before egress, vendor-independent, SIEM-ready — question #2. And the `@traceable` decorator means none of this is hostage to LangChain: the day the team rewrites the agent in a different framework, the observability comes along. But Sanjay’s third question is still open — and it’s the one that bites teams after launch: “ if somebody tweaks a prompt next quarter, will we catch the regression before it ships?” Right now, the honest answer is still no. A well-meaning edit to the routing prompt tomorrow would sail straight into production, and Meera would learn about it from the complaints queue. In the Part 2 we close that gap and then zoom out: we turn LangSmith into a regression-test framework with datasets, evaluators, and experiments ; put prompts under real version control with the Hub (immutable commits, movable `:production` tags, CI-gated promotion — and an instant-rollback story your release manager will love); tour how the exact same four moves play out in healthcare, e-commerce, legal, and edtech ; and finish with the decision everyone eventually faces — an honest LangSmith vs Langfuse matrix and a 30-second decision tree across the wider 2026 tooling landscape. If this saved you a future debugging weekend, follow me here on Medium so Part 2 lands in your feed — and I’d genuinely love to hear your observability war stories in the comments . You can follow me and connect with me on LinkedIn as well https://www.linkedin.com/in/prashantksahu References & Further Reading (Part 1) LangSmith Observability docs LangSmith GA announcement (Feb 2024) End-to-end OpenTelemetry support in LangSmith Moffatt v. Air Canada — tribunal holds airline liable for chatbot’s answer (ABA summary) ] McCarthy Tétrault case note LLM Observability with LangSmith — Part 1: Tracing Everything & Building Audit-Grade Callbacks was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Score: 39🌐 MovesJun 13, 2026https://pub.towardsai.net/llm-observability-with-langsmith-part-1-tracing-everything-building-audit-grade-callbacks-c477719af691?source=rss----98111c9905da---4
LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack
In Part 1, we made an agent replayable and audit-proof. Now we make it regression-proof — and then decide, with a clear head, whether LangSmith is even the right tool for your team. 📚 This is Part 2 of a two-part series. In Part 1 , Meera — a GenAI engineer at AcmeAI — shipped a LangGraph support agent and got stopped by three questions from Sanjay, the head of risk. We answered two of them: with LANGSMITH_TRACING=true and a project name, every conversation became a replayable trace (question #1 — the Air Canada defense), and a custom BaseCallbackHandler gave compliance a tamper-evident, PII-redacted, vendor-independent audit log running alongside LangSmith (question #2 — the two-lane pattern). We also proved the @traceable decorator traces any Python function, LangChain or not. If you haven't read it, start there — this part reuses the agent built in Part 1 . One question is still open, and it’s the sneakiest of the three: “If somebody tweaks a prompt next quarter, will we catch the regression before it ships?” Right now the answer is no. Let’s fix that — and then zoom all the way out to the industry playbook, the LangSmith-vs-Langfuse decision, and the rest of the 2026 tooling field. 1. The Regression Question: Datasets, Evaluators, Experiments The failure Sanjay is describing is silent. Someone “improves” the routing prompt — adds one clarifying sentence — and refund questions quietly start routing to General , where the responder can’t see billing documents. No error, no alert, just gradually worse answers discovered weeks later through complaints. Code has had the antidote for decades: regression tests. LangSmith brings the same discipline to LLM behavior with three nouns: A dataset : a versioned collection of (input → expected output) examples. Your test fixture. An evaluator : a function that scores one (input, expected, actual) triple. Your assertion. An experiment : one full run of the dataset through your system with evaluators attached. Your test-suite execution — with results stored, charted, and diffable in the dashboard. Dataset, evaluator, experiment Meera builds a routing benchmark, including the trap cases that bite real routers: from langsmith import Client from langsmith.evaluation import evaluate client = Client() DATASET_NAME = "acmeai-routing-eval" examples = [ {"inputs": {"customer_query": "Do you ship to Singapore?"}, "outputs": {"expected_category": "General"}}, {"inputs": {"customer_query": "My GPU appliance throws CUDA errors after the latest firmware"}, "outputs": {"expected_category": "Technical"}}, {"inputs": {"customer_query": "Why was my Visa charged twice this month?"}, "outputs": {"expected_category": "Billing"}}, {"inputs": {"customer_query": "How do I update my saved payment method?"}, "outputs": {"expected_category": "Billing"}}, {"inputs": {"customer_query": "I want a refund right now, your product is unusable"}, "outputs": {"expected_category": "Billing"}}, # angry, but still Billing - sentiment handles escalation {"inputs": {"customer_query": "Does your SDK support Python 3.12?"}, "outputs": {"expected_category": "Technical"}}, {"inputs": {"customer_query": "What are your weekend support hours?"}, "outputs": {"expected_category": "General"}}, ] # Idempotent: create the dataset only if it doesn't already exist try: ds = client.read_dataset(dataset_name=DATASET_NAME) except Exception: ds = client.create_dataset( dataset_name=DATASET_NAME, description="Routing-accuracy benchmark for the support router agent.", ) client.create_examples( dataset_id=ds.id, examples=[{"inputs": e["inputs"], "outputs": e["outputs"]} for e in examples], ) Note that fifth example: “I want a refund right now, your product is unusable.” It encodes a real design decision from Part 1 — the router should still say Billing; the sentiment check is what triggers escalation. Once that’s in the dataset, nobody can accidentally “fix” it away. Next, the target under test and the evaluator. The target wraps just the classification node — unit-testing one decision, not the whole pipeline: def routing_target(inputs: dict) -> dict: state = categorize_inquiry({"customer_query": inputs["customer_query"]}) return {"predicted_category": state["query_category"]} def correctness_evaluator(run, example) -> dict: pred = (run.outputs or {}).get("predicted_category") expected = (example.outputs or {}).get("expected_category") return { "key": "routing_correctness", "score": 1.0 if pred == expected else 0.0, "comment": f"pred={pred} expected={expected}", } And the experiment — one function call: results = evaluate( routing_target, data=DATASET_NAME, evaluators=[correctness_evaluator], experiment_prefix="router-baseline", metadata={"agent": "router-v1"}, ) View the evaluation results for experiment: 'router-baseline-c7e21a4f' at: https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=... 7it [00:11, 1.58s/it] Follow that link and the dashboard shows the per-example breakdown. Meera’s baseline comes back at 6/7–0.86 — and the table tells her exactly which example bled: There it is — the exact trap the dataset was built to catch. When the customer is venting about the product , the model reads “unusable product” as a general complaint and misses that the actionable intent is a refund. In production this is invisible: the customer still gets an answer, just one generated without access to the refund-policy documents. In an experiment, it’s a red cell with a comment string. That’s the whole pitch for evals in one table row. The regression loop The loop above is now Meera’s development workflow. The top row is mechanical: fixtures in, scores out, gate at the end. The two feedback arrows are where value compounds. The solid one — re-run until green — is the inner cycle she’s about to do (fix the prompt, re-run, compare experiments side by side). The dashed one, from “production incident” back into the dataset, is the long game: every bad answer your Part 1 tracing catches in the wild gets distilled into a new example, which means the test suite grows in exactly the directions your system actually fails. Six months in, that dataset is the team’s institutional memory of every way the agent has ever embarrassed them. One scope note: this evaluator is an exact-match check, which works because structured output constrains the labels. For free-text answers you’d add an LLM-as-judge evaluator (a strong model grading faithfulness against a rubric), and LangSmith’s online evaluators can score samples of live production traffic continuously, so drift (Scenario 4 from Part 1 ) shows up on a dashboard instead of in a complaint. Now — about that failing prompt. Fixing it properly raises a bigger question. 2. Prompt Versioning: Git for Prompts Here’s the uncomfortable question: where do your prompts actually live? If the honest answer is “in f-strings, scattered across the codebase, edited by whoever and deployed whenever” — that’s exactly how silent regressions are born, and exactly what a tribunal will subpoena, as Air Canada learned in Part 1 . LangSmith’s Hub treats prompts as versioned, deployable artifacts. Every push creates an immutable commit — old versions are never overwritten and stay pullable by hash, forever — and tags like production or staging are movable pointers to commits, exactly like git branches. Meera lifts the routing prompt out of the code and pushes it: from langchain_core.prompts import ChatPromptTemplate from langsmith import Clientclient = Client() PROMPT_NAME = "acmeai-router-categorization" routing_prompt = ChatPromptTemplate.from_messages([ ("system", "You are a customer support agent for an AI products and hardware company. " "Classify the customer query into exactly one of: Technical, Billing, General. " "Return only the category name."), ("human", "{customer_query}"), ]) url = client.push_prompt(PROMPT_NAME, object=routing_prompt) print(f"Pushed → {url}") Pushed → https://smith.langchain.com/prompts/acmeai-router-categorization/... That’s commit #1. Now the fix for the failing eval case — a v2 with one extra routing rule, pushed as a new commit of the same prompt: routing_prompt_v2 = ChatPromptTemplate.from_messages([ ("system", "You are a customer support agent for an AI products and hardware company. " "Classify the customer query into exactly one of: Technical, Billing, General. " "Rule: complaints about charges, refunds, or payments are ALWAYS Billing, " "even when the customer is angry or insulting the product. " "Return only the category name."), ("human", "{customer_query}"), ]) client.push_prompt(PROMPT_NAME, object=routing_prompt_v2) # commit #2 Does v2 actually fix the regression without breaking anything else? That’s not a matter of opinion anymore — it’s an experiment. The new target pulls the prompt from the Hub (note: the app code no longer contains prompt text at all): def routing_target_v2(inputs: dict) -> dict: prompt = client.pull_prompt(PROMPT_NAME) # latest commit chain = prompt | llm.with_structured_output(QueryCategory) result = chain.invoke({"customer_query": inputs["customer_query"]}) return {"predicted_category": result.categorized_topic} results_v2 = evaluate( routing_target_v2, data=DATASET_NAME, evaluators=[correctness_evaluator], experiment_prefix="router-v2-refund-rule", ) View the evaluation results for experiment: 'router-v2-refund-rule-9b3d51e0' at: https://smith.langchain.com/o/.../datasets/.../compare?selectedSessions=... 7it [00:10, 1.49s/it] The dashboard’s compare view puts both experiments side by side, and it’s worth looking at closely, because this view is the cultural artifact that changes how teams argue about prompts: router-baseline at 0.86, router-v2-refund-rule at 1.00 — the amber refund row flips from red to green, and the six other rows hold steady. Both halves of that sentence matter: the flip proves the fix worked, and the steady rows prove the new rule didn't quietly break anything else (the failure mode of every "small prompt tweak" ever made). That — not "it looks better to me" — is the evidence that earns v2 the production tag. In the Hub UI, moving the :production tag onto commit #2 is one click, and the production app picks it up on its next pull: prompt = client.pull_prompt(f"{PROMPT_NAME}:production") # or pin an exact commit hash for absolute reproducibility: # prompt = client.pull_prompt(f"{PROMPT_NAME}:abc123de") Git for prompts: commits, tags, CI gate The diagram is the whole operating model on one page. The rail across the top is what the Hub stores: immutable commits v1…v4, with :production and :staging as re-pinnable flags — promotion is moving a flag, not shipping a build. The pipeline underneath is what Meera just did manually, automated: a prompt edit becomes a commit, the commit triggers the eval suite in CI, a regression blocks the PR with the failing examples attached, and a pass moves the flag. Two operational footnotes are baked in: pulled prompts are cached (expect a few minutes of TTL after a re-tag before every running instance converges — pin commit hashes where you need determinism), and the rollback story is the killer feature. A bad prompt in production is fixed by moving the flag backwards : no build, no release train, no 2 a.m. deploy. Commits are immutable; tags are pointers The cultural shift lands quietly but permanently: Prompt edits no longer require a deploy. Push, test, move the tag. Rollbacks are instant. Move the tag back one commit. Every change has an author, a diff, and a timestamp. When the auditor asks “who changed the routing instructions in March?”, the answer is a commit history, not a shrug. Prompts now ship through the same gate as code. That’s the sentence that finally makes Sanjay smile — question #3, closed. 3. Beyond the Support Desk: The Same Playbook Across Industries Different stakes, same four moves Everything so far used a support bot, but look at the four moves again — trace everything, keep your own audit log, gate changes with evals, version your prompts — and notice that nothing about them is support-specific. Here’s how the same stack earns its keep elsewhere. 🏥 Healthcare — the symptom-triage assistant. A telehealth platform runs an intake bot that asks about symptoms and suggests urgency levels. Traces let clinical reviewers replay exactly why the bot said “routine appointment” instead of “urgent care” — which retrieval surfaced, which guideline was quoted. The custom callback is non-negotiable here: PHI must be scrubbed in-process before any trace leaves the network (HIPAA), and the JSONL log feeds the clinical-governance board. The eval dataset is a library of physician-written vignettes — “crushing chest pain radiating to left arm” must score urgent=1.0 on every model version, forever. A failed experiment blocks release like a failed unit test. 🛒 E-commerce — the shopping copilot. A retailer’s product-Q&A agent answers “will these boots survive a Norwegian winter?” from spec sheets and reviews. Tracing exposes the classic silent killer: the retriever returning the men’s boot specs for a women’s boot question. Cost telemetry per trace reveals that 4% of conversations consume 40% of spend (users pasting entire return-policy PDFs) — Part 1 ’s Scenario 3, found and fixed in a week. Before Black Friday, the team re-runs a 500-example eval suite against holiday prompt variants, and merchandising A/Bs a “warmer” tone by moving a Hub tag — zero engineering deploys. ⚖️ Legal — the contract-review copilot. A firm’s associates use an agent that flags risky clauses in NDAs. Privilege means traces can’t leave the building — so they self-host (or run callback-only logging) with the exact same code. The eval dataset is partner-annotated contracts, and the evaluator checks clause-level recall: missing an uncapped-liability clause is a career-limiting false negative. Prompt commits matter for a subtler reason: when a client asks “under what instructions did the AI review my contract in April?” , the firm produces the exact prompt version, by hash. 🎓 EdTech — the AI tutor. A math-tutoring app serves students aged 10–16. Online evaluators continuously score live traces for age-appropriateness and “did the tutor explain rather than hand over the answer.” The audit log doubles as a safety record for school districts. The Hub holds per-grade prompt variants (tutor-prompt:grade6, tutor-prompt:grade10) — pedagogy teams iterate on scaffolding without touching the codebase. A quick map for everyone else: Different stakes, same four moves. The infrastructure doesn’t care whether the disaster is a misrouted refund, a missed liability clause, or a tutor handing a 12-year-old the answer key. 4. LangSmith vs Langfuse: The Decision Matrix The question Meera gets most often from other teams: “should we use LangSmith or Langfuse?” It deserves a real answer, not a shrug — they’re the two most common finalists, and they genuinely optimize for different things. Two valid philosophies, evenly matched Langfuse is the open-source counterweight: MIT-licensed core, self-hosting as a first-class citizen (one Docker Compose for Postgres + ClickHouse + the server), an SDK rebuilt around OpenTelemetry, and transparent unit-based pricing for its cloud. LangSmith is the vertically integrated, managed platform: the deepest LangChain/LangGraph integration on the market, plus the production toppings — alerts, online evaluators, automation rules, agent deployment — that Langfuse mostly leaves to you. Here’s the matrix, distilled from Langfuse’s own comparison , LangChain’s counter-comparison , and independent 2026 write-ups ( ZenML , TECHSY ) — pricing figures are mid-2026 cloud list prices, so verify before you budget: LangSmith vs Langfuse comparison And the decision rules, compressed: All-in on LangChain/LangGraph, want a managed platform, value alerts + online evals + deployment in one bill? → LangSmith. Hard data-sovereignty requirements, open-source mandate, cost-sensitive at scale, or a polyglot stack standardized on OpenTelemetry? → Langfuse (self-hosted or cloud). Genuinely torn? The architecture in this series is deliberately portable: the audit-callback lane is vendor-neutral, @traceable ↔ @observe migrate mechanically, and both platforms speak OTel. Pick one, ship, and revisit in six months with real trace volumes — the switching cost is days, not quarters. 5. The Wider Field: Other Alternatives Worth Knowing LangSmith and Langfuse aren’t the only players — LLM observability has become a multi-billion-dollar category , and a few others deserve a look depending on your shape: Which tool to use when Decision tree for choosing a stack The tree compresses this whole section into three questions, asked in the order that actually matters. Sovereignty first: if traces can’t leave your network, you’re self-hosting, and the realistic shortlist is Langfuse, Phoenix, or MLflow — or writing the enterprise check for self-hosted LangSmith. Framework second: deep LangChain/LangGraph investment makes LangSmith’s zero-config integration genuinely hard to beat. Philosophy third: managed suite (LangSmith, Braintrust, Weave, Datadog) versus open-source-first (Langfuse, Phoenix, MLflow, Opik). And don’t skim past the footnote at the bottom of the tree — the audit-callback lane from Part 1 belongs in every outcome box, because it’s the one component you’ll never migrate, never license, and never lose to an acquisition. (Ask a Helicone user.) 6. When Traces Shouldn’t Leave the Building A practical privacy ladder, from lightest to heaviest control — each rung buys more sovereignty and costs more ops effort, so climb only as high as your data requires: The privacy ladder Redact in the callback before anything egresses (the Part 1 pattern; use Presidio or Comprehend for real coverage). Hide payloads platform-wide — set LANGSMITH_HIDE_INPUTS=true and LANGSMITH_HIDE_OUTPUTS=true, and LangSmith receives structure, timing, and token counts but never the content. Pick your region — EU tenants point LANGSMITH_ENDPOINT at https://eu.api.smith.langchain.com to keep traces in-region. Self-host — LangSmith in your VPC (enterprise), or pair the audit-callback pattern with self-hosted Langfuse/Phoenix/MLflow. Callback-only mode — for the most regulated workloads, switch external tracing off entirely and let your JSONL→SIEM pipeline be the single system of record. Your code doesn’t change; that’s the beauty of the two-lane design. 7. The Production Checklist Meera’s launch checklist, distilled from the whole series: ✅ LANGSMITH_TRACING=true + a named project per app and environment (-dev, -staging, -prod). ✅ Custom BaseCallbackHandler on every entry point — JSONL, append-only, shipped to the SIEM, with log rotation. ✅ PII redaction inside the callback, before any egress; LANGSMITH_HIDE_INPUTS/OUTPUTS where content isn't needed. ✅ run_id/parent_run_id + session/tenant metadata on every event, so logs and traces correlate. ✅ @traceable on the non-LangChain functions that matter (DB lookups, external APIs) so trace trees have no blind spots. ✅ A versioned eval dataset per critical decision point — start with 20–30 examples, grow it from real incidents. ✅ evaluate() wired into CI; regressions block merges. ✅ Prompts in the Hub, pulled by tag; production pinned, rollback = re-tag. ✅ Online evaluators + alerts on live traffic, so drift pages you before customers do. Key Takeaways (Part 2) Key takeaways And if you keep exactly one artifact from this series, make it this poster — the whole playbook, both parts, on one page: The Four Moves of LLM Observability And the story ends where stories like this should: Sanjay signs off, the agent ships, and three weeks later a customer claims the bot promised them a free GPU. Meera pulls the trace, reads the actual conversation, checks which prompt commit was live that day, and replies in four minutes flat. That’s the whole point. Let’s Stay Connected If this two-parter saved you a debugging weekend — or surfaced a gap in your team’s LLM stack — I’d genuinely like to hear about it. I’m Prashant Sahu , and I train and consult on GenAI engineering: LLM observability and evaluation, RAG systems, and multi-agent architectures, including the 10-day (70-hour) corporate curriculum this series grew out of. 🔗 Connect with me on LinkedIn: linkedin.com/in/prashantksahu — say hi, share your observability war stories, or just tell me which part of this series you’d like a deeper dive on. ➕ Follow me here on Medium for the next articles in this series — Langfuse hands-on, the 3-layer agent-evaluation hierarchy, and PII redaction with fairness guardrails are all in the pipeline. Missed the beginning? Read Part 1 here — observability fundamentals, zero-config tracing, tracing any Python function, and the audit-grade callback. References & Further Reading (Part 2) Managing prompts in LangSmith — push/pull, commits, tags · Prompt tags changelog LangSmith evaluation docs Langfuse: “LangSmith alternative?” comparison · LangChain: LangSmith vs Langfuse · ZenML independent comparison Tool-landscape roundups: MLflow — top agent observability tools · Firecrawl — best LLM observability tools 2026 LLM Observability with LangSmith — Part 2: Eval Gates, Prompt Versioning & Choosing Your Stack was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Score: 39🌐 MovesJun 13, 2026https://pub.towardsai.net/llm-observability-with-langsmith-part-2-eval-gates-prompt-versioning-choosing-your-stack-e607473320b5?source=rss----98111c9905da---4
Gemini Spark Is the Best AI Agent I've Tested...But It Has a Big Problem
Gemini Spark Is the Best AI Agent I've Tested...But It Has a Big Problem PCMag
Score: 38🌐 MovesJun 13, 2026https://www.pcmag.com/news/gemini-spark-is-the-best-ai-agent-ive-testedbut-it-has-a-big-problem
Larger Context Windows Don’t Fix RAG — So I Built a System That Does
Increasing context size in RAG systems doesn’t improve accuracy for aggregation tasks—it makes errors harder to detect. In this article, I benchmark retrieval-based pipelines against a deterministic full-scan engine across 100,000 rows and show why computation queries must be routed away from RAG entirely. The post Larger Context Windows Don’t Fix RAG — So I Built a System That Does appeared first on Towards Data Science .
Score: 38🌐 MovesJun 13, 2026https://towardsdatascience.com/larger-context-windows-dont-fix-rag-so-i-built-a-system-that-does/
iMideo Launches AI Video Platform With 50+ Tools as Demand Surges
iMideo Launches AI Video Platform With 50+ Tools as Demand Surges azcentral.com and The Arizona Republic
Score: 38🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82564/imideo-launches-ai-video-platform-with-50-tools-as-demand-surges/
How can self-driving cars see better? Make their sensors more human.
Human-eye inspired sensors could help autonomous cars handle changes to light. The post How can self-driving cars see better? Make their sensors more human. appeared first on Popular Science .
Score: 37🌐 MovesJun 13, 2026https://www.popsci.com/technology/self-driving-car-sensors-human-eye/
How to Transfer Chatbot Memory to and From Gemini
Want to switch AI chatbots while keeping all your old information? Here's how to do it.
Score: 36🌐 MovesJun 13, 2026https://www.cnet.com/tech/services-and-software/how-to-transfer-chatbot-memory-chats-to-and-from-gemini/
Token-maxxing: How tech firms' AI staff push backfired
Tech firms have pushed their staff to go 'all-in' on artificial intelligence in an attempt to boost productivity and cut costs - but it ended up having the opposite effect, writes Adam Maguire.
Score: 35🌐 MovesJun 13, 2026https://www.rte.ie/news/business/2026/0613/1578184-token-maxxing-ai/
The ‘AI superstar’ CEO behind a self-driving truck unicorn on why Gen Z is a better hiring bet than industry veterans
The ‘AI superstar’ CEO behind a self-driving truck unicorn on why Gen Z is a better hiring bet than industry veterans Fortune
Score: 35🌐 MovesJun 13, 2026https://fortune.com/2026/06/13/waabi-ceo-raquel-urtasun-future-of-work-embrace-ai-first-gen-z-over-industry-experts-career-advice/
Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload
Enterprise Document Intelligence [Vol.1 #5ter] - Table cells, OCR, captions, headings: cloud-grade structure, running on your own machine. No key, no per-page bill, nothing leaves the building The post Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload appeared first on Towards Data Science .
Score: 35🌐 MovesJun 13, 2026https://towardsdatascience.com/parse-pdfs-for-rag-locally-with-docling-rich-tables-no-cloud-upload/
Buying a laptop may soon come with an instant carbon score thanks to AI
Researchers are developing AI agents capable of calculating a product’s carbon footprint in real time, potentially helping consumers make more sustainable purchasing decisions.
Score: 34🌐 MovesJun 13, 2026https://www.digitaltrends.com/computing/buying-a-laptop-may-soon-come-with-an-instant-carbon-score-thanks-to-ai/
New AI model called "Count Anything" does exactly what it says, and that's harder than it sounds
"Count Anything" is intended to be the first AI model capable of counting objects in any type of image, from crowds to cell samples under a microscope, using nothing more than a text prompt. In a comparative test, it cuts the error rate in half compared to previous systems. However, the approach still struggles with extremely dense objects and ambiguous terms. The article New AI model called "Count Anything" does exactly what it says, and that's harder than it sounds appeared first on The Decoder .
Score: 33🤖 ModelsJun 13, 2026https://the-decoder.com/new-ai-model-called-count-anything-does-exactly-what-it-says-and-thats-harder-than-it-sounds/
Building a Gemini Live voice app with React, FastAPI and your own WebSocket protocol
The quickest way to try Gemini Live from a browser is to open a WebSocket straight to Google. For a weekend experiment, that is fine. Continue reading on Towards AI »
Score: 33🌐 MovesJun 13, 2026https://pub.towardsai.net/building-a-gemini-live-voice-app-with-react-fastapi-and-your-own-websocket-protocol-9752bed95182?source=rss----98111c9905da---4
7 AI Tools That Build a One-Person Business in a Weekend — No Staff. No Code. No Stress.
7 AI Tools That Build a One-Person Business in a Weekend — No Staff. No Code. No Stress. entrepreneur.com
Score: 32🌐 MovesJun 13, 2026https://www.entrepreneur.com/growing-a-business/7-ai-tools-that-build-a-one-person-business-in-a-weekend/504719
How AI Video Is Quietly Reshaping Startup Marketing
Stop polishing and start testing what actually works.
Score: 32🌐 MovesJun 13, 2026https://www.inc.com/benjamin-laker/how-ai-video-is-quietly-reshaping-startup-marketing/91350505
Why Human Outreach Still Wins in the Age of AI Recruiting
AI can screen resumes, but it still can’t spot curiosity.
Score: 30🌐 MovesJun 13, 2026https://www.inc.com/entrepreneurs-organization/why-human-outreach-still-wins-in-the-age-of-ai-recruiting/91352664
Sovrynn Launches First Compliance Intelligence Platform for Owner-Operators
Sovrynn Launches First Compliance Intelligence Platform for Owner-Operators azcentral.com and The Arizona Republic
Score: 30🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82601/sovrynn-launches-first-compliance-intelligence-platform-for-owner-operators/
Here’s How AI Agents Can Protect EV Chargers
An AI agent system proposed by researchers in Spain promises to prevent energy theft and damage to EV chargers, as well as the critical energy infrastructure that powers them.
Score: 30🌐 MovesJun 13, 2026https://www.wired.com/story/researchers-in-spain-show-how-ai-agents-can-protect-ev-chargers/
How AI Could Make Workplace Training More Human
The real skill now? Learning how to learn with AI.
Score: 30🌐 MovesJun 13, 2026https://www.inc.com/jay-sullivan/how-ai-could-make-workplace-training-more-human/91352702
🧠 Community Wisdom: How AI is changing product operating models, tracking work stress with Whoop, whether you need a portfolio of AI side projects, marketing for tiny teams, and more
Community Wisdom 189
Score: 29🌐 MovesJun 13, 2026https://www.lennysnewsletter.com/p/community-wisdom-how-ai-is-changing
I had never used a robot lawn mower. After 30 days with Mammotion’s newest model, I can’t go back.
Living (and loving) the robot mower life.
Score: 29🌐 MovesJun 13, 2026https://www.androidauthority.com/mammotion-luba-3-awd-review-3673597/
What excellence looks like in the age of AI
Yesterday, I argued in this column that AI literacy is no longer the differentiator it was 12 months ago.
Score: 28🌐 MovesJun 13, 2026https://www.philstar.com/business/2026/06/14/2534973/what-excellence-looks-age-ai
5 AI ‘influencers’ on social media, from models to a granny friendly with the Biebers
Aitana López is one of the most talked-about influencers at the moment. Since her Instagram account @fit_aitana was created in 2023, featuring lifestyle and beauty content, the pink-haired, 20-something Spanish woman has amassed over 300,000 followers at the time of writing. Aitana hits the gym often, has her own skin care line and has attended major shows at Paris Fashion Week earlier this year. Most importantly, she’s not real. Created by Rubén Cruz, the founder of influencer agency named The...
Score: 28🌐 MovesJun 13, 2026https://www.scmp.com/magazines/style/lifestyle/leisure/article/3355534/5-ai-influencers-social-media-models-granny-friendly-biebers?utm_source=rss_feed
Your LLM Needs a Map!
Before I built the chatbot, I built the map. Here is how I turn three messy databases into one graph an LLM can actually reason about. Part 2 of 4 on building a conversational analytics engine. ~9 min read. Point a language model at a raw database schema and ask it a real business question. Watch it guess. It sees Sales.SalesOrderHeader, Person.BusinessEntity, and three hundred columns named things like rowguid and TerritoryID. It has no idea that "customer" is one table, "orders" is another, and the two connect through a key it would never find on its own. So it invents a join. The SQL runs. A number comes back. The number is wrong, and nobody notices. That single failure mode, the confident wrong answer , is the reason this whole system exists. In Part 1, I called it the most dangerous of the seven walls that break text-to-SQL on real data. This article builds the first half of the fix. And it is why I spent most of my time not on the chatbot, but on the unglamorous thing that comes before it: a metadata map I call the domain graph . TL;DR information_schema tells an LLM the shape of your data. It says nothing about the meaning . Meaning is the whole game. I build a domain graph offline : introspect the databases, enrich the schema with an LLM once, store the result as a graph plus a vector index. The graph holds metadata only . The actual rows never move. Queries run in place. The payoff: at runtime, the hard thinking is already done and frozen, so answers are fast, reproducible, and safe . The setup: three databases that disagree with each other The system runs on an AdventureWorks dataset (Microsoft’s public sample, so you can verify every name here) spread across three engines: PostgreSQL for production and purchasing SQL Server for sales and person data A CSV-backed store for HR Different engines. Different SQL dialects. Different conventions. And the numbers that should worry you: Tables : 71 Columns : 472 Foreign keys actually declared : 44 That last row is the problem in miniature. Real databases are full of relationships that live in someone’s head and in the application code but were never written down as constraints . The CSV source had zero declared keys, because CSV files have no constraints at all. So when a user asks “show me orders for the customer Bike World,” the planner needs to know, reliably and ahead of time: that “customer” and “orders” map to specific tables in a specific source, that those tables join on a specific column , that “Bike World” is a value it can resolve to a key, and that the person asking is even allowed to see sales data. None of that is safely derivable at query time by a model staring at a schema dump. All of it is derivable once, offline . That offline derivation is the domain graph build. The whole pipeline in one picture Domain graph bootstrap How to read it: Left side, cheap and deterministic. Introspection just reads the databases. Middle, expensive and fuzzy. The LLM enrichment runs once, offline , never on the hot path. Right side, the artifacts. A graph (for structure) and a vector index (for meaning). The takeaway: by the time a user shows up, the hard thinking is already done and frozen. The rest of this article is just those five steps, in order. Step 1: Introspection, and the attributes that actually matter Introspection is the deterministic read of each database. One introspector per engine, all returning the same typed structure. One rule I enforced from day one: introspectors read databases and hand back plain data. They never write to the graph. Keeping the reader and the writer separate saved me later, when the graph build became its own concern. The leverage is in the per-column record . For every column, I keep: name and data_type: the obvious ones nullable: later decides whether a join is INNER or LEFT is_primary_key / is_foreign_key pk_source: either database or heuristic pk_confidence: a number from 0 to 1 cardinality: distinct value count null_percentage sample_values: real values pulled from the column Two of those are where a naive introspector quietly fails. Primary keys, when the database refuses to tell you Plenty of tables (especially the CSV ones) have no declared primary key . A planner that gives up there is useless. So when there is no database key, I score every column and pick the best candidate. The heuristic is deliberately boring, because boring is auditable : Exclusion gate. Throw out columns that look like keys but are not: rowguid, *_guid, *_uuid, modifieddate, *_hash, *_version, and friends. Hard requirements. The column must be unique (distinct count equals row count) and complete (zero nulls). Fail either, rejected, no matter how key-like the name. Score what survives: Unique and complete: +0.70 (the base) Identifier-style name (*id, *_id, id): +0.15 Softer pattern (*_key, *_pk, *_seq): +0.10 Integer / serial type: +0.10 , UUID / string: +0.05 First column in the table: +0.05 Accept anything scoring 0.70 or higher . An accepted guess gets pk_source = "heuristic" and pk_confidence set to its score. A declared key gets pk_source = "database" and confidence exactly 1.0. Why I like this: it never lies about its certainty. A guessed key carries its own doubt in a field you can read downstream. The sampling trick that finds hidden relationships Here is the non-obvious one. For most columns I grab a capped sample. But when a column’s cardinality is low , I fetch every distinct value, not a sample. Why pull all of them? Those complete value sets are how I later discover the foreign keys nobody declared . With the full set of values in a CSV column, and the full set of primary-key values in a candidate target, I can compute exactly how much they overlap . High overlap is strong evidence of a real relationship. You cannot get that from a 5-row sample. The introspector spends a little more on low-cardinality columns specifically to make the next step possible. Everything that comes out gets written to raw_schema.json, the contract between the deterministic world and the LLM world. Step 2: Letting an LLM do the one thing only it can do The schema is captured, but it is still just structure. It does not know that SalesOrderHeader is what a human means by "orders," that a salesperson calls revenue "rev," or that "last quarter" is a date range. This is where an LLM comes in. And the single most important decision in the whole system is when . I bring the LLM in offline, at build time. Never on the hot path for this work. The model enriches the schema into business knowledge once , the output gets written to versioned files, and at query time everything reads those frozen files. The intelligence is precomputed. That is the trade that makes the system both smart and fast. Several generators run in order, each feeding the next: Relationship inference (Reads: sample-value overlaps; Produces: foreign keys nobody declared; LLM?: Yes (verified by overlap)) Entity discovery (Reads: schema + relationships; Produces: which tables are real business entities; LLM?: No (pure heuristic)) Entity analysis (Reads: discovered entities; Produces: descriptions, synonyms, name expressions, sample questions; LLM?: Yes) Business domains (Reads: entities; Produces: groupings like HR, purchasing, production; LLM?: Yes) Metrics catalog (Reads: numeric columns; Produces: named measures as real SQL (total revenue, AOV); LLM?: Yes) Glossary (Reads: everything above; Produces: “last quarter,” “top,” status filters; LLM?: Yes) A few things worth pulling out: Relationship inference does not trust the model alone. A candidate is accepted only if the LLM’s confidence is at least 0.80 and the measured value overlap is at least 0.50 . The model proposes; ground-truth overlap disposes. On this dataset that turned 40 candidates into 34 accepted relationships. Entity discovery is deliberately not an LLM step. Deciding what your core nouns are is too important to leave to a sampling temperature. It scores tables instead: a primary join key scores 0.95, a table referenced by 3+ foreign keys scores 0.90, and so on. Entity analysis is where the LLM earns its seat. For a person entity, it generates the name expression FirstName || ' ' || LastName. It writes the synonyms that let "client" find CUSTOMER. It drafts the sample questions a real user would ask. The glossary stores meaning, never SQL. “Last quarter” becomes a relative date range , not a dialect-specific date function. That keeps meaning portable across three SQL engines. The trick that makes regeneration safe There is an operational trap hiding in all this generated config: The LLM gets you ~90% of the way. A human expert fixes the other ~10%. Then you regenerate, and their work is gone . I have watched teams “solve” this by never regenerating, and living with stale config forever. My answer is an ownership split enforced at the file level: system_owned/ holds generated files. Treated as disposable. Delete them, rerun the build, they come back. user_owned/ holds human corrections as small override files . At load time, a loader reads the generated base, reads the override, and deep-merges them. The override wins on any conflict. The payoff, from a real incident: an expert once added a couple of synonyms through an override file. Months later the entire base config got regenerated from scratch, and those synonyms simply re-merged on top, untouched . Nobody had to remember anything. That is the difference between config you can maintain and config you are afraid to touch. Step 3: Materializing the graph Now the build turns all that validated config into an actual graph in Neo4j. This is the heart of the article. Domain graph node and edge How to read it (the four design decisions that matter): Every node wears two labels. A table is Domain:Table, a column is Domain:Column. The first label is the layer, the second is the type. This is what lets one Neo4j database hold three graphs (domain, subject, lexical) side by side. The layer label is the namespace. The edges carry the relationships an LLM should never invent. FOREIGN_KEY edges hold the real join key plus a confidence. ENTITY_LINK edges carry a cardinality_class (PRESERVING or MULTIPLYING) that records whether following the link fans out rows. The SQL pipeline traverses these for its joins. Foreign keys have provenance. A single FK can come from four places (declared, inferred, cross-source, override), merged with a strict precedence, each tagged with where it came from. Deleting a wrong relationship requires a written reason, because a removed join path deserves an audit trail. Security lives on the nodes. A table carries its access_level, its restricted_properties, whether it requires_mandatory_filter, and its row-security policies. This is the raw material the runtime uses to enforce access by rewriting SQL. The load-bearing rule: the LLM never writes the joins. It gets them from the graph, deterministically, or the query fails loudly with “no path found” rather than hallucinating a relationship. A made-up join is a silently wrong answer, and silently wrong is the worst failure an analytics system has. Two engineering footnotes: Neo4j properties must be flat scalars, so lists get JSON-serialized in and parsed out. And every insert is an idempotent MERGE on a stable id, so I can rerun the build over an existing graph without creating duplicates. Rebuilds are routine, not surgery. Step 4: The semantic index, because a graph cannot read minds The graph is perfect at one kind of question: given a concept, how does it connect? Given CUSTOMER, what table is it, how does it reach ORDERS? That is exact-key lookup and traversal. It is hopeless at the question that actually shows up first: the user typed “orders,” so which concept did they mean? That is similarity, not traversal. A graph has no notion of “close.” So I built a second index in Qdrant that does nothing but answer “what did this phrase most likely mean.” Six collections: C1 semantic entities (One point is: an entity type + its description + synonyms; Queried at runtime?: Yes) C2 semantic metrics (One point is: a named metric; Queried at runtime?: Yes) C3 glossary (One point is: a glossary term; Queried at runtime?: Yes) C4 columns (One point is: a single column; Queried at runtime?: Built, not yet queried ) C5 entity values (One point is: a sample value; Queried at runtime?: Built, not yet queried ) C6 vocabulary (One point is: a user’s personal alias; Queried at runtime?: Yes) Each point is text embedded into a 1536-dimension vector (OpenAI text-embedding-3-small, cosine distance). The clever one is C1 , where the embedded text is built from a template: "{ENTITY_TYPE}: {description}. Also known as: {synonyms}. Common questions: {sample questions}" That template is exactly why “orders” finds SALESORDERHEADER at runtime. The user's word is cosine-closest to that blob of description and synonyms, so the match happens by meaning, not by spelling . The graph then does the exact lookup to the real table. The two stores form a pipeline, and neither can do the other’s job: Qdrant maps language to concept. Neo4j maps concept to structure. "orders" → (similarity) → SALESORDERHEADER → (exact lookup) → Sales.SalesOrderHeader + columns + FK to Customer Shipped vs. next: I will be straight about this, because pretending otherwise is the wrong kind of credibility. Four of the six collections are live (entities, metrics, glossary, vocabulary). Two are built on every bootstrap but not yet read (columns, entity values). They are deferred, with a note in the code. The build is cheap and wiring them in is a small, isolated change for later. If you assumed every collection was on the hot path, it is not, and I would rather tell you. Step 5: Making the runtime never pay for any of this When a real query runs, it does not query Neo4j for schema metadata. It reads an in-memory snapshot loaded once at startup, then served entirely from RAM. The cache is blunt about its contract, and I wrote it that way: Neo4j is the single source of truth. No YAML fallback. It fails fast and loud if Neo4j is missing at startup. After it loads, there is no per-query I/O . Every lookup is a dictionary read. It preserves the security fields (access_level, restricted_properties), because dropping them would quietly reopen an access-control gap. Domain graph runtime read How to read it: Two read surfaces, not one. The in-RAM snapshot is the hot path. A live Cypher reader handles the rare, occasional schema lookups. The hot path never touches the network. Schema reads happen in the tightest loops of the planner, sometimes dozens of times per question. Why it matters: if each of those were a round-trip to the graph, every query would crawl. This one placement decision quietly determines whether the whole system feels instant or sluggish. The rule: the graph is the source of truth and gets read in bulk at startup; the hot path reads a local snapshot. The domain graph is immutable between builds, so the snapshot never goes stale mid-session. Key takeaways If you build one of these, steal the sequence. The ordering is the insight. Introspect deterministically, and keep the reader honest about its own confidence. A guessed key should carry a number that says it was guessed. Spend your LLM budget offline. The language-to-meaning work belongs at build time, written to versioned files. An LLM in your hot path for things you could precompute is a bug, not a feature. Separate what a machine generates from what a human corrects , and merge them at load time. It is the only way to keep config both fresh and trustworthy. Use a graph for structure and a vector index for meaning. Let them form a pipeline, not a competition: language to concept, then concept to structure. Put security metadata in the graph at build time , so enforcement at query time has nothing to negotiate with a language model about. That is the map. In Part 3 , I get to the part everyone actually asks about: how a plain-English question becomes a safe query end to end, how the system resolves the specific things a user names (“Bike World,” “my team”), and the one architectural line that made me comfortable putting this in front of real users with real permissions. Next up, Part 3: From Schema to Conversation, building the subject graph and the query pipeline. Your LLM Needs a Map! was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Score: 28🌐 MovesJun 13, 2026https://pub.towardsai.net/your-llm-needs-a-map-f145583abf61?source=rss----98111c9905da---4
My John-Wick-Inspired AI Agent Team is Dangerous – Let’s Create Yours
If you’re not having fun learning AI, you’re doing it wrong.
Score: 28🌐 MovesJun 13, 2026https://www.inc.com/joe-procopio/my-john-wick-inspired-ai-agent-team-is-dangerous-lets-create-yours/91360537
WareMatch Now on ChatGPT App Store, Putting 3PL Discovery Into AI Conversations
WareMatch Now on ChatGPT App Store, Putting 3PL Discovery Into AI Conversations USA Today
Score: 27🌐 MovesJun 13, 2026https://www.usatoday.com/press-release/story/34737/warematch-now-on-chatgpt-app-store-putting-3pl-discovery-into-ai-conversations/
Funny Side Up: ChatGPT’s Bad at Shopping and Facial Recognition can go Horribly Wrong
OpenAI had taken great pride while launching its shopping assistant last September but now it has some egg on its face as reports indicate that ChatGPT is getting duped by scamsters and in turn potentially defrauding customers. The AI chatbot actually recommended cloned versions of defunct websites capitalising on known brands to steal money, says […] The post Funny Side Up: ChatGPT’s Bad at Shopping and Facial Recognition can go Horribly Wrong appeared first on CXOToday.com .
Score: 27🌐 MovesJun 13, 2026https://cxotoday.com/editors-picks/funny-side-up-chatgpts-bad-at-shopping-and-facial-recognition-can-go-horribly-wrong/?utm_source=rss&utm_medium=rss&utm_campaign=funny-side-up-chatgpts-bad-at-shopping-and-facial-recognition-can-go-horribly-wrong
HTAG Analytics Makes Australian Property Intelligence Discoverable to AI Agents Worldwide via MCP Registry
HTAG Analytics Makes Australian Property Intelligence Discoverable to AI Agents Worldwide via MCP Registry azcentral.com and The Arizona Republic
Score: 26🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82563/htag-analytics-makes-australian-property-intelligence-discoverable-to-ai-agents-worldwide-via-mcp-registry/
'I find it sycophantic, but it gives me dopamine hits’ — the thing I dislike most about AI is exactly what some users love
I quizzed people who turn to AI for reassurance and wasn't expecting their answers.
Score: 26🌐 MovesJun 13, 2026https://www.techradar.com/ai-platforms-assistants/i-find-it-sycophantic-but-it-gives-me-dopamine-hits-the-thing-i-dislike-most-about-ai-is-exactly-what-some-users-love
BUSINESSNEXT Appoints Hitesh Sahijwaala to Lead APAC & MEA, Accelerating Autonomous Financial Institutes Expansion
BUSINESSNEXT today announced the appointment of Hitesh Sahijwaala as Head — APAC & MEA. He will own Market expansion, Strategic partnerships, Enterprise Sales and Delivery across the above regions, building on BUSINESSNEXT’s momentum as financial institutions move from digital to AI-native, autonomous operations. Sahijwaala joins at an inflection point for the industry. Banks, insurers, and […] The post BUSINESSNEXT Appoints Hitesh Sahijwaala to Lead APAC & MEA, Accelerating Autonomous Financial Institutes Expansion appeared first on CXOToday.com .
Score: 25🌐 MovesJun 13, 2026https://cxotoday.com/media-coverage/businessnext-appoints-hitesh-sahijwaala-to-lead-apac-mea-accelerating-autonomous-financial-institutes-expansion/?utm_source=rss&utm_medium=rss&utm_campaign=businessnext-appoints-hitesh-sahijwaala-to-lead-apac-mea-accelerating-autonomous-financial-institutes-expansion
What a longtime Google AI leader told UW computer science students at their graduation
Jeff Dean, Google's chief scientist and a UW alum, returned to campus Friday with an optimistic but clear-eyed message about AI for Allen School graduates — many of them headed into the industry to help shape the future of technology. Read More
Score: 25🌐 MovesJun 13, 2026https://www.geekwire.com/2026/what-a-longtime-google-ai-leader-told-uw-computer-science-students-at-their-graduation/
Photiu Launches Free AI Image Upscaler to Enhance Photo Quality Without Sign-Up
Photiu Launches Free AI Image Upscaler to Enhance Photo Quality Without Sign-Up azcentral.com and The Arizona Republic
Score: 24🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82551/photiu-launches-free-ai-image-upscaler-to-enhance-photo-quality-without-sign-up/
I let ChatGPT analyze my personality through my favorite fictional characters — it revealed more about me than I realized
I let ChatGPT analyze my personality through my favorite fictional characters — it revealed more about me than I realized Tom's Guide
Score: 23🌐 MovesJun 13, 2026https://www.tomsguide.com/ai/i-let-chatgpt-analyze-my-personality-through-my-favorite-fictional-characters-it-revealed-more-about-me-than-i-realized
I spent 48 hours with Siri AI on macOS Golden Gate — here’s what I like (and what I don’t)
I spent 48 hours with Siri AI on macOS Golden Gate — here’s what I like (and what I don’t) Tom's Guide
Score: 22🌐 MovesJun 13, 2026https://www.tomsguide.com/ai/apple-intelligence/i-spent-48-hours-with-siri-ai-on-macos-golden-gate-heres-what-i-like-and-what-i-dont
This Tool Lets You Compare ChatGPT, Claude, and Gemini Responses in One Place, and It’s Only $60 Right Now
This Tool Lets You Compare ChatGPT, Claude, and Gemini Responses in One Place, and It’s Only $60 Right Now PCMag
Score: 22🌐 MovesJun 13, 2026https://www.pcmag.com/deals/this-tool-lets-you-compare-chatgpt-claude-and-gemini-responses-in-one-place
REXTRIX LAUNCHES THE WORLD'S FIRST FREE AI-NATIVE MINI-GAME PLATFORM AT SUPER AI 2026, NAMES JOÃO PEDRO BRAND AMBASSADOR
REXTRIX LAUNCHES THE WORLD'S FIRST FREE AI-NATIVE MINI-GAME PLATFORM AT SUPER AI 2026, NAMES JOÃO PEDRO BRAND AMBASSADOR azcentral.com and The Arizona Republic
Score: 22🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82517/rextrix-launches-the-worlds-first-free-ai-native-mini-game-platform-at-super-ai-2026-names-joao-pedro-brand-ambassador/
Which Coworkers Stress You Out Most? This Engineer Built an AI ‘Leaderboard’ to Rank Them Like a Top 10 List
A software developer linked his wearable tech’s heart-rate data to his job activities to document the stress fluctuations caused by coworkers and meetings. Now he’s got a running list of the worst offenders.
Score: 20🌐 MovesJun 13, 2026https://www.inc.com/kevin-haynes/which-coworkers-stress-you-out-most-this-engineer-built-an-ai-leaderboard-to-rank-them-like-a-top-10-list/91360708
Crescendo Consultants Launches Free AI Visibility Analysis to Help Small Businesses Get Recommended by AI Engines
Crescendo Consultants Launches Free AI Visibility Analysis to Help Small Businesses Get Recommended by AI Engines USA Today
Score: 20🌐 MovesJun 13, 2026https://www.usatoday.com/press-release/story/34741/crescendo-consultants-launches-free-ai-visibility-analysis-to-help-small-businesses-get-recommended-by-ai-engines/
Icepick Web Design & SEO Adds AI Search Optimization to Local SEO Packages for Service Contractors
Icepick Web Design & SEO Adds AI Search Optimization to Local SEO Packages for Service Contractors azcentral.com and The Arizona Republic
Score: 19🌐 MovesJun 13, 2026https://www.azcentral.com/press-release/story/82603/icepick-web-design-seo-adds-ai-search-optimization-to-local-seo-packages-for-service-contractors/
Fanalyze Launches AI Sports Intelligence Platform Ahead of 2026 World Cup
Fanalyze Launches AI Sports Intelligence Platform Ahead of 2026 World Cup USA Today
Score: 18🌐 MovesJun 13, 2026https://www.usatoday.com/press-release/story/34734/fanalyze-launches-ai-sports-intelligence-platform-ahead-of-2026-world-cup/
World Cup AI predictor now lets users ask daft what-ifs
Spoiler: It doesn't end well for Team Register
Score: 16🌐 MovesJun 13, 2026https://www.theregister.com/offbeat/2026/06/13/world-cup-ai-predictor-now-lets-users-ask-daft-what-ifs/5254853
6 COMPUTER SCIENCE TRIPOS Part II – 2026 – Paper 8 E-Commerce (sam56) You are considering creating an Agentic AI browser pro
6 COMPUTER SCIENCE TRIPOS Part II – 2026 – Paper 8 E-Commerce (sam56) You are considering creating an Agentic AI browser pro cl.cam.ac.uk
Score: 15🌐 MovesJun 13, 2026https://www.cl.cam.ac.uk/teaching/exams/pastpapers/y2026p8q6.pdf