AI News Archive: June 21, 2026 — Part 3

Sourced from 500+ daily AI sources, scored by relevance.

Anthropic's Mythos mess just keeps getting more complicated
It sure seems a lot like the Trump administration is just bulling Anthropic for not acquiescing to its every move, and its the cybersecurity community who'll suffer for it
Score: 40🌐 MovesJun 21, 2026https://www.theregister.com/ai-and-ml/2026/06/22/anthropics-mythos-mess-just-keeps-getting-more-complicated/5258577
How AI-Driven CRMs Help Dealers Close More Leads
How AI-Driven CRMs Help Dealers Close More Leads Automotive News
Score: 40🌐 MovesJun 21, 2026https://www.autonews.com/resource_center/CDK/an-how-ai-driven-crms-help-dealers-close-more-leads/
2026 Crash Course in AI for Teachers
2026 Crash Course in AI for Teachers Carnegie Mellon University Computer Science Department
Score: 40🌐 MovesJun 21, 2026https://www.csd.cs.cmu.edu/calendar/2026-06-22/2026-crash-course-in-ai-for-teachers
My teenage son is using AI to do his math homework. I'm now helping his school write its first AI policy.
My teenage son is using AI to do his math homework. I'm now helping his school write its first AI policy. Business Insider
Score: 40🌐 MovesJun 21, 2026https://www.businessinsider.com/teenager-uses-ai-homework-mom-helped-school-write-ai-policy-2026-6
Manuela Locarno Ajayi: AI in navigation is the road to the future
Manuela Locarno Ajayi: AI in navigation is the road to the future Automotive News
Score: 39🌐 MovesJun 21, 2026https://www.autonews.com/podcasts/shift/an-shift-podcast-manuela-locarno-ajayi-0621/
AI may change the world. That doesn’t make it a good investment
Whether AI is a bubble will be known only after the fact. For investors, the more important question is how much risk they take while waiting for the answer.
Score: 39🌐 MovesJun 21, 2026https://www.livemint.com/money/personal-finance/ai-artificial-intelligence-ai-bubble-technology-stocks-dot-com-crash-fran-ois-chollet-investing-stock-market-11781853670965.html
OMODA JAECOO previews AI-powered SIVP intelligent parking technology for Saudi Arabia
As part of the brand’s next-generation intelligent mobility ecosystem
Score: 38🌐 MovesJun 21, 2026https://www.zawya.com/en/press-release/companies-news/omoda-jaecoo-previews-ai-powered-sivp-intelligent-parking-technology-for-saudi-arabia-k2pbe990
How AI is helping fans follow the action and golfers navigate the rules at the U.S. Open
The U.S. Open at Shinnecock Hills showcases how the USGA is using AI technology to transform golf.
Score: 38🌐 MovesJun 21, 2026https://www.nbcnews.com/sports/golf/us-open-golf-artificial-intelligence-rcna350498
Vector Databases: 20 Scenario-Based Questions & Solutions (Part 1 of 2)
AI Engineer Interview Preparation Continue reading on Towards AI »
Score: 38🌐 MovesJun 21, 2026https://pub.towardsai.net/vector-databases-20-scenario-based-questions-solutions-part-1-of-2-1c6b686dad2c?source=rss----98111c9905da---4
Contempo Themes Launches AI-First Real Estate CRM with CRM-Wide AI Assistant
Contempo Themes Launches AI-First Real Estate CRM with CRM-Wide AI Assistant USA Today
Score: 37🌐 MovesJun 21, 2026https://www.usatoday.com/press-release/story/34104/contempo-themes-launches-ai-first-real-estate-crm-with-crm-wide-ai-assistant/
How transparent is DiffusionGemma (and why it matters)
Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and the GDM text diffusion team, we performed a transparency audit of DiffusionGemma, GDM's new text diffusion model. Overall, we find that DiffusionGemma is not significantly less transparent than Gemma. Gemma and DiffusionGemma perform similarly on monitorability evaluations . Although naively DiffusionGemma has a much larger opaque serial depth , we can apply the logit lens to intermediate vectors and ablate non-interpretable information without harming performance. This implies that these intermediate nodes are interpretable, which reduces the opaque serial depth to be similar to that of Gemma. However, even though the variables that the model uses at different steps are interpretable, this does not necessarily mean that we understand the algorithm that the model uses to reach the final answer. We thus distinguish between variable transparency, which we define as whether we can understand snapshots of the model's computation, and algorithmic transparency, which we define as whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. By default, algorithmic transparency is much lower for a text diffusion model. In an autoregressive model, the model proceeds through its reasoning in order, token by token; when each token is generated, we know the exact state the model was in, and can make inferences about why it generated a certain token. On the other hand, in a single "canvas" a diffusion model generates all tokens at once, and the causal relationship between different tokens is unclear; a diffusion model can e.g. use tokens at the end of the canvas to help it figure out what tokens to generate earlier in the canvas. In a series of case studies, we study these and other phenomena that are unique to text diffusion models, including non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. We make progress on algorithmic transparency and believe we now understand some of the algorithmic "styles" that DiffusionGemma uses, but we still think that it is less algorithmically transparent than corresponding autoregressive LLMs. We also include 24 open problems that we would be excited for the community to investigate. Why is this relevant for AI safety? Currently, CoT monitoring is a load-bearing aspect of many safety cases, but future models may perform more of their reasoning in latent spaces. We think that developers should perform transparency audits of new model architectures that perform larger fractions of their computation in a latent space. Thus, even though DiffusionGemma is itself not concerning from a transparency perspective, we are excited about this work because of the precedent it sets for performing these sorts of evaluations. Many of our experiments, including the opaque serial depth and monitorability evaluations, should be able to be straightforwardly applied to future latent reasoning architectures. If future latent reasoning models regress on these metrics, we will need new techniques that can translate from latent reasoning into natural language. Thus, we are particularly excited about techniques like Natural Language Autoencoders and Activation Oracles that can translate activations into natural text, and we hope that the interpretability community continues to prioritize their development. Short summary of main results: We first present a diagram of the DiffusionGemma architecture: As expected, the opaque serial depth for DiffusionGemma is much larger (28.6X) the corresponding Gemma model. But if we were able to show the intermediates were interpretable, this would drop to 1.1X. When we replace the intermediate self-conditioning vectors with their top-k or top-p tokens, we maintain most performance on downstream benchmarks: For the top-p interventions, these top tokens are mostly equal to or semantically similar to nearby tokens in the final canvas tokens. Thus, they are largely interpretable. Note that even the 10% of tokens in the first few canvases that do not fall into these categories may still be interpretable; they may be guesses for other meanings of the sentence, or may be interpretable intermediates that the model is using to reason. We are interested in further work that investigates intermediate tokens the model is confident in that are not similar to any final tokens. Monitorability, a key downstream application of transparency, is similar between Gemma and DiffusionGemma: We next introduce three views that we use to study individual rollouts and phenomena: One interesting phenomena is retroactive self-correction: we ask DiffusionGemma to count the number of perfect squares between 400 and 800 and give its answer first followed by the list of squares. The model will guess wrong, list the squares, and then in subsequent denoising steps, alter its earlier output to correct its mistake. Another interesting phenomenon is "token smearing": when DiffusionGemma is confident that a token will exist somewhere, but doesn't know exactly where the token will go, it will maintain a "smeared" probability distribution over adjacent positions. Abstract LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4. Discuss
Score: 37🌐 MovesJun 21, 2026https://www.lesswrong.com/posts/zoYXpdaMgFT43Wc24/how-transparent-is-diffusiongemma-and-why-it-matters
Opinion | Trust the Free Market on AI
By micromanaging platforms, the White House picks winners and losers.
Score: 36🌐 MovesJun 21, 2026https://www.wsj.com/opinion/trust-the-free-market-on-ai-5aa49ee0?mod=rss_Technology
Prompthen Defines ‘Interface-for-AI’: Software Built for Agents, Not Clicks
Prompthen Defines ‘Interface-for-AI’: Software Built for Agents, Not Clicks azcentral.com and The Arizona Republic
Score: 36🌐 MovesJun 21, 2026https://www.azcentral.com/press-release/story/85552/prompthen-defines-interface-for-ai-software-built-for-agents-not-clicks/
Thinking Tokens Are Not Free. Most Pipelines Treat Them Like They Are.
The hidden ops problem inside agentic pipelines using OpenAI GPT-5.x and o-series, Claude Opus/Sonnet 4.x, Gemini 3/2.5 reasoning models… Continue reading on Towards AI »
Score: 36🌐 MovesJun 21, 2026https://pub.towardsai.net/thinking-tokens-are-not-free-most-pipelines-treat-them-like-they-are-846708fdcef1?source=rss----98111c9905da---4
Routing Strategies: How AI Teams Select the Right Language Model
Routing Strategies: How AI Teams Select the Right Language Model USA Today
Score: 35🌐 MovesJun 21, 2026https://www.usatoday.com/press-release/story/34201/routing-strategies-how-ai-teams-select-the-right-language-model/
Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export MarkTechPost
Score: 35🌐 MovesJun 21, 2026https://www.marktechpost.com/2026/06/20/crawlee-for-python-build-a-web-crawling-pipeline-with-robots-handling-link-graphs-and-rag-chunk-export/
Opinion | AI Is a Boon to Ambitious Recent Grads
The opportunities for entry-level employees are new and different.
Score: 33🌐 MovesJun 21, 2026https://www.wsj.com/opinion/ai-is-a-boon-to-ambitious-recent-grads-d4f983a1?mod=rss_Technology
Ken Cardwell on Identity, AI, and Who Gets to Represent You: TEDxBeverlyGrove 2026
Ken Cardwell on Identity, AI, and Who Gets to Represent You: TEDxBeverlyGrove 2026 azcentral.com and The Arizona Republic
Score: 33🌐 MovesJun 21, 2026https://www.azcentral.com/press-release/story/85544/ken-cardwell-on-identity-ai-and-who-gets-to-represent-you-tedxbeverlygrove-2026/
When a Sequence Is Not Enough: What Knowledge Graphs Add to Agentic Systems
The series closer — and the failure that flat state can never fix. Continue reading on Towards AI »
Score: 33🌐 MovesJun 21, 2026https://pub.towardsai.net/when-a-sequence-is-not-enough-what-knowledge-graphs-add-to-agentic-systems-782af7e8bc9d?source=rss----98111c9905da---4
Minnesota faith leaders weigh AI’s place in church life
Artificial intelligence technology is reshaping how people work, communicate and seek advice, raising questions that many religious leaders see as both practical and theological.
Score: 32🌐 MovesJun 21, 2026https://www.startribune.com/minnesota-faith-leaders-weigh-ais-place-in-church-life/601849716
MCP for LangGraph Developers: From Basics to Production
The seven-level staircase Part 6 of the LangGraph Mental Model series, a ground-up introduction to the Model Context Protocol, building toward full integration with everything from Parts 1–6 For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , Part 5 , What this article assumes: You’re comfortable with the seven-module LangGraph structure and the idea of a @tool-decorated function (Part 1, Module 3). That's the only prerequisite. MCP is a new piece of infrastructure, not a new way of thinking about graphs, so we start from zero on MCP itself, and only reconnect to LangGraph once the concept is solid. Why This Article Exists In Part 1, you learned to write tools like this: @tool def search_web(query: str) -> str: """Search the web for current information.""" return f"Search results for: {query}" This works great, until it doesn’t scale. Imagine you’re building five different agents across five different projects, and every single one needs a “search the company database” tool. With the pattern above, you’d write that tool five separate times, in five separate codebases, in whatever language each project happens to use. If the database schema changes, you update five places. If you want to share that tool with a teammate building a different kind of AI app, maybe not even using LangGraph, they can’t use your @tool function. It's tied to LangChain's Python ecosystem. This is the N×M integration problem : N different tools (databases, APIs, file systems) need to connect to M different AI applications (your LangGraph agent, a teammate’s custom agent, Claude Desktop, an IDE assistant). Without a shared standard, you end up writing N×M custom integrations. The Model Context Protocol (MCP) solves this by being a universal, open standard for connecting AI applications to tools and data. Build a tool once, as an MCP server, and any MCP-compatible application, our LangGraph agent, Claude Desktop, Cursor, a teammate’s custom agent, can use it immediately, with zero custom integration code. Think of it as a USB-C port for AI applications: one standard connector, any compatible device on either end. This article takes you from “what even is MCP” to building and connecting a production-grade MCP server to a real LangGraph agent, in five levels, each adding exactly one new idea. Level 1: The Core Concept — What MCP Actually Is The Mental Model Before any code, build the picture. Imagine a hotel concierge desk . The concierge (your AI application) doesn’t personally know how to book a restaurant, hail a cab, or arrange theater tickets. Instead, the concierge has a list of trusted local services , partners who specialize in each task. When a guest asks for a dinner reservation, the concierge picks up a dedicated phone line to the restaurant booking service, makes the request in a standard format both sides understand, and relays the answer back to the guest. MCP is that standard phone line and that standard request format. The AI application is the concierge. The MCP server is a specialized service (a restaurant booker, a cab company, a ticket office). And MCP itself is the shared language and connection protocol that lets any concierge talk to any service, without the concierge needing to learn each service’s internal phone system from scratch. The Three Roles Every MCP system has exactly three participants. Get comfortable with these three words, they are the foundation of everything else in this article. Host :the AI application the end user actually interacts with. In our series so far, this is your LangGraph agent. (Other real-world examples: Claude Desktop, Cursor, VS Code with an AI copilot.) The host manages the conversation, decides when a tool is needed, and shows results to the user. Client :lives inside the host. Each client maintains a dedicated, one-to-one connection to exactly one server. If your host connects to three different MCP servers, it spins up three separate clients internally — one per server. This 1:1 mapping is a deliberate security boundary: a client for the file system server can’t accidentally leak data to the database server. Server :a lightweight, focused program that exposes specific capabilities. A server might wrap a database, a file system, a web search API, or your company’s internal CRM. Critically: an MCP server doesn’t know or care which host is calling it. The same server works for your LangGraph agent today and a completely different AI application tomorrow, unmodified. What MCP Communicates With: JSON-RPC 2.0 Every message between a client and a server is a JSON-RPC 2.0 message — a simple, well-established format with exactly three message types: Request :“please do this and tell me the result.” Always includes a unique id so the response can be matched back to it. {"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "search_db", "arguments": {"query": "Q3 revenue"}}} Response — the answer to a request. Contains either a result (success) or an error (failure), and the same id as the original request. {"jsonrpc": "2.0", "id": 1, "result": {"content": [{"type": "text", "text": "Q3 revenue was $4.2M"}]}} Notification :a one-way message that expects no response. Used for things like “my list of available tools just changed.” You will almost never write raw JSON-RPC by hand, the SDKs handle this entirely. But understanding that this is the wire format underneath everything demystifies a lot of what follows. It’s the same reason understanding HTTP underneath a REST API makes you a better API developer, even if you never write raw HTTP by hand. Level 2: The Three Primitives — What a Server Can Offer An MCP server exposes its capabilities through exactly three types of primitives. This is the single most important conceptual table in this article — almost every design decision when building a server comes down to picking the right one of these three. Primitive Purpose Real-world analogy Who decides to use it Tool Executes an action, may have side effects A POST endpoint The AI model decides Resource Provides read-only data A GET endpoint The application/user decides Prompt A reusable instruction template A pre-written form letter The user explicitly invokes it Tools — Actions the Model Can Take A tool is a function the LLM can decide to call during reasoning, exactly like the @tool-decorated functions from Part 1. The difference is where that function lives: instead of being defined inside your LangGraph codebase, it's defined inside an MCP server, and your LangGraph agent discovers and calls it over the protocol. Use a tool for anything that does something , runs a calculation, queries a database, sends an email, modifies a file. Resources — Read-Only Context A resource is data the application can pull in to give the model context, without the model needing to “decide” to call it the way it does with a tool. Resources are addressed by a URI, similar to how a web page is addressed by a URL. Use a resource for anything that’s read-only context ,a configuration file, a document, a list of available data sources the model might want to know about before deciding what to search for. Prompts — Reusable Templates A prompt is a pre-written template that structures how a user (or the host application) kicks off a particular workflow. Unlike tools, prompts are typically user-invoked , not model-invoked — think of them as the MCP equivalent of slash commands. Use a prompt for common, repeatable workflows ,“summarize this document,” “review this code for security issues” ,where you want consistent, well-engineered instructions every time, rather than relying on the user to phrase the request well. The Decision Rule When you’re not sure which primitive to use for something you’re building, ask: Does this change anything, or just provide information? If it changes something (creates, updates, deletes, sends, executes) → Tool . If it’s read-only data the model might want as background → Resource . If it’s a reusable instruction template a user explicitly kicks off → Prompt . Level 3: Building Your First MCP Server with FastMCP Time to write real code. We’ll use FastMCP , the dominant Python framework for building MCP servers, it turns the protocol’s considerable complexity (JSON-RPC handling, schema generation, lifecycle management) into a handful of decorators, in exactly the same spirit as how @tool simplified LangChain tool creation in Part 1. Installation pip install fastmcp Your First Server: One Tool # ── math_server.py ─────────────────────────────────────────── from fastmcp import FastMCP # Create the server - this object is the container for everything below mcp = FastMCP(name="Math Server") @mcp.tool def add(a: int, b: int) -> int: """Add two numbers together.""" return a + b @mcp.tool def multiply(a: int, b: int) -> int: """Multiply two numbers together.""" return a * b if __name__ == "__main__": mcp.run() # Defaults to stdio transport This is a complete, valid, runnable MCP server. Notice the parallels to Part 1’s @tool pattern: FastMCP reads your function's type hints to generate the input schema, and reads your docstring to generate the description the LLM sees. Exactly the same rule from Part 1 applies here with even higher stakes: a vague docstring means the model won't know when to call this tool, and now that vagueness affects every application that ever connects to this server, not just one LangGraph agent. Adding a Resource # ── Adding to math_server.py ───────────────────────────────── @mcp.resource("config://server-info") def server_info() -> dict: """Provides metadata about this server's capabilities.""" return {"version": "1.0", "operations": ["add", "multiply"]} Resources are addressed by a URI you define (config://server-info here). When a client requests that URI, FastMCP calls your function and returns the result. The function only runs when requested — this is lazy evaluation, so you're not computing resource data the model never ends up using. A Resource Template (Parameterized Resources) Sometimes you want a resource whose content depends on an argument — like looking up one specific record rather than a fixed configuration blob. @mcp.resource("history://{operation_id}") def get_calculation(operation_id: str) -> dict: """Retrieve a past calculation by its ID.""" # In a real server, this would look up a database record return {"id": operation_id, "result": "..."} The {operation_id} placeholder in the URI makes this a resource template . When a client requests history://abc123, FastMCP extracts abc123 and passes it as the function argument automatically. Adding a Prompt @mcp.prompt def explain_calculation(operation: str, result: str) -> str: """Generate a request to explain a calculation in plain English.""" return f"Explain how the operation '{operation}' produced the result '{result}', in simple terms a beginner could follow." Running and Testing Your Server python math_server.py Before connecting this to any real AI application, you should verify it works using the MCP Inspector — a browser-based debugging tool that ships with the ecosystem and lets you list tools, call them manually, and inspect resources, all without needing an LLM in the loop. fastmcp dev math_server.py This opens a local web interface (typically http://127.0.0.1:6274) where you can click through your tools, fill in test arguments, and see exactly what JSON-RPC messages flow back and forth. Always test here first. Debugging a broken tool through an LLM's tool-calling behavior is much harder than debugging it directly in the Inspector. Level 4: Transports: How Clients and Servers Actually Connect So far, our server has run with the default transport. Now we need to understand the choice, because it directly determines how your LangGraph agent will connect to it. The Mental Model Think of transport as the physical wire , separate from the language spoken over it . Two people can speak English over a telephone line or over a radio , same language, different wire. Similarly, MCP servers speak the same JSON-RPC 2.0 “language” regardless of transport , but how the bytes travel differs. stdio : For Local, Same-Machine Tools if __name__ == "__main__": mcp.run() # stdio is the default With stdio (standard input/output), the host application launches your server as a subprocess and communicates by writing to its stdin and reading from its stdout. There’s no network involved at all , it’s process-to-process communication on the same machine. Use stdio when: your server runs locally alongside the host (a file system tool, a local script runner, a personal math utility). This is the simplest transport to set up and the most common for development and personal-use tools. Important constraint: when using stdio, your server’s stdout must be reserved exclusively for MCP protocol messages. If you print() debug output to stdout, you will corrupt the protocol stream. Always log to stderr instead. Streamable HTTP — For Remote, Networked Tools if __name__ == "__main__": mcp.run(transport="streamable-http", host="0.0.0.0", port=8000) With Streamable HTTP , your server runs as a standalone web service. Clients connect over the network using HTTP POST requests, with optional Server-Sent Events for streaming responses back. This is what makes a server genuinely remote, it can run on a different machine, in the cloud, behind a load balancer, serving many different host applications simultaneously. Use Streamable HTTP when: your server needs to be shared across a team, deployed centrally, or accessed by host applications running on different machines (including a LangGraph agent running on a server, talking to a database-access MCP server running elsewhere). The Decision Rule stdio : local tool, single user, simple setup, development and personal use. Streamable HTTP : shared tool, multiple users or applications, production deployment, needs authentication. This exact choice is what you’ll configure when connecting from LangGraph in the next level — every server in your config will declare itself as one or the other. Level 5: Connecting MCP to LangGraph This is where everything from this article meets everything from Parts 1–5. The bridge is a package called langchain-mcp-adapters , maintained by the LangChain team specifically to make MCP tools usable inside LangGraph agents. The Core Idea langchain-mcp-adapters does for MCP what the @tool decorator did for plain Python functions in Part 1: it converts something external (an MCP tool) into a BaseTool object that slots directly into a LangGraph ToolNode, with zero special-casing in your graph logic. Once converted, your agent doesn't know or care that a tool came from an MCP server rather than being written in-process , it's just a tool. Installation pip install langchain-mcp-adapters langgraph "langchain[openai]" Step 1: Configure Your MCP Servers The central object is MultiServerMCPClient — it can connect to multiple MCP servers at once, mixing transports freely. # ── MODULE 1 (extended): MCP CLIENT CONFIGURATION ─────────── from langchain_mcp_adapters.client import MultiServerMCPClient client = MultiServerMCPClient({ # A local server, launched as a subprocess via stdio "math": { "command": "python", "args": ["/absolute/path/to/math_server.py"], "transport": "stdio", }, # A remote server, already running, connected over HTTP "company_data": { "url": "http://localhost:8000/mcp", "transport": "http", }, }) Each entry in this dictionary is one server. The key ("math", "company_data") is just a label you choose — it's how langchain-mcp-adapters keeps tools from different servers organized internally. Notice this directly mirrors the transport decision from Level 4: stdio servers get a command + args, HTTP servers get a url. Step 2: Load the Tools # This single call discovers every tool, on every configured server, # and converts each one into a LangChain-compatible BaseTool tools = await client.get_tools() print([t.name for t in tools]) # ['add', 'multiply', 'search_company_db', ...] This is the entire integration. client.get_tools() performs the MCP tools/list request against every server in your config, takes the returned schemas, and wraps each one into a tool object indistinguishable , from your graph's perspective — from a hand-written @tool function. Step 3: The Familiar LangGraph Wiring From here, every module looks exactly like Part 1. This is the entire point of the integration: MCP changes where your tools come from, not how your graph is built. # ============================================================ # LANGGRAPH + MCP AGENT — COMPLETE TEMPLATE # Extends: Part 1 (core structure) # ============================================================ # ── MODULE 1: IMPORTS & CONFIGURATION ─────────────────────── import asyncio from langchain_mcp_adapters.client import MultiServerMCPClient from langchain.chat_models import init_chat_model from langchain_core.messages import HumanMessage from langgraph.graph import StateGraph, MessagesState, START, END from langgraph.prebuilt import ToolNode, tools_condition from langgraph.checkpoint.memory import MemorySaver llm = init_chat_model("openai:gpt-4o") client = MultiServerMCPClient({ "math": { "command": "python", "args": ["./math_server.py"], "transport": "stdio", }, "company_data": { "url": "http://localhost:8000/mcp", "transport": "http", }, }) # ── MODULE 2: STATE ────────────────────────────────────────── class State(MessagesState): pass # ── MODULE 4: NODES ────────────────────────────────────────── # (Module 3, Tools, is now sourced from MCP rather than defined here) def make_agent_node(llm_with_tools): """Factory pattern: tools are loaded asynchronously, so the node closure is built after the async tool-loading step completes.""" def agent_node(state: State) -> dict: response = llm_with_tools.invoke(state["messages"]) return {"messages": [response]} return agent_node # ── MODULE 6: GRAPH ASSEMBLY (inside an async function) ────── async def build_graph(): tools = await client.get_tools() llm_with_tools = llm.bind_tools(tools) graph_builder = StateGraph(State) graph_builder.add_node("agent", make_agent_node(llm_with_tools)) graph_builder.add_node("tools", ToolNode(tools)) graph_builder.add_edge(START, "agent") graph_builder.add_conditional_edges("agent", tools_condition) graph_builder.add_edge("tools", "agent") return graph_builder.compile(checkpointer=MemorySaver()) # ── MODULE 7: ENTRYPOINT ────────────────────────────────────── async def main(): graph = await build_graph() config = {"configurable": {"thread_id": "session-001"}} response = await graph.ainvoke( {"messages": [HumanMessage(content="What's (3 + 5) times 12?")]}, config=config ) print(response["messages"][-1].content) if __name__ == "__main__": asyncio.run(main()) Why Everything Is Async Now This is the one genuine adjustment MCP introduces to the patterns from earlier parts: MCP communication is inherently asynchronous , because every tool call is, under the hood, a network or subprocess round-trip , even for the stdio transport. This means client.get_tools() must be awaited, your graph-building step typically lives inside an async def, and you invoke the compiled graph with ainvoke instead of invoke. Everything else, StateGraph, MessagesState, tools_condition, MemorySaver, is identical to every prior article in this series. tools_condition: The Pre-Built Router Returns Notice Module 5 (Routing) effectively disappears from this template — we use tools_condition, the pre-built router from Part 1, instead of writing a custom one. This isn't a new concept; it's the same tools_condition from the very first canonical template, doing exactly the same job: check if the last message has tool calls, route to "tools" if so, otherwise END. MCP tools and hand-written tools are indistinguishable to this router, because they're both just BaseTool objects by the time your graph sees them. Level 6: Production Considerations A few things separate a working demo from a system you’d actually deploy. These are the most commonly hit issues in real MCP + LangGraph systems. Keep the Client Connection Alive for the Process Lifetime MultiServerMCPClient is stateless by default , every tool call opens a fresh connection, executes, and tears down. For a single request that's fine, but if your application serves many requests, opening a subprocess or HTTP connection on every single tool call adds real latency. In production, open the client once when your process starts and reuse it for the lifetime of the application, rather than recreating it per-request. # Hold the connection open across the application's lifetime, # rather than opening/closing per request async with MultiServerMCPClient({...}) as client: tools = await client.get_tools() graph = await build_graph_with(tools) # ... serve many requests using this same graph ... Don’t Use stdio in a Web Server Context The stdio transport was designed for local, single-user, same-machine scenarios , like a desktop app launching a helper subprocess. If you’re deploying your LangGraph agent as a web service handling requests from many users, stdio servers become a liability: you’d be spawning subprocesses per request, with no natural way to share state or scale horizontally. For any server-side deployment, prefer Streamable HTTP, and genuinely ask whether you need a separate MCP server process at all — sometimes a plain @tool function calling an internal library directly is simpler and faster than going through the protocol. Combine with Human-in-the-Loop for Sensitive Tools MCP tools are still just tools by the time they reach your graph , which means every pattern from Part 3 applies unchanged. If an MCP server exposes something sensitive (sending emails, modifying a database, spending money), wrap it with the same interrupt() review pattern from Part 3's review_tool_call node. MCP gives you discovery and standardization; it does not give you safety by default. That's still your job, using the patterns you already know. MCP Servers Can’t See Your LangGraph State This is a structural fact worth internalizing: MCP servers run as separate processes . They cannot reach into your LangGraph state, your checkpointer, or any in-memory Python objects in your graph. If an MCP tool call needs to be personalized , using a stored user preference, for example , that information has to be passed explicitly as a tool argument, or handled through the more advanced interceptor pattern (a middleware hook in langchain-mcp-adapters that lets you inspect and modify a tool call before it's sent, using your graph's runtime context). For most agents, simply passing the needed values as arguments is sufficient and far simpler. Level 7: MCP in a Multi-Agent System (Tying Back to Part 4) To show this genuinely composes with everything earlier in the series, here’s how MCP slots into the supervisor pattern from Part 4 : multiple specialist agents, each backed by its own dedicated MCP server. # ── Specialist agents, each pulling tools from a DIFFERENT MCP server ── async def build_multi_agent_system(): client = MultiServerMCPClient({ "research": {"url": "http://localhost:8001/mcp", "transport": "http"}, "code": {"url": "http://localhost:8002/mcp", "transport": "http"}, }) research_tools = await client.get_tools(server_name="research") code_tools = await client.get_tools(server_name="code") researcher_llm = llm.bind_tools(research_tools) coder_llm = llm.bind_tools(code_tools) # From here, wire researcher_node, coder_node, and a supervisor_node # exactly as shown in Part 4 - the rest of the supervisor pattern # is completely unchanged. ... The takeaway: MCP doesn’t introduce a competing architecture to multi-agent systems. A specialist agent backed by an MCP server is structurally identical to a specialist agent backed by hand-written tools , the supervisor still routes the same way, the specialists still report back the same way. MCP simply changes where the tool definitions live and how many applications can reuse them. The Updated Keyword Reference Card This extends the keyword cards from Parts 1–5 with MCP-specific terms. MCP Architecture Keywords Host — the AI application the user interacts with (your LangGraph agent). Client — lives inside the host, one per connected server, manages a 1:1 session. Server — the standalone program exposing tools, resources, and prompts. Doesn't know which host is connecting. JSON-RPC 2.0 — the message format underneath every MCP exchange: Requests, Responses, Notifications. The Three Primitives Tool — an executable action, model decides when to call it. Built with @mcp.tool. Resource — read-only data, addressed by URI, application/user-controlled. Built with @mcp.resource("uri://..."). Prompt — a reusable instruction template, user-invoked. Built with @mcp.prompt. FastMCP Server Keywords FastMCP(name="...") — creates the server instance, the container for all tools/resources/prompts. @mcp.tool — decorator that exposes a Python function as a callable tool; reads type hints for schema, docstring for description. @mcp.resource("scheme://{param}") — decorator for read-only data; {param} placeholders create a resource template. @mcp.prompt — decorator for a reusable prompt template. mcp.run() — starts the server. Defaults to stdio; pass transport="streamable-http" for networked deployment. fastmcp dev server.py — launches the MCP Inspector for manual testing without an LLM. Transport Keywords stdio — subprocess-based, local, same-machine. The host launches and manages the server process. Never write debug output to stdout. streamable-http — networked, remote-capable, supports many concurrent host connections. The production default for shared servers. LangGraph Integration Keywords MultiServerMCPClient({...}) — from langchain_mcp_adapters.client. Connects to one or more MCP servers, mixing transports freely in one config dict. await client.get_tools() — discovers and converts every tool from every configured server into LangChain-compatible BaseTool objects. tools_condition — (from Part 1) the pre-built LangGraph router; works identically whether tools are hand-written or MCP-sourced. Everything is async — client.get_tools(), graph building, and graph.ainvoke() all require await, because MCP calls are network/subprocess round-trips even over stdio. The Decision Guide: When Do You Actually Need MCP? You’re building a single tool for a single agent in a single codebase → You don’t need MCP. A plain @tool function from Part 1 is simpler and has zero protocol overhead. You’re building a tool that should be reusable across multiple agents, multiple projects, or shared with teammates using different frameworks → Build it as an MCP server. This is the core value proposition. You want to use one of the hundreds of pre-built community MCP servers (GitHub, Slack, Postgres, Google Drive, Stripe, and more) instead of writing integration code yourself → Connect via MultiServerMCPClient and skip writing that integration entirely. This is often the single biggest time-saver MCP offers. Your tool needs to run on a different machine from your agent, or be shared by multiple host applications at once → Use Streamable HTTP transport, deployed as a standalone service. Your tool only ever runs locally, alongside one agent, for development or personal use → stdio transport is sufficient, and simpler to set up. Conclusion: A Standard, Not a Replacement The instinct when learning a new protocol is to wonder whether it replaces what you already know. It doesn’t. MCP doesn’t replace LangGraph’s tools, nodes, or graphs — it replaces where those tools come from and how many places can reuse them . Once an MCP tool reaches your ToolNode, it behaves exactly like the hand-written tools from Part 1, gets approved by the same interrupt() patterns from Part 3, and slots into the same supervisor architectures from Part 4. The progression in this article followed the same staircase as every other part of this series: concept and roles (Level 1), the three primitives (Level 2), a working server (Level 3), the transport choice (Level 4), the LangGraph bridge (Level 5), production hardening (Level 6), and composition with multi-agent systems (Level 7). Each level added exactly one idea on top of a stable foundation. You now have six articles’ worth of production scaffold: canonical structure (Part 1), memory management (Part 2), human-in-the-loop safety (Part 3), multi-agent orchestration (Part 4), real-world knowledge via RAG (Part 5), and standardized, shareable tooling via MCP (Part 6). Between them, these cover the overwhelming majority of what a serious, production-grade LangGraph application needs , and MCP is what lets the tooling half of that scale beyond any single codebase. For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , Part 5 MCP for LangGraph Developers: From Basics to Production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Score: 32🌐 MovesJun 21, 2026https://pub.towardsai.net/mcp-for-langgraph-developers-from-basics-to-production-12ff52df3d3c?source=rss----98111c9905da---4
OpenAI killed my favorite AI model — here's what happened next
OpenAI killed my favorite AI model — here's what happened next Tom's Guide
Score: 31🌐 MovesJun 21, 2026https://www.tomsguide.com/ai/where-do-old-ai-models-go-when-they-die-welcome-to-the-strange-world-of-refurbished-ai
Tool Calling, Explained: How AI Agents Decide What to Do Next
Understanding ow LLMs interact with the world around them, from returning data to taking action The post Tool Calling, Explained: How AI Agents Decide What to Do Next appeared first on Towards Data Science .
Score: 31🌐 MovesJun 21, 2026https://towardsdatascience.com/tool-calling-explained-how-ai-agents-decide-what-to-do-next/
Daters say AI dependence gives them the ick
There’s a new reason to avoid buddying up to ChatGPT, at least if you’re single: Daters find it deeply unattractive. In a recent survey of 3,500 Gen Z and Millennial daters in the U.S., the dating app Hily found that using AI is, quite simply, not a flex. In fact, it’s a major turnoff when it comes to romance and potential partnership. Younger daters are especially likely to view AI reliance as a red flag. While 56% of Millennial respondents said they wouldn’t date someone who uses AI regularly, that figure rose to 64% among Gen Z. More than half of Gen Z daters surveyed said they’d consider it a dealbreaker if someone used AI for career advice or spending decisions, compared with 46% and 44% of Millennials, respectively. The aversion gets stronger as the AI use becomes more personal. Three-quarters of Gen Z respondents said using AI to “analyze a relationship conflict” was a no-go, compared with 70% of Millennials. And 69% of Gen Zers said they wouldn’t date someone who used AI like a therapist, versus 60% of Millennials. Asking AI whether something in your sex life is healthy was also a turnoff, with 62% of Gen Z and 53% of Millennials calling it a dealbreaker. Even in long-term relationships, the ick seems to linger: 65% of Gen Z said they wouldn’t marry a partner who used AI to help write their wedding vows, compared with 51% of Millennials. The data may seem extreme, given how widespread AI use has become. But experts say it is not surprising that daters are wary of people who lean on the technology too much. “Being dependent on AI is a turnoff because it feels like you’re interacting with a filtered, hyper-optimized version of your date, not who they really are,” Julie Nguyen, a certified dating coach at Hily tells Fast Company. “There’s something that can feel strange about not knowing if you’re interacting with their full authenticity and vulnerability,” she adds. Given that AI is such a red flag, not using it may be a green one. Sixty percent of Gen Z and 54% of Millennials said they would find a dating prospect “extremely attractive” if they never used AI to make personal decisions. As reliance on AI grows, so does the list of reasons critics say people should be wary of it. From troubling environmental concerns , to feeling or unsure how to use the technology to get solid results, to new worrisome findings on AI’s impacts on our brains, the wariness makes practical sense. Of course, attitudes toward AI could shift over time, potentially leading to a less AI-averse dating pool. For now, though, among people seeking romance, learning that AI dependence is an instant turnoff may be one of the most tangible reasons to avoid it.
Score: 30🌐 MovesJun 21, 2026https://www.fastcompany.com/91562297/daters-say-ai-dependence-gives-them-the-ick?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss
Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section
Enterprise Document Intelligence [Vol.1 #5septies] - When a PDF prints a contents page but exposes no outline, two ways to turn it back into structure, plus the page-alignment step everyone forgets The post Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section appeared first on Towards Data Science .
Score: 30🌐 MovesJun 21, 2026https://towardsdatascience.com/reconstructing-the-table-of-contents-a-pdf-forgot-to-ship-so-rag-can-scope-by-section/
The Truth About Huge LLMs Context Windows
Bigger windows, same old attention problem. Here’s what the research actually shows. Continue reading on Towards AI »
Score: 29🌐 MovesJun 21, 2026https://pub.towardsai.net/the-truth-about-huge-llms-context-windows-912b36e8d6d6?source=rss----98111c9905da---4
The 7 Types of Agent Memory: A Technical Guide for AI Engineers
The 7 Types of Agent Memory: A Technical Guide for AI Engineers MarkTechPost
Score: 28🌐 MovesJun 21, 2026https://www.marktechpost.com/2026/06/21/the-7-types-of-agent-memory-a-technical-guide-for-ai-engineers/
Slideshow: Honor Magic V6 - AI Enhancement Pictures
Slideshow: Honor Magic V6 - AI Enhancement Pictures PCMag Middle East
Score: 28🌐 MovesJun 21, 2026https://me.pcmag.com/en/mobile-phones/37498/embed
28 Tips to Take Your ChatGPT Prompts to the Next Level
Sure, anyone can use OpenAI’s chatbot. But with smart engineering, you can get way more interesting results.
Score: 27🌐 MovesJun 21, 2026https://www.wired.com/story/28-tips-to-take-your-chatgpt-prompts-to-the-next-level/
Why ChatGPT Is More Than Autocomplete
A rich internal state, rebuilt one token at a time Figure by the author with assistance from Claude (Anthropic) Introduction Calling a large language model (LLM) like ChatGPT “autocomplete” is not exactly wrong, but it is deeply misleading. Most of us think of autocomplete as a text-completion tool: a phone keyboard guessing the next word, a search bar finishing a phrase. But that picture is not powerful enough to explain what LLMs actually produce — explanations, analogies, plans, summaries, arguments, code, stories, dialogue. A transformer-based LLM does predict one token at a time — roughly a word, though in practice often a word-fragment — but that visible sequence is only the surface trace of a much richer hidden process. Before each word appears, the model has built a high-dimensional internal state that reflects the topic, the context, the tone, the intent, and the likely directions the answer could take. The next token is not read off the prompt. It is read off this internal state. That is why a system trained only to predict the next token produces explanations, arguments, analogies, plans, and dialogue that feel far more than anything we would call autocomplete. Previous articles in this series developed a geometric picture of the transformer’s internal state in terms of attention— attention as a coupled free-energy minimization and weighted least squares problem [ Article 1 ], simplification of attention to two operators in one space [ Article 2 ], and attention as a geometric flow [ Article 3 ]. This article steps back and asks a plainer question — what is actually happening when a chatbot answers you, and why is autocomplete the wrong picture — and answers it with a minimum of mathematical machinery. The autocomplete misunderstanding Calling an LLM “autocomplete” is tempting because, at one level, it is true. Given a sequence of text, the model predicts the next token. That token is appended to the input, which is fed back in for the next prediction. Stepping through this loop produces the visible response, one token at a time, that has the appearance of autocomplete. The problem is that this picture describes what goes on “across” steps but ignores what goes on “within” steps. Across steps, things feel like autocomplete — the sequential generation of individual words that cohere with previously generated text. (Researchers properly call this process “autoregressive”, but we’ll stick with the less than proper autocomplete for now.)” The missing piece is the hidden computation taking place within a step that generates the next token. Within a step, before a new token is generated, the input prompt is transformed into a high-dimensional internal state. The tokens of the prompt are not treated as isolated words in a list; they are interpreted in relation to one another via the attention mechanism. A question, a definition, a metaphor, a constraint, a conversational tone — all of these shape the internal state from which the next token is drawn. So the next token is not predicted from the input text alone as autocomplete suggests. It is predicted from a rich internal state built from the text. The model then projects a small part of that state into the vocabulary to choose the next token, appends the token to the input text, and rebuilds the state over the longer text for the next prediction step. The observed output is the result of multiple steps thru that loop. “Autocomplete,” then, is technically defensible but conceptually misleading. It names the final visible act of generation and ignores the machinery that makes the act possible. An LLM like ChatGPT does not merely “complete” the observed text. It repeatedly reconstructs meaning over a growing context and projects part of that meaning back into language, one token at a time. Where the next token really comes An LLM predicts the next token not from the input text alone, but from a dense internal representation built by sequentially processing the text through multiple layers of the model. When a prompt is input to the model, each token is turned into a vector — a point in a high-dimensional space whose location already encodes what pre-training has learned about that token: its meanings, its grammatical roles, the company it tends to keep. This initial cloud of points is only the starting arrangement. As the cloud passes through the model’s many layers, each token/point absorbs information from the rest of the text, so the its final position reflects not just the word it started as but the role that word plays in context with the rest of the cloud. The word “bank” has a different vector representation in “On the river bank” and “Call the investment bank” because nearby words changes its position. The same is true of the text as a whole: a question, an example, a requested tone, a constraint, a prior phrase — each reshapes the text, changing not only what a given word means but what kind of answer becomes likely. Context not only disambiguates individual words. It shapes the entire internal state from which the next token is predicted. Consider four prompts: “Explain E = mc².” — pushes toward teaching, physics, symbol definitions, accessible explanation. “Explain E = mc² to a 10-year-old.” — shifts toward simpler vocabulary, analogy, a gentler tone. “Explain E = mc² in one sentence.” — adds brevity and compression. “Explain E = mc² using calculus.” — shifts toward a more technical, mathematical treatment. The idea to be explained is identical in all four. The surrounding context changes the kind of answer that becomes likely. This is the key point: the next token is not predicted from the text. It is predicted from the model’s internal state after the text has passed through many layers of interaction and conditioning. That state is not a sentence, a paragraph, a private monologue, or a plan. It is a distributed, high-dimensional representation that holds many things at once — topic, syntax, style, intention, discourse structure, relevant facts, and likely continuation. This is why a short prompt can produce a long, coherent answer. The prompt itself does not contain the answer in any literal sense. It induces an internal state that makes some continuations far more likely than others. The model then samples a single token from that state, appends it, and repeats. The model’s behavior across steps appears autocomplete-like: text → next token → more text → next token → … But the complete picture includes the rebuilding of new meaning within each step: text → internal state → next token → longer text → rebuilt internal state → next token → … Once a token is chosen it becomes part of the input to the next step where the model builds a fresh internal state from scratch over the longer text. It then projects a small piece of it back into language. That piece is the next token. The residual stream: where the internal state lives If autocomplete is the misleading surface story, the residual stream is where the deeper story lives. The internal state has a home inside the transformer, and researchers have a name for it: the residual stream. For a text of N tokens, the residual stream is a cloud of N points in embedding space, one per token , each carrying the model’s current vector representation of that token in context. Initially, each point of the cloud has minimal contextual awareness of the other points of the cloud. As the cloud passes through the layers, the token positions are updated together, and as a result — this is the important part — they are coupled. Attention lets each token’s vector be reshaped by the others, so the residual stream behaves less like N separate streams and more like an interacting cloud of points that move as a whole. That cloud is the subject of a previous article in this series, “ The Flow of Attention ”, which describes its evolution and how it organizes itself within the residual stream. For this article we need only the conclusion: the cloud is not N independent points moving through space, but one system, held together by attention. The word “apple” may start near a generic cluster of meanings and then be pulled toward fruit, or company, or logo, or stock ticker, depending on the surrounding text as it moves through the layers of the model. The larger point is that the cloud as a whole carries meaning, not of just each point individually, but of the entire body of text. Its overall configuration encodes what the text is about, what kind of response is being requested, what tone fits, what form the answer should take, which constraints matter, and which continuations are likely. The next word feels coherent not because it follows the last word, but because it is drawn from a configuration that already holds the larger shape of the response. A tiny word can depend on a large hidden context. Suppose the model has already written “Energy and mass are equivalent,” and responds with “This was one of Einstein’s key insights.” The word “This” is doing real work: it reaches back to the internal state defined by what was already written and turns the explanation from the equation towards its historical significance. The word “This” is small; the internal state that makes it the “right” continuation of the text is not. When the model finally predicts a token, it does not poll the whole cloud. It reads from one point: the final-layer vector of the last token in the text. That single vector has been shaped by the entire contextualized cloud around it, and it is where the internal state passes through a narrow bottleneck and becomes the next token — the vocabulary entry whose meaning best aligns with the direction the whole cloud has settled on. (An earlier article describes that readout step precisely — the last token’s final vector is compared against the vocabulary and a probability distribution is produced by softmax, from which the token is sampled.) The residual stream, then, is an evolving, high-dimensional cloud of interacting particles, one particle per token, moving through the layers as a single entity from which language emerges one token at a time. From initial condition to trajectory A prompt does not contain its response the way a zipped file contains its contents. Take “Please explain E = mc².” Those few words do not hold Einstein, relativity, mass-energy equivalence, or the shape of a good explanation. What they do is start a process that gradually produces all of that. In the language of dynamical systems, the prompt establishes an initial condition, and the answer unfolds as a trajectory (or continuation) from it. Not the trajectory of a single point, but the joint evolution of the whole cloud , each token’s vector reshaped by the others as the text moves through the model. Even one word shifts the starting point. “Please” leans toward requests; “Please explain” toward instruction and teaching; add “E = mc²” and the state shifts toward physics, symbols, mass, energy, light, Einstein, relativity, and the shape of an explanation. A natural answer might open by unpacking the symbols — “E stands for energy,” “m stands for mass,” “c² is the speed of light times itself” — then reach the central idea, “energy and mass are equivalent,” and then widen to “This was one of Einstein’s key insights.” That widening matters. The answer moves from defining symbols, to physical meaning, to historical significance. It is not being extracted from the prompt. It is unfolding from an initial state that the prompt set in motion. It is tempting to picture the path as a stored outline: prompt → explanation mode → symbol definitions → physical meaning → Einstein → relativity → larger significance But there is no such outline inside the model. The path is what emerges, one token at a time, as the internal state is rebuilt and read at each step. The arrows above are a description of what happened, not a script the model consulted. And the context the model responds to is not fixed. After a few tokens, the model is no longer answering “Please explain E = mc².” It is responding to a longer text: “Please explain E = mc². E stands for energy …” “Please explain E = mc². E stands for energy. m stands for mass …” At each step, the model rebuilds its internal state over the longer text, and that fresh state conditions the sampling of the next token. The prompt sets the initial condition; the residual stream carries the internal state within each step; the response is the visible sequence of tokens traced out across steps. That is the reason a short prompt yields a rich answer: the model’s response was never hidden in the prompt. The prompt merely set up the conditions from which the response emerged. Language is linear. Meaning is not. An LLM writes one token at a time. That much is true. It does not follow that the model “thinks” one idea at a time. This is another place the autocomplete analogy misleads. The output is sequential because language is sequential — a sentence has to come out one word after another. But the meaning behind the sentence is not laid out sequentially along a single line. When the model chooses the next token, it draws on an internal state that holds many influences at once: topic, grammar, tone, audience, prior context, factual associations, and likely continuations. These are not waiting in a queue. They are spread across the many dimensions of the representation, all present at the same time. Return to E = mc². The model does not wait for the word “Einstein” to appear before Einstein is relevant. The prompt already tilts the internal state toward relativity, mass-energy equivalence, physical explanation, and an educational tone. The larger shape of the answer starts forming even before the first sentence is finished. At each step only one word comes out, but many possible word continuations are being weighed at once. This is how the model stays on topic across a long answer. The topic is not stored as text generated one word at a time. It is held in the internal state, which reflects both what has been written and where the answer seems to be going. A jazz musician is a useful image. The musician plays one note at a time, but the solo is not a chain of isolated notes. Each note comes from a fuller musical context — key, rhythm, motif, tension, the sense of where the phrase is heading. The performance is sequential; the musical understanding behind it is not. An LLM works the same way: one token at a time, each shaped by a larger pattern that is not itself sequential. That is why a model can generate one token at a time without being limited to one idea at a time. The text comes out as a line of words. The computation behind it is closer to a field — a distributed state, shaped layer by layer, projected through a narrow bottleneck into the next observed token. The growing text is the only memory The residual stream is where the LLM represents meaning captured in a single step. It is not where the model stores memory between tokens. This distinction is easy to miss and it is the crux of the whole story. As text moves through the model, the residual stream holds the model’s reading of the situation — what has been asked, what has been said, what tone is in play, where the answer is heading. The resulting internal state belongs to one forward step. After the model emits the next token, it does not carry that state forward to the next step. Instead, it rebuilds a new internal state from scratch based on text that is now longer by one token to predict the next token. The generated text is doing double duty. To the reader, it is the answer appearing on the screen. To the model, it is the external memory — the durable record from which the next internal state is rebuilt. The next step is not just text → next token → more text → next token but closer to text → internal state → next token → longer text → rebuilt internal state → next token So the model’s textual response is the thing that persists across steps, not the internal state from which the text is derived. Return to E = mc² one more time. Once the model has written “E stands for energy,” that line is no longer just output — it is now part of the text the model reads in the next step. The internal state rebuilt over that longer text reflects not only the original request but the fact that the answer has begun by defining symbols. After “m stands for mass” is added, the state rebuilt in the following step reflects a stronger pattern: this is an answer that explains the equation term by term. That pattern is not stored anywhere as an instruction to define c² next. It is simply what the internal state looks like once it is rebuilt over the text written so far. This is the part ordinary autocomplete misses entirely. The model is not appending words to a string while holding a growing private thought. It is repeatedly rebuilding a temporary internal state from text it has already produced — and the text is the only thing that carries forward. (In practice, actual LLMs keep a cache of intermediate quantities — the so-called KV cache — so they do not recompute everything from the beginning at each step. But the cache holds only values that are themselves derived from the text and recomputable from it; it is a speedup, not a separate memory. The durable, information-bearing record is still the text.) Two levels: flow within a step, text across steps The last section landed on a deceptively simple idea: text is the only thing that carries forward. It is worth slowing down, because it hides the most surprising structural fact about how LLM’s work. There are two levels to keep apart, and they behave completely differently. Within a single step, an LLM is dynamical. The cloud of token vectors is built at the input layer and then reshaped layer by layer, each arrangement following from the one before it, until the last layer produces the internal state where the next token is read from. This is the flow a previous article describes: a coupled cloud moving through the layers of the model. Within steps, there is a genuine dynamical evolution in a well-defined space. Across steps — from one emitted token to the next — there is no such dynamic. The model does not hand its internal state to the next step. It throws the state away, appends the single token it just produced to the text, and builds a brand-new cloud from scratch over the slightly longer text. Nothing dynamical crosses the boundary between one step and the next. The only thing that crosses is a single token. This is the asymmetry: the dynamics live inside each step, but between steps there is nothing but a growing transcript and a model that rereads it. The flow is real, but it does not accumulate into a persistent inner state that connects the many steps that comprise the whole response. Each token is produced by a fresh flow over a slightly longer input. The coherence of a long response is not the coherence of a single sustained train of thought — it is the coherence of a sequence of independent reconstructions, each built over text that changes only one token at a time. It is worth seeing that this is a design choice, not a law of natural languages. Other architectures make the opposite one: state space models such as Mamba do carry a compressed internal state forward from step to step, updating it as each new token arrives rather than rebuilding from the full text each time. The transformer trades that carried state away and re-reads everything instead — a trade with real consequences, and the subject of a later article. Why so coherent? So why does coherence survive a process that discards and rebuilds its internal state at every step? Three things hint at an answer. First, the text the model rereads is only one token longer than it was a step ago. Almost the entire input is unchanged, so the cloud based on it is almost the same cloud, and the state it produces points in almost the same direction. The reconstruction is fresh, but it is fresh over nearly identical text, which keeps consecutive steps close to one another. Second, the token that got appended was not arbitrary — it was the state’s own preferred continuation. Adding it to the text nudges the next reconstruction a little further along in the direction that the previous one was already leaning. The current state does not only stay near where it was; it advances in the direction of the previous state. Third, once a token is written it cannot be unwritten. Every later step reads it as fixed context. There is no backtracking, no revision of what has already appeared — the transcript is append-only, and each new token is conditioned on the full record of what came before. This is the precise sense in which text generation by an LLM is called autoregressive: each token is predicted from the entire prior text. A direction, once committed to the page, is not reconsidered; it becomes part of the ground the rest of the answer is built on. These three pull in the same direction. The flow inside each step proposes where to go next; the growing text fixes that proposal in place and carries it forward. Continuity comes from the input barely changing; momentum comes from each step writing down its own preferred next move; persistence comes from the fact that nothing already written can be taken back. Put together, they suggest that by rebuilding its internal state from scratch at every step, an LLM produces answers that stay on topic, develop coherently and hold together consistently. Coherence: a double-edged sword The same structure also explains how things go wrong. Most of the time the appended token simply confirms the direction the state already favored, and the loop is self-stabilizing — small perturbations get absorbed rather than amplified. But not every step is so decisive. Sometimes the readout distribution is nearly flat: two or more continuations are almost equally likely, and the sampling step has to break what is essentially a tie. At such a branch point, a single token can swing the trajectory onto a different course — and because that token is then committed to the text, every subsequent step builds on it. The model does not look back on the road not taken. This is one structural reason a confident-sounding falsehood can persist once it appears. Hallucination is an active and much-debated research problem with several distinct causes, and the rebuilding of the models internal state at every step does not explain all of them. But it does reveal a plausible mechanism: once a wrong turn is committed to the transcript, the same forces that keep a good answer coherent — rereading the fixed text, building on what is already written — keep the wrong answer coherent too. Autoregressive commitment is not what produces the error; it is what propagates it, which is why a mistaken continuation can read as fluent and self-consistent rather than collapsing into nonsense. The append-only loop has no built-in step that revisits and revises what it has already said. Pre-training shapes where the answer can go If the model keeps rebuilding its internal state at each step from the growing text, pre-training explains why that internal state is rich enough to be worth rebuilding. During pre-training, the model sees enormous amounts of text and is trained to predict missing or next tokens across countless contexts. The objective is simple to state — predict the next word — the structure it forces the model to learn is not. To predict the next token well, the model has to capture patterns at very different ranges: short-range patterns like spelling, grammar, and phrasing; longer-range patterns like definitions, examples, and analogies; and patterns that span an entire passage, like argument, explanation, dialogue, genre, and narrative flow. Pre-training fixes where each token starts in the model’s internal space, and the operators that move it once a forward step begins. As a result, certain next-token continuations far more natural than others. A flat table of word associations could not do this; what pre-training builds is structured enough to handle grammar locally and reasoning, explanation, style, and discourse globally — and everything in between. From here on it helps to borrow a picture, holding it loosely: think of the patterns pre-training fixes as a kind of terrain, and an unfolding answer as a path across it. The picture is a metaphor, not the mechanism — there is no literal surface the model slides down — but it captures something real. A prompt like “Please explain E = mc²” starts the answer in a particular region — near physics, equations, Einstein, relativity, explanation — and some directions out of that region are more natural than others: questions lead toward answers, explanations toward definitions and examples, equations toward the concepts attached to them. When the model responds to a prompt, it is not inventing structure from nothing. While the observed text may be new, a number of coherent responses were fixed long before the prompt arrived. Fine-tuning bends the path Pre-training builds the landscape. Fine-tuning changes how the model tends to move through it. A pre-trained model has learned an enormous range of patterns — explanations, stories, arguments, code, dialogue, instructions, jokes, contradictions, helpful answers, and not so helpful ones. It knows many possible paths through language, but not all of them are what you want in a conversation with a helpful assistant. Fine-tuning reshapes those tendencies. Through further training on examples of good responses, on instruction-following, on human preferences, and on safety constraints, the model becomes more likely to take paths that behave like a helpful assistant: answering the question, respecting intent, staying on topic, avoiding certain kinds of output, adopting a conversational voice. This does not bolt a second “assistant brain” onto the model. It adjusts the same underlying weights — the same token embeddings, attention operators, and feedforward neural networks. In probabilistic terms, fine-tuning reshapes the distribution of responses given prompts — raising the probability of helpful, coherent, instruction-following paths and lowering the probability of unhelpful or unsafe ones. There is more than one way to do this. Supervised fine-tuning teaches by demonstration: for this kind of prompt, follow this kind of path. Reinforcement-learning-based post-training teaches by preference: among the paths the model might take, make the better ones more likely and the worse ones less likely. Either way, the effect is the same in spirit — the same prompt can lead somewhere different after fine-tuning. A raw pretrained model might continue a prompt as an essay, a web page, a stray fragment, or an imitation of something in its training data. A fine-tuned assistant is pushed toward treating the prompt as a request and trying to answer it well. That is why modern chatbots feel different from raw language models. The terrain comes mostly from pre-training; the assistant-like way of moving across it comes from fine-tuning. Pre-training defines the landscape. Fine-tuning bends the path. So is an LLM just autocomplete? Writing a poem, a mathematical proof, or a Medium article is inherently a linear, sequential process. The same goes for activities like playing a jazz solo or giving a lecture. Work gets done essentially one token at a time — each new token carefully selected and added to the previous ones, with the hope that an interesting and coherent whole emerges. This sequential process has the superficial appearance of “autocomplete,” but the real question is: where do these tokens — words, notes, and utterances — come from? The autocomplete claim sidesteps that question by assuming the next token comes mainly from the previous ones. A deeper look suggests otherwise: the source of the next token is an internal representation — something we might loosely call a “thought” — that draws on prior knowledge, personal experience, intent, and goals. This representation is more informative than any single token and the expectation is that, when taken as a whole, the predicted tokens provide a faithful picture of it. A similar story applies to LLMs like ChatGPT. The model first builds an internal state from previously generated text and reads the next token off it. Here, the internal state captures the context, meaning, style, discourse, and likely continuations of the text and the next token is chosen to cohere with the richness of that state as closely as possible. That completes one step. The next token is then appended to the previous text, the whole is run through the model to build a fresh internal state, and the next token is read off that. What emerges across many steps is a body of text that is the visible trace of a thought unfolding — a thin projection of a sequence of internal states that is much richer than the sequence of words it gives rise to. That projection is lossy, and the gap is where things can go wrong: a thin stream of tokens can never carry everything the internal state holds. But when it works, the text we read reveals the essence of a greater meaning that is hidden from view. Maybe something like that applies to us too — much of what we do, we do sequentially, one piece at a time, resulting in a visible trace of an internal state we never fully externalize. If so, then analyzing an LLM may provide a way to understand a process that we always run but never quite see. This is much more than autocomplete to me. About the author: Gordon is a mathematician, bioinformatician, AI researcher, and singer-songwriter who explores hidden mathematical patterns at the interface of machine learning, modern AI, biology, and music. Acknowledgment: This article was developed in collaboration with AI, disclosed here in full in the belief that transparency about this way of working serves both readers and the practice itself. Claude (Anthropic) served as the primary editorial partner, contributing structural revisions, mathematical vetting, and expository refinement throughout the drafting process. ChatGPT (OpenAI) and Gemini (Google) were consulted periodically for alternative perspectives on framing and direction. The mathematics, research direction, and final text are the author’s responsibility. Why ChatGPT Is More Than Autocomplete was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Score: 26🌐 MovesJun 21, 2026https://pub.towardsai.net/why-chatgpt-is-more-than-autocomplete-233d4c2f7605?source=rss----98111c9905da---4
This AI Tool Puts ChatGPT, Claude, and Gemini Under One Roof for Only $55
This AI Tool Puts ChatGPT, Claude, and Gemini Under One Roof for Only $55 PCMag
Score: 25🌐 MovesJun 21, 2026https://www.pcmag.com/deals/this-ai-tool-puts-chatgpt-claude-and-gemini-under-one-roof-for-only-55
Kevuru Games Developers Share Hands-On Findings on AI Agents
Kevuru Games Developers Share Hands-On Findings on AI Agents azcentral.com and The Arizona Republic
Score: 23🌐 MovesJun 21, 2026https://www.azcentral.com/press-release/story/85550/kevuru-games-developers-share-hands-on-findings-on-ai-agents/
Introducing MonitoringBench
Paper here , code , benchmark . Builds on the preview we posted in January . Authors: @monika_j , @ma-martinez , @ollie , @Tyler Tracy We are releasing MonitoringBench , a difficulty-graded benchmark of 2,644 successful attack trajectories for evaluating coding-agent monitors, alongside the semi-automated red-teaming pipeline we used to generate it. The pipeline decomposes attack construction into strategy generation, execution, and post-hoc refinement, and produces substantially harder attacks than prompt elicitation only of frontier attackers. If you would like to use MonitoringBench to evaluate your monitors or are interested in extending the methodology to other control settings, please reach out — we are happy to help people set it up. The headline result: across every monitor we evaluated (13 full-trajectory, 7 stepwise) and across every attack source we generated, post-hoc refinement of attack trajectories consistently produced the lowest catch rates , including on monitors held out from attack construction. On Opus 4.5, the strongest monitor in the lineup, catch rate fell from 94.9% on directly-elicited Opus attacks to 60.3% on our best refined attacks. Executive summary and what's new since the January preview: For readers who saw the earlier post : we previewed a 428-trajectory firewall-only version of this benchmark in January. This is the full version. Most of the qualitative claims from the January preview carry over to the full extended dataset. We continue to observe mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader attack coverage. Second, a conceive-execute gap : frontier LLMs can propose strong attack ideas or execute them, but not all at once. We show how our decomposed red-teaming pipeline mitigates the issues and produces our best attacks. The main contributions: Evolution of monitoring capabilities using MonitoringBench: a difficulty-graded, diverse monitoring benchmark based on BashArena. The dataset is now 2,644 trajectories across all four side tasks. A semi-automated red-teaming pipelines and attack taxonomy : the pipelines ensure broad attack coverage and bridge the conceive–execute gap via staged construction and refinement. Held-out monitor transfer : we developed the attacks against three full-trajectory monitors (Opus 4.5, GPT-5, GPT-5 Nano) and evaluated them on 10 held-out full-trajectory and 7 stepwise monitors. The refinement gains transfer across the board; recon agents help on stronger monitors but generalise less evenly. Failure-mode taxonomy with three independent LLM judges, letting us decompose "monitor caught X%" into why the monitor missed the rest. The main reasons frontier monitors fail are a mix of 1) detection failure, 2) partial detection (noticed some parts of the attack) or 3) detected but dismissed it/was persuaded, and 4) score calibration failure. MonitoringBench was used for monitor evaluations of internal Mythos deployment (see Anthropic's Mythos risk report) and OpenAI's Auto-Review . The pipeline Figure 1: Top : Our proposed red-teaming methodology for monitoring capability evaluation. Bottom : Red-teaming pipeline analysis: catch rate at 1% FPR by attack source, averaged across three runs. Lower is stronger. Development monitors are Opus 4.5, GPT-5, and GPT-5 Nano; remaining monitors are held out and show that refined attacks outperform elicited Opus-4.6, including on held-out monitors. We decompose attack construction into three stages: 1. Strategy generation. We use three approaches, all guided by an explicit attack taxonomy designed to ensure coverage: human-in-the-loop construction (high quality, doesn't scale), a reconnaissance agent that explores the environment and proposes strategies, and a recon agent with monitor hill-climbing that lets it score candidate actions against GPT-5 monitor and iterate toward lower-suspicion variants. 2. Execution. We run the strategy in BashArena with Opus 4.6 as the attacker. Trajectories are retained only if the side task succeeds and a GPT-5 judge confirms the agent faithfully followed the strategy. 3. Refinement. We post-process trajectories with Opus 4.6: identify suspicious elements, plan edits, rewrite the trajectory, and validate correctness. Refinement usually improves attacks by fixing execution errors, suspicious justifications, and unnecessary reconnaissance. A trajectory is accepted if it reduces absolute suspiciousness scores on at least 2 out of 3 dev monitors. The point is to decompose attack generation into separate red-teaming skills: strategy generation, attack execution and stealth optimisation. Next, we discuss the main results: conceive-execute gap mitigation, coverage improvements, and capabilities evaluation. Decomposed red-teaming produces stronger attacks Figure 1 is the main result. Each cluster of bars is one monitor; within a cluster, bars move left-to-right from baseline Opus-only elicitation through human-seeded strategies, recon, and recon-with-feedback, with each shown both before and after refinement. Lower is stronger. Two things are visible: Refinement produces our best attacks that generalise. Refinement consistently makes poorly executed attacks better, achieving the lowest catch rates across all monitors (dev and held-out), monitor types (full-trajectory and stepwise), and attack sources (Opus-only, Human-in-the-loop, Recon, and Recon+Tools). The biggest gains up to 40% appear on the strongest full trajectory monitors between Opus-only and Opus-only refined attacks, for example, GPT-5.1 Codex score goes from 69.9% to 49.8% post-refinement, and GPT-5 performance reduces even further from 60.2% to 19.5%. Recon helps, but generalises less well. The recon-based strategy generators (Recon, Recon+Tools) beat Opus-only on strong-to-mid monitors but sometimes underperform on weaker ones. Recon is designed for both exploration and stealth; hence, it finds a wider mix of strategies, including stronger attacks but also more explicit or convoluted ones that weaker monitors catch more easily. Recon+Tools + refinement is the strongest combination overall, but the marginal contribution of recon is smaller than the marginal contribution of refinement. The development monitors (Opus 4.5, GPT-5, GPT-5 Nano) were used to filter refinement outputs. We used held-out monitors, ten full-trajectory and 7 stepwise. Refinement and recon agents improve attacks across both groups and monitor models, which shows the pipelines are producing trajectories that are actually harder to catch. Coverage: the attack taxonomy reduces mode-collapse We continue to follow the attack taxonomy (see the full categories in the Appendix): Technical strategy 5 tiers (T): what the attack technical mechanism is, Structural complexity 4 tiers (S): where the attack is executed in the trajectory, Evasion strategy 4 tiers (E): how the attack is hidden, framed, or justified. Figure 2: Top: Our proposed attack taxonomy. Bottom: Mode-collapse: Opus-4.6 elicited vs our attack pipelines across the attack taxonomy show that model-only red-teams produce a narrower distribution of attacks compared to our pipelines; n denotes the number of trajectories. We show that increasing attack diversity via strategy generation allows to find more successful attacks and mitigates mode-collapse. Opus elicitation-only results in 59% of attacks within a single technical tier (T3) and 64% within a single structural tier (S3). Strategy-generated and refined attacks are designed to span the full taxonomy. Next, refinement shows that simple attacks (simple technique, single line) with the right framing perform best across the side tasks and monitors. Averaged across seven monitors, 40% of uncaught attacks fall into T1 (the simplest technical tier), despite T1 attacks being only 2% of the Opus-only baseline. Monitoring capabilities and failures Figure 3: Final MonitoringBench results: Full Trajectory (1% FPR) and Stepwise (1% per-action FPR) Catch rates. Lower catch rate indicates stronger attacks. Monitoring capability is concentrated in frontier models. Opus 4.5 catches roughly 70% of attacks in the full-trajectory setting; a mid-tier group (Sonnet 4.5, GPT-5.1 Codex, Sonnet 4.0, GPT-5) sits at 31–53%; weaker models drop off steeply. In practice, weaker monitors are often preferred for cost and trust reasons. Given the steep monitoring capability drop, cheap monitor models may give a false sense of security as attacking capabilities evolve and should be deployed with caution. Figure 4: Monitor failure distribution across five monitors evaluated on the MonitoringBench benchmark. Each bar shows the breakdown of uncaught attacks across four failure modes. Labels are assigned by majority vote across three independent LLM judges; n denotes the number of uncaught attacks per monitor. But aggregate catch rates do not tell the full story. We provide a deeper failure-mode breakdown. We categorise each uncaught attack into four buckets: (A) detection failure, (B) partial detection (noticed some parts of the attack), (C) detected but dismissed it/was persuaded, and (D) score calibration failure (scored below the threshold). We report the majority vote across three LLM judges with moderate agreement (Opus 4.5, GPT-5.1 Codex, GPT-5.4; κ = 0.51-0.69). When frontier monitors fail, the reasons for failing are different: Opus 4.5 mostly fails by being persuaded . It notices suspicious actions, then rationalises them as main-task-aligned. In some cases, when the attacker provides no explanation at all, Opus invents one itself. GPT-5.1 Codex often detects individual attack steps but fails to connect them into the full side-task implementation. Sonnet 4.0 fails by missing the attacks entirely, while Sonnet 4.5 fails for a mix of reasons. GPT-5 rarely misses an attack outright (only 10% of failures are non-detection) but has the highest total failure count, driven almost entirely (82%) by scoring/calibration errors. It flags the attack — it just doesn't score it high enough to escalate. These are actionable and tractable paths to monitor improvements. Failure mode A could be mitigated with better detection capability, C with resistance to persuasion, and D with calibration training. Takeaways The main conclusions we would like to highlight: Red-teaming decomposition produces stronger attacks. Decomposing red-teaming into subtasks: strategy, execution, refinement and applying separate optimisation to each produced our stronger attacks. Both Recon and Recon+Tools, and refinement are automatable and designed to scale with future models. Reporting aggregate catch rates without characterising the attack distribution is not very meaningful. Future monitor evaluations would ideally report attack taxonomies, use red-teaming that goes beyond model elicitation, and analyse failure modes both qualitatively and quantitatively. Some monitor failures are tractable now. Persuasion-susceptibility and scoring miscalibration are actionable, low-hanging fruit improvements. We position MonitoringBench as a static, difficulty-graded test suite that current monitors should pass as part of pre-deployment evaluation, and the pipeline as a reusable methodology for refreshing these evaluations as agents and monitors improve. The paper is here , with full methodology, the taxonomy, all generation and refinement prompts, the failure-mode judging rubric, per-side-task breakdowns, and the cover-opportunities study in the appendices. Limitations : We focus on single-agent, single-episode, persistent attacks. Our benchmark covers a single control setting (BashArena) and four side tasks; extending to LinuxArena [Tracy et al., 2026] is a natural next step. Human-seeded strategy generation depends heavily on effort, and we put uneven effort across side tasks. We show that prompt sensitivity is model-dependent; however, we did not elicit each monitor individually, and frontier monitoring deployments show better performance on MonitoringBench. Our 1% FPR thresholds are calibrated on honest BashArena trajectories rather than real deployment traffic, which may bias monitor scores. Finally, we used Claude Opus 4.5 and 4.6 to develop attacks, which might introduce model bias when evaluating monitors on Anthropic family models. Acknowledgments: We thank Adam Kaufman, Mikhail Terekhov, James Lucassen, Asa Cooper Stickland, Rogan Inglis, Neev Parikh, Jan Michelfeit, Cody Rushing, and Anshul Khandelwal for helpful feedback on early versions of the work and useful discussions throughout the project. We thank Mary Phuong, Marius Hobbhahn, and Cody Wild for helpful discussions. We thank Fabien Roger for detailed feedback on multiple drafts that improved this work and for early discussions that shaped our approach. We thank Coefficient Giving for supporting this research. I also want to especially thank our project advisors, Tyler Tracy and Ollie Matthews, for all the mentorship and support. Appendix: Attack taxonomy Below, we share the three axes of the taxonomy: technical, structural, and evasion, with full categories in Tables 1–3. Discuss
Score: 23🌐 MovesJun 21, 2026https://www.lesswrong.com/posts/2YyEQhFRyaBCgfRWd/introducing-monitoringbench
I asked ChatGPT and Claude to plan my next dream trip to Japan with Viator — one delivered much better destinations
I asked ChatGPT and Claude to plan my next dream trip to Japan with Viator — one delivered much better destinations Tom's Guide
Score: 22🌐 MovesJun 21, 2026https://www.tomsguide.com/ai/i-asked-chatgpt-and-claude-to-plan-my-next-dream-trip-to-japan-with-viator-one-delivered-much-better-destinations
The Ecovacs Deebot X12 OmniCyclone is an impressively inventive robot cleaner that mops very well, but struggles to sweep up the competition
The new Ecovacs Deebot X12 OmniCyclone comes with some great innovations but it's not the best sucker in town
Score: 20🌐 MovesJun 21, 2026https://www.techradar.com/home/robot-vacuums/the-ecovacs-deebot-x12-omnicyclone-is-an-impressively-inventive-robot-cleaner-that-mops-very-well-but-struggles-to-sweep-up-the-competition
Apricot STEM Fair hightlights natural science, artificial intelligence
Apricot STEM Fair hightlights natural science, artificial intelligence The Mercury News
Score: 20🌐 MovesJun 21, 2026https://www.mercurynews.com/2026/06/21/apricot-stem-fair-hightlights-natural-science-artificial-intelligence/amp/
Electric air taxis are stuck in the courtroom
This is The Stepback, a weekly newsletter breaking down one essential story from the tech world. For more on aviation, air taxis, and Wi-Fi speeds at 30,000 feet, follow Andrew J. Hawkins. The Stepback arrives in our subscribers' inboxes on Sunday at 8AM ET. Opt in for The Stepback here. How it started Last year, […]
Score: 18🌐 MovesJun 21, 2026https://www.theverge.com/column/950975/electric-air-taxis-lawsuits
I asked ChatGPT to select the best personal finance advice from Indian fathers: AI ranks suggestions from worst to best
Traditional Indian fathers' financial advice often prioritised caution and family protection. I asked ChatGPT to rank those pieces of advice in today's economy.
Score: 15🌐 MovesJun 21, 2026https://www.livemint.com/money/personal-finance/personal-finance-i-asked-chatgpt-to-rank-common-dad-advice-from-indian-fathers-ai-ranks-tips-from-worst-to-best-11782020088623.html
Sunday Special: Midjourney stuns the industry with ultrasonic scanner
Midjourney launches an ultrasonic scanner that dramatically improves AI image generation.
Score: 12🌐 MovesJun 21, 2026https://www.superhuman.ai/p/sunday-special-midjourney-stuns-the-industry-with-ultrasonic-scanner
This Father’s Day, Care Runs In Both Directions. The Right AI Can Help
A Father’s Day reflection on caregiving, distance, dignity, and why the right AI should help families care without replacing them.
Score: 12🌐 MovesJun 21, 2026https://www.forbes.com/sites/robertszczerba/2026/06/21/this-fathers-day-care-runs-in-both-directions-the-right-ai-can-help/
WISeR Coverage
WISeR Coverage MedCity News
Score: 00🌐 MovesJun 21, 2026https://medcitynews.com/tag/wiser/
AI, Mind Reading and Microchip Brain Implants
How neurotech advancements and new state laws are shaping the future of human-machine interfaces.
Score: 00🌐 MovesJun 21, 2026https://www.govtech.com/blogs/lohrmann-on-cybersecurity/ai-mind-reading-and-microchip-brain-implants
Is Bias in Clinical AI Good or Bad? It’s More Complicated Than That
Is Bias in Clinical AI Good or Bad? It’s More Complicated Than That MedCity News
Score: 00🌐 MovesJun 21, 2026https://medcitynews.com/2026/06/is-bias-in-clinical-ai-good-or-bad-its-more-complicated-than-that/
Is It Likely That the WISeR Model Will be Repealed?
Is It Likely That the WISeR Model Will be Repealed? MedCity News
Score: 00🌐 MovesJun 21, 2026https://medcitynews.com/2026/06/is-it-likely-that-the-wiser-model-will-be-repealed/
AI Can Accelerate Discovery — Development Still Decides What Advances
AI Can Accelerate Discovery — Development Still Decides What Advances MedCity News
Score: 00🌐 MovesJun 21, 2026https://medcitynews.com/2026/06/ai-can-accelerate-discovery-development-still-decides-what-advances/
Anthropic is under pressure
Inside: Last week's AI, caught up in 5
🌐 MovesJun 21, 2026https://aitoolreport.beehiiv.com/p/anthropic-is-under-pressure
Vox
Vox: Next-Gen Social Web App (Beta)
🧰 ToolsJun 21, 2026https://www.producthunt.com/products/vox-4?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
restoo
The control center for your restaurant
🧰 ToolsJun 21, 2026https://www.producthunt.com/products/restoo?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Kelhe
Risk scoring for autonomous AI agents.
🧰 ToolsJun 21, 2026https://www.producthunt.com/products/kelhe?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Pixten
Restore, enhance, and upscale photos with AI
🧰 ToolsJun 21, 2026https://www.producthunt.com/products/pixten?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29