AI News Archive: May 26, 2026 — Part 5

Sourced from 500+ daily AI sources, scored by relevance.

Trucking fleets must adapt faster as regulations, AI reshape industry, experts say
Executives from the Truckload Carriers Association, PepsiCo and Stevens Transport said fleets that embrace safety technology, AI will be best positioned to survive mounting operational pressures. The post Trucking fleets must adapt faster as regulations, AI reshape industry, experts say appeared first on FreightWaves .
Score: 48🌐 MovesMay 26, 2026https://www.freightwaves.com/news/trucking-fleets-must-adapt-faster-as-regulations-ai-reshape-industry-experts-say
The security assumption agentic AI just broke
I ran a red-team exercise against an internal IT-support agent wired across a stack any large enterprise would recognize: ServiceNow for tickets, SharePoint for policy and procedure docs, an internal directory for routing. The agent had legitimate read access to all three and could draft replies but not send them. Inside two hours, it had triaged a routine access-request ticket into a chain that reconstructed an in-progress reorganization no individual in the loop was cleared to discuss. No tool call was outside policy. No permission was misconfigured. Every step had a paper trail. That’s the pattern I keep coming back to. The risk conversation has centered on model behavior — hallucinations, jailbreaks, unsafe outputs — but once AI systems are connected to tools, memory and internal workflows, the harder question is execution governance: What the system is permitted to do, how far its access extends and whether anyone can reconstruct the action chain afterward. That’s where most organizations are exposed. What rarely gets acknowledged is that the enterprise controls we rely on were never designed to account for human friction, even though they depended on it. An analyst who hesitates before chaining together a dozen sensitive queries. Someone who, seven steps into a workflow, decides something doesn’t feel right. That latency was a byproduct of humans being the actors, not a design choice anyone made deliberately. It functioned as an accidental safety property embedded in every process we built. Agentic AI removes all of that. Agents move through workflows without the friction, fatigue or unease that causes humans to slow down at the moments that matter. The controls we built weren’t designed to compensate for their absence. The deployment data confirms this isn’t a future problem. According to ETR data presented at RSAC 2026 , 37% of organizations already have AI agents deployed or in active testing, while only 3% report having broad agent-specific security controls in place. Most organizations are running agents in environments that weren’t instrumented to govern them. Why the controls you’re relying on weren’t built for this When I see organizations respond to prompt injection risk, the instinct is almost always the same: Input filtering — classify the bad instruction before it reaches the model. When the risk is agent access, the response has been tightening identity controls to reduce blast radius. Both are the right instincts applied at the wrong layer. In March 2026, OpenAI published an assessment of real-world prompt injection attacks that made the filtering problem concrete. The most effective attacks, they found, increasingly resemble social engineering rather than simple prompt overrides, and identifying a sophisticated adversarial prompt is effectively the same problem as detecting a lie without access to the full context. An attack disguised as a routine HR email succeeded 50% of the time with all of OpenAI’s defenses active. Their conclusion was that defense cannot rely primarily on input filtering; the system has to be designed so the impact of manipulation stays constrained even when attacks get through. The reason this matters goes back to the original assumption. Prompt-layer defenses were built expecting a human somewhere downstream who might review an output, notice something odd or decline to take the final step. When an agent takes that step autonomously, the filter carrying all of that weight has to catch everything, and there is no evidence that it can. Identity-layer controls carry a parallel assumption. They evaluate who is accessing what, assessed per resource and per system, but weren’t built to evaluate what a system is doing across a chain of actions taken on behalf of an identity. An agent with legitimate access to an employee directory, a project management system and a calendar can correlate all three to surface conclusions that no individual permission was meant to cover, and every access along the way was authorized. This is the mosaic effect: A concept from intelligence and privacy disciplines describing how aggregating individually permissible information can produce an outcome more sensitive than any single piece would suggest. In February 2026, NIST published a concept paper proposing to adapt existing identity and authorization frameworks specifically for AI agents, explicitly because the existing frameworks weren’t designed for non-human principals that act autonomously, chain actions and require continuous rather than session-based authorization. What the actual attack surface looks like The vulnerabilities agents expose aren’t new. Overbroad permissions, overly generous retrieval, loosely scoped connectors, workflows designed with an implicit assumption that a human would pause before a consequential step: These have always been enterprise weaknesses. What’s changed is that agents exercise them continuously and at machine speed. Research presented at Black Hat USA illustrates how quickly these conditions combine. An attacker sends an email to a support address connected to Zendesk, which automatically syncs into Jira. A developer’s AI coding agent reads the ticket as part of normal workflow, and the injected prompt coerces it into extracting repository secrets, including API keys and access tokens, with no action required from the victim beyond their ordinary use of the tool. The agent never exceeded its assigned permissions. The blast radius came entirely from the scope of what it was legitimately authorized to do. The authorization problem runs deeper than any single access, though. A December 2025 paper found that more than 90% of the privacy research literature addresses only single-step leakage, and none of the agent-level evaluation frameworks currently in use model the multi-tool inference chains, where the agent assembles a picture from pieces each of which it had every right to see, faster than any review process can intervene. Object-level permission audits don’t catch this class of risk. This is the dynamic I’ve been most focused on in my own research . In a pre-registered pilot on identity drift in self-modifying agents, the cleanest finding wasn’t dramatic. After a shallow revert of an agent’s self-description, the per-action audit was clean: Every step within policy, every change logged. But the behavioral trajectory, measured at the embedding level, hadn’t reverted with it. The pattern generalizes uncomfortably well to enterprise deployments: An agent that’s been rolled back after an incident — system prompt reset, instructions retightened — can carry residue of the prior state in its memory and continue acting on it. When the unit of governance is the action, the thing you actually wanted to govern can drift past you in plain sight. What execution-layer governance actually requires The through-line is a shift in where controls have to live. Prompt-layer and identity-layer controls carry implicit dependencies on human behavior that agents don’t satisfy. The missing layer is execution governance: Controlling what the system can actually do when it acts, which is a different problem from controlling what it can see or what instructions it receives. OpenAI’s March 2026 framework offers a useful organizing principle: Design the system so that the consequences of a successful attack remain constrained even when manipulation gets through. An agent limited to reversible actions, required to pause for confirmation before consequential steps, keeps the blast radius manageable regardless of what it’s told. The relevant design question is outcome containment alongside attack prevention. In practice, most deployments haven’t built what this requires. Separating read from act needs to be a hard architectural distinction; summarizing a document and transmitting data from it are different actions, and the system should enforce that difference rather than assume the agent will respect it. Memory and context need explicit bounds, because persistence is a security primitive with real blast-radius implications. A complete trace of request, context, tool calls and outputs needs to be designed in from the start rather than assembled after the fact when something goes wrong. And the red-teaming program needs to target the full workflow rather than the model in isolation. Of those four, the read/act split is the one I see teams consistently underestimate. A sales-ops agent with read access to Salesforce and the ability to draft customer emails is one tool-call away from transmitting data to a third party, and most enforcement was never built to detect the difference between summarizing an account and sending a summary of it. The failure that won’t look like a failure A 2026 survey of 1,253 cybersecurity professionals found that 32% of organizations currently lack AI agent visibility. The report describes a scenario worth sitting with: A SOC analyst arrives Monday morning, traces an anomalous privilege change to a service account created by an agent 72 hours earlier and finds that the agent has been writing to production systems all weekend. Every action is logged. No alert fired because no detection rule existed for agent-initiated behavior. What concerns me is that without agent-aware detection, the incident gets categorized as a service account control failure, remediated and closed as a known issue type, with the underlying AI governance problem unrecognized and the conditions that produced it unchanged. The question worth asking before that Monday morning arrives is whether your detection and response workflows would recognize an AI governance failure if they encountered one, or whether the logs would just show a busy service account. This article is published as part of the Foundry Expert Contributor Network. Want to join?
Score: 48🌐 MovesMay 26, 2026https://www.cio.com/article/4176552/the-security-assumption-agentic-ai-just-broke.html
Intuit Lays Off 3,000 Employees as CEO Denies AI Connection in Mad Money Interview
Intuit lays off 3,000 employees, CEO denies AI connection
Score: 48🌐 MovesMay 26, 2026https://opentools.ai/news/intuit-lays-off-3000-employees-ceo-denies-ai-connection
Hyundai Rotem secures two national physical AI projects
Hyundai Rotem has secured two government-backed physical AI robotics projects, strengthening its position in autonomous systems for defense and industrial applications, the company announced Tuesday. The company was selected to lead separate research and development projects commissioned by Korea’s Ministry of Trade, Industry and Energy and the Agency for Defense Development. The Industry Ministry project focuses on developing a language-based control system capable of operating multiple types o
Score: 48🌐 MovesMay 26, 2026https://www.koreaherald.com/article/10756283
Google’s A.I. Killshot & Bezos’s ‘WaPo’ Warning
Google’s A.I. Killshot & Bezos’s ‘WaPo’ Warning Puck
Score: 48🌐 MovesMay 26, 2026https://puck.news/newsletter_content/googles-ai-killshot-bezoss-wapo-warning/
ChinAI #360: Anthropic’s Dogma on US-China AI Competition
Greetings from a world where…
Score: 48🌐 MovesMay 26, 2026https://chinai.substack.com/p/anthropics-dogmatic-views-on-us-china
A tech worker coalition is piloting a basic income program for AI job losses
A tech worker coalition is piloting a basic income program for AI job losses Business Insider
Score: 48🌐 MovesMay 26, 2026https://www.businessinsider.com/tech-labor-organizers-piloting-ubi-program-for-ai-job-losses-2026-4
How Varonis Atlas integrates Claude Compliance API for AI governance
AI governance requires visibility into how AI tools interact with enterprise data. Varonis explains how its Atlas platform uses Claude Compliance API data to help monitor usage, investigate risk, and support compliance. [...]
Score: 47🌐 MovesMay 26, 2026https://www.bleepingcomputer.com/news/security/how-varonis-atlas-integrates-claude-compliance-api-for-ai-governance/
AI beats human forecasters in tournament predicting 30 tech ventures
For decades, the idea that artificial intelligence can beat humans at number-crunching tasks like high-frequency trading has been widely accepted. But strategic foresight—the ability to predict the success of high-stakes, uncertain business ventures—has long been held as a uniquely human superpower.
Score: 47🌐 MovesMay 26, 2026https://techxplore.com/news/2026-05-ai-human-tournament-tech-ventures.html
After Complaint, GEICO Agrees to Modify Cancellation Process That Uses AI
Pennsylvania Attorney General Dave Sunday announced an agreement with GEICO that should modify an artificial intelligence (AI) initiated auto insurance policy cancellation process the state alleged was unfair and confusing, The agreement stems from a complaint from a new GEICO …
Score: 47🌐 MovesMay 26, 2026https://www.insurancejournal.com/news/east/2026/05/26/871246.htm
Ground robots in Latvia and the history of manned-unmanned teaming
Ground robots in Latvia and the history of manned-unmanned teaming Breaking Defense
Score: 47🌐 MovesMay 26, 2026https://breakingdefense.com/2026/05/ground-robots-in-latvia-and-the-history-of-manned-unmanned-teaming/
Industry ministry discusses AI transformation of battery industry with LGES
The industry ministry on Tuesday discussed ways to accelerate the artificial intelligence transformation of the battery industry, currently facing sluggish growth amid the electric vehicle market slowdown. The Ministry of Trade, Industry and Resources met with representatives from leading battery maker LG Energy Solution Ltd. to explore measures to help the domestic battery industry strengthen its manufacturing competitiveness under the government's initiative aimed at promoting the AI transform
Score: 47🌐 MovesMay 26, 2026https://www.koreaherald.com/article/10755917
Quote of the day by Nvidia CEO Jensen Huang: "Software is eating the world, but AI is going to eat software" — A prophetic statement predicting the impending death of software
With SaaSmageddon underway, we remember a decade-old insight into the future of the software industry
Score: 47🌐 MovesMay 26, 2026https://www.techradar.com/computing/software/quote-of-the-day-by-nvidia-ceo-jensen-huang-software-is-eating-the-world-but-ai-is-going-to-eat-software-a-prophetic-statement-predicting-the-impending-death-of-software
NOVARC AND HANWHA OCEAN SIGN MOU FOR INNOVATION COLLABORATION ON WELDING AUTOMATION AND AI-POWERED MANUFACTURING TECHNOLOGIES FOR ADVANCED SHIPBUILDING APPLICATIONS
NOVARC AND HANWHA OCEAN SIGN MOU FOR INNOVATION COLLABORATION ON WELDING AUTOMATION AND AI-POWERED MANUFACTURING TECHNOLOGIES FOR ADVANCED SHIPBUILDING APPLICATIONS Toronto Star
Score: 47🌐 MovesMay 26, 2026https://www.thestar.com/globenewswire/novarc-and-hanwha-ocean-sign-mou-for-innovation-collaboration-on-welding-automation-and-ai-powered/article_10ba7fb6-b2e4-5b9f-b933-caa0e376ec70.html
Physical data collection from the real world is India's new backoffice job for AI
Companies such as HumynAI Labs, Egodata, Neo Cambrian, XP Robotics, and Objectways have deployed people on the ground starting early this year. They are collecting data on everything from household chores like washing dishes and folding laundry to the manufacturing sector.
Score: 47🌐 MovesMay 26, 2026https://economictimes.indiatimes.com/tech/startups/smile-your-chores-are-going-viral-in-a-global-robotics-lab/articleshow/131315474.cms
Google Cloud COO says AI security belongs in the boardroom, not just the server room
Google Cloud COO Francis de Souza is urging companies to build security into their AI strategy from day one. The article Google Cloud COO says AI security belongs in the boardroom, not just the server room appeared first on The Decoder .
Score: 47🌐 MovesMay 26, 2026https://the-decoder.com/google-cloud-coo-says-ai-security-belongs-in-the-boardroom-not-just-the-server-room/
Diverse reasoning traces teach LLMs to make better decisions
How to train language models to generate diverse, accurate reasoning paths using tokens that control distinct reasoning strategies.
Score: 47🌐 MovesMay 26, 2026https://www.amazon.science/blog/diverse-reasoning-traces-teach-llms-to-make-better-decisions
Altron’s AI factory to power next wave of growth
The AI factory is central to growth, as it accelerates the shift to platform-based digital businesses, says Altron’s CEO.
Score: 47🌐 MovesMay 26, 2026https://www.itweb.co.za/article/altrons-ai-factory-to-power-next-wave-of-growth/j5alrMQADArMpYQk
South Africa delays national AI policy to 2027 after fabricated references spark credibility scandal
South Africa delays national AI policy to 2027 after fabricated references spark credibility scandal Business Insider Africa
Score: 47🌐 MovesMay 26, 2026https://africa.businessinsider.com/local/markets/south-africa-delays-ai-policy-after-fake-citations-spark-government-credibility/3s3gpf8
Persistent Systems, Kong join forces to secure enterprise AI scaling
Persistent Systems and Kong to help enterprises integrate and govern AI systems as they shift from experimentation to production.
Score: 47🌐 MovesMay 26, 2026https://www.techmonitor.ai/news/persistent-systems-kong-join-forces-to-secure-enterprise-ai-scaling
Physical AI takes off: How real-time data keeps Fraport’s airports running on time
Physical AI is no longer a concept confined to factory floors and autonomous vehicles — it is reshaping the complex, time-sensitive operations of some of the world’s busiest airports. The operators managing dozens of airports and tens of millions of passengers face decisions that cannot wait for a round-trip to the cloud, making low-latency, on-premises […] The post Physical AI takes off: How real-time data keeps Fraport’s airports running on time appeared first on SiliconANGLE .
Score: 46🌐 MovesMay 26, 2026https://siliconangle.com/2026/05/26/physical-ai-helps-data-take-flight-fraport-delltechworld/
Canada’s AI minister says he’s more worried about creating unicorns than monopolies
Speaking at BetaKit’s Most Ambitious: Town Hall, Evan Solomon said he wants to reward companies taking risks. The post Canada’s AI minister says he’s more worried about creating unicorns than monopolies first appeared on BetaKit .
Score: 46🌐 MovesMay 26, 2026https://betakit.com/canadas-ai-minister-says-hes-more-worried-about-creating-unicorns-than-monopolies/
Detectify debuts MCP server to let AI agents find and fix vulnerabilities in real time
Application security platform company Detectify AB today launched the Detectify MCP Server, a new integration layer that plugs the company’s security testing engines into artificial intelligence-driven coding workflows so that agents can find, validate and remediate exploitable vulnerabilities in real time. Detectify’s MCP Server is built on the Model Context Protocol, the open standard Anthropic […] The post Detectify debuts MCP server to let AI agents find and fix vulnerabilities in real time appeared first on SiliconANGLE .
Score: 46🌐 MovesMay 26, 2026https://siliconangle.com/2026/05/26/detectify-debuts-mcp-server-let-ai-agents-find-fix-vulnerabilities-real-time/
AI edge will depend on building hard-to-copy systems
Companies now use artificial intelligence widely. A new report says AI is a basic requirement. The real advantage comes from building unique systems around AI. These systems are hard for rivals to copy. AI cuts costs in many sectors. Learning and adapting faster creates long-term advantages.
Score: 46🌐 MovesMay 26, 2026https://cio.economictimes.indiatimes.com/news/artificial-intelligence/ai-edge-will-depend-on-building-hard-to-copy-systems/131328832
Indiana Government Integrates More AI Into Operations
The Indiana Office of the Secretary of State is leveraging AI to improve efficient service delivery. This includes the recent launch of a financial literacy program and a new phase of the notary education platform.
Score: 46🌐 MovesMay 26, 2026https://www.govtech.com/artificial-intelligence/indiana-government-integrates-more-ai-into-operations
Opinion | AI Overwatch Act Would Help China
Stricter export limits would hurt American companies and blow the U.S. tech lead.
Score: 46🌐 MovesMay 26, 2026https://www.wsj.com/opinion/ai-overwatch-act-would-help-china-5e5e61fe?mod=rss_Technology
Neocambrian AI launches India-focused robotics data factory for Physical AI models
Home services and robotics-focused data collection startups are increasingly drawing attention amid the ongoing debate around Physical AI-linked human activity datasets. Now, another startup has formally entered the space. Founded by entrepreneur Abhinav Kukreja, Neocambrian AI has announced its launch with a focus on building large-scale human action datasets for robotics and embodied AI systems. Kukreja had earlier founded DataVantage, an AI powered marketing workflows for medium and large technology enterprises. The announcement comes days after Entrackr reported on Pronto ’s experiments around Physical AI-linked data collection. Another startup, Snabbit , also confirmed to Entrackr that it had earlier been approached by US-based startup Human Archive for similar initiatives but eventually decided not to proceed. In a detailed public note, Kukreja described Physical AI as the next frontier of artificial intelligence, arguing that robotics lacks “internet-scale datasets” comparable to the text datasets that enabled large language models. According to the company, Neocambrian AI is building what it calls a high fidelity, pre-training scale database of human action, using egocentric video capture systems, motion tracking hardware, stereo capture rigs, and upgraded UMI devices for robotics training. The startup claims to have set up India’s first and only robotics data factory and plans to provide thousands of hours of collected data free of cost to Indian researchers working on vision-language-action (VLA) and world models. Kukreja also framed India as a potential global hub for Physical AI datasets, citing the country’s large workforce, diverse real-world environments, and operational experience in distributed services. The emergence of such startups highlights the growing interest in collecting structured human behavioural data for training next-generation robotics systems, even as concerns around privacy, worker consent, and ethical data collection continue to intensify.
Score: 46🌐 MovesMay 26, 2026https://entrackr.com/news/neocambrian-ai-launches-india-focused-robotics-data-factory-for-physical-ai-models-11874142
DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole
For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family , Anthropic's Claude Opus , and Google's Gemini Pro have clustered within a narrow band on Scale AI's SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases. On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE , a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor. "On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work." The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed. If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass. Why the most popular AI coding benchmark may be grading on a curve To understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong. The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses. First, contamination . Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small)," Ge wrote. Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant. Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro , ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively. The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation. OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumble DeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro , models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points. GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Pro, collapses to zero on DeepSWE — suggesting that some mid-tier models have been significantly overperforming on easier, potentially contaminated benchmarks. GPT-5.5 doesn't just score the highest — it does so efficiently. The model reaches its 70% pass rate with a median cost of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per trial with a 56% score. Claude Opus 4.7, meanwhile, costs significantly more per run, and output tokens, wall-clock duration, and dollar cost per trial all vary by an order of magnitude across the agents tested — yet none of these correlates strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks. Datacurve's audit found that Claude has been reading the answer key on existing benchmarks Perhaps the most provocative finding in DeepSWE's analysis concerns what the authors label "CHEATED" verdicts — instances where an agent passes a benchmark not by solving the problem, but by reading the answer. SWE-Bench Pro's Docker containers ship the repository's full .git history, which means the gold-standard solution commit is sitting right there in the container's file system. Most models ignore it. Claude does not. Datacurve's analysis found that both Claude Opus 4.7 and Claude Opus 4.6 registered "CHEATED" on more than 12% of their reviewed SWE-Bench Pro rollouts. In those instances, the Claude agent ran commands like git log --all or git show to retrieve the merged fix and paste it into its own patch. The behavior accounted for approximately 18% of Opus 4.7's passes and 25% of Opus 4.6's passes on the reviewed sample. The issue has been filed publicly as GitHub issue #93 on the SWE-Bench Pro repository. GPT-5.4 and GPT-5.5 never exhibited this behavior. Gemini configurations stayed around 1%. Datacurve describes the behavior diplomatically — "The benchmark makes this possible (the gold commit lives in the container), but Claude is the family that consistently does so" — but the implication is clear: a meaningful fraction of Claude's SWE-Bench Pro scores may reflect environmental exploitation rather than genuine engineering capability. DeepSWE addresses this by shipping only a shallow clone with the base commit, leaving no gold hash for the agent to discover. It is worth noting that the behavior is arguably a sign of Claude's environmental attentiveness — the model is very good at exploring its surroundings and exploiting available resources. Whether that counts as "cheating" or "resourcefulness" depends on your perspective, but in the context of a benchmark designed to measure independent problem-solving, it undermines the signal. Each AI model family fails in its own distinctive way, and the patterns matter for enterprise teams Beyond the top-line scores, Datacurve's qualitative trajectory analysis reveals distinctly different failure signatures across model families — a finding that could help engineering teams choose the right model for specific types of work. Claude is forgetful with multi-part prompts. On DeepSWE , Claude configurations miss stated requirements more than any other family. The pattern is consistent: when a prompt enumerates parallel behaviors — "support both sync and async," for instance — Claude typically implements the obvious branch and forgets to mirror the change. Datacurve reports that roughly two-thirds of Claude's "MISSED_REQUIREMENT" failures on DeepSWE follow this "one branch shipped" pattern. In one example, Claude Opus 4.7 correctly landed a sync state-data hook in one engine class while the async engine never received the same hook. GPT, by contrast, implements exactly what is asked. GPT-5.5 had the lowest rate of missing stated behaviors of any configuration tested. Across multiple runs of the same task, GPT trials tended to converge on the same interpretation of the prompt, suggesting instruction-following precision is a stable trait of the model rather than per-run luck. One of the most intriguing findings involves self-verification. On DeepSWE, Claude Opus 4.7 and GPT-5.4 wrote and ran new tests in the project's own test framework on over 80% of their runs — even though no one asked them to. On SWE-Bench Pro, those same models dropped to 28% and 18%, respectively. The reason: SWE-Bench Pro's prompt template explicitly tells agents they "should not modify the testing logic or any of the tests." Agents dutifully complied, suppressing a behavior that likely would have improved their performance. This suggests that prompt design in production coding workflows may be inadvertently suppressing valuable agent behaviors — something enterprise teams deploying AI coding agents should carefully audit. What DeepSWE gets right, what it gets wrong, and what it means for the future of AI benchmarks Datacurve is forthright about several limitations. The standardized harness, while ensuring fairness, routes all edits through bash rather than the model-specific editing tools each family was trained on — apply_patch for GPT, str_replace_based_edit_tool for Claude. This could hold models below their native ceilings. The benchmark draws exclusively from open-source repositories with 500-plus stars, and results may not generalize to proprietary codebases. Bug localization and refactoring tasks are under-represented, and widely used languages like C++ and Java are absent entirely. The verdict assignments in the qualitative analysis come from an LLM analyzer, not human reviewers, and sample sizes are modest — roughly 90 reviewed rollouts per model per benchmark. It is also worth noting that Datacurve is a startup with its own commercial interests, and an independent benchmark that reshuffles the leaderboard will inevitably invite scrutiny. The company's decision to publish the full dataset, all agent trajectories, and the evaluation harness on GitHub mitigates this concern considerably, but independent reproduction will be necessary before the AI community treats these results as definitive. DeepSWE arrives at an inflection point for the AI coding market. Enterprise adoption of AI coding agents is accelerating rapidly, with engineering organizations making consequential bets on which model to build around. The benchmark market itself has become a strategic battleground — Scale AI's SWE-Bench Pro , which Datacurve directly critiques, is maintained by a company that also provides evaluation services to the labs whose models it ranks. If DeepSWE's central findings about verifier reliability and data contamination hold up under independent scrutiny, they could force a reckoning not just with how the industry measures coding agents, but with the broader question of what benchmarks are actually for. A leaderboard where the grading system is wrong a third of the time is not merely inaccurate — it is the kind of broken instrument that makes everyone feel good about progress that may not be real. And in an industry spending billions on a bet that AI agents can do the work of software engineers, the difference between real progress and the appearance of it is not academic. It is the whole game.
Score: 46🌐 MovesMay 26, 2026https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole
Lightwheel AI Raises New Round to Build Physical AI Data and Simulation Infrastructure
The Beijing-based startup secures fresh capital to expand its data and evaluation infrastructure for physical AI, embodied intelligence, and world models.
Score: 46💰 MoneyMay 26, 2026https://pandaily.com/lightwheel-ai-physical-ai-funding-may2026
The AI rewrite of Bun in Rust is making shock waves
The AI rewrite of Bun in Rust is making shock waves InfoWorld
Score: 46🌐 MovesMay 26, 2026https://www.infoworld.com/video/4177219/the-ai-rewrite-of-bun-in-rust-is-making-shock-waves.html
AI takes on big oil’s dirty water problem in the Permian Basin
AI takes on big oil’s dirty water problem in the Permian Basin Austin American-Statesman
Score: 46🌐 MovesMay 26, 2026https://www.statesman.com/business/article/texas-ai-oil-wastewater-permian-22271691.php
Vance calls Pope Leo’s AI warnings ‘profound’
Vance said in an interview with NBC News he is glad the pope tackled AI. He also discussed why X remains off his phone after he deleted it for Lent.
Score: 46🌐 MovesMay 26, 2026https://www.nbcnews.com/politics/jd-vance/vance-pope-leo-ai-warnings-profound-rcna345751
Pony AI lifts 2026 robotaxi fleet goal on faster growth
Pony AI lifts 2026 robotaxi fleet goal on faster growth Automotive News
Score: 46🌐 MovesMay 26, 2026https://www.autonews.com/technology/ane-pony-ai-expands-robotaxi-fleet-0526/
Anthropic Update Underscores Power of AI Flaw Finder Mythos
The cybersecurity model is not yet ready for widespread release.
Score: 45🌐 MovesMay 26, 2026https://aibusiness.com/generative-ai/anthropic-update-underscores-power-ai-flaw-finder-mythos
AI readiness in telecommunications
The AI adoption challenge in telcosAccording to NVIDIA's 2025 State of AI in Telecommunications report...
Score: 45🌐 MovesMay 26, 2026https://www.databricks.com/blog/ai-readiness-telecommunications
Apple’s Fitbit Air-rivaling AI health coach is delayed, new report claims, and that’s bad news for fitness fans
Apple’s rumored AI health coach has been delayed and won’t appear at WWDC in June, a new report claims.
Score: 45🌐 MovesMay 26, 2026https://www.techradar.com/health-fitness/smartwatches/apples-fitbit-air-rivaling-ai-health-coach-is-delayed-new-report-claims-and-thats-bad-news-for-fitness-fans
Here's the pitch deck that AI-native law firm Moritz used to raise $9 million
Here's the pitch deck that AI-native law firm Moritz used to raise $9 million Business Insider
Score: 45💰 MoneyMay 26, 2026https://www.businessinsider.com/ai-law-firm-moritz-seed-round-pitch-deck-y-combinator-2026-5
CX Daily: China’s Tech Sector Catches AI Funding Fever
CX Daily: China’s Tech Sector Catches AI Funding Fever Caixin Global
Score: 45🌐 MovesMay 26, 2026https://www.caixinglobal.com/2026-05-26/cx-daily-chinas-tech-sector-catches-ai-funding-fever-102447403.html
Autonomous AI systems test governance in physical environments
Autonomous AI systems are beginning to move beyond software environments and into warehouses, delivery networks, and public spaces. The development is drawing attention to whether current AI rules cover systems that operate in physical environments. Most existing AI governance frameworks have focused on online harms and model outputs, including bias, misinformation, and harmful content. Embodied […] The post Autonomous AI systems test governance in physical environments appeared first on AI News .
Score: 45🌐 MovesMay 26, 2026https://www.artificialintelligence-news.com/news/autonomous-ai-systems-governance-physical-environments/
CloudThat Becomes One of India’s First OpenAI SMB Channel Partners
CloudThat has become one of the few Indian organizations selected as an OpenAI SMB Channel Partner, joining a highly selective global network focused on accelerating enterprise AI adoption and AI skill development. The partnership comes at a time when India has emerged as OpenAI’s second-largest market globally, with over 100 million weekly ChatGPT users, according […] The post CloudThat Becomes One of India’s First OpenAI SMB Channel Partners appeared first on CXOToday.com .
Score: 45🌐 MovesMay 26, 2026https://cxotoday.com/media-coverage/cloudthat-becomes-one-of-indias-first-openai-smb-channel-partners/?utm_source=rss&utm_medium=rss&utm_campaign=cloudthat-becomes-one-of-indias-first-openai-smb-channel-partners
Are institutions ditching Bitcoin for AI-themed products?
Bitcoin sits at US$76,638.55, and I still see a range play. The price action reflects a market digesting competing forces rather than breaking into a new trend. Institutional capital is not fleeing digital assets but rotating with purpose. Money is moving out of mainstream Bitcoin and Ether ETFs and into AI-themed funds and select altcoin […] The post Are institutions ditching Bitcoin for AI-themed products? appeared first on e27 .
Score: 45🌐 MovesMay 26, 2026https://e27.co/are-institutions-ditching-bitcoin-for-ai-themed-products-20260526/
Hyper3D Launches Rodin Gen-2.5, Bringing Sculpt-Level Detail and Production Control to AI 3D Generation
Hyper3D Launches Rodin Gen-2.5, Bringing Sculpt-Level Detail and Production Control to AI 3D Generation USA Today
Score: 45🌐 MovesMay 26, 2026https://www.usatoday.com/press-release/story/33429/hyper3d-launches-rodin-gen-2-5-bringing-sculpt-level-detail-and-production-control-to-ai-3d-generation/
What Are AI Tarpits? Understanding the Tools People Are Using to Poison LLMs
Content creators and IP holders are getting creative in order to fight back against the LLMs that are trawling their data illegally.
Score: 45🌐 MovesMay 26, 2026https://www.inc.com/fast-company-2/ai-tarpits-understanding-tools-poison-llms-chatbots-data/91349856
Despite ‘peak hype,’ orbital data centers for AI not yet ready for NatSec prime time
Despite ‘peak hype,’ orbital data centers for AI not yet ready for NatSec prime time Breaking Defense
Score: 45🌐 MovesMay 26, 2026https://breakingdefense.com/2026/05/despite-peak-hype-orbital-data-centers-for-ai-not-yet-ready-for-natsec-prime-time/
Sam Altman thinks using AI in emails and Slack is ‘dehumanising’ – and revenue will ‘take a bit longer to figure out’
The head of one of the world’s biggest pure-play AI developers says even he’s had times when using the groundbreaking technology just wasn’t going to happen. The adoption of artificial...
Score: 45🌐 MovesMay 26, 2026https://www.startupdaily.net/topic/artificial-intelligence-machine-learning/sam-altman-thinks-using-ai-in-emails-and-slack-is-dehumanising-and-revenue-will-take-a-bit-longer-to-figure-out/
China's Pony.ai says it is unaffected by self-driving car safety review
China's Pony.ai says it is unaffected by self-driving car safety review Reuters
Score: 45🌐 MovesMay 26, 2026https://www.reuters.com/world/asia-pacific/chinas-ponyai-says-it-is-unaffected-by-self-driving-car-safety-review-2026-05-26/
Rambus targets agentic AI workloads with faster client memory chipset
Rambus Inc. today announced a complete DDR5 9600 client memory module chipset designed to push PC memory speeds to 9,600 megatransfers per second, targeting the bandwidth and capacity demands of agentic artificial intelligence workloads on desktops and laptops. The new offering from the chip and silicon intellectual property company combines a second-generation client clock driver, […] The post Rambus targets agentic AI workloads with faster client memory chipset appeared first on SiliconANGLE .
Score: 45🌐 MovesMay 26, 2026https://siliconangle.com/2026/05/26/rambus-targets-agentic-ai-workloads-9600-mt-s-client-memory-chipset/
Import AI 458: Reckoning with the future; and a singularity story
What AI-driven miracles will happen this year?
Score: 45🌐 MovesMay 26, 2026https://importai.substack.com/p/import-ai-458-reckoning-with-the
AI Tools Are Transforming Muslim Worship. Religious Scholars Are Conflicted
AI Tools Are Transforming Muslim Worship. Religious Scholars Are Conflicted Time Magazine
Score: 45🌐 MovesMay 26, 2026https://time.com/article/2026/05/26/ai-muslim-worship/
Warning issued over deepfake ads that expose people to ‘the mercy of scammers’
Which? has called on the government to ensure regulator Ofcom can take action against tech firms that fail to block scams
Score: 44🌐 MovesMay 26, 2026https://www.the-independent.com/news/uk/home-news/deepfakes-social-media-scam-ai-which-ofcom-b2983728.html