AI News Archive: May 26, 2026 — Part 9
Sourced from 500+ daily AI sources, scored by relevance.
- Reliable Machine Learning Methods for Payment Risk Prediction in Supply Chain Financing
Reliable Machine Learning Methods for Payment Risk Prediction in Supply Chain Financing repository.cam.ac.uk
Score: 34🌐 MovesMay 26, 2026https://www.repository.cam.ac.uk/items/bbb7f1cf-f325-47ee-bc84-7fcad735d1ad - Lowe’s courts DIY shoppers as AI tools boost online conversions
The retailer is aiming to build relationships with its DIY customers through AI-enhanced omnichannel shopping, associate-led services and loyalty program options.
Score: 34🌐 MovesMay 26, 2026https://www.retaildive.com/news/lowes-diy-shoppers-ai-tools-online-conversions/821019/ - AI and You: AI vs UPSC—three chatbots attempt India’s toughest exam
A comparative test of leading AI models on actual UPSC Prelims papers reveal how closely modern systems can mirror human-level preparation, handling history and polity well but struggling with precise current affairs and technical distinctions that often decide exam outcomes.
- Why DORA Metrics Look Different When AI Is Part of Your Development Workflow
Why DORA Metrics Look Different When AI Is Part of Your Development Workflow DevOps.com
Score: 34🌐 MovesMay 26, 2026https://devops.com/why-dora-metrics-look-different-when-ai-is-part-of-your-development-workflow/ - OpenAI’s Attempt at an AI-Generated Pixar-Style Movie Is in Shambles
Making movies ain't so easy after all. The post OpenAI’s Attempt at an AI-Generated Pixar-Style Movie Is in Shambles appeared first on Futurism .
Score: 33🌐 MovesMay 26, 2026https://futurism.com/artificial-intelligence/openai-attempt-ai-pixar-movie-shambles - AI tools without the right people may keep businesses in pilot mode
AI tools without the right people may keep businesses in pilot mode Raleigh News & Observer
- Zendure aims to lead in AI-driven home energy management systems
The company is expanding into AI-driven home energy solutions.
Score: 33🌐 MovesMay 26, 2026https://kr-asia.com/zendure-aims-to-lead-in-ai-driven-home-energy-management-systems - Could AI-powered dash cams save businesses millions in legal fees?
AI dash cams can do a lot more than capture accidents. They help fleets exonerate drivers, fight fraudulent claims, and lower insurance costs before a lawsuit ever begins.
Score: 33🌐 MovesMay 26, 2026https://www.techradar.com/pro/could-ai-powered-dash-cams-save-businesses-millions-in-legal-fees - The AI bus is headed for a cliff: Why “atoms” are the only lifeboat left
The venture capital world is currently trapped on a bus. It is a high-speed, neon-lit vehicle labelled “AI + Data Centres + Compute.” Every LP and VC is terrified of getting off a few stops too early, fearing they will miss the peak of the mountain. But here is the hard truth: if we do […] The post The AI bus is headed for a cliff: Why “atoms” are the only lifeboat left appeared first on e27 .
Score: 33🌐 MovesMay 26, 2026https://e27.co/the-ai-bus-is-headed-for-a-cliff-why-atoms-are-the-only-lifeboat-left-20260523/ - In 2026, more HR leaders are focused on training — and not just for AI skills
A shifting job market and desire for better workforce management may be why more HR pros considered learning a top priority this year, experts said.
Score: 33🌐 MovesMay 26, 2026https://www.hrdive.com/news/2026-more-hr-leaders-are-focused-on-training/821117/ - Unitree Robotics reports plunge in first-quarter profits days before crucial IPO hearing
Unitree Robotics, a luminary in China’s humanoid robot boom, has reported a sharp plunge in first-quarter profits just days before its crucial listing hearing, casting a shadow over its Star Market initial public offering (IPO) as soaring expenses and a brutal price war catch up to the industry’s hype. The Shanghai Stock Exchange’s listing committee is scheduled to review Unitree’s IPO application on June 1, according to an exchange notice on Monday. The company, based in Hangzhou, the capital...
- Rajasthan Partners With Wadhwani AI to Deploy AI Tools for Agriculture Services
Discover how Rajasthan partners with Wadhwani AI to enhance agricultural services. Learn about AI tools that support farmers today!
- India’s Voice AI Boom is Missing One Thing—and All India Radio Already Has It
AIR represents one of the most comprehensive repositories of spoken Indian languages.
- Marketing firm moves HQ office to downtown Cincinnati, unveils AI research platform
Brandience, a growing marketing firm, has relocated its headquarters office to downtown Cincinnati while unveiling a new research platform that relies on generative AI.
- Automating Sales Territories with AI Workflows
Transform sales territories with AI-powered content automation. Streamline GTM strategies, enrich CRM data, and scale personalized outreach effortlessly.
- Vision boarding in the age of AI: Why clarity is becoming the new competitive advantage
For a long time, vision boarding was dismissed as a soft practice — something aesthetic, emotional, and often unserious. Cut out magazine clippings. Pin a dream house. Manifest abundance. Hope the universe listens. But in recent years, vision boarding has quietly re-emerged — not as a trend, but as a response. A response to speed. […] The post Vision boarding in the age of AI: Why clarity is becoming the new competitive advantage appeared first on e27 .
Score: 32🌐 MovesMay 26, 2026https://e27.co/vision-boarding-in-the-age-of-ai-why-clarity-is-becoming-the-new-competitive-advantage-20260106/ - Happiest Minds unveils enterprise AI platform, reaffirms 12.5% FY27 growth outlook
Happiest Minds unveils enterprise AI platform, reaffirms 12.5% FY27 growth outlook Techcircle
- Enterprise AI infrastructure, MLOps & developer Tools drive the next phase of AI innovation
The ET Most Innovative AI Product Awards 2026 recognises the AI Platforms, Infrastructure & Developer Tools category. It highlights the technologies powering the next phase of enterprise AI from platforms and MLOps to observability and developer tools. These innovations are enabling scalable, reliable deployment of AI across industries.
- Dropbox’s Founder Is Stepping Down After 19 Years—and His Next Move Involves AI
After almost two decades, Drew Houston is leaving Dropbox, but he hasn’t ruled out the possibility of building something new.
- Unitree’s IPO progress spurs stock buying of firms with exposure to humanoid robot maker
Unitree Robotics’ progress on its domestic initial public offering (IPO) has sparked a buying frenzy of stocks with exposure to the humanoid robot juggernaut, as traders snapped up shares of its pre-IPO investors and business partners. Investors were navigating one of the most important thematic investments this year after Unitree took a step closer to a listing on Shanghai’s tech-heavy Star Market. Unitree said Monday that the exchange authority would vet its application next week. Shares of...
- Most Entrepreneurs Think They're Winning at AI — They're Not and Their Competitors Already Know It
Most Entrepreneurs Think They're Winning at AI — They're Not and Their Competitors Already Know It entrepreneur.com
Score: 32🌐 MovesMay 26, 2026https://www.entrepreneur.com/growing-a-business/most-entrepreneurs-think-theyre-winning-at-ai/503426 - NotebookLM just made it easier to keep your sources up to date
NotebookLM is taking the manual effort out of re-syncing files.
Score: 32🌐 MovesMay 26, 2026https://www.androidauthority.com/notebooklm-auto-google-drive-syncing-3671289/ - Novorèsumè Launches AI Resume Job Matcher for Resume Optimization and ATS Compatibility
Novorèsumè Launches AI Resume Job Matcher for Resume Optimization and ATS Compatibility azcentral.com and The Arizona Republic
- Why inclusion should be baked into AI adoption
An artificial intelligence inclusion framework from talent advisory firm Seramount puts the spotlight on how AI initiatives can leave certain groups behind.
Score: 31🌐 MovesMay 26, 2026https://www.hrdive.com/news/why-inclusion-should-be-baked-into-ai-adoption/821083/ - 3D-printable humanoid legs let robotics experiments run wild
Hugging Face debuts $2,500 bipedal robot project for builders and researchers.
Score: 31🌐 MovesMay 26, 2026https://arstechnica.com/ai/2026/05/3d-printable-humanoid-legs-let-robotics-experiments-run-wild/ - How AI is changing project risk management in Jira
How AI is changing project risk management in Jira Atlassian Community
- Gemini user hits 5-hour usage cap after a single prompt, Google responds
"Yikes!" is exactly the right reaction for this situation.
Score: 31🌐 MovesMay 26, 2026https://www.androidauthority.com/google-gemini-usage-limit-problem-3670846/ - Nat Parker's new startup uses AI to automate Airbnb property management tasks
The Portland entrepreneur who launched TriMet into its mobile ticketing era returns to the spotlight.
- VerticalRent Launches Automated Rent Collection Suite With AI Expense Tracking and Free Schedule E Tax Guide
VerticalRent Launches Automated Rent Collection Suite With AI Expense Tracking and Free Schedule E Tax Guide azcentral.com and The Arizona Republic
- Palantir Mystery Deepens As Many Software Stocks Claw Back Amid AI Fears
Palantir stock is still down 23% in 2026 while the software ETF IGV has rebounded as the sector grapples with worries over AI disruption. The post Palantir Mystery Deepens As Many Software Stocks Claw Back Amid AI Fears appeared first on Investor's Business Daily .
Score: 30🌐 MovesMay 26, 2026https://www.investors.com/news/technology/palantir-stock-pltr-software-stocks-igv-etf-2026/ - iOS 27’s new Siri design will look like this, per report
Apple’s major Siri overhaul will be unveiled in less than two weeks, and a new report reveals exactly what to expect from Siri’s new UI design. more…
Score: 30🌐 MovesMay 26, 2026https://9to5mac.com/2026/05/26/ios-27s-new-siri-design-will-look-like-this-per-report/ - MiSpy.ai Launches World's First On-Demand Private Investigation Marketplace, Built To Create Jobs Instead Of Replace Them
MiSpy.ai Launches World's First On-Demand Private Investigation Marketplace, Built To Create Jobs Instead Of Replace Them USA Today
- Getmany Launches AI Workflow Tools for Freelancers and Agencies
Getmany Launches AI Workflow Tools for Freelancers and Agencies USA Today
- Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs
Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs MarkTechPost
Score: 30🌐 MovesMay 26, 2026https://www.marktechpost.com/2026/05/26/meet-omnivoice-studio-a-local-open-source-alternative-to-elevenlabs/ - Neysa & Pipeshift Take On India’s Inference Problem
Together, these companies are building a system that lets enterprises run open-source models in single-tenant environments within India, without managing the operational overhead themselves.
Score: 30🌐 MovesMay 26, 2026https://analyticsindiamag.com/ai-features/neysa-pipeshift-take-on-indias-inference-problem - Opinion: AI transforming how tenders are written but not how they’re evaluated
AI is changing how tenders are written but not how they're evaluated in Ireland. That gap is becoming a problem, says BidReview.ai founder Tony Corrigan. Read more: Opinion: AI transforming how tenders are written but not how they’re evaluated
- Logistics focused AI startup Shipsy crosses $25 million ARR
Logistics focused AI startup Shipsy crosses $25 million ARR YourStory.com
Score: 30🌐 MovesMay 26, 2026https://yourstory.com/2026/05/logistics-focused-ai-startup-shipsy-crosses-25-million-arr - The AI travel revolution: Why hotels must be found by bots to be chosen by humans
As a marketer specialised in hospitality, I’ve had a front-row seat to several waves of change in this industry. But none of them compares to what AI is now setting in motion. The potential it carries for how hotels are found, how stays are booked, and how guests are served is unlike anything I’ve seen […] The post The AI travel revolution: Why hotels must be found by bots to be chosen by humans appeared first on e27 .
Score: 30🌐 MovesMay 26, 2026https://e27.co/the-ai-travel-revolution-why-hotels-must-be-found-by-bots-to-be-chosen-by-humans-20260526/ - SpaceX’s AI Pursuits Have Yet to Take Off
Plus, California studies AI job losses, Anthropic’s revenue soars and Nvidia plays the role of $5 trillion underdog.
Score: 30🌐 MovesMay 26, 2026https://www.wsj.com/tech/ai/spacexs-ai-pursuits-have-yet-to-take-off-3c25e91e?mod=rss_Technology - What everyone gets wrong about AI
What everyone gets wrong about AI The Washington Post
- The future of AI is an AI futures market
Wall Street is building a way for traders to buy and sell the processing power that underpins AI like the way they bet on the fluctuating price of a barrel of oil.
Score: 30🌐 MovesMay 26, 2026https://www.semafor.com/article/05/26/2026/the-future-of-ai-is-an-ai-futures-market - Why AI needs integrated Jira data to deliver better insights in 2026
Why AI needs integrated Jira data to deliver better insights in 2026 Atlassian Community
- The AI Model Confidence Trap
Why your AI model can be wrong with 99% confidence The post The AI Model Confidence Trap appeared first on Towards Data Science .
- Salesforce Quarterly Highlights: FY27 Q1 Product Releases and Corporate Announcements
Something fundamental is shifting in how enterprises operate. “Forty percent of enterprise applications will be integrated with task-specific AI agents by the end of 2026, up from less than 5% in 2025, according to Gartner®.”1 As adoption accelerates, the industry is realizing the full value of agentic AI depends as much on the underlying operational […]
- AI Hallucinates Flight Refund, Sparking Memes in China
A chat log in which an AI assistant confidently presented a user with numerous inaccuracies has inspired a wave of AI hallucination spoofs, a growing trend in the country.
Score: 29🌐 MovesMay 26, 2026https://www.sixthtone.com/news/1018572/AI Hallucinates Flight Refund, Sparking Memes in China - AI can scale B2B sales, but only people can build trust
Automation dominates modern B2B selling, but the deals that matter most still close through human relationships. The post AI can scale B2B sales, but only people can build trust appeared first on MarTech .
- Allied Fence & Gates Launches InsureFENCE℠ — the First Ai-Powered Insurance Scope Scanner for Fence Claims
Allied Fence & Gates Launches InsureFENCE℠ — the First Ai-Powered Insurance Scope Scanner for Fence Claims azcentral.com and The Arizona Republic
- Three ways to avoid being fooled by AI slop
Global society makes billions of images and uploads hundreds of thousands of hours of video on the internet every day. The problem is, some of this content is misleading or downright wrong. And when it's in visual form, it can be particularly convincing.
- Automated Instruction Revision (AIR): A Comparison of Task Adaptation Strategies for LLM-based…
Automated Instruction Revision (AIR): A Comparison of Task Adaptation Strategies for LLM-based agents Modern LLM-based agents are becoming central to workflows that require planning and automation, tool use, and repeated decision-making, but their performance is fragile in environments that change over time. New data arriving in production often is different from the examples seen during agent development, additionally creating data and concept drift that can quickly make existing prompts outdated. The problem is intensified by model churn : organisations increasingly migrate between foundational models and versions, each with different strengths, weaknesses, and instruction-following sensitivity. In this setting, initial instructions become difficult to maintain because even small changes in the underlying model can alter the agent’s behaviour. Fine-tuning is not always the best practical solution; it demands hosting or computing budget, monitoring, and strong safeguards to avoid degrading general knowledge, safety, or alignment . It can also fragment systems into many specialised fine-tuned models across nodes or tasks, which increases operational complexity. At the same time, LLM-based agents often need continuous or periodic expert support to diagnose failures, revise instructions, and keep behaviour aligned with dynamic requirements/data. This dependency on human intervention slows adaptation and makes scaling brittle when many agents or workflows must be maintained simultaneously. The case for automated instruction revision Automated prompt and instruction revision methods offer a more structured alternative by updating agent behaviour directly from observed task performance and feedback . Such methods can reduce the need for manual re-engineering while keeping adaptation lightweight, repeatable, and easier to deploy across changing environments. They also improve transparency by making the revision process explicit, so it becomes clearer what was changed, which observations influenced the update , and which samples were excluded because they were contradictory or unsupported. For these reasons, automated instruction revision is emerging as an important direction for keeping LLM-based agents reliable, adaptable, and operationally manageable in modern agentic systems. Related work Existing approaches to automated instruction revision span a spectrum from lightweight prompt selection to full parameter adaptation. DSPy BootstrapFewShot focuses on automatically selecting high-quality in-context examples from labelled data, effectively improving performance without modifying the original instruction itself. More advanced prompt optimisation methods, such as DSPy MIPROv2 , treat instruction design as a search problem, jointly exploring candidate instructions and demonstrations to identify high-performing prompt configurations . Extending this idea, DSPy GEPA i ntroduces a reflective loop with a stronger teacher model that iteratively revises instructions based on execution feedback, enabling more informed and adaptive improvements. Finally, TextGrad frames instruction revision as a continuous optimisation process in text space, updating the prompt through gradient-like signals derived from task loss while keeping the underlying model fixed. Together, these methods illustrate different trade-offs between simplicity, interpretability, computational cost, and adaptability in automated instruction revision. There are other methods like CoachLM and Prewrite , but to sum up, the literature reveals two main tendencies. The first is the movement from manual prompt writing toward automated optimisation of prompts, examples, and full LLM pipelines. The second is the growing interest in interpretable decision structures such as rules or trees. Our work lies at the intersection of these directions: like prompt optimisation methods, it seeks to automatically improve LLM behaviour , but like tree-based approaches, it emphasises the discovery of explicit and reusable decision logic. What AIR does differently Our proposed Automated Instruction Revision (AIR) method is the same data-driven prompt adaptation pipeline that learns explicit task guidance from labelled examples instead of relying on continuous manual prompt engineering. But unlike the described methods, the central idea is to transform supervised task data into a compact set of task-specific decision rules, convert these rules into an executable system prompt, and iteratively refine them using additional sampled cases from the training set. This approach helps to reach comparable/top results with computational budgets that are times lower . If you’re interested in achieved results, please proceed to the RESULTS section directly. The AIR pipeline The AIR pipeline proceeds through clustering, rule induction, rule aggregation, and targeted refinement. Starting from labelled training data, it constructs an executable instruction set in the following stages. So the very first task is to create your init prompt, which can be pretty simple, like “ Please, classify/answer according to some primary rule/instruction… ”. In our benchmarking, we call this the initial prompt performance. Then we need to process our dataset: as stated above, here we need some feedback or supervision, thus it is not a self-learning approach, similar to comparison improvement techniques. So here AIR first maps each dataset into canonical input and output columns and computes embeddings for both sides (independent and target) of the supervision signal. This creates a uniform representation across tasks and provides the structure used later for clustering and local comparison. Then we should understand how to sample our data and form batches. This task is important because we can’t overload context with the entire dataset (even if it fits 1M tokens or any future RoPE -like improved size). Also, when we feed, for example, some sorted or one-class or fully randomly formed batch, it won’t help us efficiently to form any semantic decision boundaries. Our idea is to make some initial semantic grouping of the inputs, but take different targets (or mimic the targets' distribution in the general sample). This can be achieved by clustering using a few objectives (input similarity, target variety). The number of clusters is treated as a hyperparameter and is set to the default value of 5 in our experiments. Finally, AIR performs a repair step (automated blind clustering is not always perfect for different tasks (like discrete classes vs enhanced answers)) that redistributes samples from clusters containing only one output class to nearby alternative clusters. This encourages groups to remain semantically coherent while still preserving output variation that is useful for distinguishing behaviour. Then comes the rule learning part, which consists of a few steps. Firstly, after sampling balanced batches, these batches are passed to a reasoning model with an instruction to infer a small number of rules of the form “ if condition on the input, then output action or pattern. ” The purpose is not free-form explanation, but extraction of compact decision-boundary rules that distinguish competing output behaviours within the same semantic neighbourhood. After this, we need to aggregate and compile the rules into an executable prompt: Because rule induction is repeated across clusters, the initial rule pool is usually large and partially redundant. AIR therefore applies a dedicated LLM-based rule-compilation stage. In this stage, a reasoning model receives the induced rules and, following a constrained compiler prompt, groups rules with semantically similar THEN actions, identifies the shared structure of their IF conditions, merges complementary signals, removes lexical noise, and preserves only the exclusions needed to avoid cross-rule collisions. Aggregated rules produced by the above step are then concatenated with the task description to form a structured final prompt that instructs the model to follow the decision process encoded in the rules. AIR also creates a traced variant of this prompt that asks the model to return both the final prediction and the identifiers of the applied rules, enabling later refinement. Talking about refinement , this also appeared in several of the mentioned approaches, and it is almost about the continuous improvement process. Here, rather than using the aggregated rules as final, AIR re-evaluates them on newly sampled examples and records predicted outputs, applied rule identifiers, and task metric values. For each rule, the pipeline separates participating cases into mistakes, where the prediction was incorrect, and anchors, where the prediction was correct according to the provided metric. Small refinement batches built from these two sets are then sent back to the reasoning model, which is asked to make the smallest necessary local revision to the current rule. In this way, AIR updates individual rules while attempting to preserve behaviour on successful anchor cases: After refinement, the updated rules are assembled into the final system prompt used for downstream evaluation. A very important note here is that the final step will be used as the major part of the agent's continuous improvement loop , especially without the involvement of AI specialists . In this way, AIR derives task-specific instructions from labelled examples through rule induction, aggregation, and revision: Performance and benchmarks This section describes the benchmark suite, the compared adaptation methods, and the models used in our experiments. The evaluation is designed to support a controlled comparison of prompting, retrieval-based adaptation, prompt optimisation, and fine-tuning under a shared task-specific setup. We evaluate AIR on a deliberately diverse benchmark suite chosen to reflect different sources of adaptation difficulty rather than a single task family. The suite spans remapped-label classification, closed-book factual QA, schema-constrained extraction, PII identification, and event-order reasoning. The comparison is made via classical cross-validation using train, development, and test splits, with task-specific metrics (typically, the higher the better ). The test set, which was outside the training and refining/tuning, was used for the final evaluation and is present in our resulting benchmarking tables. Let's then review our tasks and datasets, which partially appear in different benchmarks and can be used for our evaluation. The first one is the example of a classification task , which typically can be solved by many different and proven simpler ML-based approaches rather than using LLMs (at least to decrease complexity/computing and increase the consistency and transparency), but sometimes the dependencies can be very close to huge variety of different semantics/knowledge behind, where the LLM-based agent with structured output may be the right way. Here, we used real customer support requests collected from Twitter, based on the Customer Support dataset (sample example on Kaggle ), and constructed an 8-class benchmark by selecting requests addressed to eight companies. The main problem was that this is not a complicated task for LLM, which has a background at least around the companies that appeared in the dataset (or rather has seen the dataset during training/caught keyinfo distillation, etc.). Thus, firstly, we’ve removed any explicit company mentions and other direct lexical cues and reassigned the output labels so that each company is consistently mapped to a different description rather than its original one. The final dataset is balanced across the selected companies, and performance is measured with exact match (accuracy). Another, less discrete task is closed-book question answering , where we built a benchmark from Ever Young by Alice Gerstenberg and created question-answer pairs that must be answered without providing the full source text at inference time. The benchmark is intentionally designed so that the relevant source material is unlikely to be part of the model’s pretrained knowledge, but such leakage is valid for all LLM-based approaches, which we are benchmarking and is not decisive, which you’ll see from the evaluation. To evaluate outputs, we use an LLM-as-a-judge rubric with four dimensions: correctness, completeness, absence of hallucination, and focus, each scored on a 0/0.5/1 scale (each interval has described decision boundaries, the same as for other tasks below, where there are such stochastic native measurements), with the final score defined as the average of the four subscores. For information retrieval , we use campaign-finance filings for Philadelphia elections from the City of Philadelphia Campaign Finance Reports source, containing transaction-level records for contributions, expenditures, and debt. Each example is transformed from a structured table row into a raw CSV-style string consisting of a header and one data line, and the columns are randomly shuffled. This means the model must first reconstruct the field mapping before it can find and extract the requested outputs (this is not the ETL-related or data extraction tasks). In addition to recovering directly stated fields, the model must infer two derived labels: candidate_relationship, which categorises how the donor relates to the candidate, and contribution_context, which combines donor type and contribution size into labels. Performance is measured as average exact match across the two target fields (approx MAP). As we stated, we’ve tried to add different tasks and benchmarks to provide an overview of existing and proposed improvement approaches from as many possible perspectives. And the next task is PII extraction . Here, we used PUPA (Private User Prompt Annotations), a benchmark derived from real WildChat user-assistant conversations that contain explicit PII leakage. In our benchmark, we isolated only the annotation component and converted it into a pure extraction task: given user_query, the model must output the gold pii_units directly, without rewriting, redaction, or response generation. Evaluation is performed with an entity-level exact-match F1 score. For event logical reasoning , we adopt the Event Logic Reasoning subset of BizFinBench.v2 . Each example describes a finance-related scenario involving multiple market events, and the task is to recover their correct logical order, either temporally or through cause-and-effect structure. The benchmark is intended to test whether this ordering logic can be induced through prompting, transferred from similar retrieved examples, or absorbed through fine-tuning. The output is a compact event-index sequence, for example, “2,1,4,3,” rather than a free-form explanation. Performance is measured with an exact match. We compare AIR against a set of adaptation strategies (partially listed at the beginning as related approaches) implemented within a shared task-specific workflow across our benchmarks. All methods start from the same manually written task instruction and use the available labelled training data, but they differ in whether they adapt the prompt, retrieve examples, or update model parameters. Initial prompt baseline. The model is evaluated with the manually written task prompt only, without retrieval, optimisation, or parameter updates. This baseline measures the performance of direct zero-shot prompting with human-authored instructions. KNN-based prompting . For each test example, we retrieve (this step is similar to the corresponding one in the RAG pipelines , which nowadays is probably more known for non-DS/AI/ML specialists and business), similar training samples, and append them as in-context examples to the initial prompt. This baseline tests whether instance-level adaptation through dynamic example selection is sufficient without modifying the instruction itself. DSPy BootstrapFewShot. DSPy automatically compiles a few-shot prompt by sampling labelled training examples and retaining demonstrations that satisfy the task metric. This yields an automatically selected demonstration set while keeping the base instruction fixed. Fine-tuning. We fine-tuned the base underlying foundation model (GPT family, as proposed in other improvement papers — described below after this section) on the training datasets using chat-formatted supervision constructed from the same initial system prompt, the task input, and the target, and the resulting model is evaluated on the test dataset. This baseline represents parameter adaptation rather than prompt-only adaptation. DSPy MIPROv2 . MIPROv2 is used as a prompt optimisation baseline without a separate teacher or reflection model. In our experiments, it is run with DSPy’s auto=”medium” budget and task-specific limits on bootstrapped and labelled demonstrations, which are chosen separately for each benchmark. It jointly searches over instruction candidates and, when enabled, demonstration candidates to find a high-performing compiled prompt. DSPy GEPA . GEPA is used as a reflective prompt optimiser with the same auto=”medium” budget, but unlike MIPROv2, it relies on a separate reasoning model as a teacher for reflection (in AIR, we called the reflection conducted in a different way as refinement). So, the GEPA algorithm receives task-specific textual feedback through the metric and uses the teacher model to revise candidate instructions based on execution traces and validation-time search. TextGrad . TextGrad treats the system prompt as an optimizable text variable while keeping the model parameters fixed. In our experiments, we run one epoch over the full training set; within each minibatch, per-example losses are summed into a single minibatch loss, which is then used to update the prompt. This baseline, therefore, represents iterative text-space optimisation of the initial instruction. AIR . AIR induces explicit task-specific rules from labelled examples, aggregates them into a compact rule set, and refines them iteratively before assembling the final prompt. Across this benchmarking, we used gpt-4.1-mini-2025–04–14 as the base task model for evaluation, at least because the GPT family was proposed and used in the described existing methods. This model is also used for the initial prompt baseline, KNN-based prompting, DSPy BootstrapFewShot, MIPROv2, and the final prompt evaluation in AIR, and it also serves as the starting point for fine-tuning. For methods that require a stronger auxiliary model for reflection or instruction refining/revision, we use gpt-5–2025–08–07 as the teacher model in GEPA, TextGrad, and AIR. In AIR, we additionally use text-embedding-3-small to compute input and output embeddings for clustering and rule induction. For the closed-book question answering benchmark, the LLM-as-a-judge metric uses gpt-5.1 as the judge model. Results This section reports the quantitative results for all compared methods across the five benchmark tasks. In addition to the task metric, each table includes the token usage reported in the original experiment slides, separating training-time token usage by model role and inference-time token usage for the base model. Below is the first table showing the classification results . GEPA achieves the highest score, while AIR remains very close and also surpasses fine-tuning. This pattern suggests that, once brand priors are removed, the task behaves less like ordinary text classification and more like learning a remapped latent label system, where explicit prompt-level instruction discovery can be highly effective. Please also compare the computational budgets (or tokens used), which are quite high for DSPy/TextGrad families. The closed-book QA results in Table 2 show different problem settings. KNN clearly performs best, while most prompt-optimisation methods fail to improve over the initial prompt. This shows that the benchmark is dominated by source-specific knowledge injection rather than by generic reasoning or procedural instruction improvement. Table 3 shows that fine-tuning is dominant on the information retrieval task: This is consistent with the structure of the benchmark: before performing extraction, the model must first reconstruct the field mapping from shuffled CSV-style rows. AIR performs substantially worse here (the same as other instruction revision methods), due to the difficulty of capturing this task through compact local rules alone. The PUPA results in Table 4 again favour fine-tuning, with MIPROv2, AIR, and GEPA forming a second tier. This pattern suggests that the benchmark rewards not only general PII detection, but also adherence to dataset-specific annotation habits. AIR remains competitive with GEPA, but does not match the best parameter-adaptation result. Note: In the source results, AIR on PUPA was terminated after 8 steps, including 5 consecutive steps with no improvement, using a batch size of 3. Finally, Table 5 shows that fine-tuning again performs best on event logical reasoning . The relatively strong initial-prompt score suggests that the base model already has some latent capability for financial event ordering, while fine-tuning appears to stabilise how that reasoning is mapped into the required output sequence. KNN and AIR both provide moderate gains, but neither approaches the fine-tuned model: Conclusions AIR is intended to provide several practical advantages . First, it is interpretable: the learned adaptation is represented as readable instruction text rather than hidden inside parameter updates. Second, it can support rule-level inspection and revision: individual rules can be examined and modified without changing model weights. Third, the pipeline preserves intermediate artefacts that can help analyse how the final prompt was constructed. Fourth, it reduces manual effort by automating a process that would otherwise require a person to inspect the data, identify recurring patterns, and summarise them as explicit rules. Fifth, the computing budget is quite moderate in comparison to other methods. At the same time, AIR has important limitations . Its core assumption is that useful task behaviour can be guided by natural language or semantic rules. When labels are inconsistent, examples are noisy, or correct decisions depend on latent patterns that are difficult to verbalise, rule induction may become unstable. AIR can also suffer from rule interaction effects: a locally beneficial instruction may conflict with others after aggregation, and repeated revisions may reduce clarity instead of improving it. The above benchmarking results do not support a single universal winner across all benchmarks. Instead, they show that different methods are strongest under different task conditions. Retrieval works best when the task is dominated by source-specific knowledge, fine-tuning is strongest when the task depends on stable dataset-specific mappings or output conventions, and instruction-based adaptation is most competitive when the target behaviour can be expressed as explicit decision rules. Within this picture, AIR occupies a clear practical niche. It does not win on every benchmark, but it remains especially competitive on tasks where the target behaviour can be captured as reusable rules within the optimised computational budget. This makes AIR attractive in settings where interpretability is important, and some loss relative to the best task-specific method is acceptable. Another important point is that the compared adaptation methods still rely on a separate search or optimisation stage to identify a strong task-specific solution. When the data changes, this process may need to be repeated rather than updated incrementally. In principle, AIR may offer an advantage here, since new data could potentially be incorporated by extending or refining the rule set instead of rebuilding the adaptation from scratch. However, this possibility was not evaluated in the present experiments and remains a direction for future work. For these reasons, we treat AIR not as a universal solution to task adaptation, but as a structured and interpretable option for settings in which task behaviour can be captured, at least in part, by explicit instruction rules derived from supervision. Building LLM-based agents that need to keep up with changing data? Talk with our experts . Read more about our data science work on our blog Connect with us on LinkedIn Do you have questions about the method or the benchmarks? Feel free to ask in the comments. This article is an adapted version of the original research paper, edited for Medium. The full paper is available at https://arxiv.org/abs/2604.09418 Automated Instruction Revision (AIR): A Comparison of Task Adaptation Strategies for LLM-based… was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.
- Cyber Hankuk University of Foreign Studies partners with AI English app Speak to support students
Cyber Hankuk University of Foreign Studies' building in Dongdaemun District, eastern Seoul [CYBER HANKUK UNIVERSITY OF FOREIGN STUDIES] Cyber Hankuk University of Foreign Studies (CUFS) announced Tuesday that it has formed a strategic partnership with AI-powered English education application Speak to support its students. The partnership benefits undergraduates in the English department, including students on leaves of absence, alumni and those enrolled in the Tesol program. It also extends to graduate students majoring in AI & English at the CUFS Graduate School, the school said. Related Article Foreign English teachers cite stagnant wages, poor conditions amid drop in E-2 visas Once a magnet for foreign English teachers, Korea sees E-2 visa issuances hit six-year low Why 'English kindergartens' remain popular in Korea despite rising costs and government crackdowns As part of the partnership's perks, eligible CUFS students and alumni will receive special discounted pricing on Speak's annual subscription plan. The discount period will run from Monday through May 31 of next year. “I hope this partnership serves as an excellent opportunity to help students overcome psychological barriers and boost their confidence in speaking English,” said Prof. Lee Jong-bong, head of the CUFS English department. CUFS plans to expand its cooperation with various edutech platforms, university officials said. Based in San Francisco, Speak entered the Korean market in 2019 and was previously recognized as Google Play’s top application for language speaking practice. The platform allows users to practice conversational English with AI tutors that provide tailored feedback. As of November 2024, the app's monthly active users in Korea stood at 240,000. A four-year cyber university, CUFS is also set to open admissions for the second semester of the 2026 academic year. Applications will be accepted from Monday to July 16 across 10 departments: English, Chinese, Japanese, Korean, Spanish, Vietnamese & Indonesian, business administration, occupational safety & housing management, psychological counseling and K-beauty. For foreign nationals, admissions are open to anyone with a high school diploma or equivalent. BY CHO JUNG-WOO [cho.jungwoo1@joongang.co.kr]