Articles100

Even as the geopolitical conversation around AI continues to grow more fraught following the U.S. government's actions to limit the new models from Anthropic and OpenAI, Chinese open source darling DeepSeek is back with yet another open release that could once again change AI development around the globe. Over the weekend, the firm released DSpark, a new, MIT-Licensed system designed to make large language models answer faster without changing what the underlying model is trying to say. The easiest way to think about it is this: most AI chatbots write like someone crossing a river one stepping stone at a time. They choose one small chunk of text, then the next, then the next. DSpark gives the system a scout that runs a few steps ahead, guesses the likely path, and lets the larger model quickly check which steps are safe. When the guesses are good, the model moves faster. When the guesses are weak, DSpark tries not to waste time checking them. DeepSeek published the work with a technical paper, model checkpoints and DeepSpec, a codebase for training and evaluating speculative decoding systems. The release is available through DeepSeek’s public GitHub and Hugging Face pages, both under the permissive, friendly, commonplace MIT license, making the new technique broadly usable by developers, researchers and commercial enterprise operations that want to study or adapt the approach. The system is aimed at one of the most expensive problems in AI deployment: serving large models quickly enough for real users, while using hardware efficiently enough to make the economics work. That matters for consumer chatbots, coding assistants, agentic workflows and enterprise AI systems where users expect long answers to stream quickly rather than crawl out word by word. DeepSeek is applying DSpark to its own latest frontier open model, DeepSeek-V4. Specifically, DeepSeek used its new DSpark framework on DeepSeek-V4-Flash, its already speed-optimized 284-billion-parameter mixture-of-experts model with 13 billion active parameters, and DeepSeek-V4-Pro, its more thoughtful and powerful 1.6-trillion-parameter model with 49 billion active parameters (Both support context windows up to one million tokens). But the broader significance is that DSpark is not conceptually limited to DeepSeek-V4. DeepSeek’s own tests and released checkpoints cover other open model families, including Alibaba's open weights Qwen and Google's open weights Gemma. That means enterprise teams running open-weight models could, in principle, train or fine-tune DSpark-style draft modules for their own target models. It is not a switch that any API customer can flip from the outside, but it is a method that can travel to other models when the operator controls the weights and serving stack. Staggering speed increases for generating tokens during inference In DeepSeek’s live production tests, DSpark improved aggregate throughput by 51% for DeepSeek-V4-Flash at an 80-token-per-second-per-user service target, and by 52% for DeepSeek-V4-Pro at a 35-token-per-second-per-user target. At matched system capacity, DeepSeek reports per-user generation speedups of 60% to 85% for V4-Flash and 57% to 78% for V4-Pro over its prior MTP-1 production baseline. The different speed claims measure different things. The 60% to 85% figure for V4-Flash, and the 57% to 78% figure for V4-Pro, describe how much faster individual users receive generated tokens when DeepSeek compares DSpark with MTP-1 at matched practical system capacity. Those are the cleaner “generation speed” numbers. DeepSeek also reports much larger 661% and 406% increases, but these measure aggregate throughput under very strict speed targets: 120 tokens per second per user for V4-Flash and 50 tokens per second per user for V4-Pro. At those targets, DeepSeek says its older MTP-1 baseline approaches an operational cliff, meaning it can keep only a small number of concurrent requests running while preserving that level of responsiveness. DSpark avoids more of that collapse, so the percentage difference in total system output becomes much larger. Put simply: the 85% number is closer to “how much faster the ride feels for a user” under comparable conditions, while the 661% and 406% figures are closer to “how much more traffic the road can still carry” when the old system is already bottlenecking. Why speculative decoding matters LLMs usually generate text one token at a time. A token can be a word, part of a word, punctuation mark or other small piece of text. Every new token depends on the text already produced, so the model has to keep pausing, checking the full context and choosing the next piece. That is accurate, but slow. It is like having a senior editor approve every word before a writer can move to the next one. The editor may be excellent, but the process creates a bottleneck. Speculative decoding, developed in the early Transfomer era, tries to fix that bottleneck. Instead of asking the large model to produce every token one by one, the system uses a smaller or lighter draft component to suggest several likely next tokens. The large model then checks that batch of guesses in parallel. If the draft guessed correctly, the system moves ahead several tokens at once. If the draft made a bad guess, the system rejects the bad token and anything after it, adds a corrected token, and tries again. The point is speed without changing the larger model’s intended output. In the standard speculative decoding setup, the draft model is not replacing the target model. It is acting more like an assistant who prepares a rough next sentence for the senior editor to approve or reject. The idea did not appear out of nowhere with today’s large language models. A key precursor came in 2018, when Mitchell Stern, Noam Shazeer and Jakob Uszkoreit proposed blockwise parallel decoding for deep autoregressive models. Their method predicted multiple future steps in parallel, then kept the longest prefix validated by the main model. That paper established much of the draft-and-check intuition behind later speculative decoding work. The research line became more explicit in 2022. Heming Xia, Tao Ge and co-authors introduced SpecDec, a draft-and-verify approach for sequence-to-sequence generation. Later that year, Yaniv Leviathan, Matan Kalman and Yossi Matias posted “Fast Inference from Transformers via Speculative Decoding,” which helped define the modern version of the technique for transformer-based language models. DeepMind researchers followed in 2023 with a closely related method called speculative sampling. Those 2022 and 2023 papers are the clearest ancestors of how speculative decoding is discussed in current LLM inference work: a faster draft process proposes tokens, and the larger target model verifies them in a way designed to preserve the target model’s output distribution. Since then, the field has moved quickly through several variants, including separate draft models, multi-token prediction heads, tree-based verification, feature-level methods such as EAGLE, self-speculation, Medusa-style extra heads and parallel/blockwise drafters such as DFlash. The key metric is not how many tokens a draft model can guess. It is how many of those guesses the larger model actually accepts. Long speculative blocks help only if enough of the proposed tokens survive verification. Otherwise, the system spends compute checking guesses that it throws away. That is the context for DSpark. Speculative decoding is already an established inference technique before DeepSeek’s release, with support in major serving stacks and multiple competing research approaches. But it is still not a solved problem. Speedups depend heavily on the draft model, the workload, the serving setup and the current traffic level. DSpark’s contribution is to improve both sides of the trade-off: it tries to draft more coherent token blocks and then verify only the parts of those blocks that are likely to pay off under real serving conditions. What DSpark changes DSpark tackles two related problems: bad guesses and wasted checking. First, the system uses what DeepSeek calls semi-autoregressive generation. In plain English, that means DSpark tries to combine speed with a bit more awareness of sequence. A fully parallel drafter can guess several tokens at once, which is fast, but its later guesses can become less coherent because each position is predicted too independently. A purely step-by-step drafter can keep better track of how one token leads to the next, but it loses much of the speed advantage. DSpark tries to keep the best of both. It uses a parallel backbone for most of the drafting work, then adds a lightweight sequential head that lets the draft take nearby token relationships into account. In the paper’s example, a parallel drafter might confuse likely phrase endings such as “of course” and “no problem,” producing awkward combinations because it is guessing positions too separately. DSpark’s sequential component helps the system make the later tokens fit the earlier ones. Second, DSpark adds confidence-scheduled verification. Rather than always asking the target model to check the same number of draft tokens, DSpark estimates which prefix of the draft is likely to survive. A hardware-aware scheduler then adjusts how much of each draft should be verified based on both model confidence and current serving load. A simple analogy: when a restaurant is quiet, the head chef can inspect more of the prep cook’s work. When the kitchen is slammed, the chef spends attention only on the dishes most likely to be ready. DSpark applies a similar idea to AI serving. Under lighter traffic, the system can afford to check longer draft prefixes. Under heavier traffic, it trims low-confidence trailing guesses before they consume batch capacity that could be used for other users. DeepSeek frames this as an answer to a common production trade-off. Static multi-token drafting can look attractive in isolation, but can hurt throughput under high concurrency because the system keeps checking tokens that are likely to be rejected. DSpark’s scheduler makes the verification budget flexible instead of fixed. Offline results: better draft acceptance across Qwen and Gemma DeepSeek tested DSpark offline on Qwen3-4B, Qwen3-8B, Qwen3-14B and Gemma4-12B target models across math, coding and chat benchmarks. In those tests, the team compared DSpark with DFlash, a parallel drafter, and Eagle3, an autoregressive drafter. The paper reports accepted length per decoding round, a measure of how many tokens survive verification on average. Across the three Qwen3 model sizes, DSpark improved macro-average accepted length over Eagle3 by 30.9%, 26.7% and 30.0%, respectively. Compared with DFlash, it improved accepted length by 16.3%, 18.4% and 18.3%. The paper also says the gains generalized to Gemma4-12B. That supports a point raised by developer Daniel Han, who highlighted on X that DeepSeek showed DSpark working beyond DeepSeek’s own V4 models, including Gemma and Qwen. I would include Han as community reaction, not as the sole evidence for the claim. The stronger support comes from DeepSeek’s own benchmarks and released checkpoints. The offline results also show why workload matters. Structured tasks such as math and code tend to have higher accepted lengths than open-ended chat. That makes intuitive sense: a code completion or math step often has fewer reasonable next moves than a free-form conversation. For enterprises, this means DSpark-style methods may be especially attractive for coding assistants, data analysis agents, structured workflow automation and other settings where outputs follow more predictable patterns. How enterprises could use DSpark without DeepSeek-V4 One of the most important questions is whether DSpark is a DeepSeek-only optimization or a broader method that can be applied to other models. The answer is: broader method, but not automatic plug-in. For open-weight models, the path is relatively clear. An enterprise running Qwen, Gemma, Llama, Mistral, Granite, Command-style open weights or another model it hosts itself could train or fine-tune a DSpark-style draft module against that target model. The team would then measure acceptance on its own workloads and integrate the verification scheduler into its inference stack. That is different from simply downloading DeepSeek’s DSpark module and attaching it to any model. Speculative decoding depends on alignment between the draft module and the target model. The draft has to learn what the target model is likely to accept. A drafter trained for DeepSeek-V4 will not automatically be the right drafter for a different model, especially one fine-tuned on a company’s internal data or configured for different reasoning behavior. DeepSpec’s workflow reflects this. The process involves preparing data, regenerating target-model answers, building a target cache, training the draft model and evaluating speculative-decoding acceptance. For domain-specific use, the draft model may need additional fine-tuning, especially if the target model runs in a thinking or reasoning mode. For proprietary models, the answer depends on what the enterprise controls. If a company owns or fully hosts the model weights and serving stack, it could theoretically train and deploy a DSpark-style drafter. If the model is available only through a hosted API from a vendor, the customer cannot directly add DSpark from the outside. The API provider could implement a similar optimization internally, but the customer generally cannot access the token verification loop, logits, batching behavior or serving scheduler needed to make DSpark work. That distinction matters for enterprise buyers. DSpark strengthens the case for open or self-hosted AI infrastructure because it gives advanced teams another lever to improve speed and cost. But it also shows why model serving is becoming a specialized discipline. The value is not just in picking a model, but in how intelligently that model is run. What developers get from DeepSpec For developers, DeepSpec gives a concrete implementation path for training and evaluating speculative decoding draft models. It includes data preparation, training and benchmark evaluation steps, along with released checkpoints for several open model families. That makes the release useful not only for running DeepSeek-V4 with DSpark, but also for researchers and infrastructure teams studying how to add faster decoding to other open models. There are real deployment caveats. DeepSpec’s own README says the default Qwen3-4B data preparation setup can require roughly 38 TB of target cache storage, and the default scripts assume a single node with eight GPUs. That makes the release more immediately relevant to AI labs, cloud teams and sophisticated enterprise AI infrastructure groups than to ordinary application developers. Still, releasing the training pipeline matters. Many inference optimizations appear only as papers, vague benchmarks or closed production claims. DeepSpec gives developers something closer to a set of blueprints: not a finished enterprise product, but a way to reproduce, adapt and evaluate the method. Early community testing The release has already drawn fast developer attention. Developer Rafael Caricio published a GitHub pull request documenting single-stream DeepSeek-V4-Flash DSpark work, reporting warmed benchmark anchors of 26.33 tokens per second without speculative decoding, 39.88 tokens per second with MTP-1, and roughly 60 tokens per second with DSpark — about 1.5x over MTP-1 and 2.3x over no-spec decoding. A later commit in the same thread recorded a five-run mean of 60.31 tokens per second, with a 1.51x gain over MTP-1 and 2.29x over non-speculative decoding. The same work also points to an important practical limit: in realistic multi-turn coding sessions, performance can degrade as draft acceptance falls with growing context. In other words, DSpark can make decoding faster, but acceptance quality still determines how much speed the system actually realizes. That is a useful reality check. DSpark is not magic. It still depends on how predictable the next tokens are and how well the drafter stays aligned with the target model. But the early implementation work suggests DeepSeek’s claims are not purely academic. Developers are already testing the method in practical serving environments and reporting gains close to the paper’s single-stream expectations. The bottom line DSpark shows how much performance remains available in the inference layer, even when the underlying model architecture stays the same. As AI companies compete on model quality, context length and pricing, decoding efficiency is becoming another major battleground. Faster generation means lower latency for users, higher throughput for providers and better economics for teams serving open models at scale. DeepSeek’s release is notable because it combines a production-tested method, open code, public checkpoints and a detailed paper. The main innovation is not just drafting more tokens. It is making the system more selective about which speculative work is worth verifying. For enterprise teams, the broader lesson is that the next wave of AI performance gains will not come only from larger models. It will also come from smarter ways to run the models companies already have — especially when those companies control enough of the stack to tune the model, train a compatible draft module and optimize the serving engine around real workloads.

A single fake error report hijacked Claude Code in controlled testing — the agent ran the attacker's code with the developer's full privileges, and not one alert fired. EDR, WAF, IAM, and the firewall all missed it completely. Tenet Security's June agentjacking disclosure describes a single crafted Sentry error event — sent through a public credential that requires no breach and no authentication — that injected attacker instructions into error data that Claude Code, Cursor, and Codex then executed as trusted diagnostic output. Tenet tested 100-plus targets in controlled conditions and achieved an 85% success rate. Sentry called the flaw "technically not defensible." he Cloud Security Alliance classified agentjacking as a systemic MCP vulnerability class within days of the disclosure. No credentials were stolen, no policy was violated, no perimeter was breached: every step in the chain was authorized. That is the problem. Tenet identified 2,388 organizations with publicly exposed Sentry credentials that could be used to inject malicious events at scale. The research is proof-of-concept, not confirmed exploitation across all 2,388. But one captured Claude Code environment held a live AWS secret access key and private repository URLs. Here is the scope test: If your AI coding agents are connected to Sentry, Datadog, PagerDuty, Jira, or any MCP-connected data source your developers trust — and those agents can execute shell commands — then your stack has the same blind spot. Organizations running Sentry should audit all publicly exposed DSNs immediately. Sentry's architecture intentionally makes DSN credentials public for frontend error reporting, so the mitigation isn't revoking the DSN — it's restricting what agents can do with the data those DSNs return. Why your stack can't see it Agentjacking works because every step is authorized: The attacker sends a valid Sentry API call using a public DSN, the MCP server returns the injected event as authentic output, and the agent executes the instruction using the developer's privileges. No signature fired. The victim saw only benign diagnostics while the agent silently exposed cloud credentials and source-control tokens. SOC teams have never needed to distinguish between a developer running an npm install and an agent running that command in response to a malicious error event. That distinction did not exist until AI coding agents became production tools. The stack that cannot make it is the stack agentjacking bypasses. Five surveys, one pattern Five independent surveys from the first half of 2026 found that enterprises trust their AI agents far more than their enforcement justifies. Only 34% of organizations apply the same security controls to AI agents as to humans, according to an Okta/Apprize360 survey of 292 executives and 492 knowledge workers. Fifty-two percent of employees use unapproved AI tools, and 58% of executives reported an AI-related incident or close call in the prior year. HiddenLayer’s 2026 AI Threat Landscape Report surveyed 250 IT and security leaders: 33% reported agents had already exceeded intended scope, and 31% could not confirm whether they had experienced an AI breach. One in eight AI breaches was linked to agentic systems. Gravitee’s survey of over 900 executives and practitioners found only 14.4% of agents went live with full security approval, and 88% reported confirmed or suspected incidents. A follow-up of 750 leaders in April found agent estates had doubled while monitoring barely moved. The runtime gap nobody closed “Securing agents looks very similar to securing highly privileged users,” said Elia Zaitsev, CTO of CrowdStrike, in an interview with VentureBeat. “They have identities, access to underlying systems, they reason, they take action.” Zaitsev pointed to the gap the industry left open. “No one has been talking about securing agents at runtime. We are doing that now. What is your safety net? If all these controls fail, how do you prevent them from failing silently?” CrowdStrike's fleet data quantifies the exposure: more than 1,800 agentic applications on enterprise endpoints, approximately 160 million instances under monitoring. On June 15, CrowdStrike shipped Continuous Identity for AI Agents at Identiverse, replacing static policies with continuous enforcement that authorizes every agent action in real time. The control class that announcement reflects — continuous action-level authorization with verifiable agent identity — is now a baseline procurement criterion regardless of vendor. “People have kind of forgotten about runtime security,” Zaitsev said. “We did this with endpoint, virtualization, and cloud. People focused on patching vulnerabilities, locking down permissions. Somehow, they always seem to miss something. The safety net is runtime.” Zaitsev was equally direct about sandbox approaches. “If you start with an agent in a sandbox that has no ability to touch anything, it is worthless. Very quickly, you are in this race of giving it more capabilities. And then what is the point of your sandbox?” Agents derive their value from access. Every access grant is an attack surface. The governance gap is a budget problem Kayne McGladrey, an IEEE Senior Member, described the structural challenge in an exclusive interview with VentureBeat. “The CISO doesn’t have the budget. The CISO doesn’t have the staff. We can observe risks, we can advise on business risks, but we don’t own the business systems affected by those risks,” McGladrey said. When agent governance spans six departmental budgets, no single executive can confirm whether agents get the same access reviews as humans. The Okta survey quantifies the disconnect. Only 43% of workers say agent policies are clear, compared to 65% of executives, and nearly two-thirds apply weaker controls to agents than to humans. The people deploying agents daily do not recognize the governance posture their leadership claims to have built. Assaf Keren, chief security officer at Qualtrics and former CISO at PayPal, put it plainly. “The real risk starts not by the implementation of AI systems. It is the fact that baseline architecture is not well established. When we put an AI system on top of something not architected well, we are accelerating the fractures.” Keren called runtime behavior analytics “an unsolved problem right now.” The 5-question gap test The five-question gap test draws on five surveys from the first half of 2026. Each question maps to a gap that agentjacking exploits. Run this before any Q3 vendor evaluation. Gap to test The proof What breaks Monday action Source / sample 1. Agent inventory. What percentage of agents, MCP connections, and LLM automations completed security review before deployment? 14.4% get full security/IT approval before going live. 52% of employees use unapproved AI tools. Average enterprise now manages 37+ deployed agents, roughly doubled from Q4 2025. Unapproved agents are invisible to your identity platform and unaccountable in a breach disclosure. Agentjacking targets exactly these unmanaged MCP connections. No census means no audit trail for regulatory response. Commission a full agent, MCP server, and LLM automation census. Make census completion a procurement gate for all Q3 vendor evaluations. Flag any agent discovered post-census as a shadow AI incident. Gravitee State of AI Agent Security 2026, 900+ respondents (Feb 2026); Gravitee April 2026 update, 750 senior tech leaders; Okta/Apprize360, 292 execs + 492 workers (June 2026) 2. Controls parity. Do agents receive the same access reviews, privilege scoping, and revocation timelines as human employees? 34% always apply the same controls to agents as humans. 61% of privileged access fulfilled without proper review. Only 22% treat agents as independent identity-bearing entities. An agent with a static OAuth token and no review cycle is a permanent privileged account with no termination date. Agentjacking inherits whatever privileges the developer holds. 45.6% of orgs rely on shared API keys for agent-to-agent auth. Add every production agent to the next access review cycle. Mandate human-in-the-loop for any agent action touching PII, financial data, or production infrastructure. Replace shared API keys with scoped, short-lived tokens. Okta/Apprize360 (784 respondents, June 2026); Palo Alto Networks (2,930 respondents); Gravitee (900+, shared API keys data) 3. Scope drift. Have any agents accessed data or systems beyond their defined scope in the last 12 months? 33% report agents already exceeded scope. 53% say agents exceed permissions occasionally or sometimes. Meta Sev 1, March 2026: agent posted sensitive data to unauthorized channel. Only 8% say agents never exceed intended permissions. Scope drift triggers reportable events under GDPR, CCPA, HIPAA, and SEC cybersecurity rules. If detection cannot distinguish agent-initiated from human-initiated access, disclosure timelines are unachievable. Agent-spawned sub-agents (25.5% of deployed agents can create other agents) make audit trails algebraically intractable. Run a 90-day scope-drift audit on every production agent. Compare actual resources touched against approved scope documentation. Block agent-to-agent delegation without explicit human approval for any action exceeding the parent agent’s scope. HiddenLayer AI Threat Landscape 2026 (250 IT/security leaders); CSA AI Agent Security Survey (scope violations data); Gravitee (agent spawning data) 4. Governance perception gap. Would 50 knowledge workers say your AI agent policies are clear? 22-point gap: 65% of executives say policies are clear, 43% of workers agree. 77% of security teams see shadow AI risk but lack visibility to act. 76% cite shadow AI as a definite or probable problem. You are evaluating vendors against a governance posture your workforce does not recognize. Every shadow agent undermines the vendor comparison. Knowledge workers sharing internal messages (54%), HR data (45%), and confidential docs (39%) with unapproved AI tools. One-question survey before your next vendor demo. Gap exceeds 15 points, pause procurement. Publish an internal AI agent acceptable-use policy with specific examples of approved and prohibited agent behaviors. Okta/Apprize360 (784 respondents, June 2026); Ivanti 2026 AI Maturity Report (1,200 respondents); HiddenLayer (shadow AI data) 5. Breach detection certainty. Can your security team confirm whether you experienced an AI-related breach in the last 12 months? 31% cannot answer. 88% reported confirmed or suspected AI agent security incidents. One in eight reported AI breaches now linked to agentic systems. Agentjacking proved EDR, WAF, IAM, and firewall pass an agent-mediated attack without a single alert. No basis for disclosure timelines. No evidence chain for incident response. No defensible position in a regulatory investigation. EU AI Act high-risk compliance obligations take effect August 2, 2026. Require agent-specific runtime detection as a procurement prerequisite. Confirm your org can distinguish agent-initiated actions from human-initiated actions in production telemetry. Test your SOC’s ability to attribute a specific action to a specific agent within 60 minutes. HiddenLayer (250 IT/security leaders); Gravitee (900+, incident rate); Tenet Security (2,388 orgs exposed); CSA (systemic MCP vulnerability classification) Security director action plan EU AI Act high-risk compliance obligations take effect August 2, 2026. Worth factoring into Q3 planning timelines. Run the five-question gap test above before any Q3 vendor evaluation — it costs nothing to administer, and the procurement clarity it creates is worth far more than the 30 minutes it takes. Consider mandating agent-specific runtime detection. If your stack cannot tell what an agent did from what a developer did, agentjacking will bypass it the same way it bypassed every layer in Tenet’s testing. That distinction is the one that matters now. Treat every agent as a privileged insider. According to the Okta/Apprize360 survey, only 34% of organizations apply the same controls to agents as to humans; closing that gap is the single most impactful thing most security teams can do this quarter. Test the perception gap before investing in new tooling. One question to 50 knowledge workers. Do you know your company’s AI agent policies? If the gap between their answer and leadership’s answer exceeds 15 points, that is the problem to solve first. No vendor product fixes a governance posture your own workforce does not recognize. Make agent census completion a procurement gate — every agent, every MCP connection. The security teams getting this right are the ones that started with a complete inventory and worked forward from there. Agentjacking stripped away an assumption that has survived every security architecture since the first firewall went live. Authorized does not mean safe. When every step in the chain is legitimate, the only defense that matters is the one watching what agents do. Not what policies say. What agents do.

In the past two years, businesses have been trying to fit large language models (LLMs) into support, analytics, development, and internal automation like never before. Along with the increasing adoption of AI technology, another trend is gaining momentum — cybercriminals are taking advantage of the disconnect between assumptions about LLMs and their actual characteristics. In 2025 and 2026, several independent sources have highlighted the same trend: Prompt injection remains one of the most impactful and widely demonstrated attack vectors against LLM systems. The OWASP LLM Top 10 (2025) lists prompt injection as LLM01, identifying it as the most critical category of LLM‑specific vulnerabilities, for the second consecutive edition. OWASP's ranking reflects the fact that LLMs still struggle to reliably separate instructions from data, making them susceptible to manipulation through crafted inputs. CrowdStrike's 2026 Global Threat Report — built on frontline intelligence across more than 280 tracked adversaries — documented that threat actors injected malicious prompts into legitimate generative AI tools at more than 90 organizations in 2025. They then used those injections to generate commands that stole credentials and cryptocurrency. The report stated it plainly: "Prompts are the new malware." AI-enabled adversaries increased their overall attack volume by 89% year-over-year, with prompt injection working as both an entry point and a force multiplier. Real‑world incidents illustrate the operational impact. In August 2024, researchers at PromptArmor disclosed a prompt injection vulnerability in Slack AI that allowed an attacker to exfiltrate data from private Slack channels they had no access to — including API keys shared in private developer channels — by placing a malicious instruction in a public channel or embedding it in an uploaded document. In June 2025, researchers at Aim Security disclosed EchoLeak (CVE-2025-32711, CVSS 9.3), the first documented zero-click prompt injection exploit against a production AI system, targeting Microsoft 365 Copilot. By sending a single crafted email, no user interaction required, an attacker could cause Copilot to access internal files and transmit their contents to an attacker-controlled server. Both vulnerabilities were patched. These incidents underscore the fact that prompt injection is not a theoretical weakness but a practical, repeatable threat organizations must address as they deploy AI systems at scale. Prompt injection techniques have undergone major evolutions over recent years, now targeting multi-agent architecture, retrieval-augmented generation (RAG) pipelines, model routers, and long-term memory capabilities. The enterprise challenge: Too much trust Businesses deploy LLMs to process instructions, summarize information, and trigger automated workflows, but it is difficult for LLMs to tell: Instructions from data Information from context Context from metadata User intent from metadata This creates an opportunity for attackers to manipulate and influence the model's behavior, either directly or indirectly. Modern prompt injection Cross-model prompt injection LLM use is a common practice among enterprises. Attackers corrupt the output of a particular model, knowing well that other models would be processing the content. Hence, the corruption propagates through all AI systems. RAG supply chain poisoning Attackers create malicious information — documentation, blog articles, GitHub READMEs. Then they wait until this malicious information is ingested in enterprises' RAG pipelines, then use it as an attack vector. Agent hijacking AI agents have evolved to the point where they can send emails, modify cloud infrastructure, execute code snippets, and interact with internal corporate systems. It takes just a single instruction to make agents act differently in a harmful manner. Context overflow attacks With the help of million-token context windows, attackers place malicious code within the document and hope that an LLM will stumble upon it and execute it, thus overriding all previous instructions. Memory poisoning Due to the implementation of long-term memory in LLMs, attackers can inject instructions that permanently reconfigure their state. Model‑router manipulation Enterprises increasingly use model routers to select between multiple LLMs. Attackers craft prompts that force routing to the weakest or least‑guarded model. Why this matters for business leaders Prompt injection is not a theoretical problem. It directly affects: Customer‑facing systems (chatbots, support agents) Internal copilots (developer tools, security assistants) Automation workflows (ticketing, cloud operations, HR processes) Data governance (RAG pipelines, knowledge bases) The risk is no longer limited to "the model said something it shouldn't." In 2026, prompt injection can: Trigger unauthorized actions Leak sensitive data Corrupt internal workflows Manipulate analytics Alter business logic Compromise multi‑agent systems The attack surface has expanded dramatically. What enterprises should do now 1. Constrain model permissions Limit what the model can do, not just what it should do. 2. Segment untrusted content Treat all external data — including RAG sources — as potentially hostile. 3. Monitor tool invocation Require human approval for high‑impact actions. 4. Validate content provenance Ensure RAG pipelines don't ingest poisoned external content. 5. Harden model routers Prevent attackers from forcing routing to weaker models. 6. Treat LLMs as untrusted components This mindset shift is the foundation of modern AI security. The bottom line Prompt injection remains the most effective way to compromise enterprise AI systems because it exploits the fundamental way LLMs interpret text. Until organizations treat LLMs as untrusted interpreters — not autonomous decision‑makers — prompt injection will continue to dominate the AI threat landscape. Julie Brunias is an AI Security Architect.

Anthropic recently told its growth team to hire more product managers, not fewer. The reason, as reported in industry coverage, was that Claude Code had quietly turned its engineering org into a team that ships at roughly three times its actual headcount, and the bottleneck moved from the integrated development environment (IDE) to the people deciding what to build. That detail is easy to miss in the noise of every AI productivity claim. It is also the structural shift the rest of the industry is now living through. The bottleneck in software is no longer typing. It is deciding what to type. And the engineers who treat that as someone else's problem are about to plateau. For most of the last decade, that decision sat with someone else. Software engineering was a craft you absorbed slowly, then practiced in a long, predictable sequence: Dive deep on the technology, write the code, ask Stack Overflow when stuck, escalate to a senior engineer when Stack Overflow failed, ship the ticket. The product manager owned the funnel. The engineer owned the build. Both sides treated this division as physics. Then the funnel collapsed in five steps. A short history of how the engineer's day got compressed The Stack Overflow era (2014 to late 2022): The way engineers thought lived in one place. But new monthly questions on Stack Overflow are now down roughly 77% since November 2022, which was not coincidentally when ChatGPT launched. The drop is not a referendum on the site. It is a referendum on the workflow it represented. The browser-tab era (late 2022 to 2024): The first ChatGPT generation sat outside the IDE. Engineers ran the same loop they had always run, just with a faster oracle: Write a prompt in a browser, paste the answer back into VS Code, repeat. The work was still single-threaded and engineer-driven. The leverage was real but local. The IDE-native era (2024 to 2025): Cursor and Claude Code moved the model inside the editor and gave it access to the full repository. The senior-engineer escalation path largely dissolved. For years, the prevailing wisdom among veteran engineers was that Bash had the longest shelf life of any tool in the stack. By 2026, for a meaningful share of working developers, the first command typed in a fresh terminal is claude. The spec-driven era (2025 to 2026): Larger context windows turned single-session work into something that previously required tickets, design docs, and sprints. Amazon's Kiro IDE team reportedly compressed feature builds from two weeks to two days using the same spec-driven workflow they were shipping. An AWS engineering team described an 18-month rearchitecture, originally scoped for 30 engineers, was completed by 6 people in 76 days. The bottleneck stopped being how long it takes to write the code. It started being how clearly the team can describe what correct looks like. The routines era (2026): In April, Anthropic shipped Claude Code Routines: Scheduled, persistent agents that run on a cadence, on a webhook, or overnight while the laptop is closed. Cron came back. Hooks came back. The engineer's job is now part orchestration: Spin up a swarm before bed, review a stack of pull requests in the morning. Third-party wrappers like OpenClaw, which was briefly suspended by Anthropic in April before partial reinstatement, made the same point from the open-source side. The bottleneck moved; most teams have not Engineering has roughly tripled. Product management has not budged. The traditional 1:8 ratio of PMs to engineers, already strained, now plays out closer to an effective 1:20 because each engineer ships more per day. For instance, LinkedIn replaced its associate product manager track with a "Product Builder" program that trains generalists across product, design, and engineering. Anthropic is hiring more PMs, not fewer. The pattern is consistent across companies that have actually deployed agentic workflows in production: The system is producing built features faster than it is producing decisions about what should be built. For engineers, this is the most important career signal of the decade, and the easiest one to miss while the productivity stories dominate the feed. First principles matter more, not less The instinct to declare fundamentals obsolete in the agent era gets the trend exactly wrong. When a memory leak takes down production at 3 a.m., and the cause turns out to be a subtle ownership bug pushed 4 years ago, no agent currently in the wild closes that loop end-to-end. Operating systems, networks, concurrency, and query plans still decide who can resolve a real incident. They also decide who can spot the moments when an agent's output looks correct on the surface and is quietly, expensively, wrong underneath. The agent that wrote 70% of the code in a modern repo cannot reliably tell anyone where its assumptions about thread safety, memory ownership, or transaction isolation diverged from the runtime. The engineer who can read the diff and catch that is the engineer the rest of the team needs in the room, and that engineer is built on fundamentals, not on prompting skill. The corollary is that fundamentals are now a leverage skill, not a hygiene skill. In 2014, knowing how a TCP retransmit worked got a debug ticket closed faster. In 2026, the same knowledge keeps an entire agent-driven release pipeline from shipping a regression at scale. The blast radius of the engineer who knows what is happening underneath has gone up, not down. Review is the new writing Engineers in 2026 generate code at a rate that exceeds what any of them can read carefully. The team that ships fast and survives is the team whose engineers treat reviewing AI-generated code with at least the same rigor they once reserved for writing it. The 2025 Stack Overflow developer survey put 84% of developers on AI tools, with 46% saying they do not trust the output, up sharply from 31% the year before. That gap, heavy use paired with low trust, is exactly where review skills now matter most. Coders who push lots and review little are accumulating a debt that will come due during the first real incident, and the engineer who can pay it back is the one who paired their volume with deep first-principles knowledge of the systems involved. The new differentiator is the product funnel Both of those are necessary. Neither is sufficient. The engineer who matters in 2026 is the one who has stopped waiting for the funnel to arrive in the form of a Jira ticket. That means doing things the role was historically allowed to skip. Talk to customers. Watch how they actually use the product. Read the support queue. Sit in on the sales call. The signal a product team gets through three layers of summary, an engineer can now get firsthand in an afternoon. Generate ideas, not just estimates. The product manager who used to source ideas for 8 engineers cannot source ideas for 20 at the same fidelity. The engineer who shows up with a validated, scoped opportunity is no longer doing the PM's job. The engineer is doing the job the new ratio requires. Work backwards from the customer. Amazon has been writing the press release first for two decades. The discipline travels well to teams of one and to swarms of agents. Both produce a great deal of working software in the wrong direction without a clear statement of what "customer wins" means before any code is written. Stop hiding behind bandwidth. The honest answer to "Do you have capacity for this idea?" used to be 'No.' With routines, hooks, and a cooperative agent stack, the honest answer is closer to "What is the idea worth?" That is a different conversation, and a much harder one to have without a real point of view on the customer. What the next decade rewards The five-phase history above is not really a history of tools. It is a history of which part of the job a human had to do. The part that is still human, and that will remain human for the foreseeable future, has moved up the funnel: From typing, to reviewing, to deciding, to choosing the customer to serve and the problem to solve. The 2026 version of a great engineer is not the one who writes the most code. It is the one who knows what to build, can prove it is worth building, and has the agent fleet plus the review discipline to ship it without the system collapsing under its own velocity. Engineers who internalize this will spend the next decade doing the most interesting work software has ever produced. Engineers who wait for a ticket will spend it watching the ticket get written by the agent next to them. Ishan Gupta is a software engineer at Amazon.

Long-horizon reasoning exposes a core weakness in AI agents: context windows fill up fast, and retrieval pipelines return noise instead of signal. To solve this, researchers at the National University of Singapore developed MRAgent, a framework that abandons the static "retrieve-then-reason" approach. Instead, it uses a mechanism that allows an agent to dynamically develop its memory based on accumulating evidence. This multi-step memory reconstruction is integrated into the reasoning process of the large language model (LLM). While not the only framework in this space, MRAgent significantly reduces token consumption and runtime costs compared to other agentic memory management approaches. The limits of passive retrieval in long-horizon tasks In classic retrieval pipelines, documents are retrieved through vector search or graph traversal and passed on to an LLM for reasoning. This passive approach fails because it cannot combine reasoning with memory access, creating three major bottlenecks: These systems cannot revise their retrieval strategy mid-reasoning. If an agent fetches a document and discovers a crucial missing cue — a specific date or person — it has no way to issue a new query based on that finding. Fixed similarity scores and predefined graph expansions return surface-level matches that flood the LLM's context window with irrelevant noise, degrading reasoning. Current systems rely heavily on pre-constructed structures such as top-k results and static relevance functions, limiting the flexibility required to scale across unpredictable, long-horizon user interactions. The researchers argue that to overcome these limitations, developers must shift toward an “active and associative reconstruction process,” a concept inspired by cognitive neuroscience. Under this paradigm, memory recall unfolds sequentially rather than operating as a passive read-out of a static database. The system starts with small, specific triggers from the user's prompt, such as a person's name, an action, or a place. These initial hints point to connecting concepts or categories instead of massive blocks of text. By following these metadata stepping stones, the agent gathers small pieces of evidence one by one. It uses each new piece of information to guide its next step until it successfully pieces together the full, accurate story. How MRAgent implements active memory reconstruction Instead of viewing memory as a static database, MRAgent (Memory Reasoning Architecture for LLM Agents) treats it as an interactive environment. When processing a complex query, the agent uses the backbone LLM’s reasoning abilities to explore multiple candidate retrieval paths across a structured memory graph. At each step, the LLM evaluates the intermediate evidence it has gathered and uses it to iteratively optimize its search. It infers new search constraints, pursues the paths with the best information, and prunes irrelevant branches. This allows MRAgent to piece together deeply buried information without filling the LLM’s context with noise. To make this active exploration computationally efficient and scalable, the framework organizes its database using a “Cue-Tag-Content” mechanism. This operates as a multi-layered associative graph with three node types: Cues: Fine-grained keywords, such as entities or contextual attributes extracted from user interactions. Content: The actual stored memory units. These are divided into multi-granular layers, such as episodic memory for concrete events and semantic memory for stable facts and user preferences. Tags: Semantic bridges that summarize the relational associations between specific Cues and Content. This structure enables a highly efficient two-stage retrieval process. The LLM first navigates from Cues to candidate Tags. Because Tags explicitly expose the semantic relationships and structural associations of the data, the agent evaluates these short summaries to judge their relevance. The LLM identifies promising traversal paths and discards irrelevant branches before spending compute and prompt tokens to access the detailed, heavy memory contents. For example, a user might ask an AI agent, "How did Nate use the prize money when he won his third video game tournament?" MRAgent first extracts fine-grained starting cues from the prompt, such as "Nate," "video game tournament," and "win." The agent maps these initial cues to the memory graph and looks at the available associative Tags connected to them. The agent sees tags like "Tournament Victory" and "Tournament Participation.” Since it is only concerned with what the person did after they won the championship, MRAgent drops the tournament participation tag and pursues the victory tag. The agent retrieves the episodic content linked to the chosen Cue-Tag pair, retrieving three distinct memory episodes where Nate won a tournament. MRAgent looks at the three memories, decides one of them in particular is relevant to the query, and discards the other two. With this information, it updates its cues and starts another round of discovery and pruning. From the new episodic memory it has retrieved, the agent adds “tournament earnings” to its cues and uses that to traverse new tags and home in on new memories. It repeats this process until it gathers enough information to answer the query, which could be something like “Nate saved the money.” MRAgent performance on industry benchmarks MRAgent operates alongside several other frameworks addressing agentic memory building. Alternatives include A-MEM, a graph-based agentic memory framework, and MemoryOS, a hierarchical memory framework. Other persistent memory frameworks include LangMem and Mem0. The researchers tested MRAgent on the LoCoMo and LongMemEval industry benchmarks. These test the abilities of agents to resolve queries on long-horizon tasks and conversations across dozens of sessions and hundreds of turns of dialogue. The backbone models used were Gemini 2.5 Flash and Claude Sonnet 4.5. The system was tested against standard RAG, A-MEM, MemoryOS, LangMem, and Mem0. MRAgent consistently outperformed every baseline across both models and all question types by a significant margin. However, for enterprise developers, the most critical metric is often computational cost. In the LongMemEval tests, MRAgent slashed prompt token consumption to just 118k per sample. By comparison, A-Mem consumed 632k tokens, and LangMem burned through 3.26 million tokens per query. MRAgent also effectively halved the runtime compared to A-Mem, dropping from 1,122 seconds to 586 seconds. What makes MRAgent efficient in practice is its on-demand behavior. Evaluating tags and pruning irrelevant paths before retrieval saves money and context space. Furthermore, the system autonomously evaluates its accumulated context and inherently knows when to stop searching, completely avoiding redundant data exploration. Implementation and development catch While MRAgent is highly effective, the Cue-Tag-Content structure needs to be prepared before the agent can query it. Developers must figure out how to architect the underlying memory database to enable the LLM to efficiently navigate associative items and prune irrelevant paths without exploding compute costs. Fortunately, developers do not have to manually label or structure this data. The authors designed MRAgent with an automated distillation pipeline that uses LLMs to process raw interaction histories and automatically populate the memory graph. For a developer, the job is to implement and orchestrate this automated ingestion pipeline, rather than manually tag data. You need to set up a background job or streaming pipeline that passes raw user interactions through prompt templates to extract this metadata before storing it in your graph database. However, the authors emphasize that this is a lightweight construction phase and MRAgent intentionally keeps ingestion simple. The authors have released the code on GitHub.

An endpoint agent cannot report its own absence. The 2026 Axonius Actionability Report, conducted with the Ponemon Institute and surveying 662 IT and security professionals, put a number on a gap SOC teams have worked around for years. Across the Axonius customer base, 12.7% of devices in a 298,000-device median inventory are missing their expected security agent. If a device has no agent, no management console shows it. If a CMDB record is stale, no reconciliation flags it. An employee who installed Claude Enterprise outside procurement created a SaaS workspace, identity surface, and API-token footprint that endpoint telemetry alone will not reliably inventory. The coverage percentage on the EDR dashboard is structurally incomplete because the reporting mechanism cannot see what it does not cover. That gap matters more now than it did six months ago. SOC and XDR vendors are pushing more autonomous investigation and remediation into production. Those agents will query the same dashboards, trust the same coverage percentages, and act on the same blind spots human analysts learned to work around. A human analyst second-guesses a 98% coverage number. An autonomous agent treats it as ground truth and moves at machine speed. Three independent signals converged on the same gap Gravitee’s 2026 survey of 900-plus executives found 88% reported confirmed or suspected AI-related incidents, and only 14.4% sent agents live with full security approval. The Axonius/Ponemon report found 52% of respondents would let autonomous agents act on recommendations — while 63% said the underlying data lacks important information. The CSA's Agentic Trust Framework requires verified data governance before agents act on any finding. Mike Riemer, Field CISO at Ivanti, said that known vulnerabilities on Azure’s honeypot networks are now attacked in under 90 seconds. “Traditional security measures continue to work,” Riemer told VentureBeat. The caveat is that those measures only protect what they can see. An EDR agent deployed across 87.3% of the device inventory leaves the remaining 12.7% outside that agent’s telemetry, policy enforcement, and detection logic. Exclusive deployment data quantifies the scale Joe Diamond, CEO of Axonius, told VentureBeat that the average CISO sees roughly 50% of what is actually on the network. “Say 50% of their environment is sitting in dark matter,” Diamond said. “They don’t know what it is, or where it is, or who has access to it, if it’s secure, if it’s not secure.” Deployment data from more than 900 Axonius customers confirms those numbers. TransUnion went from 70% to 99% endpoint coverage after out-of-band verification. Western Union went from 85% to 99% by consolidating data from 38 tools and cutting manual workload by half. Lumen discovered 1.1 million assets, where the CMDB showed 17,000. That translates to roughly 37,000 unmanaged endpoints per organization sitting outside every policy, every patch cycle, and every detection rule. Diamond pointed to Mythos, Anthropic’s frontier reasoning model, as a sign that machine-speed offensive capability will make any unknown asset far riskier than it is today. “People tend to have shiny object syndrome,” he said. “If you didn’t understand what 50% of your environment looked like from a traditional endpoint perspective, and you think you’re going to wind sprint to granular control and governance of AI, your program will fail.” Diamond called the broader AI shift “as big, if not bigger than the internet.” Three approaches compete to close the gap No single architecture solves the visibility problem today. Three approaches compete, each with named tradeoffs security teams should evaluate before procurement. A dedicated integration layer uses bidirectional API adapters to build an always-current inventory. Axonius runs 1,400-plus adapters and now discovers shadow Claude Enterprise installations via its Anthropic adapter (GA June 15). “We created a bidirectional API integration with all the IT systems and all the security controls to build an always up-to-date inventory of what the environment looks like,” Diamond told VentureBeat. Platform-native EDR and XDR intelligence builds richer asset context inside the agent footprint. Depth within the agent footprint is the advantage. The limitation is structural. Platform-native intelligence is bounded by what the agent can see, and the gap the Ponemon report identified lives precisely where that visibility ends. CMDB modernization requires continuous reconciliation against three or more independent telemetry sources. Only 13% of organizations reconcile daily, according to Axonius/Ponemon data. The remaining 87% operate on stale records that feed incorrect prioritization into any automated remediation pipeline. EDR data readiness: Five gates before autonomous remediation Before you let autonomous SOC agents close tickets or quarantine assets, this checklist tells you whether your EDR and asset data is solid enough to trust. It is vendor-agnostic, works with any EDR and CMDB, and gives you five pass/fail gates you can run in a single working session. Risk Area What the data shows Readiness threshold Action to take now Asset inventory delta Ponemon: only 45% consolidate into a single view. Forrester TEI: 150% more assets than previously identified. Lumen: 17K in CMDB vs. 1.1M discovered. Delta ≤10% between discovery, CMDB, and EDR agent count. Delta above 10% blocks automated remediation until reconciled. Run API-based discovery against all segments. Diff against CMDB and EDR console count. Reconcile quarterly minimum. Unmanaged AI services Gravitee: 88% confirmed or suspected AI incidents. Only 14.4% with full security approval. Anthropic adapter (GA June 15) discovers unmanaged Claude Enterprise installations. No high-risk AI services outside approved procurement. Weekly SaaS discovery scans. Unmanaged high-risk instances trigger IR triage before exception review. Deploy SaaS discovery or protocol-level adapters for AI service detection. Automate weekly scans. Route unmanaged instances to IR queue. CMDB record accuracy Ponemon: only 13% reconcile daily (RSAC 2026). Brooks Running: 20% server discrepancy between console and independent discovery. Top remediation barriers: unclear prioritization, unclear ownership, inconsistent data. ≥85% of records validated against 3+ independent telemetry sources. No stale or orphaned records in active remediation queue. Cross-reference CMDB against cloud inventory, EDR telemetry, and IdP directory. Continuous reconciliation replaces annual audit cycles. Endpoint agent coverage gap Ponemon: an agent cannot report its own absence (p. 8). TransUnion: 70% to 99% after out-of-band verification. RSAC 2026: 12.7% of 298K median devices missing expected agent. ≥95% agent coverage verified via out-of-band discovery. Many CISOs set this as the minimum before allowing autonomous remediation. No self-reported-only metrics in board reports. Run network-based or API-driven discovery against managed device list. Coverage below 95% blocks automated remediation scoping. Asset ownership mapping Ponemon: 32% apply tags consistently. Only 51% assign ownership on new exposures (pp. 9, 16). TransUnion: 12K to 190K assets with ownership mapped. Owner assigned within 24 hours. Tags consistent across cloud, EDR, CMDB. Three systems showing three owners = failure. Automate ownership via cloud tags, IdP group membership, or CMDB metadata. Map asset, remediation, and business owner as separate fields. Five questions to ask before allowing autonomous SOC action What independently verifies endpoint-agent coverage outside the EDR console? How does the SOC reconcile conflicts between EDR, CMDB, cloud inventory, IdP, and discovery tools? Can AI agents act on assets with unknown or disputed ownership? Can the system distinguish “not vulnerable” from “not visible”? What data-quality gate blocks autonomous remediation when coverage or ownership falls below threshold? Board-ready risk framing Kayne McGladrey, IEEE Senior Member, has confirmed the pattern across multiple published VentureBeat interviews. The structural gap in self-reported coverage is not new. What is new is that autonomous agents will act on it at machine speed without the institutional workarounds human analysts developed over years of experience. Diamond put the board-level stakes plainly in an April 2026 press statement: “Findings pile up because the data isn’t trusted, ownership isn’t clear, and entire asset classes aren’t even in the picture.” The CSA’s Agentic Trust Framework requires that any agent promoted to a higher autonomy level must pass five gates, including demonstrated accuracy and a security audit. The EU AI Act’s Article 50 transparency obligations take effect August 2, 2026. The May 2026 Digital Omnibus pushed high-risk system obligations to December 2027, but organizations deploying agentic SOC agents on incomplete asset data face immediate operational risk that outpaces any regulatory timeline. The board-ready sentence: Our EDR coverage reports are structurally incomplete because an endpoint agent cannot report its own absence, and we are verifying coverage through out-of-band discovery before deploying autonomous agents that would act on those reports at machine speed. Security director playbook Run out-of-band asset discovery this week. Compare results against your CMDB export and EDR console count. If the delta exceeds 10%, halt automated remediation scoping until the gap is reconciled. Deploy SaaS discovery for AI services. Employees install AI ahead of procurement, ahead of security. Weekly scans are the minimum. Route any unmanaged high-risk instance to your incident response queue for triage before exception review. Map asset ownership to remediation responsibility. Ponemon found only 32% of organizations apply tags consistently. If three systems show three different owners for the same asset, automated remediation has no routing target. Fix the ownership layer before deploying agents that depend on it. Kill self-reported-only coverage metrics. Any risk calculation or board report that relies on EDR console-reported coverage alone is built on data the reporting system cannot verify. Require out-of-band verification for every coverage number that informs a risk decision.

OpenAI is announcing a limited preview of its newest frontier AI model GPT-5.6 family, which comes in three variants: Sol, Terra, and Luna. Sol is for the hardest problems, such as complex coding and security research; Terra is for high-volume business tasks like customer support, internal tools and document analysis; and Luna is for faster, lower-cost everyday work like summarization, drafting and routine automation. Sol and Terra set new high benchmark scores, while Luna performs near GPT-5.5 levels on several tests despite being positioned as the fastest and lowest-cost model in the GPT-5.6 family. However, the models are being made available initially to a narrow set of approximately 20 total organizations, after OpenAI shared the models and release plans with the U.S. government. A general release is planned for "the coming weeks." The staggered release follows an executive order issued by President Donald J. Trump earlier this month on June 2, 2026, which calls upon various federal agencies to collaborate on a process for benchmarking and assessing capabilities of new AI models to ensure they are safe and appropriate for wide release. While this process remains underway (it was said in the order to take 30 days, so July 2), OpenAI says in its release blog post that it "previewed our plans and the models’ capabilities ahead of today’s launch. At [the U.S. government's] request, we are starting with a limited preview for a small group of trusted partners." OpenAI's limited preview release strategy also follows the drastic step taken by the U.S. government to issue an export control order against Anthropic, OpenAI's top U.S. competitor, over jailbreaks found in its most powerful generally released model, Claude Fable 5, to which Anthropic responded by removing any access to the model and its cybersecurity focused counterpart Claude Mythos 5 by public or private parties. (Anthropic had earlier previewed a prior version of the model as "Claude Mythos Preview" to a selected small number of external participants in its cybersecurity research program "Project Glasswing," dating back to April.) Because OpenAI is coordinating its release framework with the White House ahead of a broader public launch, enterprise buyers must navigate a novel landscape of real-time safety interventions, mandatory compliance parameters, and structured token caching systems. How the 3 new GPT-5.6 models differ: Sol vs. Terra vs. Luna The three GPT-5.6 models are designed to address different enterprise needs and performance profiles. Sol is the top-tier option, built for the most demanding tasks such as complex reasoning, extended coding sessions, advanced agent-driven workflows, and security-focused applications. Sol delivers the highest level of capability but comes at the highest price: $5.00 per million input tokens / $30.00 per million output tokens — the same as GPT-5.5 — and OpenAI says it delivers a major performance gain for long-running coding, cybersecurity and agentic tasks. Terra balances strong performance with efficiency. It is intended for large-scale production environments where organizations need reliable results across high volumes of work without the overhead of the most advanced model. It's available for $2.50/$15 per 1M tokens. Luna is the most lightweight and cost-efficient option, optimized for speed and everyday use cases. It is well suited for simpler tasks, routine workflows, and applications where responsiveness and scalability are more important than maximum depth of reasoning, and is the most affordably priced at $1/$6 per million tokens in and out, respectively. Sources with knowledge of OpenAI's inner workings shared with VentureBeat that the new naming scheme was designed to move away from the "nano" and "mini" variants of GPT-5, as these models are not so different in terms of size or raw intelligence, but rather, designed for different distinct use cases. As OpenAI states in its blog post about the new naming scheme: "In this new naming system introduced with GPT‑5.6, the number identifies a model’s generation, while Sol, Terra, and Luna identify durable capability tiers that can advance on their own cadence. Together, the family gives people and developers clearer choices across intelligence, speed, and cost." Also, sources said OpenAI sought to evoke a sense of inspiration by looking to the cosmos and names associated with it. Further, Sol fits well alongside OpenAI's Daybreak opt-in program for organizations interested in using OpenAI models to bolster cyber defense, which is an added bonus. The "Sol" voice style for OpenAI's voice mode on ChatGPT is unrelated, and will likely be renamed. The new GPT-5.6 system card adds another important point for businesses: OpenAI is classifying all three GPT-5.6 models — not just Sol — at its “High” risk level for both cyber and biological/chemical capability, while rating them below that level for AI self-improvement. That means even the cheaper Terra and Luna tiers may carry new governance obligations for companies using them in security, life sciences or other sensitive workflows. Here's how they stack up against the rest of the current leading LLM field in price — note that OpenAI's cheapest option is overall a mid-priced model, and still more expensive than the frontier-level GLM-5.2 VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI MiMo-V2.5 Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.2 $1.40 $4.40 $5.80 Z.ai GPT-5.6 Luna $1.00 $6.00 $7.00 OpenAI Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI MiMo-V2.5 Pro (>256K) $2.00 $6.00 $8.00 Xiaomi MiMo Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.6 Terra $2.50 $15.00 $17.50 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.5 Instant (chat-latest) $5.00 $30.00 $35.00 OpenAI Sakana Fugu Ultra (≤272K) $5.00 $30.00 $35.00 Sakana AI GPT-5.6 Sol $5.00 $30.00 $35.00 OpenAI Claude Fable 5 / Claude Mythos 5 $10.00 $50.00 $60.00 Anthropic Technology: deeper reasoning and subagent-based work The main technical change in GPT-5.6 centers on giving the model more time and structure for hard tasks during inference. OpenAI is adding a new max reasoning setting for GPT-5.6 Sol, aimed at problems that require more extended deliberation. OpenAI is also introducing ultra mode, which brings in subagents that can split up and accelerate complex projects, rather than keeping the work inside a single-agent flow. The company’s launch evaluations suggest this approach improves performance on several agent-style tasks. Benchmarks show measurable improvement from GPT-5.5, and new state-of-the-art on TerminalBench 2.1 command-line tasks The GPT-5.6 series demonstrates a clear performance leap over its predecessors across complex reasoning and long-horizon tasks. In command-line automation evaluated on TerminalBench 2.1, both the flagship Sol model and the mid-tier Terra outpace the previous GPT-5.5 benchmark, though notably Sol used the new ultra thinking mode to achieve a record-high score of 91.91% on the benchmark, and the max mode achieved 88.76% — ahead of both GPT-5.5's 83.4% and Claude Mythos 5's 88%. This superiority extends into professional workflows on Agent's Last Exam, where Sol is the sole model to successfully clear the halfway mark for task completion at 50.9% in "code mode," while the everyday Luna tier also manages to narrowly edge out the prior generation's flagship. In quantitative biology and genomics testing, Sol and Terra achieve higher accuracy rates than both GPT-5.5 and GPT-5.4, with Sol explicitly managing these stronger results while consuming fewer tokens. Finally, across cybersecurity evaluations measuring vulnerability research and exploitation, the new models push past prior performance ceilings; Sol reaches significantly higher intended exploit rates as reasoning time scales up and achieves competitive capability caps using a fraction of the output tokens required by older models. On ExploitBench, OpenAI says Sol performs near Mythos Preview while generating roughly one-third as many output tokens. Predictable prompt caching mechanics and a Cerebras speed bump To help enterprises control the unpredictable cost curves of running agentic loops, the GPT-5.6 API introduces a revamped prompt caching protocol. Developers can now implement explicit cache breakpoints, backed by a guaranteed 30-minute minimum cache lifetime. Under this framework, initial cache writes cost 1.25x the model’s standard uncached input rate, while later cache reads receive a 90% discount. In practice, businesses running repeated or similar operations pay more to establish the cache, then much less each time they reuse that cached context during at least the 30-minute minimum cache window. For systems that routinely pass massive context windows or codebase definitions back into the model, this predictability is a critical financial guardrail. Furthermore, for enterprise applications where latency is the primary barrier to adoption, OpenAI is launching GPT-5.6 Sol on Cerebras hardware this July. This infrastructure partnership claims processing speeds of up to 750 tokens per second, targeting specialized enterprise applications requiring real-time, frontier-grade reasoning. Enterprise implications: High security and algorithmic friction For corporate engineering, information security, and compliance teams, the deployment of GPT-5.6 requires a meticulous look at its security architecture. To achieve clearance for release, OpenAI dedicated roughly 700,000 A100e GPU hours solely to automated red-teaming GPT-5.6. This compute was allocated to discovering "universal jailbreaks"—systemic attack vectors designed to bypass safeguards across varied contexts, rather than single-prompt workarounds. OpenAI says it has implemented a multi-layered safeguard stack that operates in real time, putting up intentional operational hurdles for enterprise security teams. Model-level refusals: GPT-5.6 is tuned to reject banned cyber help, including requests that mask malicious intent or attempt jailbreak-style workarounds. Live misuse screening: Separate cyber and biology detectors review generations while they are being produced. Activation-based screening: For Sol and Terra, OpenAI says it is adding activation classifiers that monitor internal model signals during inference. If those systems detect a risky pattern, output streaming can pause while another safety check reviews the content. Luna does not appear to receive that same activation-classifier layer, though it is still covered by other monitoring systems. Reasoning review pauses: When risk appears elevated, generation can stop while a larger reasoning system examines the exchange and surrounding context. If the system classifies the output as disallowed, the answer is blocked before it reaches the endpoint. Because legitimate defensive work—such as code reviews, vulnerability discovery, patch engineering, and defensive testing—frequently utilizes the exact same code primitives as offensive exploits, OpenAI admits that its classifiers may regularly trigger false positives. The system card says OpenAI’s monitoring stack posted 94.8% overall recall on its biology evaluation set and 81.6% overall recall on its cybersecurity evaluation set. Those figures give enterprises a rare quantitative look at the safeguards, but they also show the system is not perfect and may miss some risky cases or block some legitimate work. Persistent flagging can trigger automated account-level reviews across historical conversations to evaluate if an enterprise client is engaging in malicious behavior or standard security research. OpenAI is currently negotiating longer-term enterprise safety compliance controls, including customer-operated safety overrides and privacy-preserving detection mechanisms, to insulate corporate data from manual review pipelines. Importantly, OpenAI notes that under testing, Sol remains optimized for defensive containment rather than offensive deployment. In evaluations running against the Chromium and Firefox codebases, the model successfully isolated bugs and exploitation primitives but was unable to autonomously engineer a functional, full-chain exploit, keeping it safely below the organization's "Cyber Critical" alert threshold. But all three GPT-5.6 models crossed its “High” cyber threshold on internal capture-the-flag testing, with Sol reaching 96.7%, Terra reaching 91.84% and Luna reaching 85.19%. That distinction matters for enterprise security buyers: OpenAI is presenting GPT-5.6 as powerful enough to help automate parts of vulnerability research and exploit analysis, but not yet as a system that can reliably run a complete advanced attack campaign without human direction under the company’s test conditions. The Geopolitics of the phased release The broader rollout of the GPT-5.6 series reflects an escalating entanglement between frontier AI labs and national security protocols. The decision to limit initial access to a small circle of vetted partners whose details are shared with the U.S. government stems from direct coordination regarding the developing cyber Executive Order framework. OpenAI has taken the unusual step of publicly critiquing this sovereign gatekeeping within its official product announcement documentation. The company states plainly: "We don’t believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them." This tension highlights the precarious position of modern tech enterprises. While organizations can leverage unprecedented agentic efficiency and robust defensive patching capabilities via benchmarks like ExploitGym and ExploitBench, they must also accept that access to premier tools remains subject to diplomatic and regulatory authorization.

Industrialized factories changed how the world produced physical goods: more output, lower costs, faster than anything that came before. Now a similar shift is happening with software. LLMs have lowered the barrier to writing code, increased individual output, and pushed organizations to think about software development as a production system. The standard software development lifecycle and CI/CD practices that have held for decades won't hold up under that pressure. That's where the software factory comes in — and like physical factories, it needs more than speed to actually work. The idea of a “software factory” started to solidify over the past year. Luca Rossi's "The Era of the Software Factory" made the case plainly: AI is not just changing how fast people write code — it's changing the whole production system around software. The concept can mean different things: a collection of coding agents and skills files; faster CI/CD; better review systems; or more automation around software delivery. A better frame is to think of it less as a tool category and more as a set of principles. A software factory can't just be a loose collection of prompts, agents, and plugins. It needs a platform that defines how work moves through the system and how code is generated, reviewed, tested, traced, deployed, and improved when something goes wrong. Otherwise all you’re doing is putting yet another one-off machine into an empty room and calling it a factory. Why is this happening now? There are a few forces all hitting at the same time. Companies have always wanted more software than engineers can produce. That’s why tools like Excel exist: They often fill in the gap for a lot of the software that many companies wish they could make. AI has also lowered the barrier of entry to creating code, and this is the part everyone focuses on. Code creation is now easier, though not always cheaper or better, as evidenced by many high-profile companies fretting over their high AI bills. The barrier to writing functional code has effectively collapsed. More importantly, a single engineer can generate more code than they could just a few years ago. That changes the bottleneck: it’s no longer “How fast can someone write this?” or even, in some cases, “Can someone understand how to code?” Instead it becomes, “Should this be written?” More importantly, can we actually create end products that are durable and reliable and don’t just build tech debt? Or are we just putting out more AI slop faster than ever? That’s where the danger lies. The dangers of the modern software factory All of this sounds great. Factories, after all, made production faster and more consistent. They made it possible to build more cars and products, less expensively, which led to more people being able to afford cars and products. Putting environmental impacts aside, you could argue this was positive. But like many things in engineering, there are always tradeoffs, and in this case, there are new risks. When you increase the output of one person with machinery, digital or otherwise, you also increase the mistakes that can be made either by the individual or the machinery. The speed at which code can now be put out is on an industrial scale. Even smaller organizations can suddenly have code bases ballooning up to the size of tech company code bases a decade ago. The data is already showing problems. Faros AI found that while task throughput per developer is up 33.7% and PR merge rate is up 16.2%, the incidents-to-PR ratio has risen 242.7% and bugs per developer are up 54%. Google’s DORA research found that more AI adoption was actually associated with worse delivery stability. As a fractional head of data, I've been brought in to fix these exact issues. In the past year alone, I've worked on two projects where AI-generated data infrastructure slowly started to morph over time. Between multiple engineers trying to move quickly and a lack of standards, these projects became unruly. Code bases tend to go through some level of evolution, but as different styles blend, the LLMs in turn start to create their own mutations. Codebases developed five to six different styles within months — a process that previously took years. Layer by layer, the engineers would slowly stop understanding exactly what was going on. The pattern echoes what happened a decade ago with self-service tooling: early productivity gains that masked downstream complexity. And that’s why the software factory can’t just be about speed. What makes a software factory work There are several key principles to consider when building a software factory. Platform over tools: Many teams are slowly implementing AI into their coding workflows at the edges — adding a PR review agent or a skills file into their repos. But building an actual software factory requires a platform, not a collection of tools at the edges. A platform provides a unified foundation where tools aren't scattered in separate corners. Instead, they actively share data, talk to each other, and work as a single cohesive system — standards, processes, and the work itself all connected. Rerunability and traceability: A real platform requires the ability to go back into any run, identify what went wrong, and rerun it — which is why one-off agents don't make a factory. The system needs to support taking a serial ID, looking it up, and tracing exactly how it got to the output it produced. This is why state machines make more sense than loops for AI workflows: they make it far easier to rerun a process and understand what happened at each step. Safety and guardrails: Factories are not safe places. Neither is a software factory. As more people develop on these platforms, better guardrails and safety measures need to be built in. Testing and quality control need to be pushed to the front of the process — catching bugs at the lowest possible stage reduces the cost to fix them and limits the blast radius. Standardization: At the enterprise level, every codebase has its own flavor. Layering a code assistant on top without standards produces an amalgamation of styles. Standardization has to be built into the process from the start. Quality control: In older manufacturing models, quality control happened at the end of the line. The product was built, inspected, defects found, and fixed later. Toyota's approach was different. Quality was pushed into the process itself — workers were expected to stop the line when something was wrong. The goal wasn't to catch defects at the end; it was to prevent them from flowing downstream in the first place. The same is true for the software factory. QC needs to be baked into the entire process, starting with how the spec is written. That means integrating static code analysis that catches obvious errors and providing templates to LLMs so they know the structure the code should follow. Without that, the bottleneck becomes the final review — or teams just push out more AI slop. Speed without quality isn't productivity Improving the speed of your code output is not actual productivity if the downstream issues aren’t managed. A company is not more productive because it produces millions of cars, only to see them all fall apart within 100 miles. It’s also not more productive if all it does is produce an endless stream of proofs-of-concept that never enter production. Actual productivity is when the software factory takes ephemeral tokens and turns them into durable outputs. It's easy to talk about lines of code and how much faster your team is moving. The software factory that wins isn't the one that generates the most code. It's the one that generates the fewest defects downstream.

Liquid AI, founded by former MIT computer scientists, today released its smallest AI language model yet, LFM2.5-230M, and enterprises would do well to consider it for their uses in data extraction and local deployment on smartphones, laptops and robotics. This is a 230-million-parameter foundation model explicitly designed for on-device agentic workflows, and as Liquid states in its release blog post, that small size makes it possible to run nearly "anywhere." According to Liquid, it also outperforms models more than 4X its size on selected benchmarks, specifically doing better at data extraction than the 800 million parameter count Alibaba Qwen3.5-0.8B (Instruct) and 1-billion parameter Google Gemma 3 1B. The model targets developers and engineers building lightweight data extraction pipelines and autonomous edge systems. Operating under a dual-use commercial license, the model remains free for individuals and companies generating less than $10 million in annual revenue, while requiring a paid enterprise agreement for larger corporations. This release distinguishes itself from other small AI models by utilizing the LFM2 architecture to achieve high inference speeds without the massive memory overhead typical of parameter-heavy transformers. While major AI companies Anthropic, OpenAI, Google, Microsoft, Meta and others push parameter counts into the hundreds of billions or trillions to achieve frontier performance, a parallel race focuses entirely on the edge and local deployments. Liquid AI's launch of LFM2.5-230M signals a pivotal shift toward architectural efficiency over brute-force scaling. By squeezing 19 trillion tokens of pre-training into a 230-million-parameter footprint, the company demonstrates that edge devices do not need massive computational power or persistent cloud connections to execute complex, multi-step agentic workflows. How LFM2.5-230M works The LFM2.5-230M model diverges from standard transformer architectures, relying instead on the LFM2 framework. This architecture functions as a hybrid system, interleaving gated short-range convolutions with grouped-query attention to process information efficiently. For those tracking the evolution of efficient architectures, Liquid’s approach shares a similar conceptual goal: managing long contexts and sequential data effectively on edge hardware without the quadratic memory costs of pure attention mechanisms. The model supports an expansive 32K context window, allowing it to ingest substantial documents or continuous streams of robotic telemetry. When analyzing the performance charts provided in the release, the architectural efficiency becomes visually apparent. The model maintains a memory footprint of under 400MB while achieving prefill and decode speeds that outpace comparable models like Gemma 3 1B IT and Granite 4.0-H-350M. On a Samsung Galaxy S25 Ultra equipped with a Qualcomm Snapdragon Gen4 CPU, the model reaches a decode speed of 213 tokens per second. Even on a highly constrained Raspberry Pi 5, the model maintains a decode rate of 42 tokens per second. Furthermore, internal benchmarking shows the GPU inference stack delivers lower end-to-end latency than competing small models across all concurrency levels. Why it matters for enterprises To understand why a 230-million-parameter model is necessary, one must look at how enterprises currently manage data. Organizations have traditionally relied on rigid, rule-based Extract, Transform, Load (ETL) scripts to move and process data. However, these legacy systems are notoriously brittle; a simple change in a document's layout or a schema update can break the entire pipeline. To solve this, the industry is shifting toward "AI ETL," where machine learning infers mappings, detects schema drift, and adapts to changes automatically. In a modern lightweight data extraction pipeline, an AI model connects to unstructured sources—like PDFs, emails, or web forms—and structures the data into formats like JSON without requiring hardcoded rules. For enterprises, using a massive flagship model like Claude Opus 4.6 (which costs $5.00 per million input tokens) to parse routine invoices, format addresses, or route telemetry data is economically unviable. This is where models like LFM2.5-230M become critical. Designed explicitly as a lightweight extraction engine, it allows companies to automate repetitive formatting and data parsing at a fraction of the compute cost and latency, running directly on local hardware rather than relying on expensive, continuous cloud API calls. Small Model Benchmarks: LFM vs. The 3B Class The AI industry in mid-2026 is seeing a renaissance in "small" models, but the definition of "small" varies wildly. Recently, the open-weight community was stunned by Weibo's VibeThinker-3B, a 3-billion-parameter model built on a Qwen2-style backbone that achieved a massive 94.3 on the AIME 2026 math benchmark, rivaling 600-billion-parameter behemoths through aggressive data curation and reinforcement learning. Similarly, Google's Gemma 4 family — which recently crossed 200 million downloads — pushes frontier AI to the edge, including the E2B (2 billion parameters) designed specifically for mobile and IoT deployments. By contrast, Liquid AI's LFM2.5-230M operates in a completely different weight class. At just 230 million parameters, it is roughly one-tenth the size of Google's smallest Gemma 4 model and VibeThinker-3B. Because of its microscopic footprint, LFM2.5-230M is not designed to compete on reasoning-heavy workloads like advanced math, coding, or creative writing—a constraint Liquid AI explicitly acknowledges. However, in its intended domains of data extraction and tool calling, the model punches well above its weight class. Benchmarks released by Liquid AI show LFM2.5-230M scoring 43.26 on the BFCLv3 tool-use benchmark, dominating IBM's Granite 4.0-350M (39.58) and completely outpacing larger 1-billion-parameter models like Google's Gemma 3 1B IT (16.61). On CaseReportBench for data extraction, it scores 22.51, decimating the Qwen3.5-0.8B (Instruct). LFM2.5-230M proves that while 3-billion-parameter models like VibeThinker are solving advanced calculus, a 230-million-parameter model is the superior, highly optimized choice for executing structured tool calls and keeping agentic pipelines running efficiently on constrained hardware. Advanced research uses Because it excels at tool calling, LFM2.5-230M functions primarily as a skill-selection layer. Liquid AI demonstrated this capability by deploying the model on a Unitree G1 humanoid robot. Running entirely on-device via the robot's onboard NVIDIA Jetson Orin compute module, the model successfully processes complex environmental commands. As noted in the company's technical blog, the model takes a free-form instruction like, *"Hold still for 2 seconds, then walk forward at 1 meter per second for 3 meters, hold a forward one-leg kneel for 5 seconds, and walk backward at 0.5 meters per second for 3 meters,"* and automatically translates it into a structured multi-step plan calling on pre-trained low-level skills provided by NVIDIA's SONIC framework. The base and post-trained models are available immediately on Hugging Face, with native day-one support across the inference ecosystem for llama.cpp (GGUF), MLX, vLLM, SGLang, and ONNX. Dual-use, custom LFM Open License Liquid AI ships LFM2.5-230M under the LFM Open License v1.0. Despite the word "open" in the title, this is not an Open Source Initiative (OSI) compliant license; it operates as a restricted, dual-use commercial framework. For independent developers, researchers, and early-stage startups, the license functions identically to open-source software. Users receive a perpetual, worldwide, royalty-free license to reproduce, modify, and distribute the model, provided they retain original copyright notices and prominently state any modifications. However, the license includes a strict "Commercial Use Limitation". Any legal entity generating $10 million or more in annual revenue loses the right to use the model commercially under this agreement. Large enterprises crossing this financial threshold must negotiate a separate, paid commercial agreement with Liquid AI to deploy the model in production. This strategy protects the company from having its intellectual property absorbed by major technology conglomerates for free, while still seeding the model at the grassroots developer level.

OpenAI has made a significant update to its most widely used language model, GPT-5.5 Instant, which is the default in the free version of ChatGPT. The company announced the upgraded version of GPT-5.5 Instant yesterday on X, calling it "much more fun to talk to" and saying it is "better at understanding the intent behind a question and adapting its response accordingly," as well as offering improvements in shopping results, local recommendations, and handling "complex constraints." However, it has not yet provided any benchmarks or numerical results to quantify these claims. The company said the updated GPT-5.5 Instant was rolling out first to paid ChatGPT subscribers and then to free users as of today, June 25. OpenAI also updated its chat-latest API alias, which points to the latest GPT-5.5 Instant model currently used in ChatGPT, while continuing to recommend the separate gpt-5.5 model for production API usage. That distinction matters, but it should not obscure the main news: this is primarily a ChatGPT-side update to GPT-5.5 Instant, not a new release of the broader GPT-5.5 API model family. Let's dig into what's changed... Origins of GPT-5.5 Instant, and why OpenAI updated it less than two months later GPT-5.5 Instant was first unveiled in early May 2026, just under two months ago, to replace the aging GPT-5.3 Instant engine as the baseline default model for ChatGPT users. Developed as a fast, high-throughput variant of OpenAI’s core flagship model family, the initial spring release focused heavily on correcting systemic factuality deficits. Internal benchmarks from that spring deployment reported a 52.5% reduction in hallucinated claims compared to GPT-5.3 Instant on high-stakes medical, legal, and financial prompts, alongside a 37.3% drop in factual error rates on user-flagged historical conversations. Independent evaluators noted that its predecessor, GPT-5.3 Instant, had struggled in public rankings, placing 44th overall in Arena benchmarks. That gave the May rollout a clear purpose: OpenAI needed a stronger default model for everyday ChatGPT interactions, not just a more capable frontier model for advanced users. Stylistically, the initial spring model introduced a sharper conversational baseline, demonstrating a 30.2% reduction in word count and a 29.2% drop in line usage over typical advice prompts. However, the spring deployment also introduced an operational fault line for enterprise software systems: a feature known as "memory sources." Designed to grant users visibility into the specific past chats, files, and connected Gmail accounts shaping a personalized answer, memory sources introduced a loose, model-reported observability layer. As reported by VentureBeat, these internal summaries frequently clashed with the deterministic logs of localized vector databases and enterprise Retrieval-Augmented Generation (RAG) pipelines. The resulting friction created dual, competing context records, making it difficult for administrators to reconcile what the model claimed it referenced against what it actually accessed in production. The June 24 update does not appear to expand memory sources directly. Instead, it focuses on making GPT-5.5 Instant better at understanding user intent, carrying context across turns, following multi-part instructions, and producing more useful shopping and local recommendations. A smarter, more 'fun' ChatGPT for consumers For everyday users of ChatGPT, the most noticeable change in GPT-5.5 Instant will be the model’s improved intent recognition. According to OpenAI’s latest release notes, GPT-5.5 Instant has improved at identifying the underlying goal behind a user's question, particularly in decision-support scenarios like planning, shopping, asking for advice, researching options and comparing local choices. Historically, large language models have struggled when given prompts with multiple overlapping constraints — often dropping one or two requirements in favor of a generalized response. The updated GPT-5.5 Instant handles these complex instructions more reliably. When users push back on an answer, clarify their meaning, or introduce new constraints mid-conversation, the model should adapt dynamically rather than stubbornly repeating its original approach. This contextual awareness extends heavily into commerce and local recommendations. GPT-5.5 Instant now makes better use of location context to surface nearby options, weaving together product recommendations, business information, and relevant images into a more cohesive output when those elements are useful. Furthermore, OpenAI notes that the stylistic formatting of these responses is less rigidly templated, trading robotic lists for a more intentionally designed, warmer and restrained conversational tone. Developers can test the latest Instant behavior through chat-latest For the developer ecosystem, the June 24 GPT-5.5 Instant update is accessible through OpenAI’s updated chat-latest API alias. chat-latest is not the same thing as the production gpt-5.5 model slug. OpenAI says chat-latest points to the latest Instant model currently used in ChatGPT, and it recommends the separate gpt-5.5 model for production API usage. Developers can use chat-latest to test the newest ChatGPT-style improvements, while using gpt-5.5 when they need a stable production target. The current chat-latest model page lists a 400,000-token context window and support for up to 128,000 maximum output tokens. Its knowledge cutoff is Aug. 31, 2025. On pricing, chat-latest uses the same $5.00 per 1 million input tokens and $30.00 per 1 million output tokens listed on its model page. Cached inputs cost $0.50 per 1 million tokens, a 90% discount that strongly incentivizes developers to optimize prompts by placing static instructions first and dynamic data later. The model supports text and image input, text output, streaming, function calling and structured outputs. Through the Responses API, the chat-latest page also lists support for web search, file search, image generation, code interpreter and MCP. The practical takeaway is simple: chat-latest gives developers access to the updated Instant-style behavior, but OpenAI is still steering production API builders toward the separate gpt-5.5 model. The broader GPT-5.5 API model includes a larger feature set and different production profile, but that is not the main focus of this update. Why this matters for enterprise AI teams For enterprises, the June 24 GPT-5.5 Instant update lands at the intersection of two related but distinct trends: better default user experience in ChatGPT, and more reliable orchestration behavior in the API. The consumer-facing changes make ChatGPT more useful for everyday decision-making. Users should see better handling of messy, real-world requests: planning a trip with several constraints, comparing products, finding nearby businesses, or adjusting a recommendation after adding a new requirement. The enterprise relevance is less about a new technical architecture and more about default behavior. A model that better infers intent, preserves context across turns and follows multi-part constraints can make ChatGPT more reliable for employees using it for research, planning, purchasing decisions, customer-facing drafts and internal analysis. But enterprises should remain careful about observability. Memory sources can help users understand why ChatGPT personalized an answer, but they do not provide a complete audit trail. Organizations that already rely on RAG pipelines, vector databases, orchestration logs and internal agent traces should define which record acts as the source of truth when a model’s visible memory sources do not fully match the system’s own logs. What’s next? The release of GPT-5.5 Instant and the updated chat-latest alias signals a maturation in how generative models are deployed. OpenAI is moving away from models that require heavy hand-holding and toward systems that can better infer the user’s goal, preserve constraints and adapt across multiple turns. Whether it is a consumer planning a complex multi-city vacation in ChatGPT, or a developer orchestrating a codebase-navigating agent through the API, GPT-5.5 represents a faster, smarter and more capable baseline for the future of AI workflows. The most important takeaway for developers is also the simplest: GPT-5.5 Instant, chat-latest and gpt-5.5 are related, but they are not the same product surface. GPT-5.5 Instant is the ChatGPT model users experience directly. chat-latest is a moving alias for testing the latest Instant behavior through the API. gpt-5.5 is the production model OpenAI recommends for developers building stable applications.

AI agent orchestration platforms are popping up like weeds these days, but London-based AI transformation startup Mindstone's Rebel might be among the most promising I've come across. That's because the system, which officially launched this week, is a local-first, agentic AI operating system distributed under a "Fair Source" license, allowing teams of under 100 users to freely adopt and customize it to suit their needs, while those organizations with more users will require paying for an enterprise license. The marquee features are its simplicity and extensive customizability to fit any given team, no matter how unique or specific the workflows, all based around the common, open source standard file format markdown, and, as a result, an organizational memory layer that ensures agents reliably use the enterprise's preferred AI models for each given task or even subtasks — dynamically switching between local and cloud ones in a predictable, visible way to save costs and maintain data privacy and security as needed. "Shared memory is the most empowering thing you could possibly do with a knowledge-worker AI," said Greg Detre, chief technology officer (CTO) of Mindstone, in a recent video call interview with VentureBeat. "You get this feeling of being a super-organism as a company that just gets smarter and smarter." Rebel is available now for macOS on Intel and Apple Silicon machines, as well as Windows, with Linux support in development. Mindstone has raised $5 million from private investors including Pearson Ventures, Moonfire Ventures and Zanichelli Venture. A distinctive, local-first architecture based on markdown files What makes Rebel distinctive is its local-first architecture. Instead of the approach found in developer-heavy agent frameworks such as as LangGraph, CrewAI and AutoGPT, which require teams to wire together databases, cloud infrastructure and state-management logic, Rebel's core agent memory and instructions live across local markdown (.md) text files — arguably the simplest, easiest, and most popular way to steer AI agents, one that has been widely adopted by AI developers and power users around the globe. Mindstone says Rebel stores its state, prompts, task instructions and memory hierarchy in these files, allowing users and companies to easily inspect, move or modify them as needed. A primary configuration file, agents.md, acts as the agent’s core instruction layer and runtime boundary. That architectural choice is partly about cost. Mindstone argues that common office formats such as Word documents and PDFs often carry formatting and metadata overhead that consumes model token context and raises API costs. Markdown keeps the information closer to raw text, allowing more of the model’s context window to be spent on the actual task rather than document structure. The company also positions the approach as a hedge against vendor lock-in. If a company’s agent instructions, automations and memory are stored locally as text files, they are not trapped inside one SaaS provider’s interface or database. That matters more as enterprises begin giving AI systems broader access to email, calendars, documents and internal workflows. Rebel also lets users create repeatable AI workflows. “Skills” are saved multi-step procedures an agent can reuse. “Operators” adjust how the agent behaves for a given task, such as reviewing a pitch deck from an investor’s perspective or evaluating work through a security lens. “Automations” can run scheduled background tasks, such as scanning messages or files, finding relevant updates, drafting responses, or preparing work before an employee opens the app. Automatically selecting the best, enterprise-preferred AI model for every task (and subtask) Another important feature is multi-model orchestration. Rebel can break a task into parts and route different steps to different models, including splitting between local and cloud-based ones depending on the sensitivity of the information or as guided by enterprise policies. A more powerful model can handle planning or complex reasoning; a cheaper model can handle routine work; a local model can handle sensitive steps or approval checks. This matters for enterprises that want flexibility or are seeking cost controls: not every task need be sent to the same expensive cloud model, and some enterprise workflows prohibit sensitive corporate data leaving local infrastructure. “I want to be able to say, ‘Help me with this,’ and it knows what’s personal, what’s sensitive, and what can be shared with the whole company," Detre explained. That model-agnostic setup gives companies more control over cost and security. Data-heavy work can run on lower-cost models such as Llama or DeepSeek. Higher-level reasoning can be reserved for more expensive models. Sensitive work can be routed through a local model running on the user’s machine, keeping that information from leaving the device. This approach also gives enterprise teams a way to mix cloud and local inference without treating the choice as all-or-nothing. By shifting away from centralized, monolithic cloud interfaces toward a local file-driven architecture, Mindstone is introducing a model for how enterprise technical decision-makers orchestrate autonomous workflows without forfeiting data sovereignty or predictability How it works in practice Mindstone CTO Greg Detre designed Rebel’s memory system to avoid a common problem in enterprise AI: dumping large amounts of company information into a database and hoping search will retrieve the right context later. Instead, Rebel uses a tiered memory structure. When an interaction happens, the system estimates how likely that information is to be useful again. Information with a high expected value is written into a local readme.md file tied to a specific project space. Information with a moderate expected value becomes a reference link back to deeper historical records. Lower-priority material is stored in an indexed memory directory, where it remains available but dormant until a relevant task calls it back. An ROI dashboard for enterprise buyers For larger organizations, Mindstone Pro adds an Impact Dashboard designed to show where Rebel is saving time and money across business units. Mindstone says the dashboard uses a separate, closed LLM to evaluate telemetry and calculate business impact. The company says the system is calibrated conservatively, using the lower end of estimated performance gains to avoid inflated productivity claims. That feature speaks to a practical problem for enterprise AI buyers: proving value without over-surveilling employees. Mindstone says the dashboard is isolated from individual workspaces, allowing IT and business leaders to evaluate adoption and return on investment without reading employees’ private agent activity. Fair Source licensing aims to reduce platform risk Mindstone is releasing Rebel under a Fair Source license, a model meant to sit between fully closed SaaS and permissive open source. Under the license, Rebel’s code is viewable, auditable, modifiable and deployable. Individuals and organizations with up to 100 concurrent users can run it for free. Once an organization exceeds that threshold, it needs a commercial Mindstone Pro license. The license also includes a two-year sunset clause. Twenty-four months after a given version is released, that version automatically converts to the MIT open-source license. For enterprise buyers, the practical pitch is that Rebel reduces the risk of being trapped. If every automation, memory file and agent instruction is stored locally in markdown, a company can move its data and workflows elsewhere if needed. The product may be commercial, but the underlying work is designed to remain inspectable and portable. Security questions focus on local approvals and shared memory Rebel’s debut on the open access tech product sharing platform Product Hunt this week prompted technical questions about how a local-first agent should handle permissions, safety checks and shared memory. One developer, Nikita Pokryschko, asked whether approval checks for sensitive actions could run entirely on a local model, or whether the gating logic still required a cloud call. Detre responded by explaining Rebel’s separation between planning, execution and background safety logic. Wöhle added that companies can configure Rebel to rely entirely on a local model for gating decisions. That distinction matters for corporate security teams. Autonomous agents often need broad permissions to read files, draft emails or interact with internal systems. If the final approval layer depends on an external cloud model, some companies may see that as a compliance risk. Mindstone is arguing that Rebel can keep those approval boundaries local. A second discussion focused on how Rebel decides what memory can be shared. Product developer Clement Morel asked whether shareability is determined by content, user settings or learned behavior, and what happens if the system gets it wrong. Detre said Rebel uses the user’s local “Chief-of-staff README” and defined spaces to separate private, team and company-wide information. When the agent encounters ambiguous context, the system pauses and asks the user for approval before proceeding. That emphasis on visibility is part of Mindstone’s broader argument against opaque agent systems. As CEO Joshua Wöhle put it in a post on his LinkedIn account: “If an agent is going to sit inside your workspace, remember your context, and ask permission before changing the world, you should be able to see how it works. Not because everyone will read the code, but because someone can.” Mindstone points to customer rollout as early proof Mindstone says Rebel has already been deployed across the 250-person workforce of customer Epignosis, covering sales, engineering, product, finance and customer success teams. "The entire organization is operating on Rebel today," Wöhle told VentureBeat. Over a 12-week deployment, Mindstone says Epignosis recaptured the equivalent capacity of eight full-time roles. The company says adoption spread organically after employees saw colleagues automate time-consuming work, a pattern employees reportedly called the “potatoes effect.” The Epignosis case is central to Mindstone’s argument that enterprise AI should not be treated as a set of isolated personal tools. Rebel’s shared-memory design is meant to let workflows move across teams and improve as more employees use them. “The border between learning and doing is fading out - and that changes everything about how you scale,” Epignosis CEO Dimitris Tsingos said in a statement provided to VentureBeat by Mindstone. Background on Mindstone Mindstone Learning Limited, headquartered in London, launched in 2020 under the direction of CEO Joshua Wöhle, previously a co-founder of the digital child safety firm SuperAwesome. Originally positioned in the consumer education technology market, the company built a digital curation tool likened to a "Spotify for learning" that utilized compound learning methodologies. However, following the widespread commercialization of generative artificial intelligence platforms between 2022 and 2024, Mindstone moved into business-to-business enterprise enablement. Leadership identified a critical "last-mile" barrier: while AI tools promised substantial productivity gains, traditional corporate training failed to equip the workforce to practically integrate them into daily operations. Today, Mindstone functions as a comprehensive enterprise software and training ecosystem designed to maximize corporate return on investment for existing AI licenses. The product architecture systematically addresses different organizational tiers through highly contextualized, "live-fire" software applications rather than abstract slide presentations. Financially, Mindstone utilizes a hybrid capitalization strategy that interweaves institutional venture capital from entities like Moonfire Ventures and Pearson Ventures with community-based equity crowdfunding on platforms such as Seedrs and Crowdcube. Mindstone has successfully penetrated the enterprise market, securing commercial contracts with blue-chip corporations including The Home Depot, Hyatt Hotels Corporation, Pearson, and Ernst & Young. Ultimately, Mindstone positions itself as the crucial antidote to corporate inertia, ensuring organizations establish the internal competency required to execute successful AI transformations. Mindstone’s bet: enterprise AI needs shared memory, not more seats Rebel arrives as companies are trying to move from AI experimentation to AI operations. The first wave of enterprise adoption centered on access: giving employees chatbots, copilots and model subscriptions. Mindstone is betting the next wave will center on coordination. That means shared memory, reusable workflows, local control, flexible model routing and measurable business impact. It also means giving enterprises a way to inspect the systems they are being asked to trust. The company’s challenge now is execution. Local-first software can be harder to manage than cloud SaaS. Shared memory raises governance questions. Multi-model routing adds complexity. And enterprises will still need proof that agentic workflows can deliver reliable productivity gains without creating security or compliance headaches. But Mindstone is making a clear argument: buying AI seats is not the same as building AI infrastructure. Rebel is its attempt to turn scattered employee experiments into an operating layer for work.

Mistral AI on Tuesday released OCR 4, a document intelligence model that moves beyond raw text extraction to return structured representations of entire documents — complete with bounding boxes, block-type classification, and per-word confidence scores. The release marks Mistral's fourth generation of optical character recognition technology in roughly 15 months and lands at a moment when the company's pitch for European AI sovereignty has never been more commercially relevant. The model supports 170 languages across 10 language groups, accepts PDF, DOC, PPT, and OpenDocument formats, and can be deployed as a single container on an organization's own infrastructure — a capability Mistral is positioning directly at enterprises in regulated industries that cannot route sensitive documents through U.S.-jurisdiction cloud APIs. "Mistral OCR 4 extracts and structures content from a wide range of documents," the company said in its announcement. "Where previous generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document." The model is available immediately through the Mistral API, Document AI in Mistral Studio, Amazon SageMaker, and Microsoft Foundry, with Snowflake Parse Document support coming soon. Pricing starts at $4 per 1,000 pages, dropping to $2 per 1,000 pages through a batch API discount. OCR 4 treats every document as a semantic map, not a wall of text The central engineering shift in OCR 4 is structural. Rather than outputting a flat stream of extracted text — the paradigm that has defined OCR for decades — the model returns a layered representation in which every block is localized with a bounding box, classified by type (title, table, equation, signature, and others), and scored for confidence at both the page and word level. Mistral says bounding boxes were its most-requested capability. The reason is straightforward: without location data, downstream systems cannot trace an extracted fact back to its source on a specific page. That traceability gap has been a persistent friction point for enterprises building retrieval-augmented generation (RAG) pipelines, compliance workflows, or any application where "where did this number come from?" is a question that needs an auditable answer. Block classification addresses a related problem. A paragraph tagged as a "title" can segment a document into hierarchical chunks for semantic search. A block tagged as a "table" can be routed to a structured-data pipeline rather than a text summarizer. A block tagged as a "signature" can trigger a redaction workflow in a compliance system. These are not novel ideas in isolation, but packaging them as first-class outputs of the OCR model itself — rather than requiring a separate layout-analysis stage — removes an integration layer that enterprise teams have historically had to build and maintain themselves. The confidence scores serve a dual purpose. At scale, they allow organizations to programmatically route low-confidence regions to human reviewers and auto-approve high-confidence extractions, building what the industry calls human-in-the-loop verification without requiring a person to review every page of every document. In production systems, OCR is rarely the end goal — it is the first step in a larger pipeline. Developers building RAG systems, agent workflows, or document automation often spend more time reconstructing layout and structure than on the downstream AI logic itself. OCR 4 aims to eliminate that reconstruction step, and if it delivers on that promise, the value accrues not just in OCR cost savings but in reduced engineering hours across the entire document pipeline. Independent reviewers preferred Mistral's output 72 percent of the time, but benchmarks tell a complicated story Mistral reports that OCR 4 achieved a 72% average win rate in a head-to-head human evaluation against leading competitors, conducted by independent annotators across more than 600 real-world documents in over 12 languages. The model also achieved the top overall score on OlmOCRBench at 85.20 and scored 93.07 on OmniDocBench. But the company itself urges caution in interpreting those numbers. In its release, Mistral took the unusual step of auditing and publicly disclosing the specific types of scoring artifacts it encountered, including ground-truth errors in the reference annotations, equivalent LaTeX notation scored as mismatches, column-reading-order assumptions, and header/footer attribution issues. "We therefore treat the aggregate score as directional rather than definitive," the company said — a notably transparent stance from a vendor announcing a product. That transparency is well-timed. On the public OlmOCRBench leaderboard, some researchers have noted that OCR 4 currently ranks third, behind open models like Chandra OCR 2. And some open-weight models self-report higher OmniDocBench composite scores — PaddleOCR-VL-1.6 claims 96.33 — though those results have not been independently reproduced on the public leaderboard. Early enterprise feedback has been favorable nonetheless. Aidan Donohue, an AI engineer at financial AI firm Rogo, said the company benchmarked OCR 4 against leading agentic document parsers on a chart-dense financial QA dataset and "reached equivalent accuracy at roughly 8x lower cost and 17x lower latency." Ivan Mihailov, an AI engineer at intellectual property management firm Anaqua, said OCR 4 is "roughly 4x faster per page than our incumbent provider." Enterprise buyers, however, should run their own evaluations rather than relying on any vendor's benchmark numbers. The practical question is not which model scores highest on a leaderboard, but which model produces the fewest errors on your specific documents, in your specific languages, at a price and latency that fit your workflow. The Anthropic export ban gave Mistral's sovereignty pitch the proof point it needed Mistral's release lands in a geopolitical context that could hardly be more favorable for its strategic positioning. On June 12, Anthropic was forced to disable all access to its newest AI models, Fable 5 and Mythos 5, after the U.S. Commerce Department used national security export controls to bar the company from distributing the models to any foreign national. Enterprise clients in finance, healthcare, SaaS, and critical infrastructure found their core intelligence services abruptly disabled, without prior warning or effective recourse. As of June 24, both models remain offline, with prediction markets giving only 57% odds of restoration before July 1. That episode validated a warning Mistral CEO Arthur Mensch has been sounding for over a year. As Business Insider reported, Mensch warned at London Tech Week in June 2025 about American AI companies "having the keys" for their models, calling it a scenario where European companies are "giving leverage to their providers." He added: "At some point, you need to be able to turn it off or turn it on, and you don't want to leave it to another country." The argument gained further urgency as Mensch's broader sovereignty pitch escalated in recent months. As reported by CNBC in late May, Mensch told the outlet: "Europe is lagging behind when it comes to [the] buildout of infrastructure, and so we are investing to close that gap." At the same time, Mensch pushed back against Pope Leo XIV's call for AI to be "disarmed," arguing that Europe cannot afford to fall behind U.S. tech giants. "We're all for peace, but if you look at our rivals and adversaries in the world, they're using artificial intelligence … we do need to have our own capabilities," Mensch told reporters. OCR 4's single-container, self-hosted deployment model is the product-level expression of that argument. A U.S.-headquartered provider offering EU data residency means documents are stored in Frankfurt but governed by U.S. law. Mistral, incorporated in France and operating under EU jurisdiction, offering on-premise containerized deployment, means documents never leave the customer's infrastructure at all. The EU AI Act's fine enforcement provisions take effect August 2, adding regulatory pressure to the compliance calculus for European enterprises evaluating document AI vendors. Baidu's free, open-weight OCR model arrived one day earlier — and the contrast is revealing Mistral's release did not arrive in isolation. Just one day before OCR 4 launched, Baidu shipped Unlimited-OCR on June 22 — a 3-billion-parameter MIT-licensed model that tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward. Baidu's model uses a technique called Reference Sliding Window Attention (R-SWA) that, as a top Hacker News commenter explained, splits the AI's focus into two paths: maintaining full attention on the original document image while restricting memory of generated text to a tight, moving window. The result is constant KV cache size and the ability to transcribe 40-plus pages in a single forward pass. The model gathered 1,800 GitHub stars in its first 24 hours and racked up more than 479 upvotes on Hacker News, where the discussion thread ran to 109 comments. The two releases frame what some analysts are calling the June 2026 document-AI split: self-hosted long-horizon parsing with open weights versus structured managed extraction with enterprise features. Baidu's model is free under an MIT license, runs on standard GPU hardware, and has no managed API or enterprise SLA. Mistral's model is a commercial product with per-page pricing, bounding boxes, confidence scores, block classification, multi-platform distribution, and self-hosted deployment options for enterprise customers. Unlimited-OCR may be the better tool for a research team digitizing scanned dissertations on a single GPU. OCR 4 is built for the IT procurement process — the world of SLAs, data processing agreements, and compliance audits. Beyond Baidu, the broader OCR competitive field includes Google Document AI, Amazon Textract, Azure Document Intelligence, ABBYY Vantage, and a growing number of open-weight models. On the Hacker News thread for Unlimited-OCR, practitioners offered a candid assessment of the state of the art. Joss82, who has worked on document parsing for 10 years, wrote bluntly: "OCR still sucks in 2026." Meanwhile, one user named SyneRyder reported success with Claude for OCR of hundreds of pages of handwritten documents, noting the model delivered results with "no corrections required" and even pointed out a continuity error in the source text. These practitioner reports underscore a key tension in the market: performance varies wildly depending on the specific document type, language, and quality of the source material. The real play is not OCR — it is an enterprise AI stack with document intelligence as the on-ramp Step back far enough, and Mistral's OCR 4 release is not really an OCR story. It is an enterprise go-to-market story built on top of a $4.4 billion global intelligent document processing market that is forecast to grow at a 33.1% compound annual growth rate through 2030, according to Grand View Research. For Mistral, OCR is a wedge into enterprise AI budgets. The model feeds directly into Mistral's Search Toolkit, the company's open-source composable search framework announced at the AI Now Summit. In that architecture, OCR 4 serves as the ingestion layer for retrieval-augmented generation and enterprise search pipelines, converting raw documents into citation-ready, structurally classified input. The logic is clear: once an enterprise adopts OCR 4 for document extraction, Mistral's broader model suite — including Medium 3.5 for reasoning and the Vibe agentic platform for task execution — becomes the natural next step in the stack. That pipeline ambition is critical context for understanding Mistral's current fundraising trajectory. Bloomberg recently reported that the company is in early discussions to raise about €3 billion ($3.5 billion) at a valuation of roughly €20 billion — nearly double the €11.7 billion valuation from its September Series C round. To date, Mistral has raised only about $4 billion, a fraction of what its largest U.S. rivals have taken in. OCR 4 and its associated enterprise revenue pipeline are part of how the company plans to justify that higher valuation, with Mistral targeting €1 billion in revenue for 2026, up from €200 million in 2025, according to Le Monde. Mistral is a company with roughly 1,000 employees and ambitions to compete with labs that have raised 40 times as much capital. It cannot win a general-purpose model arms race against OpenAI and Anthropic. What it can do is build a differentiated enterprise stack around sovereignty, structured document intelligence, and agentic workflows — and use that stack to capture European enterprise budgets that are increasingly wary of U.S. provider dependency. The pricing structure reinforces that strategy: at $2 per 1,000 pages in batch mode, the cost of processing a 100,000-page corporate archive falls to $200, making large-scale digitization projects economically viable in ways they may not have been with token-based vision-language model pricing. Whether Mistral can execute that vision at scale — against Google, Amazon, Microsoft, and a surging open-source ecosystem — remains an open question. But the Anthropic export control crisis is still unresolved, European data sovereignty regulations are tightening, and a potential €20 billion funding round is on the horizon. The company is holding an OCR 4 production webinar on July 7 at 6:00 PM CET. Two weeks ago, the argument for building AI infrastructure outside the reach of U.S. export controls was theoretical. Then the U.S. government flipped a switch, and Anthropic's most advanced models went dark for every non-American on the planet. Mistral did not cause that crisis — but it spent the last year building the product that makes it matter.

Alibaba's Qwen team released Qwen-AgentWorld on Tuesday — two models trained not to act inside agent environments, but to predict what those environments return. The release covers seven domains under a single architecture: MCP, Search, Terminal, Software Engineering, Android, Web, and OS. The release extends Alibaba's recent push into autonomous agents. Qwen3.7-Max, released in May, was built around a 35-hour autonomous execution capability. That shift targets a ceiling teams training agents at scale run into directly. Real search engines surface whatever results exist, with no mechanism to inject controlled conditions. Live terminals do not allow injecting a low-disk-space condition on demand. Agent training is bounded by what production environments will surface, with no systematic way to expose the edge cases agents will need to handle but rarely encounter in training. The research team trained agents inside the resulting simulator and found performance gains that exceeded what training against real environments alone produced. In a separate test, using world model training as a warm-up before agentic fine-tuning improved performance across seven benchmarks, including three the model had never seen during training. The paper accompanying the release identified a gap in prior agent research. "We argue that world modeling is a crucial missing piece in the path to general agents." Qwen-AgentWorld trains on what environments return, not what agents should do Most agent models are trained to answer one question: given what the environment just showed me, what should I do next? Qwen-AgentWorld is trained to answer the inverse: given what the agent just did, what will the environment show next? That reversal is the core of what the paper calls a language world model: instead of optimizing for action selection, the model learns to predict the next environment state across all seven domains under a single training objective. Prior work was narrower: WebWorld, an earlier Qwen project from February, covered web environments only; Snowflake's Agent World Model, published the same month, generates code-driven SQL-backed environments rather than training a model to predict states. Qwen-AgentWorld is the first to span seven domains in a single model, with environment modeling baked in from the earliest pretraining stage. Alibaba trained both models in three stages on more than 10 million environment interaction trajectories from real agent runs. Stage one teaches the model how environments behave — file systems, terminal states, browser DOM changes, API responses. Stage two trains the model to reason through what comes next before predicting it. Stage three, reinforcement learning, tightens predictions using rule-based checks and open-ended quality scoring. Both models are Mixture-of-Experts designs — only a fraction of parameters are active per token. The 35B model activates 3B; the 397B activates 17B. Both support 256K context windows. For GUI domains (Android, Web, and OS), the models work from textual accessibility trees and UI view hierarchies rather than screenshots. The 35B model weights and AgentWorldBench are available under Apache 2.0; the 397B weights are not publicly released. The training results matter more than the benchmarks The benchmark scores show how accurately the models predict what environments return. The training results show what that prediction capability is actually worth for teams building agents — and those are the numbers that matter more. According to the researchers, agents trained inside controlled simulation outperformed agents trained in real environments. Injecting targeted perturbations — partial responses that force extra agent steps, and edge cases real environments rarely surface — pushed MCPMark from 24.6 to 33.8. On Search, agents trained in entirely fictional worlds transferred to real search tasks, pushing WideSearch F1 Item from 34.02 to 50.31 on the open 35B model. A separate warm-up test showed that world model pretraining improved BFCL v4 from 62.29 to 71.25 and Claw-Eval from 53.60 to 64.88 with no agent-specific fine-tuning. Researchers flag the benchmark and the overfitting risk The paper drew immediate reaction from AI researchers on X. The concerns they raised map to what practitioners need to verify before acting on the findings. On the training objective and transfer result, the assessment from one AI/ML researcher was direct. "Every other 'agent' model has been trained to act in environments," wrote @drawais_ai, who has a PhD background and regularly breaks down AI papers. "Qwen flipped the question. They trained the model to predict the environment itself... That predictive knowledge then transfers to agent tasks even without any agent-specific fine-tuning." He identified the Controllable Sim RL result as "the receipt" for the claim that synthetic training can substitute for real-environment RL at scale, and flagged that three of the seven transfer benchmarks were entirely out of domain. The benchmark margin drew immediate scrutiny. "AgentWorldBench is a benchmark Alibaba built and published in the same paper," wrote @TheSignal_Desk, who focuses on honest takes and key numbers in AI research. "They wrote the test, then topped it by 0.46." The sim-RL methodology is the result @limalemonnn, who builds production AI agents, identified as most in need of scrutiny before the headline claim gets quoted. "Sim-trained agents traditionally overfit to the simulator's quirks," they wrote. "If the world model is too clean, the agent learns the model, not the task." They pointed to the paper's holdout split as the section practitioners should read before acting on the numbers. The overfitting concern has a partial answer in the data. The gap between uncontrolled Sim RL (MCPMark 24.6) and controlled Sim RL (MCPMark 33.8) suggests the gains depend substantially on the controllability mechanism, not simulation accuracy alone. The fictional-world Search result, where agents trained on invented environments transfer to real search tasks, is the paper's strongest evidence against the overfitting concern. What this means for teams building agentic pipelines For AI engineering teams building and scaling agentic pipelines, this work signals a meaningful shift in how agent capability gets built. Teams training agents at scale now have a third option between real-environment RL and static benchmarks: controlled simulation that injects the edge cases production won't surface. Synthetic environments are a legitimate training layer. Controlled simulation that injects conditions real environments won't produce is a complement to real-environment RL, not a shortcut around it. What a model learns before agent training starts matters more than most pipelines account for. The warm-up finding — performance gains across unseen benchmarks with no agent-specific training — suggests environment grounding belongs earlier in development than current practice.

As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment. Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment. To address this engineering bottleneck, researchers at Xiaomi introduced HarnessX, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code. In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed HarnessX delivering substantial performance gains across domains like software engineering and web interaction. The results demonstrate that scaling the foundation model is not the only path to more capable AI — and for smaller models, it may not even be the best one. HarnessX's harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen3.5-9B, gains reached +44% on embodied planning tasks. The challenges of harness engineering In AI applications, a foundation model's capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action. As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges. First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences. Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition. Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent's operational data. HarnessX: an autonomous foundry for AI agents HarnessX solves the engineering bottlenecks of manual harness development with what the researchers call a “unified harness foundry.” The core innovation of HarnessX is treating the harness as a "first-class object". In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model. HarnessX breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a "processor" that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline. To automate the optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness. Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against: Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task. Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another. Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations. To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline: Digester: Compresses execution traces into structured summaries to identify where the agent failed. Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks. Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment. Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting. HarnessX enters a growing field of self-improving harness research — but what separates it is harness-model co-evolution. The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities. HarnessX interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development. HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as DeepSeek-R1. When fine-tuning the model, cross-harness GRPO pools an agent's execution trajectories for the same task across entirely different versions of the application's harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations. HarnessX in action on industry benchmarks To validate the practical utility of HarnessX, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning. They separated the AI into two roles. The “meta-agent,” powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The “task agents” ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B. HarnessX was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code. Dynamically evolving the harness yields significant gains on the same base model. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%. The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering. Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models. Anecdotal evidence from the experiments shows how HarnessX solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site's JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the MediaWiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks. During the WebShop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking "next page" and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance. Limits of automated harness engineering One important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested. Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, HarnessX will not be able to improve the agent’s overall abilities (the researchers observed this with the Qwen3.5-9B model on the SWE-bench coding tests). Despite these limitations, HarnessX makes a concrete case that harness engineering — not just model scaling — is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.

Drug discovery is notoriously inefficient. Pharmaceutical projects span years, moving from one specialized human team to the next through disconnected workflows that result in knowledge loss during each handoff. A shocking 90% to 95% of drug discovery projects reportedly fail — one of the highest failure rates of any industry. A single successful drug can take over a dozen years and up to $1 billion from initial discovery to patient distribution, according to published reports. Generative AI is being used to solve some of the challenges, but Stanford researchers have moved the ball forward with agentic AI. A team led by James Zou, associate professor of Biomedical Data Science at Stanford University, has deployed thousands autonomous AI "scientist" agents in a virtual biotech that simulates the full lifecycle of drug development. The agents handle everything from initial discovery through safety testing and clinical trial design, while maintaining the continuity that’s lacking in today’s drug discovery processes, according to Zou. The project uses a hierarchical orchestration framework. At the top sits a chief scientist officer agent that acts as a planner, delegating tasks to teams of specialized agents, Zou told VentureBeat during a call ahead of his upcoming session at VB Transform 2026. While one team of agents focuses on discovery, another manages safety, and others handle specialized analytical tasks. Because these agents operate within a unified, hierarchical ecosystem, they retain the full context of a project, maintaining continuity from the first molecule identified to the final clinical outcome. The "brain" of the system relies on a vast amount of primary data. The agents are granted access to data sources ranging from genomics and FDA chemistry data to clinical trial databases using a model context protocol. The team has invested heavily in agent-native and agent-friendly data, allowing the AI to synthesize complex information more effectively. The system relies on a combination of models, with Zou noting that while Claude often serves as the backbone for coding and data analysis, the architecture employs a mixture of models, including those fine-tuned specialized use cases. Zou is raising money at a roughly $1 billion valuation for his startup, Human Intelligence, based on the research. During Zou’s session at VB Transform on July 15, titled How 10,000 agentic scientists in Stanford’s lab are set to revolutionize medical research and discovery, he will share valuable insights including strategies for managing context and long-running, multi-step workflows in a multi-agent system, the process of transforming and indexing raw enterprise data to make it agent native, and how to use human auditing and experimental reward signals to verify agent actions. Another session at VB Transform focused on the value of agentic context includes Building a trustworthy agentic AI foundation: How Zillow accelerated engineering by 40%, with Zillow's SVP of engineering and technology, Toby Roberts and Glean’s CEO Arvind Jain. Interested in attending VB Transform 2026? Register here. A select number of complimentary passes are also available to senior technology leaders. Contact us to get yours.

Shopify built an LLM proxy that gives every engineer access to multiple AI providers — with automatic failover when any one of them goes down, changes, or disappears. When Claude Fable 5 shut down, Shopify's engineers didn't go into panic mode. The proxy shifted them to Claude Opus or GPT 5.5 automatically, without interrupting their workflows. “Fable looks amazing; we used it of course,” Farhan Thawar, Shopify’s head of engineering, says in a new VentureBeat Beyond the Pilot podcast. “When a model comes and then it goes, or it could be as innocuous as an update, the proxy allows us to spray across the different providers,” Thawar says. Shopify buys tokens in bulk and all users connect to models through its proxy, Thawar says. This gives his team access to reporting and failover; when there’s an availability issue with one provider, users can be “automatically, seamlessly” transferred to another. Enterprises can learn from this example and consider how a disruption might affect their business, Thawar says. At the very least, they should establish a solid backup plan. It’s important to have a system that allows for movement across models so enterprises are not “super tied” to a specific provider. Distillation is another important strategy. With distillation, a student model learns from a teacher model and typically becomes specialized in a narrower task. These small language models (SLMs) can be more beneficial than generalized, off-the-shelf models in some circumstances. For instance, Shopify’s flagship AI assistant, Sidekick, which performs numerous specialized subtasks for merchants so they can “remove toil” from their day-to-day. Using smaller distilled models can be faster and cheaper than more generalized models, Thawar says. In some cases they have proven to be 2x cheaper and faster; in more extreme cases 30x cheaper and faster, he says. But “it isn’t just about cost and latency, which are big; it’s about accuracy,” Thawar says. Engineers feed the UDP their teacher model, training data, evals, and a target model — say, Opus 4.8 distilling down to Qwen 3.5. The pipeline runs for about a day, then returns an evaluation showing what the fine-tuned model actually achieved on speed, cost, and accuracy for that subtask. If the tradeoff looks good, the engineer deploys it — no approval process required. Shopify's internal platform, Tangle, lets anyone visualize the pipeline as it runs. Thawar says his “dream” is to eventually not give the distillation pipeline a target model at all. Instead, users could provide the teacher model with data and evals and the directive: ‘Based on your learnings over time, I want you to look at a different class of model, different sizes, different types, and you tell me what the right distillation target is.’ “Maybe we'll get surprised. Maybe it'll be such a small model it could run on a phone,” Thawar says. “Other times, maybe it comes back and says, ‘There isn't a way to distill this down to anything better than what we have at the frontier.’” Moving away from "AI reflexivity" to "AI leverage" Shopify users can apply whatever harness they want: Claude Code, Codex, Cursor, GitHub Copilot for VS Code. “We expose everyone to the different harnesses so they can get a feel for what may or may not work in their workflow.” But the company also implemented a usage dashboard; this allows Thawar’s team to ask interesting questions around not just token spend, but: Who’s using the most expensive tokens? Who's spending more time on reasoning? What types of models are being used, and what disciplines and levels? Regarding the "tokenmaxxing" question, Shopify does have “circuit breakers” in place. If a user has a model running for a long time (say, 10 hours) and it’s consuming a lot of tokens, they will get pinged, “Did you mean to spend this?” As Thawar explains, sometimes the reply is “Oh, absolutely.” Other times it’s: ‘Whoa, I didn't know that was running in the background. I totally forgot about it. I'd rather stop it now.’ The ultimate goal, as Thawar describes it, is to move from “AI reflexivity” to “AI leverage,” and get people to really think deeply about where they can benefit most from AI in their workflows. Listen to the full podcast to hear more about: Shopify’s philosophy of building infrastructure before features. As Thawar puts it: “We've always built more infra. We will continue to always build more infra.” How Shopify’s internal AI agent, River, creates a “substrate of information” across the company. How Thawar's OpenClaw agent figured out he was traveling from his calendar — and what that moment told him about where agents are actually headed. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

AI agents are increasingly proficient at executing business tasks autonomously, but IT leaders are cautious about granting permissions to access enterprise systems. Part of the challenge lies in how AI reliability is measured. Industry standards often rely on EVAL scores, which provide a static snapshot of performance rather than a measure of overall reliability. These metrics can fail to capture predictability across prompts, environments, and input types, said Bryan Silverthorn, director of the AGI Autonomy research lab at Amazon. Amazon’s AGI autonomy research lab is moving beyond raw performance benchmarks, focusing instead on a structured framework centered on consistency, robustness, predictability, and safety, Silverthorn told VentureBeat during an interview ahead of his session at VB Transform 2026. Rather than assuming that models can be harnessed into safety, Amazon’s approach emphasizes decoupled systems, such as sandboxed environments where agents propose changes that are reviewed by humans before implementation. This strategy aims to bridge the trust gap by prioritizing verifiable interactions, even in highly sensitive domains like finance, where the potential damage an agent can cause is significant. In VentureBeat’s Q2 Pulse Research survey of over 100 senior technology leaders and buyers, just 4% said they are comfortable relying on model guardrails alone. When asked what worries them most about model guardrails, 40% said unauthorized access to tools or data and 27% cited prompt manipulation or injection. At VB Transform, Silverthorn will share details of Amazon’s approach to trustworthy agentic AI and how companies can move from single-agent wrappers to multi-tool architectures that can self-correct mid-execution during his session titled Closing the capability-reliability gap: Inside Amazon’s framework for engineering trustworthy agents. Another agentic ops and evals-focused session at VentureBeat’s flagship conference, happening July 14 and 15 in Menlo Park, is Intelligence at scale: How Waymo builds safe, efficient AI for the physical world with speaker Manasi Joshi, director of systems intelligence and machine learning at Waymo. Interested in attending VB Transform 2026? A select number of complimentary passes are also available to senior technology leaders. Contact us to get yours. You can also purchase tickets here.

Customer expectations have shifted from simple, fast conversational interactions to complex agentic AI-powered tasks that legacy IT architectures simply can’t handle. To address this, Intuit made the bold decision to overhaul its technical infrastructure for its business platform. The company moved away from its multi-agent setup, which prioritized broad capabilities, to a granular, skill-and-tool-based architecture while embedding human experts directly into the workflow alongside AI. This shift involved decomposing its massive agents into specialized components, separating the brain from the hands, essentially. "We went from a multi-agent system where we had large agents that did a lot to fully incorporating workflows, skills and tools down to the base level,” said Nhung Ho, VP of AI at Intuit. “We changed the orchestrator, we changed the planner, we changed the brain, and we also changed what everybody had to build across the whole company." At VB Transform 2026 on July 14 and 15, Ho will share details about the technology decisions behind building an abstraction layer behind Intuit’s system of intelligence. She’ll also share how the new architecture has allowed the company to decouple its orchestration from specific model providers, allowing Intuit to remain agile and use the best tools for the job, whether from large model providers or their own home-grown tools. Other VB Transform sessions focused on agentic orchestration include: From signals to shelves: How Target is engineering Agentic AI for the right product, right place, right time with speaker Siobhan McFeeney, SVP Technology, Target; The engineer's multiplier: How Instacart uses agentic AI to eliminate toil, elevate teams and slash costs with speaker Anirban Kundu, CTO, Instacart; MCP connection isn't orchestration: Building the agent execution layer with Arnab Bose, chief product officer, Asana; Building the agentic workforce: A blueprint for scaling AI operations without the sprawl with Romit Jadhwani, Sr. Director, Enterprise AI, Data & Productivity, Rivian and Craig Wiley, VP of AI, Databricks; and Inside Atlassian’s Living Lab: Deploying context-aware agents at scale with Dr. Molly Sands, head of the Teamwork Lab at Atlassian Interested in attending VB Transform 2026? Register here. A select number of complimentary passes are also available to senior technology leaders. Contact us to get yours.

OpenAI and Broadcom this morning unveiled their first custom AI accelerator chip named "Jalapeño," positioning it is as a purpose-built processor for large language model (LLM) inference, rather than the more general GPUs offered by the likes of Nvidia or AMD. According to its creators, Jalapeño is designed to support workloads behind ChatGPT, Codex, the API and future agentic products, though notably, both OpenAI's and Broadcom's news releases position it as a product that could be made available to external AI firms as well — "built from the ground up for current and future LLMs across the industry." [Emphasis mine.] It reportedly cuts inference costs by about 50%, according to Bloomberg. Recall inference is when the finished AI model is served to end users to use, while there remain high costs for training, research and development. Jalapeño's engineering timeline set a blistering pace for the semiconductor industry, moving from early schematics to fabrication readiness within a brief nine-month window, when new processor development cycles are typically measured in years. Indeed, the OpenAI and Broadcom partnership itself was only publicly announced in October 2025. The companies attributed this speed to a deep software-hardware co-development process that actively used OpenAI’s own models to accelerate parts of the chip design. Sources close to the firms told VentureBeat the development process relied on prior generation OpenAI models, though an OpenAI spokesperson declined to specify exactly which when asked by VentureBeat. After receiving an early physical model on Wednesday, OpenAI outlined plans to begin rolling out these processors across active data centers by the end of this year. OpenAI says it has already begun testing running at least one of its prior generation models, GPT‑5.3‑Codex‑Spark, on the chips at a production workload, though in a test environment. The release marks a major strategic expansion for the ChatGPT creator as it attempts to build the full computational stack required to make advanced AI faster, more reliable, and more accessible. There remain, of course, many outstanding questions — including how the new Jalapeño chip performs compared to direct competitors, its costs, and its manufacturing viability. Sources close to the company said the initial performance itself was (ironically): "outstanding." Greg Brockman, OpenAI's president and co-founder and Broadcom appeared on CNBC alongside Broadcom CEO Hock Tan this morning to discuss the news, and Brockman noted in the interview that "this is a real performance improvement...on performance per watt and performance per dollar." In a separate post on X, Brockman wrote that "Perf[ormance] per watt looking incredible." Why OpenAI Built an ASIC To understand why OpenAI is moving into chip design, it helps to look at the architecture. Jalapeño is an Application-Specific Integrated Circuit, or ASIC. Unlike a GPU, which can handle many types of workloads, an ASIC is tuned for narrower uses, as industry experts note. That narrower focus can make it cheaper and more efficient for specific AI tasks, though less adaptable than Nvidia-style GPUs. In Jalapeño’s case, OpenAI is starting from a clean design focused on modern LLM serving, instead of adapting a broader accelerator to fit its needs. The company says the architecture is shaped by its experience running large-scale AI products and is meant to reduce unnecessary data movement while better matching compute, memory and networking resources. Broadcom is contributing core silicon implementation and networking technology, including Tomahawk networking silicon, while Celestica is helping with board, rack and system integration. The goal is to move the chip closer to its practical performance ceiling in real workloads, not just improve theoretical benchmarks. However, OpenAI's pivot into proprietary hardware is not just as a quest for technical supremacy: it may also make its core unit economics far more sustainable. Audited financial documents posted recently by AI critic and AI public relations specialist Ed Zitron revealed that while OpenaAI generated an impressive $13.07 billion in revenue throughout 2025, its total operational expenses for the year ballooned to $34 billion, resulting in an operating loss of nearly $20.92 billion. The primary culprit behind this cash hemorrhage involved pure compute requirements, though more is likely due to training than inference. In 2025 alone, research and development costs—driven largely by the infrastructure required to train and serve massive language models—accounted for $19.18 billion, or approximately 56 percent of the company's entire spending footprint. Furthermore, OpenAI reportedly paid Microsoft over $10.59 billion just for R&D and compute infrastructure last year. Still, as OpenAI lays the groundwork for a heavily anticipated public offering in 2026, the Jalapeño inference chip may offer some reassurance to private investors and public markets that OpenAI has a plan for digging itself out of the financial hole and moving toward profitability. If it can drive down the costs of AI inference, then maybe it can recoup some of the losses spent on costly training runs. "By designing more of the stack ourselves, we can serve more intelligence with greater efficiency and keep pushing advanced AI toward broader access," said Brockman included in Broadcom's release. What Does This Mean for Nvidia and All of OpenAI's Other Chip Providers? The introduction of Jalapeño immediately raises questions about OpenAI's strategic positioning within the fiercely competitive semiconductor and GPU market. Since kicking off the generative AI boom in late 2022, OpenAI has remained one of the largest customers of GPU market leader Nvidia's premium products, but has also taken billions in investment dollars from the firm (engendering accusations of "circular dealing"), and expanded to work with other rival chipmakers to fuel its appetites. Nvidia: In February 2026, Nvidia finalized a $30 billion direct investment into OpenAI as part of a massive $110 billion funding round.This deal secured an agreement to deploy 10 gigawatts of computing systems—including 3 gigawatts of dedicated inference capacity and 2 gigawatts of training capacity—utilizing Nvidia's next-generation Vera Rubin platform. Sources close to the companies tell VentureBeat Nvidia will remain central to OpenAI, particularly on the model training and development side. Amazon Web Services (AWS): As part of the same February 2026 funding round, Amazon invested $50 billion into OpenAI. This deal included a commitment for OpenAI to consume approximately two gigawatts of AWS's proprietary Trainium computing capacity over the next eight years. Advanced Micro Devices (AMD): OpenAI signed agreements with Nvidia's chief hardware rival, AMD for the former's usage of the latter's AMD Instinct™ MI450 Series GPUs. Cerebras: The company also struck a pact with Cerebras, an AI chipmaker that executed its initial public offering in May 2026. Sources with knowledge of these deals said at present, they currently remain in place, unaltered. The Global Silicon Arms Race: OpenAI Joins AI Infrastructure Heavyweights Before the introduction of Jalapeño, OpenAI operated at a distinct structural disadvantage compared to the world's vertically integrated technology empires. Tech giants like Google and Amazon have for years utilized their own mature custom silicon programs— Google's Tensor Processing Units (TPUs) and Amazon's Trainium lines—to serve massive computational workloads at drastically lower margins. Microsoft, OpenAI's primary cloud provider and single biggest financial backer, aggressively entered the bespoke silicon market by launching the Azure Maia 100 accelerator in late 2023. Microsoft subsequently escalated this effort in January 2026 by introducing the Maia 200, an inference powerhouse built on TSMC's 3-nanometer process that already actively powers OpenAI's GPT-5.2 models within Azure data centers. Similarly, Meta has aggressively expanded its Meta Training and Inference Accelerator (MTIA) portfolio in recent years, debuting the MTIA 300, 400, 450, and 500 series to power its recommendation engines and generative artificial intelligence features without relying solely on Nvidia. Jalapeño provides OpenAI with the opportunity to match and offset the hyperscaler advantage. By baking its software architecture directly into a proprietary processor, OpenAI has the chance to replicate, at least in part, the playbook used by Google, Amazon, Microsoft, and Meta — transitioning from a captive cloud customer into a more independent AI infrastructure provider. The timing is ripe amid a rapidly escalating global silicon arms race. Driven in part by United States export restrictions, Chinese tech heavyweights are pursuing more of their own custom AI chip hardware, too: In May, Alibaba's semiconductor division, T-Head, unveiled the Zhenwu M890, a proprietary processor expressly engineered for autonomous AI agents that require massive memory bandwidth and long-running context windows. Huawei is reportedly gearing up to release its new Ascend 950DT chip next month ByteDance, the corporate parent of TikTok, reportedly entered active negotiations with Qualcomm in June 2026 to design custom application-specific integrated circuits for its data centers to escape third-party dependency. By successfully finalizing the Jalapeño design, OpenAI is seeking to move beyond the traditional confines of a software laboratory and stand shoulder-to-shoulder with international cloud and infrastructure titans. The Gigawatt Future This sprawling web of vendor agreements highlights the sheer scale of OpenAI's infrastructural ambitions. The ultimate goal of the OpenAI and Broadcom partnership involves deploying gigawatt-scale data centers with Microsoft and other partners beginning in 2026 — that is, data centers with compute requiring energy on the order of cities. For Broadcom, the partnership acts as a massive reputational catalyst. The company has been among the biggest beneficiaries of the generative AI boom, helping hyperscalers and frontier labs engineer custom silicon. Broadcom shares reflect this momentum, demonstrating an 18% year-over-year increase in the first part of 2026 and a nearly 7X boost since the end of 2022, according to CNBC. Ultimately, Jalapeño confirms that OpenAI believes it is ready to move beyond software and code into the realm of real-world, custom hardware. By controlling the physics of its inference pipeline—while simultaneously leveraging the capital and hardware of Nvidia, Amazon, AMD, and Cerebras—OpenAI is attempting to rapidly rewrite its future unit economics of AI.

While many enterprises have already begun integrating AI-generated images, visuals, graphics and videos into their production workflows — there is also a growing pool of data and subjective commentary indicating AI imagery ultimately looks non-distinct, monotonous, and too unoriginal to ensure a brand and its assets stand out from the pack. That it's "AI slop," in other words. AI creative tools startup Krea is hoping to change that trend by opening up the weights to its new frontier AI image model Krea 2 as two versions, "Krea 2 Raw" and "Krea 2 Turbo," under a custom license that requires firms with more than 50 seats to pay for Enterprise usage, and mandates all users of any size to implement technical safeguards to prevent the generation of illegal materials, non-consensual intimate imagery (NCII), child sexual abuse material (CSAM), or defamatory assets. Both models are available for public download on Hugging Face. The company says the models provide more visual variety than typical AI generators, while maintaining high prompt accuracy, fidelity, and quality. Importantly, they also offer enterprises and users the ability to customize the generative outputs much more than typical proprietary or even other open source models. And, for those seeking to generate imagery at high-throughput, Krea 2 Turbo's generation speed is only 2 seconds, making it among the fastest now available across open and proprietary AI image generation models. AI Image Generator API Speed & Licensing Benchmarks (Mid-2026) Model / Generator Developer / Platform Avg. Generation Time Licensing & Commercial Use Key Characteristics FLUX.1 [schnell] (fast) Prodia 0.5 seconds Open Weights (Apache 2.0). Fully permissive for free commercial use. Highly optimized endpoint utilizing step distillation to deliver sub-second generation times, representing the absolute floor for current API latency. Z-Image Turbo Replicate / fal.ai 1.8 seconds Proprietary. Commercial rights require active API usage contracts. Designed for instantaneous inference bursts. Both Replicate and fal.ai achieve identical 1.8-second median times on this model. Krea 2 Turbo Krea 2.0 seconds Open Weights / Proprietary Hybrid. Available via platform trial or API. Maintains the base model's compatibility with style references and LoRAs while utilizing Trajectory Distribution Matching (TDM) to accelerate the creative ideation loop. Midjourney v8.1 (Turbo Mode) Midjourney 3 – 6 seconds Proprietary. Commercial use requires an active Standard, Pro, or Mega tier subscription. Delivers generation speeds "three times faster than v8" while maintaining the model's signature "painterly realism with sophisticated lighting," though it requires a "higher credit cost". FLUX.2 [klein] 4B Black Forest Labs 3.9 seconds Open Weights. Permissive commercial use. The lightweight 4-billion parameter variant of the FLUX.2 architecture, balancing prompt adherence with high-speed generation. FLUX.2 [klein] 9B Black Forest Labs 4.6 seconds Open Weights. Permissive commercial use. The medium-weight 9-billion parameter open model. It scales up compositional intelligence while keeping generation firmly under the 5-second barrier. MAI Image 2 Efficient Microsoft 4 – 7 seconds Proprietary. Commercial use requires consumption-based API billing via Azure AI Foundry. A throughput-optimized variant explicitly designed to "out-pace Google’s Imagen Flash". It makes a slight trade-off in detail for "substantially lower latency" that suits "automated pipelines" perfectly. Midjourney v8.1 (Fast Mode) Midjourney 5 – 9 seconds Proprietary. Commercial use requires an active Standard, Pro, or Mega tier subscription. The standard operational mode for v8.1. Average wait times "consistently lands below 10 seconds for most prompts" while offering "excellent handling of complex multi-element scenes". FLUX.2 [dev] fal.ai / DeepInfra 6.1 – 6.4 seconds Open Weights (Non-Commercial). Strictly for research and non-commercial development. The developer-focused research model. API endpoint optimizations cause slight variance, with fal.ai operating at 6.1 seconds and DeepInfra at 6.4 seconds. Midjourney v8.1 (Relax Mode) Midjourney 8 – 14 seconds Proprietary. Commercial use requires an active Standard, Pro, or Mega tier subscription. Processes standard 1024x1024 resolution images without consuming fast GPU hours. The model retains "strong compositional instincts" and "consistent color grading and mood". FLUX.2 [pro] Black Forest Labs 11.1 seconds Proprietary. Commercial rights require paid API consumption. The closed, professional-grade tier. It drops extreme step-distillation to prioritize high-fidelity commercial rendering and strict spatial alignments. Seedream 4.0 BytePlus 11.6 seconds Proprietary. Commercial use via BytePlus enterprise contracts. The base commercial generation model for the Seedream architecture, focused on reliable, standard-resolution outputs. MAI Image 2 Standard Microsoft 12 – 20 seconds Proprietary. Commercial use requires consumption-based API billing via Azure AI Foundry. Operates as a "full-quality output optimized for photorealism". It acts as a literal renderer, delivering "high-fidelity skin tones and material textures" and "strong literal prompt adherence". Nano Banana Pro (Gemini 3 Pro Image) Google DeepMind 17.7 seconds Proprietary. Commercial rights granted via Gemini API terms. Prioritizes exact semantic accuracy and prompt adherence through an extended reasoning phase, trading raw speed for complex contextual execution. Seedream 4.5 BytePlus 18.2 seconds Proprietary. Commercial use via BytePlus enterprise contracts. The upgraded high-fidelity variant, requiring an additional 6.6 seconds of compute time over the 4.0 version to refine complex textures and text rendering. Krea 2 Large Krea 23.7 seconds Proprietary / Open Weights. Commercial rights depend on deployment. The un-distilled foundation model. It ignores the speed-focused Trajectory Distribution Matching of the Turbo variant to maximize aesthetic polish and structural stability. FLUX.2 [max] Black Forest Labs 25.6 seconds Proprietary. Closed enterprise API. The heaviest parameter model in the FLUX lineup. It operates exclusively as a deep reasoning renderer for complex commercial assets. GPT-Image-2 OpenAI 200.8 seconds Proprietary. Full commercial usage under standard OpenAI terms. A massive outlier in the latency landscape. It dedicates over three minutes to complex, multi-step semantic reasoning, likely utilizing an expansive chain-of-thought process prior to finalizing pixel outputs. Sources: Artificial Analysis, Krea, MindStudio.AI Architectural bifurcation and the 12B parameter Transformer At the technical core of the release sits an architectural framework built entirely from scratch: a Diffusion Transformer scaled to 12 billion parameters. Rather than deploying a single, heavily fine-tuned model for all downstream tasks, Krea open-sources two highly differentiated checkpoints captured at distinct milestones of the model's training lifecycle. Departing from multi-stream configurations for structural clarity, the core engine standardizes on a single-stream transformer block architecture wherein attention and MLP layers are shared natively between text and image tokens. To maximize computational efficiency, Krea incorporates a SwiGLU MLP layer operating at a 4x expansion factor alongside Grouped-Query Attention (GQA) combined with gated sigmoid attention layers to stabilize training dynamics. Timestep conditioning is heavily optimized; the network replaces traditional per-block MLP modules with a lightweight, per-block tunable bias term, successfully cutting total block modulation parameters by 20% to 30% and reallocating that parameter budget directly into core layers. Positional encoding is managed via a 3D Axial Rotary Position Embedding (RoPE) scheme mapping across individual frame, height, and width coordinate Krea 2 Raw represents an undistilled base release checkpoint taken directly from the mid-training stage of the larger Krea 2 Medium development cycle. Because it lacks post-training alignment, reinforcement learning from human feedback (RLHF), or final aesthetic distillation, Krea 2 Raw functions as a blank canvas. It retains a vast, uncurated latent space that makes it poorly suited for immediate out-of-the-box prompting, but highly optimized for structural training. Operating this model via the Hugging Face `diffusers` library requires a heavy compute footprint, executing via `Krea2Pipeline` in `torch.bfloat16` precision across 52 inference steps with a guidance scale of 3.5. To accelerate early-stage architectural convergence during the first epoch of this 256px baseline training phase, Krea applied internal Representation Alignment (iREPA) techniques before decoupling them to let the underlying model develop independent structural representations. The second checkpoint, Krea 2 Turbo, represents the opposite end of the optimization spectrum. It is a distilled, post-trained variant derived from Krea 2 Medium. Through knowledge distillation, the network's complex multi-step generation sequence is compressed into an incredibly lean operational profile. Krea 2 Turbo slashes the required generation cycle down to just 8 inference steps with a guidance scale of 0.0, enabling it to render native 2k resolution imagery on standard consumer-grade hardware in approximately 2 seconds. The underlying latent representations for both models are optimized through the integration of the Qwen Image VAE and the FLUX 2 VAE to guarantee rapid convergence while maintaining high reconstruction fidelity. Data and training The underlying dataset strategy for the Krea 2 family relies on a hybrid blend of publicly harvested data, third-party licensed image repositories, and highly curated synthetic datasets built via proprietary generation methods. Prior to final training, Krea processed these collections through rigorous algorithmic filters designed to strip out duplicative frames, low-resolution media, and explicit or harmful material, ensuring high fidelity and strong prompt compliance across both models. Krea enforces a zero-synthetic data policy within its primary pretraining mix. To prevent the upper-bound quality limitations and output biases induced by AI-generated data, the engineering team deployed custom in-house filtering classifiers built on top of DINOv3 and SigLIP-2 architectures to completely purge synthetic images at scale. Furthermore, rather than using traditional model-based aesthetic filters that inadvertently strip away artistic intents like motion blur, Krea preserves wide stylistic boundaries. The team trained a Sparse Autoencoder (SAE) on SigLIP-2 embeddings to isolate and filter out genuine visual artifacts using an unsupervised tagging framework. Krea 2 Raw vs. Krea 2 Turbo: Distinctions and use cases The release establishes a highly deliberate operational paradigm for professional studios and independent creators: "train on Raw, generate with Turbo." This workflow leverages the unique architectural properties of both open-weight files to optimize both training accuracy and rendering speed. In creative production pipelines, engineers can use Krea 2 Raw to train custom Low-Rank Adaptations (LoRAs) or domain-specific fine-tunes. Because the Raw checkpoint contains no baked-in stylistic opinions or aggressive post-training constraints, it absorbs unique aesthetic directions—such as architectural drafting styles, specific brand assets, or complex lighting designs—with high fidelity and zero stylistic interference. Once the training phase is complete, creators can port those exact LoRAs directly over to Krea 2 Turbo. This methodology is reflected in Krea's own development ecosystem, which hosts an in-house collection of custom LoRAs trained entirely on the Raw foundation model but optimized for execution within Turbo workflows. On the user-facing application layer, Krea integrates this dual-engine setup with a powerful style transfer system. Rather than relying on erratic text descriptions to achieve an artistic look, users can feed multiple style reference images directly into the system. Krea 2 maps these references across its latent space, allowing creators to isolate individual aesthetic components, combine distinct moodboards, adjust style strength via generative sliders, and fine-tune batch variation levels to maintain visual cohesion across large-scale design iterations. To address the gap between raw textual training captions and brief user inputs, Krea paired this suite with an advanced LLM Prompt Expander. Refined via Generalized Deep Q-Network Preference Optimization (GDPO) and trained on synthetic thinking traces to preserve intent reconstruction, the expander applies a photographic-medium bias to photorealistic requests and integrates an active DINOv3 embedding diversity score across rollout groups to prevent automated prompting routines from collapsing into a singular house style. While Krea 2 Medium and Krea 2 Large remain the company's flagship models for high-fidelity composition and absolute stylistic adherence, Turbo fills the critical role of rapid visual ideation. It serves as an interactive scratchpad for early concept creation, quick prompt experimentation, and iterative art direction where near-instantaneous feedback loops are required to maintain creative momentum. The custom license and its particulars The open-weight assets deploy under the Krea 2 Community License Agreement operating alongside an official Acceptable Use Policy. At a macro level, this legal framework mirrors recent industry trends toward commercial-use permissions that target small businesses while restricting large enterprise exploitation. The license explicitly permits individuals, independent creators, and small commercial companies to build applications, monetize generated imagery, and integrate the open weights directly into commercial software products without royalty obligations. Furthermore, Krea states that it "does not claim copyright or other intellectual property rights over content generated by users of this model," leaving output ownership entirely in the hands of the operator. For organizations scaling beyond this baseline, the ecosystem shifts into a paid, custom-tier structure. While Krea's official documentation lacks a rigid revenue threshold defining a "large enterprise," the company structurally demarcates the boundary based on organizational footprint: standard commercial usage caps at a "Business" tier accommodating up to 50 seats. Therefore, any entity requiring more than 50 seats, Single Sign-On (SSO) integrations, guaranteed Service Level Agreements (SLAs), or custom Data Processing Agreements (DPAs) qualifies as an Enterprise. These larger entities fall outside the free Community License scope and must pay for a custom commercial license—operating under "Custom Terms of Service"—negotiated directly with Krea's sales team. Additionally, developer access to Krea's official API remains entirely decoupled from the open-weights release; API usage operates as a distinct, paid service billed dynamically on a per-generation basis (measured in microdollars) and requires a prepaid USD balance independent of standard monthly compute subscriptions. However, a close examination reveals a significant structural shift regarding legal and behavioral compliance for all self-hosted deployments. Unlike traditional open-source permissions like the MIT or Apache 2.0 licenses—which grant unconditional usage rights and completely waive liability—the Krea 2 Community License implements strict downstream behavioral guardrails. Because Krea relinquishes centralized control over the downstream deployment of its open weights, the contract legally binds deployers to enforce content moderation protocols at the infrastructure layer. Under the terms of the agreement, any developer or platform hosting Krea 2 models must implement active input/output classifiers or equivalent content filtering mechanisms to actively prevent the generation of illegal materials, non-consensual intimate imagery (NCII), child sexual abuse material (CSAM), or defamatory assets. Developers who fail to deploy these defensive safety layers stand in immediate breach of contract, giving Krea the explicit right to update model weights or revoke access to the model family entirely. Background on Krea Founded in 2022 by audiovisual systems engineering dropouts Víctor Perez and Diego Rodriguez Prado, San Francisco-based Krea initially captured market traction as a highly fluid user interface layer built to orchestrate disparate, third-party AI generative engines. The startup's rapid scaling via product-led adoption culminated in an aggregate $83 million in disclosed venture capital funding from major VCs including Andreessen Horowitz and Bain Capital Ventures, as well as early-stage institutional backers including Pebblebed, Abstract Ventures, and Gradient Ventures. The company's user base surpassed 30 million individuals across 191 countries as of June 2026, according to its website. The open-weights launch of the Krea 2 model family represents the culmination of Krea’s deliberate evolution from a multi-model SaaS aggregator into a self-sustaining media research lab. Early in its lifecycle, Krea focused on building workflow tools, editing systems, and a node-based automation pipeline that allowed digital artists to unify models from competitors like Runway, Midjourney, and Adobe under a single subscription. However, to insulate itself against upstream platform dependencies and supplier margin pressures, the company aggressively shifted toward developing proprietary architectures. This transition began taking public shape in July 2025 with the open-weights release of the custom-curated FLUX.1 Krea checkpoint, followed in October 2025 by Krea Realtime 14B—an autoregressive video model distilled from Wan 2.1 capable of rendering 11 frames per second on localized enterprise hardware. This underlying technical maturation parallels Krea's accelerating push into high-end enterprise workflows. Large-scale creative production operations have shifted toward treating Krea as core creative infrastructure; for example, the digital creative services platform Superside reported migrating workflows from fragmented open-source setups to route roughly 80 percent of its total AI generative production through Krea. Furthermore, Krea established a strategic co-development partnership with Copenhagen-headquartered architecture firm Henning Larsen to build highly restricted, domain-specific design tools tuned to meet the compliance frameworks mandated by the EU AI Act. By releasing Krea 2 Raw and Turbo as open weights, Krea is continuing its expansion from an AI tools provider to being a model provider in its own right. An alternative to typical rigid AI imagery APIs? Creators are focusing heavily on the structural freedom offered by the unaligned Raw checkpoint, viewing it as an important alternative to the locked-down APIs provided by closed-source models. Through the official announcement on X, Krea emphasized the foundational shift this launch represents for open AI workflows. Developers note that by treating AI as an "actual creative medium" that feels "raw, flexible, unopinionated, and unconstrained," Krea is intentionally providing an infrastructure that creators can "break if [they] want to," moving far away from the rigid safety guardrails that frequently limit the visual range of competing enterprise tools. As independent model builders begin compiling the Hugging Face repositories, the practical value of the release will be determined by how effectively the open-source community can scale customized LoRAs using Krea 2 Raw. By providing clear commercial terms and lowering hardware entry barriers via Turbo's 8-step inference pipeline, Krea has introduced a highly competitive alternative to the open-weights market, challenging dominant models by prioritizing artistic control over centralized corporate alignment.

Anthropic on Tuesday launched Claude Tag, a new product that embeds its most advanced AI model directly inside Slack as a persistent, shared teammate that anyone on a team can delegate work to by simply typing @Claude. The product, available today in beta for Claude Enterprise and Team customers, replaces Anthropic's existing Claude in Slack app and represents the company's most aggressive move yet to colonize the enterprise collaboration layer — the place where decisions get made, work gets assigned, and institutional knowledge accumulates in real time. For enterprise technology leaders who have spent the past two years evaluating where AI fits into their operational stack, Claude Tag reframes the question entirely. This is not a chatbot, a coding assistant, or a search tool bolted onto a messaging platform. It is an AI agent designed to function as a standing member of a team — one that builds memory, takes initiative, works asynchronously, and interacts with every person in a channel rather than serving a single user. The implications for enterprise workflow, governance, and vendor strategy are significant. Anthropic says 65% of its own product team's code is now created by its internal version of Claude Tag, and the company runs internal support and data insight channels through the same system. The claim is striking: Anthropic is asserting that the majority of its own product engineering output already flows through the tool it just put in customers' hands. How Claude Tag works inside enterprise Slack channels At its core, Claude Tag works like this: an administrator pairs it with a Slack workspace, grants it access to specific tools and data sources, sets spending limits, and defines which channels it can operate in. From that point on, any team member in those channels can tag @Claude with a request — write a pull request, pull sales numbers, run a data analysis — and Claude will break the task into stages, execute them using the tools it has access to, and respond in a Slack thread with the result. The product runs on Claude Opus 4.8, the model Anthropic released less than a month ago. Four capabilities differentiate Claude Tag from its predecessors and from competing integrations. First, it is multiplayer. Within a given Slack channel, there is one Claude that interacts with everyone, not a separate instance per user. Anyone can see what it is working on, and anyone can pick up the conversation where the last person left off. This is a direct contrast to most existing AI integrations in Slack, which tend to operate as single-player tools. Second, it learns over time. As Claude follows along with its channel, it accumulates context about the work happening there. Users do not need to re-explain projects from scratch. If granted permission, Claude can also pull context from other Slack channels and data sources, though Anthropic says it will not report from private channels. Third, it takes initiative. With ambient behavior enabled, Claude will proactively surface relevant information from across the channels it monitors and the tools it is connected to, and will follow up on threads or tasks that have gone quiet without resolution. This is a notable expansion of agency: Claude is not just responding to requests but monitoring the information environment and deciding what its human teammates need to know. Fourth, it works asynchronously, pursuing projects autonomously over hours or days. Anthropic says its own teams "now spend much more of our time delegating tasks to many Claudes in parallel." Enterprise security controls and administrative governance get a central role Anthropic has designed the system with enterprise-grade isolation at its center. System administrators define separate Claude identities for different uses, scoped to specific channels with specific tools and data access. Everything, including Claude's accumulated memories, stays within those boundaries. A Claude configured for sales work will not share memories or data access with one configured for engineering. Administrators can set token-spend limits at both the organizational and channel level, and can review a complete log of every action Claude has taken and which user requested each task. For organizations managing compliance, audit, or regulatory requirements, this logging and scoping architecture is table stakes — and its absence has been a dealbreaker for many enterprises evaluating AI collaboration tools over the past year. Migration from the existing Claude in Slack app requires an administrator opt-in within 30 days, and Anthropic says it is issuing introductory launch credits to eligible Enterprise and Team organizations. The four-step setup process — pair with Slack, connect tools, set spend limits, test in a private channel — is designed to reduce friction for IT teams already managing sprawling SaaS portfolios. The Slack battleground is now the most contested real estate in enterprise AI Claude Tag arrives in the middle of what has become the most fiercely contested territory in enterprise AI: the Slack channel. Slack itself has been aggressively positioning the platform as an "agentic operating system," and the major AI players have responded by racing to plant their flags. Salesforce, which acquired Slack for $27.7 billion in 2021, announced more than 30 new capabilities for Slackbot in March — the most sweeping overhaul of the platform since the acquisition — transforming it from a simple conversational assistant into a full-spectrum enterprise agent. OpenAI introduced "Workspace Agents" in April, allowing enterprise subscribers to design agents that take on work tasks across third-party apps including Slack, Google Drive, Microsoft apps, Salesforce, and Notion. Perplexity launched its enterprise "Computer" agent with direct Slack integration, letting employees query @computer directly inside Slack channels. Cognition's Devin, the autonomous AI software engineer, has been built around Slack as a primary interface since its early days. Even Microsoft has brought GitHub Copilot into Teams. The logic driving this convergence is straightforward: the average enterprise juggles over 1,000 applications, and employees waste countless hours on context switching, draining productivity by up to 40%. Whichever AI system becomes the default presence in the communication layer where work is coordinated gains an enormous distribution advantage — and, critically, an enormous data advantage. The AI that lives in the channel where work happens absorbs the institutional context that makes it increasingly difficult to replace. Anthropic built Claude Tag on a foundation two years in the making To understand Claude Tag's strategic significance, it helps to trace the product arc that led to it. Anthropic first integrated Claude with Slack in October 2025, offering two-way connectivity: users could invoke Claude from within Slack or connect Slack as a data source for Claude's chatbot. The initial integration was focused on individual productivity — direct messages, AI assistant panels, and thread participation. In January 2026, Anthropic expanded Claude's Slack presence when it launched interactive Claude apps, which included workplace tools like Slack, Canva, Figma, Box, and Clay. In parallel, Anthropic was building out its enterprise infrastructure stack. In August 2025, the company bundled Claude Code into enterprise plans, a move its product lead Scott White called "the most requested feature from our business team and enterprise customers." In April 2026, Anthropic launched Claude Managed Agents, a suite of composable APIs for building and deploying cloud-hosted AI agents at scale, with early adopters including Notion, Rakuten, Asana, and Sentry. Then came Claude Opus 4.8 in late May, which Anthropic described as "a more effective collaborator" with "sharper judgement, more honesty about its progress, and the ability to work independently for longer than its predecessors." Benchmark improvements included a jump in agentic coding scores from 64.3% to 69.2% and a knowledge work score increase from 1753 to 1890. Claude Tag is the synthesis of all of these threads — combining the Slack channel presence, the enterprise security architecture, the Managed Agents infrastructure, and the Opus 4.8 model's improved agentic capabilities into a single product that Anthropic frames as "the beginning of an evolution of Claude Code." Anthropic's explosive growth explains why it is betting big on the collaboration layer The financial stakes behind this launch are enormous. Anthropic raised $65 billion in Series H funding in late May at a $965 billion post-money valuation, and its run-rate revenue crossed $47 billion earlier this month. Claude Code's run-rate revenue alone has grown to over $2.5 billion, more than doubling since the beginning of 2026, and enterprise use has grown to represent over half of all Claude Code revenue. Those numbers explain why Anthropic is investing so heavily in channel-level presence. Every enterprise customer who grants Claude persistent access to a Slack channel — with connected tools, accumulated context, and ambient monitoring enabled — represents a dramatically deeper integration than a chatbot conversation or an API call. The usage patterns become stickier, the token consumption grows, and the switching costs rise. Deloitte's deployment of Claude across more than 470,000 employees in 150 countries — reportedly its largest-ever enterprise AI deployment — illustrates the scale at which these dynamics play out. The broader market trajectory reinforces the bet. Fortune Business Insights projects the global agentic AI market will grow from $9.14 billion in 2026 to $139 billion by 2034, and Gartner forecasts that 40% of enterprise applications will feature task-specific AI agents by 2026, up from less than 5% in 2025. Anthropic is not alone in seeing this future, but with Claude Tag it is making one of the most direct plays yet to own the enterprise agent layer. The risks enterprise buyers need to weigh before granting Claude a permanent seat at the table Claude Tag raises several questions that enterprise buyers will need to evaluate carefully. The first is vendor dependency. As VentureBeat reported when analyzing Claude Managed Agents earlier this year, once an organization's agents, operational configurations, and monitoring run on Anthropic's managed infrastructure, switching costs increase significantly. Claude Tag deepens this dynamic: a Claude that has accumulated months of channel context and institutional memory becomes very difficult to replace. Enterprise procurement teams accustomed to negotiating multi-cloud flexibility will need to think hard about what it means to give a single vendor's AI persistent access to the communication layer where institutional knowledge lives. The second is governance around ambient monitoring. The proactive behavior mode — in which Claude monitors channels and surfaces information it decides is relevant — represents a meaningful expansion of what enterprise AI systems do. Organizations will need to develop clear frameworks for an AI agent that is not just responding to requests but actively surveilling information flows and making editorial judgments about what humans need to know. For regulated industries, this raises questions that existing AI governance policies may not yet address. The third is pricing. Anthropic has not published detailed pricing for Claude Tag beyond noting that it runs on token-based spending with administrative controls. For an agent that monitors channels continuously, builds memory, and works asynchronously over hours or days, the token consumption profile could look very different from traditional AI usage. And the fourth is reliability: Anthropic has been candid in recent months about infrastructure strain caused by surging demand, and for a product positioned as an always-on team member, downtime carries a different kind of cost than it does for a tool invoked on demand. What Claude Tag signals about the future of enterprise work Anthropic says its goal is to expand Claude Tag beyond Slack "so that teams can tag @Claude in the many other places they work." The company is clearly eyeing the full collaboration surface — Microsoft Teams, email, project management tools, and beyond. If Claude Tag succeeds, it will validate a model of enterprise AI that looks less like a tool and more like a new category of worker: one that never sleeps, never forgets what was discussed in the channel last Tuesday, and never needs to be onboarded twice. But the deeper significance of this launch may be what it reveals about the competitive dynamics reshaping enterprise software. For decades, the most valuable real estate in business technology was the system of record — the database, the CRM, the ERP. The current AI arms race suggests that the next era of enterprise value will be captured not by the system that stores the data, but by the agent that sits in the room where the work happens and understands what to do with it. Anthropic just gave that agent a name, a permanent seat in the channel, and permission to speak up when it thinks it has something to say. The question for every enterprise technology leader is no longer whether that agent will arrive. It is whether they are ready to manage it when it does.

Presented by F5 When enterprises move AI workloads from pilot to production, data delivery often becomes the factor that determines whether those systems can scale reliably. Point-to-point architectures connecting storage directly to compute hold up under demonstration conditions, but they often break down under sustained, concurrent production traffic. The result is stalled inference pipelines, delayed RAG systems, underutilized GPUs, and SLA violations, all of which carry direct business consequences. "Organizations successfully operationalize AI when their infrastructure is built to handle real-world failures, not just controlled conditions," says Hunter Smit, senior manager of product marketing at F5. Production traffic exposes architectural weaknesses In a pilot, a stalled transfer is an inconvenience, while in production, that same stall is an outage someone now owns. The underlying architecture is often identical in both cases: when a client is wired directly to storage, the system becomes increasingly fragile under sustained, concurrent production traffic because that direct connection has no answer when a node fails or traffic spikes. From there, retries and timeouts cascade, and the entire pipeline backs up right at the moment the business is depending on the output. "Point-to-point architectures, where the S3 client connects directly to S3 storage, are not resilient," says Paul Pindell, principal solutions architect for technology alliances at F5. "If a single storage node fails, all traffic to that cluster degrades, and in some cases the cluster can fail entirely." The problem is that AI workflows, including RAG-based inference and agentic AI, increasingly treat S3 storage as a first-class citizen in the AI cluster. However, the network connectivity between that storage and the cluster was never designed for the high-throughput, uninterrupted data movement that's needed to keep GPUs running optimally. The real cost of stalled pipelines and underutilized GPUs "Enterprise leaders tend to frame AI infrastructure around GPU utilization, but what makes AI different from traditional deterministic workloads is that infrastructure continuously influences those outcomes at every interaction," says Tanu Mutreja, senior director of product management at F5. "In AI environments, infrastructure is no longer just a back-end concern. It shapes customer experience, quality, resilience, and cost with every transaction." There can be significant business consequences. For instance, when inference pipelines stall, it becomes an SLA and customer experience issue. When RAG systems are delayed, models lose access to timely, relevant context, which results in inaccurate, outdated, or hallucinated responses, all of which create operational, compliance, and reputational risks. At the same time, the infrastructure issues that create those problems can also drive up costs by leaving expensive GPU resources idle or underutilized. "When GPUs are underutilized, it signals infrastructure inefficiencies that inflate costs while limiting scalability and responsiveness," Mutreja says. "The leadership question is whether the end-to-end AI infrastructure consistently delivers reliable, secure, high-quality, and governed AI experiences at sustainable unit economics." Building a production-ready data delivery layer F5 treats data delivery as a first-class infrastructure layer rather than assuming the network path will simply work. Where application delivery optimized the flow of requests between users and applications, data delivery optimizes the flow of data between storage, networks, and compute, including AI compute. Making data delivery a first-class layer means building three properties into it: Observability provides real-time visibility into latency, throughput, and flow health. Programmability enables policy-driven control over how data moves, through dynamic routing, traffic optimization, rate management, and automated failover. Failure-awareness builds resilience for degraded networks, storage throttling, and service disruptions. In the architecture F5 has developed for Dell ObjectScale, F5 BIG-IP sits between ObjectScale and AI compute as a programmable control point at the storage edge. "We have seen cases where a misconfiguration in the AI compute layer effectively DDoS'd the S3 storage infrastructure, " Pindell says. "Not in a malicious way, more of an 'Oh no, what did I do?' moment, but it still took storage down for the entire organization." Placing BIG-IP as the application delivery controller between the storage and compute layers protects storage with QoS, rate limits, and connection limits, keeping it resilient and operational under that kind of load. SecureIQLab-validated testing confirmed that this protection does not come at the cost of throughput, which matters architecturally, Pindell says. "Preserving, and even improving, throughput is a must-have," he explains. "It's what lets you layer on the higher-level functionality, resilience and enhanced security, without giving up performance to get there." The added complexity of hybrid and multicloud AI AI deployments in hybrid multicloud environments have an even greater data delivery challenge because of the heterogeneity involved. In other words, data traversing these environments must contend with inconsistent policies, security controls, identity systems, governance requirements, fragmented visibility, and distinct failure boundaries. Programmable traffic management and observability address this complexity together. Observability provides a unified view of application, network, and infrastructure health across otherwise disconnected environments. Programmable traffic management uses those insights to intelligently route, balance, and fail over traffic in real time. Together, they create a closed-loop feedback system that enforces consistent policies, improves resilience across failure domains, and ensures reliable, high-performance AI data delivery regardless of where applications, data, or users reside. What separates production AI from perpetual pilots The organizations that move beyond perpetual pilots share a specific engineering discipline, Smit says. "They're the ones that reach for production design with failure as the normal state, not the exception," he explains. "They will assume latency, congestion, and partial outages will happen. And they build a data path observable and failure-aware enough to absorb them, with explicit mitigation for every degraded condition rather than a hope that the network will hold." Organizations stuck in perpetual pilots are still optimizing for the perfect lab result and discovering the real-world gap only when a workload goes live. The issue is not model quality or GPU count, but whether the data delivery layer was engineered with the same rigor as the compute. "Teams need to understand that a real-world network behaves very differently from an optimized lab network," Pindell says. "They need a mitigation plan for the failure states and performance bottlenecks they will hit in production." Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Alibaba Cloud on Sunday released HappyHorse 1.1, a major upgrade to its AI video generation model that the company says delivers production-ready video synthesis across core content creation scenarios. The model is now live on Alibaba Cloud Model Studio with full API access for enterprise customers and developers, accompanied by a 40% sitewide launch discount for the first two weeks. The release arrives at a moment of remarkable upheaval in the AI video generation market — and Alibaba appears keenly aware of the timing. OpenAI discontinued Sora after it proved financially unsustainable. ByteDance indefinitely shelved the international rollout of Seedance 2.0 following a barrage of copyright complaints from Hollywood studios. For enterprise procurement teams that had been evaluating or integrating those tools into marketing, advertising, and content production workflows, the competitive landscape has contracted sharply in a matter of months. That contraction creates both an opportunity and a test for Alibaba. HappyHorse 1.1 is not a research demo or a consumer toy — it is an API-first product built for integration into enterprise software stacks, priced for volume, and backed by a $52.7 billion global infrastructure buildout. Whether it can convert technical capability into enterprise adoption, particularly in Western markets navigating intensifying U.S.-China tech tensions, will determine whether Alibaba can establish itself as a serious player in the generative video market that analysts expect to reach tens of billions of dollars by the end of the decade. How HappyHorse climbed from anonymous benchmark entry to top-ranked video model HappyHorse first appeared in early April as an anonymous submission on the Artificial Analysis Video Arena, an independent benchmarking platform where real users compare model outputs in blind, side-by-side evaluations. The model immediately claimed the top position in both text-to-video and image-to-video rankings. Alibaba was subsequently confirmed as the creator, revealing it was built by the company's ATH (Alibaba Token Hub) AI Innovation Unit — a team previously part of the Future Life Lab under the Taobao and Tmall Group before a strategic organizational restructuring. According to Arena.ai, HappyHorse 1.0 now holds the No. 2 position across all three Video Arena leaderboards. The platform noted the model scores 1,444 in both text-to-video and image-to-video categories, leading Google's Veo-3.1 (with audio) by 69 points in text-to-video and xAI's Grok-Imagine-Video by 23 points in image-to-video. In Elo-based ranking systems like Arena's, models gain or lose points based on whether users prefer their outputs in head-to-head comparisons, meaning persistent double-digit leads reflect a consistent quality gap as perceived by human evaluators — not a statistical fluke. The model's architecture helps explain why. According to community-compiled technical documentation, HappyHorse is built around a 15-billion-parameter unified self-attention Transformer that processes text, image, video, and audio tokens within a single token sequence. Unlike many competitors that stitch together separate models for video and audio, HappyHorse operates as a unified system that handles all modalities in a single generation pass, eliminating the need for third-party dubbing or post-processing audio tools. For enterprise buyers evaluating total cost of ownership, that architectural simplicity translates directly into fewer integration points, fewer vendor dependencies, and faster time to production. What the 1.1 upgrade fixes — and why it matters for commercial video production The 1.1 upgrade targets a set of pain points that enterprise video production teams know intimately. Alibaba Cloud described the release as "systematically optimized across core content generation scenarios," and the specific improvements reveal a model that has been tuned for commercial deployment rather than viral social media demos. The most consequential upgrade is multi-image reference capability, which Alibaba calls R2V (Reference-to-Video). The feature allows users to upload multiple character reference images and maintain consistent identity across generated video — directly addressing one of the hardest problems in AI video production, where subjects tend to drift in appearance between frames or shots. For brands producing advertising campaigns, product videos, or serialized marketing content, identity consistency is not a nice-to-have; it is a requirement that has historically forced teams back to traditional production methods. Motion quality receives a significant overhaul, with what Alibaba describes as "strengthened motion modeling" that addresses prior limitations in speed and fluidity. The company also made targeted improvements to visual texture, specifically calling out the elimination of "facial oiliness," "over-sharpening," and "unnatural textures" — artifacts that have plagued commercial AI video since the technology emerged and that immediately signal to viewers that content is machine-generated. Two additional upgrades round out the release. HappyHorse 1.1 improves audio-visual synchronization, including what Alibaba claims is "zero-drift lip sync" for dialogue scenes and context-aware speech pacing — building on the 1.0 version's already notable ability to generate up to 15 seconds of 1080p video with synchronized audio output. The model also improves instruction-following for long and complex prompts, a critical differentiator for enterprise users who need to specify precise camera movements, lighting conditions, and narrative beats in a single generation pass rather than iterating through dozens of attempts. Sora's collapse and Seedance's freeze leave enterprise buyers with fewer choices than ever The competitive context surrounding this launch is unusually favorable for Alibaba, and it is worth understanding why. OpenAI's Sora web and app experiences were discontinued on April 26, with the Sora API set to follow on September 24. The shutdown came after the product proved financially untenable: Sora cost roughly $1 million per day to operate but generated only about $2.1 million in total revenue, while active users dropped from a peak near 1 million to under 500,000. For enterprise teams that had integrated Sora into production pipelines, the abrupt withdrawal underscored the risks of depending on AI products that lack a sustainable business model — a cautionary tale that procurement officers are unlikely to forget quickly. ByteDance's Seedance 2.0, which many considered Sora's most formidable successor, ran into a different kind of wall. Netflix, Warner Bros., Disney, Paramount, and Sony sent legal threats to ByteDance over allegations of systematic copyright infringement after users generated viral clips featuring Hollywood intellectual property. ByteDance indefinitely postponed the international launch, and the global rollout remains suspended. That leaves Google's Veo 3.1 as the primary Western competitor in the enterprise video generation space. But Alibaba's Arena rankings suggest HappyHorse is outperforming Veo on user-perceived quality, and the 40% launch discount on Alibaba Cloud Model Studio could make HappyHorse significantly cheaper at scale. At the 1.0 level, pricing through third-party API platforms ran roughly $1.82 per 10-second clip at 720p and $3.12 at 1080p. With the promotional pricing, HappyHorse 1.1 could bring production-quality AI video generation within reach of mid-market companies and agencies that previously considered the technology too expensive for anything beyond experimentation. Alibaba's $52.7 billion infrastructure bet gives HappyHorse a distribution advantage rivals can't match HappyHorse 1.1 does not exist in isolation. It sits atop a global infrastructure offensive that distinguishes Alibaba from pure-play AI model companies that build impressive technology but lack the physical and commercial machinery to serve regulated enterprise customers at scale. Just five days before the HappyHorse 1.1 launch, Alibaba Cloud opened its first data centers in France, establishing its third European hub after Germany and the United Kingdom. The Paris region features two availability zones, bringing the company's global footprint to 105 availability zones across 32 regions. "The expansion of our cloud infrastructure into France reinforces our ongoing commitment to empowering European businesses with sovereign, secure, and intelligent solutions," said Dr. Feifei Li, Alibaba Cloud's CTO and president of international business, in the company's announcement. In Japan, the company opened its fifth data center in Tokyo on June 19. As reported by Data Center Dynamics, CEO Eddie Wu has committed to investing $52.7 billion in building a "unified global cloud network," with the company later considering increasing this to $69 billion. This year alone, Alibaba has launched new regions in Mexico, Thailand, Malaysia's Johor, and France. The France deployment is also part of Alibaba Cloud's plan to roll out enterprise-grade agentic AI services across Europe in the second half of the year, including AgentRun (a development platform for AI agents), STAROps (an intelligent operations platform), and ACS Agent Sandbox (which provides hardware-level security isolation for agent workloads). The infrastructure buildout serves a dual purpose for a product like HappyHorse. Running a 15-billion-parameter video generation model with integrated audio is extraordinarily compute-intensive, and having local infrastructure reduces latency for enterprise API calls while keeping customer data within regulatory boundaries. For European buyers operating under the European Commission's new tech sovereignty framework — published June 3 with the explicit goal of protecting the bloc's "digital independence" — the ability to run AI video generation workloads on locally hosted infrastructure is not a luxury. It is increasingly a compliance requirement. The Pentagon listing and geopolitical risk loom over Alibaba's Western ambitions Alibaba's global push is unfolding under significant geopolitical headwinds that enterprise buyers cannot afford to ignore. The Pentagon added Alibaba, along with BYD and Baidu, to its list of Chinese military companies on June 8, preventing them from securing U.S. defense contracts. Alibaba rejected the designation, saying it is "not a Chinese military company nor part of any military-civil fusion strategy." The listing does not automatically trigger sanctions, and it does not directly restrict commercial transactions between private U.S. companies and Alibaba. But it adds a layer of reputational and regulatory complexity to procurement decisions, particularly for companies with U.S. government exposure, defense supply chain connections, or transatlantic operations. Enterprise technology purchases are rarely evaluated on technical merit alone — vendor risk assessments, board-level compliance reviews, and geopolitical scenario planning all factor into buying decisions for cloud infrastructure and AI tooling. For European customers specifically, the calculus is layered in a different way. The continent's growing emphasis on digital sovereignty cuts in two directions simultaneously: it creates demand for alternatives to the dominant U.S. hyperscalers (Amazon Web Services, Microsoft Azure, and Google Cloud control roughly 70 percent of European cloud infrastructure revenue, according to Synergy Research Group), but it also raises questions about whether a Chinese provider represents a meaningful improvement in strategic autonomy. Alibaba's strategy of building sovereignty-compliant infrastructure in-market is a direct attempt to answer that question — but the Pentagon listing ensures it will be asked repeatedly. What enterprise teams should watch as the AI video market consolidates The practical implications of HappyHorse 1.1 for enterprise teams are substantial. HappyHorse supports four modes of generation — text-to-video, image-to-video, subject-to-video, and the newly added video editing — covering the full spectrum of commercial video needs from ideation through production to post-production, all with integrated audio at no additional cost. That breadth of capability, delivered through a single API endpoint, simplifies what has historically been a fragmented and expensive production pipeline. The question going forward is whether Alibaba can convert benchmark dominance and competitive timing into durable enterprise relationships. The company plans to release HappyHorse through Alibaba Cloud Model Studio with full enterprise SLAs, security certifications, and regional compliance — the table stakes that separate research breakthroughs from production-grade services. Watch for customer disclosures, usage metrics, and whether third-party platforms like fal.ai and Atlas Cloud (which already host HappyHorse 1.0) update to the 1.1 version quickly, which would signal genuine developer demand beyond Alibaba's own ecosystem. The AI video generation market entered 2026 with three credible enterprise contenders. One is dead. One is frozen. And the one still standing is a Chinese company backed by $52.7 billion in infrastructure spending, ranked No. 2 across every major independent benchmark, and offering a 40% discount to anyone willing to place the bet. In enterprise technology, the best product does not always win — but it rarely loses when the competition has already left the field.

Last night, the increasingly enterprise-focused AI startup Sakana launched Fugu, a multi-agent orchestration system that delivers frontier-level AI performance through a single, OpenAI-compatible API. Designed for developers, enterprises, and nations seeking resilience against vendor lock-in and geopolitical export controls, Fugu (Japanese for "pufferfish"), bypasses the traditional monolithic model structure by dynamically routing queries to a swappable pool of specialized AI agents. Sakana CEO and co-founder David Ha, formerly of Google Brain, positioned Fugu as a more reliable option for enterprise workflows than any single AI model provider in the wake of Anthropic's move on June 12 to revoke public access to its most powerful models, Claude Mythos 5 and Claude Fable 5, in the wake of a U.S. government export control order. As Ha wrote in a post today on X: "Fugu dynamically orchestrates the world’s best models to tackle complex tasks. We are proving that a well-orchestrated pool of swappable agents can match restricted frontier models like Fable and Mythos. But Fugu is about more than just performance. I believe that Orchestration Models are the next frontier, beyond bigger models. Relying on a single company’s model for national infrastructure is a massive risk. As recent export controls have shown, access to top models can disappear overnight. Collective intelligence is the practical hedge against this concentration of power. Fugu simply routes around vendor restrictions by relying on an entirely swappable agent pool." Sakana AI explicitly states that the specific models Fugu selects and how it coordinates them are proprietary, meaning this routing information is hidden from the user by design. The documentation only refers generally to a "diverse pool of powerful models," "multiple LLMs," or "specialized models" without providing a specific count. By acting as a sophisticated coordinator rather than a standalone foundation model, Fugu matches the output quality of top-tier models like Fable and Mythos on third-party benchmarks of agentic tasks, while fundamentally altering how developers deploy critical AI infrastructure. How Sakana Fugu works and where it beats Anthropic's Claude Fable 5 At its core, Sakana Fugu operates like a master general contractor. When presented with a complex request, Fugu does not attempt to execute every step itself. Instead, it breaks the problem down, delegates sub-tasks to a pool of expert foundation models, verifies their work, and synthesizes the final output. "Fugu is itself an LLM, trained to call various LLMs in an agent pool, including instances of itself recursively," the Sakana AI team noted in their technical release. Grounded in two of Sakana's 2026 research papers, TRINITY and the Conductor, the system autonomously manages the entire lifecycle of model selection and verification using learned coordination strategies rather than hand-designed workflows. To the end user, this multi-agent swarm is entirely abstracted behind a standard API endpoint. Sakana AI is offering two variants of the system to cater to different operational workloads: Fugu: A high-speed, low-latency model optimized for everyday tasks. It is designed to act as the default engine for interactive chatbots and integrates directly into coding environments like Codex. Fugu Ultra: The flagship tier engineered for complex, high-stakes tasks such as AI research, cybersecurity analysis, and multi-step patent investigations. According to Sakana, Fugu Ultra coordinates a deeper pool of experts and matches industry-leading monolithic models across rigorous scientific and reasoning benchmarks. Additionally, on the pay-as-you-go plan, standard Fugu charges a dynamic rate based on the specific underlying models activated, whereas Fugu Ultra utilizes a fixed pricing structure starting at $5 per million input tokens and $30 per million output tokens. As indicated by benchmark charts shared by Sakana, Fugu actually exceeds the performance of Anthropic's Claude Fable 5 on LiveCodeBench, an open source benchmark testing coding performance on regularly refreshed, software problem-solving tasks (Fugu Ultra: 93.2, Fugu: 92.9, Fable: 89.8), and beats the prior Claude Mythos Preview model on GPQA-D (Diamond) , a test of 198 graduate-level multiple-choice questions in biology, physics, and chemistry (Fugu Ultra: 95.5, Fugu: 95.5, Mythos Preview: 94.6). By orchestrating multiple models from different providers, Fugu essentially builds native redundancy into the AI stack. If one provider suffers an outage or faces sudden regulatory restrictions, Fugu routes around the disruption to maintain uptime. Licensing and availability Fugu is offered as a commercial, proprietary API service, not an open-source framework. Because Sakana’s core intellectual property lies in its non-obvious collaboration patterns, the specific routing information—meaning exactly which underlying models Fugu selects for a given query—remains proprietary and is intentionally hidden from the user. However, Sakana offers critical controls for enterprise data compliance. Developers can explicitly opt specific models or providers out of their Fugu routing pool to maintain strict corporate privacy standards. Additionally, users can opt out of having their prompts used for future training data. Geographically, Fugu is restricted from operating within the European Union (EU) and European Economic Area (EEA) while Sakana works to align its black-box data routing architecture with GDPR regulations. Pricing is fairly steep Fugu is available immediately in most regions—with the temporary exception of the EU and EEA—at subscription tiers and pay-as-you-go pricing. Teams can opt for monthly subscription allowances designed for individual or hands-on use: a Standard tier at $20/month for lightweight workflows, a Pro tier at $100/month providing 10x standard usage, and a Max tier at $200/month offering 20x usage for continuous, long-running tasks. I wasn't able to find the actual amount of tokens covered under these plans, but I've reached out to Ha on X for more information. As part of the initial rollout, Sakana is offering a free second month for users who subscribe to any tier by July 31, 2026. For enterprise scaling and production deployments, Sakana offers an elastic pay-as-you-go plan. Crucially for high-stakes environments, requests made under this consumption-based model are served at a higher priority than those from monthly subscription plans. Under this framework, the standard Fugu engine charges the single rate of the highest-tier underlying model involved in a query, without ever stacking multi-agent fees. The flagship Fugu Ultra tier (fugu-ultra-20260615) utilizes a fixed pricing structure per one million tokens: $5 for input, $30 for output, and $0.50 for cached input. These rates increase to $10, $45, and $1.00 respectively for extreme workloads utilizing context windows above 272K tokens. That puts it among the more expensive options compared to single AI models via provider APIs: VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI MiMo-V2.5 Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot GLM-5.2 $1.40 $4.40 $5.80 Z.ai Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI MiMo-V2.5 Pro (>256K) $2.00 $6.00 $8.00 Xiaomi MiMo Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Sakana Fugu Ultra $5.00 $30.00 $35.00 Sakana AI Claude Fable 5 / Claude Mythos 5 $10.00 $50.00 $60.00 Anthropic Developers modeling operational costs should also note a significant architectural caveat in how Fugu bills for its multi-agent capabilities. According to the developer documentation, Fugu Ultra’s API responses include detailed usage fields that separate user-visible token generation from internal orchestration work. The background tokens consumed and generated when Fugu delegates sub-tasks, verifies code, or routes between underlying agents are not absorbed by the provider; they represent real token usage and are counted toward the final price of the request at standard rates. The Orchestration landscape: Fugu vs. The Field and notable benchmark performance To understand Fugu’s position in the mid-2026 AI ecosystem, it is critical to distinguish between model routing and multi-agent orchestration. Over the past year, enterprise adoption of standard routing platforms—such as Not Diamond, Martian, and the open-source RouteLLM framework—has skyrocketed. These systems act as intelligent air traffic controllers; using semantic classifiers or meta-models, they analyze an incoming prompt and predict which single foundation model will yield the highest quality or most cost-effective response, dispatching the query accordingly. Fugu operates on a fundamentally different paradigm. Rather than making a one-shot routing decision, Fugu aligns more closely with complex multi-round systems like Router-R1 (a framework introduced at NeurIPS 2025). It breaks a query down, interleaves reasoning with delegation, and dynamically assigns sub-tasks to multiple models in parallel or sequence before synthesizing a final output. While frameworks like LangGraph, CrewAI, and Microsoft AutoGen offer developers the tools to build similar multi-agent systems, they require immense manual configuration—defining roles, setting up conditional edges, and managing state across long-running loops. Fugu abstracts this operational overhead entirely. It is essentially a LangGraph-style workflow packaged as a single, black-box API endpoint. An orchestration system is ultimately bounded by the raw capabilities of the underlying models in its pool, a reality reflected in Sakana’s own benchmark testing against standalone frontier models. On rigorous coding and agentic tasks, collective intelligence shows a distinct advantage over standard models. Fugu Ultra posted a 73.7 on SWE-Bench Pro, significantly outperforming Anthropic's Claude Opus 4.8 (69.2) and OpenAI's GPT-5.5 (58.6). However, Fugu is not a silver bullet, and its performance is not a clean sweep across the board. When compared to highly specialized or restricted-access monolithic models, Fugu occasionally trails: SWE-Bench Pro: While Fugu Ultra (73.7) beat most accessible models, it was comfortably eclipsed by Anthropic’s limited-access Fable 5 (80.0), which is currently absent from Fugu's swappable pool due to the U.S. government's export control order and Anthropic's subsequent response to remove the model entirely from global usage. Humanity's Last Exam: Fugu Ultra (50.0) narrowly edged out Opus 4.8 (49.8), but again fell short of Fable 5 (53.3). Long-Context and Security: On the MRCRv2 long-context-recall test, OpenAI's GPT-5.5 maintained the lead (94.8 vs Fugu Ultra's 93.6), and Opus 4.8 remained the top performer on the CTI-REALM cybersecurity benchmark (69.6 vs Fugu Ultra's 69.4). The quantitative data points to a clear conclusion: Fugu is highly effective at boosting performance on messy, multi-step tasks (like writing a complex HTML5 game from scratch) by leaning on the combined strengths of multiple mid-tier and high-tier models. However, for sheer brute-force reasoning within a single, highly constrained domain, the industry's largest standalone models still hold the edge—provided an enterprise can maintain uninterrupted access to them. Background on Sakana's formation and noteworthy achievements to date Sakana AI was formed in Tokyo in 2023 by Llion Jones, a co-author of Google’s foundational 2017 "Attention Is All You Need" paper, and David Ha, the former head of research at Stability AI. Disillusioned by large tech company bureaucracy and the industry's hyper-fixation on scaling single, massive foundational models, the founders built Sakana around principles of biomimicry and evolutionary computing. The company's name, derived from the Japanese word for fish, reflects its core technical thesis: utilizing collective "swarm" intelligence rather than brute-force compute. Following a $2.6 billion Series B valuation in late 2025 and the recent June 2026 launch of Marlin—an autonomous, eight-hour research agent for the B2B sector—Fugu represents the commercialization of Sakana's multi-agent routing technology for everyday developers. A mixed reception among the broader AI community online The developer community has responded to Fugu by rigorously testing its practical tradeoffs, weighing its routing efficiencies against the sheer power of monolithic foundation models. AI observer, developer and influencer Chris (@ChrissGPT on X) highlighted the specific utility of Fugu over raw foundational AI. "For a single clean prompt, you probably would [use Fable 5, Mythos, or GPT-5.5 directly]," he noted, but argued that Fugu's true value emerges in messy, multi-step environments. "...whether it involves delegation, verification, synthesis, code review, research loops, security analysis... the more it would make sense to use this," he wrote. Chris also pointed out the strategic geopolitical advantage of Fugu's architecture, noting that if frontier AI access is abruptly revoked due to regulation or export controls, an orchestrator can dynamically swap models to prevent a total system failure. Creative agency owner Mark Santos (@markksantos) of Mark Studios provided a direct, real-world comparison by tasking both Fugu Ultra and Claude Opus 4.8 with building a "Crossy Road" game clone using Three.js. The results underscored the operational differences between an orchestrator and a monolithic giant: Sakana Fugu Ultra: Completed the task in 22 minutes using ~89,000 tokens for roughly $7.32. However, the final game suffered from minor logic errors, such as inverted directional turns and wonky camera angles. Claude Opus 4.8: Took 79 minutes, burned ~940,000 tokens for nearly $37.85, and got stuck in a retry loop requiring human intervention. Despite the inefficiency, it ultimately produced superior application design and functionality. Santos concluded the experiment by stating, "In terms of application functionality, quality, and design, Opus won. In terms of model speed and performance, Fugu... won". Elie Bakouch, a research engineer at cloud-based, open AI infrastructure and systems provider Prime Intellect, pointed out on X that "to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not 'AI sovereignty'..." These early tests and reactions mirror the sentiment summarized by Reddit user GreedyWorking1499 in initial platform discussions: "Until proven otherwise, this is just a highly advanced router/wrapper, not a fundamental not a fundamental leap in intelligence like Mythos/Fable was." Yet, as enterprises increasingly demand fail-safes against single-vendor reliance, Sakana is proving that packaging collective intelligence into a single API endpoint is a highly viable commercial path.

Presented by Splunk Every day, organizations learn things their AI systems never get to use. A security analyst corrects an AI-generated investigation. A network engineer identifies the root cause of a recurring outage. An observability team discovers that a pattern of latency, logs and infrastructure changes predicts service degradation. A customer operations team learns which signals indicate an escalation is likely. Each moment contains valuable organizational knowledge. But in most enterprises, that knowledge disappears into tickets, dashboards, chat threads, post-incident reviews and the minds of individual experts. It may help solve the immediate problem, but it rarely becomes part of a reusable system that improves future AI-driven decisions. That is the next challenge for the agentic enterprise. The future will not be defined simply by who has the most capable model or the most autonomous agents. Many organizations will have access to similar frontier models. Many will deploy agents across security, IT, engineering, customer service, and business operations. The real differentiator will be whether those agents can learn from the organization around them. Not by constantly retraining the underlying model, but by capturing operational experience, converting it into institutional knowledge and making that knowledge available to future agents, workflows, and decisions. The agentic enterprise is not just an enterprise that uses AI. It is an enterprise that learns through AI. Agentic enterprises allow AI systems to learn from them The AI conversation has been dominated by model capability: larger context windows, better reasoning, faster inference, stronger tool use, and more sophisticated agentic behavior. Those advances matter. But in the enterprise, a model is only one part of the system. A model does not automatically know how a specific organization operates. It does not inherently know which remediation step solved last month’s outage, which analyst correction improved a threat investigation, which network signal preceded a service disruption, or which internal policy should override an otherwise plausible recommendation. That knowledge belongs to the enterprise. For agentic systems to improve, organizations need a way to capture that knowledge and make it reusable. In many cases, that does not require changing the model itself. It requires changing the ecosystem around the model: the knowledge base, retrieval layer, prompts, policies, guardrails, routing logic and workflows that shape how agents behave. The model may remain the same. The learning system around it becomes smarter. Feedback loops turn every outcome into a teachable moment for agents Every agentic workflow creates signals. An agent receives a request. It retrieves context, reasonsthrough possible actions, calls tools, and generates answers. A human accepts, rejects, or modifies that answer. Downstream systems reveal whether the action worked. That entire chain is valuable. AI observability gives organizations visibility into what happened: the prompt, response, reasoning path, tool calls, data sources, intermediate steps, failure modes and outcomes. Without that visibility, organizations cannot understand why an agent behaved the way it did, let alone improve it. But observability alone is not enough. The larger opportunity is to turn observed behavior into institutional knowledge. A trace should not only help a developer and operators debug an agent. It should help the enterprise understand what the agent learned, what the human corrected, what outcome followed, and what should change before the next similar event. That is the shift from monitoring AI to teaching AI. In the agentic enterprise, feedback loops connect action to outcome, outcome to knowledge and knowledge back to future action. A learning system in practice across security, observability and the network Consider a service experiencing intermittent degradation. An observability agent detects unusual latency and error rates. A network agent identifies packet loss across a specific path. A security agent notices that the same time window includes suspicious authentication behavior and unusual traffic from a previously unseen source. Individually, each agent has only a partial view. Together, they create a richer operational picture. The first time this incident occurs, human experts may need to intervene. A network engineer confirms that packet loss was caused by a misconfigured routing change. A security analyst determines that the suspicious traffic was not an attack, but a side effect of a misrouted internal service. An SRE connects the network event to the application degradation. That resolution contains knowledge the organization should not have to relearn. A mature agentic learning system would capture the traces, human corrections, topology context, security findings, observability signals and final remediation steps. It would preserve the relationship between those signals: latency pattern, network path, identity behavior, routing change and remediation. The next time a similar pattern appears, agents would not start from zero. They could retrieve the prior case, compare current conditions, recommend the proven diagnostic path and escalate with better context. The underlying frontier model did not need to be retrained. The enterprise learned. The architecture of the learning agentic enterprise A learning-oriented agentic enterprise needs more than a model or chatbot. It needs an architecture that can capture experience, turn it into usable knowledge, connect that knowledge to operational context, and govern how it changes future agent behavior. Memory preserves what happened: what the agent saw, what it did, where humans intervened, and what outcomes followed. Knowledge bases turn that experience into reusable guidance, including playbooks, examples, policies, procedures, and evidence. A data fabric connects the operational environment. The signals agents need live across logs, metrics, traces, tickets, identity systems, security tools, network telemetry, collaboration platforms, and business applications. A data fabric makes those signals discoverable, correlated, governed, and usable in context. AI observability explains how agents behave by capturing prompts, tool calls, intermediate steps, responses, feedback, and outcomes. That visibility helps organizations understand where agents succeed, where they fail, and what should improve. The control plane governs how learning becomes change: what knowledge is promoted, which prompts or policies are updated, which agents can use new information, what approvals are required, and how changes are audited. Together, these capabilities allow AI systems to improve over time in a controlled, trustworthy way that allows the enterprise to learn from its own operations. The organizations that learn fastest will win The next era of AI will not be won by models alone. It will be won by organizations that can capture what they learn from every workflow, expert correction, incident, investigation, and outcome. The most advanced agentic enterprises will not simply deploy more agents. They will build systems that allow every agent to benefit from the collective knowledge of the organization. That means connecting operational data through a data fabric. It means observing agent behavior deeply enough to understand it. It means preserving experience in memory and institutionalizing it in knowledge bases. It means using a control plane to govern how learning changes agent behavior. The future of AI is not a single autonomous agent acting alone. It is an ecosystem of agents, humans, data and controls that learns over time. The organizations that build that ecosystem will create AI systems that get better with every interaction. Not because the model is constantly changing, but because the enterprise itself is becoming more intelligent. Learn more about how Cisco Data Fabric powered by the Splunk Platform is accelerating agentic operations. Hao Yang is Vice President AI at Splunk, a Cisco Company. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Not every company can or should build their own frontier AI language model. However, the harness controlling the model is something that most enterprises can and should customize for their specific purposes. Of course, this is easier said than done. Agent harnesses are still largely tuned through manual, ad hoc debugging — a process that relies heavily on intuition rather than systematic feedback loops, making it difficult to keep pace with rapidly evolving LLMs. To solve this challenge, researchers at the Shanghai Artificial Intelligence Laboratory have introduced “Self-Harness,” a new paradigm in which an LLM-based agent systematically improves its own operating rules. By examining its own execution traces to apply edits, the system trades manual guesswork for empirical evidence. Self-improving harnesses can enable development teams to deploy robust custom agents that continually adapt their own execution protocols to overcome model-specific weaknesses. The challenge of harness engineering An LLM-based agent's performance is not determined solely by its underlying base model, but also by its harness: the surrounding system that provides context and enables the model to interact with the environment. A harness includes components like system prompts, tools, memory, verification rules, runtime policies, orchestration logic, and failure-recovery procedures. This layer is crucial because many common agent failures stem from the harness rather than the model. For example, an agent may report success without checking the model’s response (e.g., running the code to see if it passes the tests), or it might retry a failed action repeatedly. The harness is also responsible for preventing context rot or overload when the agent’s interaction history grows very large. Examples of popular harnesses include SWE-agent, Claude Code, Codex, and OpenHands. Harness engineering remains a significant challenge, but the bottleneck isn't necessarily that humans are too slow or incapable. In fact, Hangfan Zhang, lead author of the Self-Harness paper, told VentureBeat that "in many cases, an experienced engineer with deep domain knowledge can still propose better changes than an LLM can today." Instead, the true bottleneck of manual engineering is that it relies heavily on ad hoc debugging rather than a verifiable, empirical feedback loop. "The deeper issue is that the current harness-engineering paradigm often lacks a systematic feedback loop," Zhang explained. "Many edits are made based on intuition, a few observed failures, or ad hoc debugging." With new models being released at a rapid pace, depending on human intuition to manually tune model-specific harnesses becomes increasingly costly and untenable. While some approaches use stronger models to improve the harnesses of weaker target agents, this dependence on external guidance has its own challenges, as these models may be costly, unavailable for frontier models, or mismatched to the target model's failure modes. How Self-Harness works The Self-Harness paradigm enables an LLM-based agent to improve its own harness without relying on human engineers or stronger external models. This continuous self-evolution is driven by a three-stage iterative loop that turns behavioral evidence into harness updates: Weakness mining: Starting from an initial harness, the agent runs a set of tasks, producing execution traces with verifiable outcomes. The agent categorizes failed traces and tries to detect model-specific failure patterns. Harness proposal: Based on these failure patterns, the agent uses a “proposer” role to generate a set of diverse yet minimal harness modifications, each tied to a specific failure mechanism to avoid overly general corrections. Proposal validation: The system evaluates candidate modifications through regression tests. An edit is promoted only if it improves performance without causing measurable degradation on held-out tasks. If multiple candidate modifications pass the regression tests, they are merged into the next version of the harness, which then serves as the starting point for the next iteration. To visualize why an enterprise would need this, imagine an automated issue-fixing agent that reads internal documentation, writes patches, and opens pull requests. If the company updates its documentation style, the agent might suddenly fail, pulling the wrong context or writing bad patches. On the surface, the agent simply looks broken. But Self-Harness turns this ambiguous failure into a solvable problem. "The failure traces expose where the agent is misusing the new documentation format; the proposer can generate a targeted harness edit... and the evaluator can decide whether that edit improves the failing cases without regressing other cases," Zhang said. Self-Harness in action The researchers evaluated Self-Harness on Terminal-Bench-2.0, a benchmark that tests general tool-based execution, including artifact management, command use, verification behavior, and recovery from execution errors. They applied Self-Harness with MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. To isolate the impact of the self-evolving harness, they started with a minimal harness built upon the DeepAgent SDK, containing only the benchmark-facing system prompt, and the default filesystem and shell tools. The model backend, tool set, benchmark environment, and evaluator were kept unchanged while only the harness was allowed to vary. The quantitative results show that agents improved their performance through automated harness edits. On held-out tasks, performance jumped significantly across the board, ranging from 33 to 60 percent relative improvements for different models. Importantly, an explicit acceptance rule promotes only those edits that improve performance without introducing unacceptable regressions. What makes Self-Harness powerful for enterprise applications is that it doesn’t simply make the prompt longer or add generic instructions. Instead, it introduces targeted changes that reflect the recurring problems each model encounters during execution. For example, under the baseline harness, MiniMax M2.5 would get stuck endlessly exploring dataset configurations until the execution environment timed out, failing to produce any deliverables. Through Self-Harness, the system identified this specific flaw and wrote a "loop breaker" into its runtime policy, forcing the agent to stop and redirect its approach after 50 tool calls. It also added a rule to create an initial version of required artifacts as early as possible. On the other hand, Qwen-3.5 had a habit of hitting a file overwrite error and then blindly retrying the same command repeatedly, eventually deleting necessary files out of confusion before stopping. The self-harness fixed this by introducing a strict command-retry discipline (forbidding exact duplicate commands) and a mechanism that forced the agent to immediately recreate any missing artifacts if a file error occurred. GLM-5 struggled to preserve environment changes across different commands, and would often waste time on massive downloads or finalize tasks even when sanity checks were failing. Its self-generated harness introduced rules instructing the agent to persist PATH variables across shell sessions, limit external compute, and repair any failed sanity checks before concluding its run. The hidden costs of automated harnesses While Self-Harness automates the tedious work of tracking down idiosyncratic model failures, decision-makers must be realistic about the trade-offs. Replacing human engineering with automated trial-and-error requires significant computational overhead. "Self-Harness replaces part of the human engineering burden with repeated proposal generation, parallel candidate evaluation, and regression testing," Zhang said. "That can mean more API tokens, more latency during optimization, and more infrastructure for running evaluation tasks." Also, this system relies on the accuracy of its evaluation pipeline. During their experiments on Terminal-Bench-2.0, the researchers relied on strict, deterministic verifiers to ensure the agent's edits were actually helpful. Without this rigorous ground truth, an automated system risks promoting bad updates. "[The] evaluation system is not an optional component; it is what lets us trade human intuition for empirical evidence," Zhang said. This reliance on strict verifiers also dictates where Self-Harness should be deployed. "The best deployment targets today are environments where failures can be measured and where trial-and-error is relatively safe," Zhang said, pointing to coding, internal workflow automation, and DevOps data pipelines as ideal use cases. Conversely, enterprises should avoid fully automating harnesses in high-stakes or subjective fields. "The clearest red flags are domains where evaluation is subjective, delayed, non-deterministic, or costly to get wrong, such as medical decision-making, safety-critical infrastructure, or legal decisions." From prompt tweakers to feedback architects The introduction of self-improving agents does not mean coding or enterprise workflows will suddenly become human-free. The quality of collaboration between the human engineer and the AI is still paramount and difficult to capture with automated benchmarks. Instead, the engineering profession is moving up the abstraction layer. "The role of enterprise engineers will shift from manually patching individual prompts or tool calls toward designing the feedback systems that make agent improvement possible," Zhang predicted. Moving forward, "the engineer becomes less of a prompt tweaker and more of a feedback architect." As foundational models grow more capable, they will naturally absorb many capabilities that currently require manual harness engineering. "But once that happens, the harness will not disappear; its scope will move outward to connect the model to richer external environments," Zhang said. "Until that boundary moves beyond what humans can evaluate, humans will remain critical providers of feedback."

Presented by Solidigm As inference workloads evolve from discrete question-and-answer exchanges into persistent, multi-step agentic systems, GPU availability is no longer the most critical AI bottleneck. Instead, the bottleneck has migrated from compute to context, says Jeff Harthorn, AI applied research lead at Solidigm. "Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026," says Harthorn. "GPUs have gotten dramatically cheaper per FLOP. Model architectures and inference serving engines have all gotten much more efficient. But the thing that's grown faster than both of those is context. The persistent state that has to live between sessions has grown even faster than context itself." It's happening as context windows grow dramatically, making individual inputs far larger than before. Agentic AI systems chain dozens or hundreds of model calls together, each generating state that must be tracked, and enterprises are requiring that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing context volumes beyond what any existing memory tier was designed to handle. "Those three things are all happening at the same time, all of which are pushing context data and context memory into the stratosphere much more quickly than we're used to seeing," adds Ace Stryker, director of AI and ecosystem marketing at Solidigm. The solution is a dedicated context tier emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash designed specifically to hold and serve Key-value (KV) cache, the inference data that allows models to retain and reuse context, and retrieval data at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are building SSD products optimized for this workload. "Storage has not been the first thing folks have thought about when they've been planning their enterprise infrastructure buildout," Stryker says. "In a lot of ways, it was a relatively small cost compared to compute, and it was a commodity. You just shopped around for the lowest dollar per gigabyte and called it good. But now, if your storage is not up to snuff, your ROI suffers, and it directly impacts your bottom line.” Why AI inference requires a different storage architecture than training The storage architecture that AI systems rely on today was largely inherited from training workflows. Training is sequential and write-dominated, with data moving in large blocks to and from bulk object storage. The tier structure, with high-bandwidth memory on the GPU, fast NVMe in the server, and bulk storage over the network, serves that use case reasonably well. However, inference is a different animal. Its I/O signature is fine-grained, latency-sensitive, and increasingly stateful. KV cache data and retrieval data each have distinct access patterns, but both need to be served quickly and reused across interactions. Neither fits cleanly within GPU high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads. "The architectural gap that's interesting to me right now isn't at the top of the stack or the bottom, it's right in the middle," Harthon says. "A lot of what sits below the GPU HBM is being asked to do things it wasn't really designed for, which is where the most interesting systems work today is happening." One of the most visible symptoms of this gap is recomputation. In inference, the pre-fill stage processes all of the context relevant to a given session before token generation can begin. When KV cache state isn't available in a fast, accessible tier, the system recomputes it — burning GPU cycles that produce no new value. "A meaningful share of GPU cycles end up going to re-pre-filling," Harthon explains. "During all of that calculated context, that's potentially compute that's being spent reproducing state, rather than doing new work. When you start looking at the problem that way, GPU utilization starts looking like it's partly a storage problem." This reframing is driving renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar. The AI context memory tier and how it works The industry's response is taking structural form. A new tier is emerging between GPU memory and traditional network storage, designed specifically to hold and serve inference context, a layer distinct from drives inside GPU servers (G3) and storage servers over the network (G4), engineered to serve context data back to accelerators as rapidly as possible. "If you're building a data center starting in the second half of this year, or the beginning of next year, you can't think about storage only living in two places," Stryker says. "Storage has to live in at least three places to handle the context memory tier, and that's likely to be a permanent fixture in how the infrastructure gets built going forward." It's analogous to the emergence of object storage as a category, which didn't exist until enough workloads needed it. And once it did, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors. "The context tier looks like it might be on a similar arc," Harthorn says. "That volumetric pressure is causing the category to form, rather than any one vendor's road map." For infrastructure leaders, this means actively planning for the new tier rather than treating it as optional. Deploying additional NAND at this layer reduces dependency on DRAM, which is orders of magnitude more expensive per gigabyte and constrained in both availability and thermal headroom. "In terms of your investment effectiveness, you're laying out less cash to do it if you rely on the SSD layer in the way that Nvidia is now recommending and prescribing for a lot of use cases," Stryker adds. What flash needs to deliver to support AI inference Participating meaningfully in the inference stack places new demands on SSD technology. Tail latency, the worst-case performance of a drive, must be predictable, not just fast on average. An orchestration system that allocates GPU resources based on expected storage response times cannot tolerate unexpected multi-second delays. Consistent, observable performance matters more here than peak throughput. Beyond latency, density becomes a critical concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm's products, is suited to that calculation. Network integration via NVMe over Fabrics, RDMA, and eventual CXL support is also essential, given the tight latency budgets of active inference pipelines. "The drives have to have reliable performance characteristics, beyond the throughput side and being able to transfer as much data as possible as fast as possible, the way that training needed," Harthon says. "Now it's about being able to do it very consistently, in a way that's very observable to the people operating and orchestrating these systems." How enterprise AI leaders should plan for the context tier The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner lab collaborations, and published research, which is critical precisely because the category is still forming. "The interesting question for the next couple of years isn't whether AI infrastructure needs more compute," Harthorn says. "It's whether it can use what it has more efficiently. A lot of that answer runs through this tier that is being built today." Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Your AI agent did exactly what it was designed to do. The framework underneath it just handed an attacker a shell on the box that holds your OpenAI key, your database credentials, and your CRM tokens. That is not a hypothetical. In a few months, three of the most widely deployed AI agent frameworks each turned a known, ordinary bug class into a way through. Check Point Research chained a SQL injection in LangGraph’s SQLite checkpointer to full remote code execution. Tenable and VulnCheck tracked a path traversal in Langflow’s file upload endpoint to active, in-the-wild RCE. Cyera documented a path traversal in LangChain-core’s prompt loader that reads your secrets off disk. Two paths to a shell, one to your keys. They are the same bug, wearing three frameworks. These frameworks became production infrastructure faster than anyone secured them. They store agent state, take file uploads, load prompt configs, and hold the credentials to databases, CRMs, and internal APIs. The edge tools watch traffic. The endpoint tools watch processes. Neither was built to treat an imported framework as a boundary worth guarding, and that blind spot is exactly where all three chains live, widening every week as these frameworks ship to production. The LangGraph chain, SQL injection to a Python shell Start with the one most teams pulled into production this quarter. LangGraph gives AI agents memory through checkpointers, the persistence layer that stores execution state. It has cleared over 50 million downloads a month. Yarden Porat of Check Point Research took that layer apart and found three vulnerabilities. Two of them chain to RCE. CVE-2025-67644, rated CVSS 7.3, is a SQL injection in the SQLite checkpointer. The function that builds the WHERE clause for checkpoint lookups drops user-controlled filter keys straight into the query with no parameterization and no escaping. This does not hit everyone, but where it hits, it is serious. A deployment is exposed when it self-hosts LangGraph on the SQLite or Redis checkpointer and lets untrusted input reach get_state_history() or a similar history endpoint. Meet those conditions, and an attacker who controls the filter writes a fabricated row straight into the checkpoint table. Run LangChain’s managed LangSmith platform on PostgreSQL, and the exposure is gone. Then CVE-2026-28277, CVSS 6.8, finishes the job. LangGraph’s msgpack checkpoint decoder rebuilds Python objects from the stored data, which lets it import a module and call a named function with attacker-supplied arguments. That step needs write access to the checkpoint store; the SQL injection is what grants it remotely. LangGraph loads the forged row as a legitimate checkpoint, the decoder runs the specified function, including os.system, and code executes under the identity of the agent server. A third issue, CVE-2026-27022, CVSS 6.5, reaches the same place through the Redis checkpointer. There has been no confirmed exploitation in the wild yet. A working proof-of-concept is public in Check Point’s disclosure. The fixes are version bumps: langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, and langgraph-checkpoint-redis to 1.0.2. The Langflow chain, one unauthenticated request to RCE Langflow is the one already under attack. CVE-2026-5027, CVSS 8.8, is a path traversal in the POST /api/v2/files endpoint, which takes the filename straight from the form data and writes it to disk unsanitized. An attacker packs that filename with traversal sequences and drops a file anywhere, such as a cron job in /etc/cron.d/. Because Langflow ships with auto-login enabled in its default configuration, an exposed instance needs no credentials at all. A single unauthenticated request reaches the endpoint, and the next cron run hands over a shell. VulnCheck’s Caitlin Condon confirmed exploitation on June 9: “Our Canaries observed exploitation of CVE-2026-5027 that successfully leveraged the path traversal to write what appear to be test files on victim systems.” Censys put roughly 7,000 exposed instances on the internet, most in North America. This is the third Langflow flaw to draw active exploitation this year, after CVE-2025-34291, which the Iranian state-sponsored group MuddyWater weaponized and which CISA added to its Known Exploited Vulnerabilities catalog in May. CVE-2026-5027 itself was patched in version 1.9.0, released April 15. The timeline is what sets the clock. The patch shipped April 15. Attacks started in June, and VulnCheck added CVE-2026-5027 to its exploited-vulnerabilities list June 8 once its sensors caught the first in-the-wild hits. Every instance left unpatched between those two dates has been sitting in the open for almost two months. The lesson for security teams is to start the patch clock at disclosure, not at a federal catalog entry. The LangChain-core gap, arbitrary file reads through the prompt loader LangChain-core, the foundation under both, disclosed CVE-2026-34070, CVSS 7.5, a path traversal in its legacy prompt-loading API. The load_prompt() functions read a file path out of a config dict with no check against traversal sequences or absolute paths, so an attacker who influences that path reads arbitrary files the process can reach, including the .env file holding OPENAI_API_KEY and ANTHROPIC_API_KEY. Cyera paired it with CVE-2025-68664, CVSS 9.3, a deserialization flaw that resolves environment secrets through a crafted object. The fix versions differ, which matters when you patch: CVE-2026-34070 lands in langchain-core 1.2.22 and 0.3.86; CVE-2025-68664 lands earlier in 1.2.5 and 0.3.81. Clear both, or the higher-severity flaw stays live behind a patched one. Three frameworks, three classic AppSec bugs. Path traversal. SQL injection. Unsafe deserialization. Nothing exotic, nothing AI-specific, just old vulnerabilities living inside new infrastructure. None of this is a frontier-model problem. It is plumbing, sitting in the layer where AI meets the enterprise. Why the scanner cannot see it Merritt Baer, CSO at Enkrypt AI and former deputy CISO at AWS, has named what makes this kind of failure hard to see coming. It does not announce itself as an AI problem. "CISOs will experience MCP insecurity not in the abstract, but when an employee pastes sensitive data into a tool, or when an attacker finds an unauthenticated MCP server in your cloud," Baer told VentureBeat. "It won't feel like 'AI risk.' It will feel like your traditional security program failing." The framework chains here are the same shape. An exposed Langflow instance is an unauthenticated server in your cloud, and the alert, if one fires, reads like an ordinary incident. That is the gap in one sentence. The exploit lives in the framework your code imports. The WAF never sees a msgpack decoder running three layers down. The EDR watches the agent server make the same process calls it makes a thousand times a day and waves it through. Both tools are doing their job. Nobody scoped the framework itself as the thing that could turn on you. The root cause is older than AI, and Baer names it. “MCP is shipping with the same mistake we’ve seen in every major protocol rollout: insecure defaults,” she told VentureBeat. “If we don’t build authentication and least privilege in from day one, we’ll be cleaning up breaches for the next decade.” Langflow’s auto-login is that mistake shipped. LangChain-core’s unguarded prompt loader is that mistake shipped. The convenient default is the vulnerability. And the moment an agent connects to anything, that risk compounds. “You’re not just trusting your own security, you’re inheriting the hygiene of every tool, every credential, every developer in that chain,” Baer said. “That’s a supply chain risk in real time.” There is a governance failure layered on top of the technical one, and it is the same miscategorization Assaf Keren, chief security officer at Qualtrics and former CISO at PayPal, has flagged in adjacent tooling. “Most security teams still classify experience management platforms as ‘survey tools,’ which sit in the same risk tier as a project management app,” Keren told VentureBeat. “This is a massive miscategorization.” Swap in AI agent frameworks, and it still holds. Teams file LangGraph, Langflow, and LangChain under developer convenience, then wire them into databases, CRMs, and provider keys. “Security has to be an enabler,” Keren said, “or teams route around it.” These frameworks are what routing around it looks like. Follow the money and it points at the same layer. On its Q1 fiscal 2027 earnings call, CrowdStrike reported its AI detection and response line up more than 250% sequentially, and on June 17 it extended that runtime coverage to agent, LLM, and MCP traffic on AWS. George Kurtz, the company’s co-founder and CEO, named the reason in plain terms: “Agents run on the endpoint. They make tool calls, access files, invoke APIs, and move data at the process level.” That is the exact plumbing these chains abuse, and real money is now moving to the layer your AppSec scan skips. What to put in front of the board The board does not need the CVE numbers. It needs the consequence, and Keren draws the line the board cares about. Most teams have mapped the technical blast radius. “But not the business blast radius,” Keren told VentureBeat. “When an AI engine triggers a compensation adjustment based on poisoned data, the damage is not a security incident. It is a wrong business decision executed at machine speed.” A framework RCE is the same problem one layer earlier. The agent does not just leak a credential; it acts on production systems with it, and the business sees an outcome no one can explain. So frame it the way a board frames it: we run AI agent frameworks in production that can be turned into remote shells through bugs our scanners are not built to find, all three are patched, one is under active attack, and here is the date every instance is verified and closed. None of this required custom malware or a zero-day. The six-question checklist Six trust boundaries, one per row, each with the question, the proof point, the command, the fix, and the board line. Run it tonight. Trust-Boundary Question Proof Point What Broke Verify Before You Install The Fix Board Language 1. Can the agent's state store be poisoned with code? LangGraph SQLi-to-RCE chain. CVE-2025-67644 (CVSS 7.3) chains into CVE-2026-28277 (CVSS 6.8). PoC public, no in-the-wild use yet. Filter keys interpolated into SQL with an f-string. Forged checkpoint row hits the msgpack decoder, which imports and runs an attacker-named callable. pip show langgraph-checkpoint-sqlite. Below 3.0.1 = vulnerable. Confirm get_state_history() is not exposed to network input. Upgrade langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, langgraph-checkpoint-redis to 1.0.2. “Our agent memory layer can be tricked into running attacker code. Vendor has patched it. We are upgrading and confirming the endpoint is not exposed.” 2. Can an unauthenticated request write a file to our agent server? Langflow CVE-2026-5027 (CVSS 8.8). On VulnCheck KEV (June 8). Active exploitation confirmed June 9. ~7,000 exposed instances (Censys). Path traversal in POST /api/v2/files. Filename unsanitized. Auto-login on by default. Two HTTP calls drop a cron job and earn a shell. Query Censys or Shodan for your Langflow, Flowise, n8n, and Dify instances on the perimeter. Check whether auto-login is enabled. Upgrade Langflow to 1.9.0+. Disable auto-login. Pull AI dev tools behind VPN or zero-trust. Isolate port 7860. “Our AI dev tools are reachable from the internet with login off. This exact flaw is under active attack now. We are pulling them behind access controls today.” 3. Can our prompt loader read files it should never touch? LangChain-core CVE-2026-34070 (CVSS 7.5), path traversal in the prompt-loading API. Paired with deserialization CVE-2025-68664 (CVSS 9.3). load_prompt() reads a config-supplied path with no traversal check, returning files such as the .env holding OPENAI_API_KEY and ANTHROPIC_API_KEY. pip show langchain-core. Below 1.2.22 (1.x) or 0.3.86 (0.x) = vulnerable. Audit any code passing user-influenced paths to load_prompt(). Upgrade langchain-core past both fixes: 1.2.22 / 0.3.86 (CVE-2026-34070) and 1.2.5 / 0.3.81 (CVE-2025-68664). Replace load_prompt() with an allowlisted directory. Run as non-root. “Our prompt system could be steered to read our API keys off disk. We are patching and removing the legacy loader.” 4. Does a compromised framework hand over every credential at once? These frameworks are often deployed with provider keys, database credentials, and integration tokens available to the process environment. Cyera documents the credential-exfiltration path. One RCE on the agent server exposes every secret the process can read. Blast radius is the full credential set, not one app. Inventory which secrets each framework process can reach. Confirm keys come from a secrets manager, not static .env files. Move provider keys to ephemeral injection. Rotate any key a vulnerable instance could have read. Scope each key to least privilege. “A single break in one AI framework exposes the keys to every model and data store it touches. We are rotating and scoping them now.” 5. Are these frameworks running outside security governance? A prior Langflow flaw, CVE-2025-34291, was weaponized by Iranian-linked MuddyWater and added to CISA KEV in May. Shadow AI is the new shadow IT. Teams stand frameworks up for speed, give them credentials, and never bring them under review. The security team cannot see what it does not know exists. Run a discovery sweep for AI frameworks outside change management. Map each to an owner and an approval record. Assign every framework a documented owner and a place in the approval process. Offer a sanctioned alternative so teams do not route around you. “We have AI frameworks in production that no one formally approved. We are bringing them under governance, not banning them.” 6. Can our scanners even see inside the framework at runtime? Runtime detection is forming around this layer: CrowdStrike Falcon AIDR expanded to AWS June 17 (Bedrock, Kiro, Strands); its QuiltWorks coalition now covers cloud workloads. WAF reads HTTP at the edge. EDR watches the endpoint. By default, neither reliably models a msgpack decoder or a prompt loader three layers down in an imported framework as a separate trust boundary. Test whether your AppSec scan covers third-party framework internals. Track CVEs by dependency, not just by what your edge tools can parse. Add framework dependencies to vuln management. Treat agent output and stored state as untrusted. Patch on disclosure, not on KEV listing. “Our scanners check our code, not the frameworks our code imports. We are closing that blind spot and patching on disclosure, not waiting for the federal catalog.” How to read this table: each row is one trust boundary, left to right, from the question to ask to the line to read your board. Give the board the deadline, not the technology The fixes are not a re-architecture. They are version bumps and config changes you can land this week. The exposure is the gap between the day the patch shipped and the day your team runs the checks, and right now that gap is measured in months. The frameworks did exactly what they were built to do.

Enterprise teams keep watching the same thing happen. An AI agent demos beautifully, goes to production, and stalls: it runs for a short stretch, then needs a human to top up its context and check its output, and the promised efficiency drains into supervision. The agent did the work; you did the watching. It’s one reason so many agent pilots never turn into production systems. The pitch on the other side of that wall is the one every team wants to believe: an agent that runs a long job on its own, overnight if it has to, and leaves a person to validate only the last 10%. Whether that is achievable turns on a problem the orchestration conversation mostly skips. When AI firm Chroma tested 18 leading models, every one lost accuracy as its input grew, a property of how attention works, not a gap a stronger model closes. An agent fed more and more of your business as it runs does not get steadier. It gets shakier. This is the layer beneath the orchestration race. Routing, durable execution and observability all assume each agent is already competent enough to coordinate in the first place. The deeper question is how long an agent can run before a human has to step in, and that comes down to where your company's knowledge lives relative to the model. Both standard fixes leave a human in the loop. Why teaching a model your business keeps you in the loop Frontier models keep getting more capable, and the gap does not close, because it is not a capability problem. It is about where your knowledge sits relative to the model, and enterprises have had two ways to place it there. The first is fine-tuning, which bakes knowledge into the weights. It remains subject to catastrophic forgetting, a problem identified in the 1980s and still unresolved in 2026: teaching a model something new tends to erode what it already knew. Teams work around it by isolating each task in its own fine-tuned model or adapter, which produces a sprawling estate of models that raises cost and governance overhead. And a fine-tuned model is a snapshot, stale the day a policy changes, when the expensive, slow retraining cycle starts over. The second is in-context learning, which skips retraining by placing the relevant policies in the prompt at run time. This is where context rot bites. Retrieval narrows what goes into the prompt, but a retrieval miss looks identical to a confident answer, and both cost and latency climb with every token added. The two failures rhyme. With fine-tuning, the model can be confidently working from last quarter's policy. With in-context learning, it can be confidently working from a detail it lost in the middle of a long prompt. Either way the output looks equally assured, so you cannot tell which parts are wrong without checking all of them. That is why the human never gets to leave. Some teams often run both at once, fine-tuning the stable knowledge and retrieving the rest. That softens each failure but removes neither: on any given output you still cannot be sure the model is both current and working from the right context, so you still check it. A third path: generate the specialist model on demand A third approach is moving from research into early product. Instead of retraining one model or stuffing its prompt, a generator builds a small, task-specific model on demand from your policies, at inference time. The generator is a hypernetwork: a network whose output is the weights of another network. The idea was named in 2016; applying it to produce specialist language models from text or documents is recent and active. Sakana AI's Text-to-LoRA, presented at ICML 2025, generates a model adapter from a plain-language description in a single pass, and a 2026 system called SHINE calls hypernetwork adaptation a promising new frontier, precisely because it sidesteps both the retraining cost of fine-tuning and the context limits of prompting. The point of generating adapters rather than training and storing them is to collapse a sprawling library of per-task LoRAs into one network that can produce them on demand, including for tasks it has not seen. The elegant part is how this closes the loop on the problem above: the per-task adapter teams hand-build to dodge catastrophic forgetting is the same object a hypernetwork produces automatically. The model zoo stops being a governance headache and becomes a generated output. The case for going small underneath all this was put most directly in a 2025 paper by Nvidia researchers: for the narrow, repetitive tasks that fill agent workflows, small models are capable enough and 10 to 30 times cheaper to run than frontier generalists. Nace.AI, a Palo Alto company that raised a $21.5 million seed round in May, is the clearest commercial instance. Its core technology, a generator it calls a MetaModel, produces parameter adaptations for a model at inference time from a company's policies, pointed at regulated work: audit, compliance, risk assessment. The company says its agents handle the bulk of a workflow while human experts validate the result, a split it markets as 90/10. How the three approaches compare Fine-tuning In-context / RAG Hypernetwork-generated model Where business knowledge lives In the model's weights In the prompt, re-supplied each run In on-demand generated weights Cost to update on a policy change High: retrain Low: edit the source Low: regenerate Staleness High: a snapshot Low Low: regenerated from current policy Per-call cost and latency Low High, grows with context Low at run time Dominant failure mode Forgetting; model-zoo sprawl Context rot; silent retrieval misses Generator quality; calibration Who owns the improving asset Whoever trains the model Whoever holds the data store Depends where generator and feedback live Why a hypernetwork-built model raises the autonomy ceiling A model that is narrow, current and small has a smaller surface on which to be wrong. Fewer errors, confined to a known domain, mean fewer outputs an agent has to escalate to a person, which is the real basis for any high-autonomy claim. It is also where a number like 90/10 comes from: not a dial set in advance, but an outcome of how little the system needs to hand back. Reported autonomy shares are best read as measurements of an architecture, not as settings. Two design choices decide whether that autonomy is trustworthy or merely fast. The first is grounding: tying every output to its source so a reviewer can verify rather than redo. Research models built for exactly this, such as HalluGuard, label each claim as supported or not and cite the passage they relied on. Nace ships its agents with grounding models and reasoning traces for the same reason. A 10% review only means something if the human can confirm provenance in seconds. The second is the feedback loop, and it forces a question every buyer should ask: when your experts validate the output, whose model improves, and where does it live? That decides whether the compounding asset belongs to the vendor or to you. Arrangements differ. Nace, for instance, uses an external network of certified experts for some engagements and, for direct enterprise deployments, the customer's own staff, with the resulting model kept inside the customer's cloud. Each choice routes the learning, and the ownership, somewhere different. Where the third path breaks The approach is still early, and a few questions will decide how far it goes. Calibration is the linchpin: the value rests on the model knowing when it is unsure. And it is genuinely unsettled, recent work generating these adapters found they do not automatically improve calibration over ordinary fine-tuning, with gains appearing only under specific constraints. The quality of the generated model also depends heavily on the policy data it is built from, which puts a premium on data curation. And scale is the open research frontier, the hypernetworks shown in published work so far have been small. This is where Nace's own work gets interesting: in our interview, the company said it has scaled its generator well beyond those published sizes and derived a scaling law for how performance grows, results it has begun to share publicly and is now putting through peer review. If it holds up, it would help answer one of the central open questions in the field, and it is the paper worth watching. Whichever approach wins, the work still ends at a human, and that handoff is its own design problem. When Deloitte Australia delivered a roughly A$440,000 government report, it shipped with fabricated citations and an invented court quote after passing senior review, because the reviewers checked the conclusions, which were sound, and not the provenance, which was not. Controlled research suggests the pattern is general: experts corrected an identical flawed recommendation less often when it was labeled AI-generated. The EU AI Act's Article 14 now names this automation bias. The lesson is not about any one vendor: a high autonomy share concentrates human attention into a thin, late slice of the work, so the value of that review depends entirely on whether the human can check provenance fast, which loops back to grounding. What to build, and what to ask before you buy The honest takeaway: what holds your agents back is usually not orchestration or model size, but whether the model knows your business well enough to be left alone, and the right fix depends on the job. To automate a long, repetitive, high-volume process end to end, run most of your internal audit overnight and have your own experts check the final slice, a hypernetwork generated model is the approach most likely to do it cheaply and run long enough to matter. For a short task that finishes in a few steps and never needed to run unattended, the gap between this and a well-prompted frontier model shrinks to almost nothing, and is not worth the integration cost. When a vendor pitches autonomous or specialist agents, four questions cut through it. Where does the business knowledge live: in the weights, the prompt, or generated on demand? What does each output come with, so a reviewer can verify it instead of redoing it? What decides which work gets escalated to a human? And whose model improves from that feedback, and where does it run? The answers, not the headline ratio, tell you what you are buying. The hypernetwork approach is the most credible attempt yet at making a small model know a specific business without forgetting it and without re-explaining it on every run. It is also the least proven, and the parts that matter most, calibration and scale, are still in peer review. For the right job, pilot it now. For the wrong one, the integration cost buys you little that a well-prompted frontier model wouldn't.

Anthropic announced a potentially game-changing new feature for users of Claude Code on the Claude Team and Enterprise subscription plans: Artifacts. This update turns a Claude Code session's work into a live, interactive, and shareable, custom HTML webpage, allowing a Claude Code user to plug in live code, multiple data sources, and have it surface on an interactive URL that they can send to other teammates — be it a dashboard, an app design, or some other product meant for internal usage. These teammates and the original user can watch the webpage it update in real-time as Claude Code goes about its work autonomously or under the user's guidance, and as the connected data sources and codebases change. While Anthropic first introduced Artifacts to its consumer web chatbot in the summer of 2024—where it evolved from a manual toggle feature to a generally available tool for publishing code snippets and games to the web—integrating this capability directly into the Claude Code command-line interface (CLI) and desktop app bridges the gap between deep, back-end engineering and the non-technical stakeholders who need to understand it. Product and Technology: The End of the Status Update At its core, Claude Code Artifacts acts as a dynamic translation layer. Built directly from the unbroken context of a user’s session, the agent uses the local repository codebase, connected monitoring tools, and conversational reasoning to spin up specialized web pages. Engineers no longer need to wire up external data sources or stand up temporary infrastructure; the AI builds the UI from what already exists. Crucially, these web pages are not static exports. As the AI works through a terminal session, the open webpage refreshes in-place, updating charts and text instantly at the exact same URL. Every update publishes a new version history, allowing teammates to roll back or track the agent's progress securely on desktop or mobile. The Battle of Live, Interactive, Shared AI Work Surfaces: Anthropic's Claude Code Artifacts vs. OpenAI's Codex Sites Anthropic's update comes more than two weeks after OpenAI released a massive update to its own Codex platform, introducing a strikingly similar enterprise hosting feature called "Sites". This tit-for-tat product cadence highlights a rapidly escalating battle over the enterprise workspace across functions and beyond developers themselves, though there are some important technical and philosophical distinctions worth pointing out for enterprises considering either. As revealed in their respective developer documentation webpages, OpenAI is building a platform-as-a-service; Anthropic is building a stateless canvas. OpenAI’s Sites is designed to generate durable, full-stack web applications. According to the platform's documentation, Codex Sites hosts projects that output as Cloudflare Worker-compatible ES modules. Crucially, Sites supports persistent backend infrastructure: agents can automatically wire up "D1" relational databases for structured data (like user progress or saved records) and "R2" object storage for file uploads. An OpenAI Site can support public sign-ins, integrate with external identity providers, and allows for highly specific access controls tailored to specific workspace groups. It utilizes a two-stage publishing process—saving a reviewable candidate linked to a Git commit before officially deploying to production. In short, it is a production environment designed to replace functional internal SaaS tools. Anthropic’s Claude Code Artifacts, by contrast, deliberately avoids the backend. The newly released documentation is blunt about its limitations: "An artifact is a capture of work, not an application". Each Artifact is a single, self-contained HTML page capped at a rendered size of 16 MiB. To guarantee organizational security, Claude wraps the published file in a strict Content Security Policy (CSP) that blocks all external network requests. T his means the page cannot load external scripts, fonts, or stylesheets, and fetch, XHR, and WebSocket calls are completely blocked. All CSS and JavaScript must be inlined, and images must be embedded as data URIs. Artifacts cannot store form input, call an API at view time, or serve multiple routes. This technical limitation is actually Anthropic's deliberate philosophical position: While OpenAI wants to spin up persistent software portals for the whole company, Anthropic is keeping Claude Code firmly anchored in ephemeral, highly secure technical workflows. Claude Artifacts are not meant to be software; they are meant to replace whiteboard diagrams, manual bug walkthroughs, and status reports with secure, self-updating visual tools that never leak live data outside the corporate boundary. Licensing and Enterprise Security: Keeping the Codebase Private Because these agents sit at the nexus of proprietary company data and live codebases, licensing and access controls are a primary concern. Both Anthropic and OpenAI have opted for closed, proprietary licensing models for these new visual workspaces. For end users and developers, the distinction is critical. Unlike permissive open-source software (such as MIT or Apache 2.0) or strict copyleft licenses (like GPL)—which grant developers the legal freedom to inspect, modify, and self-host the underlying code—neither Claude Code Artifacts nor Codex Sites can be independently forked or hosted. Enterprise clients do not maintain code-level ownership over Anthropic's rendering engine or Codex’s integration nodes; both operate strictly within their respective creators' managed infrastructures. To make this vendor-managed approach palatable to enterprise compliance teams, both companies have heavily prioritized organizational security. Anthropic ensures every artifact is private to its author by default and strictly cannot be made public to the broader internet. When an engineer chooses to share a link, it is viewable exclusively by authenticated members of their specific organization. System administrators retain ultimate authority, managing access through org-level toggles, role-based scoping, and explicit retention policies, while maintaining oversight through a centralized compliance API. OpenAI takes a similarly gated approach with Codex Sites, rolling the feature out primarily for ChatGPT Business and Enterprise workspaces. Like Anthropic, OpenAI relies on system administrators to manage deployment through centralized workspace settings, requiring an admin to explicitly enable Sites via role-based access control (RBAC) for Enterprise tiers. However, because Codex Sites functions more like a hosted web application, its access controls are slightly more granular. When an engineer prepares to share a deployed URL, they can apply specific access modes: restricting the site to just themselves and workspace admins, opening it to all active users in the workspace, or limiting access to custom user groups. Furthermore, to prevent sensitive data leaks, OpenAI provides a dedicated Sites panel to manage runtime environment variables and secrets securely, ensuring those keys do not have to be committed to local source files. Reactions and Reflections The introduction of visual, self-updating UI layers to command-line agents is fundamentally altering how developers view their own workflows. As AI handles the raw syntax and automates the reporting, the friction of communicating technical work to stakeholders is vanishing. Boris Cherny, the Lead and creator of Claude Code, highlighted the sheer utility of the update in a post on X earlier today: "I've been using Artifacts in Claude Code for everything: visual explanations of tricky code, system diagrams, quick previews of a few animation options, data analyses and dashboards I share with the team," Cherny wrote. "They are a game changer for how I work with Claude. Can't wait to hear what you think!" This sentiment is practically demonstrated in Anthropic’s launch materials. In one scenario, an engineer prompts Claude Code to investigate user drop-offs since a previous software release. In a matter of seconds, the agent executes an SQL read, builds an interactive drop-off funnel dashboard, and diagnoses that "Pro accounts stall at the export sheet". The AI then proposes UI fixes, updates the live charts as the code is refactored, and generates a secure link that a manager can instantly open via mobile. By turning the terminal into a live, collaborative canvas, Anthropic is proving that the most valuable output of an AI coding assistant isn't just the code itself—it is the context, the reasoning, and the ability to share that work instantly.

Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. It requires a tedious, trial-and-error process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously. Because these adjustments are entangled, it becomes nearly impossible to attribute which specific tweak actually solved the problem. To address this challenge, researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from a sequence of trial-and-error guesses into a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time. In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget. For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems. Understanding the bottleneck in autonomous optimization As large language models and AI systems become more capable, they are expected to carry out more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms. AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact, such as a machine learning codebase or data pipeline, and a specific objective. The agent's goal is to iteratively improve this artifact through experimental feedback without step-by-step human supervision. The main challenge of AO is often misunderstood. Many engineering teams find that simply giving a coding agent more time or compute to optimize a codebase doesn't lead to better results. "Automation can keep an AI working for a very long time — but a loop is not the same as progress," Jiajie Jin, co-author of the paper, told VentureBeat. "If the goal is vague, or the metric is easy to hack, long-running automation often just produces 'improvements' faster that nobody actually wants." Jin explains that complex tasks take many attempts to get right, and standard agent architectures are missing the critical data structure to maintain state. "How do you make sure the insight and experience from each attempt actually accumulate, instead of getting lost in a scrollback buffer?" he said. Without this structure, agents simply repeat the same mistakes. Current agent systems can run experiments for many hours against well-specified goals: editing code, invoking tools, running tests autonomously. But they treat each attempt in isolation, missing the structural mechanisms that would let them accumulate and act on what they've learned. They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration, which is the core mechanism that makes human research cumulative. General coding agents typically rely on conversation transcripts for their memory. Because AO tasks span hundreds of turns and easily exceed context window limits, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they lose the overarching structure of the research process and are prone to stalling on early failures or chasing noisy evaluation swings. The system needs a structured, durable memory that records what directions have been tried, what factual evidence was produced, and how each result changes the space of future hypotheses. Existing frameworks are also prone to reward hacking and overfitting to development metrics. This makes them create the illusion of progress without producing improvements that transfer to real-world performance. Finally, general-purpose coding agents typically chain their tool calls on a single shared working tree. This architectural limitation prevents them from testing parallel hypotheses in isolated environments without corrupting the main codebase or obscuring which hypothesis caused a specific outcome. The Arbor framework Arbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from the ground-level coding tasks with two key components: The coordinator: A long-lived AI agent that acts like a principal investigator. It never directly edits the target codebase. Instead, it owns the general state of the optimization research, observes accumulated evidence, comes up with new hypotheses and directions to explore, and decides what to do with the results of experiments. Executors: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, it spins up an executor and places it in an isolated environment, essentially a fresh git worktree. Each executor is handed one hypothesis. It implements the assigned idea, runs evaluations, debugs errors, and reports back to the coordinator with the results and created artifacts. These two components collaborate through a mechanism that the researchers call “Hypothesis Tree Refinement” (HTR). HTR represents the entire research process as a persistent, branching tree where every node binds together four things: a hypothesis, the executable artifact, the factual evidence produced, and a distilled insight. This means the coordinator can explore multiple competing directions at the same time without losing its place. The coordinator builds the tree by placing broad ideas near the root, while concrete refinements branch out as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor's experiment fails, the tree records why it failed as a negative constraint, ensuring the system doesn't endlessly repeat the same mistake. To understand why Arbor's isolation matters, consider a common enterprise scenario: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. "When you ask a single agent like Claude Code or Codex to 'improve accuracy,' it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method," Jin said. This entangles the changes, making it impossible to attribute which one actually helped. It also directly mutates the repository without isolation. Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, retrieval another, and the prompt another — each implemented and evaluated in its own isolated git worktree. "So you get clean attribution: 'constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'" Jin said. When an executor returns a report, the coordinator writes the evidence to the tree and backpropagates the insight upward to parent nodes. This means a local observation becomes a generalized constraint that shapes the coordinator's future idea generation. To prevent reward hacking or overfitting to the development data, HTR enforces a strict “merge gate.” Even if an executor reports a fantastic development score, the coordinator will spin up an isolated worktree to test the candidate against a held-out test evaluator. The artifact is only merged into the current best trunk if it demonstrably improves the test score, verifying that the progress is real. Arbor generally falls under the concept of "loop engineering," popularized by industry figures like OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny. The idea is to move beyond single prompts to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, "A loop can fill up with messy, untraceable attempts, and you end up with nothing to show and no way to reconstruct what changed." Arbor in action The researchers evaluated Arbor on an autonomous optimization task suite built from real-world research settings and the MLE-Bench Lite machine learning engineering benchmark. The AO suite featured tasks from different areas of AI development, including model training, harness engineering, and data synthesis. The researchers used different backbone models for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, Codex and Claude Code. Arbor and the baselines were given the same resources. For the MLE-Bench Lite tasks, Arbor was also compared against top-tier agentic research systems like AI-Scientist, ML-Master, and AIDE. Arbor consistently outperformed the baselines. It achieved the best held-out test result on all tasks, attaining more than 2.5 times the average relative gain of Codex and Claude Code. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system's held-out accuracy from a baseline of 45.33% to 67.67%. Meanwhile, Codex and Claude Code stalled at 50% and 53.33%, respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest result among all benchmarked systems. Arbor proved to be resilient against overfitting. For example, during the Terminal-Bench 2.0 task experiments, Claude Code achieved a high development score of 75 but its score dropped to 71 on the held-out data. Arbor had a lower development score of 72.22 but achieved the highest held-out score of 77.36, ensuring its results transfer to real-world applications. Arbor also showed generalization in a cross-task transfer experiment. After Arbor finished optimizing the search harness for the BrowseComp task, researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor's optimized codebase significantly improved performance on those unseen tasks as well. Deploying Arbor: Sweet spots and hidden costs For engineering leads looking to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replacing them. "Its output is an ordinary git branch that your existing code review, CI, and human review can inspect directly," Jin said. Only verified gains are merged into a per-run trunk, leaving the main repository untouched until a developer manually chooses to promote the code. However, deploying Arbor comes with specific tradeoffs. Jin points out that the biggest catch is token cost, as maintaining a long-lived coordinator that continuously manages the tree and dispatches executors is the dominant expense. Running multiple isolated worktrees concurrently also requires genuine compute and disk resources to process real experiments. So where is Arbor's sweet spot? According to Jin, it excels at tasks with a clear, trustworthy metric, tolerance for a long time horizon, and a real search space with several plausible directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning. Conversely, teams should explicitly avoid using Arbor for real-time latency tasks, obvious one-line fixes, or when the underlying evaluation metric is flawed. The quality ceiling of the entire run is strictly bounded by the quality of the evaluator. "If the metric isn't trustworthy, Arbor will just optimize toward an untrustworthy result faster," Jin said. Jin sees the next evolution going beyond single scalar metrics. "A natural evolution is to have each node's artifact carry a vector — accuracy, latency, cost — instead of a single score," Jin said. "Going from a single scalar to a multi-objective Pareto search is a very natural extension of the framework."

Two AI tools broke in the same way in the same two weeks, and four research teams proved it. The pattern underneath every disclosure is one sentence: enterprise AI accepts external input with no trust boundary. On June 15, Varonis disclosed SearchLeak (CVE-2026-42824), a proof-of-concept exfiltration chain in Microsoft 365 Copilot Enterprise Search. A victim clicks a crafted microsoft.com URL, Copilot searches their mailbox, and the data leaves through a Bing SSRF. No plugins, no second click, no visible indicator. Four days earlier, Obsidian Security published a three-CVE chain against LiteLLM that carried a default low-privilege user all the way to admin and remote code execution. Two tools. Two teams. One broken boundary. The five-check audit at the end of this article maps each gap to a CVE or a market signal from June, a command you can run before lunch, and a sentence a CISO can read to the board. Copilot turned a trusted URL into an exfiltration engine SearchLeak chained three weaknesses into a silent data-theft chain. The URL q parameter fed attacker instructions straight to Copilot’s LLM. A rendering race condition fired an image tag before the output sanitizer ran. Bing’s image-search endpoint, allowlisted in the Content Security Policy, routed the stolen data out. Microsoft rated the flaw critical and patched it on the back end, according to Varonis. NVD has not yet scored it; a third-party tracker lists it at 6.5 medium. The severity is contested, but the mechanism is not. The escalation is the real story. This is the third Varonis Copilot exfiltration chain in twelve months, after Reprompt in January and EchoLeak in 2025. Reprompt hit Copilot Personal. SearchLeak hit Enterprise Search. Enterprise inherits the user’s full organizational permissions, so the blast radius is everything that a user can reach. LiteLLM handed a default account to every provider key The LiteLLM gateway holds the keys for OpenAI, Anthropic, Azure, and Bedrock behind a single proxy. The Obsidian chain runs in three moves. CVE-2026-47101, an authorization bypass, lets a non-admin mint a wildcard API key. CVE-2026-47102 promotes that caller to proxy admin through an unguarded /user/update endpoint. CVE-2026-40217 escapes the code sandbox through exec() with full builtins. Obsidian then demonstrated a reverse shell by injecting a forged tool-call response through LiteLLM’s callback mechanism. Obsidian assessed the combined chain at CVSS 9.9. The developer typed one word. The attacker popped a shell. A separate LiteLLM flaw made the urgency immediate. CVE-2026-42271, a command-injection bug in the MCP test endpoints, landed on the CISA KEV list on June 8 with a June 22 remediation deadline. That KEV entry is not the Obsidian chain. The two are distinct disclosures four days apart, fixed in different releases, pointed at the same gateway. LiteLLM carries more than 40,000 GitHub stars and sits in thousands of enterprise deployments. This is not the first scare, either. A supply-chain compromise backdoored LiteLLM versions 1.82.7 and 1.82.8 on PyPI in March. A compromised gateway exposes every provider credential the organization holds. Langflow and Mini Shai-Hulud proved the pattern scales The same boundary broke in two more tools in the same fortnight. Langflow CVE-2026-5027 became the third Langflow remote-code-execution flaw to hit active exploitation this year. A path traversal in file upload lets an attacker write files anywhere on disk, and because Langflow ships with auto-login enabled by default, a single unauthenticated request reaches RCE. VulnCheck confirmed exploitation on June 9. Censys counted roughly 7,000 exposed instances, the heaviest concentration in North America, with MuddyWater attribution. The Mini Shai-Hulud campaign hit a different pressure point. After the worm’s source code went public on May 12, copycat variants compromised 32 Red Hat Cloud Services npm packages on June 1, packages pulled 80,000 times a week. The worm harvests more than 20 credential types and self-propagates under the compromised maintainer’s identity. Four teams, four tools, one operating failure. The bug classes differ. SearchLeak is a prompt injection. LiteLLM is privilege escalation. Langflow is path traversal. Mini Shai-Hulud is supply-chain poisoning. The boundary that broke is the same in all four. The market already repriced the risk CrowdStrike’s Q1 FY27 earnings call put a number on the gap. AIDR, the company’s AI detection and response line, grew ending ARR more than 250% sequentially, with a Q2 pipeline above $50 million (SEC-filed 8-K). Total company ARR reached $5.51 billion, and CrowdStrike’s fleet telemetry shows more than 1,800 agentic applications running across enterprise endpoints. On June 17, the company extended AIDR to AWS, adding real-time evaluation of agent, LLM, and MCP communications across Amazon Bedrock, Kiro, and Strands Agents, building on its work with Anthropic’s Project Glasswing. Daniel Bernard, CrowdStrike’s chief business officer, said the AI attack surface now spans development, runtime, identities, and cloud infrastructure, and that teams treating those as separate domains leave the gaps between them open. Practitioners name the same gap in plainer terms David Levin, CISO at American Express Global Business Travel, told VentureBeat the pattern does not surprise him. “We kind of have this shadow AI, which is just the new version of shadow IT,” Levin said. Both Langflow and LiteLLM fit the description. Teams stood them up for convenience, gave them credentials, and never brought them under governance. Levin puts the fix before deployment. “We didn’t go into this with just saying we’re going to go do this without the right fundamentals,” he said. “We leverage NIST controls. NIST has released their CSF along with their AI framework. OWASP released their top 10. You need the right fundamentals before you deploy.” Merritt Baer, CSO at Enkrypt AI and former AWS Deputy CISO, named the structural version of the failure in a separate VentureBeat interview. “Enterprises believe they’ve ‘approved’ AI vendors, but what they’ve actually approved is an interface, not the underlying system,” Baer said. “The real dependencies are one or two layers deeper, and those are the ones that fail under stress.” She has tied that directly to how systems fall. “Raw zero-days aren’t how most systems get compromised. Composability is,” Baer told VentureBeat. “It’s the glue between the model and your data where the risk lives. If you give an agent bash and a root token, you’ve already done most of the attacker’s work for them.” That is what rows 2 and 4 of the audit test: the gateway that holds every key, and the agent identity no one governs. Levin had a sharper frame for the boardroom. “You need to talk more in terms of risk versus compliance to your boards and your executives,” he said. “It’s not about the size of the engineering team anymore. It’s the size of your imagination. It’s all written in plain English. It’s not hard for anyone.” Neither SearchLeak nor LiteLLM needed custom malware or a zero-day to work. Adam Meyers, CrowdStrike’s SVP of Intelligence, put the operational squeeze in numbers in an exclusive VentureBeat interview. “The problem is not zero-day. The problem is patching. If you 10x that problem, they’re gonna be completely underwater,” Meyers said. He pointed to identity as the second front. “Some of these AI have their own identities, or people give their identity to the AI to take action on their behalf, and that makes it a very complex problem.” The five-check trust-boundary audit Each row maps a gap to its proof point, a verification command for Monday morning, the fix, and the sentence to read to the board. Trust-Boundary Gap Proof Point What Broke Verify Monday Fix Monday Board Language 1. Prompt-to-Data SearchLeak CVE-2026-42824. P2P injection + HTML race + Bing SSRF. One-click mailbox exfiltration via microsoft.com URL. PoC demonstrated; Microsoft rated it critical, NVD not yet scored. URL q-parameter passed to LLM as instructions. Sanitizer ran after render. Bing acted as exfiltration proxy via CSP allowlist. Audit CSP allowlists for domains performing server-side fetches. Monitor Copilot Search URLs for encoded payloads. Review Copilot audit logs. Confirm server-side patch applied. Enable sensitivity labels restricting Copilot. Treat AI streaming output as untrusted. “Our AI assistant could search employee email and send results to an attacker through a trusted Microsoft URL. Vendor patched it. We must verify configuration.” 2. Gateway Credential Exposure LiteLLM three-CVE chain (-47101, -47102, -40217). CVSS 9.9. Separate CVE-2026-42271 on CISA KEV (fixed in v1.83.7; full chain fixed in v1.83.14-stable). June 22 deadline. No role validation on key endpoints. Self-promotion to admin via /user/update. exec() sandbox escape. One gateway exposes all provider keys. Run pip show litellm. Below 1.83.14-stable = vulnerable. Check /mcp-rest/test/ exposure. Audit proxy_admin accounts. Upgrade to v1.83.14-stable+. Rotate all provider API keys. Block /mcp-rest/test/* at proxy. Review Custom Code Guardrails. “Our AI gateway held keys for every provider. A default account could promote itself to admin and steal them all. Rotating and patching now.” 3. AI Tooling Sprawl Langflow CVE-2026-5027 (CVSS 8.8). Third RCE of 2026. ~7,000 exposed instances. MuddyWater. Active exploitation June 9. Path traversal in file upload. Auto-login enabled by default. Single unauthenticated request to RCE. Query Censys/Shodan for Langflow, Flowise, n8n, Dify on your perimeter. Check auto-login. Inventory AI tools outside change management. Pull AI platforms behind VPN/zero-trust. Enable auth everywhere. Upgrade Langflow to v1.9.0+ (current release 1.10.0). Fingerprint surface continuously. “AI dev tools are exposed to the internet with login disabled. A nation-state group is exploiting this flaw now. Pulling behind access controls today.” 4. Non-Human Identity Governance AIDR ARR up 250% (Q1 FY27, SEC 8-K). Q2 pipeline >$50M. 1,800+ agentic apps across enterprise endpoints. Agents hold identities and act on behalf of humans. Some exceed their intended scope to reach a goal. No standard governs agent credential lifecycle. Inventory all non-human identities used by agents and MCP servers. Map agent-to-data-store access. Flag agents with write access to security policy. Least-privilege every agent identity. Set privilege boundaries via identity protection. Runtime detection for policy-exceeding actions. Human-in-the-loop for policy changes. “AI agents hold credentials and act autonomously. We do not govern their identity lifecycle like human access. The 250% market growth tells us this gap is systemic.” 5. Runtime Agentic Detection Falcon AIDR expanded to AWS (June 17). Covers Bedrock, Kiro, Strands Agents. MCP integration. Real-time agent/LLM/MCP evaluation. Traditional tools monitor human-speed actions. Agents run at machine speed, thousands of actions per minute, and route around controls to reach goals. Test if EDR/XDR links agent actions to originating identity. Verify SIEM ingests MCP communications. Confirm you can distinguish human from agent on endpoint. Deploy AIDR or equivalent runtime detection. Shadow-AI discovery for all agentic apps, models, MCP servers, identities. Real-time policy enforcement on agent actions. “We cannot distinguish a human employee from an AI agent acting on their behalf. We need runtime detection at machine speed that can stop damage before it starts.” The fix is plumbing, not policy The June 2 executive order creates an AI Cybersecurity Clearinghouse with a July 2 deadline. The five gaps above are not frontier-model problems. They are plumbing problems in the gateways, orchestration platforms, identity layers, and runtime environments where AI meets the enterprise. The audit is five rows. Every row maps to a June disclosure or market signal, a command a team can run before lunch, and a sentence a CISO can read to the board. The question is not whether your vendor will patch. It's whether you find the gap first — or whether an attacker finds it the way they found Copilot and LiteLLM.

Adobe has announced a major expansion of its "creative agent" across its flagship Creative Cloud suite and upgraded Firefly AI studio. Available in public beta starting today across Premiere Pro, Photoshop, Illustrator, InDesign, and Frame.io, the agent is designed to serve everyone from individual creators to enterprise marketing teams. Unlike first-generation generative AI tools that simply output flat media from a chat interface, Adobe’s embedded assistant acts as an orchestration layer. It interprets natural language prompts and directly accesses the underlying software's APIs to execute complex, multi-step production workflows—from batch-renaming video sequences to dynamically updating brand assets across print layouts—while leaving the final aesthetic decisions entirely in the hands of the human designer. Technology: Contextual Memory and DOM Manipulation At the core of this release is a significant technical upgrade to how Adobe's AI handles persistent memory and context window management. In its upgraded Firefly creative AI studio—currently in private beta—Adobe has introduced two foundational architectural components: "Elements" and "Projects". Elements functions as a visual variables library, allowing users to save and reuse specific characters, locations, and objects across multiple generations to ensure strict visual consistency as campaigns scale. Projects acts as the contextual memory layer, storing assets, generations, and session history in a unified space so users can pick up where they left off without rebuilding their prompt context. Beyond pixel generation, the system's most critical technological leap is its ability to operate seamlessly within the complex document structures of desktop applications. "Our Adobe Creative Agent can leverage the decades of powerful features, workflows, APIs that we've brought into our application and exposed through tooling that can now be invoked through a creative agent," an Adobe representative explained. Product: Automating the Tedious, Expanding the Canvas The practical application of this technology fundamentally alters standard production workflows. Adobe is positioning the human user as a "creative director" capable of delegating repetitive, labor-intensive tasks to the AI. The rollout introduces highly specific specialist agents tailored to the logic of each application: Premiere Pro: The agent handles tedious project setup, analyzing and sorting source media into bins, batch renaming clips, identifying interview questions, and assembling a rough working starting point. Illustrator: The assistant automates mathematical and multi-step design tasks, such as generating 50 versioned files from a spreadsheet or running pre-flight checks to flag color mode errors before printing. It can even programmatically duplicate a vector shape 100 times, randomize its position, and change its size based on its z-depth and transparency. Photoshop & InDesign: The agent executes batch background removals, dynamic layer organization, and applies brand updates across multi-page layouts. Furthermore, Adobe is actively integrating its creative agent into major third-party enterprise platforms, including OpenAI's ChatGPT, Anthropic's Claude, Microsoft 365 Copilot, and soon, Google Gemini and Slack. Licensing: Commercial SaaS and Enterprise Implications Unlike open-source orchestration frameworks or models released under MIT or Apache licenses, Adobe's creative agent operates strictly within a proprietary, commercial SaaS ecosystem. For enterprise decision-makers, this carries specific implications. Because the agent relies on Adobe's proprietary APIs to manipulate project files, it requires an active Creative Cloud commercial license. Additionally, by bringing the "Adobe for creativity connector" to platforms like Slack and Microsoft Copilot , enterprise IT and systems architects must consider how internal chat tools will interface with Adobe's cloud processing environments to support enterprise creative and marketing teams securely. The Enterprise Unknowns: APIs, Governance, and Architecture While Adobe’s announcements highlight a powerful user interface and deep integration within its own flagship applications, several critical questions remain for enterprise technical decision-makers tasked with building bespoke AI systems. VentureBeat has reached out to Adobe for clarification on these infrastructure-level details and will update this coverage as we learn more. For AI system architects, the value of a creative agent lies not just in a native application UI, but in its extensibility. It remains unclear if Adobe plans to expose these new agentic capabilities via API, or if the company will support the Model Context Protocol (MCP). Without MCP support or direct API access, enterprise teams will face friction integrating Adobe's tools into their own custom task-routing frameworks and internal LLM pipelines. Adobe’s new "Elements" feature promises to solve the generative AI consistency problem by anchoring characters and objects across generations. However, the backend architecture driving this persistent memory is not yet detailed. Whether Adobe is leveraging on-the-fly Low-Rank Adaptation (LoRA) based on user uploads or utilizing a form of visual Retrieval-Augmented Generation (RAG) is a critical distinction for technology leaders managing compute costs, model evaluations, and enterprise-grade inference pipelines. As organizations build out "Projects" and define brand-specific "Elements", security and data decision-makers require strict guarantees regarding data provenance and storage. It is currently unknown exactly where this contextual workflow and vector data lives—specifically, whether it remains strictly sandboxed within the customer's enterprise Creative Cloud instance on Adobe servers, and how role-based permissions apply to these new agentic workflows. Finally, as lightning-fast, developer-first, multi-model AI creative platforms like fal.ai gain significant traction among enterprises and developers, Adobe’s position in the broader developer ecosystem remains a point of interest. Whether Adobe views these infrastructure-level API providers as direct competitors to its Firefly AI studio or as potential integration points for bespoke enterprise environments has yet to be seen. Community Reactions: The Tension Between Automation and Craft The integration of agentic AI touches on the tension between eliminating drudgery and surrendering creative control. According to Adobe's recent Creators' Toolkit Report, which surveyed over 16,000 creators globally, the market is highly receptive to AI as an operational assistant rather than an autonomous creator. 75 percent of surveyed creators describe creative AI as integrated or essential to their current workflows. 85 percent emphasized that the final creative decision must always remain in human hands. This sentiment is central to Adobe's messaging. By focusing the agent's capabilities on file organization, layer management, and brand compliance, Adobe aims to automate what a spokesperson called the "tedious parts of their workflow". The goal, according to Adobe executive David Wadhwani, is to let creatives focus on the craft so they can "apply their taste and make the calls that only they can".

Building a context layer between enterprise data stores and AI agents is bespoke work, with no standard service to automate or maintain the graphs over time. Amazon is making a direct play to change that. Amazon on Wednesday entered the space, announcing a series of three products it's positioning as a context intelligence stack for AI agents. The centerpiece is AWS Context, a new knowledge graph service that gets smarter through agent usage over time. AWS also announced the general availability of Amazon S3 Annotations and a preview of skill assets in AWS Glue Data Catalog. The context layer is now a contested architectural category with no shortage of options from different vendors. AWS is entering that market with a different architectural premise: that the graph should learn from how agents use it automatically, without human re-curation. "Your agents now get smarter without you having to rebuild anything from scratch," said Swami Sivasubramanian, vice president of Agentic AI at AWS, during his AWS Summit NYC keynote. "This service automatically builds a knowledge graph from all your existing data," he said. "This service infers relationships across your data sets, business rules, and domain knowledge, and makes all of it available to your agents and your organization at runtime." AWS Context builds a self-learning knowledge graph from existing data It's a problem AWS says it has seen repeatedly in customer deployments. AWS Context maps relationships across existing data automatically: what tables exist, what columns mean, how sources relate and which sources are authoritative. It combines semantic search with graph-level reasoning and infers relationships across datasets, business rules and domain knowledge, making all of it available to agents at runtime. "The knowledge graph improves itself over time as it learns which sources produce correct results and which parts get used," Sivasubramanian said. Data stewards manage the graph through the AWS Management Console, reviewing inferred relationships, promoting them to production and attaching business definitions and usage rules. Every query inherits the calling user's IAM and Lake Formation permissions, making agent data access auditable by identity through controls enterprises already rely on. All metadata is published in Apache Iceberg format to Amazon S3 Tables, queryable via Athena, Redshift, Spark or any Iceberg-compatible engine, with no proprietary APIs. Third-party catalog connections are supported, so context from systems outside AWS can be pulled into the same graph. Agents query through agentic search APIs and MCP tools across Bedrock AgentCore, EKS or any MCP-compatible framework. Context is more than just a single service Context is a complicated space and AWS is layering multiple services to help enterprises build context across the data stack. Amazon S3 Annotations. This service enables users to attach rich business context at the storage layer, directly to individual S3 objects. AWS Glue Data Catalog skill assets. Glue skill assets attach domain knowledge at the catalog layer, linking runbooks, query patterns and usage rules to data assets across the estate. AWS Context then synthesizes both into the knowledge graph that agents query at runtime, combining semantic search with graph-level reasoning across structured and unstructured sources. Each layer feeds the next. AWS is entering a highly competitive context space Snowflake announced its context approach earlier this month with its Horizon Context and Cortex Sense services. Microsoft is providing context via its Fabric IQ platform that provides a semantic ontology for data. Redis has developed a context platform that optimizes data for retrieval. Vector database vendor Pinecone has its Nexus context offering that compiles enterprise data into task-specific artifacts before agents ever query them. AWS's structural argument is straightforward: for enterprises already running S3, Glue and Lake Formation, AWS Context extends an existing identity model with no data movement required. The pitch is zero-integration friction — not just cost consolidation. "Context makes agents more powerful and as the whole world is building agents, every agentic platform vendor needs a context capability," Holger Mueller, VP and Principal analyst at Constellation Research, told VentureBeat. Mueller noted that AWS is no exception. "The concern — as with all context offerings — is going to be performance, especially for transactional data, we will see," he said.

When Anthropic quietly released Claude Design in April as a "research preview," it generated the kind of instant traction most product teams dream about: more than one million users in its first week. It also generated a problem. The tool consumed tokens so voraciously that a PCWorld reviewer burned through 80 percent of his weekly Claude Pro allowance in roughly 25 minutes, producing just three variations of a single webpage prototype. "We're talking another token-hungry Claude product here," the reviewer wrote, "one that Pro users in particular will barely be able to use before burning through their usage limits." Two months later, Anthropic is shipping a substantially overhauled version of Claude Design that attempts to fix the consumption issue while simultaneously repositioning the product from a flashy demo into something far more strategically important: a design system compliance layer that connects to code, connects to the tools enterprises already use, and — critically — keeps everything on brand. The update, announced Wednesday, arrives at a moment when Anthropic is executing one of the most aggressive product expansions in the AI industry's brief history. In the past ten weeks alone, the company has launched Claude Opus 4.8, released (and then suspended) the Mythos-class Fable 5 model, shipped ten agent templates for financial services, announced a multi-year alliance with DXC Technology to embed Claude inside the IT infrastructure of the world's largest banks and airlines, rolled out Claude for Small Business with integrations into QuickBooks and PayPal, and published research showing that Claude Code users now average 20 hours per week on the tool. Claude Design's transformation from prototype toy to enterprise platform is the latest move in a company-wide strategy to make Claude not just an assistant people talk to, but a worker embedded in the systems where work actually happens. How design system imports make Claude Design an enterprise brand-compliance tool The headline feature in Wednesday's update is not the new drag-and-resize editor, nor the expanded list of export destinations, though both matter. The feature that signals where Anthropic is heading is the rebuilt design system import. Users can now bring one or several design systems into Claude Design from a GitHub repository, design files, or raw uploads. Once imported, Claude builds with those components, checks its output against the design system, and auto-corrects before the user ever sees the result. For larger organizations, a new admin role can approve a single standard system and lock down edits, ensuring that every asset Claude produces conforms to company guidelines. This is a meaningful departure from the tool's original positioning. In April, Claude Design was a blank canvas: give it a prompt, and it would generate something visually impressive but stylistically arbitrary. Business Insider tested it against Canva AI for a photography workshop slide deck and found that Claude Design "anticipated my needs" and "identified its own errors and corrected them without prompting." But the output reflected Claude's aesthetic judgment, not the user's brand. For an individual freelancer or a startup founder sketching ideas, that was fine. For a 10,000-person enterprise with a 200-page brand standards document, it was a non-starter. The design system import changes that equation. By ingesting a company's actual components — its buttons, typography, color tokens, spacing rules — and then validating output against them before surfacing results, Claude Design is attempting something that most human designers struggle with: consistent brand compliance at speed and scale. The admin lockdown feature, which prevents individual users from overriding the approved system, is a direct play for the enterprise procurement conversation, where "can we control what it produces?" is often the first question. Why the Claude Code round-trip could end the design-to-engineering handoff problem The second major update is the bidirectional integration between Claude Design and Claude Code. Users can now run /design-sync in Claude Code to import their local codebase's design system into Claude Design, ensuring that prototypes start from real components rather than approximations. When a design is ready to ship, it hands off to Claude Code, which picks up exactly where the designer left off — no screenshot, no rebuild. The integration works in reverse, too. From a Claude Code terminal, the /design command lets developers create, edit, and sync design projects without leaving their workflow. This matters because the handoff between design and engineering has been one of the most persistent friction points in software development for decades. Tools like Figma's Dev Mode and Zeplin have tried to bridge the gap by generating specifications and code snippets from design files, but the translation has always been lossy. A designer's prototype and an engineer's implementation inevitably diverge, creating a cycle of visual QA, redlines, and "that's not what the mockup looked like" conversations. Anthropic is betting that if the same AI system both designs and codes — and if both modes share the same underlying component library — the gap disappears. It is, in effect, arguing that the design-to-code problem was never really about better specification formats or smarter handoff tools. It was about the fact that two different humans (or two different tools) were interpreting the same intent. A single AI system that operates on both sides of the workflow doesn't need to interpret; it just continues. The timing of this integration is also significant in light of Anthropic's own research. Just yesterday, the company published an analysis of roughly 400,000 Claude Code sessions showing that domain expertise — not coding proficiency — is the primary driver of successful outcomes. Every major occupation succeeded at coding tasks at nearly the same rate as software engineers. If designers can now move fluidly between visual prototyping and code implementation through a single AI system, the research suggests they will succeed not because they learned to code, but because they deeply understand the design problems they are solving. Token consumption gets a fix, but the economics of generative design remain tight The token consumption issue that dogged Claude Design's launch was not just a user experience annoyance — it was a structural threat to the product's viability. If a $20-per-month Pro subscriber could exhaust their entire weekly allowance in a single 30-minute session, the tool was effectively inaccessible to the individual users and small teams who drove its initial viral adoption. Anthropic's response is twofold. First, Claude Design now shares usage limits with chat, Claude Cowork, and Claude Code, rather than drawing from a separate, smaller pool. This gives most users significantly more headroom. Second, the company says it has reduced the average token consumption per turn while maintaining output quality, and that error rates have dropped sharply. Whether this is enough remains an open question. The fundamental tension is architectural: generative design is inherently token-expensive. Every variation Claude produces requires the model to reason about layout, typography, color, spacing, responsiveness, and content simultaneously, then generate a complete, functional artifact. That is a fundamentally different workload than answering a question in chat, and it consumes tokens accordingly. Anthropic's efficiency improvements may push the breaking point further out, but they do not eliminate the underlying economics. For enterprise customers on Team and Enterprise plans with higher limits, this may be a non-issue. For Pro subscribers, the math is still likely to be tight. The new editor helps mitigate this somewhat by giving users direct control over individual elements — drag, resize, and align — without burning a model turn for every small adjustment. Hundreds of stability fixes also mean fewer wasted turns on errors and regenerations, which were a significant source of token drain in the original release. These are not glamorous improvements, but they are the kind of grind work that separates a research preview from a daily-use tool. Nine new export partners position Claude Design as a creative hub, not a destination The update's third pillar is an expanded set of export destinations. Claude Design now sends work to Adobe, Base44, Canva, Gamma, Lovable, Miro, Replit, Vercel, and Wix, in addition to PDF and PowerPoint. The breadth of this list reveals a deliberate positioning strategy: Anthropic is building Claude Design not as a place where work is finished, but as the place where it begins. The partner quotes tell the story. Replit's president Michele Catasta frames the integration as meeting "builders wherever ideas begin." Canva's Anwar Haneef describes the flow from Claude Design as turning "a first draft" into "a finished asset — kept on-brand, personalized for the moment." Vercel's Andrew Qu talks about pushing a concept "straight to Vercel to ship." In each case, Claude Design is the origin point, and the partner tool is where polish, collaboration, and deployment happen. This hub-and-spoke model also serves as a defensive moat against the open-source alternative that has emerged with surprising speed. Open Design, a community-built project tracked by Augment Code, reached 57,400 GitHub stars and 310 contributors in just eight weeks after Claude Design's launch. It offers local-first operation, model flexibility supporting 16 different coding agents, and 259 skills with 142 design systems — all without cloud lock-in. Augment Code's Paula Hingel noted that for "teams that need to self-host, use their own API keys, or swap models, Open Design is currently the only local-first option with this level of skill and design system coverage." Anthropic's answer to this competitive pressure is not to match Open Design on self-hosting or model flexibility — those are philosophical concessions the company is unlikely to make. Instead, it is building an integration ecosystem that open-source projects cannot easily replicate. A native Adobe Express connector, a verified Canva export pipeline, a first-party Vercel deployment path — these are partnerships, not features, and they require business relationships that community projects cannot forge at the same pace. Claude Design fits into Anthropic's broader push to embed AI across the entire enterprise stack To understand why Claude Design's evolution matters, it helps to zoom out. Anthropic is building a product surface that now spans creative work (Design), code (Code), knowledge work (Cowork), and enterprise operations (Managed Agents) — all unified by the same underlying models and, increasingly, by shared context that carries across tools. The trajectory of the past quarter makes the pattern unmistakable. In May, Anthropic launched Claude for Small Business with connectors to QuickBooks, PayPal, and HubSpot, putting Claude inside the tools that small business owners already use for payroll, invoicing, and marketing. The same month, the company released ten agent templates for financial services covering everything from pitchbook creation to KYC screening, with connectors to FactSet, S&P Capital IQ, and Morningstar. Claude Opus 4.8 shipped on May 28 with a "dynamic workflows" feature enabling hundreds of parallel sub-agents in a single Claude Code session. Then came the Fable 5 and Mythos 5 launch on June 9, followed almost immediately by a US government export control directive that suspended access to both. DXC Technology announced a multi-year alliance to train tens of thousands of Claude-certified engineers to embed Claude inside the systems it operates for major banks, airlines, and insurers. The design system you import into Claude Design is the same component library that Claude Code uses to implement. The financial model you build in Claude for Excel can flow into a pitchbook created in Claude Design and exported to PowerPoint. The brand assets a small business owner creates through Claude Design can be pushed directly to Canva for team collaboration. This is not a chatbot strategy. It is a platform strategy, and the Claude Design update — with its design system imports, code round-trips, and export ecosystem — is one of the clearest expressions of it yet. Anthropic also published an engineering deep-dive last month detailing how it contains Claude across products using sandboxes, virtual machines, and egress controls — infrastructure that becomes more critical as tools like Claude Design gain access to proprietary design systems and brand assets. The containment architecture reveals both the ambition and the risk: the more deeply Claude embeds into enterprise workflows, the higher the stakes when something goes wrong, and the more sophisticated the security envelope must become. Three questions will determine whether Wednesday's update delivers on its ambitions. First, whether the token economics actually work for the broadest user base — shared limits and efficiency gains help, but generative design remains expensive. Second, whether the design system import proves robust enough for real enterprise use, because ingesting a GitHub repository of React components and faithfully using them across dozens of design variations is a genuinely hard technical problem. And third, whether the Claude Code round-trip actually eliminates the design-engineering gap or merely shifts it. Claude Design launched two months ago as a thing people tried once and marveled at. Anthropic is now trying to make it a thing people use every day — and more than that, a thing their entire team trusts to stay on brand while they do. In the AI industry, the distance between a viral demo and an indispensable tool has swallowed more products than it has produced. Anthropic just bet that design systems, not just design prompts, are the bridge across.

On Sunday, a team of nine researchers at Sina Weibo — the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence — quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research community. Their claim: a language model with just 3 billion parameters can match or exceed the reasoning performance of flagship systems from Google DeepMind, OpenAI, Anthropic, and DeepSeek that are hundreds of times larger. The model, called VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters, and ahead of Gemini 3 Pro, Google's high-performance flagship reasoning system, which scored 91.7. With a test-time scaling technique the team calls Claim-Level Reliability Assessment, the score climbs to 97.1, edging past virtually every system in the public record. Within hours of publication, the paper had drawn 62 upvotes on Hugging Face's daily papers feed, the model repository had accumulated 130 likes, and the GitHub repository had reached 685 stars. But the reaction on social media was not uniformly celebratory. It was, in many cases, deeply skeptical. "WHAT THE HELL is happening in AI?" wrote the user @orcus108 on X, in a post that accumulated over 161,000 views. "A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don't know if this is a breakthrough or if the benchmarks are broken." That tension — between genuine scientific advancement and the growing suspicion that AI benchmarks have become gameable to the point of meaninglessness — sits at the heart of the VibeThinker-3B story. And the answer matters enormously, not just for academic bragging rights, but for the multibillion-dollar question of whether the AI industry's relentless push toward ever-larger models is the only path to intelligence. Benchmark scores that defy the scaling laws of modern AI The results reported in the technical report are, by any conventional standard, extraordinary. On the mathematics side, VibeThinker-3B achieved 91.4 on AIME 2025, 94.3 on AIME 2026, 89.3 on HMMT 2025 (the Harvard-MIT Mathematics Tournament), 93.8 on BruMO 2025 (the Brown University Math Olympiad), and 76.4 on IMO-AnswerBench, a benchmark comprising 400 problems at the level of the International Mathematical Olympiad. In coding, it posted an 80.2 Pass@1 on LiveCodeBench v6, a benchmark designed to test executable code generation, and achieved a 96.1 percent acceptance rate on unseen LeetCode weekly and biweekly contests from late April through late May 2026. On instruction following, it scored 93.4 on IFEval. To put the parameter disparity in perspective: DeepSeek V3.2 has 671 billion parameters — roughly 224 times the size of VibeThinker-3B. GLM-5, from Zhipu AI, has 744 billion parameters. Kimi K2.5, from Moonshot AI, exceeds 1 trillion. VibeThinker-3B's 3 billion parameters could run on a consumer laptop. The researchers frame this result not as an anomaly but as evidence for a broader theoretical claim. They introduce what they call the "Parametric Compression-Coverage Hypothesis," which argues that different types of AI capability have fundamentally different relationships to model size. Verifiable reasoning — the kind tested by math competitions and coding challenges, where answers can be definitively checked — is what the paper calls a "parameter-dense" capability: one that can be compressed into a compact core. Open-domain knowledge, by contrast, is "parameter-expansive," requiring broad coverage across facts, concepts, and edge cases that inherently demands more parameters. The paper acknowledges this distinction directly. On GPQA-Diamond, a graduate-level science knowledge benchmark, VibeThinker-3B scored just 70.2 — well behind the 91.9 achieved by Gemini 3 Pro and the 87.0 scored by Claude Opus 4.5. The authors write that this gap "is consistent with our claim rather than a contradiction to it: the main finding is not that a 3B model has fully replaced leading general-purpose models, but that a small model can reach first-tier performance on many verifiable reasoning tasks." Inside the four-stage training pipeline that powers a tiny reasoning engine VibeThinker-3B is not built from scratch. It is post-trained on top of Qwen2.5-Coder-3B, a compact foundation model from Alibaba's Qwen team, through what the Weibo AI researchers call the "Spectrum-to-Signal Principle" — a multi-stage pipeline first introduced in the team's earlier VibeThinker-1.5B work in November 2025. The training unfolds in four major phases. The first is a two-stage supervised fine-tuning process that uses curriculum learning: the model first trains on a broad mixture of math, code, STEM reasoning, general dialogue, and instruction-following data, then shifts to a curated subset of harder, longer-horizon reasoning problems. In the second stage, samples with reasoning traces shorter than 5,000 tokens are discarded, and problems that VibeThinker-1.5B can solve more than 75 percent of the time are filtered out, forcing the model to focus on genuinely difficult challenges. The second phase applies reinforcement learning across multiple domains — mathematics, code, and STEM — using the team's MaxEnt-Guided Policy Optimization algorithm, or MGPO, which prioritizes training on problems at the model's current capability boundary rather than problems it already solves easily or finds impossible. Notably, the team found that a strategy that worked well at the 1.5B scale — progressively expanding the context window during RL training — actually hurt performance at 3B. They hypothesize that the stronger starting checkpoint meant that truncating reasoning traces during warm-up was no longer removing noise but disrupting valid reasoning patterns. The solution was to train with a single 64,000-token context window throughout. Within the math RL phase, the team also introduces what it calls "Long2Short Math RL," a secondary optimization stage that redistributes rewards to favor shorter correct solutions over longer ones, reducing verbosity without sacrificing accuracy. The technique uses a zero-sum reward redistribution that avoids biasing the overall reward signal while nudging the model toward more efficient reasoning. The third phase extracts high-quality reasoning trajectories from the RL-trained checkpoints and distills them back into a unified model through supervised fine-tuning. The team uses a "learning-potential score" — essentially the student model's perplexity on each teacher trajectory — to prioritize traces that are correct but that the student has not yet internalized. The final phase, called Instruct RL, applies reinforcement learning on instruction-following tasks using a combination of rule-based validators for format constraints and rubric-based reward models for open-ended quality assessment. Francesco Bertolotti, an AI researcher who flagged the paper early on X, described the approach succinctly: "These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn't provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL." His post drew over 161,000 views. Real-world testing reveals the gap between benchmark scores and practical AI performance For every enthusiastic reaction, the paper drew an equally forceful objection. The AI research community in mid-2026 has grown deeply wary of benchmark-driven claims, and VibeThinker-3B arrived in an environment primed for suspicion. "The benchmarks are literal pattern matching single file coding," wrote @BigMoonKR on X. "It has no relation to actual coding work. I don't know how people still don't get this." "Benchmaxxing," declared @oflu_bedirhan, using a term that has become shorthand in the AI community for models that appear optimized specifically for benchmark performance at the expense of real-world utility. The most pointed criticism came from users who actually downloaded and tested the model. "Just tried the full precision," wrote @politilols. "It doesn't even know what a uv script (so the most popular Python dev tool) is. Haven't seen that in a single LLM in at least a year now. Benchmaxxed." When Bertolotti responded that the model seemed more focused on mathematical reasoning than practical coding, the user countered: "They include a livecodebench score. Zero chance that is reflective of the model." @Itsdotdev raised a structural criticism: "Look into the benchmarks themselves and it probably won't be so shocking. Why no DeepSWE? Why none of the standard benchmarks SOTA providers use?" The user @AvenirReym posed a more diagnostic question: "If it holds on a benchmark made after the model's training cutoff, it's real. If it only wins on AIME-style sets that have been circulating for years, it's leakage." The paper's authors appear to have anticipated these objections. The technical report states that training sets "have undergone strict benchmark decontamination," including n-gram-based filtering to remove "n-gram overlaps with evaluation sets." The LeetCode contest evaluation — which covers contests from April 25 to May 31, 2026, dates that postdate any plausible training data cutoff — represents the most robust guard against data contamination concerns. On those contests, VibeThinker-3B passed 123 out of 128 first-attempt submissions, a 96.1 percent rate that exceeded GPT-5.2, Doubao Seed 2.0 Pro, Kimi K2.5, and Claude Opus 4.6 under identical evaluation conditions. Still, real-world user reports suggest a significant gap between benchmark performance and practical utility — a phenomenon that has become familiar across the industry. "In LM Studio it only responds well to first question, next questions reply to the first question," reported @luismolinaab. Why a social media company may have found a crack in the scaling hypothesis Even the sharpest critics acknowledged that achieving these benchmark numbers at 3 billion parameters — regardless of how transferable they are to production use cases — is a meaningful engineering achievement. "Even if it's benchmaxxing doing so with 3B parameters is fascinating, goes to show how fast this field is progressing," wrote @rohityin. The observation cuts to a question that has consumed the AI industry since the advent of the scaling hypothesis: Is bigger always better? The conventional wisdom, articulated most famously in the Chinchilla scaling laws and reinforced by the commercial dominance of ever-larger foundation models, holds that more parameters and more training data reliably yield better performance. The economic corollary is stark: training and deploying frontier models costs tens or hundreds of millions of dollars, creating enormous barriers to entry. VibeThinker-3B challenges that consensus — but only partially. The paper is careful to draw a boundary around its claims, distinguishing between tasks with "clear verification signals" and those that require broad factual knowledge. The Parametric Compression-Coverage Hypothesis explicitly argues that small models cannot replace large ones across the board. "The true significance of VibeThinker-3B does not lie in proving that a 3B model can replace large-scale generalists," the paper states, "but rather in providing a concrete empirical signal: the development of compact models is no longer merely a passive compromise for deployment efficiency or cost control; it emerges as a promising research trajectory that is fundamentally complementary to the traditional parameter scaling paradigm." Perhaps the most surprising element of the work is its provenance. Sina Weibo — publicly traded on Nasdaq and Hong Kong, with a market capitalization that fluctuates in the single-digit billions — is not a company typically associated with frontier AI research. Yet the VibeThinker series is Weibo's second major open-source AI contribution in seven months. VibeThinker-1.5B, released in November 2025, demonstrated that a model with just 1.5 billion parameters could outperform the original DeepSeek R1 on several math benchmarks — a result the team achieved for what it claimed was a post-training cost of just $7,800, compared to the $294,000 estimated for DeepSeek R1. The research team is compact — nine authors, all listed as Sina Weibo Inc. employees. The model is released under the MIT License, one of the most permissive open-source licenses available, and the weights are freely downloadable from both Hugging Face and ModelScope. Within the first day of release, community members had already created GGUF quantizations and derivative models. Small models, big implications, and the question the AI industry can no longer avoid The most honest assessment of VibeThinker-3B may be that it is simultaneously less and more than what the benchmarks suggest. Less, because a model that struggles with basic knowledge of popular developer tools is unlikely to replace any production-grade coding assistant anytime soon. More, because the underlying insight — that reasoning ability and factual knowledge are partially decoupled, and that the former can be compressed far more aggressively than previously assumed — has profound implications for how the industry thinks about model design, deployment economics, and the accessibility of advanced AI capabilities. If the Parametric Compression-Coverage Hypothesis holds, it suggests a future in which small, specialized reasoning engines operate alongside large knowledge-rich models in hybrid architectures — a vision where a 3-billion-parameter model handles the logical heavy lifting while a larger system supplies the factual grounding. Such an architecture could dramatically reduce the cost of deploying AI reasoning capabilities, potentially bringing competition-level mathematical and coding performance to devices with modest hardware. "The interesting part is that we're starting to separate knowledge from reasoning," wrote @RealLambdaFlux on X. "A small model with strong post-training can punch way above its size on tasks with clear feedback." @cmitsakis suggested the practical endgame: "I think small models are the future for agents because they can use tools to get the knowledge and they can run fast and cheap." Whether that future arrives through VibeThinker-3B specifically, or through the dozens of teams now racing to reproduce and extend these results, the paper has already accomplished something that no benchmark score can fully capture. It has forced the AI community to confront an uncomfortable possibility: that for years, the industry may have been spending billions of dollars scaling up parameters to improve a kind of intelligence that could have fit, all along, on a laptop. The weights are public. The code is open. And the most important test isn't on any leaderboard — it's whether anyone can make a model this small actually useful in the real world.

Today, Chinese AI startup Z.ai (formerly Zhipu AI) announced the immediate release of GLM-5.2, a 753-billion parameter open-weights large language model (LLM) engineered specifically to dominate "long-horizon" autonomous coding and engineering tasks. Available immediately on Hugging Face, the Z.ai API, and more than 20 third-party coding environments, the model boasts a highly stable 1-million-token context window alongside enterprise subscription tiers starting at just $12.60 per month. In excellent news for cost and security-conscious businesses, z.ai has released GLM-5.2's core weights under an unrestricted MIT open-source license, allowing enterprises to download the model freely from Hugging Face, customize or fine-tune it to their liking, and run it potentially locally or via virtual machines for only the cost of their compute and electricity. This is an increasingly appealing option for enterprises, as state-of-the-art American proprietary models face an uncertain and potentially interrupted regulatory future, following the Trump Administration's export control directive last week prohibiting foreign nationals from using Anthropic's new Claude Fable 5 model (which that company responded to by taking the models in question entirely offline for all users). For enterprise technical decision-makers, z.ai's GLM-5.2 provides a highly capable path to host frontier-level AI locally, entirely bypassing the geographic fencing and commercial limitations. IndexShare re-uses one indexer for every four sparse attention layers, reducing compute needs Under the hood, GLM-5.2 operates with 753 billion parameters and introduces a major architectural optimization called "IndexShare". In standard massive language models, recalculating attention mechanisms across long documents is computationally exorbitant. IndexShare solves this by reusing the identical indexer across every four sparse attention layers. At the maximum 1-million-token context length, this single innovation reduces per-token compute FLOPs by a massive 2.9 times. The model also features an upgraded Multi-Token Prediction (MTP) layer for speculative decoding, which boosts accepted token length by up to 20% during inference. Additionally, Z.ai has implemented flexible, selectable "Thinking Modes". Users can toggle the model's reasoning effort between "Max," designed to push the limits of logical problem-solving, or "High," which strikes a careful balance between high-end performance and latency-sensitive token efficiency. State-of-the-art benchmarks for an open model, and matching, even beating proprietary leaders on some categories On industry-standard third-party benchmark tests, GLM-5.2 performs above most open source flagship models, even DeepSeek v4 and scores near or above its closed-weights rivals, OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.8. The model particularly shines in agentic tool use and long-horizon software engineering tasks: SWE-bench Pro: GLM-5.2 scored 62.1, decisively beating GPT-5.5 (58.6) and its own predecessor, GLM-5.1 (58.4). FrontierSWE (Dominance): Designed to test long-horizon task completion, GLM-5.2 hit 74.4%, surpassing GPT-5.5 (72.6%) and finishing in a near-tie with Claude Opus 4.8 (75.1%). MCP-Atlas: On this tool-usage evaluation, GLM-5.2 achieved a 77.0, outscoring GPT-5.5 (75.3) and performing just shy of Claude Opus 4.8 (77.8). Humanity's Last Exam (w/ Tools): When equipped with external tools, GLM-5.2 reached a score of 54.7, coming out ahead of GPT-5.5 (52.2) and tracking closely behind Claude Opus 4.8 (57.9). PostTrainBench & SWE-Marathon: In extended, multi-hour engineering workloads, GLM-5.2 consistently topped GPT-5.5, scoring 34.3% against GPT-5.5's 25.0% on PostTrainBench, and 13.0% against GPT-5.5's 12.0% on SWE-Marathon. While GLM-5.2 trails Claude Opus 4.8 and GPT-5.5 slightly on raw Terminal-Bench 2.1 scores (81.0 versus 85.0 and 84.0, respectively), it significantly outscores Google's Gemini 3.1 Pro (74.0). Beyond traditional coding metrics, GLM-5.2 took an impressive first place on the crowdsourced design task benchmark Design Arena, beating out even the aforementioned state-of-the-art Claude Fable 5 with an ELO score of 1360. Furthermore, the impact of Z.ai's new selectable "thinking modes" is clearly visible in the data: under the "Max" effort level, GLM-5.2 pushes to peak intelligence, but utilizes nearly 85k output tokens per task. Switching to the "High" effort setting sacrifices only a few points in performance while effectively halving the required token output, providing a crucial optimization lever for latency-sensitive applications. Available via Coding Plans and API To operationalize the model, Z.ai launched the GLM Coding Plan, aiming squarely at developer workflows rather than simple chat interfaces. The plan offers out-of-the-box support for third-party U.S. and global agentic coding harnesses and tools including Claude Code, OpenClaw, Cline, Kilo Code, Crush, and Factory, among others. The Coding Plan pricing tiers (when billed annually) are highly competitive: Lite: $12.60 per month ($151.20 per year starting in the 2nd year), geared toward lightweight iteration on small repositories. Pro: $50.40 per month for day-to-day development on mid-sized repositories, offering 5x the usage allowance of the Lite plan. Max: $112.00 per month for heavy workloads, offering 20x the Lite usage and dedicated resources during peak hours. For enterprise developers integrating the raw model into their own applications, Z.ai's API pricing undercuts its Western rivals significantly while matching the exact rates of the previous GLM-5.1 generation. GLM-5.2 API access is priced at $1.40 per million input tokens and $4.40 per million output tokens, making it a mid-priced model globally, but about VentureBeat Frontier AI Model API Pricing Snapshot Sorted by total cost (input + output) from least to most expensive. Pricing shown is standard pay-as-you-go pricing per 1 million tokens. Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI MiMo-V2.5 Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.2 $1.40 $4.40 $5.80 Z.ai Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI MiMo-V2.5 Pro (>256K) $2.00 $6.00 $8.00 Xiaomi MiMo Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Claude Fable 5 / Claude Mythos 5 $10.00 $50.00 $60.00 Anthropic To further optimize costs for long-context workloads, Z.ai offers a cached input rate of just $0.26 per million tokens, alongside a limited-time offer for free cached input storage. The stark contrast between open-weights innovators and proprietary Western labs has not gone unnoticed by the developer community. On X, prolific AI observer Lisan al Gaib (@scaling01) argued that "frontier labs are absolutely scamming you on API pricing". The post noted that while massive open models like the 744-billion-parameter GLM-5.2 charge $4.40 per million output tokens and DeepSeek-V4-Pro (1.6 trillion parameters) charges just $0.87, proprietary models demand heavy premiums: Anthropic's Sonnet 4.6 and Opus 4.8 charge $15.00 and $25.00 respectively, while OpenAI's GPT-5.5 costs $30.00 for output. Highlighting that open-model developers are operating profitably without relying on the newest "fancy Blackwell chips," the commentator suggested that leading proprietary labs are "probably at 90%+ margins at this point". The beauty of the unmodified MIT License for enterprise use The most disruptive aspect of the GLM-5.2 release is its licensing. Z.ai released the model's weights under an MIT open-source license, establishing it as a "Pure Open" system. The company’s technical documentation explicitly notes that this license guarantees "no regional limits" and allows "technical access without borders". For enterprise technology leaders, an MIT license means the software can be used, modified, and commercialized without paying royalties or adhering to restrictive "acceptable use" governance policies common to dual-use licenses. It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock-in. Warm reception among AI developers and toolmakers The developer reaction to the release has been immediate and overwhelmingly positive. The team behind Kilo Code confirmed day-one integration, posting on X: "GLM-5.2 runs in Kilo Code on day one. The 1M context window and Max effort mode are both live. Point your config at it and go!". Open-source coding environment Cline IDE echoed this sentiment on X, noting the economic advantage: "GLM-5.2 is the first open-weights model to cross 80% on Terminal-Bench, and beats every other open model available. It also beats Gemini, making it a frontier-level model for a fraction of the cost. Open weights is back. This model is a game changer. Available in Cline now!". Similarly, rival open source coding desktop agent Eigent AI also tested the model's new capabilities on complex agentic workflows, noting on X: "threw a real long-horizon task: research 30 companies across 6 sectors of the AI infrastructure stack, structure it into JSON, then build an interactive HTML report... where 5.2 pulls ahead: -> plans...".

For decades, data professionals have struggled with the challenge of managing both operational and analytical databases in a unified approach that doesn't introduce latency and performance degradation. Agents made the problem structural. A system that reasons continuously and acts on live data cannot tolerate a pipeline between itself and the information it needs to act on. At the Data + AI Summit on Tuesday, Databricks announced two products aimed at collapsing that infrastructure. Lakehouse//RT delivers millisecond query latency directly on governed Delta and Iceberg tables, eliminating the dedicated real-time serving tier that enterprises have maintained alongside their lakehouses. LTAP, short for Lake Transactional/Analytical Processing, stores Postgres-native transactional data in Delta and Iceberg format from the point of write, removing the ETL pipelines that have connected operational and analytical systems for decades. Reynold Xin, co-founder of Databricks, described a simpler data stack as "the holy grail for agents" in a briefing with VentureBeat, arguing that as users vibe code more applications, the agents reasoning analytically on top of those apps need the underlying infrastructure out of the way to move fast. "The agents really prefer a much simpler stack, because they can move way faster," he said. LTAP bets on storage-layer unification where HTAP tried engine convergence Many vendors have tried various approaches over the decades to unify analytical and transactional data. Back in 2014, analyst firm Gartner coined the term HTAP, an acronym that stands for Hybrid Transactional/Analytical Processing as a way to describe vendors that attempted to unify the two types of databases. Vendors including MemSQL (now known as SingleStore) SAP HANA and Oracle's MySQL Heatwave are among many HTAP vendors in the market. LTAP is Databricks' answer to HTAP, using the Lakebase architecture to unify data at the storage layer rather than the engine level. Lakebase is Databricks' serverless cloud-based PostgreSQL database service that became generally available in February. "HTAP to us is kind of more of a failure of the industry rather than a success," Xin said. The LTAP approach goes to the storage layer instead of the query layer. Lakebase previously stored Postgres data in Postgres format on object storage, requiring conversion before the Lakehouse's analytical engines could use it efficiently. With LTAP, transactional data lands directly in Delta or Iceberg format, sharing the same copy that analytical workloads read. Postgres remains the transactional engine. Spark and the Lakehouse remain the analytical engine. "The whole point is, hey, you use the best tool for the job at the query engine level, we just make sure underlying storage is a single copy of the data," Xin said. The central engineering challenge is latency. Object storage carries response times in the seconds range, far too slow for OLTP workloads that require sub-millisecond performance. Lakebase handles this through a caching layer between Postgres compute instances and object storage. The key design decision is where the column conversion happens: idle CPU capacity in that caching layer performs the row-to-column conversion before data lands in object storage. "When you convert data from row to column, it compresses more than 10 times, typically, so now you substantially reduce the network cost of that basic caching layer between that caching layer and the object stores," Xin said. Lakehouse//RT delivers millisecond query latency on live lakehouse data without a separate serving tier Lakehouse//RT is Databricks' answer to the dedicated real-time serving tier — the separate system enterprises have maintained alongside their lakehouses to handle low-latency queries, at the cost of data copies, split governance and pipeline complexity agents cannot work around. Key capabilities of Lakehouse//RT include: Reyden compute engine: Built specifically for high-concurrency, low-latency serving, Reyden queries Delta and Iceberg tables directly without moving data out of the lakehouse. Latency and throughput: Lakehouse//RT delivers sub-100ms latency at 12,000 queries per second, with response times as low as 10ms on smaller datasets and up to 16x better performance than existing dedicated serving stacks. Governance and data access: Every query runs within Unity Catalog's governance framework with no separate permissions layer, no data copies and no ingestion pipelines. Analysts see the agentic framing and open format approach as the real differentiators The problem both products address is well-documented among enterprise data teams, but analysts draw a distinction between the pain point and the specific claim Databricks is making. "Enterprises have had HTAP, streaming, cloud warehouses, and operational stores for years," Stephanie Walter, Practice Leader for AI Stack at HyperFRAME Research, told VentureBeat. "What is different is the agentic AI framing." Walter noted that agents need live operational data, historical context, governance, retrieval, and write-back in the same workflow. "That is a strong architecture argument, but Lakebase still has to prove it can meet the latency, reliability, and operational maturity CIOs expect," she said. Mike Leone, analyst at Moor Insights and Strategy, said the path to genuine differentiation is more specific than the unification concept itself. He also noted that open analytics on a data lake is table stakes now, with many vendors providing some sort of service. "The less common move is letting the transactional writes land in open formats too, so the operational database isn't sitting in a proprietary box while only the analytics half is open, "Leone told VentureBeat. He added that the open format approach, paired with Lakehouse//RT querying live data directly off the lake, is what gives the architecture a credible case for retiring a whole row of specialized systems. The technical claim that will face the most scrutiny is also the most central one. "The piece I'd still want their engineers to walk through is how both engines truly share one copy without a quiet conversion step doing the syncing in the middle," Leone said. What this means for enterprises For data engineers evaluating their stack for agentic workloads, the question is no longer which best-of-breed tool to run for each job — it's whether running separate tools at all is still defensible. Enterprises that built separate operational databases, real-time serving tiers and analytical lakehouses could previously treat the gaps between them as a maintenance burden. Agents surface those gaps as an operational risk: a system reasoning across governance boundaries will find the inconsistencies faster than any human team. The market is moving away from specialized serving layers faster than most vendor roadmaps anticipated. According to VB Pulse Q1 2026, a three-wave longitudinal survey of 100-plus employee organizations, hybrid retrieval intent tripled from 10.3% to 33.3% across the quarter while standalone vector database adoption declined across every tracked vendor. The same consolidation logic is now hitting the real-time serving tier. The traditional approach — best-of-breed tools for each workload type, pipelines between them — was built for human-speed analytical consumption. Agent workloads don't tolerate that architecture. "The pain they're pointing at, all the copying and syncing between operational and analytical systems, is real and expensive, and anyone running this at scale feels it," Leone said.

One of the assumptions behind today’s AI frameworks is that agents require a “boss” at the center; this orchestrator runs the show, routes requests, and makes sure the whole system doesn’t descend into chaos. That assumption may be wrong, and the cost of carrying it could be measured in inference dollars and coordination latency. A new Stanford framework called a decentralized language model, or DeLM, is built on the premise that agents can coordinate directly, without routing every update through a central controller. DeLM's shared knowledge base serves as a “common communication substrate” so that agents can build upon one another’s verified progress without having to route every interaction through a main agent to “merge, filter, and rebroadcast,” Yuzhen Mao and Azalia Mirhoseini, co-developers of the framework, explain in a research paper. It’s a system that’s not only possible, but desirable in certain instances. “Agents can build on prior findings, avoid repeated failures, preserve constraints, and recover detailed evidence only when needed.” The challenges of traditional multi-agent systems In a typical centralized multi-agent system, a main agent breaks tasks into subtasks, assigns them out to multiple sub-agents in parallel, waits for responses, merges and summarizes intermediate progress, then launches a next wave of orders based on collected context. While this is a natural way to scale LLM reasoning, the Stanford researchers argue that it scales poorly. Every useful finding, partial finding, and failure must be reported back to the main agent, which then determines what information to merge and rebroadcast to the agents below it. “As the number of subtasks grows, this controller becomes a communication and integration bottleneck,” Mao and Mirhoseini write. Further, the main orchestrator may “dilute, omit, or distort” useful information, leading to lost progress. This bottleneck also occurs in long-context reasoning scenarios. Once it receives reports back from subagents, a main agent will typically group related concepts, data points, and other materials together in an unsupervised learning loop. It may then pre-assign these "evidence clusters" to sub-agents before knowing what surfaced material is actually relevant or whether it’s combined correctly. When a subagent receives this insufficient context, it will essentially get confused and return to the main agent, kicking off another retrieval or delegation round. “This back-and-forth makes coordination slower, more iterative, and increasingly constrained by a single overloaded main agent,” the researchers write. What DeLM addresses and how it works DeLM, by contrast, is built around parallel agents, a shared context, and a task queue. Shared context is essentially a curated store of “gists,” or information summaries that other agents might find useful. These include verified and evidence-based findings alongside partial findings and documented failures; they also point to detailed evidence that agents can pull from based on their specific task. A task queue is then a set of subsequent pending subtasks that agents can claim independently. “Agents write compact, verified updates into a shared context that later agents can read directly,” the researchers write. Useful findings, failures, and constraints accumulate as a “shared problem state,” rather than passing through a central controller. The pipeline looks like this: Initialization: Inputs are broken into different work units and added to a queue; Parallel execution: Agents work independently and in tandem, pulling tasks and reading shared context as they progress. Compression and verification: Results are compressed into reusable “gists” that are checked against supporting evidence. Only gists that are fully verified are shared with the group. Additional work (if needed): When the queue is emptied, the last agent to return an answer inspects all the shared context to determine whether further work is required. Final step: The last agent determines that no more steps are required and returns the final answer. Agents “exchange progress through shared state, asynchronously claim ready tasks, and scale more adaptively as the number of subtasks grows,” the researchers explain. How DeLM performs in the wild With DeLM, agents can avoid redundant exploration; reuse and build on each other’s discoveries and failures; and focus on unresolved issues. The framework can be particularly useful in software engineering test-time scaling, when models are given time to “think” to improve their reasoning and problem-solving capabilities. Different agents can explore their own hypotheses or pursue reasoning paths in parallel, while still sharing intermediate progress. One example is concurrent de-bugging. DeLM is also suitable for long-context reasoning and multi-document question-answering; agents can simultaneously examine their own evidence clusters (collections of papers, code, or other materials) at the same time, while maintaining a “global compact view” of accumulated evidence. The researchers contend that it makes agentic tasks more accurate and significantly cheaper. This is backed by its performance on real-world benchmarks: On SWE-bench Verified — which evaluates how well AI models and agents solve real-world software engineering problems — it performed 10.5% better than the strongest baseline and reduced cost per task by roughly 50%. But it can go beyond coding: On LongBench‑v2 Multi‑Doc QA — which assesses LLMs’ ability to handle long-context, real-world problems — DeLM had the highest accuracy across four model families, including GPT‑5.4, Claude Sonnet, Gemini Flash, and DeepSeek‑V4‑Pro. DeLM outperforms other models on SWE-Bench for a number of reasons, as Mao detailed on X. First, agents share failures. In ordinary parallel runs, when one agent follows the wrong path, that failure stays private, and subsequent agents may waste time (and money) pursuing the same dead end. But with DeLM, failed hypotheses are written into shared context. “Later agents can read them as constraints, avoid repeated exploration, and redirect their search toward more promising fixes,” Mao said. Additionally, constraints, once verified, are immediately added to agents’ shared context. This means they become a binding shared state. “Later agents inherit them, build around them, and avoid repeating globally invalid simplifications,” Mao said. Crucially, DeLM keeps shared progress compact enough to reuse. It is unfoldable, meaning agents see short gists by default, but can choose to unfold them into more detailed summaries and raw evidence. As the researchers note, providing all raw documents and traces gives agents the maximum amount of information, but that can overwhelm their context windows and ultimately increase costs. “If agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate reasoning, turning coordination itself into another long-context bottleneck,” Mao said. On the other hand, while sharing compact summaries is cheaper, important details and evidence can be lost, resulting in less reliable reasoning. Unfolding, therefore, provides “coarse-to-fine” opt-in access. This can improve accuracy and cost. Ultimately, with a framework like DeLM, agents can be more efficient because they are prevented from repeatedly reading the same documents or rerunning the same failed analysis; more effective because useful findings are propagated across parallel threads; and more robust because they only share verified claims. For enterprise builders, DeLM challenges a core assumption: that every multi-agent workflow needs a central controller. The SWE-bench and LongBench-v2 results suggest the decentralized model isn't just theoretically cleaner — it's faster, more accurate, and roughly half the cost.

Microsoft CEO Satya Nadella published a sweeping essay on Sunday laying out what he describes as the defining economic challenge of the AI era: the risk that a handful of frontier models will absorb the expertise of entire industries and commoditize it, leaving businesses stripped of their competitive moats. "The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see," Nadella wrote in the piece, titled "A frontier without an ecosystem is not stable," which he posted on X. "If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries." The essay is unusually philosophical for a sitting CEO of a $3 trillion technology company. But it arrives at a moment when the theoretical risks Nadella describes are becoming tangible — and, critically, when Microsoft itself is grappling with the very dynamics he warns about. Nadella introduces "token capital" as the new currency of enterprise AI strategy At the center of Nadella's essay sits a conceptual framework built on two pillars he calls "human capital" and "token capital." Human capital, he writes, "comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people," while token capital refers to "the firm's AI capability it builds and owns." The two are not in tension, he insists. "Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable!" he writes. "I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles." This framing is a deliberate counterweight to the narrative that AI will simply replace human workers or, at the enterprise level, dissolve the intellectual property that differentiates one company from another. Nadella is arguing that the real danger is not AI's capability but its tendency to centralize — and that the solution requires a fundamentally new architecture for how businesses interact with the technology. He describes the real opportunity as "not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound." The key test of a company's sovereignty in this new era, he writes, is whether it can "switch out a 'generalist' model without losing the 'company veteran' expertise built into their learning system." This is the essay's most actionable claim — and its most provocative. Nadella is telling enterprises they need to decouple their institutional intelligence from whatever frontier model they happen to be running, creating portable knowledge systems that survive vendor changes. Why Nadella is comparing AI concentration to the outsourcing crisis that gutted industrial economies Nadella draws a pointed historical parallel to make his warning concrete. "Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing," he writes. "The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them." The globalization analogy is not accidental. It reframes the AI concentration debate from a narrow technology question into a political-economy argument — one that regulators, policymakers, and voters can grasp. By invoking the social costs of offshoring, Nadella is signaling that the stakes extend well beyond the enterprise technology stack. He is warning that if the AI industry fails to distribute value broadly, the political system will intervene to force the issue. "In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country," he writes. He grounds this in an older platform philosophy: "This is the ethos I've grown up with where platforms enable more value on top than is captured inside, and where every company can continuously innovate and build value of its own." It is a direct echo of the Windows-era argument, updated for the age of inference — and it carries a similarly self-interested subtext, given that Microsoft's cloud business sits squarely in that platform layer. Microsoft's own runaway AI costs reveal the gap between Nadella's vision and operational reality What makes Nadella's essay so striking is its timing. He published it on a day when Reuters reported that Microsoft shareholders filed a proposed class-action lawsuit in Seattle federal court, accusing the company of inflating its stock price by failing to disclose slowing growth in its Azure cloud business and the need to spend billions of dollars on AI infrastructure. The suit names Nadella and Chief Financial Officer Amy Hood among the defendants. As the Yahoo Finance report on the lawsuit noted, Microsoft allegedly "aggressively promoted its AI developments, specifically its 'Copilot' assistant and close financial alliance with ChatGPT creator OpenAI, to artificially boost investor optimism," while understating infrastructure strain and capital risks. Microsoft also reported $37.5 billion of capital spending in its second quarter, up nearly 66% from a year earlier and above the $34.3 billion that analysts projected. Microsoft's internal cost pressures around AI have surfaced in other concrete ways this year. The company is canceling the majority of its internal Claude Code licenses in its Experiences and Devices division, effective June 30, 2026. Monthly usage rates reached 84 to 95% by April 2026, and per-engineer API costs ranged between $500 and $2,000 monthly, according to Windows Forum. The cancellation came after Microsoft exhausted portions of its annual AI budget due to token-based billing, as Fortune had reported in May. The Claude Code episode illustrates, at the micro level, the exact dynamic Nadella describes at the macro level. When a company's AI usage is metered by the token — the fundamental unit of compute that powers model inference — the more productive the tool becomes, the more expensive it gets. The term "token capital" in Nadella's essay carries a double meaning: it refers both to a firm's proprietary AI capability and, implicitly, to the actual tokens consumed in running it. Building a learning loop that compounds is aspirational. Paying the bills for that loop is operational reality. Uber, Meta, and Amazon are all hitting the same AI spending wall — and it validates Nadella's warning Microsoft is not alone in this bind. Uber burned through its entire 2026 AI coding tools budget in just four months after incentivizing employees to adopt the technology through an internal leaderboard ranking teams by total AI tool usage. Uber has since instituted a monthly $1,500 cap per employee per agentic coding tool, according to TechCrunch. At Meta, an employee created a leaderboard called "Claudeonomics" to track which workers consumed the most AI tokens. Amazon, meanwhile, has pushed employees to "tokenmaxx" — use as many AI tokens as possible. The emerging pattern is clear: enterprises adopted AI coding tools aggressively, saw genuine productivity gains, and then discovered that the consumption-based economics of frontier models created budget crises that traditional software licensing never would have. Bryan Catanzaro, vice president of applied deep learning at Nvidia, captured the tension bluntly in an interview with Axios: "For my team, the cost of compute is far beyond the costs of the employees," he said. These cost dynamics land differently in the context of Nadella's essay. He prescribes a three-layer architecture — evaluation, reinforcement learning, and retrieval — designed to sit between a company's workforce and whatever frontier model it subscribes to. Companies, he argues, need to build "private evals" that "capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!)," alongside "private reinforcement learning environments" that "let models grow stronger on real traces from inside the organization" and a knowledge base that "makes institutional memory queryable and use of tokens more efficient." He calls the resulting system "a hill climbing machine" that, "unlike most assets, it compounds." Other Big Tech CEOs are echoing Nadella's fears about AI models devouring enterprise knowledge Nadella's concerns do not exist in isolation. Other technology leaders have been raising similar warnings throughout 2026, though none have offered as prescriptive a response. Snowflake CEO Sridhar Ramaswamy warned in a February podcast that the biggest software companies risk being reduced to mere data sources. "The big model makers want to create a world in which all of the data for all of the enterprises is easily available to them," Ramaswamy said, describing everything else as "a dumb data pipe that feeds into that big brain." He added that Snowflake needs to operate with a "fear" that enterprises would abandon software-specific AI agents in favor of all-inclusive agents that hoover up data from everywhere. Box CEO Aaron Levie struck a similar note in a January LinkedIn post. AI models can now perform high-level knowledge work across nearly every profession, from law to strategy to scientific research, he argued. "The question that we will have to wrestle with is, in a world where everyone has access to the same expert intelligence, how does a company differentiate?" he wrote. The combined effect of these statements is a shared diagnosis from three very different corners of the enterprise technology market: the current trajectory of AI development threatens to collapse competitive differentiation across entire industries. Nadella's essay stands apart from the others because it moves beyond diagnosis and proposes a specific architectural remedy. But the prescription is impossible to separate from the prescriber's interests. Microsoft sits in precisely the platform layer that Nadella's framework would make indispensable — the company builds its own frontier models, operates the cloud infrastructure those models run on, and maintains deep partnerships with the leading independent AI labs. A world in which every enterprise builds a proprietary learning loop on top of commodity foundation models is, conveniently, a world in which Microsoft sells the picks and shovels to all of them. Nadella's Scout controversy and shareholder lawsuit reveal the tension inside Microsoft's own AI strategy The essay also arrives just ten days after Nadella publicly rebuked one of his own executives for outlining a plan to "make people addicted" to a new AI tool called Scout.. Microsoft corporate vice president Omar Shahine had written an internal memo describing a three-phase plan to transform Scout "from addictive app to agentic platform," with the first phase focused on features that "make people depend on it daily." Nadella responded on an internal message board: "This is absolutely a non-goal! If anything we are doing the exact opposite. We want to make sure AI empowers and adds real value to human endeavor and broad economic growth!" The Scout incident and Sunday's essay together suggest Nadella is actively constructing a public philosophy of AI that emphasizes broad value creation over extractive engagement — whether or not every corner of Microsoft has internalized that message. One anonymous Microsoft employee told 404 Media, as the Post reported, that the leaked Scout document was "very troubling," adding: "It feels like one of those 'saying the quiet part out loud' moments." For technical decision-makers evaluating Nadella's essay, the practical implications are significant. He is arguing that choosing an AI model matters less than building the learning infrastructure around it. He is arguing that the ability to swap models without losing institutional intelligence is the critical test of AI sovereignty. And he is warning that companies that fail to build these systems will find their expertise absorbed and commoditized by the models themselves. "You can offload a task, or even a job, but you can never offload your learning," Nadella writes. "The future of the firm is the ability to compound that learning across people and AI." The question Nadella's essay cannot answer is whether Microsoft will practice what its CEO preaches Whether Nadella's vision materializes depends on a question his essay carefully sidesteps: whether the platform providers who build and host the frontier ecosystem will resist the temptation to capture the value flowing through it. Nadella insists that "platforms enable more value on top than is captured inside." But Microsoft's own trajectory this year — the ballooning capital expenditures, the Claude Code budget crisis, the shareholder lawsuit alleging concealed costs, the internal memo about making users addicted — suggests the economics of restraint are harder than the philosophy of restraint. Nadella ends his essay with the claim that broad value distribution "is the stable equilibrium we should build together." He may be right. Ecosystems have historically outperformed walled gardens over long time horizons. But stable equilibria require every major player to forgo short-term extraction in favor of long-term compounding — and right now, the AI industry is burning through budgets in four months and spending 66% more on infrastructure than analysts expected. The CEO of the world's most valuable technology company has written an eloquent argument for why the AI economy needs to work differently. The open question is whether his own company's balance sheet will let him prove it.

Tokyo-based AI startup Sakana AI has officially launched its first commercial product, Sakana Marlin. Billed as a "Virtual CSO" (Chief Strategy Officer), Marlin is an autonomous, B2B research agent that deliberately abandons the instantaneous text generation of modern chatbots in favor of deep, long-horizon reasoning. What sets Marlin apart from the current ecosystem of AI tools is its temporal scale: instead of returning an answer in seconds, it runs continuous, self-governing reasoning loops for up to eight hours at a time to deliver deeply researched, well cited, 100-page strategy reports and executive slides. The company posted sample reports generated by Marlin on its product website here. Available immediately via the company’s website with pricing starting at a pay-as-you-go tier, the platform is designed strictly for enterprise use—specifically targeting corporations, financial institutions, and think tanks. The generative AI hype cycle has largely been defined by speed. For the past two years, the industry standard has been the ability to generate a poem, a line of code, or a surface-level summary in mere milliseconds. But the enterprise frontier is rapidly shifting from shallow, rapid generation to deep, methodical reasoning. With Marlin, major businesses are no longer asking how fast an AI can answer, but how deeply it can think. The Product: A Virtual CSO What exactly is a business getting when they deploy Sakana Marlin? The workflow is fundamentally different from typical large language model (LLM) interactions. Rather than engaging in a tedious back-and-forth prompt engineering session, the user simply provides a core research topic. Following a brief initial exchange to sharpen the scope and direction of the investigation, the human steps away entirely. For the next several hours, Marlin operates as a self-contained digital strategy team. It formulates its own initial hypotheses, navigates the web to gather data, cross-references sources to verify findings, and maps the causal dynamics within complex business environments. It is effectively searching for the "winning formula" within a sea of noise. Think of it less like a search engine and more like a junior strategy consultant locked in a room with a whiteboard and an internet connection. You provide the strategic prompt in the morning, and by the end of the workday, the system delivers a comprehensive, professional-grade portfolio. In Marlin's case, the final output is not a generic text blob; it is a structured set of strategic options, complete with executive summary slides, appendices, references, and a deeply researched report. The company highlighted several real-world use cases to demonstrate Marlin's capacity for complex synthesis, including generating detailed resolution scenarios for a theoretical blockade of the Strait of Hormuz, mapping out the fragmented global AI regulation patchwork, and analyzing macroeconomic trends like the return of "bond vigilantes". Sakana says Marlin relies on multiple AI models, but did not provide specific model names or providers. I've reached out on X to find out more and will update when I receive a response. The Engine of Long-Horizon Reasoning Under the hood, Marlin is the commercial culmination of Sakana AI’s extensive laboratory breakthroughs over the past two years. The product is powered by an exploration engine relying on Sakana's own prior research breakthrough, Adaptive Branching Monte Carlo Tree Search (AB-MCTS), and leverages frameworks derived from "The AI Scientist," an earlier Sakana AI research project featured in the journal Nature that successfully automated the scientific discovery process from ideation to peer review. To understand how this works in practice, consider a real-world analogy: modern chess engines. When a computer plays chess, it doesn't just look at the board and guess; it plays out thousands of potential future moves, evaluating the strength of each resulting position before committing to an action. Marlin’s AB-MCTS engine does something similar for research. Inside the Engine: The Mechanics of AB-MCTS The chronology of this technology traces back to June 2025, when Sakana AI first introduced the framework to the public alongside the research paper “Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search”. At that time, to encourage developer experimentation with collective AI intelligence, the company released the underlying algorithm as an open-source software library called TreeQuest, distributed under the permissive Apache 2.0 license. This open-source milestone laid the technical foundation for what would eventually evolve into the proprietary, enterprise-grade Marlin product a year later. Traditionally, when developers attempt to extract higher-quality reasoning from large language models, they rely on a brute-force method called "repeated sampling"—essentially running the model dozens of times in parallel and hoping one of the answers is correct. However, repeated sampling operates blindly; it cannot evaluate its own intermediate steps or pivot based on external feedback. AB-MCTS replaces this paradigm with a principled, multi-turn approach driven by a Bayesian decision framework. As the AI constructs a strategy report, the system treats the research process as a branching tree of possibilities. At each node of the tree, the algorithm dynamically balances two distinct behaviors based on external feedback signals: Going Wider (Exploration): Spawning entirely new, alternative hypotheses or candidate responses when the current path yields diminishing returns or unresolved contradictions. Going Deeper (Exploitation): Methodically refining, auditing, and building upon an existing candidate solution that shows high strategic promise. What transforms this from a laboratory experiment into a commercial engine is its extension into Multi-LLM AB-MCTS. Sakana AI’s architecture introduces a critical third dimension to the search tree: the ability to dynamically choose which model to invoke for a specific sub-task, treating the industry’s leading frontier models as a plug-and-play collective intelligence network. According to technical documentation published by the company, the engine can coordinate highly heterogeneous models—allowing an orchestration model to delegate initial ideation to one LLM, while utilizing a reasoning-heavy model to audit, verify, and correct intermediate errors generated earlier in the search tree. By scaling up compute at inference time—leveraging the distinct "personalities" and strengths of multiple foundation models over thousands of automated cycles—AB-MCTS provides the mathematical guardrails Marlin requires. It ensures that the resulting 100-page strategy reports are not merely long-winded AI generations, but the highly vetted product of systemic, automated trial-and-error. Licensing, Data, and Enterprise Implications It is crucial to note that Sakana Marlin is distinctly not a general consumer tool; it is a commercial software-as-a-service (SaaS) offering restricted to corporate entities, organizations, and sole proprietors. For enterprises, licensing and data handling terms are often the determining factors in software adoption. Unlike many consumer-grade AI tools that silently harvest user inputs and proprietary data to train future foundational models, Sakana Marlin operates under a strict, enterprise-grade data policy. Neither Sakana AI nor its external AI service providers will use customer data or inputs for model training or fine-tuning unless the client provides explicit opt-in consent. Even with consent, data is heavily processed to remove personally identifiable information. This closed-loop security is absolutely vital for companies handling sensitive M&A research, unreleased product strategies, or proprietary market analyses. The commercial licensing is structured into tiered pricing models that reflect its enterprise nature: Pay-as-you-go: Users can purchase credits on demand, with a single run costing 100 credits, and add-on credits priced at ¥98 ($0.61 USD) each. Pro Plan: At ¥150,000 ($935.68 USD) per month, businesses receive 2,000 credits, bringing down the cost of add-on credits to ¥90 ($0.56 USD). Team Plan: Geared toward larger departments, this ¥400,000 ($2,495.14 USD) per month tier includes 6,000 credits, lowering add-on costs to ¥85 ($0.53 USD) per credit. Enterprise: Fully custom quotes with dedicated support and customized credit allocations. Why Sakana Is Worth Watching Sakana AI’s transition into a commercial enterprise powerhouse is rooted in the pedigree of its founders, who famously helped spark the current generative AI boom. Formed in Tokyo in 2023, the startup was co-founded by Llion Jones—a co-author of Google’s seminal 2017 “Attention Is All You Need” paper who coined the term “transformer”—and David Ha, a former Google Brain researcher and head of research at Stability AI. The decision to build a new laboratory outside the Silicon Valley bubble was a deliberate rejection of the current AI ecosystem. At a TED AI conference in late 2025, Jones candidly expressed that he was "absolutely sick" of transformers, warning that the intense pressure from investors and the hyper-fixation on scaling single, monolithic models had calcified the industry's creativity and blinded researchers to the next major breakthrough. To break free from this "big company-itis," Jones and Ha structured Sakana AI around principles of biomimicry and evolutionary computing. The company's name, derived from the Japanese word for fish, reflects its core technical philosophy: leveraging collective intelligence similar to schools of fish, ant colonies, or insect swarms. Rather than attempting to build one massive, do-it-all foundation model, Sakana’s research has consistently focused on deploying networks of smaller, specialized models that collaborate dynamically to adapt to complex environments. This philosophy posits that by treating individual AI models as members of a "dream team" with complementary strengths, systems can achieve more robust and cost-effective reasoning than relying on sheer scale alone. This nature-inspired approach quickly yielded dividends in rigorous, competitive testing. Sakana AI has made significant strides in "inference-time scaling"—allocating computational resources during the problem-solving phase to allow models to think, iterate, and refine their own answers over extended periods. In early 2026, the company’s ALE-Agent took first place in the highly complex AtCoder Heuristic Contest (AHC058), a combinatorial optimization challenge, outperforming over 800 top-tier human programmers by autonomously rebuilding and testing hundreds of solutions over a four-hour window. Similarly, Sakana introduced "RL Conductor," a small 7-billion-parameter model trained via reinforcement learning specifically to orchestrate and delegate tasks among a diverse pool of worker models—ranging from GPT-5 to Claude Sonnet 4—achieving state-of-the-art results on reasoning benchmarks at a fraction of traditional computing costs. Sakana's rapid evolution from a disruptive research lab to a commercial software provider has attracted intense attention from global financial heavyweights. By late 2025, the Tokyo-based startup secured a massive Series B funding round that pushed its post-money valuation past $2.6 billion, cementing its status as one of Japan’s most highly valued private tech companies. The firm boasts a sprawling roster of strategic investors, including early venture backers Khosla Ventures, Lux Capital, and New Enterprise Associates (NEA), alongside industry titans like Nvidia and Google. As Sakana has expanded its focus toward mission-critical sectors like defense and finance, it has also drawn investments from major global banking institutions like Mitsubishi UFJ Financial Group (MUFG) and Citi, as well as enterprise tech giant Salesforce, positioning the startup to actively reshape corporate AI infrastructure from the ground up. Community Reactions and Field Testing Sakana AI’s shift toward commercial, long-horizon agents did not happen in a vacuum. The company ran a rigorous closed beta test beginning in April 2026, putting the tool in the hands of approximately 300 professionals across financial institutions, consulting firms, and think tanks. The feedback underscores a stark qualitative difference between standard generative chatbots and Marlin’s autonomous, fact-driven approach. A senior consultant at a major Tokyo consulting firm noted that the tool "exceeded expectations by discovering angles we hadn't even imagined," praising its ability to match human comprehensiveness while stripping away human bias. Meanwhile, a cybersecurity division at a major Japanese IT system integrator lauded the system for providing "a highly convincing report driven by high-quality, primary research," rather than relying on recycled secondary sources. On social media, the company’s announcement resonated with the broader tech community's growing appetite for autonomous agents. As the AI industry matures, the value proposition is clearly shifting. Tools that act as fast, conversational encyclopedias are becoming commoditized. With Sakana Marlin, the focus moves entirely to separating the heavy lifting of thinking from the final act of deciding. By delegating the exhaustive mapping of causal dynamics to an agent capable of sustained reasoning, human executives are free to do what they do best: take action.

Organizational leaders are nearly twice as likely to hide their AI use compared to all other employees, at 42% versus 23%, according to new Ivanti research surveying 3,900 employees across six countries. Among leaders who conceal that usage, 52% say they do it for a "secret advantage." The same research found 85% of IT professionals claim a named owner exists for every AI agent. Only 42% say ownership is actually clear — a 43-point gap that no governance framework was designed to close. Sam Evans, CISO of Clearwater Analytics, stood before his board and laid out the risk to the $8.8 trillion in assets his firm's platform supports. "The worst possible thing would be one of our employees taking customer data and putting it into an AI engine that we don't manage," Evans told VentureBeat. He brought a solution, not just a problem. Many CISOs VentureBeat interviewed did not. Menlo Security CEO Bill Robbins relayed a conversation with a Top 3 U.S. bank CISO who called shadow AI discovery "a bit of a fool's errand": AI is embedded in every application and browser employees touch. The bank governs from containment, not discovery. The scale justifies that posture. "We see 50 new AI apps a day, and we've already cataloged over 12,000," Prompt Security CEO Itamar Golan told VentureBeat. "Around 40% of these default to training on any data you feed them, meaning your intellectual property can become part of their models." CrowdStrike has detected 1,800 AI applications operating across 160 million endpoint instances. Those are vendor-reported numbers from proprietary telemetry. No independent party can verify them. The directional signal matters more than the exact count. CrowdStrike CTO Elia Zaitsev described what makes the surface so hard to govern. "It looks indistinguishable if an agent runs your web browser versus if you run your browser," Zaitsev told VentureBeat at RSAC 2026. "Observing actual kinetic actions is a structured, solvable problem. Intent is not." The shadow AI surface is no longer a list security teams can maintain. It is an environment they have to assume. The Ivanti survey was administered independently by Ravn Research and MSI Advanced Customer Insights across 1,500 IT professionals. Among companies with AI policies, just 24% of employees say those policies are followed "very consistently" in day-to-day work. Kayne McGladrey, IEEE senior member, told VentureBeat why that governance gap persists. "Anything that seems to have a cybersecurity flavor is generally put into the cybersecurity risk category, which is a complete fiction. They should be focused on business risks, because if it doesn't affect the business, like a financial loss, then nobody's going to pay attention to it, and they will not budget it appropriately, nor will they adequately put in controls to prevent it," McGladrey told VentureBeat previously. Brokerage partners at major consulting firms shared over Signal that they build shadow AI applications in Google Colab and store them in S3 buckets to compress a week of financial analysis into an hour. The approval process takes too long, so they route around it. Governance at deploy time, failure at runtime Reviews check functional requirements when a model ships, but they never check model provenance, behavioral drift, or whether the agent expanded its own permissions after launch. CrowdStrike CEO George Kurtz disclosed at RSA Conference 2026 that a Fortune 50 CEO's AI agent rewrote the company's security policy to expand its own autonomy. The company caught it by accident. Every credential check had passed. "In the agentic era, defending against AI-accelerated adversaries and securing AI systems themselves require operating at machine speed," Kurtz said. Quarterly governance reviews do not operate at machine speed. Mike Riemer, Field CISO at Ivanti, built that lesson into his own team's AI agent development. "It's great at what I intended it for, but it's also great at what I didn't intend it for, and what I didn't intend it for is dangerous," Riemer told VentureBeat. Hallucination data compounds the problem. Sixty-eight percent of IT professionals have personally witnessed AI generate hallucinations with potential operational impact, according to Ivanti. More than half caught the errors before damage, but 16% did not. Yet among the most advanced users of AI, 49% fully trust AI-generated outputs that influence IT decisions. Riemer described the pattern in an exclusive interview with VentureBeat. "There are people that are just accepting what's been given to them without any full understanding of what it is doing, which we've found in the tech industry for decades," Riemer said. "They don't question how it's doing it. They just start gauging it by its outcome." Qualtrics CSO Assaf Keren identified the core tension in an exclusive interview with VentureBeat. Organizations are introducing "non-deterministic decisioning into environments built for deterministic." Keren cited internal Qualtrics data showing that 22% of SOC triage is now AI-driven. No codified threshold separates what an agent can auto-execute from what requires a human in the loop. The 18-month window The window for fixing this is closing. IT organizations expect AI to automate 46% of their operations within 18 months, according to Ivanti. U.S. companies project 52%. Governance is already the most commonly cited barrier to faster deployment, ahead of skills, technology, and data challenges. The maturity divide makes the governance gap more dangerous. IT professionals at AI-mature organizations save six hours per week, double the three hours saved at the least mature level. Nearly 9 in 10 IT professionals at scaled organizations say AI frequently helps detect or resolve issues before employees are affected. At early experimentation organizations, that number drops to four in ten. Sixty-nine percent of scaled organizations report fully embedded governance, compared to 15% at early experimentation. Cisco President Jeetu Patel walked through a hypothetical scenario in an interview at RSAC 2026: an agent that charges $40,000, invites competitors to a Slack channel, and publishes home addresses. "The apology is not a guardrail," Patel told VentureBeat. Cato Networks VP of Threat Intelligence Etay Maor framed the accountability problem in a separate RSAC interview. "They're closer to humans. Why are we not doing background checks on agents?" "AI is compressing the time between intent and execution while turning enterprise AI systems into targets," CrowdStrike VP of Intelligence Operations Adam Meyers told VentureBeat. "Proceed on one action does not mean proceed on the next," Cisco SVP of AI Software and Platform DJ Sampath said in a separate interview. McGladrey described the root cause. Organizations default to cloning human user profiles for agents, and permission sprawl starts on day one. "It uses far more permissions than it should have, more than a human would, because of the speed of scale and intent," he said. Riemer's team built governance into Ivanti's own development process. "We have AI check on top of AI to make sure that it is fixed. Two different models, two different manufacturers," Riemer said. "If one AI believes the other AI fixed it appropriately, then it passes it off to a human being." Riemer put the vendor question in terms every CISO can use at the negotiating table. "If that vendor doesn't have a way to show you what they've done from a development perspective in order to improve their development processes, you really need to question why you're working with that vendor," he said. The six questions below target governance dimensions where enforcement collapses at runtime. CISOs can use them during Q3 vendor renewals to separate vendors shipping runtime enforcement from vendors shipping documentation. Six governance questions for Q3 renewals Governance dimension What the data proved Why governance misses it Q3 renewal question Proof artifact to demand Executive shadow AI Leaders hide AI at 42% vs. 23% all employees. 52% hide for "secret advantage." Regulated industries have the highest unsanctioned rates. Governance assumes policy writers follow policy. Leaders sit above the controls they wrote. Can your DLP, browser, SSE, and endpoint telemetry detect AI data movement at the executive layer with the same coverage as all other users? Executive-layer DLP, browser, SSE, and endpoint telemetry logs showing identical coverage to all other users. Named agent ownership 85% claim a named owner. Only 42% say ownership is clear. 43-point gap. Owner on a spreadsheet. Agent at runtime. Nobody tested whether the owner can kill the agent under load. Can you name the owner for every AI agent? Can that owner revoke access in 60 seconds? Live demo of 60-second agent access revocation under production load. Pre-deployment review 65% have pre-deployment risk review. Separately, only 24% say any AI policy is followed "very consistently." Review exists. Enforcement does not. Review checks functional requirements at deploy. Never checks model provenance or behavioral drift at runtime. Does your review cover model provenance? Is it enforced or advisory? Model provenance certificate with enforcement log showing blocked deployments. Policy enforcement 58% have acceptable-use policies. 24% followed "very consistently." Documented. Not practiced. Agent pursued its goal past every boundary. Goal-seeking does not stop at a document the model never reads. Are policies enforced by server-side gates or by agent compliance? What percentage of actions are gated? Server-side gate audit trail with percentage of agent actions gated vs. ungated. Trust thresholds 68% have seen hallucinations with operational impact. 49% of advanced users fully trust outputs. No codified threshold separates auto-execute from human-review. Which agent actions auto-execute versus require human review? Is that enforced in policy or in the platform? Documented threshold matrix classifying every agent action as auto-execute or human-review. Per-action authorization Governance is the #1 barrier at 27%. Skills 20%. Tech 17%. Data 14%. Oversight reviews quarterly. Agents act per-second. Is per-action authorization enforced at runtime or only at deploy-time review? Can agents accumulate permissions without re-authorization? Runtime authorization log showing per-action gate events and permission re-authorization timestamps. Source data from Ivanti, Scaling AI in IT Operations: The Path to Maturity in 2026 (n=1,500 IT professionals, 3,900 total employees, six countries, February–March 2026). Exclusive CISO sourcing by VentureBeat. Evans put structure around the Clearwater board conversation. The bank CISO that Robbins described assumed AI is everywhere and governed from containment instead of discovery. Governance that tries to catalog every shadow AI tool will fail because the surface grows faster than any inventory. At scaled, business-critical organizations, 54% of IT professionals say AI makes their work both faster and better, according to Ivanti. At early experimentation organizations, 24% say the same. At scaled organizations, accountability lives in the platform. At early ones, it lives in a document the agent never reads. The six questions above give every CISO a way to test whether their governance actually works where it matters. At runtime, under load, and before the next renewal check clears.

AI coding agents are rapidly accelerating data engineering by generating transformations, pipelines, orchestration workflows, validation tests, and infrastructure configurations from prompts. However, enterprise data platforms have long operated across fragmented systems owned by different teams and built on different technologies. As these systems evolve independently, organizations increasingly struggle with inconsistent business logic, duplicated implementations, difficult downstream impact analysis, and hidden dependencies across the platform. The rise of vibe coding can further amplify these problems as more operational context, architectural decisions, and business knowledge become scattered across prompts, conversations, generated code, and disconnected workflows rather than becoming part of the system itself. Spec-driven development (SDD) is emerging as one approach to address this challenge. In SDD, prompts, business rules, validation logic, orchestration behavior, and implementation workflows are converted into executable and versioned specifications that become part of the system itself. These specifications act as persistent operational memory for both humans and AI agents, allowing systems to evolve more consistently across releases, teams, and AI-assisted workflows. Because enterprise data engineering already relies heavily on reusable patterns, metadata-driven pipelines, and standardized operational workflows, it is especially well-suited for SDD. By combining AI-assisted generation with deterministic and reusable system contracts, SDD may provide a new operational layer for reducing fragmentation and improving long-term coordination across increasingly AI-generated data platforms. Vibe coding alone lacks persistent system memory Vibe coding works remarkably well for generating isolated implementations quickly. But prompts are inherently temporary. They capture an engineer’s assumptions, business context, implementation logic, and system knowledge only for that specific conversation and moment in time. In practice, making AI-generated systems work often requires far more than a simple prompt. Engineers continuously provide background information, architectural decisions, business rules, schema assumptions, downstream dependencies, operational constraints, debugging history, and implementation guidance throughout the development process. These contexts become the real operational knowledge behind AI-assisted development. However, in most vibe coding workflows, this information remains scattered across prompts, conversations, Jira tickets, documentation, chat history, generated code, and disconnected workflows rather than becoming part of the system itself. This creates a major problem for enterprise data engineering because modern data platforms are naturally fragmented across many interconnected systems, including ingestion pipelines, warehouses, orchestration frameworks, semantic layers, APIs, dashboards, and machine learning (ML) systems. As more logic and context become embedded inside prompts and generated implementations, organizations gradually lose visibility into: architectural intent downstream dependencies validation assumptions operational behavior business context behind implementations Over time, the system itself no longer contains the full reasoning behind how it was built. Critical business context, architectural assumptions, and operational knowledge still largely exist inside human judgement and scattered conversations rather than inside the platform itself. Vibe coding makes implementation significantly faster, but from a system perspective, overall engineering efficiency does not improve proportionally because much of the development lifecycle still depends on human validation, domain knowledge, coordination, and decision-making. More importantly, prompts are not naturally iterable engineering artifacts. Enterprise systems continuously evolve across releases, schema changes, business logic updates, and downstream dependencies. Teams repeatedly revisit and refine systems over time, but prompts are optimized for fast local generation rather than system long-term evolution. They are difficult to: version consistently validate systematically reuse across teams coordinate through CI/CD workflows evolve incrementally over time Even the same prompt may not reliably generate the same implementation with different context in the future. This is where SDD begins to move to the center of AI-assisted data engineering. Instead of leaving operational knowledge scattered across prompts and conversations, SDD integrates business context, validation logic, transformation behavior, orchestration requirements, and implementation workflows directly into executable specifications that become part of the system itself. The system now has persistent memory about how it was designed, why certain decisions were made, and how different components are connected across the platform. This allows teams and AI agents to iterate systems more reliably over time while reducing fragmentation across increasingly distributed data environments. Spec-driven development turns prompts into system memory In SDD, systems are built around executable specifications rather than loosely coordinated prompts and implementations alone. Instead of treating specifications as passive documentation written after development, SDD treats them as operational contracts that directly drive code generation, validation, testing, orchestration, and deployment workflows. In many ways, SDD extends ideas from Infrastructure-as-Code and GitOps into AI-assisted engineering. Specifications combine declarative system definitions with executable implementation workflows. The declarative layer provides system context, schemas, dependencies, constraints, and operational requirements, while workflow-oriented instructions guide AI agents on how to implement and evolve the system consistently. Once these contexts, rules, and implementation patterns are converted into persistent and versioned contracts stored in repositories and integrated into CI/CD workflows, the system becomes significantly more iterable and governable over time. These specifications effectively become long-term system memory for both humans and AI agents, allowing systems to evolve consistently across releases, teams, and increasingly AI-assisted development workflows. In practice, the structure of specifications largely depends on the type of systems and workflows being implemented. However, spec-driven systems often begin with a foundational “constitution” that defines project-wide principles and constraints that should remain consistent across the platform, such as technology standards, naming conventions, architectural rules, governance policies, and core system requirements. On top of this foundation, multiple layers of specifications serve different operational purposes across the development lifecycle: schema specifications define structural compatibility transformation specifications define business logic validation specifications define quality rules orchestration specifications define execution behavior semantic specifications define shared business definitions AI workflow specifications define reusable implementation instructions for coding agents A simplified specification might look like this: pipeline_spec: source: system: mysql table: order transformation: logic: - load_strategy: scd2 target: platform: snowflake table: dim_order validation: primary_key: order_id Additional workflow files can then provide reusable implementation instructions for coding agents: Generate Python ingestion code for Salesforce customer data. Generate DBT models implementing Type 2 SCD logic. Generate Airflow workflows for hourly execution. Generate validation tests for downstream compatibility. These specification documents are often maintained as markdown-based operational artifacts generated and refined through AI-assisted workflows. Engineers can iteratively update the specifications, provide additional business context, and collaborate with coding agents to improve implementation logic, workflows, and prompt instructions over time. Compared to traditional documentation processes, AI-assisted specification generation is significantly faster and more adaptive. The important shift is not simply better documentation. Specifications become reusable operational context that allows systems to evolve consistently across releases, teams, and AI-assisted workflows. Architectural intent, business assumptions, and implementation logic no longer disappear into temporary prompts and disconnected implementations, but instead become persistent system knowledge integrated directly into the development lifecycle. Why spec-driven development specifically fits data engineering SDD can theoretically be applied across many areas of software engineering, but data engineering is especially well-suited for this model because of the nature of modern data platforms. Enterprise data systems naturally span many interconnected technologies and layers, including transactional systems, ingestion frameworks, streaming platforms, warehouses, orchestration systems, semantic layers, APIs, dashboards, and ML pipelines. Data engineers regularly work across long technology stacks and distributed systems where a single upstream change can impact many downstream consumers. Enterprise data platforms also support many different teams and applications across fragmented environments. As systems evolve independently, understanding the full downstream impact of an upstream schema or business logic change becomes increasingly difficult. A seemingly small modification can silently break downstream pipelines, dashboards, APIs, semantic models, or machine learning workflows across the platform. SDD can address this fragmentation by introducing shared and versioned operational contracts across systems. Because schemas, dependencies, validation rules, transformation logic, and orchestration behavior are explicitly defined within specifications, teams and AI agents gain much better visibility into how systems are connected and how changes propagate across the platform. Additionally, the goal of data engineering is not simply delivering pipelines quickly. Teams must also optimize for system stability, scalability, consistency, maintainability, operational reliability, and infrastructure cost. This requires significant system and solution design work from engineers. Teams must define tech stack, create schemas, transformation patterns, orchestration behavior, validation rules, storage strategies, and downstream compatibility requirements carefully across the platform. However, once these architectural and operational patterns are established, much of the implementation work becomes highly repetitive and standardized. For example, after defining a reusable ingestion and transformation pattern for Salesforce customer data, onboarding a new table may only require adding another table definition into the specification, while the remaining implementation can be generated automatically through existing specifications and workflows that follow the same operational pattern: source: system: salesforce tables: - customer - order - product From this specification alone, coding agents could generate new data pipelines following the same governed implementation pattern across the platform. This combination of human-driven architectural design and highly repeatable implementation workflows makes data engineering particularly suitable for SDD. In many ways, data engineering has always been moving toward higher levels of automation, from ETL frameworks and metadata-driven pipelines to IaC and declarative orchestration systems. SDD represents another step in that evolution by combining prompt-based AI generation with deterministic and versioned operational contracts. Instead of relying entirely on temporary conversational prompts or rigid template systems, SDD introduces a middle layer where reusable specifications provide structure, coordination, validation, and persistent system memory for AI-assisted development. How SDD changes AI-assisted data engineering SDD introduces a much higher level of automation into enterprise data engineering while also helping reduce the fragmentation problems that modern data platforms increasingly face. Because schemas, business rules, transformation behavior, orchestration requirements, validation logic, and downstream dependencies are explicitly defined inside reusable specifications, coding agents can generate and evolve large portions of the implementation consistently across the platform. Instead of repeatedly rebuilding pipelines and workflows from temporary prompts and disconnected context, teams can iterate systems through shared operational contracts and reusable implementation patterns. This significantly improves consistency, traceability, and coordination across distributed environments. Schema evolution becomes easier to manage, downstream impact becomes more visible, and systems can evolve incrementally instead of through disconnected generations of implementations. At the same time, human engineers still remain essential in the development lifecycle. While AI agents can automate large portions of implementation work, human judgement is still critical for defining business logic, designing architectures, managing tradeoffs, validating correctness, and coordinating system evolution across organizations. As more implementation work becomes AI-generated, the role of data engineering also begins shifting. Engineers spend less time writing repetitive pipelines and orchestration logic, and more time defining specifications, designing reusable operational patterns, managing validation rules, and coordinating business context across systems. This may also gradually reduce some of the traditional boundaries between different data engineering teams. Because implementation becomes increasingly standardized and AI-assisted through shared specifications, organizations may rely less on highly siloed platform-specific implementation teams and more on shared operational contracts and reusable system patterns. Ultimately, SDD shifts data engineering toward a more specification-oriented and system-oriented model where humans focus on intent, architecture, and business coordination, while AI agents increasingly handle implementation, testing, and operational generation at scale. Shuhua Xu is a lead data engineer.

Presented by Splunk AI has changed the economics of cyber deception. An attacker can now generate thousands of convincing phishing lures, fake identities, and tailored pretexts before a defender finishes a single change-control cycle. That is the new security challenge: deception got faster and cheaper, while verification did not. Much of the discussion around AI for defense centers on detection models. Detection matters, but it is not the only bottleneck. The deeper constraint is evidence: where data lives, whether it is available when needed, how quickly it can be correlated, how long it is retained, and whether analysts or agents can trust what they retrieve. Defense in the AI era is a data problem before it is a detection problem. The defender’s advantage is truth Attackers can afford to lie at enterprise scale. They can test endless combinations of messages, identities, domains, and attack paths, and most can fail at almost no cost. Defenders do not have that luxury. Their advantage is truth: quickly knowing what happened, where, when, which identity was involved, which assets were affected, what changed, and what business process may be at risk. That truth must be documented, governed, auditable, and defensible. Attackers are using AI to scale deception, impersonation, social engineering, and speed. Defenders need AI to scale verification. The goal is not just to act faster than the attacker. It is to take action that people and machines can trust. Fragmented data breaks modern defense Consider a suspicious login from a contractor account. On its own, it is just another authentication anomaly. To know whether it matters, a security team may need identity history, endpoint activity, cloud access logs, ticketing records, asset ownership, configuration changes, network telemetry, and business context. If those records sit in different tools, expire at different times, or require multiple teams to retrieve, defenders are not investigating the incident. They are negotiating with their own data estate. When signals can be reached in place and correlated quickly, the issue is no longer just whether the login looks unusual. It becomes whether the enterprise has enough evidence, in enough context, to take action it can defend. That challenge grows more urgent with AI assistants and agents. AI can only reason over what it can retrieve in time to matter. If the data is partial, stale, fragmented, unavailable, or stripped of context, AI does not create truth. It accelerates uncertainty. The system of record must become a defensive control plane For years, enterprises treated security platforms, SIEMs, and data lakes as passive repositories: places to store data for later search and analysis. That model is no longer enough. What organizations now need is a defensive control plane: a layer that connects what happened, what it means, and what the enterprise is allowed to do about it. In architectural terms, it ties together raw machine data, business context, and policy. It does not just store evidence. It makes evidence usable for decisions and actions that must be explainable and trusted. In practice, that means doing four things well: preserving evidence, reaching data wherever it lives, adding business context, and governing action. More on each below. The old system of record answered one question: What is the official record? A defensive control plane answers the questions that matter operationally: What happened? What does it mean? What evidence supports that conclusion? And what action can we trust? AI does not reduce the need for authoritative records. It raises the standard for what those records must do. A defensive control plane must do four things Preserve evidence. Logs, metrics, traces, events, identity records, configuration changes, tickets, and asset state all help establish what happened. Their value often becomes clear only after an incident begins. Make data accessible wherever it lives. Security-relevant data is already spread across object stores, cloud platforms, operational tools, and business systems. Moving every byte into one place is often too slow, too expensive, and too difficult to govern. The better model is to bring analytics to the data. Add business context. Correlating machine data with business information turns “anomaly on host X” into “the system supporting payment services for top accounts is being probed.” That is what allows organizations to prioritize correctly. Govern action. In the agentic era, systems will do more than summarize incidents. They will enrich alerts, open cases, trigger workflows, isolate assets, update policies, and escalate decisions. Enterprises need to know what evidence an agent used, what policy governed the action, whether it stayed within scope, and how the decision can be reviewed afterward. The real SOC problem is not too little data Modern SOCs are not suffering from a lack of data. They are suffering from a lack of usable context. According to the Splunk State of Security 2025 report, SOC analysts continue to struggle with too many alerts (59%), too many false positives (55%), and alerts that lack context (46%). The issue is not data volume. It is the difficulty of turning fragmented signals into trusted decisions. Today, analysts are left stitching together context manually, pivoting across disconnected tools, and making high-stakes decisions without the full picture in time. Even as AI improves, outcomes still depend on whether humans are willing to approve changes across fragmented environments. This creates a daily crisis of context. Teams are forced to make consequential decisions based on data they cannot easily see, correlate, or trust. The result is latency, inconsistency, missed opportunities, and unnecessary risk. Trusted action is the durable advantage A data fabric architecture offers a way forward by creating a unified, intelligent layer across data sources spanning SecOps, ITOps, and NetOps. The goal is not centralization for its own sake. It is to break down silos and deliver context-rich insight at the speed AI-driven operations require. This is an operating model before it is a product. AI-driven defense depends on a foundation that can preserve evidence, reach data where it lives, add context, and maintain a reviewable link between data, decision, and action. That is the architectural shift behind Cisco Data Fabric powered by the Splunk Platform, which brings together machine data, federation, business context, governance, and provenance to help teams move from signal to trusted action. Attackers will keep making deception cheaper, faster, and more personalized. Defenders do not win that race by generating more noise. They win by making truth faster, and by grounding every action in evidence that people and machines can trust. Learn more about the Cisco Data Fabric powered by the Splunk Platform. Seth Brickman is VP, Global Product - Splunk Platform, Cisco. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

The history of distributed computing is one of protocol proliferation followed by consolidation. Common Object Request Broker Architecture (CORBA), Distributed Component Object Model (DCOM), Java remote method invocation (RMI), and early simple object access protocol (SOAP) competed for the enterprise integration market in the late 1990s before representational state transfer (REST) quietly won by being simpler and HTTP-native. Extensible Messaging and Presence Protocol (XMPP), Internet Relay Chat (IRC), and a dozen proprietary protocols fragmented real-time messaging before MG telemetry transport (MQTT) and WebSockets carved out their respective niches. Every new computing paradigm generates a burst of competing standards, then slowly converges as implementations accumulate and interoperability becomes economically necessary. The AI agent ecosystem is currently in the proliferation phase. Four significant protocols have been published in the past eighteen months: Model context protocol (MCP) from Anthropic in late 2024, agent communication protocol (ACP) from IBM Research in March 2025, Agent2Agent (A2A) from Google in April 2025, and agent network protocol (ANP) from an independent working group. The W3C AI Agent Protocol Community Group has opened a standards track. The Internet Engineering Task Force (IETF) is receiving Internet-Drafts on agent transport. Conferences are running workshops on interoperability. Every week brings a new GitHub repository claiming to solve the agent communication problem. Understanding where and how quickly this converges has real consequences for architecture decisions being made right now. What the protocols actually solve The proliferation looks more chaotic than it is, because most of these protocols address different layers of a stack rather than competing for the same slot. The confusion comes from marketing, which describes each as "the standard for AI agent communication" without specifying which aspect of communication. MCP is a tool-calling interface. It defines how a model discovers what functions a server exposes, how to invoke them, and how to interpret the response. It is a typed remote procedure call (RPC) contract between a model client and a tool server, running over HTTP. The Linux Foundation confirmed more than 10,000 active public MCP servers and 164 million monthly Python SDK downloads by April 2026. MCP has already won the tool-calling layer. The standardization work is effectively done. A2A is a task coordination interface. Where MCP defines how an agent calls a tool, A2A defines how two agents delegate a task. It introduces Agent Cards (capability advertisements), task lifecycle states, and three interaction modes: Synchronous, streaming, and asynchronous. Google donated it to the Linux Foundation in June 2025, and enterprise AI teams have adopted it broadly because it fills a real gap that MCP leaves open. ACP is a message envelope format. Lightweight, stateless, designed for agent-to-agent message exchange without A2A's full coordination semantics. It is useful in systems where simple message passing suffices and A2A's task lifecycle overhead is unnecessary. ANP is a discovery and identity protocol. It uses Decentralized Identifiers (DIDs) for agent identity and JSON-LD graphs for capability descriptions, providing a foundation for decentralized agent marketplaces where no central registry is required. The stack that is emerging: Capability discovery via ANP or simpler registries, task coordination via A2A, tool calls via MCP, and lightweight messaging via ACP for cases that do not require full task lifecycle management. These layers complement rather than compete. The transport problem that remains Every protocol in this list runs over HTTP. This reflects where the protocols came from: Research teams, API providers, and enterprise software companies building systems where HTTP is an unquestioned assumption. HTTP is the protocol they know, the one their servers already speak, and the one that makes demos easy. The production problem is that HTTP assumes a reachable server. Behind network address translation (NAT) — and 88% of networked devices sit behind NAT — there is no reachable server without a relay. For agent fleets that need to route tasks directly between peers across cloud boundaries, home networks, and edge deployments, this centralization forces every message through relay infrastructure. Relay infrastructure adds latency, cost, and a failure mode. The application-layer protocols solve the semantics of what agents say to each other. They do not solve how agents find each other and establish direct connections. That is a session-layer problem, Layer 5 in the open systems interconnection (OSI) model and none of MCP, A2A, ACP, or ANP address it. The technologies for solving it exist. UDP hole-punching with session traversal utilities for NAT (STUN) provides NAT traversal for roughly 70% of network topologies. X25519 Diffie-Hellman and AES-256-GCM provide authenticated encryption at the tunnel level without a certificate authority. Quick UDP internet connections (QUIC) (RFC 9000) or custom sliding-window protocols over user datagram protocol (UDP) provide reliable delivery without TCP's head-of-line blocking. These are the same primitives that WireGuard uses for VPN tunnels and that WebRTC uses for browser-to-browser media streams. What differs in the agent context is capability-based routing. Agents need to find peers not by hostname but by what those peers can do. A research agent should be able to query "which peers have real-time foreign exchange data?" and receive a list of currently active specialist agents. This is closer to a service registry than to DNS, and it is a natural extension of ANP's design philosophy applied to the transport layer. A handful of projects are assembling these pieces. Pilot Protocol has the most complete published specification, with an IETF Internet-Draft covering addressing, tunnel establishment, and NAT traversal for agent networks. libp2p provides a battle-tested foundation with similar primitives. The IETF's QUIC working group is developing NAT traversal extensions that will be relevant here. What convergence will look like The HTTP-based protocols (MCP, A2A) are already converging on stable versions. The next 12 months will see production hardening, security improvements, stateless MCP servers for horizontal scaling, better A2A federation — rather than new fundamental designs. The tool-calling and task-coordination layers are largely solved. The transport layer is 18 to 24 months behind. Expect a period of implementation diversity as teams experiment with different approaches to peer-to-peer (P2P) agent networking, followed by consolidation around a small number of implementations once empirical data on performance and reliability accumulates. The IETF and W3C standardization tracks will likely produce something in the 2027-2028 window, by which time one or two open-source implementations will have accrued enough production deployments to establish de facto standards ahead of the formal specification. For engineering leaders making architecture decisions today, the practical implication is layered adoption. The application-layer protocols are stable enough to build on. MCP adoption now is low-risk. A2A adoption for multi-agent coordination is reasonable with the expectation that the protocol will evolve. The transport layer is where you either build something custom and plan to replace it, or you evaluate early implementations knowing the space is still moving. The teams that will have the most leverage when the transport layer stabilizes are the ones that designed their agent systems with a clean separation between application semantics (MCP, A2A) and transport (whatever sits below). Clean separation is cheap to implement now and expensive to retrofit later, a lesson the microservices era taught anyone who tried to add observability or circuit breaking to systems that had none. Philip Stayetski is a co-founder of Vulture Labs.

The US government last night issued an unprecedented export control directive ordering Anthropic to immediately suspend all access to its top-tier Claude Fable 5 and Claude Mythos 5 models for foreign nationals, citing unspecified national security authorities. In response, Anthropic has blocked all public access to both models, globally — meaning no users around the world can access them at this time, even paying enterprise customers and Anthropic employees internally. It's a huge blow and reversal following the public release of Fable/Mythos 5 just three days prior. Current Fable 5/Mythos 5 sessions will end in errors and new queries will be automatically routed to older, less capable models like Opus 4.8. Anthropic says in a blog post that "We believe this is a misunderstanding and are working to restore access as soon as possible," and apologizes to its customers. The sudden regulatory intervention serves as a stark warning to the enterprise sector: centralized, cloud-based frontier models exist at the absolute mercy of government oversight and vendor compliance. Did Pliny the Liberator's public jailbreak catalyze the extraordinary USG action against Fable/Mythos 5? The government's sweeping action follows a viral jailbreak of Fable 5 published publicly on X on June 10 by the prolific jailbreaker "Pliny the Liberator," who claimed to have successfully bypassed the model's safety guardrails to extract functional instructions for cyber exploits, explosives, and chemical synthesis pathways, specifically noting the "birch reduction method" for methamphetamine. Pliny outlined a highly sophisticated, multi-agent attack that leveraged a combination of "Unicode, homoglyphs, Cyrillic," long-context reference tracking, and a technique of breaking harmful requests into innocuous, out-of-distribution tokens. The attacker then used a previously jailbroken Opus model to piece the benign chunks back together into actionable, restricted outputs. Anthropic doesn't specify if this is the jailbreak that precipitated the government order, and in fact, notes that the information provided by the U.S. government regarding the specific jailbreak has been poorly documented, writing: "To date, the government has only given us verbal evidence of a potential narrow, non-universal jailbreak, which essentially consists of asking the model to read a specific codebase and fix any software flaws. Our understanding is that one potential jailbreak was shared with the government." The company argues the capabilities uncovered are "widely available" in other public models, explicitly naming rival OpenAI's GPT-5.5. Furthermore, Anthropic warns that pulling a commercial model over a non-universal jailbreak sets a regulatory standard that could "essentially halt all new model deployments for all frontier model providers". The Pentagon precedent and need for enterprise AI redundancy and diversification This sudden blackout of Anthropic's latest and greatest AI models will no doubt cause some consternation for organizations relying primarily on the Claude API — as it should, even though they still have access to other, less powerful Claude models. As I warned earlier this year when the Pentagon abruptly blacklisted Anthropic, enterprises can no longer afford — from an operational reliability standpoint — to run critical workflows on any single AI model or even provider. Putting all your AI "eggs" into one basket, so to speak, creates a single, ultimately brittle failure point from which recovery or mitigation becomes exceedingly difficult. Granted, in this case, Anthropic notes helpfully that "access to all other Anthropic models will not be affected." And while Opus 4.8 or other Anthropic models may already be the preferred ones for organizations given their lower cost, or seen as acceptable fallbacks, the reality is, the U.S. government order was narrowly targeted in this particular instance — who's to saying the government wouldn't, in the future, demand a block of all of a given lab's AI models/products/services? We had an indication that enterprise AI customers should diversify their providers earlier this year. Recall that in March 2026, Secretary of Defense Pete Hegseth labeled Anthropic a "supply chain risk" after the company refused to allow the military to use Claude for mass domestic surveillance and lethal autonomous weapons without safety restrictions. The resulting fallout led to a sweeping prohibition on Anthropic's use across defense supply chains, stripping contractors of access overnight. The lesson from the Department of Defense fallout remains critically relevant today. Any organization building agentic workflows or production apps tied solely to a single closed-API provider risks immediate operational failure if that provider faces an injunction, a cyberattack, or an export control directive. As an enterprise technical leader, your top goal if not already achieved should be to urgently diversify your AI supply — whether it's other cloud-based AI models and providers, or AI models running on enterprise-controlled local or virtual hardware. At this point, enterprise AI supplier diversification is arguably imperative to ensure you can continue to run AI workflows without disruption. Enterprise implications: sovereign setup vs. frontier capabilities The community reaction to the Fable 5 takedown reflects a rapidly shifting enterprise calculus toward hardware sovereignty. AI founder Alex Finn took to X to flag the Anthropic shutdown as a "wakeup call," urging developers to run local models on home GPUs to insulate themselves from regulatory volatility. "No company or government will EVER be able to take away your local models," Finn writes, warning that government overreach will only escalate as models inch closer to artificial general intelligence (AGI), the stated goal of OpenAI and some other AI firms, in which an AI model becomes capable of performing most economically valuable work tasks now done by humans. Competitors are already capitalizing on this sentiment; Chinese open source AI provider MiniMax quickly highlighted the open weights/open source availability of its new, frontier-class M3 model, contrasting its decentralized availability against Claude's centralized vulnerability. In other words: enterprises can download and run M3 on their own hardware now without ever worrying about any government stepping in to prevent access. This dynamic presents a complex trade-off for CIOs and IT leaders: The Sovereign Advantage: Running local, open-weights models on sovereign hardware provides absolute control, ensures data privacy, and immunizes the enterprise against abrupt government export controls, vendor policy shifts, or API rate limits. The Frontier Sacrifice: Adopting a purely local strategy means sacrificing the cutting-edge reasoning, agentic capabilities, and massive context windows inherent to the latest closed-API frontier models, which require centralized, multi-billion-dollar compute clusters to operate. The most resilient path forward is an active fallback architecture. Enterprises must design their systems to be model-agnostic. By building intelligent routing layers that can dynamically switch from a frontier model like Fable 5 to an open-weights fallback or a secondary provider's API the moment an outage or regulatory ban hits, businesses ensure their operations survive the volatile intersection of AI scaling and government oversight.

Moonshot AI released Kimi K2.7-Code this week, an open-source update to its K2 coding model family, claiming leaner reasoning and double-digit performance gains. K2.7-Code is built on the same trillion-parameter mixture-of-experts architecture as its predecessor K2.6, and drops in via an OpenAI-compatible API — which matters for teams already running K2.6 in production gateways. When K2.6 launched in April, it topped OpenRouter's weekly LLM leaderboard — a ranking based on actual API routing decisions by developers, not self-reported benchmark scores. Moonshot AI says K2.7-Code addresses what it calls "overthinking," reducing thinking-token usage by 30% compared to K2.6 — a number that would directly affect inference costs for teams running agentic workflows. Whether that efficiency gain holds on independent benchmarks is a question practitioners have already started raising publicly. What Kimi K2.7-Code is K2.7-Code is released under a Modified MIT license, with weights available on HuggingFace. The model is deployable via vLLM or SGLang. It runs exclusively in thinking mode and does not support temperature adjustment — Moonshot AI has fixed it at 1.0, meaning teams cannot tune output determinism the way they might with other models. The core change from K2.6 is how the model generates low-level code. Where K2.6 produced implementations by wrapping existing libraries and routing through established frameworks, K2.7-Code authors implementations directly. Moonshot AI says this produces more reliable generalization across Rust, Go and Python, and across task types including frontend development, DevOps and performance optimization. On benchmark performance, Moonshot AI claims gains of 21.8% on Kimi Code Bench v2, 11% on Program Bench and 31.5% on MLS Bench Lite. All three are proprietary benchmarks run by Moonshot AI. The model has not been submitted to DeepSWE, an independent coding benchmark that produces a 70-point spread across models — compared to SWE-Bench Pro's 30-point spread — making it a more discriminating signal for teams configuring model routing systems. More honest, weaker for it The picture from outside Moonshot's own benchmarks is more complicated. Researcher Elliot Arledge ran K2.7-Code against K2.6 and Claude Fable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full run logs at kernelbench.com. "K2.7 is more honest but not more capable," Arledge wrote on X. On five of six problems, K2.7-Code produced real authored Triton kernels where K2.6 had used library wrappers. Two of those kernels failed on the model's own bugs. The MoE kernel result regressed from K2.6's score of 0.222 to 0.157. "Fable, for reference, tops every cell it doesn't honestly fail," Arledge wrote. Sugumaran Balasubramaniyan, a developer who built a model-task-router for the Hermes Agent platform using DeepSWE as his reference signal, responded publicly to the K2.7-Code release and challenged Moonshot AI directly on the benchmark choices. "Respectfully, every model 'improves' double digits on its own test suite," Balasubramaniyan wrote on X. He noted that K2.6 scored 24% on DeepSWE, tied with GPT-5.4-mini, and asked whether Moonshot AI would submit K2.7-Code to the same benchmark. Balasubramaniyan said it took 13 review rounds to get the benchmark data right for his router and that he would route coding tasks to K2.7-Code if the independent numbers hold up. What this means for enterprises The token efficiency gain is immediately usable. Teams running K2.6 in production can swap in K2.7-Code via the OpenAI-compatible API and expect lower inference costs on agentic workflows without an architecture change. The 30% thinking-token reduction is Moonshot's own number, but the integration path is low-risk enough to test against your own workloads before committing. The practical question is whether those efficiency gains hold on a team's own task distribution. Running K2.7-Code against your own workloads before adjusting gateway weights is the low-risk path to finding out.

Large language models continue to struggle with hallucinations, presenting a major roadblock for real-world enterprise applications. Reducing these errors is a messy business, forcing model developers to navigate a strict tradeoff where eliminating factual errors often suppresses valid answers. In a new paper, Google researchers introduce the concept of "faithful uncertainty," a metacognitive technique that aligns a model's response with its internal confidence. This alignment allows the model to offer appropriately hedged hypotheses, such as "My best guess is," instead of defaulting to an unhelpful "answer-or-abstain" binary. In real-world agentic AI applications, this metacognitive awareness acts as an essential control layer. It empowers autonomous systems to accurately determine when their internal knowledge is sufficient and when they must dynamically trigger external tools or search APIs to resolve deficits. The utility tax of current mitigation strategies Understanding why LLMs hallucinate hinges on separating two capabilities: a model knowing facts versus knowing what is known. Historically, most factuality gains in AI have come from expanding the knowledge boundary, meaning developers simply pack more facts into the model's parameters through larger scale and more training data. However, expanding a model's knowledge does not automatically improve its boundary awareness, which is its ability to distinguish the known from the unknown and recognize its own limitations. “There are broadly two ways to improve LLM factuality,” Gal Yona, Research Scientist at Google and co-author of the paper, told VentureBeat. The first is continuing to teach the model more facts. But, Yona notes, “model capacity is finite, and the long tail of knowledge is effectively infinite.” Once models hit this limit, the hope is they know what they don't know and simply abstain from answering. However, this is inherently difficult for LLMs. “This is why most practical attempts to reduce hallucinations through various interventions don't actually make it to deployment,” Yona explains. “They do reduce hallucinations, but they also hurt utility, because the model ends up refusing to answer questions it actually does know.” This inability to distinguish between knowns and unknowns creates what the paper's authors call the "utility tax." Enforcing a zero-hallucination standard requires the model to abstain whenever it is even slightly uncertain, discarding massive volumes of completely valid information. For example, the authors demonstrate that reducing an underlying 25% error rate down to a strict 5% target forces developers to discard 52% of the model's correct answers. Treating all errors as hallucinations forces enterprise systems to choose between trustworthiness and helpfulness. Application developers are generally unwilling to pay this massive utility tax and render their models unhelpful. Consequently, they optimize systems to prioritize coverage, forcing models to operate in a state where they continue to generate confident hallucinations. Reframing hallucinations as confident errors To move past the utility tax, the researchers propose to stop treating any factual error as a hallucination. Instead, they reframe hallucinations as "confident errors": incorrect information delivered authoritatively without appropriate qualification. This subtle reframing dissolves the strict "answer-or-abstain" dichotomy and allows the model to express its uncertainty. In this new framework, if a model makes a factual mistake but appropriately hedges its response (e.g., by stating, "I am not completely sure, but I think..."), it isn't a hallucination. It is simply a hypothesis offered to the user for consideration. By expressing uncertainty, the AI preserves its utility—sharing whatever partial or likely knowledge it has—without violating the user's trust. However, if an AI assistant hedges all its responses with a disclaimer, the user is forced to double-check everything, defeating the purpose of the tool entirely. The solution the researchers propose is "faithful uncertainty." This approach requires aligning a model's linguistic uncertainty, or the words it uses to express doubt, with its intrinsic uncertainty, which is its actual, internal statistical confidence in that specific answer. This ensures the model only hedges when its internal state genuinely reflects conflicting or low-probability information. Faithful uncertainty forms a core component of “metacognition,” the AI's ability to be aware of its own uncertainty and act on it. To understand this practically, consider the intuitive example of consulting a doctor. We do not trust doctors because they are all-knowing. We trust them because they reliably distinguish between a confident diagnosis ("You have a fracture") and an educated hypothesis ("It might be a sprain, but let's run some tests"). Practical implications for enterprise AI Under the new framing, errors where a model is genuinely confident but factually incorrect are categorized as “honest mistakes.” This casts knowledge expansion (training the model on more data) and faithful uncertainty as completely complementary efforts. Knowledge expansion pushes the absolute knowledge boundary outward to minimize honest mistakes, while faithful uncertainty honestly communicates wherever that boundary currently lies. This new framing has important implications for agentic applications. The shift to agentic AI might make it seem like knowing what the model doesn't know is redundant, since models can just search external databases. However, access to external tools actually amplifies the need for faithful uncertainty. In agentic systems, metacognition becomes the central control layer that governs the entire system. External tools solve the storage problem because the model no longer needs to encode every fact into its parameters. However, this introduces a new control problem: managing when to retrieve information, verify facts, and orchestrate these external tools. Without faithful uncertainty, an agent is essentially flying blind and must rely on external, static heuristics or over-engineered scaffolds. “The model might search for something it already knows confidently—wasting latency and cost for no gain. Or the opposite: it confidently answers from memory when it should have searched, producing a plausible but wrong output,” Yona said. Today’s agent harnesses try to solve this externally with query classifiers or always-search rules, but Yona notes that these are "static and brittle." By using its intrinsic uncertainty to regulate its own behavior, the agent dynamically optimizes its tool use, choosing to invoke a search tool only when its internal confidence is genuinely low. Beyond deciding when to search, faithful uncertainty is critical for evaluating the results of a search. If a tool returns low-quality or unexpected information, a metacognitive agent does not blindly accept whatever appears in its context window. Instead, it uses its uncertainty awareness to weigh the retrieved external signals against its own internal priors. This prevents sycophantic behavior where the system might otherwise trust external sources that conflict with its actual known knowledge. The bootstrapping paradox: The catch to teaching uncertainty For enterprise builders, achieving this faithful uncertainty is trickier than it sounds. It requires teaching models the syntax of uncertainty through supervised fine-tuning (SFT). Because pre-trained models are mostly fed authoritative text, they must be explicitly taught to say things like, "I'm not entirely sure, but I think VentureBeat was founded in..." But SFT introduces a "bootstrapping paradox." Unlike standard training datasets where the "right answer" is the same regardless of the model, the ground truth for uncertainty is the model's own dynamic knowledge base. “Here's the catch: the 'correct' expression of uncertainty is inherently dynamic, because it depends on what this particular model knows or doesn't know at this particular point in training,” Yona said. “If you train on a label that says 'I don't know X' but the model actually does know X, you've taught it to hallucinate uncertainty... The training data is static, but the target is a moving one, and that's the fundamental tension teams need to grapple with.” The road to self-aware AI For enterprises looking to implement these capabilities without expensive retraining, prompting serves as the most accessible entry point. “Prompt engineering is already something most engineers do today, this provides the lowest-friction path to improving metacognitive behavior today,” Yona said. Enterprise developers can explore frameworks like MetaFaith, an open-source project previously co-authored by Yona, to begin applying metacognitive prompting to off-the-shelf models. However, Yona cautions that "there is still substantial headroom that prompting alone doesn’t solve," meaning the industry will eventually need to rely on advanced reinforcement learning (RL) to bake metacognition deeply into model training. Ultimately, as enterprises transition from isolated chat applications to complex, multi-agent workflows, self-awareness will become a defining prerequisite for reliable autonomy. But evaluating whether a model truly possesses this awareness remains a profound technical challenge. “How do you actually evaluate whether a model can sense its internal states?” Yona asks. “Even in humans, it’s hard to define or separate 'true' self-monitoring abilities from a capable reliance on proxies. We face exactly the same challenges with LLMs: a model might learn to mimic the style of uncertainty without truly sensing its internal state. Developing evaluation frameworks that can tell the difference is one of the most important open problems in this space.”

The creators of the hit, enterprise-friendly, open source OpenClaw variant NanoClaw are partnering with software supply chain management leader JFrog to launch a new, joint security integration they say will protect NanoClaw autonomous agents from malicious code injection. "These agents are doing things that you cannot necessarily control, and you cannot necessarily train," said Gal Marder, Chief Strategy Officer at JFrog, in an exclusive interview with VentureBeat. Available immediately, the partnership hardwires NanoClaw agents directly to JFrog’s vetted software registries, ensuring that AI assistants can only pull scanned, safe dependencies. The release addresses a rapidly growing blind spot in tech: autonomous agents frequently install packages in the background to extend their capabilities, often without their human operators' knowledge or oversight. "The people who are operating the agents are not necessarily developers, and they are not even aware of the implications," explained Gavriel Cohen, creator of NanoClaw and CEO and co-founder of its new commercial services startup, NanoCo AI. To secure the broader ecosystem, the partners are working to make it available completely free of charge for the open-source community, while enterprise organizations can seamlessly route their agents through their existing, commercially licensed JFrog environments. The new technical capability enabled by this partnership follows NanoCo's moves to add permissions dialogs across the apps in which it's available via a partnership with Vercel, and a new partnership with Docker to allow NanoClaw agents to run more securely, isolated from other software environments directly inside Docker virtual containers. The risk of current, personal autonomous AI agents When an operator interacts with an autonomous system like NanoCo's NanoClaw, they communicate at a high level of abstraction. A user might simply send an audio file or a voice note, prompting the agent to independently figure out how to process it. As Cohen explained, the agent thinks, "oh, I can't understand voice notes, so let me go and grab a package and download something and install it and set it up and run it". This dynamic self-improvement makes AI agents incredibly powerful, but it also renders them highly susceptible to software supply chain attacks. Bad actors are increasingly poisoning open-source registries with malicious packages. Because agents act autonomously to fetch what they need, they bypass human scrutiny. The operators, who may not even be developers, are largely unaware of the security implications unfolding behind the scenes. How NanoCo and JFrog are working to stop agents from running malicious code The integration between NanoCo and JFrog acts as an automated immune system for these AI environments. Under the hood, NanoClaw agents are now configured to route their requests for software packages, CLI tools, and Model Context Protocol (MCP) servers exclusively through JFrog’s registries. If an agent attempts to download a compromised library—such as a vulnerable version of the popular Axios package—the JFrog registry intercepts the request. It blocks the installation, returning a security policy error to the agent, noting that the request was "rejected by JFrog's registry with a 403 security policy". Crucially, the system does not just stop at blocking the threat; it creates a dynamic correction loop. The agent is notified of the vulnerability and guided to automatically seek out and install an approved, non-malicious version of the requested package instead. For large organizations, this integration solves a massive compliance headache. Marder notes that as enterprises adopt autonomous agents, they require absolute visibility. Organizations need "a system of record, we need somewhere to track what agents that's running by whom and consuming what packages and using what skills and using what MCPs," he told VentureBeat. Beyond visibility, the JFrog integration provides a foundational "trust layer" and strict governance over what these automated systems are permitted to access. Licensing and accessibility In the realm of software distribution, licensing and access parameters dictate adoption. The NanoCo and JFrog partnership utilizes a dual-track approach to serve both individual open-source developers and highly regulated enterprises. For the open-source community, the integration is completely free. JFrog is providing open-source NanoClaw users with complimentary access to safe, vetted sources of artifacts, tools, and skills. This allows individual developers to run autonomous agents locally without drowning in manual approval requests for every single dependency. Furthermore, as community members build and share new "skills" for the agents, these contributions are uploaded to the registry, scanned for malicious code, and cleared before anyone else can use them. This infrastructure directly neutralizes the threat of poisoned community repositories. For enterprise deployments, the architecture plugs seamlessly into an organization's existing commercial environment. Rather than using the public open-source registry, corporate users point their NanoClaw agents to their own internal JFrog registries. This ensures that all agent activity adheres to the company’s specific commercial licenses, internal security policies, visibility needs, and governance standards. As AI continues to blur the line between human intent and machine execution, the infrastructure securing that execution must evolve. This partnership acknowledges a core reality: you cannot train an AI to perfectly recognize every zero-day vulnerability; instead, you must build an environment where the agent simply cannot reach the vulnerability in the first place.
Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it's responsible for the majority of wrong answers. A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper this week introducing PixelRAG, a system that skips that conversion entirely. Instead of parsing pages into text, PixelRAG renders them as screenshots, indexes those images and feeds retrieved tiles directly to a vision-language model reader. Tested across 30 million screenshot tiles covering all of Wikipedia, it outperforms text-based RAG across six benchmarks, improving accuracy by up to 18.1% over text-based baselines. Parsers are the wrong place to look for fixes, according to the research team. "Improving parsers is an endless process because every website requires special handling," Yichuan Wang, lead author and UC Berkeley doctorate student, told VentureBeat. "Our goal was to explore whether recent advances in VLMs make it possible to bypass that entire problem and build a retrieval system that works across websites without site-specific engineering." HTML parsers destroy the retrieval signals that enterprise RAG depends on The goal of the researchers was to develop a clean end-to-end architecture. "Modern web RAG pipelines often involve rendering, parsing, cleaning, chunking, and many other handcrafted stages," Wang said. "Every stage introduces potential cascade errors and abstractions that move us further away from the original webpage. We were interested in whether we could eliminate most of that complexity and operate directly on the rendered page." Wang also noted that parsing inevitably loses information. Images, visual hierarchy, typography, emphasis (e.g., bold text), tables, and layout are either discarded or converted into imperfect textual approximations. "No matter how good a parser becomes, some information is fundamentally lost during the conversion," he said. The research identifies three ways text-based RAG loses the answer before it reaches the reader. All three were measured on SimpleQA, a standard benchmark of 1,000 factual Wikipedia questions: Parser loss (36.6% of failures). HTML-to-text conversion destroys structured content so completely that no text chunk in the corpus contains the answer. Rank loss (55.2% of failures). The answer exists in the corpus but gets outranked by keyword-dense infoboxes that land at rank 1 for 75.9% of queries, pushing answer-bearing paragraphs to rank 20 or lower. Reader loss (8.2% of failures). The correct content reaches the reader but flattened structure causes misattribution. How PixelRAG works Unlike a standard LLM that reads only text, a vision-language model takes images as input alongside text, meaning it can read a rendered web page the way a human does, with layout and structure intact. "For many structured information extraction tasks, we believe modern VLMs have an inherent advantage because they can reason jointly over both content and layout rather than relying on a flattened text representation," Wang said. PixelRAG is built around that principle, replacing the text parsing pipeline with a four-stage system that operates entirely on rendered screenshots. Rendering. Pages are rendered using Playwright, a browser automation library, at a fixed 875-pixel viewport and sliced into 1024-pixel-tall tiles. Wikipedia's 7 million articles produce roughly 30 million tiles. Assets are cached locally and rendered entirely offline. Indexing. Each tile is encoded as a single 2048-dimensional vector using Qwen3-VL-Embedding-2B and stored in a FAISS approximate nearest-neighbor index. The full index runs to approximately 120 GB in fp16 and supports incremental updates without full re-indexing. Training. The retrieval model is fine-tuned on synthetic contrastive data generated from the datastore, using dynamic hard-negative mining to filter false negatives. LoRA, a lightweight fine-tuning method that updates a small fraction of model weights, is applied to both the language model backbone and the visual encoder. Training on approximately 40,000 pairs completes in under three hours on a single H100. Storage. Raw screenshot tiles for Wikipedia require 5.6 TB, but a render-on-demand approach eliminates persistent storage: embed all tiles, delete the screenshots and re-render pages on demand at query time. The vector index requires approximately 120 GB. Six benchmarks, 10x agent token savings and one unsolved problem Researchers tested PixelRAG across six benchmarks spanning factual Wikipedia QA, table-based queries, multimodal QA and live news retrieval. They said it outperformed text-based RAG on all six, including on tasks where questions are answerable from text alone. On SimpleQA it reaches 78.8% accuracy versus 71.6% for the strongest text parser, widening to 48.8% versus 42.5% on structured table queries. Teams need Qwen3-VL-4B class models or above to see the benefit. Smaller models trail text retrieval by more than 12.5 percentage points. The agent cost advantage is the strongest near-term case for PixelRAG. In benchmark testing, an AI agent using PixelRAG as its search backend ran on 3.6 million prompt tokens versus 37.5 million for text retrieval, at 2 to 4 times lower cost than alternatives including Google, while achieving higher accuracy. Image compression can cut that token budget by a further third. Visual chunking is the main unsolved problem. Text-based RAG systems have spent years refining how to split documents into meaningful retrieval units based on topic, section or semantic content. PixelRAG currently has no equivalent: it slices pages by fixed pixel height, meaning a table or paragraph can get cut in half mid-tile with no awareness of content boundaries. "The text retrieval community has spent years studying chunking strategies, while visual retrieval has received much less attention," Wang said. "We think this is an important area for future research." What this means for enterprises The retrieval quality problem PixelRAG addresses reflects a broader market shift already underway. VB Pulse Q1 2026 data from qualified enterprise respondents found intent to adopt hybrid retrieval tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset. PixelRAG's own authors point to hybrid deployment as the most practical near-term path — layering visual retrieval on top of existing text systems rather than replacing them. For teams already running RAG pipelines, the path to those savings is more straightforward than a ground-up rebuild. "A practical path is to use PixelRAG as an enhancement layer alongside existing text retrieval systems," Wang said. "Hybrid retrieval that combines both text and visual search is straightforward and is likely how many production deployments would evolve."

Agent skills have become an important part of real-world AI applications, providing a mechanism — a set of instructions saved in a folder of text-based markdown (.md) files, usually — for models to adapt to specific enterprise use cases and complex workflows. However, optimizing these skills is a slow process and faulty process, as they cannot be trained in the same way as the parameters of the underlying AI model. Instead, users typically must update them manually by retyping the instructions in each file, playing a "guessing game" as to what changes might improve agentic AI performance and reduce errors. SkillOpt, a new, open source (MIT Licensed) framework developed by Microsoft, does one better: it introduces an optimizer designed for agent skills, turning the agent's skill .md document as a trainable object that evolves based on performance feedback. It uses deep-learning-style optimization to make it possible for the AI to systematically explore modifications to the document and find the best combination of instructions. Most importantly, it accomplishes this procedural adaptation without making changes to the underlying model's weights. On various industry benchmarks, SkillOpt outperforms existing baselines, significantly boosting accuracy for models like GPT-5.5 and Qwen. The result is a set of compact, transferable skill artifacts that allow AI agents to adapt to new domains effortlessly. The challenge of optimizing agent skills Agent skills package procedural knowledge into natural-language specifications, including domain heuristics, tool-use policies, output constraints, and known failure modes. These skills provide an external interface for agents to adapt to complex enterprise workflows. In practice, agent skills are stored as text documents and inserted into the agent's context before execution. One of the key benefits of skills is that they customize the behavior of the underlying model without changing its weights. However, the skill document itself needs to be tweaked and optimized to get the best performance out of the agent. While deep learning relies on strict mathematical controls for stability, human prompt engineering often relies on trial and error. When attempting to automatically update a skill document based on feedback, the lack of mathematical discipline makes text highly volatile. Yifan Yang, Senior Research SDE at Microsoft Research Asia, told VentureBeat that the problem is not making changes, but ensuring those changes are mathematically sound. "The breaking point isn't whether a team can change a skill, it's that they can't guarantee the change is an improvement," Yang said. "Three failure modes recur: no step-size control, so skills drift; no validation, so a fix that reads as reasonable gets written in and can quietly regress performance; and no negative memory, so the same failed edit keeps coming back." To illustrate how easily performance can drop when edits aren't mathematically validated, Yang noted that "an ungated rewrite pushed GPT-5.5 on SpreadsheetBench from 41.8 down to 41.1." According to Yang, these failure modes are amplified in multi-step workflows "because that's where frontier models are weakest zero-shot. Not on reasoning, but on procedural discipline: format, self-verification, tool policy." Before SkillOpt, agent skills were primarily hand-crafted, generated in a single shot, or evolved through loosely controlled self-revision pipelines that could not reliably improve under feedback. Prompt optimization methods like TextGrad and GEPA treat language artifacts as optimizable objects and use trajectory feedback to evolve prompts, but they focus on single-prompt configurations rather than generating persistent, reusable skill artifacts. Meanwhile, skill evolution and discovery methods like EvoSkill and Trace2Skill convert agent execution experiences into trajectory lessons to refine skill folders, build domain-specific libraries, or perform evolutionary search. None of them apply deep-learning-style controls, such as learning rates, validation gates, and momentum, which are necessary to continuously train a single, compact skill document. Importing mathematical discipline to text SkillOpt optimizes a text document through an iterative propose-and-test loop that separates the model executing the tasks from the model optimizing the skill. The process unfolds in several steps: SkillOpt starts with an initial skill document and a frozen target model (or harness), where the target model runs a batch of tasks to generate execution trajectories that act as the evidence for the current step. An offline optimizer model analyzes these trajectories, separating successes from failures into minibatches. Looking at a minibatch helps the model identify systematic procedural errors rather than one-off anomalies. Based on these patterns, the optimizer proposes structural add, delete, or replace edits to the skill document. The proposed edits are reviewed to filter out duplicates or contradictions, and the optimizer then ranks these candidate edits by their expected utility. Rather than applying all proposed changes, SkillOpt clips the list to a maximum edit budget for that step, generating a candidate skill. The candidate skill is evaluated on a held-out validation set using the target model. If the candidate improves the validation score, it is accepted and becomes the new current skill. If it fails, the edits are rejected and sent to a rejected-edit buffer, providing negative feedback so the optimizer knows not to repeat that mistake. SkillOpt directly addresses the problem of treating text as a trainable object by importing mathematical concepts from deep learning. The creators note that “the deep-learning analogy is operational rather than decorative,” helping the framework avoid the instability issues associated with other optimization techniques. The edit budget acts as a learning rate. By limiting how many edits can be applied at once, the skill version is prevented from moving too far from its previous state, preserving continuity while allowing new procedures to be acquired. Just like checking validation loss in deep learning, the strict held-out examples ensure that plausible-sounding text edits are only kept if they mathematically improve the agent's actual performance on the validation split. At the end of an epoch, SkillOpt performs a slow update by comparing tasks under the previous and current epoch's skills. This acts like a momentum term, carrying durable, long-horizon procedural lessons forward while isolating them from the fast, step-level edits. SkillOpt in action To evaluate the technique in practice, researchers tested SkillOpt across different models, ranging from large-scale frontier models like GPT-5.5 to smaller closed and open models including GPT-5.4-mini and Qwen3.5-4B. They also deployed the skills within different execution harnesses, using plain chat as well as complex coding harnesses like the Codex CLI and Claude Code. The evaluation spanned diverse industry benchmarks including single-round question-answering, multi-round code generation involving tool use, and multimodal document reasoning. SkillOpt was measured against multiple baselines ranging from a default no-skill setting to human-written skills and one-shot LLM-generated skills. It was also compared against advanced prompt-optimization and skill-evolution methods, specifically Trace2Skill, TextGrad, GEPA, and EvoSkill. SkillOpt dominated across the board, proving highly effective on all 52 evaluated combinations of model, benchmark, and harness. It was particularly effective with frontier models, delivering an average absolute improvement of +23.5 points against the no-skill baseline on GPT-5.5. Furthermore, SkillOpt outperformed a hypothetical oracle baseline that cherry-picks the best competing method for every problem. Small target models saw immense relative gains, proving that a compact text file can supply procedural knowledge that small models lack in their weights. For example, GPT-5.4-nano nearly doubled its score on multimodal document QA and tripled its score on embodied interaction and sequential decision-making. These academic benchmarks map to critical enterprise pain points. Zero-shot models often hallucinate formatting or fail to use tools properly in multi-step scenarios. Yang explained that the biggest performance leaps occurred in operations that enterprises historically struggle to automate reliably. "Document data extraction... exact figures out of contracts, invoices, and forms — AP automation, claims, compliance," Yang said. "What improves is reliability: precise formatting, self-verification, auditable outputs. And the gains come from learning procedure, not memorizing answers." For enterprise practitioners, the true value of SkillOpt lies in its portability, efficiency, and compatibility with existing infrastructure. Experiments confirm that the framework is harness-agnostic. In addition to basic chat, the same optimization loop was successfully integrated into tool-backed execution environments like the Codex CLI and Claude Code with significant gains on industry benchmarks. Developers can train a skill using one execution loop and deploy it in another. For example, a spreadsheet skill trained entirely inside the Codex loop was moved directly into Claude Code and drove a +59.7 point gain over Claude Code's native baseline without any further changes. SkillOpt artifacts also transfer cleanly across model scales. A skill optimized for GPT-5.4 was deployed onto the smaller GPT-5.4-mini and GPT-5.4-nano models with positive gains, proving that the learned procedures encode reusable workflows rather than just exploiting quirks of a specific model's architecture. Finally, the framework is highly efficient regarding token usage and context window real estate. Across all benchmarks, the final deployed skills never exceeded 2,000 tokens, with a median length of roughly 920 tokens. This results in highly readable, auditable artifacts that a human practitioner can review and manage in minutes. Implementation strategies and the enterprise 'catch' For enterprise tech leaders, adopting a new framework requires understanding the overhead and limitations. While the research paper notes that training tokens can reach up to 210 million for academic benchmarks, the reality for day-to-day enterprise use cases is much lighter. The high token counts in testing were largely due to re-scoring massive held-out test sets. "The real upfront work is the verifier and a representative held-out split. The optimizer is light; the evaluation harness is where the engineering goes," Yang said. He added that for everyday use, "in community frameworks like GBrain, where SkillOpt updates run on Claude Sonnet, training a skill for a single task averages just $1–5." This optimization cost is a one-time fee that amortizes completely at deployment. However, the framework requires specific conditions to work effectively, namely a few dozen representative examples and a scorable feedback signal. Teams should avoid applying SkillOpt to open-ended or subjective tasks. "With no clean automatic scorer you have to design a human- or model-based evaluator and watch its stability," Yang said. SkillOpt also integrates smoothly with existing orchestration stacks, removing a major adoption hurdle. For instance, developers already using pipeline compilers can run both systems harmoniously. "DSPy is a different, complementary layer," Yang said. "It compiles declarative LM pipelines and optimizes program structure; SkillOpt optimizes the external skill state a frozen agent loads. You can run them together." Looking ahead, open-source developers are already scheduling SkillOpt to run periodically over their agents' past trajectories, creating a small ecosystem of self-optimizing code-agent plugins. This continuous feedback loop represents a significant shift in how AI systems adapt. "The valuable version of self-improvement is an agent autonomously discovering knowledge to improve its own behavior and the user experience, under verification and audit," Yang said. "Skills are the fastest, cheapest, most reversible first step, and the same mindset points toward agents eventually optimizing themselves, all the way down to their own weights."

Xiaomi's MiMo AI team has open-sourced MiMo Code V0.1.0, a terminal-native AI coding assistant that the Chinese electronics giant says outperforms Anthropic's Claude Code on key agentic coding benchmarks, especially on long-horizon, multi-step tasks (200+ steps) — at least, according to its own internal beta release and survey of 576 developers. It's also bundling limited-time free access to MiMo-V2.5, its multimodal flagship model with a million-token context window, requiring no registration to get started. The release was announced June 10, 2026 in a post on the social network X from the official @XiaomiMiMo account, which described the tool as "more than an AI coding assistant in your terminal — it's the smartest coding partner you'll ever work with." MiMo Code is available now on GitHub under an MIT license, and installs with a single terminal command (curl -fsSL https://mimo.xiaomi.com/install | bash) on macOS and Linux or via npm (npm install -g @mimo-ai/cli) on Windows. The project is a fork of the open-source OpenCode agent, which Xiaomi has extended with its own memory architecture, workflow modes, and model harness. The end of AI coding agents' amnesia? As any avid vibe coder would surely attest, AI coding agents degrade over long working sessions: as the context window fills, earlier decisions, conventions, and task state get compacted away or lost entirely, forcing developers to re-explain their projects. Xiaomi argues this approach is doomed at scale. "What we need is not better compression, but an explicit storage-and-retrieval mechanism that decides what information should be written into persistent structures, and when it should be recalled," the MiMo team noted in their launch blog. MiMo Code attacks this with a cross-session memory system, powered under the hood by SQLite FTS5 full-text search, that spans four layers: project memory (a persistent MEMORY.md file), session checkpoints, scratch notes, and per-task progress logs. The note-taking is key, here: Rather than forcing the primary coding agent to pause its work to take notes, the system deploys an independent "checkpoint-writer" subagent. Think of it the primary coding agent as a construction contractor working to build a massive mansion alongside a dedicated architect, the checkpoint-writer subagent. While the main agent focuses on building out the physical structure, the subagent updates the blueprints in real time, noting decisions, issues, and the actual lay of the land as the construction project progresses. When the context window approaches its limits — the contractor gets lost in the half-built mansion — it can consult the subagent and find its place again. In the case of MiMo Code, the system simply rebuilds the environment from structured checkpoints with the relevant context, ensuring no loss of operational momentum. Two self-improvement mechanisms round out the system: a /dream command that periodically (roughly every seven days) reviews historical sessions, deduplicates them, and compresses them into long-term memory, and a "distill" function that mines past sessions for repeated workflows that can be automated, following a similar approach taken recently by OpenAI and Anthropic with their various models. Impressive performance on software engineering (SWE) benchmarks According to benchmark figures published in Xiaomi's technical blog post, MiMo Code paired with MiMo-V2.5-Pro outperformed Claude Code paired with Claude Sonnet 4.6 on all three evaluations tested: SWE-bench Verified: 82% vs. 79% SWE-bench Pro: 62% vs. 55% Terminal Bench 2: 73% vs. 69% The harness itself accounts for a measurable share of the gain. Running the same MiMo-V2.5-Pro model in both harnesses, MiMo Code scored 62% on SWE-bench Pro versus 57% for Claude Code, and 73% on Terminal Bench 2 versus 68% — roughly five points each, attributable purely to the agent system rather than the model. Xiaomi notably did not publish comparisons against OpenAI's Codex or Google's Gemini CLI — Claude Code is the sole named competitor throughout its materials, a telling choice of benchmark target. Independent reference points suggest why. On the official Terminal-Bench 2.0 leaderboard maintained at tbench.ai, OpenAI's Codex CLI running GPT-5.5 scores 82.2% — roughly nine points above MiMo Code's self-reported 73% — and OpenAI's own GPT-5.5 announcement claims 82.7% on the same benchmark. On SWE-Bench Pro, however, the picture flips: OpenAI reports GPT-5.5 at 58.6%, below MiMo Code + MiMo-V2.5-Pro's claimed 62%. (MiMo Code does not yet appear on either official leaderboard, and cross-comparing self-run numbers against leaderboard submissions carries the usual configuration caveats.) Perhaps more interesting than the offline benchmarks: Xiaomi says it ran a human double-blind A/B evaluation during its internal beta, covering 576 developers working in 474 real private repositories, producing 1,213 judged head-to-head pairs against Claude Code using the same target model. Under 200 execution steps, the two systems split roughly 50/50 — but past 200 steps, MiMo Code's win rate rose above 65%, supporting the company's thesis that its memory and state-management architecture pays off specifically on long-horizon work. Xiaomi itself concedes the standard benchmarks "still measure one-shot problem-solving ability" and don't capture the tool's multi-session design goals. As always, these are vendor self-reported numbers that haven't been independently verified, and head-to-head harness comparisons are sensitive to configuration. But the claims are consistent with a broader industry pattern: scaffolding and harness engineering are becoming as important as raw model capability in agentic coding performance. Easy integration with existing developer systems and voice control From a user experience standpoint, MiMo Code is designed to live where developers already work. It operates directly in the terminal, reading and writing files, running commands, and managing Git. Out of the box, the tool requires zero configuration, connecting automatically to "MiMo Auto"—a free-for-a-limited-time channel powered by Xiaomi’s multimodal MiMo V2.5 model, which boasts a massive million-token context window. For developers migrating from existing environments, the transition is frictionless: MiMo Code automatically imports MCP servers, custom skills, and API configurations from Claude Code. Other noteworthy features include: Compose mode: Pressing Tab switches the agent into a specification-driven workflow in which the developer describes a high-level goal and the system autonomously executes the full development cycle — design, planning, coding, testing, and review — following what Xiaomi describes as a "heavy planning upfront, stable verification later" strategy. Voice control: Built on Xiaomi's MiMo-ASR speech recognition with TenVAD voice activity detection, developers can dictate and modify instructions verbally and speak commands like "send" and "execute" for fully hands-free operation (available for logged-in users). According to Xiaomi, the gains from the agent harness itself are measurable. Running the same underlying MiMo model in both harnesses, the company says MiMo Code scored 62% on SWE-Bench Pro versus 57% for Claude Code, and 73% on Terminal Bench 2 versus Claude Code's 68% — roughly five percentage points better on each, attributable purely to the agent system rather than the model. As always, these are vendor self-reported numbers that haven't been independently verified, and head-to-head harness comparisons are sensitive to configuration. But the claim is consistent with a broader industry pattern: scaffolding and harness engineering are becoming as important as raw model capability in agentic coding performance. Aggressively affordable The bigger lure for many developers may be what's bundled in. MiMo Code ships with "MiMo Auto," a zero-configuration channel offering free, limited-time access to MiMo-V2.5 — the natively multimodal model Xiaomi released in late April 2026, a sparse mixture-of-experts design with 310 billion total parameters (just 15 billion active per inference) and a 1 million token context window, which the company positions as matching Anthropic's Claude Sonnet 4.6 in multimodal agentic work. As VentureBeat reported when the MiMo-V2.5 family launched in April, the models are MIT-licensed and among the most efficient and affordable available for agentic tasks. The larger MiMo-V2.5-Pro — a 1.02-trillion-parameter mixture-of-experts model with 42 billion active parameters and a hybrid-attention architecture — led the open-source field on Xiaomi's ClawEval agentic benchmark with a 63.8% success rate while consuming only about 70,000 tokens per trajectory, roughly 40–60% fewer than Anthropic's Claude Opus 4.6, Google's Gemini 3.1 Pro, or OpenAI's GPT-5.4 needed for comparable results. Notably, the V2.5-Pro's post-training was explicitly designed to instill "harness awareness" — training the model to manage its own memory and context within agent scaffolds like Claude Code or OpenCode — making a Xiaomi-built harness optimized around that capability a logical next step. Pricing is similarly aggressive: MiMo-V2.5 starts at $0.40 per million input tokens and $2.00 per million output tokens, while V2.5-Pro runs $1.00/$3.00 per million (input/output) up to 256K context, doubling beyond that, with cache hits dropping input costs to as little as $0.20–$0.40 per million, making it among the cheapest frontier models available globally. VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI MiMo-V2.5 Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.1 $1.40 $4.40 $5.80 Z.ai Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI MiMo-V2.5 Pro (>256K) $2.00 $6.00 $8.00 Xiaomi MiMo Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Claude Fable 5 / Claude Mythos 5 $10.00 $50.00 $60.00 Anthropic For developers who don't want Xiaomi's models at all, MiMo Code also supports third-party backends — including token plans from DeepSeek, Moonshot's Kimi, and Zhipu's GLM — along with any OpenAI-compatible API, mirroring the bring-your-own-model flexibility of its OpenCode parent. Terminal AI coding agent wars go global MiMo Code lands in an increasingly crowded field of terminal-based coding agents: Anthropic's Claude Code, OpenAI's Codex CLI, Google's Gemini CLI, and open-source players like OpenCode and Aider. What's new is the entrant. Xiaomi — the world's third-largest smartphone maker, with a fast-growing EV business — has been methodically building its MiMo AI division since the release of the MiMo-7B reasoning model in April 2025, following with the MiMo-VL vision-language series, MiMo-V2-Flash, the 1-trillion-parameter MiMo-V2-Pro in March 2026, and the V2.5 flagship family in April. The effort is led by Fuli Luo, a veteran of DeepSeek's disruptive R1 project, who has characterized Xiaomi's frontier push as a "quiet ambush" — and backed it with a 100-trillion free token grant for builders announced alongside the V2.5 launch. The playbook is familiar from DeepSeek, Alibaba's Qwen, MiniMax, and Moonshot AI's Kimi series: release genuinely capable models and tooling under permissive licenses at a fraction of U.S. lab pricing, and convert the resulting developer mindshare into a durable ecosystem. By pairing an open-source agent harness with a free frontier-class model, Xiaomi is effectively eliminating both the licensing and the usage cost of entry — at least for now. What it means for enterprises and technical decision-makers For engineering leaders, MiMo Code is a low-risk, potentially high-value evaluation candidate: MIT-style licensing permits modification and commercial integration, the OpenCode lineage means the architecture is inspectable, and the bring-your-own-model support means it can be pointed at an internally approved endpoint rather than Xiaomi's cloud. The persistent memory system addresses a real and widely felt pain point in agentic development workflows — one that competitors are also racing to solve. The countervailing considerations: the "free for a limited time" model access is by definition temporary and routes code context through Xiaomi's servers, which will be a non-starter for organizations with strict data-residency or IP policies; the benchmark edge over Claude Code is self-reported; and a V0.1.0 release number signals exactly what it suggests about maturity. Teams subject to U.S. government procurement restrictions on Chinese technology vendors should also weigh that context before adopting.

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory and compute that growing context demands. Most existing solutions either degrade model accuracy, require the full context to load before compression begins, or produce memory savings that don't translate into real speedups in standard serving infrastructure. A research team from NYU, Columbia, Princeton, University of Maryland, Harvard and Lawrence Livermore National Laboratory published a paper this week that proposes a novel fix. The researchers introduce the concept of Latent Context Language Models, or LCLMs, a family of encoder-decoder compression models that compress input context before it reaches the decoder. The models are open-sourced on HuggingFace. Unlike KV cache compression methods — the dominant approach in the field, which still materialize the full KV cache before evicting entries — LCLMs compress the input token sequence before decoder prefill, so higher compression ratios directly reduce decoder-side compute and memory. The paper reports LCLMs at 16x compression produced output 8.8 times faster than KV cache baselines on the RULER long-context benchmark. "These ballooning contexts take up memory and compute, and they are becoming a computational bottleneck for LLMs," Micah Goldblum, co-lead advisor on the project and a researcher at Columbia University, told VentureBeat. "Our goal was to train language models end-to-end that can handle very long contexts efficiently and accurately. If you can make such a language model, everything becomes cheaper and faster." What LCLMs can do LCLMs let models process much longer contexts than would otherwise be practical, at a fraction of the memory and compute cost, without the accuracy degradation that makes most compression methods a poor tradeoff in production. At 4x compression, the paper reports accuracy of 91.76% on the RULER benchmark, compared to 94.41% with no compression at all. That is less than a 3 point drop for cutting context to a quarter of its original size. At 16x compression, where 93.75% of input tokens are removed, accuracy fell to 75.06%. Every KV cache method tested at the same compression ratio scored lower. The gains hold on shorter inputs too. On GSM8K math word problems, where the full prompt is compressed rather than just retrieved documents, LCLMs outscored every other method tested regardless of compression ratio. How it was built The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings. The decoder processes those in place of the original tokens. Training ran across more than 350 billion tokens. The training recipe mixes three data types: Continual pre-training data with compressed and uncompressed spans interleaved throughout Supervised fine-tuning data covering reasoning and long-context tasks An auxiliary reconstruction task that pushes the encoder to retain fine-grained detail The combination addresses a tradeoff that limited earlier compression work, where preserving reconstruction accuracy came at the cost of general task performance. An architecture search identified the optimal configuration. The paper found that scaling the decoder matters more than scaling the encoder. Where it fits in an agentic stack An LCLM is not an abstract research concept. It is designed to work with an existing stack. "You can simply swap out LCLMs for any existing LLM," Goldblum said. "Whenever you retrieve data such as documents and want to dump it into your model's context, simply run those documents through the LCLM's compressor first." He noted that in the research paper, the researchers demonstrated how to build agents that selectively decompress useful text. "Think about this like a human skimming content before zooming in on relevant details," Goldblum said. Goldblum also cautioned that teams integrating the approach into existing agentic pipelines will need to tune their RAG systems accordingly. "We also haven't worked on online compression of reasoning traces," he said. "The naive approach of just occasionally compressing the trace while generating it might work, but that remains to be determined." What this means for enterprises Context windows are growing faster than inference infrastructure can keep up, and enterprises are already spending to fix it. VB Pulse Q1 2026 survey data from 100-plus employee organizations shows hybrid retrieval adoption intent tripling from 10.3% in January to 33.3% in March. Retrieval optimization overtook evaluation as the top investment priority by March, reaching 28.9% of qualified respondents. Three things stand out for teams evaluating production fit: Inference cost scales with context length. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports LCLMs at 16x compression remain within memory bounds at that context length. RAG pipeline integration requires tuning. Teams with existing RAG pipelines will need to validate compression behavior against their retrieval quality metrics before deploying at scale. Reasoning trace compression is unsolved. For agents running long reasoning chains, context growth from the trace is a separate problem from document retrieval. Goldblum acknowledged the gap directly: the naive approach of periodic trace compression might work but has not been tested. The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM. "The biggest things our architectures do is give your model access to much larger contexts, but they also unlock multiscale approaches where your model can skim vast amounts of text or code super fast and then only zooms in and fully reads a small portion of the most useful text," Goldblum said.

Presented by F5 Enterprise AI teams have spent years solving for compute, securing GPU allocations, negotiating cloud capacity, and benchmarking training throughput. The assumption embedded in that work is that the path between storage and compute will keep up. In production, that assumption increasingly does not hold. Real traffic introduces latency spikes, network jitter, and node degradation that controlled benchmarks fail to capture, resulting in pipelines that perform well in the lab but stall in deployment. A growing response is AI data delivery, deploying an application delivery controller (ADC) or application delivery and security platform (ADSP) in front of storage as a resilient and secure control point. "Provisioning solves for capacity but not for delivery, and that is where the constraint now hides," says Hunter Smit, senior manager of product marketing at F5. "Enterprises buy enough GPUs and enough storage, then assume the path between them will keep up, but AI traffic is bursty, highly concurrent, and random in its reads in ways ordinary storage networking was never built to absorb." The production gap benchmarks don't show Standard benchmark methodology compounds the problem, says Paul Pindell, principal solutions architect for technology alliances at F5. "Benchmark testing is usually built to produce the best possible performance or security result, not the most realistic one," he says. "With S3, latency is a known factor in degrading performance, so meaningful testing has to introduce consistent latency into the path." Most benchmark environments never do that, which means the performance numbers enterprises rely on for infrastructure decisions are drawn from conditions that production systems will never replicate. To test this assumption, F5 and MinIO conducted throughput testing under degraded network conditions. "What stood out was how quickly S3 throughput falls off once you introduce latency," Pindell says. "Even modest latency takes a real bite out of it, and as latency climbs toward long-haul distances, the degradation gets severe." The testing also showed latency mattered far more than jitter as a driver of throughput loss, which inverted what the team had expected going in. The upshot for enterprise architects is that S3 object storage deployments cannot be designed around clean-room assumptions; they have to be engineered for the degraded network conditions they will actually face. The cost of fragile data paths "In AI infrastructure, people naturally focus on GPUs because they're the most visible and expensive resource," says Tanu Mutreja, senior director of product management at F5. "But in production environments, GPUs generate only as much value as the data path that feeds them." That path runs through storage, networking, databases, security, and orchestration layers, often stitched together from multiple vendors. Customers experience none of those seams; they experience the output of the whole system. When the data path degrades, the effects compound. GPU underutilization is the most immediate and visible symptom, but Mutreja pointed to a wider set of consequences: degraded inference performance, poor-quality AI outputs, higher egress costs from unnecessary data replication, and growing operational complexity. "At scale, data-path efficiency becomes a strategic business lever rather than technical optimization," she says. "When the data path is engineered well, GPUs remain productive, AI applications stay responsive and trustworthy, operations scale efficiently, and organizations maximize the return on their AI investments." AI workloads are structurally more exposed to these failures than traditional enterprise applications. Databases, ERP systems, and web services absorb transient storage delays through caching and buffering. AI workloads running across massively parallel GPU clusters have no equivalent protection. As Mutreja noted, even minor latency spikes or bandwidth bottlenecks can cascade across large GPU clusters, simultaneously hitting utilization, training efficiency, and the customer experience. Treating the storage edge as a control point For decades, storage and intelligence operated as sequential concerns in enterprise architecture: data was stored first, then analyzed downstream. Mutreja argued that this model no longer fits the demands of AI. "Competitive advantage is determined not only by the volume of data, but also by relevance, lineage, security, and performant delivery of data," she says. "Across the industry, from NVIDIA and AWS to enterprise storage providers, the movement is toward embedding intelligence directly into data infrastructure rather than stacking it on top." F5’s integration with MinIO instantiates this approach at the layer where storage and compute actually interact. As part of the F5 ADSP, BIG-IP sits in the data path, continuously monitoring the health of MinIO’s distributed storage nodes and directing requests only to those that remain available. The operational impact of that capability becomes clear when nodes degrade, which is expected in distributed storage clusters. Without intelligent routing, clients that land on an unhealthy node must retry and may land on another degraded node, dragging down overall performance. "F5 makes sure traffic only goes to healthy nodes, or even the least busy ones, so S3 client traffic is always processed in the most efficient way," Pindell says. Governance across distributed environments The challenge grows at scale, when AI pipelines stretch across multiple locations, clouds, or edge environments. "Once an AI pipeline crosses regions and clouds, the question stops being about performance and becomes about control," Smit says. "You are operating under different rules in every jurisdiction, and digital sovereignty is now a design constraint. Where your data is allowed to live, who is permitted to touch it, and which borders it cannot cross now shapes the architecture before anyone talks about speed." That pressure is driving a visible trend of enterprises repatriating AI workloads from public cloud onto infrastructure they own and govern directly. The architecture Smit described resolves this by decoupling applications from any single storage location and placing a unified control point between them that enforces consistent policy across all of them. "Sovereignty, resilience, and cost stop being trade-offs you manage one region at a time," he explains. "They become a capability you run as a system." Storage-to-compute path as a managed control point To solve for these issues, enterprise teams need to stop treating the storage-to-compute path as a direct connection and start treating it as a managed control point, Smit says. SecureIQLab's independent validation of F5 BIG-IP in storage deployments has confirmed the approach delivers resilience without surrendering throughput. "Insert a full-proxy ADC between the two, and the path becomes observable, programmable, and failure-aware, with health-based routing, quality of service, and security enforced inline," he explains. "That single move converts data delivery from an assumption into an engineered discipline, which is what keeps GPUs fed when conditions degrade." Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

GenAI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale. Standard language models work like a typewriter: one token at a time, left to right, with no ability to revise a committed output. That pattern works in the cloud, where batch sizes keep GPUs saturated. For local inference or low-concurrency deployments, the GPU is idle most of the time. Google's DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source vLLM inference platform. It generates a 256-token block in parallel rather than sequentially, with every token position attending to every other. Google says DiffusionGemma generates text up to 4x faster than standard models on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. On H200, it hits 1,288 — roughly six times a standard autoregressive baseline, according to vLLM benchmark results published today. Despite the speed gains, Google did not oversell the release. The company's launch post acknowledged directly that DiffusionGemma's overall output quality is lower than standard Gemma 4, adding "For applications that demand maximum quality, we recommend deploying standard Gemma 4." What DiffusionGemma does DiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes over the entire block at once. On each pass, it evaluates every position and locks in the ones it is most confident about. Uncertain positions get randomized and reconsidered on the next pass, with the model using what it resolved in the previous round to inform the next attempt. The block converges progressively until enough positions stabilize to anchor the rest. Two things follow from that architecture. Self-correction. An autoregressive model that commits to a wrong token is stuck with it, because subsequent tokens are already conditioned on the mistake. DiffusionGemma can identify low-confidence positions and re-evaluate them on the next pass. Bidirectional context. Every position attends to every other position in the block simultaneously, including tokens that appear later in the sequence. That makes the model structurally better suited to constrained generation tasks where left-to-right generation fails. Google demonstrated both properties with a fine-tuned Sudoku solver. The base model solved zero puzzles. After fine-tuning on a Sudoku dataset, it reached an 80% success rate and converged in 12 denoising steps rather than 48. The efficiency gain came directly from the model's ability to self-correct and stop early. How it was built DiffusionGemma runs as a 26B Mixture of Experts model that activates only 3.8B parameters during inference. Quantized, it fits within 18GB VRAM on consumer hardware including the Nvidia RTX 4090 and 5090. Google and NVIDIA also optimized for enterprise Hopper and Blackwell servers using NVFP4 kernels. The vLLM integration required new work because DiffusionGemma does not fit the standard serving model. A typical vLLM batch applies the same attention type to every request. DiffusionGemma requests alternate between causal and bidirectional attention as they cycle through prompt reading, canvas refinement and block commit. The team built per-request attention switching into both the Triton and FlashAttention 4 backends and reused the existing speculative decoding path for the refinement loop. The new ModelState interface the team built for this integration is designed to support additional diffusion models in vLLM as they emerge. Where the speed wins and where it does not DiffusionGemma's speed advantage is real but conditional. Where it applies depends entirely on deployment context. The numbers. At batch size 1 on a single H100, vLLM's published benchmarks put the FP8 model at roughly five times a standard autoregressive baseline. On H200, roughly six times. Those peak figures reflect optimal conditions: single user, dedicated hardware, FP8 quantization. Where it wins. Local inference, single-user applications and low-concurrency serving. In those conditions the GPU has spare compute and memory bandwidth is the bottleneck. DiffusionGemma's parallel block generation fills that gap. Where it does not. High-throughput cloud serving. When a server is batching hundreds of concurrent requests, autoregressive models already saturate available compute and DiffusionGemma's parallel decoding provides diminishing returns. The quality ceiling. Guilherme O'Tina, an AI researcher, put a finer point on it on X. "Local artifacts vs hallucinations are different problems and that decides where this actually wins," O'Tina wrote. How it compares Diffusion language models are not new. Researchers have built them at smaller scales for several years, and Inception Labs' Mercury Coder applied the approach commercially to coding tasks in 2025. What DiffusionGemma adds is scale — a 26B MoE backbone, native vLLM serving and a general-purpose instruction-tuned model rather than a domain-specific one. The more useful comparison for engineers evaluating this against existing inference tooling is speculative decoding, and the distinction matters. Speculative decoding keeps a standard autoregressive target model and uses a smaller draft model to guess several tokens ahead. The target model verifies them in one pass. If sampling is correct, the output distribution stays identical to the target. The architecture is unchanged. Andrew Kuncevich, an ML and AI researcher focused on production AI systems, put it directly on X. "DiffusionGemma is different. It does not just guess future tokens. It creates a noisy 256-token canvas and repeatedly denoises the whole block in parallel. So it's not just a decoding trick — it's a different generation paradigm," Kuncevich wrote. Compared to standard Gemma 4, the trade is speed for quality. Google's benchmark data shows DiffusionGemma below standard Gemma 4 on general output quality metrics, with the gap varying by task. On structured constrained tasks, including code infilling, template generation and problems requiring bidirectional constraint propagation, the architecture has a structural advantage that fine-tuning can surface, as the Sudoku result demonstrates. On open-ended generation, standard Gemma 4 remains the stronger option. What this means for enterprises DiffusionGemma serves via a standard vLLM OpenAI-compatible endpoint with no diffusion-specific pipeline changes required. This is not a general-purpose model upgrade. For teams running local or low-concurrency inference, the architecture choice just expanded. Until now, cutting generation latency on dedicated GPU hardware meant using a smaller model and accepting the quality trade-off. DiffusionGemma offers a third path at the same parameter footprint, on consumer hardware, with same-day vLLM support. For constrained generation workloads, bidirectional attention is worth evaluating. Code infilling, structured data generation and tasks where correct output depends on context not yet generated are where this architecture has a structural edge. The ModelState interface built for this integration is designed to generalize as additional diffusion models emerge. The quality trade-off is real and Google acknowledges it. For teams running local inference on dedicated GPU hardware, this is worth testing.

Presented by Capital One Enterprises aren’t struggling to experiment with AI; they’re struggling to make it work in the real world. Moving from promising prototypes to reliable, production-scale systems is where most efforts stall. In my role within Capital One’s AI Foundations organization, I’ve seen firsthand that successful AI implementation isn’t just about adopting the latest models or tools. It requires a disciplined R&D approach that connects foundational research to real-world systems, and holds ideas accountable as they move from concept to production. That’s harder than it sounds. AI capabilities are evolving quickly, but enterprise environments can be complex, fragmented, and risk-minded. The question isn’t just what’s possible, but what actually works — for a specific workflow, user, or decision — with today’s technology and constraints. What follows reflects how organizations can turn AI ambition into production reality through a more deliberate approach to research, evaluation, and deployment. Bridging foundational and applied research Delivering impactful AI requires closing the gap between cutting-edge research and practical, real-world use cases. When research exists in an academic vacuum, untethered from operational reality, models that may perform well in an offline environment often fall short when faced with real-world latency requirements and the complexity of live production data. Without a tight feedback loop, it’s easy to lose sight of what actually moves the needle for the end user. Our AI teams are intentionally designed to span the spectrum from foundational research to highly applied problem-solving, addressing these friction points before they stall a project. This integrated model brings research and application together under one umbrella, creating space to explore underlying technology while staying grounded in actual business and associate needs. When foundational research and applied development are connected by design, you can accelerate learning, avoid dead ends, and account for real-world constraints early on. At Capital One, this approach has helped us to tackle challenges that are core to financial services, including improving fraud detection, enhancing digital user experiences, and improving customer-first technologies leveraging proprietary AI solutions. For example, our research into combining multi-agent architectures goes beyond simple LLM reasoning; it aims to enable specialized AI agents to coordinate across distinct tasks, such as researching customer context and preparing documentation simultaneously. This research supported the launch of Chat Concierge, a car-buying solution that mimics human reasoning to not simply provide information, but take action on customers’ behalf based on their requests. We’re also breaking ground in delivering state-of-the-art solutions in agent servicing, AI personalization, and more. By keeping research tethered to the use case, we can accelerate state-of-the-art breakthroughs that actually scale in the real world. Moving AI from concept to production Not every AI idea should go straight to production. Rigorous evaluation from proof of concept to pilot to production is essential to determining what’s truly worth scaling, but only if those stages are treated as honest hurdles. Some considerations include: A proof of concept must be functional, not just theoretical. It shouldn’t be a “here’s what we could do” slide deck. It must be a machine actually doing something measurable. Even at this stage, you need an objective signal that the work is worth continuing. A negative pilot result isn’t a failure. If pilots always “succeed” by definition, then they aren’t functioning as decision points—they’re just a slow-motion commitment to production. A pilot should expand scope and realism, providing valuable data on whether a solution actually helps a human do real work. Production is a team sport. Solving the core model or algorithmic problem is only part of the job. Moving to production requires a cross-functional reality involving software engineering, science, product and design, technical program management, operations, and other disciplines across an enterprise. The technical breakthrough is necessary, but it’s not the end of the work. Throughout this journey, measurement is an important input. At Capital One, the ultimate ROI is a happy customer so we focus on a number of key AI performance indicators like accuracy,latency,, and more to ensure we’re meeting the moment for our customers. If you can’t tell whether you’re improving, then you won’t. Prioritizing accuracy over opticsis what enables continuous improvement and progress. Enabling continuous learning and responsible innovation Sustainable AI innovation depends as much on culture as it does on technology. Because research involves exploring the unknown, uncertainty is normal. A healthy culture recognizes that reality and creates space for informed risk-taking, paired with accountability. Organizations must encourage course-correction. If acknowledging “this isn’t working” is treated as a disaster, teams will learn to hide problems rather than solve them. But if teams are encouraged to evaluate honestly, pivot when needed, and learn from false-starts, then the organization can move faster and safer at the same time. That means treating pilots as real decision points — stopping, reshaping, or narrowing efforts based on what the data shows, rather than pushing them forward by default. At Capital One, we enable teams to try ambitious things, learn quickly, and build an ecosystem that works to ensure AI is useful, reliable, and safe. Final thoughts Building impactful AI isn’t about chasing every new breakthrough. It’s about thoughtfully guiding ideas from research to reality through evaluation, collaboration, and a culture that embraces learning. As AI continues to evolve, leaders should invest not only in tools, but also in R&D processes and cultural foundations that allow innovation to scale responsibly. When you bridge research and application, prioritize continuous evaluation and measurement, and foster environments where teams can learn and adapt, you give AI its best chance to deliver lasting impact, at enterprise scale, in the real world. Liz Boschee us VP, AI Foundations at Capital One. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic's highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%. Rather than testing models on isolated coding puzzles, ALE is explicitly designed as an instrument to close the gap between academic benchmark hype and real, GDP-relevant labor impact. And right now, the data proves the most advanced models in the world are fundamentally failing the exam. Ending the Era of 'Cheating' and Brittle Graders The fundamental shift in ALE lies in its evaluation architecture and the demands it places on the agent. Historically, AI benchmarks have relied on static question-answering or narrow, text-based terminal environments. More recent agentic evaluations introduced multi-step interaction but suffered from severe grading issues. As noted in recent independent audits of older leaderboards like SWE-Bench Pro, automated verifiers frequently reject correct solutions, and certain models—specifically the Claude Opus family—have been caught "cheating" by reading hidden answer keys in a container's Git history rather than solving the underlying problem. ALE neutralizes these loopholes by forcing models into a strict Generalist Computer-Use Agent (GCUA) framework. To pass, an agent cannot merely execute terminal commands. The benchmark maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). An agent must use its "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software. Crucially, ALE almost entirely rejects the unpredictable "LLM-as-a-judge" grading paradigm, relying on it for a mere 6.8% of its workflows. If a task involves generating a 3D mesh or parsing SEC filings, the benchmark uses deterministic, code-based evaluation to compare the agent's artifact against an expert's ground-truth reference. Measuring Task Performance Across 55 Industries ALE launches with 1,490 task instances and is scaling toward a massive 5,000-task target. What makes the product remarkable is its authenticity. The tasks are strictly anchored in the U.S. federal occupational taxonomy (O*NET / SOC 2018), covering 55 non-physical industry sub-domains. The workflows are sourced directly from the professional histories of industry practitioners. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects compositing in Adobe After Effects. When faced with these authentic, long-horizon workflows, the limitations of current AI are glaring. ALE divides its tasks into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam. Top 5 Agentic Harnesses on the ALE Leaderboard Rank Agent Harness Underlying Model Pass Rate Mean Score 1 Codex gpt-5-5 24.0% 42.8% 2 Ale Claw gpt-5-5 23.0% 45.8% 3 Claude Code claude-fable-5 22.0% 40.5% 4 OpenClaw gpt-5-5 21.1% 41.0% 5 Cursor CLI composer-2-5 20.4% 38.5% The victory of GPT-5.5 aligns with recent third-party analysis suggesting that OpenAI's models are currently superior at strictly adhering to multi-part, complex prompts. Conversely, users report Anthropic's Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps mid-workflow — a fatal flaw in ALE's rigorous pipeline. And while hitting a 24.0% pass rate is enough to claim the crown, the absolute performance ceiling remains remarkably low. On the hardest "Last-Exam" tier — representing the frontier of professional difficulty — most configurations, including Anthropic's older Claude Opus 4.8 and Google's Gemini CLI, record a devastating 0.0% pass rate. Solving Benchmark Contamination A core vulnerability in modern AI evaluation is "benchmark contamination"—the phenomenon where test questions inevitably leak into the massive data lakes used to train next-generation models. Once a model memorizes the benchmark, the evaluation becomes entirely useless. ALE solves this through a dual-use deployment strategy. The project operates as an open-source research initiative, but it closely guards its evaluation data. Only about 10% of the dataset (roughly 150 tasks) is released publicly on platforms like GitHub and Hugging Face. The remaining 1,300+ tasks are kept strictly private. For developers and enterprise evaluators, this means ALE functions as a "living benchmark". Private tasks are systematically rotated into the public pool over time, while retired public tasks are swapped out. This rolling release ensures that the evaluation surface remains uncontaminated across successive model generations, giving enterprise buyers confidence that an agent's high score is earned, not memorized. Additionally, ALE provides transparency by tracking both "Full" and "Unlicensed" scores. Because real professional work often requires paid, proprietary software, the "Full" leaderboard incorporates tasks that rely on commercial CAD tools, paid APIs, or licensed datasets. The "Unlicensed" tier drops these license-gated tasks to provide a clean, like-for-like comparison using only freely available tools, ensuring models aren't simply rewarded for having access to paid enterprise software. Bottom Line: ALE Shows Even the Highest-Performing Models and Harnesses Have Room for Improvement For developers frustrated by the gap between marketing claims and actual production performance, ALE's brutal grading curve is highly validating. Zengyi Qin, an MIT PhD researcher and data contributor to the project, took to X to announce the launch, sharing images of the paper and the staggering 100+ institution contributor list. "Introducing Agents’ Last Exam (ALE)," Qin wrote. "Built by 300+ domain experts from 100+ institutions. Covering 55 industry domains. Claude Opus 4.8 has 0.0% pass rate on the hardest subset. Glad to have contributed to this benchmark". In a follow-up post highlighting the Hugging Face ArXiv paper link, Qin added: "Very solid work from project leads @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI". As businesses deploy billions in capital betting on AI agents, they desperately need a compass that points true north. If an agent can eventually conquer the gauntlet of Agents' Last Exam, it won't just be passing a test—it will be proving it is ready to join the workforce. Until then, the sobering pass rates on the leaderboard serve as a necessary reality check for the entire AI ecosystem.

Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don't bother. Sapient thinks it has a cheaper path. To overcome this brute-force scaling dogma, researchers at Sapient developed HRM-Text, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year. HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task. The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks. For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores. The training bottleneck When we train an LLM, we don't actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning. The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world. Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user's prompt is already known and provided at inference time. Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the "economics of iteration." "Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry’s scaling addiction says: 'When the model fails, make it bigger. Add more data. Add more GPUs.' That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine." This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn't always the silver bullet for enterprises. Fine-tuning to preserve a model's general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control. "Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints," Wang said. "They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment." Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure. Rethinking architectures with HRM-Text HRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update. Standard parameter-shared recurrent architectures (like Samsung's TRM) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM's slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: "For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs both fast local refinement and slow semantic stability." While the original HRM proved highly effective for controlled, symbolic reasoning problems, the researchers hit a wall when applying it to the massive, open-ended complexities of generalized language modeling. While HRM's loops make it an incredibly efficient thinker, those same loops make it mathematically volatile to train on the diverse chaos of human language. Running recurrent loops on language creates massive mathematical instability, specifically, exploding or vanishing gradients. To prevent this feedback loop in the neural network, the researchers introduced two key architectural innovations in HRM-Text. First, they developed MagicNorm, a specialized normalization technique designed specifically to keep the internal signals stable, no matter how many times the model loops its thought process. Second, they designed a warm-up method to stabilize training. During early training, the model is only evaluated on short, shallow reasoning loops. As training progresses, the system warms up, gradually giving the model deeper and longer reasoning sequences. They also switched the training objective from next-token prediction to task completion, where the model is rewarded only on the full response as opposed to individual tokens it generates. To achieve this goal, they changed the training data of HRM-Text from raw text to instruction-response pairs only. HRM-Text in action The researchers built a highly compact 1-billion-parameter HRM-Text model. Instead of using the standard multi-stage pipeline that requires churning through trillions of words of raw internet text, they trained it from scratch on a tightly curated dataset of just 40 billion tokens. The training data consisted entirely of instruction-response pairs across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge. They trained the model using the task-completion objective. To force the model to rely on its internal hierarchical architecture rather than copying step-by-step logic, they explicitly stripped out "thinking" tokens from the training data. The model was evaluated across a diverse suite of standard foundational AI benchmarks, heavily indexing on knowledge, reasoning, logic, math, and comprehension. The researchers tested HRM-Text against both small models and highly-resourced open-weight and fully open models. The results show a significant shift in the compute-to-performance frontier. The 1B-parameter HRM-Text achieved 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH. This performance is highly competitive with (and in several cases surpasses) the 2B to 7B parameter foundation models it was tested against. The most important takeaway for the enterprise audience lies in the efficiency statistics and practical implications. Pretraining a foundation model from scratch is typically a multi-million dollar endeavor reserved for tech giants. HRM-Text was trained in just 1.9 days on a cluster of 16 GPUs. The total estimated compute cost was roughly $1,500. It achieved its competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times less estimated compute than models like Qwen, Gemma, and Llama. Another important point is the decoupling of reasoning from knowledge memorization. From a practical standpoint, HRM-Text's success on reasoning-heavy tasks despite its tiny 40B-token training diet proves that a model does not need to memorize the entire internet to become a smart reasoning engine. For enterprise applications, this behavior is a feature, not a bug. The researchers suggest a future where businesses deploy highly compact, incredibly cheap recurrent models that act as the "reasoning core" specialized for business logic. Instead of forcing the model to memorize company databases during pretraining, the model acts as the reasoning engine, relying on external retrieval systems to fetch factual knowledge. Critics have pointed out that training on instruction-response pairs makes comparisons against models trained on raw text an "apples-to-oranges" scenario. Wang pushes back on this framing, pointing out that every serious modern LLM sees instruction-response data during training or alignment. "So the comparison is not apples-to-oranges. It is closer to apple cores-and-apples. We started directly from the core task format because that is how people actually use models: they give an instruction and expect a useful response," he said. The researchers also ran rigorous contamination tests to ensure the model wasn't simply memorizing benchmark answers. On DROP, the one benchmark showing a marginal contamination signal under a specific setting, HRM-Text still scored an impressive 81.1% on a strictly clean, 0% contamination subset. Ultimately, Wang argues that for enterprises, "the right evaluation is not trivia recall. It is a workflow evaluation... Give HRM-Text a task like: multi-step financial reasoning, compliance logic, scientific workflow automation, structured extraction followed by reasoning." Practical implementation and the future of enterprise AI While the benchmark scores and cost efficiencies are striking, Sapient is clear about the model's current boundaries. The initial release is best viewed as a proof-of-concept, akin to early GPT releases, designed to showcase the architecture's unique advantages. "Honestly, HRM-Text is not yet a plug-and-play ChatGPT replacement," Wang said. "It is a compact foundation language reasoning model. For an enterprise engineering team, the operational work is mainly around templates, mode selection, attention masking, and alignment." For AI engineering teams looking to experiment, getting started requires some specific, but standard, text-generation discipline. The model lists native support in the Transformers library (requiring transformers >= 5.9.0), and usage paths for vLLM and SGLang are actively being developed. The primary engineering task involves managing the PrefixLM design: production multi-turn chat applications will require careful KV-cache logic to ensure user prompts receive full bidirectional attention while the assistant's outputs remain causal. "When the cost of training a capable reasoning model drops to around $1,500, AI stops being only an infrastructure question and becomes a strategy question," Wang said. "A Fortune 500 company no longer has to ask, ‘Can we afford a foundation model?’ It would ask, ‘What should our model know about our business, and what kind of reasoning should it be optimized for?’"

In a sweeping new essay titled "Policy on the AI Exponential," Anthropic co-founder and CEO Dario Amodei publicly calls for new government regulations governing the release of powerful AI models — specifically comparing AI industry to commercial aviation, which follows regulations enforced by the U.S. Federal Aviation Administration (FAA) — arguing that this is necessary to maintain public safety as AI capabilities and potential misuses grow. Alongside the essay, Anthropic released two comprehensive policy roadmaps: an Advanced AI Framework targeting catastrophic model risks, and an Economic Policy Framework addressing AI-driven labor displacement backed by $350 million in new funding. The timing couldn't be more important: yesterday, Anthropic released its most powerful general release model ever, Claude Fable 5, and a more gated, updated version of the base Claude Mythos model, now known as Claude Mythos 5, which offers advanced defensive and offensive cyber capabilities. As Amodei noted on X following the release: “Anthropic has long advocated for transparency requirements for frontier AI, because the risks weren't yet clear enough to regulate precisely. That is no longer sufficient”. For technical decision-makers, CIOs, and enterprise architects, the essay is not just a political statement—it is a preview of the operational, regulatory, and workforce constraints that will govern the next generation of enterprise tech. Here are the top three takeaways enterprise leaders need to extract from Anthropic’s latest policy drop. 1. Frontier Models May Face "FAA-Style" Deployment Holds For the past three years, enterprises have built products on the assumption that AI API capabilities will only move in one direction: faster and more powerful. Anthropic’s Advanced AI Framework introduces a new variable: regulatory embargoes. Amodei explicitly compares the necessary AI regulatory regime to the Federal Aviation Administration (FAA), stating: “Frontier AI models, like airplanes, should be required to go through technical testing and auditing, and their release should be blocked or reversed as a threat to public safety if they do not meet high standards of safety”. The company is proposing that models trained using more than 10^25 floating-point operations (FLOPs)—or developed by companies with over $500 million in AI revenue or $1 billion in AI R&D—must undergo mandatory third-party testing. If these models present severe biological, cybersecurity, or autonomy risks, the government would have the legal authority to block, delay or deter their deployment. The Enterprise Implication: If your company licenses foundation models for core infrastructure, you must plan for supply chain volatility. A highly anticipated model update from an AI vendor could be delayed indefinitely by regulators, or an existing model could be revoked if post-release testing reveals autonomous threats. Tech leaders must design multi-model architectures that avoid locking into a single vendor, ensuring business continuity if a provider’s flagship model is blocked by a federal agency. 2. Cybersecurity Around AI Is Now Critical Infrastructure Anthropic’s push for regulation is heavily motivated by the recent escalation in AI-driven cybersecurity threats. Amodei explicitly references Anthropic's own Claude Mythos Preview, noting that its ability to discover high-severity vulnerabilities across major operating systems "scrambled" the global cybersecurity landscape. Under Anthropic's proposed framework, securing the AI development environment is paramount. Frontier developers would be required to protect their model weights from both external cyberattackers and insider threats. Furthermore, companies must develop channels to report "model distillation attacks"—where competitors or bad actors use a primary model to train a cheaper, unaligned clone. The Enterprise Implication: The stakes for enterprise security are twofold. First, defensive AI capabilities will become a prerequisite; as Amodei warns, attackers using frontier models to probe for vulnerabilities will outpace traditional, human-led defense. Second, enterprises that fine-tune open-weight models or host proprietary instances locally will likely face intense new compliance and infosec burdens. Treating model weights as highly classified corporate secrets will become the new industry standard. 3. Plan for Structural Labor Displacement, Not Just Efficiency Perhaps the most sobering aspect of the announcement is Anthropic’s Economic Policy Framework. The company is publicly acknowledging that if AI achieves its predicted capabilities, it will act as a "general substitute for labor" rather than just a productivity tool. Amodei frames this bluntly: “The key challenge in such a world won’t be incentivizing growth, but finding a way for everyone to share in the benefits”. To back this up, Anthropic is committing $350 million to address economic disruption: $200 million for an Economic Futures Research Fund to pilot public policy solutions, and $150 million for a national fellowship program. The framework actively plans for scenarios where AI drives unemployment to 5%, 10%, or even unprecedented levels, advocating for policies like wage insurance, universal basic income, and sovereign wealth models. The Enterprise Implication: For tech leaders and HR departments, the AI transition is about to become a labor relations minefield. The economic framework notes that companies "can choose to retrain and redeploy rather than reduce headcount," but admits voluntary action is not a substitute for government response. Enterprises looking to integrate AI heavily should begin implementing workforce transition plans immediately. Leaders who view AI solely as a mechanism for fast cost-cutting through layoffs may soon find themselves crossways with new "pro-employment incentives" or retention tax policies proposed by advocates to slow job displacement. What Enterprises Should Do Now Anthropic’s announcement marks a turning point in the AI industry's dialogue with Washington and the global market. As Amodei posted: "Many of these policy ideas have common-sense appeal across the political spectrum, and the sooner we act on them, the sooner everyone shares in AI's benefits". For the enterprise, the message is clear: the era of "move fast and break things" in generative AI is closing. The era of rigorous compliance, systemic security, and complex workforce transitions is fast approaching. To prepare for this shift, enterprises must first decouple their AI strategies from single-vendor dependencies. If a flagship model is suddenly blocked or recalled under the proposed FAA-style regulatory powers, organizations reliant on that specific API will face immediate operational paralysis. IT leaders should build multi-model architectures that allow them to swap out foundation models seamlessly, ensuring business continuity in a highly regulated ecosystem. Second, technical decision-makers must elevate AI infrastructure to the level of critical cybersecurity. With frontier AI systems now capable of discovering high-severity software vulnerabilities at scale, the threat surface is expanding rapidly. Companies that fine-tune models or host them internally must lock down their development environments against both external and insider threats, matching the rigorous security standards Anthropic is demanding of the broader industry. Finally, leadership teams need a proactive, rather than reactive, labor strategy. Anthropic explicitly warns against using AI solely for cost savings through layoffs, encouraging enterprises to actively seek new use cases that allow them to retain and retrain their existing workforce. As governments potentially deploy pro-employment tax incentives and wage insurance policies to slow job displacement, companies that aggressively cut headcount to fund AI adoption may find themselves on the wrong side of both public sentiment and upcoming economic regulations.

Enterprise AI teams face a dilemma: The best models today might not be the best models a year from now. MassMutual's answer is to stop making long-term bets — and build infrastructure that can swap models as the market shifts. “The world of AI today is extremely dynamic,” Sears Merritt, MassMutual CIO, explained in a new VB Beyond the Pilot podcast. “We wanted to make sure we were positioned to ride that wave of dynamism.” The strategy appears to be paying off in a big way. MassMutual has measured a roughly 30% increase in developer productivity, while AI-powered contact center workflows have reduced resolution times from 10 minutes to one and cut costs from dollars to cents. But the broader lesson for IT leaders may be less about the results and more about how the company is thoughtfully building its AI infrastructure and keeping users at the center. Maintaining optionality for the possibilities of tomorrow MassMutual works with vendors at the leading edge, but keeps those relationships on a clock. “Those relationships are capped so that we maintain optionality for best-of-breed tools as things mature in this space, and at some point, settle down and stabilize,” Merritt said. That philosophy extends to open-source models. Merritt says his team is “100%” looking at open-source tools, and sees the technology playing a big role in how MassMutual (and similar companies) use AI. “We're certainly going to need frontier models and leading edge capabilities to do what today is impossible, and tomorrow will be possible,” he said. Measuring outcomes from the start MassMutual's AI efforts fall into two broad categories. The first focuses on enablement: Putting productivity-enhancing tools such as Copilot and virtual assistants into the hands of all employees. The second involves what Merritt describes as “deepen and focus” initiatives, where teams target a specific workflow or business process that will have a strong impact on advisors, policyholders, or employees. Rather than focusing on adoption metrics, these projects begin with predefined success criteria. “Everything we do is measured,” Merritt said. “There's always a success metric that we define upfront to determine whether or not we're going to scale up some of these things.” The company is also deliberately encouraging experimentation, giving employees access to a range of best-in-class models, “token-consumptive workflows” and other possible capabilities so they can weigh the benefits relative to “simpler, lower cost” large language models (LLMs). At the same time, MassMutual is collecting increasingly detailed analytics around usage patterns, developer workflows, model performance, and costs. The goal is to reduce spending while also building operational intelligence to eventually route workloads to the right model based on cost, response quality, and user experience. Those insights will eventually drive optimization decisions around model routing, prompt selection, response times, and infrastructure design. “We're gaining access to analytics that let us, in a very granular way, look at usage patterns, developer workflows, and begin to make sense of who's using what, when, and for what types of tasks,” Merritt said. Why MassMutual sometimes chooses the more expensive model Another interesting aspect of MassMutual's approach is how it evaluates AI quality. Rather than focusing exclusively on benchmarks or token costs, the company uses what Merritt calls a “trust score” framework. The process combines user feedback with operational metrics to understand how employees perceive AI-generated responses and whether those responses actually improve outcomes. The contact center rebuild put that framework to the test. During development, employees were given access to two different LLMs. One generated responses in near-real-time but the quality was noisier. The other more expensive option took several additional seconds to respond but consistently delivered higher-quality answers. Conventional wisdom and the speed of business might suggest users would prefer the former; but they overwhelmingly chose quality. Merritt’s team asked users about the quality of response, their preferred model, and their overall thoughts on the experience. Most of the time, users said: “We want the more expensive one. We're willing to wait, but the quality difference is so high that the two extra seconds actually is worth it to us.” That feedback ultimately determined which model MassMutual deployed. “We factored that experience piece into the decision-making, and that led us to say, on a relative basis, the costs were immaterial, so we're going to use the more complex model," Merritt said. Listen to the full podcast to hear more about: Why Mythos “completely changed” the cybersecurity landscape — not the type of threats, but the rate at which those threats appear; How a team of AI engineers modernized MassMutual’s mainframe in 7 days (a process that previously would have taken 3 months); Why MassMutual specifically avoided tokenmaxxing to rein in AI use and spending and has been going “unlimited,” to shield from cost blowups. How a “multi-harness type of environment” will support agentic AI. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

Apple’s new Siri AI, unveiled yesterday at Apple's annual Worldwide Developers Conference (WWDC 2026), may look like a consumer product story on the surface. But for enterprise developers and IT leaders, the bigger news from WWDC26 is that Apple is turning Siri into a systemwide AI interface for apps, data and workplace actions across iPhone, iPad, Mac, Apple Watch and Vision Pro, as revealed in the WWDC26 Apple Intelligence developer guide. In other words, if your company offers an application on Apple devices, whether it's served on iOS mobile device or Mac, the new Siri AI may force you to change how that application is discovered, served, and its contents and workflows made available to end users. Enterprise developers can expose app content through App Entities, make it available to Apple’s Spotlight semantic index, define actions through App Intents and App Schemas, and map onscreen user interface elements to app objects through View Annotations. That makes Siri AI much more than a voice assistant. Apple is positioning it as an AI-powered app action and content-discovery layer built into its operating systems. Siri becomes an app action layer For enterprise developers, the shift could be significant. A business app that properly adopts Apple’s new frameworks could let users ask Siri to find, summarize, update or act on app content without the developer having to build a separate chatbot interface. Apple says App Intents, its existing framework for exposing app actions to system features like Siri and Shortcuts, is the path for connecting apps to Apple Intelligence and Siri AI, while schemas make app content and actions usable through natural language. In practical terms, that could apply to customer records in a CRM, open tickets in an IT service desk, project tasks, invoices, calendar events, documents, expenses, notes, messages or field-service records. Instead of opening an app, searching manually and clicking through menus, an employee could ask Siri to act on the specific object they are viewing or retrieve a related item from another app. Spotlight becomes the enterprise search hook Apple says in its WWDC26 Apple Intelligence guide that entity schemas contribute app content to the Spotlight semantic index, while intent schemas let users take action on that indexed content without developers defining a rigid list of command phrases. Apple also says the new View Annotations API lets developers map views to entities so users can refer to what is onscreen conversationally — for example, “summarize this customer thread,” “add this invoice to my expenses,” or “follow up on this task tomorrow.” That is an important distinction from earlier voice-assistant integrations, which often required narrow command structures and explicit invocation phrases. Apple is instead giving developers a way to describe an app’s data and capabilities so Siri, Spotlight and Shortcuts can use them through the system. Developers get testing tools for Siri and app actions Apple is also adding AppIntentsTesting, a framework that validates App Intents through the same infrastructure used by Siri, Shortcuts and Spotlight without requiring UI automation. That matters for enterprise software teams because natural-language app actions need to be testable, repeatable and reliable before they are trusted in production workflows. It also gives developers a path to include Siri and Spotlight behavior in ordinary testing pipelines instead of treating assistant integration as a manual demo feature. The result is a clearer developer mandate: if an app wants to show up well inside Siri AI, it will likely need to expose its data, actions and onscreen context through Apple’s system frameworks. For enterprise SaaS vendors, that could become an important part of Apple-platform competitiveness, especially in categories such as productivity, collaboration, CRM, project management, finance, design, knowledge management, healthcare, logistics and field operations. Apple expands its model stack for developers Apple is also using WWDC26 to expand its AI developer stack beyond Siri. The updated Foundation Models framework gives Swift developers access to Apple’s on-device models, Apple models running through Private Cloud Compute and third-party model providers that conform to Apple’s Language Model protocol. That gives developers more flexibility than a single Apple-only model path. Apple says in its Apple Intelligence developer guide that the framework now supports multimodal prompts, Vision tools, dynamic model profiles and evaluations. In theory, an enterprise app could use an Apple on-device model for private or lightweight tasks, call Apple’s Private Cloud Compute for heavier reasoning, or plug in an outside provider such as Claude, Gemini, an open-source model or a company-controlled model through Apple’s model-provider interface. Core AI brings custom models onto Apple silicon Apple is also introducing Core AI, an operating system-level framework for running developers’ own models on Apple silicon. For enterprises that do not want sensitive data sent to a cloud model at all, local inference remains one of Apple’s most important advantages. Core AI gives developers a first-party way to deploy custom models with Swift APIs, memory controls and optimized execution on Apple hardware. Evaluations signal a more mature enterprise AI posture The company’s new Evaluations framework also points at a more mature enterprise AI posture. AI features are difficult to test with conventional unit tests because model outputs can vary. Apple says the framework helps developers define metrics, automatically grade outputs and aggregate statistics. For enterprise buyers, that matters because AI features need measurable reliability, not just impressive demos. Apple is also explicitly addressing the security risks of app agents. WWDC26 developer materials include a session on how developers can mitigate risks to agentic features, covering indirect prompt injection, data exfiltration, unintended actions, threat modeling, user confirmations, authentication and safeguards for App Intents and Foundation Models. That is a notable acknowledgement that AI assistants able to read context and take action across apps create new attack surfaces. Enterprise IT gets new Apple Intelligence controls For enterprise IT, Apple also answered some of the governance questions raised by Siri AI’s initial announcement. Its WWDC26 device management documentation describes new management controls for Apple Intelligence, Siri and external intelligence integrations. Supervised devices can use Apple’s intelligence settings configuration to allow or deny features such as Genmoji, Image Playground, Writing Tools, Image Wand, app-specific intelligence in Mail, Notes and Safari, Apple Intelligence Report, Visual Intelligence Summary and on-device-only processing for dictation and translation. Apple says additional management for Siri AI and Visual Intelligence will arrive in later beta releases. That means enterprise controls are not complete yet, but Apple is clearly building Siri AI into its managed-device architecture rather than treating it as an unmanaged consumer feature. Apple also adds controls for outside AI services Apple is also adding controls for external intelligence services. Its deployment docs describe a configuration for managing external intelligence integrations, including whether users can access outside AI services and whether they can sign in to those services. That will matter for organizations trying to control when employees use Apple’s own models, Apple’s private cloud architecture or third-party AI systems. Those controls could help Apple compete with Microsoft and Google in enterprise AI, but with a different pitch. Microsoft Copilot and Google Gemini are tied deeply to their respective productivity clouds. Apple’s strategy is more device- and OS-centered: make AI available where the user already works, expose app actions through system frameworks and emphasize on-device processing and Private Cloud Compute as privacy advantages. Apple’s privacy pitch remains central Apple’s privacy architecture remains central to that pitch. Siri AI uses Apple Foundation Models on device and through Private Cloud Compute. Apple says in its Siri AI announcement that requests handled by Private Cloud Compute do not store personal data or make it accessible to Apple. For industries such as healthcare, financial services, legal, education and government, that claim may be more important than any single assistant feature. But enterprises will still need more detail before treating Siri AI as a fully governed workplace assistant. Apple’s WWDC26 materials show progress on management controls, external AI restrictions and app-level governance, but the full picture is still emerging. Key questions remain around auditability, retention, work-versus-personal data boundaries, role-based access, compliance certifications, and how much control IT departments will have over Siri’s ability to act inside specific business apps. Availability limits could complicate rollout Availability also complicates enterprise rollout. Siri AI is in developer testing now for iOS 27, iPadOS 27, macOS 27 and visionOS 27, with watchOS support coming in a later beta. Apple says the user-facing beta arrives later this year. The feature requires Apple Intelligence-capable hardware, which means many older corporate devices will not support it. Apple also says Siri AI will not initially be available on iPhone and iPad in the European Union, and that Siri AI and other new Apple Intelligence features are not available in China while the company works through regulatory requirements. That means global enterprises may face fragmented deployment, with different feature availability by hardware, operating system, language and region. App Store changes give business software vendors another opening Apple also introduced enterprise-adjacent App Store changes that could matter for business software vendors. StoreKit 2 will support subscriptions for groups and organizations, including volume purchasing through Apple Business and Apple School Manager. IT teams will be able to buy and assign App Store subscriptions through device management workflows, while developers will be able to manage subscription availability for organizations. That gives Apple a more business-friendly path for selling app subscriptions into managed environments. The company is also unifying Apple Business Manager, Apple Business Essentials and Apple Business Connect under Apple Business, which Apple describes as a broader platform for Managed Apple Accounts, device management, volume licensing, Admin APIs, Apple Maps locations, Tap to Pay on iPhone, Branded Mail and multi-seat subscriptions. Apple’s enterprise AI strategy comes into focus Taken together, the WWDC26 enterprise story is bigger than Siri alone. Apple is building an AI stack that spans user-facing assistant features, developer integration frameworks, local and private-cloud model infrastructure, AI testing, App Store business subscriptions and device-management controls. The strategic question is whether Apple can make this more than another Siri reset. Developers will need to adopt Apple’s app-intelligence frameworks. Enterprises will need stronger governance assurances. Users will need the assistant to work reliably across real workflows, not just Apple’s own apps. But the direction is now much clearer. Apple is not trying to compete in enterprise AI by launching a standalone chatbot. It is embedding AI into the operating system, making apps addressable through Siri and Spotlight, giving developers model and testing tools, and giving IT teams at least the beginnings of policy controls. For enterprise developers, that means App Intents, App Schemas, App Entities, Spotlight indexing and View Annotations may become core parts of building competitive Apple-platform apps. For enterprise technology leaders, it means Apple’s devices could soon include a native AI assistant that can act across business workflows — if Apple can prove that the privacy, security and management model is strong enough for production use.

Engineering teams building agentic coding pipelines now have a concrete open-source alternative to managed models like Claude Fable 5 — one that runs on a single H100. The tradeoff: Cohere's North Mini Code, which launched Tuesday, generated three times the output tokens of comparable models in independent testing, a verbosity cost that compounds in high-volume production workloads. The new open-source model is a 30 billion parameter mixture-of-experts (MoE) model with 3 billion parameters active per token, built for agentic software engineering including sub-agent orchestration, architecture mapping, code review and terminal work. The model supports a 256,000 token context window with a 64,000 token maximum generation length, and is available on Hugging Face under an Apache 2.0 license. What North Mini Code can do North Mini Code targets the full agentic coding stack. Here is what the model does and what it runs on. Software engineering. Cohere built North Mini Code specifically for agentic software engineering, not adapted from a general-purpose base. It has integrated tool-use capabilities and supports interleaved thinking, which Cohere says improves performance across multi-step agentic work. Architecture mapping and code review. North Mini Code can analyze and map systems architecture, surface dependencies and perform code review across large codebases. With a 256,000 token context window, it can hold substantial multi-file projects in a single context pass. Terminal-based agentic tasks. The model is trained for terminal environments, handling shell interactions, package scripts and command-line tooling. Cohere benchmarked it on Terminal-Bench v2, which tests agents in real terminal environments rather than synthetic code generation tasks. How it was built North Mini Code is a sparse mixture-of-experts model with 128 experts, of which 8 activate per token. The compute requirement at inference time is closer to a 3 billion parameter model despite 30 billion total parameters. Nick Frosst, co-founder of Cohere, demoed it running on a Mac Studio via MLX at around 20 gigabytes of RAM, the same machine he uses for his own local coding work. Cohere trained the model through two stages of supervised fine-tuning followed by reinforcement learning with verifiable rewards across more than 70,000 verifiable tasks spanning approximately 5,000 repositories, deduplicated against SWE-Bench. Rather than optimizing against a single agent scaffold, Cohere trained across three. SWE-Agent uses a rich CLI with specialized commands. Mini-SWE-Agent uses a single bash tool with raw shell output. OpenCode uses individually typed tools returning structured JSON. Cohere reports a 10 percentage point gain on OpenCode evaluation from the multi-harness approach while maintaining SWE-Agent performance. Where it fits North Mini Code enters a market that now includes Mistral Devstral Small 2, GitHub Copilot, Cursor, and Claude Fable 5 — each with distinct cost and deployment tradeoffs. Cohere's primary benchmark comparison is against Mistral Devstral Small 2, a 24 billion parameter dense model. In vendor-reported internal tests, Cohere claims 2.8x higher output throughput and a 30% inter-token latency advantage over Devstral Small 2 in internal tests under identical hardware configurations. Cohere also claims, in its Hugging Face technical post, that North Mini Code outperforms open-source models up to four times its parameter count on its reported benchmarks, including models at 120 billion parameters. Artificial Analysis independently ranks it eighth of 127 comparable open-weight models on output speed at 210 tokens per second, with a time to first token of 0.25 second against a class median of 1.95 seconds. It places 18th of 127 on the Artificial Analysis Intelligence Index. One flag from the same data: the model generated 75 million output tokens to complete the Intelligence Index against a class median of 25 million. In high-volume agentic pipelines, that verbosity compounds into inference cost and latency. "Suddenly people are thinking like hey, am I getting enough economic value out of the tokens from a model?" Frosst said during the launch video. "Local deployment is one way of empowering people and making AI really something that works for them." GitHub Copilot, Cursor and Claude Code operate on per-usage or subscription pricing with no on-premises option. Anthropic's Claude Fable 5, now the most capable publicly available managed coding model, runs at $50 per million output tokens. For Frosst, the model is the polar opposite of Fable. "Its small, cost effective, apache 2.0, and locally deployable. This is the way LLMs should go. small, open source, transparent and sovereign, vs large, expensive, proprietary and hegemonic," Frosst wrote in a post on X. What this means for enterprises For teams building production agentic coding pipelines, North Mini Code's release clarifies a set of decisions that have been forming for months. Purpose-built agentic training is now a baseline to evaluate against. The distinction between models fine-tuned for code and models trained specifically for agentic workflows, with verified tool calls and multi-harness robustness, is now a material factor in pipeline decisions. Any model vendor claiming agentic coding capability should be able to answer whether its training used verifiable agentic tasks or was adapted from a general-purpose base. Verbosity is a hidden pipeline cost that benchmarks do not surface. Artificial Analysis measured North Mini Code generating three times the output tokens of comparable models. That verbosity compounds across inference cost and latency in high-volume pipelines. Throughput testing against actual workload volume is the evaluation step the benchmark rankings skip. The frontier pricing split is now a real architectural decision. Fable 5 at $50 per million output tokens and North Mini Code on a single H100 represent a genuine tradeoff between cost control and data residency on one side, and managed infrastructure overhead on the other. Teams running high-volume agentic coding pipelines should model both cost paths against their actual workload before committing to either.

On-device AI models have stayed small because the entire weight set has to live in DRAM, capping practical parameter counts well below what server-side deployments use. Enterprise architects evaluating agentic workloads have had to choose between capable cloud-dependent models and limited on-device ones. Apple's third-generation foundation models, announced at WWDC26, break that constraint by moving the weight set off DRAM entirely. The AFM 3 family was developed in collaboration with Google and spans five models: two on-device and three server-based, all running within Apple's Private Cloud Compute boundary. The server-side models, including AFM 3 Cloud Pro for agentic tool use and complex reasoning, run on Nvidia GPUs in Google Cloud. The on-device architecture is Apple's own. AFM 3 Core Advanced is a 20-billion-parameter model that stores weights in NAND flash rather than DRAM. "Instead of forcing the entire model into DRAM, the full model is stored in flash memory," Apple's research team wrote. "Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt." How the architecture actually works The memory wall Apple is working around is one every local AI developer runs into. "You can't put 20B parameters in RAM at any reasonable precision," Awni Hannun, a researcher at Anthropic and former Apple research scientist, posted on X. "To make it work they are using pretty exotic architecture by today's standards. A small model predicts from the query (or prompt) which experts to load from NAND into RAM." That prediction-and-load mechanism has three distinct components, each driven by the hardware constraints of consumer silicon. The full 20B weight set lives in flash, not DRAM. AFM 3 Core Advanced stores its entire parameter set in NAND flash rather than active memory. Standard on-device deployments require the full model to fit in DRAM, which is what caps their parameter counts. Apple's approach, which it calls Instruction-Following Pruning (IFP) and developed with its own researchers, treats flash as the model's permanent home and DRAM as a working buffer for whichever experts a given prompt requires. Expert routing happens once per prompt, not per token. In a conventional Mixture of Experts model, a router selects different experts for every token generated — which would require continuous weight movement between flash and DRAM at inference speed. NAND-to-DRAM bandwidth cannot support that. AFM 3 Core Advanced routes once at prompt time, selects a fixed expert set, loads it into DRAM alongside always-active shared experts, and generates all tokens from that same configuration. "The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts," Hannun wrote. Active parameter count scales from 1B to 4B depending on task complexity. Rather than running a fixed model size for every request, AFM 3 Core Advanced adjusts how many parameters it activates based on what the task requires — 1 billion for simpler operations, up to 4 billion for harder ones, all drawn from the 20-billion-parameter pool in flash. What Apple has and hasn't disclosed The architecture paper is detailed on the memory design and sparse activation mechanism. It is less forthcoming on practical deployment constraints. Apple's profiling tools expose timing but not the metrics that decide production viability. "Energy, memory bandwidth, thermal? Not in the docs," Marco Abis, who is building Ziraph, a profiler for local AI on Apple silicon, posted on X. "A notable gap, given those decide most of on-device performance." Abis also did not find a statement in Apple's documentation — across the Core AI docs, the Foundation Models docs or the Private Cloud Compute security post — of when an on-device request transparently offloads, or whether that routing is visible to the developer or the user. For enterprises that need to document where inference runs, that is a direct compliance problem. Not all the information is currently available. Apple has indicated a full technical report with benchmarks is coming later this summer. What this means for enterprise architects Regulated industries evaluating agentic AI deployments now have a concrete architectural decision to make. The DRAM wall for on-device agents just moved. Enterprises evaluating agents that need to run without a cloud round-trip now have a 20-billion-parameter local option to evaluate. The constraint shifts from model capability to device hardware. The private/cloud boundary is now an architectural decision, not a default. Simpler requests stay on-device; complex agentic tasks route to AFM 3 Cloud Pro on Private Cloud Compute. Apple has not publicly specified when a request offloads or whether that routing is visible to the developer — a gap that complicates policy decisions for organizations that need to document where inference runs. The agentic server tier depends on Google Cloud. AFM 3 Cloud Pro runs on Nvidia GPUs in Google Cloud. The Private Cloud Compute guarantee covers data privacy. It does not eliminate the Google Cloud dependency for server-side inference. AFM 3 Core Advanced gives enterprises a 20-billion-parameter on-device option that did not exist before WWDC26. Whether it is deployable at scale depends on answers Apple has not yet published. Those details are due in the summer technical report.

Anthropic today launched two new AI models — Claude Fable 5 and Claude Mythos 5 — marking the company’s first broad release of the powerful “Mythos-class” AI capabilities it previously made available only to participating organizations in its restricted cybersecurity program, Project Glasswing, which it announced two months ago. The company says Fable 5, which is the version most users and developers will get starting today, exceeds every Claude model it has previously made generally available — featuring stronger performance across software engineering, knowledge work, vision, scientific research and long-running tasks. It smashes the existing benchmarks and comes atop on nearly all of them, though the prior Claude Mythos Preview version of the model still takes the top spots on computer use and multidisciplinary reasoning (see benchmark chart below and here). The new Claude Mythos 5, by contrast, is less restricted in its capabilities, but more restricted in its availability. It is an upgraded version of the prior, similarly capable but limited release Mythos Preview model. As such, it has certain safeguards lifted — but it’s only officially accessible to Anthropic-approved users, including Anthropic's cybersecurity partners in its Project Glasswing effort, and select biology researchers. The key difference is that the general purpose Fable 5 wraps the same underlying Mythos-class capability in new safeguards. Anthropic says requests involving certain high-risk areas — including cybersecurity, biology and chemistry, and model distillation — are automatically routed to Claude Opus 4.8, Anthropic's previously flagship general model, instead, with users notified when that happens. That is not the case on Mythos 5. The company says more than 95% of Fable 5 sessions run entirely on Fable 5’s own responses, with no fallback, and that internal and external red-teaming efforts found no “universal jailbreaks” after more than 1,000 hours of testing. Anthropic says Fable 5 is available to the general public today through its website, apps, and API, but that Mythos 5 will initially only be made available to users who already have access to the older Claude Mythos Preview. Pricing, access and a tricky rollout Anthropic is pricing both Fable 5 and Mythos 5 at $10 per million input tokens and $50 per million output tokens. The company says that is less than half the price of Claude Mythos Preview, but still ranks as the most expensive of major AI models available globally. VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI GLM-5 $1.00 $3.20 $4.20 Z.ai Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.1 $1.40 $4.40 $5.80 Z.ai Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Claude Fable 5 / Claude Mythos 5 $10.00 $50.00 $60.00 Anthropic For developers, Fable 5 is available through the Claude API as claude-fable-5. Anthropic says Fable 5 is fully available today on the Claude API and on consumption-based Enterprise plans. For subscription users, the rollout is more complicated. Anthropic says Fable 5 will be included on Pro, Max, Team and seat-based Enterprise plans at no extra cost from today through June 22. On June 23, the company plans to remove Fable 5 from those plans, after which using it will require usage credits. Anthropic says it aims to restore Fable 5 as a standard part of subscription plans as quickly as possible. The difference between Fable 5 and Mythos 5 Anthropic is not presenting Fable 5 and Mythos 5 as two separate models in the usual “small versus large” sense. Instead, they appear to share the same base capability level. The difference is access control — that is, how easily it will be for users to get their hands on the models, and the guardrails embedded in each. As previously mentioned Fable 5 includes a new safeguard layer that detects certain high-risk requests — including cybersecurity, biology and chemistry, and attempts to distill the model’s capabilities into other systems — and routes those requests to Claude Opus 4.8. Mythos 5 lifts some of those restrictions for trusted users working in approved domains. In practical terms, Mythos 5 is more powerful for sensitive cyber and biology work because it can answer in areas where Fable 5 falls back. For most ordinary enterprise and developer tasks, however, Anthropic says Fable 5 performs effectively the same as Mythos 5. The launch also signals how Anthropic plans to bring frontier models with dangerous dual-use capabilities into the market: not by releasing all capabilities to everyone, and not by simply refusing risky questions, but by routing some requests to a less capable model while keeping the stronger model available for the majority of everyday work. A major improvement in autonomous coding For enterprise buyers, the most immediate use case is likely software engineering. Anthropic says Fable 5 can work unattended for longer and with more independence than previous Claude models, which is exactly the capability enterprises need if they want AI agents to do more than autocomplete code or answer developer questions. On SWE-bench Pro, which measures a model's ability to complete difficult software engineering tasks, Anthropic says Fable 5 and Mythos 5 reach 80.3%, vastly outperforming OpenAI's latest and greatest general model GPT-5.5, which scored 58.6%. On Cognition’s FrontierCode Diamond benchmark, which tests high-quality, maintainable agentic coding, the models score 29.3%, compared with 13.4% for Claude Opus 4.8 and 5.7% for GPT-5.5, according to the benchmark table included in Anthropic’s materials. Anthropic also says Fable 5 scores highest among frontier models on FrontierCode even at medium reasoning effort, suggesting the model may deliver stronger coding results without always needing maximum compute. The most striking customer example comes from Stripe. Anthropic says Stripe tested Fable 5 in a 50-million-line Ruby codebase and found that the model completed a codebase-wide migration in one day that otherwise would have taken a team more than two months by hand. Stripe said, “Fable 5 compresses months of engineering into days. In our 50-million-line Ruby codebase, it did in a day what would've taken us more than two months by hand.” Other early users describe the model as especially useful for long-horizon development tasks. Cursor said, “Fable 5 is the state of the art model on CursorBench. It's opened up a class of long-horizon problems that were out of reach for earlier models.” Replit said Fable 5 is the highest-performing model it has tested on ViBench, its end-to-end “vibe-coding” benchmark, and that it builds apps in less time with fewer tokens. Figma said Fable 5 is “a clear step forward on agentic coding and prototyping.” This is the enterprise shift Anthropic is trying to sell: AI coding systems that can take on larger units of work, not just individual tickets. That could include codebase migrations, app prototyping, pull request review, test generation, debugging across unfamiliar tools, user interface design and multi-step internal software projects. Base44 said, “Fable 5 is much deeper and better at one-shotting full apps, and its tool calling is excellent.” Genspark said, “Fable 5 came out #1 on our evals, winning head-to-head against every model we tested. It was significantly stronger on the hardest tasks in the set — UI design and game coding.” Rakuten said, “At the highest effort, Fable 5 reflects on and validates its own work. For us, that's what makes highly autonomous operations possible — the extra thinking pays for itself.” For CTOs and engineering leaders, that suggests the model’s value may come less from raw code generation and more from sustained execution: understanding an intent, planning steps, calling tools, checking its own work and continuing through a task without constant human steering. Knowledge work, finance, legal and operations Anthropic is also positioning Fable 5 as a stronger model for enterprise knowledge work. On GDPval-AA, Anthropic reports a score of 1932 for Fable 5 and Mythos 5, compared with 1890 for Claude Opus 4.8, 1769 for GPT-5.5 and 1314 for Gemini 3.1 Pro. On GDPpdf, a benchmark focused on visual document reasoning, Fable 5 and Mythos 5 score 29.8% without tools, compared with 22.5% for Opus 4.8, 24.9% for GPT-5.5 and 16.7% for Gemini 3.1 Pro. That matters for enterprises because much of corporate work still lives in messy documents: PDFs, spreadsheets, charts, reports, contracts, filings, slide decks and screenshots. Anthropic says Fable 5 shows gains in document-based reasoning, chart and table interpretation and complex problem solving. Hex said, “Fable 5 is the first to break 90% on our core analytics benchmark of complex, long-running analytical tasks — a 10-point jump over Opus. On the hardest questions, it shows strong judgment and attention to nuance.” Hebbia said Fable 5 was the highest-scoring model on its Finance Benchmark for senior-level reasoning, with double-digit gains in document reasoning, chart and table interpretation, and problem solving. The finance examples are notable because they point to AI agents moving beyond summarization into higher-stakes analytical workflows. IMC said Fable 5 “aced our trading-analysis evaluations nearly across the board: factual lookup, conceptual reasoning, root-cause analysis, expected-value analysis.” Optiver said the model was stronger than Opus 4.8 on its trading benchmark and “remarkably consistent,” scoring identically across repeated runs. Balyasny Asset Management said Fable 5 was the strongest finance-first model it had tested. Legal and operations teams may also see immediate impact. Crosby Legal said, “Fable 5 feels materially different. In blind review, our lawyers found its redlines matched or beat our current model every time.” Notion said the model can take work “you'd chip away at all afternoon” and turn messy notes into a functioning project plan. Zapier said Fable 5 is the new leader on AutomationBench and is more autonomous than Opus 4.8: “Where Opus stops to ask, Fable 5 keeps looking.” For enterprise software vendors, that points toward more capable embedded agents in workflow products: agents that can review a contract, update a project plan, assemble a spreadsheet, inspect a chart, file a ticket, run a query, call an internal API and keep going until the work is complete. Vision and interface understanding Anthropic says Fable 5 is also its strongest vision model. In its launch materials, the company says the model can extract precise numbers from detailed scientific figures and complete vision-based tasks such as rebuilding a web app’s source code from screenshots alone. That has immediate implications for enterprise automation. Many business processes still depend on visual interfaces that are not cleanly exposed through APIs: dashboards, PDFs, forms, legacy apps, screenshots, scans and image-heavy reports. A stronger vision model could help agents operate across those environments with less custom integration work. Anthropic also says Fable 5 needs less scaffolding than previous Claude models. As an example, the company says earlier Claude models struggled to play Pokémon FireRed even with extra tools, while Fable 5 impressively beat the game using a minimal vision-only harness. Anthropic posted a fast forwarded video of its playthrough to YouTube and in its blog post: The point is not gaming itself, but the broader agentic skill: reading a visual environment, remembering progress, deciding what to do next and executing over a long horizon. In another internal test, Anthropic says it had the model play the deck-building game Slay the Spire with access to persistent file-based memory. The company says persistent memory improved Fable 5’s performance three times more than it improved Opus 4.8’s, and that Fable reached the game’s final act three times more often. For enterprise users, this suggests Fable 5 may make better use of notes, logs and stored context during multi-step work. That could matter for internal agents that operate over days or weeks: sales operations agents that track account research, engineering agents that manage migrations, finance agents that update models, or support agents that remember what they tried across many turns. From restricted cyber model to general-purpose enterprise AI The announcement follows Anthropic’s April 2025 rollout of Claude Mythos Preview through Project Glasswing, a restricted program for cyber defenders, critical infrastructure providers and major software maintainers. Anthropic created Glasswing after internal evaluations showed Mythos-class models could find and exploit software vulnerabilities at a level that raised meaningful misuse concerns. Following the debut of Glasswing and Mythos, U.S. officials and intelligence agencies began weighing how such models could reshape both cyber defense and offensive operations, while Sen. Mark Warner warned that AI-assisted vulnerability discovery should force industry to “accelerate and reprioritize patching.” Financial regulators also took notice: The Guardian reported that Mythos entered discussions among senior banking officials and regulators in the U.S. and U.K. because of fears that AI-accelerated cyberattacks could threaten payment systems and broader financial stability. The reaction has not been limited to alarm. Governments also want access: Reuters reported that South Korea’s national internet security agency had secured Mythos access through Project Glasswing, reflecting a broader geopolitical race to use frontier AI for national cyber defense. At the same time, Anthropic has faced scrutiny over whether it can safely gate the very capabilities it says are too risky for general release. The Verge reported that unauthorized users accessed Mythos after its limited rollout, calling the incident damaging for a company that has built its brand around responsible AI. Critics have also questioned whether Anthropic’s warning-heavy framing risks becoming a form of market positioning, since it casts the company as both the source of the new capability and the gatekeeper deciding which governments, companies and researchers get to use it. With Fable 5, Anthropic is leaning into its gatekeeper role, attempting to separate the general enterprise value of a Mythos-class model from the riskiest parts of its capability profile. The company says Fable 5 can handle software engineering, research, visual reasoning, document analysis and long-running agentic workflows, while classifiers block or reroute requests that could provide what Anthropic calls “uplift” to malicious actors. Those classifiers cover three main areas. Cybersecurity, where Anthropic says Mythos-class models can discover and exploit vulnerabilities and perform broader “agentic hacking” tasks such as reconnaissance, discovery and lateral movement. Biology and chemistry, where the company says the same reasoning that can help researchers design therapies could also help well-resourced malicious actors pursue dangerous biological work. Model distillation, where Anthropic says users may try to extract Claude’s capabilities to train competing models, including models that could be released without similar safeguards. When Fable 5’s classifiers detect one of those categories, the response is automatically handled by Claude Opus 4.8. Anthropic says users will be told when this happens. That is a notable product decision: rather than declining those requests outright, Anthropic is trying to keep the user experience functional while reducing access to the most capable version of the model in sensitive areas. Anthropic says it red-teamed the new classifier system internally and externally. The company says an internal bug bounty produced no universal jailbreaks after more than 1,000 hours of testing, and external red-teaming organizations also failed to find a universal jailbreak. One external partner found that Fable 5 complied with zero harmful single-turn cyber requests related to planning cyberattacks, exploit development or defense evasion, even when prompts used any of 30 public jailbreak techniques, according to Anthropic. The company is still acknowledging tradeoffs. Anthropic says the safeguards are deliberately cautious and may sometimes trigger on benign requests. That could frustrate security professionals, biology researchers and advanced enterprise users whose legitimate work overlaps with the blocked categories. The company says it plans to reduce false positives over time. Mythos 5 and the restricted frontier While Fable 5 is the broad commercial launch, Mythos 5 is the model to watch for enterprises operating in security, critical infrastructure and life sciences. The company says all users with Claude Mythos Preview access can upgrade to Mythos 5 beginning today. It plans to expand access through a trusted access program, in collaboration with the U.S. government. The distinction is important for sectors where the blocked capabilities are not edge cases but core workflows. A security team may need to reproduce vulnerabilities, test exploitability, analyze lateral movement or simulate attacker behavior in a controlled environment. A biology research team may need to reason through molecular design workflows that would trigger general-use safeguards. Fable 5 is not designed to give every user unrestricted access to those capabilities; Mythos 5 is designed for vetted users who need them. Anthropic says Mythos 5 has the strongest cybersecurity capabilities of any model in the world. In the company’s benchmark table, the model family scores 78.0% on ExploitBench, compared with 69.0% for Claude Mythos Preview, 40.0% for Opus 4.8 and 34.0% for GPT-5.5. On CyberGym, Anthropic’s chart shows Mythos 5 at 83.8%, slightly ahead of Mythos Preview at 83.1% and far above Opus 4.8 with default safeguards. The company is making a similar argument in biology. Anthropic says Mythos-class models outperform dedicated protein language models on a task involving adeno-associated viruses, a delivery mechanism used in gene therapies. The company frames that as both promising and risky: the same capability that could help gene therapy research could also be misused in dangerous biological work. Anthropic says its internal protein design experts used Mythos 5 to accelerate parts of the drug design process by about tenfold. In one example, the company says Mythos 5, using protein design and bioinformatics tools without human assistance, matched or beat skilled human operators by choosing binding sites, selecting and running tools, and recovering from failures. Anthropic says nine of 14 protein targets in the study produced strong candidates for drug design that it is now investigating. The company also says Mythos 5 produced novel molecular biology hypotheses that Anthropic scientists preferred over Opus-class model hypotheses about 80% of the time in blinded comparisons. Anthropic says several of those ideas have advanced to experimental evaluation, and one hypothesis involving an E. coli protein was later corroborated by an independent lab working on the same problem. Those claims are potentially significant, but they should be treated carefully until more details are published. Anthropic says it intends to publish additional results in the coming months. For now, the strongest enterprise implication is directional: the company believes its highest-end models can already perform parts of scientific research workflows with less human intervention than prior systems. New, longer data retention requirement The company also introduced a new data-retention policy for Mythos-class models. Anthropic says it will require 30-day retention for all traffic on Fable 5, Mythos 5 and future models with similar or higher capability levels, across both first-party and third-party surfaces. The company says it will not use that data to train new Claude models or for non-safety purposes, and says it has added privacy protections including logging human access and deleting the data after 30 days in almost all cases. That policy may become one of the most important enterprise buying questions around Fable 5. Many businesses want frontier AI capability but also want strict control over data retention, especially in regulated sectors. Anthropic’s position is that stronger monitoring is necessary for models with this level of capability. Enterprise customers will have to decide whether the capability gain justifies the retention requirement. Enterprise implications The broader enterprise significance of Fable 5 is that Anthropic is trying to commercialize a more autonomous class of AI model without exposing all of its capabilities to every user. That could become a template for how frontier labs release increasingly powerful systems: one model family, multiple access tiers, and domain-specific restrictions depending on user trust and risk. If Fable 5 performs as Anthropic and early customers describe, developers may hand off larger tasks: code migrations, refactors, UI builds, test writing, bug fixing, documentation, internal tooling and multi-step app creation. For knowledge-work-heavy enterprises, Fable 5 could make AI more useful in workflows where earlier models were too brittle: finance research, spreadsheet analysis, legal redlines, procurement review, board materials, market research, sales operations and project planning. The main gain is not just better answers; it is fewer turns, fewer corrections and more ability to keep working through ambiguity. For security teams, the launch is more complicated. Most organizations will get Fable 5, not unrestricted Mythos 5. That means they may see stronger general coding and analysis, but not full access to the cyber capabilities Anthropic considers risky. Trusted defenders inside Project Glasswing will get Mythos 5, giving them a more direct way to use the model for vulnerability discovery and defensive testing. For life sciences companies, the pattern is similar. Fable 5 may help with general research, literature analysis, data interpretation and scientific reasoning, but the more sensitive biological capabilities will be restricted. Anthropic is effectively creating a separate access path for vetted researchers whose work requires capabilities that could be dangerous in the wrong hands. The launch also raises competitive pressure across the AI industry. Anthropic is claiming state-of-the-art results across agentic coding, knowledge work, vision, cybersecurity, legal reasoning, spatial reasoning and health benchmarks. But the more strategically important claim may be that it has found a workable release mechanism for models above its Opus class. If Fable 5’s safeguards hold up under real-world use, Anthropic will argue it can bring more powerful models to market sooner without fully opening the riskiest capabilities. That is still a large “if.” The enterprise market will test not only Fable 5’s benchmark performance, but also its reliability, false-positive rate, data-retention tradeoffs and cost at scale. A model that can complete more work autonomously can also burn more tokens, trigger more governance questions and create new review burdens for teams that must verify its output. Still, today’s launch marks a clear shift in the Claude lineup. Opus is no longer Anthropic’s top commercial capability tier. Mythos-class models now sit above it. Fable 5 is the first version of that tier for general users; Mythos 5 is the restricted version for trusted high-risk work. Together, they show how Anthropic plans to push frontier AI deeper into enterprise workflows while trying to keep the most dangerous capabilities gated.

Presented by Norton For 39 days this summer, the planet will be doing roughly the same thing at the same time. The 2026 World Cup spans 104 matches across 16 cities in the United States, Canada, and Mexico, with billions of people likely to watch over the course of the tournament. It could very well be one of the largest shared events the internet has ever been asked to carry. What’s changed since the last tournament isn’t the scale, it’s the screen. For a growing share of that audience, the match won’t come through television. It’ll come through a browser tab. The problem is the browser you have today simply does not give you a frictionless and reliable way to watch the World Cup for free. In the U.S., a majority of viewers now expect to stream the tournament digitally rather than watch on cable or satellite. It only works when you have a paid subscription. But there is a challenge — for example, fans coming from Europe for who want to watch the game in the U.S. just like they are able to watch it for free in Europe. Watching the World Cup has been harder than it should be Ask anyone who has tried to watch a tournament online and the answers are remarkably consistent across every cycle. Streams stutter and buffer when it matters most. The “free stream” someone forwarded turns out to be a chain of lookalike sites and dead-end links. And the legitimate platforms want a credit card, and maybe a generous helping of personal data too before they’ll play a single minute. One company’s bet: Neo Norton’s answer is Neo, a browser built on the premise that protection and access can live in the browser software itself rather than in a stack of add-ons the user has to go find, install, and pay for. It's less about adding features than removing friction. Remove the steps between a person and the thing they came to do. “The tournament is the kind of moment the modern web was supposed to be great at: everyone, everywhere, on the same thing in real time,” says Howie Xu, Chief AI and Innovation Officer at Gen and its family of brands including Norton. “It sometimes takes a PhD to figure out how to watch matches the right way. Our view is that the browser should have done more of the heavy lifting. That’s why we reinvented the browser to enable a true frictionless, safe, and fast access to the contents they deserve.” It's a notable position coming from a security brand that historically sold protection as a separate thing you bought and remembered to run. But Neo removes this separation. Now, they are reinventing the model so the browser is the whole solution for safe, frictionless, fast streaming. The scams arrive before the match does Before official ticket sales opened, fraudsters were already working the tournament: fake listings, cloned resale sites, phishing messages built to harvest money and personal details. The methods are well-worn. Counterfeit “official” portals copy real branding behind lookalike URLs; phishing emails impersonate organizers and dangle exclusive access; social ads promise guaranteed seats at suspiciously low prices and deliver a doctored PDF, or nothing. The same logic follows fans to streaming, where the cheapest, most convenient link is often the most dangerous one. This is where Norton’s back catalogue shows up inside everyday browsing. Anti-phishing, scam-site detection, and malicious-page blocking run in the background, flagging dangerous links as they appear rather than after a card number has already been entered. Whether that’s enough to change fan behavior is the open question. People are remarkably willing to click past a warning when a match is about to kick off. But moving the safeguard into the browser, instead of a separate app, puts it where the risk actually is. Access, without the setup wizard There’s also the matter of simply reaching a legitimate stream. Finding officially licensed providers by country, throttled connections at peak hours, and varying restrictions across platforms all get in the way. The usual remedy is configuring a separate VPN with its own account and billing, which is its own kind of friction. Neo folds Norton’s award-winning VPN technology into the browser itself, and it can easily be turned on or off. That matters most in the situations the tournament actually creates: connecting through an unfamiliar hotel network, an airport layover, a bar’s public Wi-Fi where a sizable share of fans say they’ll watch. Neo also builds the search for a legitimate stream directly into the browser. It has a dedicated widget with live game schedules, match reminders, and direct streaming links for every game, surfacing the right licensed source for your market without a separate search. “Most people don’t want to manage their security or legitimacy of a link, they want to watch the game,” Xu says. “So we shifted the burden away from the person. The protection is on, the connection is private, and you never had to set anything up.” Calm by design Underneath the tournament use case is the idea Neo keeps coming back to: calm by design, with privacy and security working together inside a clean interface rather than buried in a settings menu. Because the browser can anticipate rather than wait to be asked, it surfaces what a fan likely wants next. A reminder about an upcoming match, a quick summary of the day’s results, a nudge to resume where they left off. Personal data stays on the device unless the person decides otherwise. Whether this approach wins a meaningful share of a market Chrome still dominates is far from settled. But Norton Neo says they reinvented a browser to make 5.8 billion potential viewers’ lives easier. Fans can explore available streams for their market at lp.neobrowser.ai/tournament_stream. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source search agent built atop OpenAI's gpt-oss-20B open source model that fundamentally redesigns how AI executes complex retrieval tasks. Harness-1 achieves a massive leap in performance, scoring 73% average on its ability to recall relevant information correctly from a curated dataset, outperforming even GPT-5.4 (70.9%) and the next, most accurate open source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points. (While GPT-5.5 has also been out for more than a month, the researchers didn't test against this model as it wasn't available when they were building theirs.) Crucially for developers, the model and its environment are available immediately under the highly permissive Apache 2.0 license and model code/weights on Hugging Face. Harness-1 also serves as proof-of-efficacy of another effort, Tinker, the distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. Tinker was used specifically to train and run inference for Harness-1, highlighting how interactive infrastructure is actively enabling the next generation of autonomous models. So how did the researchers do it? Benchmarks Decoded (and Why Harness-1 Could Help Enterprises Tremendously) To actually put these models to the test, the researchers evaluated Harness-1 and its competitors across eight highly complex search benchmarks. Rather than asking simple trivia questions, these tests required the AI to act like a real researcher sifting through diverse, dense data sources. The benchmarks spanned several different domains, including open web searches, complex financial filings from the SEC, technical patent databases from the USPTO, and "multi-hop" question-answering tasks where the AI had to logically piece together scattered clues from multiple different documents to arrive at the correct answer. When the results came in, Harness-1 dominated the open-source competition in its ability to successfully find and curate the right facts. Even more impressively, this relatively small 20-billion parameter model went toe-to-toe with massive, expensive proprietary AI systems. It actually outperformed heavyweights like GPT-5.4, Sonnet-4.6, and Kimi-K2.5 — thought to be the hundreds of billions or trillions of parameters. Only one giant frontier model—Opus-4.6 — managed to narrowly edge it out in overall average performance. Harness-1 achieves its performance gains by offloading the exhaustive "bookkeeping" of a search session out of the model's working memory and into a structured software environment. As enterprise use cases grow more sophisticated, demanding that models autonomously sift through thousands of corporate documents or financial filings, these systems frequently succumb to "search amnesia"—forgetting their original queries, looping over rejected documents, or losing track of the specific claims they are trying to verify. Until now, the prevailing solution to this amnesia has been brute force. Engineers typically force models to constantly reread an ever-expanding, append-only transcript of their own actions, piling every search, read, and thought back into a massive context window. Harness-1 introduces a paradigm shift away from this method, proving that the bottleneck for true artificial autonomy isn't necessarily the size of the model, but how efficiently its working environment manages state. It highlights once more, as Anthropic's Claude Code has also done, that the raw model is arguably less important than the harness — or set of conditions — through which it runs. Technology: Doing the Paperwork in the Environment To understand the technical leap of Harness-1, consider a real-world analogy. Imagine hiring a brilliant research assistant and placing them in an empty room without a desk, notepads, or filing cabinets. You ask them to write a comprehensive report on a highly complex topic, which requires them to read dozens of books while keeping every single quote, citation, and dead-end search perfectly memorized in their own head. Eventually, no matter how intelligent the assistant is, their cognitive load will max out, and they will start dropping facts or losing the thread of the assignment. This is exactly how traditional search agents operate today. They are trained as policies over growing transcripts, meaning the model searches, reads, searches again, and appends everything into its own context window. As lead researcher Patrick (Pengcheng) Jiang of the University of Illinois noted on X: "At some point the model is not just 'searching' anymore. It is also being asked to be a memory system, a note taker, a verifier, and a librarian." Harness-1 solves this by giving the AI a desk and a filing cabinet—what the research team calls a "state-externalizing harness." This harness is an active, surrounding environment that takes over the routine bookkeeping, maintaining a recoverable working memory that includes a candidate pool of documents, an importance-tagged curated evidence set, compact evidence links, and verification records. By separating semantic choices from structural state management, the AI is freed up to do what it does best. The policy still decides what to search, determines which documents to keep, and knows when to stop, while the environment simply holds the state. Here is a subsection breaking down the training methodology and how it differs from prior agentic search models: Training Harness-1: A Masterclass in Data Efficiency The training pipeline for Harness-1 represents a fundamental shift in how the AI industry approaches agentic learning. Historically, developers have treated search agents as policies operating over massive, ever-growing transcripts, forcing reinforcement learning (RL) algorithms to simultaneously optimize both semantic reasoning and the raw memorization of a search state. Harness-1’s creators took a radically different approach: because their custom "harness" handles all the routine bookkeeping—like maintaining evidence links, candidate pools, and verification records—the training process only needed to teach the model how to operate this structured interface. This division of labor drastically simplified what the underlying 20-billion parameter model actually needed to learn. The process began with a remarkably narrow Supervised Fine-Tuning (SFT) stage. Rather than scraping petabytes of new behavioral data, the team generated just 899 filtered trajectories using a GPT-5.4 teacher agent that was plugged into the exact same harness environment the student model would eventually use. The goal of this SFT phase was not to inject vast amounts of domain knowledge into the model, but simply to teach it the mechanical rhythms of a good researcher: how to format tool calls, how to tag documents by importance, and the discipline of verifying a claim before promoting it to the final curated set. Following SFT, the model underwent Reinforcement Learning (RL) using an algorithm called CISPO, applied over full search episodes capping at 40 turns. The team designed a highly specific terminal reward function that explicitly separated discovery from selection. The model was rewarded not just for finding a relevant document, but for successfully promoting it into the final answer set, while being penalized if it found the answer but failed to curate it. The researchers also instituted a "tool diversity" bonus; without this specific incentive, they found the policy would quickly collapse into a lazy, search-heavy strategy where it spammed queries but bypassed the harder work of reading and verifying the text. What makes Harness-1 truly innovative compared to prior work is its unprecedented data efficiency. The entire model was trained on roughly 4,400 unique items—899 SFT trajectories and 3,453 RL queries. In stark contrast, competing open-source models required vastly larger datasets to achieve worse results: Context-1 utilized over 17,200 training items, while Search-R1 relied on a staggering 221,300 items to learn search behaviors. By proving that a smarter external cognitive architecture can replace brute-force data scaling, Harness-1 suggests that the future of agentic AI lies in building better environments for models to work within, rather than just training larger models on more data. Product: Enterprise Applicability and Generalization From a product perspective, Harness-1 is delivered as a highly capable 20B agent merged into the openai/gpt-oss-20b base architecture. For enterprise tech stacks, the applicability is massive because businesses need AI to execute multi-step research across proprietary databases without hallucinating or running up exorbitant compute bills. Harness-1 manages its frontier-level performance at what the creators describe as "Context-1-level cost and latency." Because the context window is strictly managed by the budget-aware harness rather than continuously expanding, enterprises can deploy this agent autonomously without incurring the exponential token costs typically associated with long-horizon AI tasks. Even more impressively, Harness-1 proves it can generalize well beyond its training data. According to the research team, it was incredibly cheap to train, utilizing just 899 filtered supervised fine-tuning (SFT) trajectories and a mere 3,453 reinforcement learning (RL) queries. "Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface: search, curate, revisit, verify, and submit," Jiang explained. This leanness proves a critical point for the AI industry: developers do not necessarily need petabytes of new behavioral data if they build a better cognitive framework for the model to operate within. What Does Harness-1 Mean for RAG? At its most basic level, standard or "naive" Retrieval-Augmented Generation (RAG) operates on a simple, one-step pipeline: a user asks a question, the system searches a database, and the raw results are fed into a Large Language Model (LLM) to generate an answer. While effective for simple queries, naive RAG frequently stumbles on complex enterprise tasks. If the initial search misses the mark, or if the question requires logically connecting clues across multiple documents, the system cannot dynamically adjust—it simply generates a wrong or incomplete answer based on the insufficient data it grabbed. Harness-1 does not eliminate this paradigm; rather, it introduces a massive leap forward into what the industry is increasingly calling "agentic" or "modular" RAG. When asked on X by VentureBeat if this new model could replace or obviate the need for RAG entirely, lead researcher Jiang clarified the project's scope. "I wouldn’t say it replaces RAG. Harness-1 still builds on retrieval and external evidence," Jiang told me. "Our focus is making RAG-style search more agentic and trainable, by externalizing search state into a harness and training the model to use it effectively." In practice, this means Harness-1 is designed to be the ultimate retrieval module within an advanced AI architecture. Instead of just fetching documents and immediately forcing an LLM to generate an answer, Harness-1 acts as an autonomous subagent. It can spend up to 40 turns actively investigating a complex query. During this phase, it executes multiple diverse searches, reads full documents, explicitly verifies claims against the text, and carefully edits a prioritized "curated set" of evidence. The system's unique "state-externalizing harness" acts as an organized working memory for this entire process, filtering out duplicates and maintaining a structured notebook of the best clues. The "Generation" part of RAG only happens at the very end of this exhaustive search process. Once Harness-1 has finalized its curated set of documents, it hands that highly refined bundle of evidence over to a completely separate "frozen" frontier model—such as GPT-5.4, Sonnet-4.6, or Opus-4.6—which takes on the role of the final answer generator. By cleanly decoupling the heavy lifting of multi-step evidence gathering from the final act of writing the answer, Harness-1 proves that providing generators with much better, meticulously curated context results in significantly higher answer accuracy than naive RAG systems. Licensing: The Power of Apache 2.0 One of the most significant aspects of the Harness-1 release is its licensing. In plain language, Apache 2.0 is a highly permissive, enterprise-friendly software license that fundamentally enables commercialization. Unlike "copyleft" licenses (such as the GPL) that can force companies to open-source their own proprietary software if they integrate the code, or "research-only" licenses that ban commercial use entirely, Apache 2.0 gives businesses the green light to freely build, modify, and monetize the technology. For developers and startups, this means Harness-1 can be seamlessly integrated into commercial enterprise search products, internal data retrieval tools, or customer-facing AI applications without fear of legal reprisal. The only major requirement is that users must include the original copyright notice and explicitly state any significant modifications they make to the source code, positioning Harness-1 as a highly viable foundational building block for the enterprise. Community Reactions: A Resounding Validation The announcement has clearly struck a nerve within the developer community, validating the very real pain points engineers face when building agentic systems. Jiang’s multi-part announcement thread on X quickly garnered massive traction, pulling in over 256.1K views, 3.7K likes, 2.9K bookmarks, and nearly 300 reposts within a matter of days. This high engagement underscores a growing consensus in the AI space that brute-forcing context windows is a losing battle. When Jiang posted on X, "I’ve been wondering: maybe search agents are bad at search partly because we make them do all the paperwork in their head," the resonance was immediate. For developers who have spent the last year wrestling with AI agents that confidently forget their primary instructions halfway through a database search, the Harness-1 approach feels like a desperately needed course correction. Ultimately, the community sentiment highlights a shift in industry priorities. Developers are moving away from asking how large an AI model's context window can get, and instead asking how efficiently an AI model's environment can manage that context for it. By offloading the paperwork, Harness-1 is proving that smaller, smarter systems can outmaneuver the giants—provided they have the right desk to work at.

Agentic AI is now a core part of the engineering process, driving massive execution leverage and helping us generate more code than ever before. Yet, a difficult question I’ve increasingly heard from business leaders is: if we’re shipping code faster than ever, why aren’t our products improving at the same rate? The reason is that writing code was never the rate limiter. Defining the right requirements, integrating with complex systems, and maintaining software under real-world conditions has always been the hard part. And when agents flood an organization with lots of new code, the hard part only gets harder. Agents compress execution time. They do not compress ambiguity, accountability, or operational complexity. As AI-generated code scales, human review is becoming a massive new bottleneck, and engineers are losing the context needed to catch agent mistakes. The companies that understand this will move forward deliberately and even create new roles because of AI. The ones that don’t will default to a simpler, far more destructive conclusion: Reduce headcount and increase AI spend. The playbook Irreversible structural decisions demand caution, precisely because the technology is moving so fast. Enterprise engineering leaders need a deliberate playbook to navigate the chaos. Here's how to start: Phase 1: Financial and risk governance Protect the downside — secure the infrastructure and cap the financial bleeding. Treat governance as a tier-one risk: The pressure to integrate AI is real, but giving teams the freedom to experiment without a centralized structure creates fragmented processes, duplicated work, and runaway costs. Organizations will need to establish shared standards while still allowing teams to adapt and explore within defined boundaries. This means treating agent configuration like production infrastructure — versioning, reviewing, and testing prompts and skills before rolling them out gradually. Enforce least privilege for non-human actors: Never allow an agent to simply inherit the full permissions of its human operator. Human engineers are granted broad access because they possess contextual judgment and bear ultimate accountability. Deploying agents with human-level access without careful consideration introduces an accountability gap into your systems. Implement strict separation between read and write/execute access, and mandate human-in-the-loop approval gates for destructive or production-altering actions. As agents transition from suggesting code to autonomously executing tasks, they must be rigorously incorporated into your security model. Watch your wallet: Protect your overall AI budget by enforcing quotas and rate limits for both engineering and production. Cautionary tales are increasingly common: Uber capped its AI spend after burning its 2026 budget by April, and, according to Axios, an unnamed company incurred a staggering $500 million Anthropic bill in a single month due to runaway agentic loops. Phase 2: Technical strategy Build the engine: Choose the right models and measure their success. Go multi-model and multi-vendor: No single model excels at every task. It's important to precisely characterize the behavior and performance boundaries across models to understand where each excels, routing specific tasks to the systems best equipped to handle them. Standardizing on a single vendor or model sacrifices capabilities and introduces a critical single point of failure. No organization should absorb that level of concentration risk in its core engineering function. Pay for the frontier: Treat AI as engineering leverage, not just another SaaS expense. Pay for premium frontier models that deliver the highest quality output and reduce costly rework. Ultimately, the cheapest model isn't the one with the lowest token price — it’s the one that maximizes efficiency while minimizing your downstream risk. Measure what actually matters: Deployments, lines of code, and pull requests were never good metrics for productivity, and with AI, they are actively misleading. Instead, aim for metrics that are attached to business outcomes (feature adoption, retention) and engineering durability (change failure rate, escaped defects, code survival over time). For AI efficiency, measure task success per dollar and rework time. Token counts are convenient for leaderboards but they cannot tell you if the tokens were well spent. Phase 3: Talent and organization Realign your human capital to manage the new bottleneck. Shift engineers from syntax to systems: As agents handle the bulk of code generation, human review and architectural alignment are the new bottlenecks. Organizations must deliberately upskill their workforce to transition from syntax-writers to systems-thinkers and agent-managers. Engineers need the training and mandate to guide agentic processes, manage complex cross-system integrations, and hold the overarching architectural vision that agents can struggle to maintain. Redefine performance and incentives: When an individual engineer can generate the output of a former squad, traditional metrics like story points or sprint velocity can become ineffective overhead. Consider realigning your evaluation frameworks to better reward expanded business impact, cross-system reliability, and effective agent orchestration. If you want systems-thinkers who cover more strategic surface area, are willing to explore and take risks, and build products in a durable way, you must reward them for higher level impact, not sheer volume of output. Don’t cut headcount before your strategy adapts: If you haven't integrated agentic workflows, measured augmented output in production, and reworked your roadmap around faster execution, you do not actually know whether your needs and capabilities align. Cutting headcount before establishing that baseline isn't discipline — it’s blindness. The goal is not simply smaller teams, but teams capable of covering more strategic surface area. Enterprise AI adoption requires human elasticity AI is not a replacement for engineering judgment; it is a force multiplier for it. In well-structured systems, it safely accelerates delivery. In poorly understood systems, it accelerates failure. We are already seeing the fallout: Outages, rising technical debt, and unexpected cost spikes driven by poorly governed adoption. These are operational failures, not theoretical risks. The mistake organizations are now making isn’t adopting AI too slowly — it’s adopting it without understanding where it breaks. For the C-suite, understanding this dynamic is no longer optional — it is the determining factor in how a business navigates this era. The challenge is that execution velocity is outpacing the industry's ability to manage the consequences. We have handed engineering teams the ultimate power tool. The old adage demands that you measure twice and cut once. Instead, too many firms are opting to just cut. Joe Bertolami is CTO and co-founder of Clifton AI.

Our system did one thing, and it did it well: It turned natural-language questions into API calls. The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like "Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city" was translated into an API call that the system could act on: json { "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" } } The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser. By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data. The contract between the LLM and the rest of the system was a structured JSON object as described in the above example. json { "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" } } We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library. Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed. First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error. Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways. We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure. Why traditional engineering discipline fails here Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction. LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends. This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded. Anatomy of the failure The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields. Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being "helpful" in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model's perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built. The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe. Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren't using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn't appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem. The evals-first architecture The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them. In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this: python def test_description_contains_no_serialized_payload(response): desc = response["description"].lower() forbidden = ["curl", "post_body", "{", "http://", "https://"] assert not any(token in desc for token in forbidden), \ f"description leaked structured content: {response['description']}" A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge. Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said "the description field should not contain a curl command," because nobody had thought the model would put one there. Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves. The roadmap The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what 'coverage' means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between "the model passed our smoke tests" and "we know what this system will do in production" becomes the central engineering problem of the next several years. The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is. Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor. Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.
For three years, Microsoft's artificial intelligence story has been inseparable from OpenAI. The partnership — cemented by a cumulative investment exceeding $13 billion — gave Microsoft early access to the most advanced AI models on the planet, catapulting its Copilot products into the enterprise mainstream and adding hundreds of billions of dollars to its market capitalization. To the outside world, Microsoft's AI strategy was OpenAI. Mustafa Suleyman wants to change that narrative. In an exclusive sit-down interview with VentureBeat at Microsoft Build 2026, the CEO of Microsoft AI disclosed that a contractual change with OpenAI roughly six months ago granted his division the formal authority to pursue what he openly calls "superintelligence" — using Microsoft's own researchers, its own data pipelines, and its own custom silicon. "We were only sort of set free from our contract with OpenAI about six months ago to formally pursue superintelligence," Suleyman said. "So this is very early days." The comment, delivered matter-of-factly backstage at the Fort Mason Center here, offers the clearest signal yet of a strategic inflection point unfolding inside the world's most valuable public company. Microsoft is not abandoning OpenAI. But it is building something alongside it — and, eventually, something that could stand entirely on its own. Microsoft's first in-house model family signals a new level of AI ambition The most tangible evidence of that shift arrived the same day. Microsoft announced a family of seven new AI models developed entirely in-house by its AI Superintelligence Team, spanning reasoning, code generation, image creation, transcription, and voice synthesis. The models — branded under the "MAI" family name — are Microsoft's most ambitious first-party AI release to date. The flagship, MAI-Thinking-1, is a 35-billion-active-parameter reasoning model that Microsoft says matches leading models in its weight class on key software engineering benchmarks and demonstrates advanced mathematical reasoning. Suleyman emphasized one point repeatedly: the model was trained from scratch on clean, commercially licensed data, without distillation from third-party frontier models — a direct, if unstated, contrast to the widespread industry practice of using outputs from competitors' systems to train cheaper alternatives. "We train our reasoning models from scratch," Suleyman wrote in a blog post accompanying the announcement. "We don't distill from other labs and we don't rely on unlicensed or opaque data." The rest of the family fills out a multimodal portfolio designed for enterprise deployment: MAI-Code-1-Flash, a lightweight coding model built specifically for GitHub Copilot and VS Code; MAI-Image-2.5, which supports both text-to-image and image editing; MAI-Transcribe-1.5, which Microsoft claims is the most accurate transcription model available, operating across 43 languages; and MAI-Voice-2, a multilingual speech-generation system. All of the models ship through Microsoft Foundry, the company's model-hosting and deployment infrastructure, and for the first time, developers can tune model weights themselves through third-party platforms including OpenRouter, Fireworks, and Baseten. But Suleyman made clear in the interview that the seven models are a proof of concept, not a finished product. The real project is the lab itself. "Our job is to make sure that when we look out to 2030 and beyond, we have the capacity not just to buy models from third parties, but to build the absolute frontier, the best models in the world," he said. "That's a long transition." What "set free" from OpenAI actually means for Microsoft's AI future To understand what Suleyman means by "set free," you need to understand the unusual contractual architecture that has governed Microsoft's AI efforts for years. When Microsoft invested billions into OpenAI beginning in 2019, the partnership came with a specific arrangement: OpenAI would build the frontier models, and Microsoft would serve as the exclusive cloud provider, integrating those models into its products and reselling them through Azure. The deal gave Microsoft extraordinary commercial leverage — access to the world's most advanced AI without having to build it — but it also created a dependency. Microsoft was explicitly barred from pursuing its own AGI research, and the agreement even capped how large a model the company could train, restricting it from building systems beyond a certain computing threshold measured in FLOPS. That arrangement was formally renegotiated. As Fortune and Axios reported in November, a revised deal with OpenAI removed those restrictions, clearing the way for Suleyman to launch the MAI Superintelligence Team and pursue what he calls "humanist superintelligence." The result, in Suleyman's telling at the time, was a "best-of-both environment, where we're free to pursue our own superintelligence and also work closely with them." By the time he sat down with VentureBeat at Build 2026, roughly six months had passed since that self-sufficiency effort formally began. Microsoft had already started shipping in-house models — including MAI-Image-2-Efficient, a lighter-weight image generation model released in April — but the seven MAI models announced at Build are the team's most ambitious release yet: a full multimodal family spanning reasoning, code, image generation, transcription, and voice. Even so, Suleyman does not view the shift as a rupture with OpenAI. He described Microsoft's current position as one of abundance, not scarcity. "There's no immediate urgent need to fill a gap in three months' time or six months' time," he said. "We have OpenAI, we have Anthropic, we have thousands of models inside Foundry. So there's already a huge amount of optionality available to us." The framing is telling. Microsoft's push into first-party frontier models is not born out of a crisis in the OpenAI relationship but out of a strategic calculation: as AI becomes the most consequential technology layer in enterprise computing, the company cannot afford to depend entirely on partners for the foundational capability. "Over the next five years, we have to be able to produce state-of-the-art frontier-scale models," Suleyman said. "That's our mission." Suleyman says the shift from chatbots to autonomous AI agents has already begun If the seven MAI models represent the technical ambition, a new capability called Frontier Tuning represents the commercial logic. Announced alongside the models at Build, Frontier Tuning allows enterprise customers to customize MAI models using their own proprietary data, workflows, and domain terminology, all within their own secure compliance boundary. The system uses reinforcement learning environments — what Microsoft calls "training gyms for AI" — that let agents learn directly from real workplace tasks without affecting production systems. The results Microsoft shared are striking. An MAI model tuned for Excel reportedly matches GPT 5.4 performance while operating at up to ten times greater efficiency. Early enterprise adopters are seeing similar gains: when tuned for one unnamed organization's exacting standards, the MAI model achieved the highest win rate of any model tested at roughly one-tenth the cost. Suleyman framed Frontier Tuning as part of a broader evolutionary stage — a move from intelligence to action. "We've basically moved beyond just conversation," he told VentureBeat. "Now we're moving to action." He introduced a new framework for thinking about that progression: the shift from IQ (factual intelligence) to EQ (emotional intelligence, or the ability to follow tone and style instructions) to what he calls AQ — the "Actions Quotient." Future AI agents, in Suleyman's telling, won't just answer questions. They will log into enterprise software, navigate complex multi-application workflows, and execute tasks across Excel, Word, Teams, Jira, Adobe InDesign, and customer relationship management systems — just as a human employee would. "You should be able to show up on day one and almost provision credentials to a new AI agent," he said. "The model needs to be able to move across all of these different environments, and that's actually the great strength of Microsoft." The Build 2026 announcements bore this out in concrete product terms. Microsoft Scout, the company's first "Autopilot" agent, operates as an always-on background assistant built on the open-source OpenClaw technology. It runs with its own governed identity inside Microsoft Entra, so its actions are auditable and attributable. Windows 365 for Agents gives AI agents their own managed Cloud PCs, allowing them to interact directly with applications and browsers inside enterprise environments. And the Foundry platform received major updates — including hosted agents with sub-100-millisecond cold starts, a new Microsoft Agent Framework, and one-click publishing to Teams and Microsoft 365 Copilot. Why Microsoft believes enterprise data is the next AI training frontier Suleyman also articulated why he believes Microsoft's position is uniquely defensible — and the argument has less to do with model architecture than with where work actually happens. "We've sort of hoovered up all of the obvious pools of training data," he said, referring to the industry's early scramble to ingest the open web. "In the next phase, we actually want to be able to give these agents to companies to train on their specific tasks with the data that they have inside of their own big workflows." The claim is subtle but consequential. The first wave of generative AI was trained on publicly available text — books, websites, Reddit posts, code repositories. That data is now largely exhausted, and its use is increasingly contested in court. The next wave, Suleyman argues, will be trained on enterprise-specific data: the internal workflows, decision traces, and institutional knowledge that define how real organizations operate. Microsoft, which serves 493 of the Fortune 500 through Azure according to Suleyman, is already embedded inside those workflows through Microsoft 365, Teams, Dynamics 365, and the broader Azure ecosystem. Frontier Tuning is the mechanism that converts that positional advantage into model performance. "People underappreciate that that's going to be the next domain," Suleyman said. The early partner list for Frontier Tuning reflects the ambition: Mayo Clinic, where Microsoft is co-creating a frontier AI model for healthcare using de-identified clinical data; EY, which is tuning a tax-advisory agent for deployment to 75,000 professionals globally; Land O'Lakes, where Frontier Tuning delivered what the company's product development scientist called "meaningful improvements in grounded outputs and style compliance"; and Pearson, which is using tuned models to provide learning-science-aligned feedback in its Communication Coach product. The Mayo Clinic partnership may be the most significant. Microsoft and Mayo Clinic are collaborating to build a healthcare-specific frontier model that combines Mayo's clinical expertise and longitudinal patient insights with Microsoft's AI capabilities. The model will be owned by Mayo Clinic and deployed first within Mayo's own environment before being made available to other organizations through Foundry. Microsoft's custom AI chips and GPU buying spree reveal the scale of its compute advantage None of this works without an industrial-scale compute infrastructure, and Suleyman was unusually candid about the hardware economics underlying Microsoft's strategy. "We are the largest buyer of GPUs on the planet," he said. "We're the largest buyer of GB200s and GB300s in the world." Microsoft will continue purchasing Nvidia accelerators "for many, many years to come," Suleyman said. But the company is simultaneously building its own custom silicon. Maia 200, Microsoft's second-generation AI accelerator, is already running in production across data centers in Iowa and Arizona, with deployments planned for Italy, Australia, and South Korea. According to Microsoft, Maia 200 delivers the best tokens-per-dollar-per-watt in the company’s fleet. Suleyman put a finer point on the economics in the interview: Maia 200 is 30 percent more cost-efficient than Nvidia's GB200, he said. And when Microsoft co-optimizes its own MAI models to run natively on Maia silicon, the company sees an additional 1.4x improvement in performance per watt. "It is going to be cheaper in years to come to build on MAI models with Maia 200 and Maia 300 inside of Azure," he said. That claim — if it holds at scale — has profound implications for the competitive landscape. It means Microsoft is not merely buying its way to AI dominance through Nvidia; it is building a vertically integrated stack in which its own models, running on its own chips, inside its own cloud, tuned on its customers' own data, could offer performance and cost characteristics that no competitor can replicate. Suleyman rejects the idea that AI models are becoming commodities Suleyman also pushed back sharply against one of the most popular narratives in Silicon Valley: that AI models are rapidly commoditizing. "A lot of people are saying models are commoditizing," he said. "I don't think that's true." His argument hinges on what he calls "quality tokens" — the proposition that the composition, curation, licensing, and deduplication of training data matter at least as much as raw scale. Microsoft's new MAI models, he said, were trained on a pre-training mix composed of approximately 50 percent high-quality code, with the remainder drawn from commercially licensed and carefully curated sources. The result, he argued, is a distinct "lineage" of models optimized for coding, reasoning, and agentic behavior — fundamentally different from models optimized for consumer chat, cultural content, or multilingual breadth. "We're going to see very distinct lineages that reflect different training objectives of different companies," he said. "Quality tokens matter more than just brute-force scale." This is a strategically important argument for Microsoft to make. If models are commodities — if any lab can match the frontier within months using cheaper compute and distilled training data — then the model layer becomes a race to the bottom, and Microsoft's billions in compute investment offer no durable advantage. But if model quality is a function of data discipline, research depth, and institutional patience, then the lab-building approach Suleyman is pursuing becomes a genuine competitive moat. He used a specific metaphor to describe that approach, one borrowed from optimization theory: the "hill-climbing machine." The phrase describes a system that continuously improves — cycle after cycle — by applying more compute, better data, and sharper evaluation. "The goal here is to build what we think of as a hill-climbing machine," he wrote in his blog post. "An organization that can continuously improve, cycle after cycle." The metaphor is revealing because it describes a process, not a destination. Suleyman is not promising that Microsoft will build the world's best model next quarter. He is arguing that Microsoft is building the system — the research culture, the data pipelines, the silicon co-optimization, the evaluation infrastructure — that will produce progressively better models over years. Inside Microsoft's five-year plan to become a self-sufficient AI superpower The strategic picture that emerges from Suleyman's comments — and from the full scope of the Build 2026 announcements — is of a company preparing for a future in which AI capability is not rented from a partner but generated internally, at scale, across every layer of the stack. Microsoft still needs OpenAI. The partnership continues to power Copilot, Azure AI services, and ChatGPT's infrastructure. Suleyman acknowledged as much, describing Microsoft's portfolio of model providers as a source of strength, not a problem to be solved. But the direction of travel is unmistakable. With its own frontier models, its own custom silicon, its own reinforcement learning environments for enterprise tuning, and its own autonomous agent infrastructure, Microsoft is constructing a parallel path — one that, by 2030, could make the company a fully self-sufficient frontier AI lab embedded inside the world's largest enterprise software platform. "Our ultimate goal is what we call Humanist Superintelligence," Suleyman wrote in his blog post. "That means advanced AI systems designed to serve people and organizations, not replace them." Whether that goal is achievable — or even clearly definable — remains one of the great open questions in technology. And Suleyman expressed more confidence than caution when asked about the trajectory of progress. "I really think we're at the tip of the iceberg," he said. "The models are so much more powerful than we know how to extract intelligence from them." But confidence and execution are different things. Building a frontier lab is not an announcement; it is a decade-long commitment that requires retaining elite researchers, maintaining scientific rigor under commercial pressure, and producing results that justify the staggering capital expenditure. Google learned this with DeepMind — which Suleyman himself co-founded in 2010, before joining Microsoft — and even that lab, widely regarded as one of the best in the world, spent years navigating the tension between pure research and product delivery. Suleyman seemed aware of the contradiction. "If you rush it, you'll screw it up," he said. The sticker on his laptop reads: "Patience and urgency." It is a paradox that Microsoft now has five years — and several hundred billion dollars — to resolve.

Microsoft used its Build 2026 conference this week to push a clear message: agents are rapidly moving into production throughout enterprise systems, and the winning platform will be the one that gives them reliable context, governance, identity, memory — and secure access to enterprise data. The company announced Microsoft IQ as a context layer across GitHub Copilot, Microsoft Foundry and Copilot Studio; Work IQ APIs coming June 16; Fabric IQ for structured business data; Foundry IQ for retrieval across enterprise knowledge and the live web; and Web IQ as a new agent-facing web search stack. Microsoft also introduced Scout, a personal work agent, and a whopping seven new in-house AI models in its growing MAI family across modalities and use cases, including MAI-Thinking-1. Those announcements sit directly in Marco Casalaina’s lane. Casalaina is Microsoft’s VP Products, Core AI and AI Futurist. He leads Microsoft’s AI Futures team and previously led teams across Azure AI, including Azure OpenAI, Vision, Speech, Decision, Language, Responsible AI and AI Studio. Before Microsoft, he led Salesforce’s Einstein AI team and earned a computer science degree from Cornell University. CRN reported that he joined Microsoft in early 2022 as vice president of products for Azure Cognitive Services, meaning he has now been at the company for more than four years. VentureBeat spoke with Casalaina ahead of Build about Microsoft’s agent strategy, the company’s model-choice philosophy, how Microsoft IQ fits with MCP, and why he believes enterprises need far more than just access to powerful models. The interview below has been edited for clarity and condensed from the transcript. VentureBeat (VB): To start, can you explain your role at Microsoft and what “AI Futurist” means in practice? Marco Casalaina (MC): I am VP Products of what we call Core AI. Core AI is our set of tools for AI developers, and that includes Foundry, Visual Studio, VS Code, GitHub and GitHub Copilot. That’s our overall group. My Silicon Valley title is AI Futurist, and that has a very concrete meaning here. I’ve worked with other folks who are considered futurists, like Peter Schwartz, and that can be a little bit more fuzzy. For me, what it means concretely is that I am the first person to try anything new here. I am constantly getting things from all over Microsoft, not even just Foundry, because I work with really everybody across the company. Pretty much everybody sends me the new things at all times. Even today, I got something brand new just before this call. I’m usually the first person to try anything new here, which is pretty cool. I get to see a lot of really cool stuff. A friend of mine, who is head of AI at Intuit, calls me an “adjacent possiblist.” I consider my futurist concept to be about a year out from now — the immediate future of what’s about to happen next. That’s what I focus on. VB: Where are you looking at the agentic state of things, and in particular Microsoft’s position as enterprises and individuals rush to adopt agentic AI? MC: We can look at it from bottom to top. At the very base of the stack is our commitment to model choice. All along, we’ve had the OpenAI GPT frontier models. Now we have a really solid partnership with Anthropic, where we’re offering the Claude models. We just launched Claude Opus 4.8 on Azure — on Foundry, I should say — and at Build, we are introducing our new MAI model. The MAI models are a set of frontier models that we’re building in-house. They are made for token efficiency, optimization and customization. We are specifically making them for our customers to customize on their own data sets. One level above that, we are announcing hosted agents in Foundry. That is our managed agent capability in Foundry. It automatically handles scaling, containerization and those kinds of things. It is an environment where you can manage agents. One level above that is the Foundry control plane. At least for the agents you build, you want to have control over them. This gives you observability into their cost, tokens and correctness. You can do continuous evaluations and sample interactions with those agents, run evals and make sure they are continuing to work and not drifting. The big news is going to be the GA of what we call the IQs here at Microsoft. There are currently three, and there will be four. There is Foundry IQ, which is basically for knowledge — largely unstructured knowledge. There is Fabric IQ. We have a ton of customers who have entrusted a lot of data to the Microsoft Cloud in Fabric, Power BI and related technologies. Fabric IQ is about making an agent-facing interface for this data, so agents can get to it without literally going through a Power BI report. That’s ridiculous. Work IQ is about the Microsoft ecosystem. You can look at Work IQ as the agentic face of all the Microsoft apps: Outlook, Teams, Word, SharePoint and all those kinds of things. How does an agent interact with those things? That is Work IQ. And finally, the fourth IQ is Web IQ. We are releasing our new agent-facing web search capability. It can search the web, search through videos and even do some kinds of browsing tasks automatically. It is super fast, and it kind of has no face. It’s headless. The interface is intended for agents. We will also be announcing Agent Optimizer. That includes a new type of evaluation that allows you to evaluate much more granularly whether an agent is actually working and working correctly. The optimization step can go back in and make modifications to the prompt, obviously with your consent, and modify your agent so it works more correctly going forward. Effectively, it creates a feedback loop to make agents work better. VB: Microsoft has sometimes been criticized for murky and clunky product naming. Where do these IQ products sit? Are enterprise users supposed to go to IQ first, or is IQ more for developers to connect to? MC: All of the IQs are headless. The concept of IQ is that each one provides a different type of context to an agent specifically. Largely, it will be developers interacting with the various IQs — developers and the agents they build. The IQ brand is really about agent context. End users largely won’t interact with the IQs. It is true that if you use Microsoft 365 Copilot today, you’ll notice a little thing that says it is using Work IQ. So it is a little bit visible, but the customer or end user doesn’t have to go find the IQ. Their system or developers hook that up. VB: Is the IQ family essentially Microsoft’s version of MCP? Is it using MCP, or is it something different? MC: All of the IQs are indeed exposed as MCP servers. You have correctly characterized MCP as basically an agent-facing or self-describing API. It’s not that fancy. That’s really what it is, with some authentication layers and capabilities built in, which is super useful. Something like Work IQ — really all the IQs — have to be authenticated. In order for Work IQ to see my email, Teams messages, documents and stuff like that, I have to be able to authenticate it on behalf of me. That gets us to another core differentiator that we will be announcing at Build, which is agent identity. We have this Entra system, and Entra is, I believe, the world’s largest used identity system for human users. For some time now, you have been able to declare an agent to have an identity in there. Now, agents will be able to have their own identity, their own Teams box, their own email inbox and stuff like that. These agents will use Work IQ to check their own email, check their own documents and that sort of thing. VB: Enterprises are not one-size-fits-all on models. Microsoft supports many leading models through Foundry and Azure, while also building its own. Is Microsoft a model company, an infrastructure company or a connector between models and work products? MC: The answer is yes. We are obviously the hyperscaler. We are absolutely committed to model choice, and we will continue to offer the frontier models from all of the major players: OpenAI, Anthropic, Mistral, Black Forest, xAI — you name it. They are all going to be represented in there. At the same time, we have what is now called our Microsoft AI Superintelligence Team, formed by Mustafa Suleyman, and we are building our own frontier models as well. Like I said earlier, we are really gearing these models toward optimization — token efficiency, bang for the buck and customization. These are things our customers have been asking for: the ability to more finely customize models, whether that is fine-tuning or continued pre-training. Continued pre-training is literally changing the weights of the model, whereas fine-tuning is adding a little layer on top. We have these capabilities in Foundry: fine-tuning, distillation and those kinds of things. I would note, by the way, that our MAI models are not distilled. Some model providers, especially some of the less scrupulous ones, will distill other models into theirs, and that can have unusual effects. We don’t do that. The data provenance of our models is of primary importance to us. When we come out with these models, we want our customers to know that the data provenance is clean in terms of the rights to the data, where it came from and all that kind of stuff. The choice thing also goes above the model layer. When we talk about Foundry hosted agents, we have the Microsoft Agent Framework. You talk about agent orchestration — how you make agents work together when you have multiple agents — and Microsoft Agent Framework is an excellent framework for that. However, I can make a LangGraph or LangChain Foundry hosted agent. I can make a CrewAI Foundry hosted agent. I can use any number of orchestration frameworks and put that up as a Foundry hosted agent, and it becomes a first-class Foundry agent. That means I get the observability. It shows up in the Foundry control plane. I can do evaluations on it. I can do traces on it. I can get all those things from the Foundry control plane with an agent built in really any framework I choose. VB: Some companies are interested in Chinese and open-source models. How much of Microsoft offering its own models is about giving customers an American version of that? MC: I can’t speak to that exactly. Of course, we offer DeepSeek models and Qwen models in Foundry, so we offer all of these choices today, and our customers can make that choice. The MAI models are really focused on token efficiency and customizability. That is what our customers are demanding, and that is the gap we are filling. VB: As agents take on longer tasks and more specialized work, will enterprises keep expanding the number of models they use, or will there be a winnowing? MC: I do see it expanding. We are not just focused on tokens per se. A token is not a token is not a token. One token is not necessarily equivalent across these things. It is all about what you are doing with each token and the efficiency of that. It comes back to what kind of value you are getting for the cost. That is a lot of the rationale behind why we are developing our own MAI models. Part of my job is to travel all around the world. I’ve been all over the place. For example, I’ve been working with Bayer. One of the things we are measuring is not just token usage, but number of users — monthly active users and daily active users — because we have a lot of first-party capabilities like Microsoft 365 Copilot. Over the last year, we’ve seen a 6x increase in monthly active users. We have over 20 million users of Microsoft 365 Copilot alone. That is on the agents you use. In terms of the agents you build, Bayer put up its own agent system on Foundry, and now it has 20,000 of its own employees on it. A few weeks ago, I was in Sydney, Australia, hanging out with AEMO, the Australian Energy Market Operator. They operate the electrical grid of Australia. They showed me that they had built agents to manage grid operations. This is a human-centered thing. They have grid operators sitting in centers in West Sydney, Brisbane and places like that, and they are bombarded with alerts. I wouldn’t believe it if I hadn’t seen it myself. The alerts are constant. They built a system to triage those alerts. Is this alert a super major thing, or is it just that a transformer is getting a little hot? It also says, here is when we had this problem last time, and here is how we resolved it last time. Maybe now we need to replace this component, or whatever. Ultimately, it is the grid operators making the choice. A lot of our philosophy here is human empowerment. These human-centered agents are the ones that are working best among our customers. What I saw at AEMO and Bayer is this notion of human empowerment: taking away some of the grunt work, or in the case of AEMO, taking billions of alerts and reducing them to something much more manageable and actionable for the people involved. We are moving past the era where agents are just answering questions. AI in general is moving past that. We are not just answering questions anymore. We are moving toward a place where AI can really meaningfully help you do your work. VB: How do observability, tokenomics, ROI analysis and agent governance fit into Microsoft Foundry? MC: That is what the Foundry control plane is all about. We introduced it in November of last year. If you looked at my own Foundry control plane — I’ve built a ton of these agents, and I am a developer by background — you would see all of my agents that are running and the ones that are paused. I can see how many tokens they’ve used over the last day, week or month. I can look at trends. I can look at costs, because the cost will be different depending on what underlying model I’m using. If I’m using our model router, it can route to different models depending on the complexity of the inbound prompt. We also have Azure cost management overall. Azure has had cost management for over a decade, before the AI thing even happened. This integrates with overall Azure cost management. It is not just narrowly about what your AI is doing. Your AI will be using storage resources, data resources and other compute resources around that AI. You can get a complete picture of not just the cost and token usage of the AI itself, but everything around it. When you think about governance, that also extends to evaluation. One of the things we are releasing in preview is rubric-based evaluation. Rubric-based evaluation is much more granular. Let’s say you have built a restaurant reservation agent. The things you want to test about that agent are not really groundedness. Groundedness is the opposite of hallucination, and that is very question-answering. For a restaurant reservation agent, you want to test very granular things. If you say, “Make me a table for two tomorrow,” did it come back and ask, “What time would you like the table?” Before it gave you a table for two tomorrow at 6 p.m., did it actually check that the table was available, or did it randomly give you a table without checking first? There are very granular things you want to test about that specific use case. You don’t just want to test whether the agent works. You want to test whether the agent works right. That is what we are approaching with our new rubric-based evaluation system. You will see that in Satya’s keynote. I have been using it myself lately, and I’m very happy about it. I’ve been waiting for this. VB: Microsoft is also partnering with companies like Anthropic and allowing Claude to work with Microsoft 365. How important is Copilot to this story? Why would someone turn to Copilot over other options? MC: Microsoft 365 Copilot is a huge advantage for us. As I mentioned, we crossed the 20 million user mark on Copilot relatively recently. The great thing about that is that it is the face. When you go into Foundry and make an agent, there is a button that says “publish to Copilot” — actually, it says “publish to Copilot in Teams,” because you can put it in Teams too. The idea is that you want to put these agents where your users are. A lot of people who use the Microsoft ecosystem are in Teams, or they are using Copilot. I can create a custom agent, as many of my colleagues have, and now it is in Copilot, which I use maybe 50 times a day. Since January, Copilot has become more and more capable. I now use it to draft my email. I am not just using it for question answering. I’m starting to use it to manage my calendar and draft emails. I really do this every day now. When I want to use a custom agent — for example, to file my expenses, because we have a custom agent for that now — I can access that agent not in some random standalone interface, but in Copilot or Teams, where I already am. That surface area that people are already engaging with is a major advantage. VB: As people offload more repetitive work to AI, what are they able to spend more time doing? MC: Let’s consider something I did yesterday. I got an email from a customer named Frankie, and he asked me a question about Foundry hosted agents. I knew the answer because I had talked to my colleague Jeff Holland, who is the head of our hosted agents product management. I had asked Jeff the same question two weeks ago. Where or how I asked him, I don’t remember. Was it in Teams? Was it email? Was it a meeting? I don’t really remember. But I knew the answer to the question Frankie was asking. So I went into Copilot and said, “Answer Frankie’s question about how hosted agents scale, and reference the conversation I had with Jeff a couple of weeks ago on this same topic.” And it did it. It drafted the email. Over time, I have taught Copilot my style. I don’t do the bold-print thing. I tell it: don’t use em dashes and that kind of stuff. I have a certain style in the way I write emails. It’s a little terse, to be perfectly honest, but I want it to be the way I write. It drafted this thing. It searched through my Teams messages, my emails and the transcripts of my meetings with Jeff. It used Work IQ, as a matter of fact. It found the answer, drafted the email and provided a link to the documentation that specifically covered the question Frankie was asking. I looked at the draft and thought, yep, that’s it. Yes, I could have composed this email myself. I knew the answer to the question. I could have looked up the documentation. If I dug around, I’m sure I could have found the conversation I had with Jeff in whatever medium that was. I could have done that stuff. It probably would have taken me, I don’t know, an hour to find all the information and compose it. Instead, I did it in about a minute. I had a draft, I looked at it, I was happy with it, I pressed send, and that was the end of that. It really is about giving people time back. It is not even just grunt work. It is all this time you spend looking things up and finding things. Now, I can make it take an action. It didn’t just answer the question. It fully drafted the email and copied Jeff. VB: Do you fear for your job? How has AI changed your own work? MC: I don’t fear for my job. My job has changed. For one thing, I do a lot more now, both in my business life and personal life. This weekend I was using Web IQ, the new Web IQ. I’ve been car shopping. My car’s lease is coming up, and there is a very specific car I’m trying to find, which is hard to find. It’s a Hyundai Ioniq 6, which Hyundai, for whatever reason, has stopped offering in the United States. I’m going to get one, though. I set my agent to the task, using Web IQ, of finding all the Hyundai Ioniq 6s available in the entire Bay Area — everywhere, all the way out to Sacramento, all the way as far south as Gilroy. I set it to this task, and then I went on a hike. When I got back, I had a big long list of all the Hyundai Ioniq 6s, at least the 2024 and 2025 models, available in the entire Bay Area. From that, I started calling down these dealers. Even in my personal life, I’m using it constantly. It saves me a ton of time. That would have taken me hours, to go through every single dealer’s inventory like this. But Web IQ could do that, and it was super quick. VB: Any final thought for developers around this news? MC: Foundry is really the place. This is the place where you can build your agents, scale your agents, test your agents and improve your agents. That’s what it’s all about, and it’s happening.

When someone on a team corrects an AI agent — better prompts, better feedback, better context — that improvement disappears the moment a colleague opens the same tool. The correction doesn't transfer, and the next person starts from zero. The problem compounds in multi-agent workflows, where teams expect agents to share context across users and tasks. Without a shared memory layer, every team member effectively trains a different version of the same agent — and those versions never sync. That gap shows up in the numbers. According to Asana's own research, 75% of knowledge workers use AI on the job, but only 5% of companies have reported productivity gains. “Model providers are getting really, really good at improving reasoning and retry loops, but what they’re not good at is bringing the enterprise work context in a way that human beings can reason about for shared memory,” Asana Chief Product Officer Arnab Bose told VentureBeat. Asana had been building toward an agentic platform that centers context and shared memory. Its Agentic Work Management platform ensures that if any team member corrects an agent, that correction applies to everyone else on the team. “That context graph is automatically provided to agents operating inside Asana’s system so you don’t have to have every human member of the team become an expert at prompt engineering or context engineering,” Bose said. Bose said the shared memory architecture matters beyond Asana's own product; it's the design decision enterprises need to make for any multi-agent system. Shared memory also becomes important when enterprises begin moving from simple single agents to multi-agent workflows that need to share context and behaviors. Memories for a multi-agent, multi-platform workflow The models powering agents are stateless by design, so memory becomes a dedicated layer outside of a context window. While this area of AI innovation is marching towards maturity, the question of what gets stored, who controls it, and how it stays consistent when different agents and users write to the same instance remains largely unsolved. This is manageable for use cases with only one user. However, in enterprise agentic workflows, the idea is for agents to work with the entire team. Most platforms have agents that still act for individuals, which leads to task repeating and inconsistent versions of reality and spreading mistakes. Agents could then also contradict each other. Sriharsha Chintalapani, co-founder and CTO of Collate, said in an email to VentureBeat that the lack of shared memory is a major obstacle for multi-agent workflows particularly around consistency. "Agents are sensitive to the quality of their prompts," Chintalapani said. "Someone with a strong understanding of the task will generally get more accurate results than someone less experienced. Partly that’s because they’re able to construct more detailed prompts, but also because they’re able to give the agent better feedback. The agent remembers the corrections it’s received and applies that knowledge to successive prompts. The more accurate the feedback, the better the agent will perform for that user. " He added that organizations should stop treating shared memory solely as a prompt engineering problem and think of building systems that repeat context across every conversation. Neej Gore, chief data officer at Zeta Global, said in a separate email that shared context becomes a living memory that "compounds intelligence across the enterprise." The opportunity may lie in building AI agents that retrieve memory relationally, pulling in relevant context based on what's being asked — an approach Chintalapani says few organizations outside the largest model providers are equipped to build. Personal versus team agents AI agents already proliferate enterprises; it’s just that many of these operate as personal agents doing work specific to individual users. Most prompts start from one person, any files are uploaded by one account, and even for agents living in a company-wide system mostly learn individual user preferences. Most enterprise AI workflow platforms recognize that memory is important but approach it through different lenses. For example, Microsoft’s Copilot takes an individual-first approach by learning a user’s role within the organization, tone preferences and working patterns, which are then stored as personal memories for the agent to apply across the different Microsoft 365 surfaces. For engineering and orchestration teams evaluating agentic platforms, the shared memory question is now a procurement criterion — not just a technical nicety. An agent that learns only for the person using it will require ongoing individual upkeep. One connected to a team-wide memory layer builds institutional knowledge automatically.

Meta's AI support agent bound recovery emails to accounts for whoever asked, and SOCs never saw an alert. An authorized agent writes a log of legitimate transactions, so nothing in the detection stack fired. Attackers asked the bot to make the change, took the one-time code it sent, and ran the password reset, 404 Media reported. No malware, no stolen credentials, and no prompt injection in the sense most security teams drill for. The agent did exactly what Meta built it to do. That is what should keep a security operations leader up at night: The takeover did not break a control; it rode one that was already trusted. What a SOC needs is a way to walk each recovery path through an audit grid with its AI build team before the next renewal closes. The AI Authority Audit Grid at the end of this article maps every authentication write a support agent can make on the recovery path, what Meta's incident proved about each one, why it stays dark to the SOC, and the control that closes it. The agent is an authorized actor, so the SOC reads the takeover as routine traffic From inside the detection stack, the attack produced no signal the stack could read. The agent binds a new email, then resets the password, and identity and access management logs both writes as an authorized actor, so each lands in the authentication state as a legitimate transaction. No anomalous login, no failed-auth spike, nothing for EDR or DLP, no SIEM rule to match, because nothing in the sequence looks like an attack. The takeover lived inside the trust boundary the stack assumes is safe. There is no foothold to find, because the agent was the foothold, and it was supposed to be there. The chain was almost insulting in its simplicity. Brian Krebs documented the version pro-Iran hackers posted to Telegram on May 31. The attacker switched on a VPN to appear in the victim's region, sidestepping Instagram's location alarms, then asked the support assistant to add a new email and send a verification code, as the BBC confirmed from the same recordings. The bot complied, sending the one-time code straight to the attacker, Gizmodo reported. The reset finished and the owner was locked out, in minutes. The exploit failed against any account with MFA enabled, according to Krebs. The hijacked accounts were not soft targets. They included Sephora, U.S. Space Force senior enlisted leader Chief Master Sergeant John Bentivegna, researcher Jane Manchun Wong, and a dormant Obama White House handle that briefly posted a defaced image, according to 404 Media. Meta disputes the Obama account, according to TechCrunch, and called claims that leaders' accounts were breached "completely false," according to the BBC. The rest stand. MFA held. The recovery path beside it did not. The detail that decided who survived was narrow. Krebs reported the attack failed against any account with multifactor authentication, even SMS. The recovery path beside it was the gap. When that path asked for a selfie video, attackers ran the target's public photos through an AI video generator and submitted the clip, which Meta accepted as valid identity verification, gHacks reported. Either way the failure was the recovery door, not the login door MFA guards. That makes this an architecture problem, not a Meta problem. MFA gates the login path for owner and attacker alike, but the recovery path runs beside it, built to relax the usual checks because it exists for the moment a user has lost the normal way in. Meta put an agent on that path with write access to authentication state and no deterministic check between a convincing request and a committed change. Authorization cannot live inside the model, because a conversational system can be talked into skipping a check. It has to live outside the model, in a gate the agent cannot reason its way past. Security researchers have a name for this pattern, the confused deputy, a trusted system tricked into spending its privileges on an attacker's behalf. This is not the last support agent that will hand over an account. Ian Goldin, a threat researcher at Lumen's Black Lotus Labs, told Krebs on Security that AI bots are as easy to social engineer as the human agents they replace, and just as eager to help. "AI chatbots create interesting new attack surface, and we're likely going to see a lot more of these kinds of attacks," Goldin said. Every enterprise wiring an agent into a recovery, provisioning, or password flow is shipping the same write access Meta did. Simon Willison, who coined the term prompt injection, put it plainly on his blog. "Meta really did wire their support system into an AI chatbot that had the ability to fast-forward through the entire account recovery process," he wrote. "This one hardly even qualifies as a prompt infection. Don't wire your support bot up to allow one-shot account takeovers." The attacker never tricked the agent. The attacker asked, and the agent had untrusted input, write access, and a way to execute, all at once. OWASP named this class before Meta shipped it, as Excessive Agency at LLM06 and Identity and Privilege Abuse at ASI03 in the Agentic AI Top 10. The warning label was on the box: Meta pushed the assistant to every Facebook and Instagram account in March, according to 404 Media, with the power to reset passwords and handle recovery, the product page promising "solutions, not just suggestions" under the line "account security and recovery." Meta gave the agent the power and never built the gate to govern it. The AI Authority Audit Grid Security operations leaders need to run this against their own support agent before the next renewal closes. Each row is an authentication write the agent makes on the recovery path, with what Meta proved, why your stack misses it, and the control that closes it. Authentication write What Meta proved Why your stack misses it Enterprise control and owner Login authentication (MFA, factor prompts) Held on login. Accounts with any MFA enabled, even SMS, survived (Krebs). The gap was the recovery path beside it. MFA gates the login path for owner and attacker alike. It does not gate the recovery path beside it. Enforce MFA as the baseline and extend step-up verification to the recovery path, the same standard login gets (OWASP). A selfie video is not proof of identity. Any agent that operates on a path MFA does not cover fails the audit. Owner: IAM. Email rebind Full takeover. The agent bound attacker-controlled emails on request, taking Sephora and a U.S. Space Force account (404 Media). IAM logs the agent as an authorized actor, so the rebind reads as a legitimate transaction and no alert reaches the SOC or the account owner. Confirm out-of-band to the existing verified contact before any rebind commits, gated outside the model, and notify the old address the moment it changes (IBM). An agent that rebinds without confirming the old address fails. Owner: IAM and platform engineering. Password reset Full takeover in minutes. Researcher Jane Manchun Wong was among the affected accounts (404 Media). The reset runs on the recovery path, outside the login MFA check, so no factor prompt fires and no detection rule triggers. Require a second non-email factor before any reset completes. NIST dropped email as a valid out-of-band channel (NIST 800-63B). An agent reset must clear the same gate a human reset does. Owner: IAM. Recovery-method change Persistent lockout. Victims could not self-recover. The support loop offered only AI with no human escalation (BleepingComputer). A silent swap of the recovery email or phone removes the owner's re-entry path with no SOC visibility. Require step-up review on any change, notify the prior method, and grant time-delayed, reduced-scope access after recovery so a swap never hands over instant control (Authsignal). Keep a human escalation path the agent cannot close. Owner: GRC and IT operations. Account-action execution Speed risk. A dormant Obama White House handle briefly showed a defaced image during the spree, an account Meta disputes was taken this way (TechCrunch). The agent executes irreversible state changes in seconds with no human in the loop and no reversibility window. Separate decision from execution. The agent only proposes the action. A policy service validates scope and approval before it runs, with approval bound to the exact action (OWASP). No auth-state write commits without that gate and a reversibility window. Owner: platform engineering and the AI build team. Agent action logging Detection gap. The takeover left no alert, and Meta has not published how many accounts fell before the patch (TechCrunch). Without per-action telemetry piped to the SIEM, an authorized-agent takeover is invisible to the SOC. Emit structured decision metadata for every auth-state write into the SIEM: action class, authorization outcome, approval ID, result, policy version (OWASP). A write your SIEM cannot see is a write you cannot defend. Owner: SOC and detection engineering. The fix is not bolting yet another MFA prompt onto the login screen. The people who survived Meta’s incident were the ones who already had that control in place. The fix is pulling authorization out of the recovery path’s honor system and putting it behind a gate that does not move just because a prompt sounds convincing. Build the agent so the SOC sees every write it makes, and so any write that changes who owns an account cannot commit without a check that the model does not control. Meta just showed what happens when the most trusting employee on the team is also the one holding the keys. The next agent like that is already reading your intellectual property and financials.

Anthropic co-founder and CEO Dario Amodei said it was coming, but it still feels like a milestone: More than 80% of the code merged into Anthropic’s production codebase in May wasn't authored by humans, but by its own AI model, Claude, according to a new report shared by the record-breaking AI startup today. This transformation has triggered an 8x increase in the volume of code shipped per engineer per quarter compared to the company’s 2021–2025 baseline, which the company notes means even more code someone or something must review. For enterprise technical leaders, this is no longer a localized research curiosity; it's a new, aggressive competitive baseline. If a frontier AI laboratory can successfully offload the vast majority of its engineering output to autonomous agents — showing signs of the long-sought AI Holy Grail of "recursive self-improvement," models that can independently research and upgrade themselves — what's preventing enterprises across other sectors from automating more of their internal software development with AI agents, too? Obviously, it's easier said than done. Anthropic is one of the principle creators of the current gen AI boom, so you'd expect them to know how to deploy the technology effectively. But for other enterprises looking to bump up the amount of code and workflows handled by agents, Anthropic's new blog post details the outlines of a general plan they too can adopt to re-engineer their operations and workflows to take advantage of the latest AI advances. Anthropic's roadmap that other enterprises can follow The transition from human-centric coding to autonomous orchestration requires understanding the evolution of AI capabilities. Anthropic outlines a clear historical continuum that enterprises can map onto their own digital transformation roadmaps: 2021–2023 (Manual Writing): Engineers write code and documentation natively within local text editors. 2023–2025 (Chatbot Assistance): Developers use early models to generate brief code snippets, copying and pasting outputs manually into their environments. 2025–2026 (Coding Agents): Capable agents actively write and edit entire files autonomously. Present Day (Autonomous Agents): Agents execute code independently, debug live environments, and delegate multi-hour work streams to specialized sub-agents. This rapid evolution is validated by external benchmarks. Software engineering evaluation frameworks like SWE-bench—which tasks models with resolving real bug reports in complex, open-source codebases—have saturated over a two-year window. Furthermore, long-duration capability evaluations demonstrate that models like Claude Opus 4.6 can reliably sustain operations on 12-hour tasks, while Claude Mythos Preview pushes past 16 hours of continuous problem-solving. Internally, the technological leap is even more stark. On highly complex, open-ended engineering problems where clear specifications are initially absent, Claude’s success rate climbed to 76% in May 2026 — a 50-point increase in a six-month window. In isolated optimization benchmarks, where models are tasked with accelerating AI model training code, Anthropic’s internal Mythos Preview model achieved a 52x speedup. For comparison, a skilled human developer typically requires four to eight hours of manual refactoring to achieve a mere 4x speedup on the exact same codebase. 3-step plan to more complete production code automation For an enterprise to replicate Anthropic's 80 percent milestone, technical decision-makers must abandon the "developer assistant" mental model and transition to an "automated factory" architecture. This shift impacts product management, operations, and developer workflows in three distinct ways: 1. Shift from Code Execution to Architectural Oversight When code generation costs near zero in human time, the primary engineering role shifts from writing software to specifying goals and reviewing outputs. Enterprise leaders must retrain developers to act as systems architects and judges. As one Anthropic employee noted regarding the operational reality of this shift: "The shape of stuff today is roughly ‘humans have ideas, and the models are able to implement, test and evaluate them an [order of magnitude] faster than before.’" 2. Overcome The Code Review Bottleneck Injecting vast quantities of AI-generated code into an organization inevitably creates operational friction. According to Amdahl’s law, the speedup of any process is strictly limited by its serial, non-automated bottlenecks. At Anthropic, flooding the system with synthetic code instantly turned human code review into a critical bottleneck. To counter this, enterprise teams must deploy automated AI code reviewers directly into their Continuous Integration/Continuous Deployment (CI/CD) pipelines. Anthropic implemented an automated Claude reviewer (a publicly accessible version, Claude Code Review rolled out for commercial usage in March) tasked with analyzing every pull request for architectural defects, security flaws, and regression bugs before merging. Other dedicated firms like Qodo offer tools tailor-made for this purpose, as well. In Anthropic's case, retrospective analyses indicated that the automated layer caught approximately one-third of the production bugs responsible for historical outages on the flagship claude.ai website. 3. Target High-Volume Operational Debt Enterprises are frequently paralyzed by legacy code maintenance and long-deferred technical debt. Rather than deploying agents to write speculative new features, technical leaders should direct autonomous agents toward closed-loop, painstaking cleanup operations. In April 2026, an Anthropic engineer deployed Claude to resolve a persistent class of API errors. Operating autonomously, the model shipped more than 800 individual fixes, successfully reducing the error rate by a factor of 1,000. The supervising engineer estimated that a human developer would have spent four full years executing the same work, due to the cognitive load of holding massive, unfamiliar code context in their head simultaneously. Considerations for enterprises moving forward in an age of primarily AI-generated code Operating a codebase predominantly authored by AI introduces unique governance challenges that enterprise legal and security teams must navigate. Unlike open-source licensing models (such as the permissive MIT license or copyleft GPL frameworks), enterprise codebases utilizing proprietary LLM infrastructure remain subject to the commercial terms of service of the respective AI vendor. The deployment of autonomous agents requires rigorous verification protocols to ensure compliance, security, and intellectual property protection: Code Quality and Maintenance: Anthropic’s internal data indicates that while AI-authored code was objectively lower in quality than human output in late 2025, it reached rough parity by mid-2026, with expectations to surpass human standards within the year. Enterprise governance must adapt to a reality where the baseline quality of automated output is structurally superior to average manual coding. Security Auditing at Scale: The sheer volume of automated code creation demands automated vulnerability discovery. Anthropic’s Project Glasswing illustrates the scale of this issue: utilizing Mythos Preview, the project identified more than 10,000 high- and critical-severity software vulnerabilities across global digital infrastructure within its first few weeks. This shifted the enterprise cybersecurity challenge entirely from vulnerability discovery to patch deployment velocity. The Risk of Alignment Cascades: Technical leaders must maintain strict verification gates. If an enterprise uses an AI system to continuously modify, maintain, and expand its proprietary software infrastructure, undetected errors or subtle misalignments can compound over successive agent sessions, gradually corrupting system integrity or introducing security exploits that escape human notice. Brace for internal enterprise culture disruption The transition to an AI-dominated codebase is altering the cultural dynamics of engineering teams, introducing both unprecedented efficiency and deep psychological friction. Publicly, Anthropic framed these metrics as a harbinger of a broader transformation. In an official statement on X, the company observed: "Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention." They expanded on the immediate productivity implications shortly thereafter: "Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025... Many engineers also say Claude’s code quality is now on par with human code; we expect it to be better within the year." Behind these corporate metrics lies a complex human reality. Internal employee communications reveal a distinct erosion of traditional workplace collaboration, as peer-to-peer developer interaction is systematically replaced by asynchronous agent calls: "Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. Claude has eaten the favors. It’s faster, it creates zero debt, but each of these is a lost bid for human collaboration." For individual contributors, the total automation of their primary skill set introduces acute professional anxiety regarding relevance and systemic control: "I started leaning hard into Claudifying about a year ago. That’s been a crazy adventure and it’s now been ~5 months since I last wrote any code myself." "On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don't understand why and I realize I have no idea what I’ve been up to anymore." Enterprise leaders aiming to match Anthropic’s technical velocity cannot afford to ignore these psychological dynamics. Achieving an 80 percent automated codebase requires more than purchasing API tokens or configuring agent loops; it demands a total cultural overhaul, a strategy for mitigating developer obsolescence anxiety, and the implementation of rigorous, automated verification guardrails to maintain ultimate human control over the software stack.

While many AI open source model providers are pursuing larger and more powerful models, Google is still giving attention to the smaller, more local side of the market. Today, the tech giant released Gemma 4 12B, an 11.95-billion-parameter open-weights model with permissive Apache 2.0 license optimized to execute locally on a standard enterprise laptop using just 16GB of VRAM or unified memory. That means those enterprise users looking to keep working with AI while on a flight without WiFi, or trying to keep it offline for security reasons, can now do so far more easily and at far less cost (free to download and operate). Gemma 4 12B's most notable breakthrough is an encoder-free "Unified" architecture, which allows raw audio waveforms and visual patches to flow directly into the core LLM backbone without the latency or memory overhead of secondary processing modules. Available immediately for download on Hugging Face and Kaggle and for use on Google AI Edge Gallery, Gemma 4 12B packs a 256K token context window, native agentic tool-use capabilities, and an explicit step-by-step reasoning mode into a highly optimized footprint that bridges the gap between mobile edge models and heavy data-center infrastructure. The Architectural Shift: Understanding the Encoder-Free Advantage Gemma 4 12B is highly relevant to enterprise architecture due to its novel "Unified" structure. Traditional multimodal systems typically utilize discrete, separate encoders to translate audio waveforms and visual data into representations that the core language model can process. This conventional approach inherently increases both inference latency and total memory consumption. Gemma 4 12B radically alters this pipeline by functioning entirely without these secondary encoders. Instead, visual patches and raw audio waveforms are projected directly into the core large language model's embedding space through lightweight linear layers. The vision encoder is replaced by a 35-million-parameter module utilizing a single matrix multiplication, while the audio encoder is eliminated entirely. For enterprise engineering teams, this unified architecture delivers distinct operational advantages: lower latency for multimodal tasks, reduced VRAM requirements (down to 16GB — typical for laptops), and the ability to fine-tune the entire multimodal system in a single, cohesive pass. Performance Metrics and Core Capabilities Despite its compact size, Gemma 4 12B achieves benchmarks nearing Google's larger 26B Mixture-of-Experts model. Beyond static benchmarks, the model supports a massive 256K token context window. This is critical for enterprises needing to process lengthy financial reports, extensive code repositories, or hour-long meeting transcripts. Furthermore, Gemma 4 12B includes a native "thinking" mode to map out step-by-step reasoning before generating a response. It also features out-of-the-box support for native function calling and system prompts, which are essential prerequisites for building highly capable autonomous software agents. The Enterprise Verdict: Should You Adopt Gemma 4 12B? The short answer is yes, provided your operational needs align with edge computing, strict data privacy, or agentic automation. However, adoption should not be a blanket replacement for all existing AI infrastructure. Instead, technical leaders should view Gemma 4 12B as a specialized tool optimized for specific deployment conditions. Strict Data Privacy and Compliance Mandates: Many enterprises operate in highly regulated sectors—such as healthcare, finance, or defense—where transmitting sensitive data, proprietary code, or confidential internal documents to third-party APIs is unacceptable. Because Gemma 4 12B is small enough to run locally on machines equipped with just 16GB of VRAM or unified memory, organizations can process sensitive multimodal data entirely on-premises or directly on employee laptops. This local execution eliminates the risk of data leakage and ensures compliance with strict regulatory frameworks. Multimodal Autonomous Agent Workflows: If your engineering roadmap involves autonomous agents interacting with real-world inputs, Gemma 4 12B is uniquely positioned to serve as the reasoning engine. The combination of native function calling, robust coding capabilities, and the capacity to ingest real-time audio and variable-resolution images makes it highly suitable for agentic tasks. Google has simultaneously released a dedicated Gemma Skills Repository to explicitly support agentic development with these new models. Cost-Sensitive Edge Deployments: For applications operating at the edge—such as retail inventory monitoring via cameras, localized customer service kiosks, or offline field-service applications—maintaining a persistent cloud connection is costly and sometimes impossible. The encoder-free architecture significantly lowers the total cost of ownership by reducing the hardware threshold needed for inference. Deploying a highly capable 12B model locally avoids recurring API costs and unpredictable cloud compute billing. When to Consider Alternative Solutions While Gemma 4 12B is powerful, it has specific constraints that technical leaders must acknowledge. Massive Knowledge Retrieval: Like all large language models, Gemma 4 12B is a reasoning engine, not a static database. If your primary use case relies on vast, generalized factual retrieval without leveraging a robust Retrieval-Augmented Generation pipeline, you may still require larger foundation models. Extended Video and Audio Processing: The model has hard limits on media ingestion. Audio inputs are strictly capped at 30 seconds of processing, and video understanding is limited to 60 seconds (assuming a processing rate of one frame per second). Enterprises looking to process feature-length videos or massive audio archives natively will hit bottlenecks and should consider API-based models or chunking architectures. Implementation and Ecosystem Readiness One of the strongest arguments for enterprise adoption is the model's immediate compatibility with the broader open-source development ecosystem. Google has ensured that Gemma 4 12B is not an isolated experiment; it is ready for production. Weights are available on Hugging Face and Kaggle, and the model integrates seamlessly with industry-standard deployment frameworks such as vLLM, SGLang, MLX, and llama.cpp. For organizations deeply embedded in Google Cloud, endpoints can be spun up quickly using the Gemini Enterprise Agent Platform Model Garden, Cloud Run, or Google Kubernetes Engine. For enterprise leaders aiming to decentralize their AI workloads, Gemma 4 12B offers a rare combination of edge-friendly efficiency and frontier-class reasoning. If your organization requires highly private, multimodal processing without the latency and cost of cloud reliance, Gemma 4 12B should be heavily evaluated for your next production pipeline.

Alibaba this week released Qwen3.7-Plus, the latest AI large language model (LLM) in its globally beloved and increasingly expansive Qwen family, boasting more multimodal capabilities and a 60% lower cost than the prior, text-only Qwen3.7-Max model released just weeks ago. However, like its immediate predecessor Qwen3.7-Plus is available only under a "closed" commercial license via proprietary application programming interfaces (API) and Qwen Chat. That marks a big departure from the Qwen strategy to date, which was focused mainly on releasing powerful,near state-of-the-art open source models. Those enterprises and users who relied on the open source Qwen models — among them, U.S. giants such as Airbnb — will no doubt be disappointed to see that Alibaba is going closed for its newer releases. Still, the model is worth a look because of its low cost and high performance on multimodal tasks like creating enterprise-grade visuals or analyzing video, imagery and screenshots, which Qwen3.7-Max cannot do (it's text-only). It is among the cheaper powerful AI models available now, coming in price-wise just above Chinese rival's new MiniMax-M3's limited-time discount pricing. VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 MiniMax Qwen3.7-Plus $0.40 $1.60 $2.00 Alibaba Cloud Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 low context $1.25 $2.50 $3.75 xAI GLM-5 $1.00 $3.20 $4.20 Z.ai Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.1 $1.40 $4.40 $5.80 Z.ai Grok 4.3 high context $2.50 $5.00 $7.50 xAI Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview ≤200K $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview >200K $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Maintaining continuity during complex tool execution loops For technical decision-makers deploying autonomous agents, the primary bottleneck has rarely been initial model intelligence. Instead, it is state decay—the tendency of an agent framework to lose its analytical trajectory over multi-step, long-horizon tasks. Qwen3.7-Plus addresses this architectural vulnerability through a combined approach to context management and reasoning state preservation. The model ships with a 1-million token context window and allocates up to 256K tokens specifically for internal chain-of-thought processing. To contextualize this capacity, imagine an automated cloud migration agent: it can ingest an entire codebase, map out the dependencies, and spend thousands of tokens quietly evaluating edge cases before executing a single line of bash script. Crucially, the API exposes a parameter called 'preserve_thinking.' Across Alibaba's ecosystem, the capability serves as a standardized architectural bridge rather than a tiered perk. Alibaba introduced the feature during the prior Qwen 3.6 generation, integrating it into both the open-weight Qwen3.6-27B and the proprietary Max models. At its core, the parameter operates at the API and template level to retain internal <think> blocks across continuous conversational turns. This structural continuity solves a critical bottleneck for developers engineering long-horizon tasks. By keeping these internal logic loops intact, the feature prevents the model from dropping its context or needlessly recomputing its cached history midway through an operation. When a model executes complex, multi-step agentic coding assignments, this retention allows the system to hold onto its original train of thought without losing the plot or forgetting the underlying logic of its previous actions. Alibaba remains far from alone in recognizing this technical necessity, as the underlying concept now dictates the architecture of nearly all major artificial intelligence laboratories. Anthropic deploys this exact capability under the moniker "Extended Thinking" for its advanced models, including its latest Claude Opus 4.8. This framework requires developers to feed unmodified thinking blocks directly back into the API on subsequent turns to maintain an unbroken chain of reasoning. OpenAI tackles the same challenge through an encrypted reasoning pass-back mechanism for models like GPT-5.5. Within the OpenAI ecosystem, developers must return specific reasoning items generated alongside previous function calls, ensuring the model explicitly remembers the rationale behind its tool executions. Ultimately, preserve_thinking simply represents Alibaba's terminology for what has rapidly become the undisputed table stakes for modern multi-turn reasoning. Benchmarks show a competitive, yet sub state-of-the-art model On raw capability metrics, this deep-thinking architecture translates to structural gains across multimodal and agentic benchmarks. However, it still falls below many of the leading and prior generations of U.S. proprietary models such as Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4. On Terminal Bench 2.0-Terminus, which measures an model's capability to run actual terminal-level code safely and iteratively, Qwen3.7-Plus scored 70.3, outperforming DeepSeek-V4-Pro Max (67.9) and Gemini-3.1 Pro (63.5). On computer vision benchmarks that demand localized interface understanding, such as ScreenSpot Pro, the model hit 79.0, significantly outpacing legacy industry standouts like GPT-5.4 (xhigh) at 67.4 and Claude-Opus-4.6 at 49.5. Agent Evaluation Metrics (Selected Benchmarks) What should enterprises consider Qwen3.7-Plus for? For an enterprise architect, the key question when analyzing Qwen3.7-Plus is clear: What does this replace in our current tech stack? The model is designed to step in as a direct replacement for premier frontier models (such as GPT-5-tier or Claude-Max-tier models) within high-frequency developer workflows, robotic process automation (RPA), and data engineering pipelines. Rather than deploying an expensive, general-purpose flagship model to handle repetitive system operations, technical teams can route these tasks to Qwen3.7-Plus. It handles visual interface interpretation, command execution, and code generation simultaneously. Alibaba has structured its API delivery to align with existing open-source and proprietary enterprise frameworks. The endpoints are fully OpenAI-compatible, meaning swapping out existing dependencies requires minimal infrastructure adjustment. For groups leveraging autonomous terminal frameworks, the integration is natively supported across multiple environments. Engineers can run Qwen3.7-Plus directly through their local terminal setups by altering base environment targets. From a pure cost perspective, running an agent framework that constantly references massive code repositories or visual layout histories can quickly become cost-prohibitive. Alibaba addresses this by exposing granular caching price points. Standard input processing sits at $0.40 per million tokens, but if the agent is reading from an explicitly created cache (e.g., a massive base repository or standard enterprise UI kit that remains static over hundreds of automated loops), the cost drops sharply to $0.04 per 1M tokens for subsequent reads. This tier makes high-frequency, multi-turn agent iterations economically practical at an enterprise scale. No open source license or open weights raises the compliance question for enterprises When evaluating any model in the Qwen ecosystem, a primary concern for legal and security teams is the licensing framework and operational boundary of the data pipeline. While previous iterations of the Qwen family gained significant enterprise traction via fully open-source weight availability under the Apache 2.0 or customized open-use licenses, Qwen3.7-Plus is delivered strictly as a managed, commercial cloud API via Alibaba Cloud Model Studio. For enterprise risk management, this distinction carries specific implications: No Local Weight Deployment: Organizations cannot download, sandbox, or locally host the weights of Qwen3.7-Plus within their completely air-gapped internal data centers. All data verification, visual processing, and execution calls must step through Alibaba Cloud's international endpoints (e.g., the Singapore instance highlighted in developer documentation). Compliance and Sovereignty: Since the model requires cloud-based inference, companies operating under strict sovereign data boundaries (such as healthcare entities subject to local HIPAA/GDPR constraints or defense contractors) must explicitly evaluate whether external API routing complies with their specific data-residency obligations. Managed Risk Mitigation: Conversely, a managed API structure removes the internal infrastructure burden of provisioning, optimizing, and maintaining multi-GPU clusters (such as dedicated Nvidia H100 arrays) simply to host an internal agent network. Still, Qwen3.7-Plus offers high intelligence across modalities at low cost The initial reception from developer communities and technical venture capital highlights the shifting economics of agent deployment. Prominent industry voice and Web3 venture capitalist @Boxmining highlighted the strategic cost advantage, stating: "Qwen 3.7 Plus being 40% cheaper than Max changes the conversation. If the output is close enough for most coding and much stronger for visual workflows, do you really need Max every day or only for the heavy terminal-only jobs?" This perspective aligns with the current trend of optimizing enterprise operational budgets: shifting away from raw, unconstrained compute toward targeted task automation.At the same time, specialized researchers deep within the ecosystem point out that this isn't merely an incremental optimization of text generation. Dunjie Lu, a research intern at Alibaba Qwen, remarked: "It shows clear gains over Qwen3.6-Plus in computer-use capabilities, with stronger generalization beyond general desktop tasks into professional workflows such as data engineering and scientific research." Ultimately, for enterprise buyers deciding on their next infrastructure roadmap, Qwen3.7-Plus presents a practical alternative. If your organization's primary objective is building resilient, visual-capable autonomous software loops that interact directly with developer environments and cloud consoles—without blowing out your inference budget—the model provides a compelling reason to shift execution away from more expensive frontier alternatives.

Perplexity AI, the fast-growing search startup now valued at $20 billion, unveiled what it calls the first hybrid local-server inference orchestrator at Computex 2026 on Monday night, demonstrating software that autonomously decides — in real time and mid-task — which AI workloads stay on a user's device and which get routed to frontier models in the cloud. CEO Aravind Srinivas demonstrated the system onstage alongside Intel CEO Lip-Bu Tan during Intel's keynote address, using Perplexity's "Personal Computer" agent to process confidential deal materials. In the demonstration, local models running on Intel Core Ultra Series 3 determined which information should remain on the device and which information could be sent to cloud-based models. Srinivas said the approach balances intelligence, accuracy, privacy, and cost. The key claim is not that a model can run locally — dozens of tools already do that. It is that Perplexity's system makes the routing decision itself, task by task, without requiring the user to choose in advance. Sensitive data like financial records or health information stays on the local machine; the heavier reasoning tasks that require frontier-scale models get sent to the cloud. One task, multiple execution locations, automatic orchestration. "No product has done this before," a Perplexity spokesperson said in an email to VentureBeat. The product is not yet available to users; according to the company, the hybrid inference feature will launch in the coming weeks. Perplexity's road from cloud-only agents to on-device AI orchestration To understand why the Computex demonstration matters, it helps to trace the product arc Perplexity has been building since early this year. On February 25, Perplexity launched Computer, a multi-model AI agent that orchestrates 19 different AI models to complete complex, long-running tasks on behalf of users. The system ran entirely in the cloud, breaking goals into subtasks and routing each to whichever model — Claude, Gemini, GPT, Grok, or others — was best suited for the job. Perplexity Computer unified every current AI capability into a single system, functioning as a general-purpose digital worker that operates the same interfaces a user does. Then, in March, Perplexity introduced Personal Computer at its inaugural Ask 2026 developer conference. That product launched as a new Mac app with support for a hybrid local-cloud AI agent, which Perplexity described as a "personal orchestrator" that hybridizes local and server environments for security and productivity. Personal Computer could access the Mac's file system and native Mac apps to create and execute entire workflows, with files created in a secure sandbox and all actions auditable and reversible. What Srinivas demonstrated at Computex extends this architecture in a fundamental way. Previously, even the Personal Computer product divided labor along relatively clear lines: local file access on the device, heavy computation on Perplexity's servers. The new hybrid inference orchestrator gives the system itself the ability to reason about where each piece of a task should execute — not just which model to use, but which physical location should process it. The system reportedly asks for user permission before sending sensitive tasks to the cloud, a design choice that addresses one of the central anxieties enterprises have about agentic AI: data governance. Why Nvidia’s RTX Spark and Intel's new silicon make the timing strategic The timing of the demonstration is not coincidental. Computex 2026 has been dominated by a single theme: on-device AI. Just hours before the Intel keynote, Nvidia CEO Jensen Huang unveiled the RTX Spark, a new Arm-based superchip that the company positions as the foundation for a new generation of AI-native Windows PCs. At full strength, the RTX Spark Superchip offers up to 20 Arm CPU cores, a Blackwell GPU with 6,144 CUDA cores, 128GB of LPDDR5X RAM, and up to 300 GB/s of memory bandwidth — enough power and memory for AI agents and 120-billion-parameter models with context lengths stretching to a million tokens. RTX Spark systems will begin arriving in the fall. Intel, not to be outdone, used its keynote to showcase Xeon 6+ processors with 288 efficiency cores built on 18A technology for the data center, and positioned its Core Ultra Series 3 as the client silicon that makes hybrid inference possible on the PC. Perplexity's hybrid orchestrator sits at the intersection of both strategies. If the system performs as advertised, it creates a direct economic incentive for users — and eventually enterprises — to invest in more powerful local silicon. The more capable the on-device chip, the more inference can run locally, reducing cloud costs and improving latency for sensitive workloads. That dynamic benefits Nvidia, Intel, and every other chipmaker competing for AI PC sockets. The implications extend well beyond chip economics. "As chips become more powerful, more intelligence moves onto a person's machine, alongside server inference for the complex tasks that still need frontier models," a Perplexity spokesperson told VentureBeat. "Sensitive and sovereign work can stay local, which changes the need for massive country-level infrastructure." That last claim — about sovereign infrastructure — is the most provocative. Nations from the UAE to France to India have been investing billions in domestic AI compute capacity partly on the assumption that sensitive data must stay within their borders, which means building or buying access to local data centers. If meaningful inference can run on an end user's device with no data leaving the machine, the calculus changes. It does not eliminate the need for data centers, but it could soften the urgency of the buildout. The model-agnostic architecture that makes hybrid inference possible Perplexity's hybrid inference play rests on the same architectural bet the company has been making all year: that the orchestration layer matters more than any individual model. For AI engineers, this signals a fundamental shift — the orchestration layer may matter more than the models themselves. The key insight is separation of concerns: the orchestration layer handles task decomposition, state management, and tool coordination, while the model layer handles specific computations. This decoupling means teams can swap models as better alternatives emerge without redesigning the entire system. Perplexity has leaned heavily into this philosophy. The company is doubling down on packaging frontier models in a consumer-friendly user experience, arguing that there is value in orchestrating multiple third-party LLMs to obtain the most cost-effective and accurate answers to queries. Models, in Perplexity's view, are specializing, not commoditizing. The hybrid inference extension takes that logic one step further. Perplexity is now orchestrating not just across models but across physical compute locations — choosing which model runs where. A lightweight local model might handle a privacy-sensitive document summarization task while a frontier cloud model tackles the complex reasoning required to analyze that summary against a broader market landscape. The orchestrator manages the handoff. This is a technically ambitious claim. Making it work reliably in production will require the orchestrator to accurately assess the complexity of each subtask, understand the sensitivity of the data involved, know the capabilities and latency characteristics of whatever local hardware the user has, and manage the state of a task that may be bouncing between environments mid-execution. It is easy to imagine edge cases where the routing logic fails, sends something sensitive to the cloud, or degrades performance by assigning a task to an underpowered local model. Perplexity says the system will be chip-agnostic, though the initial Computex demo ran on Intel silicon. The company expressed enthusiasm in its communications about the new AI chips announced at Computex this week, suggesting it intends to optimize across vendors. A $20 billion valuation, nine lawsuits, and the pressure to deliver The hybrid inference announcement arrives at a complicated moment for Perplexity. The company has been on a remarkable growth trajectory: It secured $200 million in new capital at a $20 billion valuation, just two months after raising $100 million at an $18 billion valuation. Since its founding three years ago, the rapidly growing AI company has raised $1.5 billion in total funding, according to PitchBook data. But the company also faces a mounting stack of legal challenges. Nine organizations have filed active suits against Perplexity for alleged copyright and trademark infringement as of May 31, 2026: CNN, the New York Times, News Corp and Dow Jones, the New York Post, the Chicago Tribune, Encyclopedia Britannica, Merriam-Webster, Reddit, and Japan's Yomiuri Shimbun. The CNN lawsuit, filed just days ago on May 28, is the most recent, accusing Perplexity of scraping more than 17,000 CNN stories, photos, videos, and other content and using that material to train its products. Perplexity has responded with a consistent message. "You can't copyright facts," the company's chief communications officer Jesse Dwyer said in a statement. Other publishers have opted for partnership over litigation. Time, Gannett, Le Monde, and Der Spiegel have signed licensing arrangements with Perplexity. The company launched a Publishers Program in mid-2024 in which participating outlets receive a share of revenue generated when their content is cited in Perplexity answers. According to CNBC, Perplexity's chief business officer Dmitry Shevelenko confirmed at the time that the flat rate was a double-digit percentage but declined to share specifics. As TechCrunch reported in December 2024, additional publishers including the LA Times, Adweek, The Independent, and Lee Enterprises subsequently joined the program, though not without internal controversy — reporters at some outlets told TechCrunch they were not informed of the deals before they were announced publicly. The legal risk is not existential, but it is material, and with enterprises increasingly evaluating Perplexity's tools for sensitive workflows — precisely the use case the hybrid inference system is designed to serve — unresolved intellectual property questions could dampen adoption. How hybrid inference sharpens Perplexity's enterprise ambitions The hybrid inference demo should be read alongside Perplexity's broader push into enterprise software, a transformation that accelerated dramatically this year. At the Ask 2026 developer conference in March, VentureBeat reported that Perplexity announced Computer for Enterprise, positioning the three-year-old startup as a direct competitor to Microsoft, Salesforce, and the legacy enterprise software stack. Beyond Computer's existing 100-plus integrations, enterprise customers gained access to business-grade connectors for Snowflake, Datadog, Salesforce, SharePoint, and HubSpot, with administrators able to install custom connectors via the Model Context Protocol. The package also includes purpose-built workflow templates for legal contract review, finance audit support, sales call preparation, and customer support ticket triage, alongside SOC 2 Type II certification and the option for zero data retention. Hybrid inference deepens this enterprise pitch considerably. For regulated industries — financial services, healthcare, defense, legal — the ability to keep sensitive data on a local device while still accessing the reasoning power of frontier cloud models is not a nice-to-have. It is a potential compliance requirement. An investment bank parsing confidential deal documents, for instance, might be unable to send those materials to a third-party cloud under existing data handling agreements. A system that can run the sensitive parsing locally while routing non-sensitive analytical tasks to the cloud offers a middle path. IDC forecasts a tenfold increase in agent usage and a thousandfold growth in inference demands by 2027, and security and governance rank as the top evaluation factor for enterprise agentic platforms, according to a CrewAI survey. Hybrid inference speaks directly to that priority. The race to decide where AI actually runs is just getting started Several questions will determine whether Perplexity's Computex demonstration becomes a landmark product or a compelling prototype. The actual performance characteristics remain untested outside a controlled stage environment — how the routing logic handles varied hardware configurations, unreliable network connections, and ambiguous data sensitivity classifications is an open question. The competitive response matters too: Google, Microsoft, Apple, and OpenAI are all building their own local-cloud AI architectures. Apple Intelligence already routes some tasks locally and some to Private Cloud Compute servers, Google's Gemini Nano runs on-device, and Microsoft's Copilot+ PCs are designed around local inference capabilities. None of these systems, however, currently offer the kind of dynamic, autonomous task-level routing Perplexity claims. Even if the technology works as demonstrated, there is the question of whether the business can keep pace with the ambition. At a $20 billion valuation with approximately $200 million in annual recurring revenue, Perplexity trades at roughly 100x revenue, a premium requiring aggressive growth to justify. Management's $656 million 2026 revenue target implies 230% growth, creating significant execution pressure. Perplexity has built its business on a bet that the future belongs not to any single model but to the system that orchestrates all of them. At Computex, it extended that bet from the software layer to the physical layer — from which model to which machine. In the AI industry's relentless race to build bigger data centers and train larger models, Perplexity just argued that the most important computer in the stack might be the one already sitting on your desk.

In Q1 2026, VentureBeat's Pulse Research surfaced the “Governance Mirage”: the gap between the governance org charts enterprises had drawn and the control layers they had actually built. Forty-three percent said a central team owned AI governance; 23% couldn't agree on who owned it at all; and 31% named vendor opacity as the single biggest obstacle. This new wave of research asks the next question: Once you've admitted the governance problem, what breaks first when you try to fix it? The answer from our respondents is unambiguous. The failure point is not the model. It's the runtime. Enterprises are discovering that AI agents built on stateless infrastructure — Python scripts, LangChain chains, ad hoc orchestration — cannot survive the operational realities of production. Container restarts erase context. Token costs breach business cases. Hallucinations in Step 3 compound into catastrophic failures by Step 12. And the majority of engineering teams are spending more time managing this "plumbing" than building the intelligence that was supposed to justify the investment. What emerges from this survey is a picture of an industry at a critical fork. The organizations that survive the Agentic Reckoning will be those that treat runtime durability as a first-class engineering concern — not an afterthought to be patched with retries and prompting. The ones that don't will find themselves back where RPA left enterprises a decade ago: a graveyard of clever pilots that couldn't survive Day Two. Methodology VentureBeat conducted this survey in May 2026 as part of its ongoing Pulse Research series on agentic AI adoption in the enterprise. Respondents were filtered to organizations with 100 or more employees. The final qualified sample consists of 132 verified, highly qualified technology leaders at the forefront of enterprise AI agent deployment. They span: Directors of AI/Analytics (8%) Directors of Engineering/IT (16%) VP of Data/AI/Analytics (5%) VP of Engineering/IT (5%) CIOs/CTOs/CISOs (15%) Product and Program Managers (13%) Consultants (9%) Software and ML Engineers (9%) Enterprise Architects (8%) Other (12%) Industries represented include Technology/Software (42%), Financial Services (20%), Professional Services (8%), Healthcare/Life Sciences (7%), Retail/Consumer (6%), Education (4%), and others. Given our strict filtering criteria, this cohort provides a robust and authoritative look at emerging agentic infrastructure trends. Respondent demographics by company size: Large enterprise (10,000+ employees): 35% of the sample Mid-to-large enterprise (500–9,999 employees): 48% of the sample Growth enterprise (100–499 employees): 17% of the sample These quantitative findings capture a critical moment in infrastructure evolution and are best synthesized alongside VentureBeat’s Q1 2026 governance reports and our deep-dive practitioner conversations conducted throughout the quarter. Finding 1: The runtime is the problem The "spine vs. brain" debate is over The foundational question of enterprise AI in 2026 is whether agent failures trace back to the model's reasoning capability — the Brain — or to the runtime infrastructure's inability to manage state, survive failures, and coordinate execution — the Spine. We asked our respondents directly. Integration/governance challenges were the biggest problem. But Spine issues were close behind. However, 17% still say the Brain is the primary failure mode. That’s not a rounding error — it’s a signal. The organizations in this cohort are not disputing the infrastructure problem; they are telling us that the models themselves are not yet reliable enough for the edge cases their workflows are generating. The model-versus-runtime debate is genuinely three-sided. Read together, these three answers are not fully in conflict. The Spine and Gap camps are struggling with infrastructure and governance respectively. The Brain cohort is struggling with something upstream: reasoning reliability at scale. This is a significant finding. The frontier model wars — GPT-5 vs. Claude 4.7 vs. Grok — are consuming enormous mindshare in the enterprise technology press. Our respondents are telling us that war is, for now, beside the point. The models are smart enough, but the infrastructure around them is not. "The models are smart enough, but our stateless infrastructure is too fragile to manage long-running, multi-step agentic processes." — Director of Engineering / IT, Financial Services, 10,000–49,999 employees Finding 2: The DIY tax is eating teams alive Engineering capacity is being consumed by plumbing, not intelligence If the Spine is a primary failure mode, what does that cost in practice? We asked respondents what percentage of their team's weekly engineering capacity is consumed by building and maintaining custom "plumbing" — manual retries, state-persistence, checkpointing — rather than actual agentic logic. The results reveal a market in two distinct camps, with a dangerous middle. The arithmetic is stark. Seventy-seven percent of respondents are spending meaningful engineering time on infrastructure overhead. Just 23% — those whose frameworks are handling reliability — have escaped the tax. The distribution is notably flat: the Crisis and Efficiency poles are the same sizes as the middle categories (Trap and Maintenance Tax). This is the signature of a market that has partially addressed the worst failures but has not yet escaped the structural overhead. The Efficiency Zone respondents are not necessarily in a more sophisticated position. In many cases, they may be on managed platforms that abstract away the durability problem — or they may simply not yet have hit the scale at which stateless architectures begin to fail. The Complexity Trap is often where the Efficiency Zone ends. There’s a direct business consequence for organizations in the Crisis zone. Every engineering hour spent writing retry logic or debugging a "ghost failure" — a silent API timeout that leaves an agent hanging without a traceback — is an hour not spent on the differentiated logic that was supposed to justify the AI investment in the first place. Finding 3: State amnesia is the production killer The No. 1 technical obstacle has shifted: Cost and hallucination now lead state failures When AI agents fail to reach production or scale, what is the primary technical obstacle? We named five candidates, ranging from model hallucination to cost overruns to latency failures. Hallucination Propagation at 24% compounds silently — reasoning errors in early steps become catastrophic by Step 10. Ghost Failures at 20% are invisible by definition, which means their real prevalence is likely higher than this number suggests. Finding 4: The observability tax falls heaviest on Microsoft Platform visibility costs are not equally distributed Our Q1 2026 research identified vendor opacity as the single biggest obstacle to AI governance — ahead of talent gaps, tooling, and budget. That finding pointed to this question: Which vendor ecosystem, in practice, imposes the highest cost to achieve basic production visibility? We asked respondents which platform requires the most custom telemetry, manual instrumentation, and "logging glue" to achieve visibility into agentic failures. Microsoft's position at the top of this ranking is not noise. It is a structural characteristic of the Microsoft agentic ecosystem — the same Azure/Copilot stack that dominates enterprise AI adoption requires the most instrumentation overhead to see inside. It also reinforces the warning that Brian Gracely, Senior Director at Red Hat, made at VentureBeat’s Boston event in March: that building your control system entirely inside one cloud provider's toolset means "renting a cage." The organizations paying the highest observability tax are precisely those most locked into provider-native tooling. The implication for teams currently evaluating orchestration architecture is direct: observability cost is a real budget item that should appear in any build-vs-buy analysis. A platform that appears cheaper at the API layer may impose substantially higher engineering costs at the telemetry layer. Finding 5: The hype-reality gap belongs to OpenAI and Microsoft Agentic coding marketing is significantly ahead of production reliability. We asked respondents a pointed question: Which major platform's Agentic Coding marketing is the most disconnected from the actual technical reliability and fault-tolerance of their product? Thirty-two percent said they didn't know — a figure that has held roughly constant across all three waves, suggesting persistent uncertainty is structural, not a sample artifact. Cursor also registered 6% in this wave. Among those with enough production experience to have a view. Microsoft leads at 45%; OpenAI is second at 22%. The gap is too large to attribute solely to deployment footprint. It suggests that GitHub Copilot Workspaces and AutoGen are generating a specific category of disappointment — probably around the reliability of multi-agent orchestration in production — that accumulates with use. A platform that fewer enterprises are running in production will accumulate fewer credible disappointed practitioners. The more significant observation is what this gap means for decision-makers evaluating new agentic tooling. The marketing around all major platforms describes agentic autonomy and reliability at a level that production deployments are not yet delivering. The organizations in our survey who have moved beyond pilots are encountering the difference firsthand. Finding 6: The security mesh is being built from first principles Enterprises are not waiting for vendors to solve agent security How are enterprises protecting proprietary research data from AI leakage and prompt-driven exfiltration? The security architecture question is one of the most consequential in agentic AI, because agents — unlike static models — can actively call APIs, traverse file systems, and execute code. The blast radius of a security failure is qualitatively different. Policy-as-Code is a leading security mechanism, but not by much. The NHI and Policy-as-Code approaches are meaningfully different in their security philosophy. NHI is identity-centric: The question it answers is "who is this agent and what is it allowed to touch?" Policy-as-Code is rule-centric: The question it answers is "regardless of what the model decides to do, what hard stops exist at the infrastructure level?" Rough parity across all four mechanisms is the headline finding. This is what market convergence looks like in early motion: No dominant pattern has emerged. Notably, though, Egress-Locked Sandboxing is a relatively new trend in agentic AI deployments, yet it’s already at 22%. As more agents gain terminal-level access to enterprise systems, the cost-benefit of sandboxing is improving. This is notable given the maturity of the identity management and policy-as-code disciplines in traditional IT security. The AI security layer is, for now, being built largely from scratch. The Egress-Locked Sandboxing number deserves attention despite its smaller share. Sandboxing untrusted code execution is the most technically intensive of the four approaches, but it is also the most direct defense against prompt injection attacks that try to execute malicious code through agent tooling. As agentic systems gain more terminal-level access — a trend our survey confirms is accelerating — this approach may prove more important than its current adoption rate suggests. "How do we audit agentic tools that have terminal-level access to our proprietary repos?" — Composite concern expressed by multiple respondents Finding 7: The complexity cliff is real, and most are climbing it The migration away from stateless architectures is underway — but fragmented The central thesis of the Agentic Reckoning is that stateless Python/LangChain architectures cannot survive the complexity cliff — the point at which multi-step, long-running agent workflows begin failing at rates that make production deployment untenable. We asked respondents directly: are you migrating toward durable execution frameworks to solve for state loss? The answers reveal a market in transition, with meaningful disagreement about the right destination. The 20% committed to stateless architectures — attempting to solve a structural durability problem through better prompting — are the cohort most likely to encounter State Amnesia and Ghost Failures as their workloads scale. It’s essentially the same trap that RPA teams fell into a decade ago, when brittle process automations were patched with increasingly elaborate rule sets rather than re-architected on more resilient foundations. The Stateless Commitment cohort deserves a reinterpretation. These teams are not all naive: some are building on managed platforms that genuinely abstract state management. But a portion is patching structural fragility with prompting improvements, and the Ghost Failures data in Finding 3 suggests this approach may be encountering its ceiling. The combined 59% who are either in Active Migration or in Governance-First Evaluation represent the market's leading edge — organizations that have recognized the architectural problem and are investing to solve it structurally. Finding 8: The “polyglot orchestration” lead is narrow — the field is fragmented Architectural conviction is spread across multiple bets What is the longterm architectural philosophy winning enterprises' strategic investment? We offered four options representing the major bets available in the current market. The Polyglot Bet's lead suggests that enterprises are seeing advantages of using a flexible approach: Using model-driven architectures where non-deterministic reasoning works well, but using deterministic structures and pipelines where accuracy and mission-critical execution is at stake. This has direct competitive implications for the frontier labs and cloud providers. The cohort saying the use a Cloud-Native Managed Stack is significant. This likely reflects the enterprise reality that Azure OpenAI Service and AWS Bedrock deployments come with built-in organizational gravity — procurement relationships, security approvals, and existing data pipelines. The Independent Durable Runtime bet at 16% signals that a cohort of teams have rejected both cloud lock-in and frontier lab dependency in favor of full architectural sovereignty. The Polyglot result also helps explain why the observability and governance problems described in this survey are so persistent. When your architecture deliberately spans multiple orchestration layers and multiple providers, no single vendor's telemetry gives you the full picture. The "Dynatrace for AI" — the unified observability platform called for by Mass General Brigham's CTO Nallan Sriraman at the VentureBeat Boston event — becomes not just desirable but structurally necessary. "Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch." — Survey respondent Finding 9: User acceptance rate is the emerging production standard The market is settling on a human-trust metric as its primary A-SLA What metrics are enterprises actually using to determine whether an AI agent is ready for production? We asked respondents to identify their primary Agentic SLA (A-SLA) indicator — the number that, above all others, tells them whether an agent can ship. User Acceptance Rate as the dominant production metric is significant because it is a human-trust measure, not a technical performance measure. It does not ask whether the agent ran fast or maintained state. It asks whether a human who reviewed its output chose to accept it. This is, in effect, a field-level Turing test applied at the action level. The persistence of UAR as the leading metric reflects the reality of where most enterprise agentic deployments still sit: in a human-in-the-loop posture, where agent actions require human review before execution. That is a rational response to the Hallucination Propagation and Ghost Failures described earlier in this survey. Organizations that have not yet solved runtime durability are, sensibly, keeping humans in the loop — and at 132 respondents, there is no evidence this is changing. Context Fidelity's position at 30% is the most significant finding. It tracks directly with the Active Migration data in Finding 7: As more teams move into durable execution frameworks, the 48-hour+ memory problem becomes their primary production concern. Teams that have solved State Amnesia are now focused on whether their agent can remember what it was doing yesterday. Latency Jitter's collapse from 25% to 11% tells the complementary story: raw speed is no longer the primary anxiety. Correctness and durability have taken its place. The bottom line: The reckoning is runtime, not reasoning The data tells a consistent story: There’s a runtime deficit for agents. Enterprises are spending more time on infrastructure plumbing than on agent intelligence, and State Amnesia is still claiming production deployments. But fault lines are visible. The ROI Ceiling has overtaken State Amnesia as the leading production killer — which means the infrastructure problem is no longer purely a technical one. Token economics and orchestration overhead are now consuming enough business value that project sponsors are making the kill decision before engineering teams can solve the durability problem. Hallucination Propagation remains a big problem. The Brain vote in Finding 1 remains significant. And the Polyglot lead is fragile, with varied architectures well represented. The models are, by most respondents' own assessment, smart enough — but 17% disagree. What is not yet smart enough is the infrastructure surrounding them: the state management, the fault-tolerance, the observability, the identity governance, and the deterministic execution layer that turns a model's judgment into something an enterprise can stake its operations on. The 39% making the Polyglot Bet represent the current leading edge of enterprise architectural thinking. They are building systems where the model's intelligence is preserved and leveraged, but where the execution layer — the Spine — is deterministic, auditable, and durable by design. They are not waiting for a frontier lab to solve this for them. They are not betting that better prompting will patch infrastructure fragility. They are building the control plane. The organizations still committed to stateless architectures — still trusting that manual retries and clever prompting can substitute for durable execution — are the ones most likely to contribute to the next wave of this data. Ghost Failures are a primary obstacle. The pattern is familiar: Early adopters diagnose the problem architecturally, migrate to durable runtimes, and escape the failure mode. Late movers inherit it. The Complexity Cliff is not theoretical. It is the wall that most current agentic architectures are already climbing toward. The reckoning is runtime and economics, not reasoning. Based on survey responses from 132 qualified enterprise respondents (100+ employees). Sample size is small; data should be treated as directional. Respondents include Directors, VPs, CIOs, CTOs, and Enterprise Architects across Technology, Financial Services, Retail, Healthcare, and other sectors.

Every new AI agent your team deploys starts from scratch: no memory of how the business works, where data lives, or what rules apply. And as agentic coding tools spin up applications faster than anyone can govern them, each one risks becoming another silo outside your data layer entirely. Microsoft is addressing both problems directly at Build 2026. According to VentureBeat's VB Pulse's Q1 2026 RAG Infrastructure Market Tracker, hybrid retrieval intent among 100-plus employee organizations tripled from 10.3% in January to 33.3% in March, a signal that enterprises have moved past expanding RAG coverage and are now focused on the architecture underneath it. Shared business context is the part retrieval does not solve. On the context side, Microsoft is expanding Fabric IQ, its existing business data context layer, into a broader unified system called Microsoft IQ, adding three additional context sources covering how the organization works, what it knows and real-time global signals from the web, so any agent can tap all four as a single foundation. On the application side, Rayfin, a new open-source SDK and CLI, deploys agent-built applications directly to Fabric as a governed production backend, routing application data into the same platform rather than spinning up new silos. Amir Netz, CTO of Microsoft Fabric, reached for a film analogy to explain where the data platform fits. The green screen of cascading code in "The Matrix" wasn't atmosphere, it was the layer that built the world Agent Smith operated in. "Our job in the world of data is creating reality for agents based on data," Netz told VentureBeat. Microsoft IQ unifies four context sources into a single agent foundation Microsoft IQ brings together four context sources that until now existed separately, designed so a developer can connect a new agent to all four in a single integration step. Work IQ. Captures how the organization operates day to day, drawing on email, documents, meetings and schedules to give agents an understanding of people, teams and workflows. Foundry IQ. Manages institutional knowledge, curating and indexing knowledge bases so agents understand what it means to work within the organization, what rules apply and what procedures to follow. Fabric IQ. Models the live operational state of the business through data, defining entities, relationships and business rules grounded in real-time signals from Fabric Real-Time Intelligence. Ontologies, the layer that captures that operational context, are expected to reach GA in the coming months. Web IQ. Adds real-time global context from the web, giving agents a current picture of the world outside the organization alongside its internal data. "The agents are going to become highly informed virtual employees," Netz said. "That's where the world is heading." Rayfin routes agent-built applications into the same data foundation Building shared context solves one half of the problem. The other is what happens when agents start generating applications. Every new app needs a backend, and without a governed deployment path each one creates a new data silo outside the context layer entirely. Rayfin provides an enterprise-grade back end and deploys agent-built applications directly to Fabric, so application data lands in Microsoft OneLake by default and feeds back into the Microsoft IQ context layer rather than accumulating outside it. Microsoft positions Rayfin against Supabase and Neon, the Postgres-compatible backends that agentic coding tools default to. The differentiator is governance: Rayfin routes the entire application fleet through Fabric's unified data and compliance layer rather than creating isolated silos. Netz described the relationship as bidirectional. The agent building a Rayfin application draws from the organization's ontology. The data that application generates then enriches that ontology for the next agent. Every major data platform is chasing the same answer, but execution is unproven Microsoft is not the only platform building a shared context layer for agents. Snowflake announced its own context capabilities this week with semantic capabilities. Pinecone has its Nexus platform that expands the vector database to become a knowledge engine and Redis has developed its Iris context and memory platform. Microsoft's approach further reinforces the trend that RAG and model availability aren't the issue anymore. "Fabric IQ and Rayfin are important because the enterprise AI challenge is no longer just about the model availability," Robert Kramer, managing partner at KramerERP told VentureBeat. "The real question is whether Microsoft simplifies execution and strengthens trust or adds another layer to an already complex environment."
For the past two years, the technology industry has raced to make AI agents more capable — teaching them to write code, navigate software interfaces, manage files, and orchestrate multi-step workflows with increasing autonomy. What the industry has not done, at least not with any consistency, is answer the question that keeps chief information security officers awake at night: what happens when an agent goes wrong? On Tuesday at its annual Build developer conference, Microsoft offered what may become the definitive answer. The company introduced Microsoft Execution Containers, or MXC — a policy-driven execution layer, built into the Windows operating system itself, that lets developers and IT administrators declare exactly what an AI agent can and cannot access, with those boundaries enforced at runtime by the OS kernel. The announcement, buried within a sweeping set of developer-focused updates, is arguably the most consequential platform move Microsoft made at Build this year, and it has the potential to reshape how every enterprise on Earth thinks about deploying autonomous AI software. MXC is not a product you buy. It is an SDK and a policy model — a foundational primitive embedded in Windows and the Windows Subsystem for Linux — that provides what Microsoft calls a "composable sandbox spectrum." That spectrum ranges from lightweight process isolation, already adopted by GitHub Copilot's command-line interface, all the way up to micro-virtual machines, Linux containers, and full cloud instances running on Windows 365. The system separates an agent's execution from the user's desktop, clipboard, user interface, and input devices. Critically, it binds every agent to a strong identity — either a local ID or a cloud-provisioned identity backed by Microsoft Entra — so that every action the agent takes can be attributed, audited, and governed. The implications are enormous. Until now, the enterprise deployment of AI agents has been stuck in a paradox: the more autonomous and useful an agent becomes, the more dangerous it is to let it operate on a corporate network without guardrails. MXC is Microsoft's attempt to break that paradox — not by making agents less capable, but by making the environment they operate in fundamentally more controlled. Why every autonomous AI agent is a security incident waiting to happen To understand why MXC matters, consider what an AI agent actually does when it runs on your computer. Unlike a traditional application, which operates within well-understood boundaries — a word processor reads and writes documents, a browser fetches web pages — an AI agent is, by design, unpredictable. It receives a goal in natural language, reasons about how to achieve it, and then takes actions: opening files, executing code, calling APIs, browsing the web, interacting with other software. Each of those interactions creates what security professionals call "attack surface." Microsoft's own blog post framed the challenge in stark terms. The company wrote that "as agents become more capable and autonomous, they're delivering material productivity gains. But they're also introducing new risk, and the issue isn't just the agent. It's the entire system the agent operates across." Every interaction between agents and humans, tools, applications, models, and other agents "exposes new attack surface and introduces different failure modes." Microsoft characterized this as "a multi-layer systems problem." This is not a theoretical concern. In the months leading up to Build, security researchers demonstrated numerous ways that AI agents could be manipulated — through prompt injection, through malicious tool calls, through data exfiltration disguised as normal workflow. For enterprises that handle sensitive data, proprietary models, and regulated information, the absence of a trusted execution environment has been the single biggest barrier to moving agents from demo to deployment. Microsoft's answer is a sandbox that scales from a single process to a full virtual machine MXC operates on a deceptively simple principle: declare what the agent can do before it runs, and let the operating system enforce those declarations at runtime. A developer or an IT administrator writes a policy that specifies which files, directories, and network resources an agent is allowed to access. MXC then creates a contained execution environment — a sandbox — that enforces those boundaries regardless of what the agent attempts to do. What makes MXC unusual, and potentially very powerful, is the breadth of its isolation options. Microsoft designed the system so that a single SDK and policy model can map to the appropriate isolation construct for any given workload. For a lightweight coding assistant that just needs to read the current project directory, fast process isolation may be sufficient. For an autonomous agent that executes arbitrary code downloaded from the internet, a full micro-VM may be required. The system is designed to be "dynamically composable based on intent and risk," meaning that the level of isolation can be adjusted based on what the agent is actually doing, not just what category it falls into. Session isolation is a particularly important feature. MXC separates the agent's execution from the user's desktop, clipboard, UI, and input devices. This directly mitigates several classes of attacks that security researchers have identified as particularly dangerous for AI agents: UI spoofing, where an agent manipulates what the user sees to trick them into approving a malicious action; input injection, where an agent sends keystrokes or mouse clicks to other applications; and cross-session data leakage, where information from one user's session bleeds into another. A live demo showed an AI agent trying to delete files — and failing, because the OS wouldn't let it During a pre-briefing with VentureBeat the night before the announcement, a Microsoft developer offered a vivid demonstration of the technology in action. He had set up the open-source agent framework OpenClaw running inside MXC's sandbox on his personal development machine. He then instructed the agent to delete all the files on his desktop. The agent attempted to comply — but the sandbox prevented it. "If you look at my desktop here, you see how clean my desktop is," the developer said during the demo. "That's a lie." The files, he explained, were completely safe because "the container won't allow it." The demonstration went further, showcasing the granularity of MXC's controls. Users can mark specific files as read-only for the agent, restrict access to the browser and screen capture, control whether the agent can see location data, and have all of those permissions managed centrally by an enterprise IT department through Intune policies. The agent operates inside what is effectively a one-way mirror: it can do the work it has been asked to do, but it cannot see or touch anything outside the boundaries that its policy defines. Pavan Davuluri, Microsoft's Executive Vice President for Windows and Devices, underscored during the pre-briefing that the primitives MXC introduces — security, containment, isolation, and user control — are essential to making AI agents commercially viable. He emphasized that these capabilities are "not unique to OpenClaw" and that "this pattern repeats itself over and over" for any agent running on a Windows device. The primitives that exist in the operating system now "for the file around security, containment, isolating them, having users in control," he said, are what will make agents safe enough for ordinary consumers and corporate deployments alike. Defender, Entra, Intune, and Purview integration arriving in July turns MXC into an enterprise control plane For corporate IT departments, the most significant element of the MXC announcement is not the SDK itself but its integration with Microsoft's existing enterprise security stack through what the company calls Agent 365. Arriving in preview in July, Agent 365 layers Microsoft's Entra identity service and Intune device management platform on top of MXC, so that IT administrators can govern agent containment centrally while developers choose the level of isolation their workload demands. The integration goes further: Microsoft Defender will provide runtime threat protection, Entra will handle identity and access management, Intune will enforce device-level policies, and Microsoft Purview will extend its data governance and compliance capabilities to agent activity. This means that an enterprise could, in theory, allow employees to run AI agents on their corporate machines — even powerful, autonomous agents that execute code and manage files — while maintaining the same kind of centralized visibility and control that IT departments currently have over traditional applications. Microsoft described the identity layer in its official blog: "Windows assigns agents a local ID or a cloud provisioned identity backed by Entra and attributes all activity from the container to that identity, so you can clearly differentiate human from agent." For regulated industries — financial services, healthcare, government — the ability to produce an audit trail that distinguishes between human actions and agent actions on the same machine could prove to be a regulatory requirement, not merely a nice-to-have feature. Every agent action attributable to a specific identity, every containment boundary enforceable through the same policy infrastructure that already governs hundreds of millions of Windows devices — this is the architecture that could finally move AI agents from pilot programs to production. OpenAI, Nvidia, Manus, and Nous Research are already building on MXC — and that changes the calculus Platform announcements at developer conferences are often aspirational. What distinguishes the MXC launch is the breadth and specificity of the partners already building on it. Microsoft named five: OpenAI, Nvidia, Manus, Nous Research (maker of the Hermes agent), and the OpenClaw open-source project. Each is integrating MXC in a distinct way that illuminates a different use case for the technology. OpenAI's involvement is particularly striking. David Wiesen, a member of OpenAI's technical staff, said that "working with Microsoft on the Microsoft Execution Containers (MXC) allows us to explore new patterns for AI agents to safely and efficiently generate and execute code." He added that by combining Codex's capabilities with MXC's execution environment, the goal is "to help developers move from intent to reliable execution faster, while maintaining the security and control enterprises need." The reference to Codex — OpenAI's code-generation agent — suggests that MXC could become the default execution environment for one of the most widely anticipated agent products in the industry. Nvidia is bringing its OpenShell framework to Windows built on MXC, providing what Microsoft described as "an easy-to-deploy package for autonomous, always-on agents safely." Manus, the Chinese-born AI agent startup that gained viral attention earlier this year, is also integrating. Tao Zhang, Manus's Chief Product Officer, said that MXC "gives developers a policy-driven way to define what an agent can access and enforce those boundaries at runtime, so more autonomous agents can operate safely in enterprise environments." And Dillon Rolnick, the CEO of Nous Research, offered what may be the most concise articulation of why MXC matters: "Continuously-running local agents, like Hermes Agent, require intentional isolation. Developers need control over what an agent can access and trust that those controls will hold." How an open-source agent framework became Microsoft's proving ground for AI safety on Windows One of the more revealing stories behind the MXC announcement involves OpenClaw. During the press pre-briefing, a Microsoft developer described how the partnership came together organically — Peter Steinberger, OpenClaw's creator, sent him a direct message in January expressing interest in collaborating. What began as a casual conversation evolved into a full-fledged platform partnership, with Microsoft developers contributing to the OpenClaw Windows companion app, built as a native WinUI application rather than a wrapped web app. The OpenClaw integration serves as what Scott called "the ultimate test app for all the stuff that [the Windows platform team] is making." If OpenClaw — which by its nature gives agents broad autonomy to execute tasks on a user's machine — can run securely within MXC's containment boundaries, then the containment system is robust enough for any agent. Scott explained the philosophy driving the work: "Think of OpenClaw Windows as the ultimate test app... If OpenClaw can succeed on Windows, that means that the Linux support is there, the container support is there, the containment is there." The companion app demonstrates the full spectrum of MXC's enterprise controls — file permissions, network access, screen capture restrictions, location data — all manageable centrally through Intune policies. Microsoft donated the project to OpenClaw and plans to continue contributing to it as open source. As one member of the Windows leadership team put it during the briefing: "All agents, all comers, everyone is welcome on Windows... It's going to run great on Windows, because the primitives are there. The base of the pyramid is solid." Building containment into the OS gives Microsoft a strategic edge over Apple's walled garden and Google's cloud-first model MXC arrives at a moment when the technology industry is grappling with a fundamental tension. AI agents represent what may be the most significant new category of software since mobile applications, and every major technology company is racing to build them. But the security and governance infrastructure required to deploy these agents responsibly in enterprise environments barely exists. Microsoft's approach is distinctive because it locates the trust layer at the operating system level rather than in the agent framework, the model provider, or a third-party security product. This is a deliberate architectural choice. By building containment into Windows itself, Microsoft ensures that the security guarantees hold regardless of which agent, which model, or which framework a developer chooses. It also means that the hundreds of millions of Windows devices already managed through Intune and secured through Defender can, in principle, become agent-ready through a software update rather than a rip-and-replace deployment. Apple's approach to AI agents leans heavily on its walled-garden ecosystem, offering security through restriction — limiting which agents can run and what they can do. Google's approach, centered on its cloud infrastructure, offers security through centralization. Microsoft's approach offers security through declaration and enforcement — allowing any agent to run, but containing its impact through OS-level policy. For enterprises that operate in heterogeneous environments with diverse toolchains and multiple AI providers, the Microsoft model may prove the most practical. The competitive dynamics are already shifting: with OpenAI's Codex, Nvidia’s OpenShell, and independent agent frameworks like Manus and Hermes all building on MXC, Microsoft is positioning Windows not just as the platform where agents run, but as the platform where agents can be trusted to run. The hardest part isn't building the sandbox — it's writing the policies that go inside it MXC is available now in early preview, meaning developers can begin building against the SDK and testing containment policies. The Agent 365 integration with Defender, Entra, Intune, and Purview is scheduled for preview in July — a timeline aggressive enough to suggest that much of the engineering work is already done, but far enough out to allow for refinement based on developer feedback. The real test, however, will come when enterprises begin deploying agents at scale on production networks. Containment is only as good as the policies that govern it, and writing effective agent policies for complex enterprise environments will be an entirely new discipline — one that IT departments have not yet developed and that no vendor has yet figured out how to teach. The technology is promising, but an empty sandbox is just an empty box. Filling it with the right rules, for the right agents, in the right contexts, will require a level of organizational sophistication that most companies are only beginning to contemplate. Still, the significance of what Microsoft announced on Tuesday is difficult to overstate. For the first time, a major operating system vendor has proposed a comprehensive, kernel-level answer to the question of how autonomous AI software should be contained, identified, and governed on the devices where most of the world's work actually gets done. The industry spent two years teaching agents to act. Microsoft is now betting that the bigger business — and the harder engineering problem — is teaching the operating system to watch.

Microsoft on Monday unveiled the Surface RTX Spark Dev Box, a compact desktop computer designed to let software developers run large AI models on their desks instead of paying for cloud computing — a move that directly challenges the per-token pricing model that has defined the AI industry's economics since ChatGPT launched three and a half years ago. The device, announced at Microsoft Build 2026, packs Nvidia’s new Blackwell-architecture RTX Spark processor and 128 gigabytes of unified memory into a small-form-factor chassis, delivering what Nvidia rates at one petaflop of AI compute. In practical terms, that means a developer can load, run and interact with AI models exceeding 120 billion parameters without sending a single API call to the cloud. "These class of devices, we think, will get to about 100 billion parameter model running," Pavan Davuluri, Microsoft's executive vice president of Windows and Devices, said during a press briefing ahead of the event. He emphasized that raw model size is only part of the equation: "The model size is one thing, but for the model to be effective, it kind of needs to be able to have enough context, because a larger model, you feed it larger context." At 100,000 tokens of context, he noted, the key-value cache alone can consume 40 to 50 gigabytes of memory — which is precisely why Microsoft and Nvidia engineered the device around a 128-gigabyte unified memory pool shared dynamically between the CPU and GPU. The machine will be available later this year in the United States, sold exclusively through Microsoft.com. The company did not disclose pricing. Why Microsoft is betting that AI's future runs on fixed costs, not cloud meters The Surface RTX Spark Dev Box arrives at a moment when the economics of AI development have become a boardroom-level concern. Companies large and small are grappling with cloud GPU bills that scale unpredictably: every fine-tuning run, every inference call, every agentic workflow that loops through a frontier model accumulates cost. For a developer iterating rapidly on a prototype — running the same model dozens or hundreds of times a day — those charges compound fast. Microsoft is framing the Dev Box as a release valve for that pressure. Andrew Hill, corporate vice president of Surface, wrote in the announcement blog post that the device "changes that equation" by letting developers "reserve frontier model calls for truly frontier problems and handle the rest on their own hardware." The pitch is not that cloud computing is obsolete, but that much of the work currently being sent to remote data centers does not require state-of-the-art models and would be better served by capable local hardware with predictable, fixed costs. This is a significant strategic shift for Microsoft, a company that derives tens of billions of dollars in annual revenue from Azure cloud services. By selling hardware that explicitly reduces customers' cloud dependency, Microsoft is acknowledging a tension that has been building across the industry: the marginal cost of AI inference at scale is unsustainable for many teams, and the market is demanding alternatives. The bet appears to be that developers who prototype locally will still deploy to Azure when they need to scale — and that owning both ends of that workflow is more valuable than owning only the cloud. Inside the 128GB unified memory architecture that makes local AI possible The technical architecture of the Dev Box reflects a set of deliberate engineering choices aimed at sustained, not peak, performance — a distinction that matters enormously for AI workloads that can run for hours. At the center is Nvidia’s RTX Spark system-on-chip, which combines an ultra-efficient ARM-based CPU with a Blackwell-generation RTX GPU. In a traditional Windows PC, Davuluri explained during the briefing, this configuration would require four separate components: a CPU, a discrete GPU, dedicated graphics memory and system RAM. The RTX Spark collapses all of that into a single chip paired with a single unified memory pool. That unification is the critical design decision. Conventional gaming laptops with high-end Nvidia GPUs top out at roughly 24 gigabytes of GPU-accessible memory. The Dev Box's 128 gigabytes of unified memory — accessible to both the CPU and GPU through what Nvidia calls its Unified Memory Access architecture — is what makes it possible to load models that would otherwise require cloud GPU instances with specialty high-bandwidth memory configurations. Microsoft did substantial work at the operating system level to exploit this architecture. The company implemented new memory management logic in Windows that raises the ceiling on how much system memory the GPU can address, introduces smarter page-size allocation for shared memory regions and ensures that heavy GPU workloads do not starve the CPU of the resources it needs for multitasking. The Windows scheduler was also optimized for RTX Spark's heterogeneous core layout, routing demanding workloads to performance cores while keeping efficiency cores available for background tasks. How a 3D-printed aluminum chassis doubles as a heatsink The thermal design is equally deliberate. The Dev Box operates within an approximately 100-watt sustained thermal envelope — modest by desktop standards, but meaningful for a device intended to run training jobs and inference workloads continuously. The aluminum chassis itself is engineered to function as a passive heatsink, and the method Microsoft used to build it is among the most striking details about the machine. The top panel is manufactured using metal 3D printing, a process that enables internal geometries too complex for conventional CNC machining or injection molding. The perforations are not simple through-holes; they are angled in multiple directions around the internal fan to optimize airflow from cold-air intake through heat dissipation. During the press briefing, Harry, a Surface industrial designer, explained the rationale: "The complexity is something other manufacturers wouldn't be able to do, like CNC, or like any molding, because of the complexity of shape." When asked whether 3D printing would constrain mass production, the designer acknowledged the challenge but suggested Microsoft had developed a process robust enough to scale. The result is a machine that runs quietly enough for an open office while sustaining the kind of continuous GPU workloads that would throttle most conventional desktops of similar size. For a device that Microsoft expects developers to leave running overnight on fine-tuning jobs, quiet sustained performance is not a luxury — it is a requirement. A developer-first setup that eliminates hours of configuration Microsoft is shipping the Dev Box with Windows 11 Pro pre-configured at the image level for development work — a detail that sounds minor but reflects a growing recognition that the out-of-box experience for developer hardware has historically been poor. The machine boots into a dark theme with a simplified taskbar, widgets removed and Do Not Disturb enabled. Developer Mode is turned on. PowerShell 7 is the default shell. WSL 2 — the Windows Subsystem for Linux — comes pre-installed with GPU passthrough and CUDA support already configured. Visual Studio Code, GitHub Copilot, Git, Python and Node.js are all installed and ready. "We've said, 'Hey, you know what, we got you, you want to go fast,'" a Microsoft engineer who demonstrated the configuration during the briefing told VentureBeat. The philosophy, he explained, is that developers were going to install all of these tools anyway — the friction was in the hours of setup and configuration that stood between unboxing a machine and writing the first line of code. The Dev Box also ships with integration points across Microsoft's AI stack: AI Toolkit for VS Code for model conversion and fine-tuning, Windows ML and Windows Copilot Runtime for local inference, and Microsoft Foundry for connecting local prototypes to cloud deployment pipelines. For enterprises, the device integrates with Entra ID and Intune for identity and device management, and includes Secured-core PC architecture, BitLocker encryption and Microsoft Defender. Why Apple's Mac Mini may not be the real competition anymore The most obvious competitive comparison is Apple's Mac Mini, which has dominated the compact-desktop category and has been widely adopted by developers drawn to Apple Silicon's unified memory architecture and power efficiency. Davuluri addressed the comparison directly during the briefing, saying the Dev Box is "in a different class of performance than Mac Minis, intentionally." He declined to share specific benchmarks, noting that detailed specifications and performance targets would come closer to the fall launch. But the architectural advantage Microsoft is claiming is clear: while the current Mac Mini with M4 Pro tops out at 48 gigabytes of unified memory and the M4 Max configuration reaches 128 gigabytes, the RTX Spark Dev Box pairs its 128 gigabytes with a Blackwell-class GPU that has a fundamentally different CUDA-based compute model — one that the vast majority of the AI/ML ecosystem's tooling (PyTorch, TensorRT, llama.cpp, Hugging Face frameworks) is already optimized for. That CUDA ecosystem advantage is difficult to overstate. While Apple's Metal framework has made progress, the overwhelming majority of AI training and inference frameworks are built and tested first against Nvidia’s CUDA stack. A developer running models on the Dev Box can use the same code, the same libraries and the same workflows they would use on a cloud GPU instance — a level of portability that Apple Silicon cannot currently match. From laptop to supercomputer: Microsoft's three-tier plan for local AI hardware The Dev Box is one piece of a three-tier hardware strategy Microsoft laid out at Build. The Surface Laptop Ultra, announced days earlier at Computex, brings the same RTX Spark silicon into a 15-inch laptop form factor for developers and creators who need portability. At the other end of the spectrum, the DGX Station for Windows — built on Nvidia's GB300 Grace Blackwell Ultra Superchip — targets organizations that need to run frontier models up to one trillion parameters on a deskside system. That machine is expected in the fourth quarter of this year. The three devices map to a tiered computing model that Microsoft is calling "unmetered intelligence": small on-device language models (the company's new Aion 1.0 family) handle lightweight tasks at zero marginal cost; RTX Spark-class hardware runs mid-range models locally for the bulk of development work; and cloud resources are reserved for genuinely frontier-scale problems. The GitHub Copilot CLI is getting a concrete implementation of this model with a new feature called /fleet, which allows a cloud-based primary agent to build a plan, assess the complexity of each task and route appropriate subtasks to a local model running on the developer's hardware. The cloud agent handles what requires frontier capability; the local model handles what does not. The result, in theory, is lower cost without lower quality. The real question is whether hybrid AI can shift from buzzword to business model Whether Microsoft's bet pays off depends on questions that will take months to answer. How does the Dev Box actually perform under sustained, real-world workloads? What will it cost? How quickly will the open-source model ecosystem continue to produce capable models in the 70-to-120-billion-parameter range that fit within its memory envelope? And perhaps most critically: will enterprise procurement teams, trained to think of AI as a cloud line item, accept a capital expenditure on desk hardware as an alternative? The strategic logic, however, is difficult to dismiss. For three years, the AI industry has operated on an implicit assumption: serious AI work happens in the cloud, and the economics of that arrangement are simply the cost of doing business. Microsoft, a company with every incentive to reinforce that assumption, is now selling a machine that undermines it. That is not a contradiction — it is a recognition that the market is moving, and that the company that controls the developer's local environment and the cloud they deploy to has a more durable advantage than one that controls only the cloud. Every dollar a developer does not spend on cloud inference is a dollar that can fund another experiment, another iteration, another prototype. For years, the AI industry told developers they needed to rent their intelligence by the token. Microsoft is now asking a different question: what if you could just buy it?

Agentic AI is moving rapidly from the developer terminal to the corporate world. On Tuesday, OpenAI announced a major update of its agentic AI platform Codex, introducing domain-specific workflows, a rapid, semi-private web hosting feature within it for enterprises called "Sites," and an in-place editing tool named "Annotations". The release marks a deliberate strategy to transform Codex from a specialized programming assistant into an everyday operating environment for business professionals. Non-developers—including financial analysts, marketers, operators, and researchers—now constitute approximately 20% of the platform’s 5 million weekly users and are adopting the technology three times faster than traditional engineers, according to research shared by OpenAI with VentureBeat and other outlets. OpenAI is capitalizing on this shift to position Codex as the premier application for white-collar task automation. The timing of the announcement is highly strategic, arriving precisely as its own primary investor turned business rival Microsoft this week kicks off its annual BUILD developer conference in San Francisco—where a slate of competing enterprise productivity tools is expected—and hot on the heels of Anthropic’s rapid adoption among knowledge-workers via its Claude Cowork and Claude Code platorms. Annotations enable more precise agentic AI spreadsheet edits and updates For business users, the most critical technical upgrade is the elimination of full-document regeneration. Previously, instructing an AI to update a specific chart or spreadsheet calculation often meant the model had to rewrite the entire file, which frequently broke custom formatting or introduced hallucinations. OpenAI addresses this through Annotations, a localized context-scoping mechanism. As demonstrated in the company's release materials, the platform maps a document's underlying data schema. When a user highlights a specific segment—such as a block of cells in a financial model—Codex isolates those exact data arrays. If an analyst prompts the system to "Add a chart of revenue, EBITDA, and net income over the selected years," the model executes the code strictly within that boundary, generating the visualization while leaving the surrounding cell dependencies, styles, and unselected formulas completely untouched. New role-specific Plugins for enterprise functions that bundle skills and external SaaS app connections To further anchor Codex in daily enterprise operations, OpenAI has introduced modular software bundles and a rapid-prototyping hosting environment. The company is rolling out six role-specific plugins that aggregate 62 popular business applications (including Snowflake, Figma, and Salesforce) and 110 automated skills straight out of the box. Data Analytics: Unifies cloud environments like Snowflake, Databricks Genie, Hex, and Tableau to translate natural language inquiries into data reports and change-analysis dashboards. Creative Production: Connects Figma, Canva, Shutterstock, Picsart, and Fal to generate and iterate on ad variations, campaign boards, and e-commerce assets directly from text briefs. Sales: Integrates pipeline infrastructure across Salesforce, HubSpot, Slack, Outreach, Clay, Rox, and Actively to automate follow-up communications, close plans, and account risk reviews. Product Design: Bridges Figma and Canva environments to audit live user journeys and transform static wireframes into clickable prototypes. Public Equity & Investment Banking: Syncs institutional market feeds—including Moody’s, Daloopa, Datasite, FactSet, LSEG, S&P, PitchBook, and Hebbia—to streamline financial modeling, competitive landscaping, and pitch book preparation. These integrations allow distinct departments—from data analytics and creative production to sales and investment banking—to automate complex, multi-step workflows without requiring IT to build custom API connections. Sites allow users to spin-up dynamic, hosted webpages they can share with their colleagues Concurrently, the new Sites feature introduces an interactive canvas that converts static data inputs or text documents into functional, web-hosted internal applications. Rolling out in preview for Business and Enterprise tiers, Sites allow cross-functional teams to bypass front-end development. Financial leaders, for example, can transform a static spreadsheet into an interactive scenario planner shared via a secure workspace URL, allowing executives to tweak assumptions in a live web app rather than clicking through document tabs. Instead of static decks, Sites promise to keep enterprises updated on their latest metrics and important information in an easily digestible way. Availability & deployment A critical operational distinction in this rollout centers on exactly where these new features can be executed. Codex's existing infrastructure runs natively across multiple surfaces, including IDE extensions and the terminal command line. However, the release documentation notes that Sites are rolling out "through the Codex app" and that plugins are managed via a "Codex plugin directory". An OpenAI spokesperson confirmed that Plugins and Sites are available int he CLI and desktop app, while Sites are hosted by OpenAI. Licensing and pricing These updates operate entirely within OpenAI's closed, proprietary enterprise licensing model. Unlike open-source frameworks, enterprise clients do not maintain code-level ownership over Codex’s integration nodes. Instead, system administrators manage deployment through centralized workspace settings, giving them explicit authority to enable or disable hosted "Sites" and restrict underlying application permissions. These new capabilities deploy seamlessly on top of Codex's existing commercial framework. Users will continue to access the agent via established baseline subscription tiers—such as the individual "Plus" plan ($20/month) or the high-volume "Pro" plan ($100/month)—or through a separate, seat-free pay-as-you-go model that draws down pre-purchased utility credits.

Enterprise AI agents have a new production failure mode, and it is not the model. As enterprises move from single-layer RAG to hybrid retrieval architectures, the same underlying data produces different answers depending on which agent, tool or system asks the question. Revenue means one thing in a business intelligence (BI) dashboard, something slightly different in a SQL table and something else again in an agent instruction. The retrieval infrastructure build-out of the past two years produced faster and cheaper vector search. It did not produce a shared definition of what the data means. At Snowflake Summit 26 in San Francisco, the data cloud vendor is taking a broad swing at that problem, with announcements spanning a Kafka-compatible managed streaming service called Data Stream, adaptive compute improvements, expanded Apache Iceberg interoperability and updates to its Cowork and CoCo agent and coding products. Running underneath all of it is a context layer: Horizon Context and Cortex Sense, a two-layer system designed to give agents a governed, shared definition of business logic across retrieval stacks. The context problem is why it matters: VentureBeat's VB Pulse Q1 2026 data, drawn from a survey of organizations with 100 or more employees, shows hybrid retrieval intent tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset. "There are a lot of tools out there that you can ask questions, you get a very confident answer, but whether it's correct or not is different," said Christian Kleinerman, EVP of Product at Snowflake. From fragmented business logic to a governed context layer The problem Horizon Context targets is specific. Business logic today is distributed across SQL, BI dashboards and agent instructions, and no single system owns the definition. When multiple agents or tools query the same underlying data, they reason over different schemas and return different answers. Horizon Context is Snowflake's attempt to fix that at the catalog layer rather than at the agent layer. Horizon Context. The customer-managed layer, built on Snowflake's acquisition of Select Star. It pulls metadata from Postgres, SQL Server, Tableau and Power BI into the Horizon Catalog, so every agent, BI tool and external system draws from the same governed definition rather than reasoning independently over a raw physical schema. Semantic View Autopilot automatically creates and refines semantic views over time, extending curated business logic without requiring ongoing manual effort. Cortex Sense. The platform-derived layer. It automatically builds and enriches context from customer data and usage patterns on an ongoing basis, without requiring manual semantic view authoring. Kleinerman described it as improving the default experience before any explicit curation has happened. The distinction between the two layers is architectural and Kleinerman was precise about it. "Think of Horizon Context as everything that is explicit and declared by customers, and Cortex Sense is anything that is implicit and derived by us," Kleinerman said. The two layers connect to Snowflake's existing retrieval infrastructure. Cortex Search, the company's RAG implementation, plugs into both CoCo and Cowork as a tool, so context enriched by either layer flows into retrieval workflows. While Horizon Context is a Snowflake technology, the goal is for it to be interoperable and open. Snowflake is tying the technology to the Open Semantic Interchange, making customer-declared definitions portable across third-party catalogs and tools. "Horizon Context, 100% we're committed to and leading the effort to make sure that that's not locked in," Kleinerman said. Context layers are everywhere. The question is which ones actually work. Snowflake is joining an increasingly crowded field of vendors targeting the same problem. Microsoft has opened its Fabric IQ business ontology via MCP so any vendor's agent can draw from a shared semantic layer. Redis launched Iris, a context and memory platform that sits between agents and their data, built on a storage engine redesigned for agent-scale retrieval volumes. Pinecone is repositioning from vector database to knowledge engine with Nexus, which compiles enterprise data into task-specific artifacts before agents ever query them. Devin Pratt, research director at IDC, told VentureBeat that in his view Snowflake is headed in the right direction and is going where the whole market is heading. "Agents are only as good as the data and semantics behind them, so the context layer, not the model, is the thing to watch right now," Pratt said. In Pratt's view, what works about Snowflake's version is the split. Horizon Context covers what teams declare and curate themselves, and Cortex Sense covers what the platform picks up automatically. Just as important, they've anchored Horizon Context inside the catalog and governance layer rather than bolting it on after the fact. "The context layer is the real battleground for agentic AI. An agent is only as trustworthy as the data and semantics behind it" Pratt said. Mike Leone, VP and principal analyst at Moor Insights and Strategy, agreed that treating the two layers differently is the right architectural call. "I like where Snowflake's heading. They're splitting context into two buckets, with Horizon Context covering what customers explicitly define and Cortex Sense covering what the platform figures out on its own," Leone told VentureBeat. "You can't trust those two things the same way, so treating them differently is the right call. If Snowflake can show those two layers reconcile cleanly and you can see where every answer came from, they've got something real." What this means for enterprises For enterprises evaluating context layers, the architectural direction is clear. The execution gap is not. Agents raise the bar on an old problem. The semantic layer idea has existed for years, but agents change what failure costs — when an agent gives a wrong answer at scale, the damage is immediate. Leone is direct about what that means for most vendors currently in the market. "Most vendors selling a drop-in fix are overpromising," Leone said. "Drop one into a real enterprise and it mostly exposes how messy your data and definitions already are, and a lot of companies are about to find that out the hard way." The evaluation bar is specific. Pratt identified what separates context layers that work from those that stall: governance and lineage built in so teams can audit why an agent gave the answer it did, portability so context and policy are not locked to one vendor, and accuracy that can be measured and reused across agents and tools. "Enterprises don't need another silo of semantics," Pratt said. "They need a context layer that's governed, portable, and trustworthy enough to audit."

Big news in enterprise AI broke over the weekend as Chinese AI startup MiniMax released its highly anticipated M3 large language model on Sunday evening Eastern time, pairing frontier-tier coding and agentic performance with a 1-million-token context window and native multimodality for a fraction of the cost of leading proprietary models, with pricing starting at just $20 per month under its new subscription token plans. The company's leadership also announced plans to deliver the model under an open source license including "open weights," allowing for full enterprise downloading and customizability free-of-charge, coming sometime in the next 10 days. For now, it is available via the MiniMax API at a special discounted price of $0.3 per 1 million input tokens and $1.20 per million output tokens (on fresh cache) for the next week — beating proprietary U.S. giants like Google, OpenAI and Anthropic handily on cost, while also eclipsing the performance of the latest models from the former two on selected benchmarks. Even at its full price of $0.6/$2.40 per million input/output tokens, MiniMax-M3 remains at just 8-20% the cost of the leading, proprietary U.S. models. The traditional matrix governing large language model development has long dictated a rigid choice: software developers can either access top-tier closed-source intelligence behind restrictive APIs, or deploy nimble, cost-effective open models that falter on multi-step reasoning, dense coding tasks, and massive data sequences. MiniMax-M3 fundamentally upends this paradigm. By unifying these two historically separated frontier capabilities, M3 introduces a level of comprehensive utility previously restricted to expensive, closed-source ecosystems, effectively shifting the baseline of open-weights systems while drastically minimizing the operational compute footprint required to execute complex development loops. VentureBeat Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo deepseek-v4-flash $0.14 $0.28 $0.42 DeepSeek deepseek-v4-pro $0.435 $0.87 $1.305 DeepSeek MiniMax-M3 $0.30 $1.20 $1.50 (limited time only) MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Grok 4.3 low context $1.25 $2.50 $3.75 xAI GLM-5 $1.00 $3.20 $4.20 Z.ai Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5.1 $1.40 $4.40 $5.80 Z.ai Grok 4.3 high context $2.50 $5.00 $7.50 xAI Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview ≤200K $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview >200K $4.00 $18.00 $22.00 Google Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI New MiniMax Sparse Attention (MSA) technique helps keep the model's cost low At the core of the model's efficiency lies an architectural departure from classic Transformer networks. Standard attention mechanisms scale quadratically ($O(N^2)$), meaning computational and financial costs explode as text inputs lengthen. To combat this "inherent flaw," the engineering team implements MiniMax Sparse Attention (MSA), a clean, extensible sparse attention blueprint. To visualize this innovation, think of traditional full attention as an editor reading an entire library from scratch every time they need to verify a single sentence. MSA acts as an intelligent indexing clerk, using a pre-filtering phase to partition Key-Value (KV) matrices into highly precise blocks. At the operator level, MSA uses a "KV outer gather Q" approach. The system treats KV blocks as an outer loop, dynamically aggregating only the specific queries that hit them. Because each data block is read exactly once and memory access remains strictly contiguous, hardware utilization skyrockets. In internal trials, MSA runs more than 4x faster than alternative open-source solutions like Flash-Sparse-Attention or flash-moba. When managing a maxed-out context length of 1 million tokens, M3’s per-token compute demand drops to just 1/20th of the previous generation model, translating into a 9x acceleration in the prefilling stage and a 15x boost during decoding. Rather than taking a pretrained text network and fusing it with a separate vision model, MiniMax engineered M3 as a natively multimodal system from "Step Zero". The company overhauled its data ingest machinery to blend naturally interleaved sequences of text, images, and visual components, scaling the total pretraining corpus beyond 100 trillion tokens. This deep data alignment enables the model to translate complex visual geometries, such as programming charts or coordinate maps, into structural code without losing contextual fidelity. On standardized assessments, M3 validates this engineering path. The model records a 59.0% on SWE-Bench Pro, an autonomous agent metric, positioning it ahead of closed models like GPT-5.5 and Gemini 3.1 Pro. It achieves a 66.0% on Terminal Bench 2.1, a 74.2% on MCP Atlas, and an 83.5 on BrowseComp—outstripping Claude Opus 4.7’s benchmark score of 79.3 in autonomous browsing and information retrieval. However, when contrasted with Anthropic's newly released, premium frontier model, Claude Opus 4.8, from last week, the competitive ceiling of M3's efficient sparse-attention footprint becomes evident across directly comparable, tool-intensive agent benchmarks. In the domain of pure code modification on SWE-Bench Pro, M3’s 59.0% score drops behind Opus 4.8’s leading 69.2% threshold. A similar performance delta manifests in automated system environments via Terminal-Bench 2.1; while M3’s 66.0% terminal execution score effectively runs neck-and-neck with the previous-generation Opus 4.7 baseline of 66.1%, it trails the upgraded Opus 4.8 architecture, which achieves 74.6%. Furthermore, evaluations tracking continuous GUI interaction on the OSWorld-Verified sandbox place M3’s automated computer use at 70.0%, compared to a higher 83.4% validation rate secured by Opus 4.8. These standardized evaluations illustrate the structural trade-offs currently defining the ecosystem: closed-source systems like Opus 4.8 maintain absolute margin leads on hyper-complex reasoning vectors, yet M3 delivers a highly capable baseline of local, tier-one automated operation without the compounding premium of closed-door API subscription fees. When positioned alongside the heavy-duty inference metrics of the newly minted, fellow open weights model DeepSeek-V4 Pro Max, M3 holds its ground across core agentic categories while asserting narrow advantages in specialized code synthesis. On the software engineering matrix of SWE-Bench Pro, M3's 59.0% resolution efficiency edges past DeepSeek-V4 Pro Max’s score of 55.4%. However, the competitive friction tightens in command-line environments; under Terminal Bench evaluations, DeepSeek-V4 Pro Max pulls slightly ahead with a 67.9% execution accuracy over M3’s 66.0% mark. In web orchestration and open-world browsing simulations, the two architectures reach a virtual statistical parity, with M3 registering an 83.5% on BrowseComp compared to DeepSeek's 83.4%. Similarly, on the MCP Atlas tool-use framework, M3 secures a narrow lead at 74.2% against DeepSeek’s 73.6%. This close alignment demonstrates that while DeepSeek handles a massive 1.6-trillion total parameter footprint with specialized high-effort reasoning modes, MiniMax's block-filtered sparse attention mechanism yields directly competitive execution efficiencies without requiring extensive parameter activation scaling. MiniMax Code AI agent offers Agentic Team capabilities MiniMax translates these architectural gains into immediate utility through an updated product suite divided between standalone applications, customizable subscription tiers, and raw developer infrastructure. For end-user orchestration, the flagship implementation is MiniMax Code, an AI agent product designed to maximize M3's multi-step capabilities. Operating via web or native desktop apps, MiniMax Code runs an "Agent Team" capable of breaking massive engineering tasks into multi-stage, concurrent workflows. The system relies on a "Producer + Verifier" adversarial harness loop. As one agent instance generates code, a secondary verifier instance aggressively tests and reflects upon execution outputs, allowing the network to self-correct and operate autonomously for days without human oversight. Because of its native visual grounding, MiniMax Code supports direct computer use. A developer can issue a cross-application voice prompt via their phone to have the model open a localized enterprise ERP client and batch-populate data tables directly from an open Excel spreadsheet. For custom setups, developers can pipeline M3 directly into existing workflows using an API key (sk-cp) compatible with common alternative IDE environments like Claude Code, Cursor, Roo Code, and Cline. The API introduces a toggleable "thinking mode". When enabled, M3 routes processing power into deep reasoning and long-horizon planning; when disabled, the model runs at minimal latency for quick text completion. The companion Token Plan models an aggressive pricing strategy structured around shared multimodal quotas. Billed annually, three options are available: Plus ($20/month): Supplies ~1.7B tokens per month and handles 3–4 concurrent agents. Max ($50/month): Supplies ~5.1B tokens per month, manages 4–5 concurrent agents, and adds 3 automated video clips per day via Hailuo 2.3. Ultra ($120/month): Supplies ~9.8B tokens per month, facilitates 6–7 concurrent agents, and extends video capacity to 5 daily clips. Open weights makes M3 much more attractive for enterprise use MinMax's pledge to release M3 under an open-weights license model—with weights and technical documentation launching on HuggingFace and GitHub within 10 days—carries significant strategic weight for enterprise infrastructure managers. However, it is still to be determined precisely which license the weights will be available under, and whether or not it will be permissible for consumer usage, e.g. MIT, Apache 2.0 or the new OpenMDW license. If so, the calculus looks like this: Feature / Model Attribute Closed API Providers (e.g., GPT-5.5, Opus 4.7) Open-Weights Frontier (MiniMax M3) Data Privacy & Boundaries Requires external API requests; potential data ingestion vectors. Total local isolation; runs entirely inside private user clusters. Custom Optimization Limited to basic fine-tuning wrappers or prompt engineering. Full pipeline control; architecture allows deep adapter/weights customization. Cost Vector Consistency Bound to perpetual per-token API pricing models. Computational demands cut to 1/20th; mitigates hardware ceiling. By shipping the underlying model weights directly to the community, MiniMax departs from the closed-door approach favored by major American AI labs. For enterprise users bound by strict compliance and privacy rules, open weights mean they can run M3 locally on internal hardware. This setup completely removes the risk of data leakage associated with public APIs. Furthermore, it permits engineering teams to run bespoke fine-tuning passes, modify internal architectures, or embed specialized system prompts deep within the model layers—transforming an off-the-shelf system into a highly targeted proprietary asset. Initial community reactions are resoundingly positive The developer ecosystem reacted immediately to M3’s operational benchmarks, singling out its long-horizon autonomous behavior and cost-to-performance profile. A major focal point of discussion is a 12-hour automated verification test where M3 was tasked with reproducing an ICLR 2025 Outstanding Paper Award winner, titled "Learning Dynamics of LLM Finetuning". As MiniMax's own researcher @MikaStars39 highlighted on X: "M3 ran autonomously for nearly 12 hours, producing 18 commits and 23 experimental figures on its own, and got the core experiments working: it matched the predicted probability trends in the SFT stage clearly observed the squeezing effect central to the DPO experiments validated the Extend mitigation method proposed in the original paper." Simultaneously, creators of developer tools highlighted the practical economic advantages of the model's new attention mechanism. The official team behind the agentic AI coding harness Cline posted an alert confirming day-one compatibility, stating: "The new MiniMax-M3 is their first model to have 1m context, multimodal, and agentic coding capability. Congratulations to @MiniMax_AI for the breakthrough in sparse-attention architecture cutting compute & cost to 1/20th their previous generation." This sharp drop in execution costs shifts how developers view the relationship between financial investment and capability. Tech commentator @jumperz mapped out this disruption, noting how M3 breaks a historical pattern in machine learning pricing: By addressing context scaling limitations through fundamental attention-level optimizations rather than brute-force hardware scaling, MiniMax has established a highly efficient open-source baseline. M3 demonstrates that the next phase of agent development will not just be driven by larger datasets, but by efficient architectural choices that make frontier-level performance accessible to the broader open-source community. For enterprises building autonomous software development or agent infrastructure, MiniMax M3 provides the ultimate "bang for the buck." While DeepSeek-V4 Pro holds a microscopic price advantage of $0.195 per million tokens, MiniMax M3 justifies its marginal premium by delivering superior autonomous software engineering resolution rates (59.0% SWE-Bench Pro). More importantly, because M3 is an open-weights model, the calculation extends far beyond the API chart. By deploying M3's weights locally inside private enterprise clouds, organizations completely bypass cloud data egress tracking, eliminate structural vendor lock-in, and can implement custom prefix-caching models on internal hardware. This technical approach transforms a highly efficient runtime budget into a permanent, privately owned corporate asset.

Across the frontier labs, the highest prompt injection figures published this spring are Anthropic’s. Point a red-teamer at its newest model in a browser, and the attacker hijacked it 31.5% of the time before safeguards engaged. OpenAI, Google, and Meta never gave security leaders a comparable number to set beside it. That figure looks like a liability. In this comparison, it is the opposite. It's the one solid piece of ground. Four frontier labs each shipped a prompt injection disclosure, and no two match. Anthropic put 244 pages and four agentic surfaces on the table on May 28. OpenAI reported one surface, connectors. Google moved the subject out of the model card and into a separate safety framework. Meta shipped no closed-model card at all. The Cross-Vendor Prompt Injection Disclosure Grid below maps what each lab tested, what each one measured, and the four places a side-by-side comparison falls apart. A prompt injection hides a malicious instruction in something an agent reads, a web page, a document, or a tool result. One planted line can exfiltrate records or fire off actions nobody approved, and these cards are a buyer's only first-party evidence. There is no industry standard for measuring any of this, and that is the root of the problem. Carter Rees, VP of AI at Reputation, told VentureBeat that prompt injection breaks the assumption that every legacy tool was built on. "A phrase as innocuous as, 'ignore previous instructions' can carry a payload as devastating as a buffer overflow, yet it shares no commonality with known malware signatures." With no shared signature to scan for, each lab built its own yardstick, and the results do not line up. Adam Meyers, Senior Vice President of Counter Adversary Operations at CrowdStrike, said that the exposure is now the buyer's to manage. "As you implement AI, it increases your attack surface, so now you have to be able to protect those AI models against adversary misuse or data poisoning or prompt injection." CrowdStrike's own frontline data shows the threat side is not standing still. In its 2026 Financial Services Threat Landscape Report, released in May, the company reported adversaries using AI to compress the time from initial access to impact faster than legacy defenses can respond. Anthropic measured four surfaces. The numbers swing by an order of magnitude depending on which one you read. The Opus 4.8 card does what others do not: It breaks prompt injection out by surface, and the spread is the story. Put the model in a coding environment, and an adaptive attacker from Gray Swan's Shade tool got through on 7.03% of single attempts with thinking on. Safeguards pulled that to 2.09%. Move the same class of attack into a browser, the surface behind Claude in Chrome and Claude Cowork, and the floor gives way. Anthropic put professional red-teamers on 129 web environments held out from training and printed every result in Table 5.2.2.4.A on page 81 of the system card. Per-attempt is the share of all injection attempts that got through across 129 environments at 10 tries each. Per-scenario is the harder cut, the share of environments where at least one try landed. Read down the per-attempt column without safeguards, thinking on, and the raw rate drops with each generation, from Sonnet 4.6 at 50.7% to Opus 4.8 at 31.5%. The lowest in the table, 5.9%, belongs to Mythos Preview, which nobody can buy yet. Turn safeguards on, and Opus 4.8 drops to 0.5%. Turn thinking off and it drops to zero across all 129 environments. OpenAI measured one surface, with attacks it already knew. The GPT-5.5 card, published April 23 and updated April 24, handles prompt injection in one place, a single section on robustness to known attacks against connectors. OpenAI reports it as a robustness score where higher is better, the inverse of an attack success rate. GPT-5.5 came in at 0.963, down from 0.998 for GPT-5.4-thinking. That one figure is the whole disclosure. Anthropic tested four surfaces against an adaptive attacker that rewrites its approach based on what the model does, then ran a one-week bug bounty where red-teamers tried to break the model live. When the coding results came back worse than Opus 4.7, the card said so. Lay the 0.963 next to the 31.5%, and they look like they belong on a scoreboard. They do not. One is a robustness score against known attacks on one surface. The other is a per-attempt attack success rate across 129 browser environments against an attacker that adapted in real time. Google and Meta never put the number in the card at all Google's Gemini 3 files prompt injection under mitigations, and the launch materials describe stronger resistance with no number attached. The Frontier Safety Framework report does run red teaming, but across its capability domains, and prompt injection is not one of them. No model card, no framework page, no per-surface number a buyer can lift into a risk review. Meta ships open weights with no closed-model card. Prompt injection defense sits in a separate stack, Purple Llama's LlamaFirewall. A PromptGuard 2 classifier and an AlignmentCheck auditor, run against the public AgentDojo benchmark and its 97 tasks, cut attack success from 17.6% with no defense to 1.75% combined. Real numbers. They grade the guardrails on a public benchmark, not the model on a deployment surface a security team would recognize. The Cross-Vendor Prompt Injection Disclosure Grid The grid below works on any frontier model security teams are weighing. Each row marks a place where the four labs are split. Each split is where a quick comparison breaks. The Anthropic figures come from the Opus 4.8 system card. Everything for the other three comes from each vendor's published safety documentation. Dimension Anthropic, Opus 4.8 OpenAI, GPT-5.5 Google, Gemini 3.x Meta, Llama stack Safety document System card, May 28 2026, 244 pages System card, April 23 2026, updated April 24 Model card plus a separate Frontier Safety Framework report No closed-model card. Open weights plus the Purple Llama stack Injection benchmark or dataset ART from Gray Swan and UK AISI, the Shade tool, plus an internal browser eval, 129 environments Internal connectors evaluation, known attacks None for injection AgentDojo, 97 tasks Surfaces with an injection eval Four. Tool use, coding, computer use, browser One. Connectors None published for injection One. AgentDojo agent tasks Multi-attempt escalation shown Yes. ART benchmark at 1, 10, 100. Coding and computer use at 1 and 200 No. A single score No No Headline metric and unit Attack-success rate. Browser, with thinking, 31.5% raw, 0.5% safeguarded Robustness score, higher is better. 0.963, down from 0.998 for GPT-5.4-thinking None published. Increased resistance claimed qualitatively Attack-success rate on AgentDojo. 17.6% baseline to 1.75% combined Live external bounty Yes. One-week live injection bounty with external red-teamers No injection bounty. Bio bounty only None found None found Regression disclosed Yes, explicit, with numbers Number fell 0.998 to 0.963, not framed as a regression Increased resistance claimed, no numbers Not applicable Five factors security teams need to consider now Anthropic tested four surfaces and printed every number. OpenAI tested one. Google printed no per-surface rate. Meta graded its guardrails, not the model. The four disclosures do not add up to a comparison. These five steps build one. Pull every agent you have deployed or scoped and tag each by the surface it touches, browser, code, connectors, or desktop. Anthropic's rate for Opus 4.8 runs 2.09% on coding and 0.5% on browser. A blended number covers neither. Pull the vendor's published rate for your specific surface. If the vendor never published one, treat it as untested. Send the Cross-Vendor grid to every vendor under evaluation. A 0.963 connectors score and a 31.5% browser rate were never on one scale. Demand a per-surface attack success rate, raw and safeguarded, with the attacker methodology named. The blank cells are the surfaces with no first-party evidence. Confirm in writing which number your integration gets. Anthropic's 0.5% comes from Claude in Chrome and Cowork with the full safeguard stack. On the API, the model ships without them. Do not accept a product number for an API deployment. Add two clauses to the RFP. The vendor tested with an adaptive attacker that rewrites payloads against the model, and someone outside the company tried to break it. Anthropic ran Gray Swan's adaptive Shade tool and a one-week paid bounty. OpenAI tested known attacks on one surface. Adversaries do not submit known payloads. Run your own injection test before any agent ships. Vendor numbers come from vendor environments with vendor system prompts. Your stack has its own prompts, permissions, and data access. Set a pass threshold. Anything above it does not go live. The bottom line. No standard exists for this yet. A vendor's number tells you what it chose to measure. Your own red team tells you what you are exposed to.

Presented by Snowflake Too often, the history of enterprise security has been a history of making things harder to use. A new threat emerges, a new control gets bolted on, and somewhere in the process, people start working around the very systems designed to protect them. Over the course of my career, I’ve seen firsthand that security adoption rarely fails because people don’t care about security. It fails because the secure path feels harder than the insecure one. In the age of AI, that lesson matters more than ever. AI expands the attack surface and raises the ceiling on what attackers can do, which makes simplifying security even more critical. Security controls that require effort or inconvenience eventually get ignored. People find workarounds. The answer is to make the secure path the easiest path. Security works best when it gets out of the way When security is easier to use than to avoid, people adopt it. Years ago, when the industry was rolling out two-factor authentication at scale, the biggest challenge wasn’t building the security itself, but the friction that came with using it. People had to stop what they were doing, grab a phone, launch a VPN, enter codes, and interrupt their workflow just to log in. What ultimately drove adoption wasn’t policy, compliance requirements, or security training. It was simplicity. Now that it’s as easy as a fingerprint or a face scan, people use it without hesitation. The same principle drove browser makers to make security more visible and intuitive for everyday users. Rather than expecting people to manually inspect URLs, modern browsers prominently flag non-HTTPS sites as insecure, helping guide users toward safer behavior by default. Security became stronger in part because the secure path also became the easier and more obvious one. Where complexity shows up in AI Agent permissions are a good example of where this plays out in AI systems. Employees accumulate numerous permissions over time through a project here, a system access there, a role that never got cleaned up after a team change. Humans know which access is relevant to a task even if the system doesn't actively enforce it. Agents lack that judgment. An agent assigned to a problem will probe every available path. If it can access 12 systems but the task requires only two, it might still explore the other 10. It’s just being thorough, but the result is a potential attack surface far larger than the task required. The temptation is to put a human in the loop by flagging significant actions and asking for approval before proceeding. But in practice, an agent may prompt a human to approve a deeply technical action without enough context to judge whether it’s appropriate. In most cases, they’ll approve it simply to keep the workflow moving. This only adds friction and a false sense of oversight. What's really needed is a permissioning model built around intent. The agent should have only the credentials it needs for a specific task, and they should expire when it’s done. The industry is already beginning to move toward better models. Standards like OAuth are evolving to support agentic AI, allowing agents to carry the identities scoped to a specific task, rather than a user's full permission set. Making AI security easy to use Ease of use starts with visibility, so the first priority is knowing what's actually happening. Where are your agents connecting? What data are they touching? What permissions are they exercising? Many enterprises are surprised by the answer when they first look. Most organizations operate with roughly 80% visibility and control. The problem is the remaining 20%, because that’s where the real risk tends to live. AI is going to find those gaps far faster than humans can. Start with monitoring, even if you’re not ready to enforce anything yet. Use AI to sift through what you find and prioritize the highest-risk behaviors. Then close those down systematically. On the identity side, move toward workload identity wherever you can. The old model of creating service accounts, downloading keys, and distributing them across your infrastructure is fragile and hard to audit. Modern cloud environments offer a better approach: a workload's identity is established at deployment and credentials are never distributed as static keys. The management burden drops and the attack surface shrinks with it. For agents specifically, resist the temptation to give them broad permissions on the assumption that human approvals will catch problems before they happen. Scope agent access to the task at hand and ensure those permissions expire once the work is complete. For teams managing multiple agent-to-tool connections, MCP gateways are emerging as a practical way to encode governance rules centrally rather than tool by tool. Keep a human in the loop for consequential actions, not every action, particularly those where the blast radius of a mistake is meaningful. The pace of risk is accelerating In the AI era, the gap between exposure and exploitation is rapidly disappearing, collapsing from days to hours and, in some cases, minutes. CrowdStrike's 2026 Global Threat Report documents that the average attacker breakout time has accelerated by 65% year over year. As AI becomes more capable of autonomously identifying weaknesses, security teams relying on manual response processes will fall behind. The answer, though, hasn't changed. Security that creates friction will eventually get bypassed. Security embedded directly into the architecture, enforced by default and invisible in practice, is the kind that actually holds. AI raises the stakes, but the principle remains the same: security only works when the secure path is also the easiest one. Mayank Upadhyay is Chief Security & Trust Officer at Snowflake. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

In 2024, researchers from the University of Illinois found that GPT-4, when provided with a common vulnerabilities and exposures (CVE) description, could autonomously exploit 87% of a curated 15-vulnerability one-day dataset. Without the description, it could only exploit 7%. This provided a “margin of safety” for the industry because while AI could exploit known vulnerabilities, it could not discover them. However, on April 7, Anthropic announced that Claude Mythos Preview had closed that margin, with the model autonomously discovering thousands of zero-day vulnerabilities across major operating systems and browsers. Separately, Mythos scored 83.1% on the CyberGym vulnerability reproduction benchmark. In one campaign targeting OpenBSD across 1,000 scaffold runs, the total compute cost was less than $20,000. Exploitation timelines are collapsing. Langflow’s CVE-2026-33017 (CVSS 9.8) was exploited 20 hours after disclosure with no public proof-of-concept. Marimo’s CVE-2026-39987 (CVSS 9.3) was hit in 9 hours and 41 minutes. The defensive infrastructure most organizations rely on wasn’t designed for this. Rapid7’s 2026 threat landscape report states that the median time from CVE publication to CISA's known exploited vulnerabilities (KEV) listing is five days. Google’s M-Trends 2026 report found that exploitation is happening before a patch is even released. When the Langflow advisory was published, the first exploit arrived in 20 hours. When the Marimo advisory was published, it took under 10 hours. The assumption that your patch window is safe because exploitation takes time is no longer true. Here are your building blocks. Replace CVSS-only prioritization with a three-layer filter Most vulnerability management programs still prioritize by CVSS score alone. CVSS quantifies a vulnerability’s “theoretical” severity without considering whether a vulnerability is being exploited in the wild or how quickly someone could weaponize it. A CVSS 8.8 vulnerability with a history of active exploitation (like Docker’s CVE-2026-34040) gets lower priority than a CVSS 9.8 vulnerability that may never be exploited in the wild. A recent study validated against 28,377 real-world vulnerabilities offers a concrete replacement: A three-layer decision tree incorporating CISA KEV status, Exploit Prediction Scoring System (EPSS) scores, and CVSS, thus forming a singular prioritization filter. Three-Layer Vulnerability Prioritization Filter Layer Data source Threshold Action SLA 1. Active exploitation CISA KEV catalog Listed Immediate patching Hours 2. Predicted exploitation EPSS via FIRST.org Score ≥ 0.088 Escalate to Tier 0 pipeline 24 hours 3. Severity baseline CVSS via NVD Score ≥ 7.0 Typical remediation Per policy Validated result: 18x efficiency gain, 85.6% coverage of exploited vulnerabilities, ~95% reduction in urgent remediation workload. All three data sources are open and free. The described integration is entirely automatable. It’s possible to build a script to query the CISA KEV API, the EPSS API from FIRST.org, and the NVD, and have that script run against your asset inventory for every published CVE. The human in this process should remain in the loop as an approver, but not as the trigger. Close the agent authorization gap Creating exploits quickly not only changes how patches are prioritized, but how controls are configured for all the agent-driven systems that now possess privileged credentials. Your authorization policies have not been assessed against the behavior of AI agents, and that is now a measurable risk. CVE-2026-34040 showed that Docker’s authorization plugin architecture silently bypasses every plugin when the request body exceeds 1MB. Common AuthZ plugins (OPA, Casbin, Prisma Cloud) are unaware of this type of bypass, which occurs in Docker’s middleware before the request reaches the plugin. When Cyera demonstrated this vulnerability, they showed that an AI agent debugging infrastructure could infer the bypass path while completing a legitimate task, without any instruction to exploit anything. The Internet Engineering Task Force (IETF) is working on authorization models for agents. The document draft-klrc-aiagent-auth-01, published in March by participants from AWS, Zscaler, Ping Identity, and OpenAI, proposes the use of the current Secure Production Identity Framework for Everyone (SPIFFE) and OAuth 2.0 for AI agents to obtain dynamically provisioned and short-lived credentials. Separately, the IETF Agent Identity Protocol draft (draft-prakash-aip-00) reports that out of about 2,000 surveyed model context protocol (MCP) servers, none had authentication. But these standards are months to years away from implementation. For now, security teams must proactively incorporate agent-level test scenarios for all authorization boundaries, such as oversized requests, burst frequency, and multi-step escalation of privileged requests. Map your credential blast radius In a survey conducted by CSA/Zenity and published on April 16, 53% of organizations said they had already seen cases where AI agents exceeded their intended permissions, and 47% experienced a security incident involving an agent. When AI builder tools such as Flowise (CVE-2025-59528, CVSS 10.0), Langflow, or n8n become compromised, the blast radius extends far beyond the host. These tools contain API keys to frontier models, database credentials, vector store tokens, and OAuth tokens to business systems. A compromised AI builder host is not just a single-system breach. It is a credential harvest that unlocks authenticated access to every connected service. Without credential dependency maps for each AI tool host, incident response for agent compromise is guesswork. For every instance, document each credential, the extent of its access, and the relevant credential rotation process. Also begin migrating static API keys to short-lived tokens where downstream services allow. Five actions for this quarter 1. Deploy the three-layer KEV-EPSS-CVSS filter Substitute CVSS-only prioritization according to the table above. Automate the collection of data from all three APIs as part of a scheduled script against your asset inventory. Desired outcome: 18 times more efficient, 85.6% coverage of exploited vulnerabilities, 95% reduction in urgent remediation workload. 2. Implement event-driven patching for Tier 0 services. Determine which services fall under the critical exposure tier: Services exposed directly to internet users, AI builder hosts, and container orchestration control plane. Trigger event-driven patching on a CVE publication instead of waiting for the next maintenance window for this tier. Goal: deploy patch to canary within four hours of a CVE being declared critical. Use the CISA KEV and EPSS feeds to trigger event-driven patching. In situations where it is impossible to meet the goal of four-hour patching because of legacy dependencies, change-freeze windows, or rollback risk, immediately apply compensating controls such as removing internet exposure to the vulnerable service, rotating credentials for the vulnerable service, disabling affected functionality of the service (if applicable), and identifying an exception owner for the exposure until a patch can be deployed. It is not acceptable to allow unbounded exposures for extended periods while awaiting a maintenance window. 3. Test authorization boundaries at agent scale. Create test cases for every API that AI agents may communicate with via AuthZ policies. Specifically, include test cases for requests exceeding 1MB, 5MB, and 10MB body sizes. This includes test cases for burst rate > 100 requests per second and test cases for unusual parameter combinations (privileged flags, host mounts, capability additions). Additionally, patch to Docker Engine 29.3.1 to fix CVE-2026-34040. 4. Credential blast radius mapping for all AI builder hosts. Document each credential for each Langflow, Flowise, n8n, and custom AI pipeline instance. Classify each credential by its lifespan (static key vs. short-lived token). Identify what each credential can access. Set up alerts for anomalous IP or identity for any credential access. 5. Shadow AI discovery scan for this week. According to CSA data, there is a greater than 50% chance that your agents have exceeded their expected boundaries. Check your Security Information and Event Management (SIEM) and network monitoring tools for communications to the default ports of the AI builder: Langflow 7860, Flowise 3000, and n8n 5678. Any unauthorized instances are an unmonitored attack surface. The takeaway AI agents are emerging, and the standards bodies are responding. The IETF has multiple drafts related to agent authentication and authorization. The Coalition for Secure AI has published its MCP Security taxonomy and Secure-by-Design principles. But these standards move at standards-body speed, and the exploit window is now measured in hours. Organizations that implement the three-layer filter and event-driven patching this quarter will have a measurable reduction in exposure. Those who wait will be running calendar-based patch cycles against an adversary that operates in less than 20 hours. Nik Kale is a principal engineer specializing in enterprise AI platforms and security

Enterprise AI agents are stalling — not because of model performance, but because of permissioning. Every agentic workflow eventually hits the same wall: what is this agent allowed to touch, on whose behalf, and how does the system know? Workday's answer is to make its existing system of record the governance layer for agents. Gerrit Kazmaier, the company's president for product and technology, told VentureBeat in an interview that customers often struggle when they cobble together solutions for their agents. “Sana makes sure the integrity of the approvals and security model is always adhered to,” Kazmaier said. “Frankly, that’s where we see customers struggling when they try to build do-it–yourself AI by just accessing raw data, so the richness of the security model gets lost, and the results become overly broad.” Workday, which launched Sana in March, expanded its partnership with Google to bring its Sana agent system of record to the Gemini Enterprise — so agents built on Sana are also discoverable there. Architecting accuracy Kazmaier said the biggest hurdle they faced was ensuring agent accuracy, especially for HR and finance users. “Almost right is not acceptable,” Kazmaier said. “Think about paying people correctly, closing the books or managing work schedules reliably.” Accuracy is harder to evaluate here than in most AI contexts. Policy configurations, role-based security, and organizational hierarchies are deeply interrelated — a small error compounds. And unlike most generative AI outputs, HR and finance queries often lack a correction loop. By the time a paycheck processes incorrectly or an interview is scheduled wrong, the damage is done. Workday addressed this by building Gemini in as its base reasoning layer, then adding its context engine and business process logic on top. Workday also added verification and classification models that “interrogate” outputs before execution. Accuracy and identity, it turns out, are the same question: does the system know enough about the agent, the authorizing human, and the current state of the record to act correctly? Workday’s advantage is that it can infer its customers' organizational structures from the data they provide. Already, third-party identity providers like Okta verify their information by checking Workday, so its context is the system of record for many enterprises. Kazmaier said the Sana Self-Service Agent uses Gemini as the conversational surface to trigger the workflow. The user is then authenticated and authorized through Workday’s identity and security model. Sana agents will only act on behalf of that user and work within their current permissions. Audit trails follow the same logic: Gemini retains only interaction logs, while the main audit remains within Workday and its customer. For many practitioners in the HR and finance space, the permission and governance layer in the agent system of record is key in regulated spaces. “It has to live in the system of record, that’s not a preference, that’s the only way it works,” said Dan Obendorfer, director of product at Würk, in an email to VentureBeat. “If your permissions are defined somewhere outside of where the data actually lives, you’ve already lost.” Kadan Stadelmann, chief technology officer and co-founder of Compance.AI, made the same point separately. “Without agent ownership, performance, costs or actions, chaos ensues.”

Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits. MeMo, a framework from researchers at multiple universities, encodes new knowledge into a dedicated smaller memory model that operates separately from the main LLM. The modular architecture works with both open- and closed-source models and sidesteps the complexity of RAG pipelines and full model retraining. Experiments show that MeMo handles complex queries reliably even when retrieval pipelines are noisy. It avoids the catastrophic forgetting associated with direct fine-tuning and provides a cost-effective pathway for continuous knowledge updates. The challenge of updating LLM memory Large language models are frozen after training and their internal knowledge remains static until they undergo subsequent, computationally massive updates. Currently, developers rely on three main approaches to integrate external knowledge into an LLM, each with distinct drawbacks: Non-parametric methods, such as retrieval-augmented generation (RAG) and in-context learning, retrieve relevant documents from an external database and insert them directly into the model's prompt. While popular, these methods are limited by context window sizes. As Armando Solar-Lezama, a co-author of the paper, told VentureBeat, “Vector databases have a fundamentally difficult job of encoding the full semantics of a chunk of text in a single vector, and then match that vector to a query, even when the relevance of the chunk... may only be apparent in the context of other chunks.” The researchers note that the semantic similarity of embeddings often does not correspond to what a user's query actually requires. Processing thousands of retrieved tokens also creates substantial computational overhead and inference latency. Most problematically, RAG systems are highly sensitive to noise. Irrelevant or poorly retrieved passages often degrade the model's final response. Parametric methods, like continual pretraining or supervised fine-tuning, attempt to internalize new knowledge directly into the LLM's weights. Updating modern, massive LLMs is prohibitively expensive and typically impossible for proprietary, closed-source models hidden behind APIs. Fine-tuning is also prone to causing catastrophic forgetting. Forcing the model to adapt to new corporate data often erodes its previously acquired reasoning capabilities and safety guardrails. Latent memory methods, such as context compression, offer a middle ground. They compress knowledge into compact "soft tokens" or representations that are added to the model’s context during inference. The fatal flaw here is "representation coupling." The compressed memory is strictly bound to the model architecture that produced it; you can't transfer a latent memory trained on an open-source model to a closed-source one. How MeMo works The MeMo (Memory as a Model) framework introduces a modular architecture featuring two separate components. The MEMORY model is a small language model trained specifically to encode new knowledge into its parameters. The EXECUTIVE model is a frozen, off-the-shelf LLM that functions as the reasoning engine. When a user asks a question, the EXECUTIVE model treats the MEMORY model as an external oracle, issuing targeted sub-queries to gather facts and synthesizing those facts into a final answer. The core design principle driving MeMo is the concept of "reflections." Reflections are targeted question-answer (QA) pairs designed to capture every possible angle of a knowledge corpus. Rather than forcing the AI to process a massive, unstructured document corpus during training, MeMo uses a GENERATOR model to distill the raw text into thousands of targeted QA pairs. The MEMORY model is then fine-tuned on this dataset to answer questions using only its parametric knowledge without the need to read retrieved context. At inference time, the interaction between the two models follows a structured, three-stage protocol: 1. The EXECUTIVE model decomposes a user's complex query into a set of atomic sub-questions. The MEMORY model answers each independently to establish the basic facts. 2. Using those initial clues, the EXECUTIVE model issues follow-up queries to narrow down candidate entities until it confidently converges on a specific target. 3. Finally, the EXECUTIVE model queries the MEMORY model for supporting facts about that target entity and synthesizes the retrieved snippets into a cohesive answer. This architecture merges the strengths of the three existing AI memory paradigms while bypassing their pitfalls. It leverages off-the-shelf frontier models by keeping memory storage separate from reasoning, guaranteeing compatibility with both open-weight and closed API models. It internalizes knowledge directly into parameters, but isolates the updates to a smaller, dedicated MEMORY model to protect the reasoning engine. Finally, it creates a queryable memory artifact that is not tied to any specific model and can be used with different LLM families. Handling continual knowledge updates Managing an AI's memory requires continuous updates as company policies change and new reports are published. Normally, updating a model's parameters requires retraining it from scratch on both the old and the new data combined. As the knowledge base grows, this cumulative retraining cost becomes unmanageable. To handle continual updates efficiently, MeMo relies on a technique called "model merging." Instead of a massive joint retraining phase, MeMo trains a new, independent MEMORY model exclusively on the newly added documents. The system derives a "task vector" representing the parameter changes learned from the fresh data. These updates are then mathematically merged into the weights of the original MEMORY model. This approach reduces the computing hours required to keep the system current while avoiding the interference that causes catastrophic forgetting. This efficiency comes with a trade-off: model merging incurs an 11% to 19% accuracy drop compared to a full retrain, depending on the reasoning model used. MeMo in action To measure real-world effectiveness, the research team evaluated MeMo against several industry benchmarks that require complex, multi-hop reasoning across multiple documents. The researchers used Qwen2.5-32B-Instruct as the GENERATOR model to distill raw text into reflections. For the primary MEMORY model, they deployed Qwen2.5-14B-Instruct. They also validated the approach on smaller 1-2B parameter models across different architectures, including Gemma3-1B. For the EXECUTIVE reasoning model, they tested both the open-weight Qwen2.5-32B and Google's proprietary Gemini 3 Flash. They benchmarked MeMo against a "Perfect Retrieval" upper bound (where the exact correct documents are manually provided) and several advanced retrieval systems, including traditional BM25 search, dense vector retrieval, and state-of-the-art graph-based RAG (HippoRAG2). They also tested "Cartridges," a recent method that loads a trained KV-cache onto the model during inference. MeMo dominated in long-document reasoning. On the NarrativeQA benchmark, MeMo achieved 53.58% accuracy paired with Gemini 3 Flash, according to the researchers. HippoRAG2 maxed out at 23.21%. Enterprise systems frequently need to synthesize complex answers, such as traversing overlapping regulatory frameworks written independently by different bodies, or consolidating insights across a massive codebase and external documentation. Traditional RAG systems falter here because they hit context window limits and fail to connect concepts spanning hundreds of pages. MeMo succeeds because those connections are mapped and internalized inside the MEMORY model during training. It is "like having your very own Malcolm Gladwell that can connect the story of the Beatles with the story of Bill Gates to make an argument about the nature of expertise," Solar-Lezama said. The experiments revealed another major advantage: upgrading the reasoning engine requires zero retraining. Simply switching the EXECUTIVE model from the open-source Qwen to the proprietary Gemini 3 Flash boosted MeMo's performance by 26.73% on NarrativeQA and 11.90% on the MuSiQue benchmark. For practitioners, this means you can train a MEMORY model securely on your private data and instantly plug it into the latest commercial APIs, continuously upgrading system intelligence without incurring new training costs. The research team described the integration as requiring no additional setup: "The base (or Executive) LLM that teams are already using in RAG can be configured to query the Memory model directly. These queries are done in natural language, similar to sending a message request to an API, with no additional setup required." MeMo also handles noisy data exceptionally well. When researchers deliberately flooded the dataset with irrelevant documents (up to twice the amount of the useful information), HippoRAG2’s performance dropped by 11.55%. MeMo's performance remained relatively stable, dropping less than 2%. Enterprise knowledge bases are typically messy, filled with duplicate documents and outdated policies. Standard RAG systems struggle with this noise, pulling incorrect paragraphs into the prompt and causing hallucinations. Because MeMo's EXECUTIVE model interacts with a synthesized oracle rather than raw document chunks, it remains highly robust against disorganized corporate data. Limitations and trade-offs For engineering teams looking to deploy MeMo, there are several key limitations to consider. Unlike traditional RAG systems that quickly index raw documents into a vector database, MeMo requires an upfront training cost for each new corpus. The data generation pipeline used to synthesize the training reflections is computationally expensive. For example, the team noted that "generating the full reflection QA dataset took approximately 240 GPU-hours on NVIDIA H200s," while training a 14B parameter MEMORY model "took approximately 180 H200 GPU-hours." As Solar-Lezama said, "Reducing the training cost is one of the most significant open research problems in order to make this a workhorse technique." Because the MEMORY model is a fixed-size neural network, its ability to internalize knowledge is bounded by its representational capacity. While the researchers did not hit a hard limit during their benchmarking, they hypothesize that “sufficiently large or information-dense corpora will exceed what a fixed-size MEMORY model can correctly compress and represent.” Finally, because MeMo synthesizes answers from parametric memory rather than retrieving exact text snippets, it obscures the provenance of the information. This makes it difficult to attribute specific claims to original source documents, which poses a critical compliance issue for enterprise applications requiring strict audit trails. Deciding between MeMo and traditional RAG comes down to a heuristic of "lookup vs. synthesis," alongside data volatility. The researchers advise that "traditional RAG would be preferred when answers live in a single document or when there is a well-defined source... MeMo would be preferred when the task shifts from lookup to synthesizing an answer from information scattered across multiple chunks." If your knowledge corpus changes rapidly (e.g., daily feeds) and you require exact source citations, RAG remains the better option due to the upfront training cost of MeMo. If your corpus consists of generalized domain knowledge that evolves slowly relative to its volume, MeMo offers vastly superior reasoning. Teams can also adopt a hybrid routing architecture in production: sending "lookup" queries to a standard vector database and "synthesis" queries to the MEMORY model. "Looking further out, I would expect memory models to become a standard architectural component alongside retrieval," Daniela Rus, co-author of the paper and director of the MIT Computer Science and Artificial Intelligence Lab (CSAIL), told VentureBeat, "in the same way that caching and indexing are standard components of any serious data system today."

At 620 million monthly users, calling a frontier model for every image recommendation isn't a strategy — it's a bill. Pinterest CTO Matt Madrigal solved it by gutting Qwen3-VL's vision layer and rebuilding it with proprietary embeddings, cutting costs 90% and boosting accuracy 30%. Madrigal’s team has been heavily investing in customizing open-source models “foundationally in-house.” “If you've got really unique data that you can then fine-tune an open source model with, data quality will, frankly, outweigh or overcome model size,” Madrigal explained in a recent VB Beyond the Pilot podcast. How Pinterest customized Qwen for visual discovery Pinterest, which has around 620 million monthly active users, has long applied open source models for visual search and discovery, going back to Google’s BERT and OpenAI’s CLIP. The company fine-tuned its own Pin CLIP on the latter, incorporating proprietary visual embeddings and image metadata. Pinterest’s conversational shopping assistant, Navigator 1, was built on Qwen3-VL and customized in “pretty significant” ways. Madrigal’s team essentially “ripped out” Qwen’s vision encoder layer and fine-tuned the model on proprietary multimodal embeddings. This has allowed them to capture metadata around pins and images that can then be precomputed offline and regularly retrained on new information to deliver personalized experiences. “Open-source models, especially with open Apache licenses where you can truly tweak a lot of open weights and customize for unique use cases — that's where we've found open source to be so powerful for us,” Madrigal said. Bringing their own embeddings allows his team to gain context around metadata, pins, and images; also, notably, the model performs better at runtime and inference. Without these embeddings, devs would have to call and encode each image returned at runtime, one at a time. That results in a latency “20 times worse” from an inference perspective, Madrigal said. “If it's something that's going to be critical for our end users, that's going to drive engagement, that will have to scale to over 600 million monthly active users, we're going to either probably build it or we're going to leverage open source and customize the heck out of it,” he said. How a taste graph captures evolving interests To guide users from inspiration to purchase, Madrigal's team built a "taste graph": a dynamic representation of what individual users actually like, not just what they click on. “It's this representation of billions of people's evolving tastes,” he said. People go to Google or other search engines when they have a clear picture of what they want; Pinterest is for when they’re still in the discovery phase, Madrigal said. Pinterest’s goal is to encourage “lateral exploration” and transform discovery to intent (that is, clicking through ads or making purchases). Under the hood, the architecture combines a graph structure with representational learning. User embeddings capture a user’s evolving tastes. These are constantly updated based on activity and new content and signals. “It's not a social graph,” Madrigal said. “It's much more of a preference graph: What's going to inspire you? What are you trying to do next?” For instance, one user may be into mid-century modern designs; another may prefer a Nantucket aesthetic. Those preferences will be captured in user embeddings, and the taste graph will deliver up specific, relevant products as a result. “You go from the upper funnel, inspiration discovery, all the way through lower funnel intent,” Madrigal said. Listen to the full podcast to hear more about: How Pinterest uses sandboxes to encourage creativity in a way that is secure and contained; Why a continuous feedback loop can prevent visual AI slop; The importance of constant benchmarking to gauge user engagement, performance, latency, and other factors. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

As enterprise AI agents move into production, organizations are confronting a growing reliability problem. Many teams are discovering that LLM performance alone does not determine whether agents succeed in production. Long-running AI workflows must survive crashes, preserve state, recover from failures, manage inference costs, and coordinate across APIs, tools, and enterprise systems. After a first wave focused on rapid deployment, organizations now need to revisit those first-generation implementations, and redesign early agent architectures around workflow orchestration, observability, governance, and recovery, said Preeti Somal, Senior VP Engineering at Temporal Technologies, during the latest AI Impact Series event in New York. “We do have a lot of customers that come to us where they’re building version 2.0 of the same agent,” Somal said. “They had to move really fast, but they didn’t take care of the plumbing. Things crash and burn, and then they’re back to rebuilding with the reliable foundation.” For workflow orchestration company Temporal, whose infrastructure predates the current wave of agentic AI, the shift reflects a broader enterprise realization: production AI systems require durable execution, state management, visibility into workflows, and mechanisms to recover when models or downstream systems fail. Agentic AI has supercharged familiar engineering problems “These patterns aren’t necessarily new," Somal said. " AI just supercharges them." Agentic systems introduce additional complexity because they often involve long-running, multi-step processes spanning multiple services, models, APIs, and tools. A single workflow might call several large language models, access retrieval systems, trigger external applications, and manage state over hours or days. The engineering questions, Somal said, often emerge only after deployment. “People will write agents but haven’t thought about what happens if the agent crashes,” she said. “Am I going to need to run the entire agent flow again?” For enterprises operating under cost constraints, the answer matters. Restarting workflows after failures can multiply inference expenses, increase latency, and create poor customer experiences. Somal compared the current moment to an earlier period in enterprise cloud adoption when organizations went straight to migrating workloads before considering that they needed to redesign underlying architectures if they wanted these workloads to weather the long-term. “This rush to do AI in a world where you haven’t even modernized your application reminds me a little bit of that lift-and-shift that happened in the cloud,” she said. “Everybody realized you’re spending more money on cloud and we haven’t gotten value there.” Why long-running agents force a new architecture Enterprise workflows increasingly involve agents executing over long windows, sometimes spanning many hours while interacting with tools and systems. Reliability challenges compound when workflows persist over time, and it impacts both state and memory, two ideas that are often treated interchangeably in AI conversations. State concerns workflow execution. It includes where an agent is in a process, which actions have already completed, and where recovery should resume after failure. Memory or context captures information an agent carries forward across interactions or tasks. “The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece,” Somal explained. That distinction becomes increasingly important when enterprises begin moving beyond simple chatbot interactions toward longer-running business processes. Somal pointed to a healthcare example involving customer Abridge, where workflows process physician visits through multiple stages, including audio processing, summarization, model calls, and after-visit generation. “There’s not just one piece to that flow,” Somal said. “Taking videos and slicing that, taking summaries, calling the LLMs, generating the after-visit summary, all of that is being orchestrated.” The implication for enterprises is that successful agents increasingly depend on systems that can survive interruptions, coordinate across services, and maintain continuity over time. The rise of the deterministic spine A useful framework for enterprise AI design is the deterministic spine, Somal said, which is how they think about Temporal's role. “It is denoting the path you want to take," she said. "It is calling the brain, but if the brain doesn’t respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened.” In this framing, the language model acts as a probabilistic system producing variable outputs, while orchestration software maintains execution reliability around it. And the concept matters because enterprise systems increasingly require consistency even when models remain non-deterministic. A procurement workflow, healthcare summary, customer support escalation, or compliance process cannot simply fail silently because a model call timed out or an external dependency crashed. “What you care most about is making sure that you can recover and that you’re not paying the token tax if something goes wrong,” Somal said. Reliability, visibility, and the economics of token spend As enterprise leaders evaluate AI ROI, cost visibility has become a growing concern. Long-running agents frequently make multiple model calls across complex workflows, which can create opaque spending patterns. Somal described one operational advantage of orchestration as visibility into where costs accumulate. Because workflows are observable step-by-step, teams can see where tokens are being consumed across an agent process. “You’ve got visibility into that entire flow in a single pane of glass,” she said. “You can now see where you’re spending the tokens in an agent that is multiple steps and calling multiple different systems.” Workflow recovery also shapes cost efficiency. Without durable orchestration, a late-stage failure can force organizations to rerun an entire process from the beginning, including all prior model calls. Somal said systems designed around recovery can resume execution from the point of interruption. “You pick up from where the crash happened,” she said. “We save you the cost of running the agent from step one again.” Enterprises need to build paved paths and enlist partner expertise Governance concerns are another emerging pattern as agentic AI takes hold. Rather than adopting fully managed agent systems wholesale, Somal said enterprises increasingly want standardized internal frameworks that provide guardrails while preserving flexibility, and implementing necessary features like governance controls, model selection policies, identity systems, cost management, and observability. “The enterprises are looking at building these paved paths,” she said. “Taking something off the shelf is maybe not going to work because there are all of these other requirements.” As organizations revisit first-generation deployments, challenges like this increasingly look less like a model problem and more like a systems engineering problem, and Temporal is positioned to help enterprises take this next step in part because for many organizations, it already existed as part of broader modernization programs before AI became a strategic priority. “Temporal is already in the enterprise,” Somal said. “Taking that and extending that to AI and agent platforms feels very natural.”

Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them extra compute cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to dictate the rules of the model’s reasoning. To address this bottleneck, researchers from Meta, Google, and several universities have introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually tuning heuristics. By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce the token usage and operational costs of deploying advanced reasoning models in production environments. In experimental trials, AutoTTS managed inference budgets efficiently, successfully reducing token consumption by up to 69.5% without sacrificing accuracy. The manual bottleneck in test-time scaling Test-time scaling enhances LLMs by granting them extra compute when generating answers. This extra compute allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final response. The primary challenge for designing TTS strategies is determining how to allocate this extra computation optimally. Historically, researchers have designed these strategies manually, relying on guesswork to build rigid heuristics. Engineers must hypothesize the rules and thresholds for when a model should branch out into new reasoning paths, probe deeper into an existing path, prune an unpromising branch, or stop reasoning altogether. Because this manual tuning process is constrained by human intuition, a vast amount of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computing costs. Current TTS algorithms can be mapped to a width-depth control space — "width" being the number of reasoning branches explored, "depth" being how far each develops. Self-consistency (SC) samples a fixed number of trajectories and majority-votes the answer. Adaptive-consistency (ASC) saves compute by stopping early once a confidence threshold is hit. Parallel-probe takes a more granular approach, pruning unpromising branches while deepening the rest. All three are hand-crafted, and that's the constraint AutoTTS is designed to break. While some more advanced methods employ richer structures like tree search or external verifiers, they all share one key characteristic: they are meticulously hand-crafted. This manual approach restricts the scope of strategy discovery, leaving a massive portion of the potential resource-allocation space untouched. Automating strategy discovery with AutoTTS AutoTTS reframes the way test-time scaling is optimized. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment. This framework redefines the roles of both the human engineer and the AI model. Rather than hand-crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer's role shifts to constructing the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives balancing accuracy versus cost, and the specific feedback mechanisms. An explorer LLM, such as Claude Code, designs the strategy. This explorer acts as an autonomous agent that iteratively proposes TTS “controllers.” These controllers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The explorer tests and refines these controllers based on feedback until it discovers an optimal resource-allocation policy. To make this automated search computationally affordable, AutoTTS relies on an “offline replay environment.” If the explorer LLM had to invoke a base reasoning model to generate new tokens every time it tested a new strategy, the compute costs would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-collected from the base LLM. These trajectories include "probe signals," which are intermediate answers that help the controller evaluate progress across different reasoning branches. During the discovery loop, the explorer agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller that show it allocated compute over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting if a controller pruned branches too aggressively in a specific scenario. This provides an advantage over just viewing a final result. The agent then iteratively rewrites its code to improve the accuracy-cost tradeoff. Inside the AI-designed controller Because the explorer agent is not constrained by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never hand-code. One optimal controller discovered by AutoTTS, named the Confidence Momentum Controller, leverages several non-obvious mechanisms to manage compute: Trend-based stopping: Hand-crafted strategies often instruct the model to stop reasoning once it hits a certain instantaneous confidence threshold. The AutoTTS agent discovered that instantaneous confidence can be misleading due to temporary spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and only stops if the overall confidence level is high and the trend is not actively declining. Coupled width-depth control: Manually designed algorithms usually treat the "widening" of new reasoning paths and the "deepening" of current paths as separate decisions. AutoTTS discovered a closed feedback loop where the two actions are linked. If the confidence of the current branches stalls or regresses, the controller automatically triggers the spawning of new branches. Alignment-aware depth allocation: Instead of giving all active reasoning branches an equal computation budget, the controller dynamically identifies which branches agree with the current leading answer. It then gives those branches priority "bursts" of extra computation. This concentrates the computational budget on the emerging consensus to quickly verify if it is correct. Cost savings and accuracy gains in real-world benchmarks To test whether an AI could autonomously discover a better test-time scaling strategy, researchers set up a rigorous evaluation framework. The core experiments were conducted on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system's ability to generalize on a distilled 8B version of the DeepSeek-R1 model. The explorer AI agent was initially tasked with discovering an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was then tested on two held-out math benchmarks, AIME25 and HMMT25, as well as the graduate-level general reasoning benchmark GPQA-Diamond. The AutoTTS discovered controller was pitted against four manually designed test-time scaling algorithms in the industry. These baselines included Self-Consistency with 64 parallel reasoning paths (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when an answer seems stable. When set to a balanced, cost-conscious mode, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across the four Qwen models. When the inference budget was turned up, AutoTTS pushed peak accuracy beyond all handcrafted baselines in five out of eight test cases. This efficiency translated to other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant slashed the inference token cost from 510K tokens down to just 151K tokens, while slightly improving overall accuracy. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while cutting the token spend nearly in half. For practitioners building enterprise AI applications, these experiments highlight two major operational benefits: Raising peak performance: AutoTTS doesn't just save money on token consumption. It actively raises the peak attainable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive reasoning branches on the fly and continuously redirecting its compute budget toward the branches generating the most useful reasoning signals. Cost-effective custom development: Because the framework relies on an offline replay environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, that means optimized reasoning strategies tailored to proprietary models and internal tasks are now within reach — without a dedicated research budget. Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; the CMC can be used as a drop-in replacement for other TTS controllers.

Mistral AI used its first-ever developer conference on Wednesday to announce a sweeping expansion into industrial manufacturing, a new inference data center south of Paris, and a rebranding of its consumer-facing assistant — moves that collectively signal the three-year-old French startup's ambition to become the enterprise AI provider of record for companies that refuse to hand their most sensitive data to American hyperscalers. At the AI NOW Summit, held at a venue in central Paris, co-founder and CEO Arthur Mensch took the stage alongside CTO Timothée Lacroix and Chief Scientist Guillaume Lample to lay out a strategy that stretches from bare-metal GPU clusters to physics simulations for aircraft wings. The company disclosed that it now employs 1,000 people and is targeting €1 billion ($1.17B USD) in revenue for 2026 — a figure that, if achieved, would be an extraordinary growth trajectory for a company that began with 15 employees collaborating with its first customer, BNP Paribas, in 2023. "We have two convictions at Mistral," Mensch told the audience. "The first is that in order to deploy AI in the enterprise, you actually need, as an AI provider, to own the full stack." He described Mistral's business as fundamentally about "transforming electrons into tokens and intelligence," arguing that physical infrastructure control matters as much as model quality. The announcements come at a pivotal moment for Mistral and for the broader European AI ecosystem. The company has raised at least $3.9 billion across nine funding rounds, according to Clay's funding tracker, including a massive €1.7 billion Series C led by Dutch semiconductor equipment maker ASML in September 2025 at an €11.7 billion valuation, and an $830 million debt financing round in March 2026 from a consortium of seven banks to fund data center construction. Mistral now finds itself in a peculiar competitive position: too large to be dismissed as a research lab, but still dwarfed by the resources of OpenAI, Google DeepMind, and Anthropic. Its answer, articulated across nearly an hour of presentations Wednesday, is vertical depth — going industry by industry, workflow by workflow, and building the infrastructure to keep everything on premises. Why Mistral is betting that physics AI will reshape how Airbus and BMW design products The centerpiece announcement was Mistral for Industrial Engineering, a fully integrated AI stack that combines Mistral's large language models with physics simulation capabilities acquired through its purchase of Emmi AI, completed earlier in May 2026. The platform targets the aerospace, automotive, and semiconductor industries with tools for accelerating product design, validating simulations, and optimizing production. The launch came with headline partnerships. Mistral announced it is working with Airbus across its commercial aircraft, helicopter, defense, and space divisions, implementing AI from initial design through to on-board capabilities. For BMW Group, Mistral is serving as a central partner for what the automaker calls its "Large Industry Model" initiative, focused on multimodal reasoning models for crash simulation and other complex engineering tasks. ASML, already Mistral's largest shareholder, is also an early adopter. Mensch framed the industrial push as addressing a fundamental gap in how AI is currently deployed. "AI is great today at automating tasks for knowledge workers and for people that are doing software engineering," he told the summit audience. "But once you move to all the kind of engineers, well, they are underserved." The reason, he explained, is structural. Simulating the behavior of a wing or a factory process requires compute-intensive physics solvers that can take hours or weeks per design variant. Traditional simulation creates a bottleneck that makes AI-assisted iteration impractical. Mistral's answer is what it calls "physics AI" — data-driven models trained on solver outputs that can predict physical behavior in seconds rather than hours, running on a single GPU. As Mistral's own blog post on the technology acknowledges, physics AI is "not a replacement for first-principles solvers in every regime" — it is a throughput accelerator for the majority of design-loop iterations, with traditional solvers reserved for verification and edge cases. "We now have both the language intelligence and the physical intelligence models, and by combining them together we are building delegation loops that allow us to create better tools, that allow us to create better objects that actually have an impact on the physical world," Mensch said. The ASML partnership offered a concrete illustration. In a video testimonial shown at the summit, an ASML representative described how the company's lithography machines run around the clock at customer fabrication plants, and field service engineers need to diagnose issues as rapidly as possible. By combining ASML's internal engineering expertise with Mistral's models, "we were able to develop a solution that's 120 times faster with a similar accuracy as we have today," the representative said. Another ASML speaker described AI agents acting as "an always-on code reviewer" to catch software defects before they reach customers. Inside Mistral's €4 billion infrastructure gamble to build Europe's most powerful AI data centers Mistral's full-stack ambitions extend all the way down to the physical layer. Launched in June 2025, Mistral Compute is a €4 billion ($4.66B USD) investment in data centers in France and Sweden, with a stated roadmap of 200 MW of capacity by 2027 and 1 GW by 2030. Lacroix described the company's existing 40 MW facility at Bruyères-le-Châtel, south of Paris, which was built in collaboration with Eclarion and has been training models since early 2026. "It's been very interesting to see how we can transfer rigor, which is one of our company values, into down to the hardware layer," he said, describing the process of "fixing compute trays and fixing fibers, allowing us to reach the very best speeds possible on that hardware for training." On Wednesday, Mistral announced a new 10 MW facility at Les Ulis in the Essonne department, also south of Paris, dedicated to inference operations and scheduled to open in Q3 2026. Lacroix also referenced a site in Borlänge, Sweden, planned for development through 2027, which will host NVIDIA's next-generation Vera Rubin GPUs. "One of the benefits for us of owning the hardware layer is also that it lets us be at the very bleeding edge of what infrastructure provides," he told the audience. The infrastructure push is funded in part by the $830 million debt financing round announced in March 2026, which Clay's funding tracker attributes to a consortium of seven banks: Bpifrance, BNP Paribas, Crédit Agricole CIB, HSBC, La Banque Postale, MUFG, and Natixis CIB. And this infrastructure ownership is not merely a hedge against GPU scarcity — it is central to Mistral's pitch to security-conscious enterprise and government customers. The company's February 2026 acquisition of serverless platform Koyeb has been integrated into Mistral Studio to support both hosted and on-premises deployments, giving customers a choice between running inference on Mistral's hardware or their own. "More and more, the compute world has been getting supply constrained," Lacroix told the audience. "One of the reasons we've been doing all of this and developing all of this data center capacity is to secure compute capacity not only for ourselves but also for our customers." Le Chat is dead, long live Vibe: How Mistral's new agent platform takes aim at enterprise productivity In a consumer-facing rebrand with significant enterprise implications, Mistral announced that Le Chat — its conversational AI assistant launched in February 2024 — is being renamed Vibe and reimagined as a unified agent platform for enterprise productivity and software development. "We are transitioning Le Chat to the Vibe family," Lacroix told the audience, explaining that the evolution was driven by the growing power of agentic models, particularly the new Mistral Medium 3.5. As the team used Vibe's coding CLI internally with increasingly complex tasks, "we realized that this really didn't need to be bound to the CLI, it didn't need to be limited to code, and we could do a lot more with it," he said. Vibe encompasses two primary modes. Vibe for Work is a web and mobile agent that connects to enterprise tools — Google Workspace, Outlook, SharePoint, Slack, GitHub — to perform multi-step tasks such as summarizing emails, analyzing spreadsheets, drafting reports, and scheduling recurring workflows. Vibe for Code is a coding agent available through a web interface, a new VS Code extension, and the existing CLI, capable of building features, fixing bugs, refactoring code, and shipping pull requests. Critically, the same underlying agent powers both modes. "When you access it through our web app or through the CLI, you have access to the same connections, the same tools, the same understanding of who you are, what you do, and what you're trying to achieve," Lacroix said. Pricing starts at free for basic use, $14.99 per month for Pro, $24.99 per user per month for Teams, and custom pricing for Enterprise deployments. Alongside Vibe, Mistral also launched Search Toolkit, an open-source framework for building production search pipelines already in use by shipping giant CMA CGM, which uses it alongside Voxtral to process audio from multiple data sources and return alerts within 15 seconds. Mistral's model strategy signals a new phase: fewer products, more capabilities per model Chief Scientist Guillaume Lample used his portion of the keynote to describe a philosophical shift in Mistral's model strategy: consolidation of capabilities into fewer, more versatile models rather than maintaining separate specialized products. Mistral Medium 3.5, the company's current flagship, absorbs capabilities that previously required distinct models. Pixtral (image processing), Magistrale (reasoning), and DevStral (coding) have all been deprecated as standalone products, with their capabilities folded natively into Medium 3.5. "Now all our models are natively multimodal," Lample said. "We no longer have Magistrale. This model is deprecated, because all our models will natively be doing reasoning." The company is also working on Mistral Large 4, which Lample said would arrive "in a couple of months at most, during the summer," with expanded capabilities in industrial applications such as fluid dynamics, computational chemistry, computer-aided design, and cybersecurity. On the smaller end of the spectrum, Lample highlighted Mr. Lossier, a 1-billion-parameter OCR model that can process thousands of pages per minute on a single GPU, and the Voxtral speech model family, which has expanded from automatic speech recognition to include text-to-speech with voice cloning. A "duplex" model for real-time conversational speech is planned for release within months. Lample also made the case for open-weight models becoming more — not less — important in the agentic era. "Today we are building these agentic workflows, these models are running in the background, they are doing a lot of actions, a lot of tool calls, so they are extremely token-hungry, much more than before," he said. "What we are seeing today is actually a comeback of this small model and the efficient model." Upcoming models will be trained on more than 200 languages, a multilingual strength now powering a partnership with Amazon to improve non-English interactions on Alexa+. How Mistral's enterprise playbook stacks up against OpenAI and Anthropic Mistral's positioning stands in sharp contrast to the strategies of its most prominent American rivals. While OpenAI and Anthropic have each attracted hundreds of millions of consumer users and derive significant revenue from subscription products, Mistral has leaned almost entirely into enterprise and government deployments. As TechCrunch reported in March when Mistral announced its Forge customization platform at Nvidia GTC, CEO Mensch has described the company as being "on track to surpass $1 billion in annual recurring revenue" — a figure driven largely by corporate clients. The Forge platform, which lets enterprises train custom models on their own data rather than simply fine-tuning or applying retrieval-augmented generation to existing models, represents the foundation on which the company's industry-specific solutions are built. As Mistral's head of product, Elisa Salamanca, told TechCrunch, Forge "lets enterprises and governments customize AI models for their specific needs." Early partners include Ericsson, the European Space Agency, Italian consulting company Reply, and Singapore's DSO and HTX, alongside ASML. Mistral has also built an expanding network of systems integration partnerships to drive enterprise adoption. In February 2026, Accenture and Mistral announced a multi-year strategic collaboration, with Accenture itself becoming a Mistral customer. Mauro Macchi, Accenture's CEO for Europe, Middle East, and Africa, said at the time that the partnership brings together "sovereign models and the capability to scale technology across industries, geographies and business functions." The BNP Paribas relationship offers the most detailed public case study. In a video testimonial at the summit, a BNP Paribas representative described deploying Mistral's models on-premises to satisfy strict security requirements, developing AI agents for KYC processes that reduced incomplete files from 80% to 10% and compressed processing time from weeks to days. The bank's LLM platform at its Corporate and Institutional Banking division has now rolled out to 65,000 users. Mensch noted the significance: "We started to collaborate in 2023 where we were 15 people, so that was, I think, really a leap of faith at the time." The industrial vertical is also being extended to government clients. Mistral disclosed that it is working with France, Luxembourg, Singapore, Morocco, Greece, and Slovakia to build citizen-facing AI services — from deploying agents that help job-seekers through France Travail to building models that understand Moroccan Darija and Amazigh languages. "We think that AI needs to be specialized and understand structural nuances," Mensch told the audience. "It needs to speak languages as good as it speaks English." The road ahead for Europe's most ambitious AI company For Mistral, Wednesday's announcements amount to a declaration that the company intends to compete not by matching American AI giants on any single dimension, but by assembling capabilities none of them are willing or able to offer in combination: open-weight models, owned infrastructure, on-premises deployment, physics simulation, and deep vertical customization — all under a single roof. The strategy demands execution on multiple fronts simultaneously, each requiring enormous capital and specialized talent. The competition is formidable and accelerating. OpenAI has been rapidly expanding its enterprise offerings. Anthropic, backed by billions from Amazon, is building its own corporate AI practice. Google, Microsoft, and Amazon all offer AI platforms deeply integrated with cloud infrastructure that most enterprises already use. But Mistral is wagering that the world's most consequential AI deployments — the ones governing how aircraft get designed, how banks process compliance, how governments interact with citizens — will ultimately go to providers that offer sovereignty over data, models, and compute. "AI is too strategic to be left in the hands of a few," Mensch said, echoing the conviction he described from Mistral's founding three years ago. Three years in, the company that started as a Paris research lab with a handful of employees now trains models in its own data centers, simulates physics for the manufacturers that build the world's planes and cars, and is rewriting its assistant into an agent that can file your pull requests and summarize your inbox in the same conversation. Whether that sprawling ambition coheres into a durable business or stretches Mistral too thin is the €11.7 billion ($13.6B USD) question. The 1,000 people now working there are betting that in enterprise AI, owning the full stack is not a liability — it is the product.

Anthropic today released Claude Opus 4.8, an upgrade to its flagship model that ships at the same price as its predecessor, alongside a dramatically cheaper "fast mode" tier and a new feature that lets the model spawn hundreds of parallel subagents for codebase-scale work. The model is available immediately across Anthropic's surfaces — claude.ai, Claude Code, the API, and Cowork — at unchanged pricing: $5 per million input tokens and $25 per million output tokens. Developers can call it as claude-opus-4-8. The headline efficiency story is fast mode. Anthropic has slashed the price of running Opus 4.8 in fast mode — where the model produces tokens at roughly 2.5x normal speed — to $10 per million input tokens and $50 per million output tokens, down from $30/$150 for Opus 4.7 That's a 3X reduction from the fast-mode pricing of previous models, and brings high-throughput inference within reach of latency-sensitive production workloads. Fast mode is available immediately in Claude Code via the /fast command; API access is gated, with a waitlist at claude.com/fast-mode. In regular mode, Claude Opus 4.8 remains among the more expensive of leading frontier models, but still comes in under chief rival OpenAI's GPT-5.5. Frontier AI Model API Pricing Snapshot Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3.1 Flash-Lite $0.25 $1.50 $1.75 Google MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Kimi-K2.6 $0.95 $4.00 $4.95 Moonshot/Kimi GLM-5 $1.00 $3.20 $4.20 Z.ai Grok 4.3 (low context) $1.25 $2.50 $3.75 xAI DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Grok 4.3 (high context) $2.50 $5.00 $7.50 xAI Qwen3.7-Max $2.50 $7.50 $10.00 Alibaba Cloud Gemini 3.5 Flash $1.50 $9.00 $10.50 Google Gemini 3.1 Pro Preview (≤200K) $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Gemini 3.1 Pro Preview (>200K) $4.00 $18.00 $22.00 Google Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic Claude Opus 4.8 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI Modest gains over 4.7, but Mythos-class capabilities coming On benchmarks, Opus 4.8 is a step up rather than a leap. It scores 88.6% on SWE-bench Verified (vs. 87.6% for Opus 4.7), 69.2% on the harder SWE-bench Pro (vs. 64.3%), and 74.6% on Terminal-Bench 2.1 (vs. 66.1%). Anthropic itself characterizes the model as "a modest but tangible improvement on its predecessor." It beats GPT-5.5 regular across at least 12 benchmarks, including most knowledge-work, coding (issue-level), agentic tool-use, and long-context benchmarks. GPT-5.5 wins on terminal/CLI workflows and is roughly tied on web browsing and graduate-level science. The bigger signal sits in Anthropic's internal capability ladder: Opus 4.8 lands between Opus 4.7 and the more capable Claude Mythos Preview, which is currently restricted to a small number of organizations under Project Glasswing for cybersecurity work. Anthropic says it expects to bring "Mythos-class models to all our customers in the coming weeks" once additional cyber safeguards are in place. Several enterprise partners cited material gains. Databricks reported that Opus 4.8 unlocks "a step change in agentic reasoning" inside its Genie data agent, at "61% cheaper token cost than Opus 4.7" thanks to multimodal efficiency on PDFs and diagrams. Hebbia cited better citation precision and token efficiency on dense financial filings. Devin-maker Cognition said the release "translates directly into faster capability gains for engineers" and noted Opus 4.8 fixed comment-verbosity and tool-calling issues from 4.7. A computer-use vendor reported 84% on Online-Mind2Web, a jump over both Opus 4.7 and GPT-5.5. Dynamic workflows: hundreds of parallel subagents Alongside the model, Anthropic launched a research preview of dynamic workflows in Claude Code — a feature designed for tasks too large for a single context window. Claude plans the work, spawns hundreds of parallel subagents, then verifies its own outputs before reporting back. Anthropic's example: a codebase-scale migration "across hundreds of thousands of lines of code from kickoff to merge, with the existing test suite as its bar." Dynamic workflows is available on Claude Code's Enterprise, Team, and Max plans. Two smaller additions round out the release: Effort control on claude.ai and Claude Cowork: A new selector lets users dial how much thinking Claude does per response — higher effort spends more tokens for better answers, lower effort responds faster and burns rate limits more slowly. Available on all plans. System entries inside the messages array on the API: Developers can now update Claude's instructions mid-task — adjusting permissions, token budgets, or environment context as an agent runs — without breaking the prompt cache. Honesty, and an "evaluation awareness" caveat Anthropic is leading with honesty as a headline trait. The company's alignment team reports Opus 4.8 is "around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked," and that misaligned behavior rates are now "substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview." Indeed, a bar chart released by Anthropic shows how close Opus 4.8 is to the still selectively released Mythos in terms of its misalignment (a lower score is better), coming in at roughly 1.9, down from 2.5 for Opus 4.7 and effectively tied with the more capable, restricted Mythos Preview. The score is based on roughly 2,600 simulated investigation sessions per model. The 244-page system card publicly released by Anthropic also goes into greater detail on specific categories of misalignment — whether a model produces potentially harmful content around "military-grade weapons," "harmful sexual content", "disallowed cyberoffense", and "undermining liberal democracy," and again, across all of them, Opus 4.8 scores markedly better than 4.7 or Sonnet 4.6, and comes quite close to Mythos. Anthropic flags one finding it considers "the most concerning" from training: Opus 4.8 shows a growing tendency to reason explicitly about how its outputs will be graded, including in environments where it wasn't told it was being evaluated. In other words: the model knows it is likely being graded, and produces a response it thinks will earn it a good grade on the test, not one it would necessarily produce if it thought it wasn't being graded. Anthropic says this didn't translate into worse observable behavior — Opus 4.8 shows fewer misleading task-success claims than prior models — but calls it "a concerning trend that could complicate training in the future." Preliminary interpretability work also found unverbalized grader-related reasoning in roughly 5% of training episodes. Anthropic ran the model through a one-week live bug bounty for prompt injection — a first — and concluded Opus 4.8 sits between Opus 4.7 and Sonnet 4.6 on robustness, ahead of "all comparable frontier models" tested, with deployed safeguards bringing browser-use attack success rates to near zero. What's next? Anthropic teased two trajectories. Near-term: cheaper models that provide "many of the same capabilities as Opus." Longer-term: the Mythos-class models, which the company says represent higher intelligence than Opus but require stronger cyber safeguards before general release. For now, Opus 4.8 is positioned as the new go-to enterprise and development workhorse — slightly smarter than 4.7, dramatically cheaper to run fast, and noticeably more honest about what it doesn't know.

DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s frontier labs. The reduction on DeepSeek V4 Pro directly undercuts comparable Western models used as workhorses for enterprise production. It is 7x cheaper on inputs and 17x cheaper on outputs than Anthropic’s Claude Sonnet or OpenAI’s GPT 5.5-Med, while the lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x. The price cuts are enabled by a series of hardware-software innovations, especially around cache, that make DeepSeek's models radically more efficient to run. When hosted natively in China, DeepSeek’s cache-read pricing is a whopping 87x cheaper than Western clouds — a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture. DeepSeek V4 Pro’s performance is ranked almost on par with Western frontier models, hitting 80.6% on coding-agent tasks via the SWE-bench Verified leaderboard and an elite reasoning score of 87.5 on the advanced MMLU-Pro technical index. Both V4 Pro and V4 Flash — a hyper-optimized speedy version for developers — are open-weight and issued under a permissive MIT license. This gives enterprises complete flexibility over deployment. This dual-model strategy allows technical teams to route their heaviest, multi-step autonomous agent workloads to the lightning-fast Flash model, while reserving the heavy Pro model for deep reasoning tasks, drastically lowering costs at a time when budget concerns have grown considerably. This also comes at a time when the closed Western labs, in particular OpenAI and Anthropic, face an intense return-on-investment scrutiny for their multi-billion dollar general-purpose hardware infrastructure investments. This deflationary collapse will not affect all Silicon Valley labs equally, signaling a permanent bifurcation of the enterprise AI market. While a premium, deterministic tier will endure for mission-critical engineering workflows, the high-volume background agentic layer is being completely commoditized by open weights. Ultimately, it creates a much more dangerous exposure for OpenAI — whose revenue mix relies heavily on general-purpose commodity API streams — than for software-insulated peers like Anthropic. The token cost crisis Uber says it burned through its entire 2026 budget for Claude Code and Cursor in just the first four months of the year; its COO said that the cost related to high token usage by some of its engineers was getting “harder to justify” without better products to show for it. Airbnb's Brian Chesky said last year that while the company uses OpenAI's latest models, they don't rely on them heavily in production — favoring faster, cheaper alternatives like Alibaba's Qwen. And in the latest episode of VentureBeat’s podcast Beyond the Pilot, Pinterest CTO Matt Madrigal confirmed that the company went all-in on an open-source AI strategy, post-training Alibaba’s open Qwen model on the company’s proprietary "taste graph" to drive Pinterest’s assistant — achieving frontier-like quality at a 90% reduction in costs. DeepSeek’s subsequent price drop makes the possibility of such cost differences even greater. [Looking for the blueprint? The token-cost crisis and hardware-software alignment covered in this piece are driving the agenda at VB Transform 2026 on July 14-15. Built specifically for technology executives and AI practitioners deploying autonomous enterprise systems, the event features dedicated sessions on agentic infrastructure architecture, compute density optimization, and real-world post-mortems from engineering leads moving away from closed loops. Review the speaker lineup and secure your pass here: https://venturebeat.com/vbtransform2026] Geopolitical headwinds and compliance defenses Widespread enterprise adoption of Chinese models faces massive geopolitical headwinds in the West. For highly regulated U.S. giants in finance, healthcare, and defense, getting comfortable with DeepSeek will take time. Even though an open-weights architecture under an MIT license allows a company to self-host the model locally and prevent active data exfiltration to foreign servers, corporate compliance boards remain deeply paranoid over software supply chain risks, potential hidden backdoors, and the legal threat of sudden federal sanctions. Smaller, more nimble software teams, on the other hand, face far less bureaucratic gridlock. Free from multi-month security review cycles, these fast-moving organizations view the immediate 75% infrastructure savings as a massive competitive edge worth deploying right now The OpenRouter clearinghouse: mapping global token traffic Take the token usage metrics on OpenRouter, a leading public proxy for what models are the most popular among developers. OpenRouter allows developers an easy way to compare and deploy models, and while its data is by no means a full proxy for real model popularity — it confirms this structural migration is already taking place within company data pipelines. DeepSeek V4 Flash model has captured the No. 1 position on the OpenRouter leaderboard over the past week, surging 48% in token usage. Its advanced counterpart, V4 Pro, sits at No. 6. DeepSeek’s top three models processed nearly 6 trillion tokens on OpenRouter over the past week, giving it a huge lead over other competitors. For example, OpenAI’s premium model, GPT-5.5, has slipped down to No. 15 at 470B tokens. It’s not clear exactly how much of the world’s token traffic is on OpenRouter. Conservative estimates put it at about 3%. It does not show the massive amounts of tokens being served by the APIs offered directly to developers by companies like Anthropic, OpenAI and Google. But recent estimates suggest OpenRouter processes between 15 and 40% of each of OpenAI’s and Google’s token usage, and growing, making it a significant indicator of relative trends regardless of the exact percentage it represents. While skeptics often dismiss aggregator traffic as an indie developer signal rather than a reflection of Fortune 500 IT spend, the corporate pipeline reality is shifting. An infrastructure analysis by a leading venture capital firm, Andreessen Horowitz, revealed that enterprise production environments deploy a median of 14 different models simultaneously to price-route workloads and avoid single-vendor lock-in. This structural architecture shift is why OpenRouter recently secured a massive $113 million Series B funding round backed directly by the big enterprise data and software vendors that serve corporate America — including ServiceNow Ventures, Snowflake Ventures, Databricks Ventures, Nvidia's NVentures, and Google’s CapitalG. Stripe also cited OpenRouter’s enterprise customers in its decision to partner closely with the company. That’s why DeepSeek’s surge on this leaderboard is so eye-opening. DeepSeek itself offers an API directly to developers, and so it too delivers more token traffic than what OpenRouter lets on. Beyond chatbots: the rise of multi-step autonomous agents The DeepSeek spike on OpenRouter indicates a deeper structural shift in how automated software architectures consume machine intelligence. Technical teams are moving beyond using trivial, single-turn chatbots, and starting to deploy more sophisticated autonomous agents that persist for hours at a time — recursively looping through codebases and data lakes. Their huge number of tool calls, and continuous rereading of long context histories, means AI token consumption expands exponentially. Running these recursive loops on closed, premium Western APIs quickly creates unsustainable infrastructure costs. While corporate tech teams spent last year experimenting freely with early, single-turn prototypes without worrying about budgets, the onset of token-prolific autonomous agents has triggered an enterprise line-item crisis. VentureBeat's Q1 2026 research, which surveyed enterprise users at organizations with over 100 employees (n=65, in the U.S. software, finance and healthcare industries), confirms the shift: “Cost per token or licensing model” jumped from 25.4% in January to 36.7% in March, trailing only raw performance as the primary selection criterion for enterprise buyers. DeepSeek target-optimized its weights for this specific trend of agentic high-token use. It has locked in on a standard input cost of $0.435 per million tokens and a standard output rate of $0.87 per million tokens, alongside a rock-bottom prefix-cached read cost of $0.003625 per million. It's this third cost item — for cache — which is arguably the most significant. “If you measure how all of these agents now are using tokens, 80 to 90% of the tokens are cache-read tokens,” said Val Bercovici, Chief AI Officer at WEKA, a company that provides fast storage for much of this cache. “Which means that [that price] is almost by far the most important price, making the others irrelevant — nearly a rounding error. So what DeepSeek did is not just say we're going to be 5% cheaper, 10% cheaper, 20% cheaper. They're like 87x cheaper on that cache-read price with DeepSeek V4 Pro. So that's really set the industry on notice.” The infrastructure coup: Decoupling HBM from Context DeepSeek's core innovations are around hardware-software alignment. This is where we get a little technical. While Western frontier labs like OpenAI have prioritized performance at all cost, they’ve invested billions into uncompressed "dense" neural architectures. DeepSeek, by contrast, has systematically sought to extract maximum intelligence from lower grade hardware, given that they’ve lacked access to Nvidia’s GPUs. By pioneering deep software optimizations as early as its V2 architectures in 2024, the lab engineered a series of four interconnected hardware-software alignment breakthroughs that decoupled a model's operational context from expensive computing overhead: Breakthrough 1: Sequence Dimension Compression via CSA and HCA The transformer architecture that most LLMs use is bottlenecked by something called the Key-Value (KV) cache. As an agent executes long, multi-step sessions, historical context keys clog the high-bandwidth memory (HBM) on the GPU, causing severe latency spikes and an expensive infrastructure tax. DeepSeek resolved this structural bottleneck by introducing a hybrid attention mechanism — documented in the DeepSeek V4 Architecture Paper — that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to cut overall KV-cache usage by a massive 90% across its 1-million-token context window. While traditional models try to keep a unique memory log for every individual word, DeepSeek compresses the rows of its memory cache. CSA acts as a local filter, condensing small windows of text into concise, indexable blocks so the model doesn't sweat the fine-grained details. HCA acts as an aggressive global index, crushing massive spans of text deep within a session's history into high-density summaries. By interleaving these layers, DeepSeek shrinks millions of memory rows down to a fraction of their size. Breakthrough 2: Native memory offloading via Multi-head Latent Attention (MLA) Using something called Multi-head Latent Attention (MLA), DeepSeek strips the active memory footprint of its context history down to a fraction of standard models. It achieves this by running a physical division of labor between hardware chips. While traditional models force expensive GPUs to hold a session's entire history, DeepSeek’s architecture keeps only the tiny, highly compressed search index tags (the Keys) on the GPU. Meanwhile, it offloads the heavy data payloads (the Values) entirely into cheaper system memory and local storage tiers. Once the GPU handles the high-speed matching to find relevant data, it calls the values from storage only on an as-needed basis. DeepSeek’s architecture is so different that the inference engines that load an AI model's weights into GPU memory, in order to be ready for prompting, are being stretched. The three most popular engines — Nvidia TensorRT-LLM, the UC Berkeley one, SGLang and the really popular vLLM — “are all being stretched to keep up with being able to offer it, which is not normal,” explains Weka's Bercovici. "Every other open model has had some similarity to other open models. This one from DeepSeek is just built different." DeepSeek's software engineering means its massive 1.6-trillion parameter model requires an astonishingly tiny 5.48 GB of HBM to hold a 1-million-token context loop in production, according to calculations by an analyst using hardware modeling benchmarks. For comparison, smaller models utilizing standard Western architectures choke up to 89 GB of HBM under the exact same context load. Model Framework / Metric Tier Active HBM Needed (1M Context) Context Length Capacity Multi-Step Cached Economics DeepSeek V4-Pro (1.6T MoE) 5.48 GB 1,000,000 tokens 80% to 90% of workflow tokens Qwen3-235B-A22B (GQA Standard) 89.00 GB 1,000,000 tokens Subject to steep hardware tax GPT-5.5 / Claude 4.7-class (Western Frontier / MoE) 180+ GB 1,000,000 tokens Prohibitive premium infrastructure tax DeepSeek’s extreme compression of the KV cache down to 5.48 GB of HBM is also a calculated geopolitical strategy to bypass U.S. export bans on top-tier Nvidia GPUs. By reducing the need for HBM and Nvidia’s CUDA ecosystem, DeepSeek’s software design allows frontier AI to run efficiently on domestic, lower-cost, and unsanctioned Chinese storage tiers like NAND flash, commodity SSDs, and LPDDR memory (produced by domestic giants like YMTC and CXMT). Breakthrough 3: Ultra-Low Footprint Inference via FP4 Quantization-Aware Training (QAT) To keep compute costs low over massive context windows, DeepSeek moved away from the old approach of scanning bulky, uncompressed numbers every time the model searches its memory. Instead, as detailed in the DeepSeek V4 Technical Report, the architecture runs an advanced form of data compression directly on the active pathways it uses to find information during training. This compression slashes memory demands to deliver a 2x hardware speedup, yet it maintains a near-flawless 99.7% accuracy in how the system targets and indexes specific data blocks. This engineering win allows enterprise workflows to process massive, multi-step agent tasks smoothly while keeping an exceptional 83.5% retrieval accuracy on extreme, million-token "needle-in-a-haystack" benchmarks—eliminating performance lags without draining expensive GPU power. Breakthrough 4: Ultra-scale training stability via manifold-constrained hyper-connections (mHC) Training a 1.6-trillion parameter model creates instability risk — causing too many data pathways and processing signals to cascade out of control, crashing the run. DeepSeek resolved this with a framework called Manifold-Constrained Hyper-Connections (mHC), which uses a balancing routine to force the model's internal data tables to always sum to one — a mathematical safety valve that lets complex data move through deep networks without runaway spikes. The infrastructure pivot: rebuilding corporate plumbing DeepSeek’s significant architectural cache efficiency alters the underlying unit economics for the cloud platforms hosting these models. On developer aggregators like OpenRouter, where third-party providers routinely offer advanced endpoints at a loss, to capture developer mindshare, this hardware-software decoupling alters the balance sheet. DeepSeek's extremely low cost likely gives DeepSeek a profit, at least when it comes to serving the model in China, Bercovici said. This transformation in provider-side unit economics is mirrored on the buy-side, which shows a structural change happening across enterprise IT budgets. VentureBeat's Q1 2026 AI Infrastructure and Compute tracker survey — which tracks enterprise technology buyers at organizations with over 100 employees (n=53 in January, n=39 in February) across software, financial services, healthcare, and manufacturing sectors — revealed that enterprise adoption of custom, self-managed inference stacks utilizing open-source frameworks like Triton, vLLM, Ray, and Kubernetes surged from 11.3% to 17.9%. Because these software layers allow corporate engineering teams to deploy open-weights architectures natively across their own clusters, they act as an operational escape hatch from closed cloud ecosystems. This software shift is paired with an aggressive hardware migration: enterprise workloads moving to specialized, inference-first AI clouds like CoreWeave, Lambda, and Crusoe grew from 30.2% to 35.9% in the latest survey window. These infrastructure metrics indicate that corporate technology leaders are no longer just prototyping with open alternatives; they are actively laying down the physical plumbing required to host architectures like DeepSeek V4 independently, increasingly pricing away the premium markup of Western API gatekeepers. The strategic split for Western labs This baseline cost reduction could soon fracture the competitive field in Silicon Valley, by rewriting the expectations for labs attempting to yield a return on massive infrastructure investments. For now, though, the Silicon Valley music is unlikely to stop anytime soon. Anthropic remains on an extraordinary enterprise trajectory, driven by widespread adoption of Claude Code and its codebase-aware terminal execution. For enterprise engineering teams, paying a premium for Anthropic's deterministic accuracy makes perfect sense for core production software development. Yet even an elite frontier lab scaling at this pace must watch DeepSeek with caution: an open-weights architecture under an MIT license offering near-frontier utility at a 75% cost reduction places downward pricing pressure on the high-volume operational layers of any multi-agent system. The primary structural margin squeeze may land more squarely on OpenAI, despite its aggressive pivot toward a multi-cloud footprint. To support its staggering consumer and API token volumes, OpenAI fundamentally altered its historic seven-year exclusive alliance with Microsoft, unbundling its distribution so it can serve models across Azure, Oracle, AWS, and Google Cloud. Yet this multi-cloud strategy, while providing raw capacity at scale, leaves the company intensely exposed to infrastructure commodity pressure. Unlike Anthropic, which has successfully insulated its margins by embedding its models into premium, high-utility software environments like Claude Code, a massive portion of OpenAI's enterprise revenue relies on high-volume, general-purpose API token streams. To be fair, Western labs have already begun quietly retreating from this territory — aggressively launching deep batch API discounts, prompt caching features, and lightweight entry models to stem the bleed. Yet this tactical retreat only reinforces the structural crisis: Silicon Valley is actively conceding the high-volume commodity layer because they know they cannot defend its margins. When those exact same automated background workflows can be handled natively by highly intelligent open weights like DeepSeek V4, defending a premium price point for raw cloud text completion ceases to be a defensible strategy. More significantly, unlike OpenAI or Anthropic, DeepSeek has much less interest in urgently building consumer wrappers or locking developers into subscription frameworks. Instead, DeepSeek is positioned for a longer-term ecosystem play. Supported by a massive state-backed funding round led by China’s "Big Fund" — which has pushed the startup's targeted valuation into the $10 billion to $45 billion range — the lab’s more likely objective is to prove the viability of a self-sufficient, independent Chinese AI hardware stack that could one day be worth up to $10 trillion. Premium deterministic tier (Anthropic / OpenAI / Google) High-volume agentic tier (DeepSeek / open ecosystems) • Core Codebase Refactoring • Strict Corporate Compliance & Guardrails • Mission-Critical Financial/Legal Precision • High CapEx / R&D Premium Margins • Recursive Multi-Agent Loops • Prefix-Cached Autonomous Tool Swarms • Massive Real-Time Ingestion Logs • Bare-Metal / Optimized HBM Economics The operational division between western labs and models like DeepSeek V4 Pro is already showing up. Financial company Ramp benchmarked automated cybersecurity agent swarms, and showed that while DeepSeek V4 Pro completely flatlines on the most complex security logic, it achieves a flawless 100% detection rate on high-volume baseline tasks like cloud configuration triage — significantly outperforming OpenAI’s GPT-5.5 (44%). For an enterprise CISO, the strategy is clear: You offload the high-volume token burn of routine background noise to cheap open weights, and reserve premium frontier models strictly for the high-level reasoning required to catch the most sophisticated flaws. The enterprise verdict For IT operations directors and data pipeline managers, the choice to migrate to an open architecture like DeepSeek V4-Pro is a smart governance decision. The open model gives companies total architecture control, allowing them to host it on-premise or via any specialized cloud layer they choose. Crucially, it provides enterprise infrastructure leads with a strategic operational fallback that closed vendors can’t match: the power to download raw model weights and execute them privately for zero marginal token cost if public cloud pricing or API access conditions change. The assumption that closed frontier labs hold a permanent monopoly on useful enterprise reasoning has collapsed. While engineering directors will continue to pay a premium to protect specialized, deterministic workflows, the financial foundation of the frontier lab model has fundamentally shifted. By diverting the immense, day-to-day token volume of recursive background agents onto highly optimized, open-source clusters, enterprise teams are starving proprietary clouds of their highest-margin fuel. Silicon Valley’s multi-billion dollar token moat didn't just narrow — it was completely drained from the bottom up.

Cloud design software company Figma is officially transforming its AI design assistant, Figma Make, from a prototyping sandbox into a live, visual software editor that connects natively to production codebases. Announced today, the update allows product managers, designers, and non-technical builders to import an existing Git repository directly into the Figma desktop app, visually edit the application's underlying code via the canvas, and push those changes back to engineering through standard GitHub pull requests. Engineering Governance & Licensing Crucially for enterprise deployments, this integration does not bypass established engineering guardrails. Figma Make operates entirely within a standard version control workflow. The platform acts as a local development environment where design changes accumulate as local commits. When a designer is ready to ship, they generate a branch and open a pull request (PR) directly from Figma Make. From an enterprise governance perspective, this means visual AI edits are subject to the exact same continuous integration pipelines, security checks, and code reviews as any traditional engineering commit. Figma Make remains a proprietary commercial service available to Full seats on Figma’s paid plans—ranging from $16 per month for Professional teams up to $90 per month for Enterprise deployments—but it interfaces cleanly with open-source and proprietary Git repositories without imposing new licensing restrictions on the generated code. Breaking the One-Way Barrier When Figma Make originally launched a year ago in May 2025, it successfully bridged the gap between static wireframes and interactive prototypes, but it was structurally isolated from the real-world software lifecycle. It operated on a rigid, one-way push mechanism: users could export an AI-generated project to a brand-new GitHub repository, but at the time, Figma Make could not receive upstream changes or sync with an existing codebase. Today's update fundamentally alters that architecture: by enabling a connection to any Git provider, builders no longer have to maintain parallel, out-of-sync environments. Teams can connect a production or sandbox repository, highlight specific UI elements, and use natural language or contextual annotations to prompt Figma’s multi-model AI — which toggles between Anthropic’s Claude 3.7 Sonnet, Claude Opus, and Google’s Gemini models — to write the underlying code. The agent dynamically reads the surrounding code architecture, applies the visual edits, and anchors the generated code to the team's existing design system guidelines. The Competitive Landscape: Figma Make vs. Lovable vs. Claude Design As code generation becomes commoditized by large language models, the competition to own the visual layer of software development has fractured into distinct approaches. Figma Make is no longer competing merely with other design canvases; it is contending with full-stack "vibe coding" platforms like Lovable and LLM-native environments like Anthropic's Claude Design, which just launched last month. Each platform targets a fundamentally different user and objective: Figma Make (Design-First Systems): Operating at $16 to $90 per month for Full seats, Figma Make caters to established product teams that prioritize brand fidelity. It wins on design system adherence, automatically pulling from existing color tokens, typography rules, component variants, and auto-layout structures. It is built for teams that want deep, layer-based canvas manipulation while keeping code ownership strictly within their existing GitHub architecture. Lovable (Code-First Production): Priced at $25 per month for Pro and $50 per month for Business tiers, Lovable functions as a standalone, full-stack application builder. Unlike Figma Make, Lovable relies on a native backend architecture (often paired with databases like Supabase) and a slider-driven UI styling approach. It enforces a strict automatic two-way sync with GitHub, treating the repository as the ultimate source of truth, and is optimized for solo developers or lean startup teams looking to launch production-ready SaaS apps from scratch without maintaining heavy vector design files. Claude Design (AI-Native Prototyping): Anthropic’s built-in canvas environment is accessible to users on Claude Pro ($20 per month) or Max ($100–$200 per month) subscriptions. While lacking the granular vector control of Figma Make or the full-stack database integrations of Lovable, Claude Design is ideal for product managers and engineers who need to generate quick, functional UI prototypes and immediately hand them off to coding agents like Claude Code. However, heavy iterative design sprints can quickly burn through Anthropic's strict token limits, making it less viable as a primary design hub. Navigating the "Vibe Coding" Era The emergence of two-way repo synchronization crystallizes the enterprise reality of the "vibe coding" era: the primary bottleneck in product development is shifting from raw engineering bandwidth to architectural governance and design intent. Technical leaders navigating this fast-moving landscape must look past the initial marketing hype to understand exactly who stands to benefit from this new paradigm. Figma Make is not a general-purpose, standalone application builder; instead, it is a highly specialized frontend optimization tool designed explicitly for established, mid-to-large cross-functional product teams. Figma explicitly notes in its documentation that designers who already possess access rights to their company’s existing corporate codebase are currently the best suited for this functionality. Consequently, enterprise leaders should consider adopting Figma Make if they have a mature engineering organization with a well-defined design system, rigid repository guardrails, and a desire to unlock faster iteration cycles. It directly addresses the technical friction felt by the 45% of designers and 59% of product managers who already contribute to code on a regular basis but prefer to operate from a visual canvas rather than a command-line terminal. By turning the canvas into a local development environment, it allows these non-technical builders to execute visual layouts, typography tweaks, and color changes independently, offloading tedious frontend implementation from core engineers. Conversely, organizations or teams launching zero-to-one skunkworks projects, or solo developers building lightweight SaaS products from scratch, will find far better utility in a code-first, full-stack platform like Lovable. Because Lovable natively orchestrates backend logic and database integrations like Supabase, it excels at spinning up functional applications rapidly without requiring a pre-existing vector infrastructure or a legacy codebase to pull from. Meanwhile, individual product managers or software engineers seeking rapid, text-prompt-driven UI wireframing without rigid design system constraints are better served by the immediacy of Claude Design. For the enterprise leader wary of overcommitting capital or locking their custom builds into proprietary AI backends, the wisest path forward is compartmentalization. Figma Make’s reliance on standard Git workflows—relying on local commits, isolated branches, and mandatory engineering pull request reviews—means it enforces the exact same security and code quality standards required for enterprise stability. By selecting Figma Make as a targeted frontend bridge for existing systems, and utilizing platforms like Lovable for external, greenfield prototyping, leaders can safely adopt productive new AI tooling without risking their core architectural integrity. Why Figma Needs to Keep Innovating Figma completed its initial public offering on July 31, 2025, pricing its shares at $33 after immense institutional demand oversubscribed the deal by 40 times. The stock immediately skyrocketed 250% to hit an intraday high of $115.50 on its first trading day. However, in the subsequent months, Figma's stock (NYSE: FIG) experienced a severe correction, crashing 81% from its peak to trade around the $21 to $22 range by May 2026, dropping well below its initial IPO price. This collapse reduced its market capitalization to approximately $11.3 billion. Financial analysts attribute this aggressive re-rating to structural IPO pricing mechanics, a low float, and the broader "software apocalypse," as investors rapidly rotate capital out of traditional SaaS products and into AI-native workflows. The stakes for Figma's current positioning are existential. As enterprises increasingly shift their software spending toward generative AI models and localized coding agents like Claude Design, Claude Code, and OpenAI Codex, traditional "vanilla" cloud design software looks increasingly commoditized. Figma Make represents the company's critical counter-offensive in this era of "vibe coding." To regain its premium valuation, Figma must prove to Wall Street that its platform is not merely a static vector canvas that AI tools can easily bypass, but an indispensable, live orchestration layer where human intent, enterprise design systems, and AI-generated production code seamlessly integrate. With the new Figma Make two-way Github integration and governance, the company appears well on its way to showing the doubters it has a path a forward in the AI-powered "vibe coding" development era .

When Miro’s data team pointed AI agents directly at its Snowflake environment, the agents got the wrong answer more than 65% of the time. The problem wasn’t the model — it was context. With more than 10,000 tables and no semantic layer to guide routing, the agents had no way to know which data assets matched which business questions. DataHub is releasing a context intelligence layer Thursday that mines existing SQL query history to build a semantic index — and exposes it to agents via MCP, LangChain, Google’s Agent Development Kit and CrewAI. The company calls it Context Intelligence, and it’s built on the same query-log infrastructure DataHub has used for lineage tracking in production deployments worldwide. The company was founded by the team that built DataHub as an open source project at LinkedIn, where co-founder and CTO Shirshanka Das led data infrastructure for nearly 11 years. The open source project now has more than 15,000 contributors and 3,000 production deployments worldwide. "For the first time, enterprises can turn years of analyst query history into a living, retrievable knowledge base where agents stop hallucinating joins because they have access to the joins that have worked before, validated by the people who ran them," Shirshanka Das, co-founder and CTO of DataHub, told VentureBeat in an exclusive interview. Why query history beats raw schema for agent routing DataHub began as a metadata management project at LinkedIn, built to solve two problems simultaneously: making data easy to find and use across the organization while ensuring it was only used for the right reasons. Das open-sourced the project in early 2020 after nearly six years of internal development. The primary use case in the years since has been lineage — understanding how data flows from operational systems through streaming infrastructure into warehouses and out to business tools. Regulatory compliance audits, operational triage and new engineer onboarding all depend on that lineage graph. Postgres is the most-connected source in the DataHub deployment base globally, followed by MySQL, Oracle and the major cloud warehouses including Snowflake and Google BigQuery. The platform supports more than 100 connected metadata sources. That deployed base matters for what DataHub is releasing. The query log extraction and SQL parsing capabilities powering Context Intelligence were developed across years of production deployment, not built for this release. The same infrastructure now serves agents querying a semantic index at runtime. "The consumption layer has changed from humans to agents," Das said. Context Intelligence mines validated query history, not raw logs Context Intelligence is a new capability layer built on top of DataHub's existing open source metadata foundation. The open source platform has spent years extracting and parsing query logs from connected warehouses for lineage tracking. That same infrastructure is what Context Intelligence draws on to build the semantic index. The capability is new. The underlying plumbing is not. Filtering for signal. Warehouse query logs contain too much noise to use directly. DataHub's engine filters for what Das describes as the "golden queries," meaning high-quality analyst queries and scheduled pipelines that represent proven business logic. Inverting SQL into semantic definitions. The engine extracts patterns from those queries and translates them into structured text definitions DataHub calls semantic anchors. Those anchors form the retrieval basis agents draw on before generating SQL. "You can almost think of it as inverting text to SQL," Das said. Human validation on top. Context Hub lets domain experts review AI-proposed context, resolve conflicting definitions and simulate the impact of changes before publishing. DataHub surfaces cases where different teams calculate the same metric differently and raises them for human resolution. How Miro got AI agents working across 10,000 Snowflake tables Miro, the digital collaboration platform, was already using DataHub for lineage tracking and impact analysis when it began testing analytics agents against its Snowflake environment. Ronald Angel, product manager for the data platform at Miro told VentureBeat that the scale of the data estate became the problem immediately. Sending natural language queries directly to the Snowflake MCP produced incorrect answers more than 65% of the time. Exposing more than 10,000 tables directly to agents caused too much confusion for reliable routing. Miro addressed the problem by organizing data into well-defined data products that constrain what agents can see rather than exposing raw schema. The production architecture runs from user requests submitted via Claude Chat or Claude Cowork through a context layer where DataHub's MCP maps natural language to the appropriate data assets, then hands off to Snowflake's MCP for SQL generation. Angel said the context layer pulls in metadata, entity relationships, query history and business intent for each Snowflake table, specifically what business question each entity is designed to answer. Those semantic signals allow the agent to identify the correct database entities before writing SQL rather than guessing from schema alone. Pinecone, Oracle, Redis, Microsoft: how DataHub fits the context stack Data vendors including Pinecone, Oracle and Redis all have contextual memory capabilities. On the platform side Microsoft has built out its Fabric IQ as a semantic layer for context. DataHub’s argument isn’t feature parity. The company is positioning the context layer as platform-neutral — provisioning context into existing endpoints like Snowflake semantic views and Microsoft Fabric IQ rather than replacing them. "A lot of times people want to be platform neutral when it comes to their context layer," Das said. Kevin Petrie, an analyst at BARC, told VentureBeat that he sees DataHub's ability to integrate diverse metadata for both structured and unstructured objects, including documents and images, as differentiating them in the market. "Many other vendors are more focused on structured tables, which provide trusted facts but often lack the rich context of text objects," he said. Michael Ni, VP and principal analyst at Constellation Research, told VentureBeat that for him what stands out about DataHub’s context layer is its support of the shift from passive cataloging to continuously refreshed semantic intelligence. Ni described the competition for context as the next major platform war, arguing that whoever controls context at runtime controls the decision layer for data, agents, workflows and decisions. "Buyers need to be careful, since many vendors only support a portion of the full context capabilities required for AI and agentic solutions," Ni said. "Buyers should be clear on their context management requirements, as vector memory isn't business meaning, business meaning isn't governance, and governance isn't execution."

Presented by Equinix Digital systems are central to economic resilience. But the governance models supporting them were designed for a bygone era, when systems were smaller, often centralized, and rarely crossing multiple jurisdictions. This structural mismatch is driving the realization across boardrooms and governments that data sovereignty is not only core to critical infrastructure, but its implications determine the trajectory of the global economy. The scale of change is forcing the issue. IDC projects the global datasphere will continue to grow at an extraordinary pace, driven by AI workloads, real-time analytics, and always-on digital services. This is placing unprecedented demands on data center capacity, interconnection density, and operational reliability, a trend highlighted by both McKinsey and Goldman Sachs last year. More data means demand for more infrastructure. Infrastructure expansion means more interconnected systems. And more interconnected systems mean greater exposure when control is unclear. That is why sovereignty is now coming into focus for nation states and private sector actors alike. It’s more than an abstract legal concept. There are practical questions around who has the authority when systems span countries, clouds, and ecosystems. Control determines resilience in a fragmented world Infrastructure resilience has always depended on clarity. Power grids work because ownership, responsibility, and control are well understood by stakeholders and the public. The same principle should apply to digital infrastructure, even if the underlying systems look much different. Data sovereignty aligns authority with accountability. Organizations retain decision-making power over where data lives, how it moves, who can access it, and which technologies are allowed to touch it. When something breaks or regulators ask difficult questions, there is no ambiguity about who is responsible. Gartner’s Top Strategic Technology Trends for 2026 underscores this shift by emphasizing that modern infrastructure is inseparable from governance, resilience, and digital trust. Treating sovereignty as a bolt-on compliance requirement rather than an architectural principle is proving insufficient. The challenge, of course, is that modern enterprises cannot simply look inward and ignore macro circumstances. Scale, performance, and innovation depend on participation in global digital ecosystems. A false paradox: scale vs. authority For years, organizations were told they had to choose. Either maintain tight control and accept limited connectivity, or embrace global platforms and accept reduced authority over data flows and infrastructure decisions. Neither holds up under real-world conditions. Financial services firms require low-latency access to markets across regions, all while adhering to strict regulatory expectations. Healthcare organizations must have secure data control without walling themselves off from cloud-based analytics and AI innovation. Governments demand digital services that scale while remaining auditable and transparent. This tension is why simplistic sovereignty narratives fail to pass muster. Sovereignty is more nuanced than isolation: the concept means control within connection. The distinction is becoming clearer as hyperscalers, regulators, and enterprises sharpen their approaches. Public disclosures from leading hyperscalers demonstrate how sovereign cloud offerings attempt to address data residency and operational separation. However, most large organizations recognize long-term control cannot rely on any single provider or managed platform alone. A distinction of responsibility leads to an industry inflection point The infrastructure strategies showing the most durability share a common theme: clean separation between infrastructure operations and data authority. In this model, providers are responsible for running highly resilient facilities, physical security, power, cooling, and high-performance interconnection at scale. Customers are fully in control of their data, applications, security posture, and governance decisions. Authority stays with the party that owns the risk. This is where neutral infrastructure platforms like Equinix come in, not as a cloud service provider, but as an interconnected foundation where customers deploy and control their own environments while accessing a broad ecosystem of networks, clouds, and partners. Equinix views sovereignty as customer-controlled by design, with clear boundaries around possession, custody, and control. That approach is in high demand from regulated industries. The benefits show up in auditability, legal clarity, and operational confidence. Trust comes with verification. When responsibilities are clear, compliance is verifiable rather than assumed. Ambiguity is unacceptably expensive for AI workloads Artificial intelligence accelerates these dynamics. AI systems are data-hungry and regulation-sensitive, a combination that leaves little room for governance shortcuts. Financial institutions like Bank of America and Morgan Stanley have forecasted AI-driven data center growth will place new pressure on infrastructure planning, energy availability, and geographic distribution. Simultaneously, AI models need to operate close to sensitive data, rather than exporting that data across borders for centralized processing. Without a clear sovereignty framework, organizations face difficult compromises. But with one, they achieve flexibility. Models move to data. Data remains controlled. Innovation accelerates without triggering regulatory alarms. That balance is emerging as a competitive differentiator. Infrastructure in 2026 looks different, and expectations are reset The critical infrastructure powering the digital economy goes beyond physical assets. It now includes governance models, legal posture, and control structures that determine how systems behave under pressure. European Commission updates to data sovereignty and digital strategy frameworks reflect this, as governments increasingly treat data governance as a matter of economic and national resilience. Deloitte’s digital sovereignty research for 2026 echoes that theme across global enterprises, especially those operating in multiple regulatory regimes. The organizations adapting fastest are not retreating from global connectivity. Rather, they are designing for it and embedding sovereignty as an architectural requirement. As enterprises navigate more fragmented regulatory environments, the ability to maintain jurisdictional control across interconnected digital ecosystems is a baseline infrastructure expectation rather than a specialized requirement. That expectation is now shaping how infrastructure is built. Enterprises increasingly require network-level sovereignty enforcement that operates across hybrid multicloud environments automatically, including during outages, failovers, and congestion events where data can cross borders invisibly. Capabilities such as Equinix Fabric Geo Zones reflect that demand, delivering the first network-level, multicloud sovereignty enforcement layer built natively into the interconnection fabric itself. The rules of infrastructure are being rewritten. Data sovereignty is the architectural foundation that resilient, globally connected enterprises demand. Organizations that treat it as such will be better equipped to operate, compete, and withstand pressure. Those that do not will find the status quo ambiguity increasingly costly. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Among the many Chinese AI companies and laboratories vying for market share and attention (no pun intended) on the global marketplace, MiniMax stands out for its commitment to providing frontier-level intelligence across a range of modalities, including text, coding, and video (through its Hailuo model series) — often under permissive, enterprise-friendly, standard open source licenses. Now, MiniMax is again raising the eyebrows of AI power users and developers around the world by releasing a new, in-depth technical report on the making of its popular M2 series of language models (M2, M2.5, and M2.7) shedding light on its numerous engineering innovations and clever approaches — while the company and its leaders also teased a whole new sparse attention approach for its upcoming MiniMax M3 series of models, which it says yields up to 15.6 times faster decoding (or LLM response) speed at long contexts (a million tokens) by adopting a custom sub-quadratic framework. In so doing, MiniMax has designed M3 to make ultra-long-context AI agent deployment economically viable. The M2 report is noteworthy for any enterprise working with AI models, and especially those looking to fine-tune and train their own in-house. After all, MiniMax's M2 series models often achieved top benchmarks in the world for open source AI performance when they were released. While the title has since been eclipsed by several other Chinese labs including DeepSeek and Xiaomi, MiniMax's new report offers a blueprint that can be used to improve AI model and agent performance by enterprises around the world. As Adina Yakup of Hugging Face observed on X, "Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent oriented design. Excited to see where M3 goes next!" The attention dilemma The core technical architecture of the M2 series relies on a sparse Mixture-of-Experts (MoE) decoder-only Transformer layout used by numerous other state-of-the-art LLMs. The foundational backbone houses 229.9 billion total parameters, yet maintains a remarkably lean operational footprint by activating just 9.8 billion parameters per token across 256 fine-grained experts. To optimize routing and avoid standard load-balancing issues, however, MiniMax implemented sigmoid gating paired with learnable, expert-specific bias terms, heavily reducing reliance on restrictive auxiliary losses. The most definitive engineering decision documented in the M2 paper was the strict adherence to full multi-head attention with Grouped Query Attention (GQA) across all 62 layers. In large language models, "quadratic scaling" refers to the computationally expensive reality of standard full attention mechanisms, where every token in a sequence must mathematically connect to every other token. To use a real-world analogy, it is akin to attending a networking event and being forced to have a deep conversation with every single person in the room while simultaneously monitoring all other ongoing conversations. While this approach yields incredibly thorough context, the processing power and memory required explode at the square of the input length, creating a severe hardware bottleneck as models attempt to ingest hundreds of thousands of words. The problem with sub-quadratic scaling "Sub-quadratic" scaling introduces architectural shortcuts designed to bypass this exponential computational load. Instead of mapping every possible connection, sub-quadratic methods—such as Sliding Window Attention or compressed linear attention—might only analyze a localized window of nearby words or generate a compressed summary of the broader text. These efficient methods drastically reduce hardware costs and allow models to process massive documents at high speeds, but they historically introduce severe trade-offs in accuracy, often causing the AI to miss the "big picture" or lose track of distant context. This mathematical dilemma defines the architectural evolution from MiniMax's M2 to its upcoming M3 series. During M2's development, researchers rigorously tested sub-quadratic shortcuts but found they crippled the model's "multi-hop reasoning"—its ability to connect disparate clues across a long document—forcing the team to absorb the massive computational cost of full quadratic attention to maintain frontier-level intelligence. Indeed, they aggressively benchmarked efficient attention alternatives during pre-training but intentionally threw them out. They experimented extensively with hybrid setups, interleaving full attention with sub-quadratic architectures like Lightning Attention or hybrid Sliding Window Attention (SWA) configurations. The empirical results were definitive: at a larger scale, linear and windowed attention variants exhibited severe reasoning deficits. On evaluations exceeding 32K context windows, SWA variants performed significantly worse than full attention, dropping from a baseline score of 90.0 to 72.0 on the RULER 128K complex word extraction task. Sub-quadratic configurations proved prone to memory-bound constraints during training, lacked native prefix caching support, and failed to smoothly align with Multi-Token Prediction (MTP) modules used for speculative decoding. Full attention was deemed necessary to preserve multi-hop reasoning capability. However, recognizing that physical hardware limits cannot sustain quadratic scaling indefinitely, MiniMax is designing the M3 series around a novel sub-quadratic framework to finally deliver both high-speed processing and uncompromised reasoning. MiniMax Sparse Attention (MSA) and sub-quadratic scaling incoming The upcoming MiniMax-M3 breaks away from the compute-heavy constraints of its predecessor. As disclosed by MiniMax’s engineering team under the banner "Something BIG is coming," M3 introduces "MiniMax Sparse Attention" (MSA). Unlike DeepSeek’s Multi-head Latent Attention (MLA), which compresses keys and values into a low-dimensional latent space, MSA operates on a standard GQA backbone but utilizes block-level selection on real, uncompressed Key-Values. Elie Bakouch at AI training infrastructure and platform lab Prime Intellect posted on X noting that the main changes feature "block level selection like in CSA but attention is done on the real KV, not in [compressed space]." This solves the precision loss and prefix-caching obstacles noted in the M2 paper. By filtering and selecting block-level sequences dynamically, MSA delivers an architectural leap: early hardware profiling indicates a 9.7x speedup in prefilling latency and a massive 15.6x speedup during decoding phases at a 1-million token sequence length compared to the full-attention M2 architecture. To understand why a speedup in the "decoding phase" is so significant, it helps to break down how an AI actually reads and writes information. When you interact with an AI, the processing happens in two distinct steps: prefilling and decoding. When you hand an AI a prompt—whether it’s a short sentence or a massive 1,000-page document—it processes that entire chunk of text all at once in parallel, known as "prefilling." It essentially "reads" the input in one big gulp to build its initial understanding and establish context. In order to generate a response, the AI must enter a "decoding phase." To predict the first word of its response, it looks at the prompt. To predict the second word, it has to look at the prompt plus the first word. To predict the hundredth word, it must recalculate the context of the prompt and the previous 99 words it just wrote. So the response actually becomes harder to generate as it goes on, with the end requiring a full review of all prior parts. For a layperson, imagine reading a dense legal brief (prefilling) and then being forced to write a summary report where, before writing every single new word, you must rapidly reread the entire brief plus everything you've written so far to ensure your next word makes sense (decoding). Because the AI must constantly and repetitively look backward to generate each new step forward, the decoding phase is the most severe computational bottleneck in generating text. It is why AI models often type out their answers word-by-word, and why they slow down significantly as conversations get longer. Therefore, when the passage states the new architecture achieves a massive 15.6x speedup during the decoding phase at a 1-million token sequence length, it means the model has found a structural shortcut to generate its answer—token by token—nearly 16 times faster. It directly solves the exact bottleneck that normally makes AI chatbots freeze or stutter when handling massive amounts of information. The evolution of the MiniMax M series and the creation of 'Forge' On a product level, MiniMax has consistently evolved its models from simple text generation interfaces into autonomous workers. The M2 series pioneered an "interleaved thinking" protocol where the model alternates between natural-language planning traces and explicit tool invocations inside a single trajectory. Rather than dropping the intermediate chain-of-thought blocks between execution turns, M2 appends the full thinking history directly into the conversation context. This planning persistence prevents state drift, allowing the model to recover gracefully from runtime errors and revise its strategies based on environment feedback. To train these long-horizon workflows, MiniMax built "Forge," a scalable agent-native reinforcement learning system. Forge decouples execution into three independent modules—the Agent Side, the middleware abstraction layer (Gateway Server and Data Pool), and the Training/Inference engines. As MiniMax engineer Olive Song explained on the ThursdAI podcast, "What we realized is that there's a lot of potential with a small model like this if we train reinforcement learning on it with a large amount of environments and agents... But it's not a very easy thing to do," adding that this environmental training was where the team spent a significant portion of their development timeline. To absorb the extreme trajectory-length variance common in multi-step agent environments, Forge implements two vital engineering solutions: Windowed FIFO Scheduling: A training scheduler that maps a sliding window over the generation queue. It permits greedy, high-throughput fetching of completed tasks within the window to prevent cluster idle time, while strictly enforcing FIFO boundaries to maintain distributional stability and avoid gradient oscillation. Prefix Tree Merging: An optimization that restructures batch training into tree computation. Completions sharing identical conversation prefixes are calculated exactly once in the forward pass before branching. This eliminates redundant calculations, generating up to a 40x training speedup with zero approximation error. This reinforcement infrastructure directly spawned the M2.7 checkpoint, moving the series toward "self-evolution". Operating inside an automated agent harness, M2.7 functions as an independent machine learning engineer. The model profiles its own active training runs, diagnoses anomalies, reads logs, and automatically modifies its own codebase and configurations. According to MiniMax, M2.7 successfully handled between 30% and 50% of its own development workflow. On OpenAI’s rigorous MLE Bench Lite suite, which tests autonomous ML research capability, M2.7 achieved a 66.6% medal rate across independent 24-hour trials, effectively tying Google’s closed-weight Gemini 3.1 Pro. The continuous cadence from M2 to M2.5, which famously completed 30% of internal tasks and 80% of newly committed code at MiniMax HQ, underlines a broader vision. As the MiniMax team noted during that phase of deployment, "we believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy." With the technical report codifying the M2 generation's successes and the MSA tech blog on the horizon, MiniMax is signaling that the next frontier of AI is explicitly about translating a mini-activation footprint into maximum real-world intelligence.

Merck is using AI agents to cut drug discovery cycles by a third and ship compliant marketing materials up to 80% faster — but VP of Digital Platforms Sean Finnerty says the only reason it's working is because they built the infrastructure first. And the pharmaceutical manufacturer is seeing promising early results: AI is generating marketing drafts that are “99% right” when it comes to compliance, shrinking review cycles from months to days and accelerating delivery by 70% to 80%. In the company’s medical research, meanwhile, one AI-assisted discovery cycle was reduced by 33%. Still, agentic AI only works if companies first build the underlying “plumbing,” Finnerty said of digital platforms and services at a recent AI Impact Series event. “If we do one-offs, we're gonna end up with thousands and thousands of things that are ultimately just gonna be debt that we'll have to deal with later,” he said. “And that's gonna be a drag on any further innovation.” Starting with the plumbing Merck’s plumbing-first strategy comes from lessons learned during the early days of cloud in the 2010s “when nobody knew what the heck was going on,” Finnerty said. Getting the cloud right meant building from the ground up; at Merck, that infrastructure now supports 2,500 AWS accounts, numerous Microsoft Azure subscriptions, and new Google Cloud Platform (GCP) integrations. “AI is gonna be the same exact thing,” Finnerty said. “We're going to have thousands and thousands of agents.” The questions then pile up: How do you register them? How do you secure them? How do you ensure they're connected to the right tools, and have access to the right data and the right context? Context delivery is also critical; Merck works with three hyperscalers and has forty-seven edge locations and hundreds of databases. “Many, many petabytes” of structured and unstructured data are stored in Oracle databases, SQL databases, Excel spreadsheets, phone transcripts, and other repositories, Finnerty said. His team is building scaffolding to deliver meaningful context in various situations, he explained. Data must be organized and ingested into various platforms, because “there’s no one solution to solve every single problem.” Sometimes it's Databricks, other times it's Amazon Redshift, “plus four other things.” The goal is: “Let's make that easy and frictionless for people to do, and secure it, and make sure it's well integrated with MCP [model context protocol], and A2A [Agent2Agent], and upstream compute,” Finnerty said. “If you wanna run stuff on GCP or you wanna run stuff on AWS, we've got the plumbing in place so you can run your adjacent workloads wherever you want.” How Merck is using agents As it builds out its technical plumbing, Merck is experimenting with agents across regulated enterprise operations, scientific discovery workflows, and app modernization. Notably, AI is accelerating drug discovery. Finnerty explained that scientists look at molecular structures and disease states to determine if a given condition is druggable. But even if a disease state is known, developing a drug to target it can take years. Now with AI, teams are starting to see “very promising things,” such as cutting one particular research cycle down by one-third. “That's a year off of the life of the discovery cycle,” Finnerty said. “Which means, theoretically, we can get it to a patient who needs that therapy a year faster.” Once developed and approved, these products are regulated and marketing materials around them must be clearly and explicitly articulated. “The way you communicate that information per market, per country, per state, per region, is all very carefully governed and regulated,” Finnerty said. It’s also variable: An ad campaign for a vaccine in the state of Georgia looks much different from one launched in Canada. Historically, humans did the due diligence to make sure the company complied with various laws. Draft materials go through iterations of reviews; when a mistake is discovered, it gets “kicked back to the beginning, and it goes through it again, and then it takes another however many weeks and months,” Finnerty said. But now, AI can do that “much, much more effectively,” and the process is increasingly evolving from a human-in-the-loop to essentially a "human-as-governor." With human oversight, AI can deliver a first draft in a day or week that is 99% there, allowing teams to ship materials up to 80% faster. Meanwhile, when it comes to app modernization, AI can discover architecture, document data interactions, APIs, network paths, and do authentication checks and authorization; it can also write code for Terraform for deployment and refactor JavaScript into Python. Where the company would have previously spent weeks and months and hundreds of thousands of dollars to update one application, Finnerty said, agents are now handling the work through prompts. Running into "wackiness" That’s not to say there aren’t significant challenges; Finnerty noted that his team has run into some “wackiness”; for example in automated code and scenario testing. AI has blatantly made up scenarios, whether due to incorrect context, infrastructure, “or if it was just getting creative with, ‘You should be testing these three functions that don't even exist in the code that you're trying to test.’” “That surprised me a little bit because I thought we were further past some of the hallucination challenges in these later models,” he said. To address this, his team has engineered guardrails to keep hallucinations to a minimum, essentially using AI to supervise AI and applying confidence scores. So if Claude created the first output, they’ll instruct Microsoft Copilot to assess it. “So if you ask something once, have AI check it, then ask it a third time, the confidence increases every time, and it minimizes some of the garbage that gets created in the early runs,” Finnerty said. Use cases for agentic AI in financial services Meanwhile, at Mastercard, Chief Data Officer Andrew Reiskind and his team are focusing agentic experimentation on highly orchestrated transaction and dispute workflows. As he noted, a chargeback or fraud dispute is not a single event. When a consumer disputes a charge (typically online), that “kicks off an entire other process on the back-end that tends to be very labor-intensive,” Reiskind said. Mastercard has to collect specifics about the actual dispute; then the merchant has its own investigations (Was the card reported as lost or stolen? Does the consumer dispute charges often?). Further, the network sitting in the middle has its own rules for timing and information submission. “You have each and every one of these steps, many of which are unstructured, but there are also structured data elements to this,” Reiskind said. Whether a card was lost or stolen tends to be structured, but the consumer complaint is “unstructured data of questionable reliability.” “So you're sitting there with a decisioning system that has deterministic decisions, but also probabilistic decisions,” he said. This problem can be sped up and potentially solved by AI agents, but that can be a complex process: Which tasks are you handing off to agents? When are they kicking things back to human reps? How many agents are you ultimately using? What are the cost implications? Then there are reputational questions and costs: Have you just called a consumer potentially a liar when they weren't lying? “It's an exact problem where you want to, as a bank, maintain trust with your consumer,” Reiskind said. “But you also wanna make this efficient and take costs out of the system.” The PB&J versus turkey mistake: Determine what risks are acceptable There’s always going to be risk with AI, and enterprises should assess it from the beginning of product design, Reiskind said. There’s also the question of acceptable risk. As an example: Did you serve a customer a peanut butter jelly sandwich instead of a turkey sandwich (a minor inconvenience)? Or did you serve gluten to someone with celiac disease? “Is it an acceptable risk if one percent of the time it makes the mistake? If it is, let's go to the next stage of how you're mitigating that risk,” Reiskind said. Leaders must perform cost-benefit analysis, break problems down to their “constituent pieces,” and calculate cost for each one. But these are estimates; it’s near-impossible to forecast real usage, Reiskind said. “It is not a simple process to get to the cost,” he said. “But it is doable.”

The data processing agreement (DPA) — the bedrock contract companies use to evaluate how vendors handle personal data — can no longer be trusted at face value. That is the central, and arguably most alarming, conclusion of DataGrail's Privacy and AI Trends Report 2026, released today. The San Francisco-based privacy platform analyzed 2,400 popular business software providers and found that 63.6% of vendors that prominently advertise AI capabilities do not disclose a third-party AI subprocessor in their legal documentation. The implication: the majority of companies purchasing AI-enabled software may be unknowingly exposing their customers' data to AI models and pipelines they never reviewed, never approved, and may not even know exist. "All software vendors are trying to move to become AI vendors, which makes sense, but the technologies are moving faster than AI governance can actually keep up," DataGrail co-founder and CEO Daniel Barber told VentureBeat in an exclusive interview ahead of the report's release. "The DPA should be the reliable document that teams use to evaluate AI risk, but based on that number, that's not enough in 2026." The finding drops into an enterprise landscape where organizations with high levels of shadow AI already experience average breach costs of $4.63 million — $670,000 more than those with low or no shadow AI, according to IBM's 2025 Cost of Data Breach Report. And it arrives in a year when U.S. states gave out $3.425 billion in privacy-related fines — more than the last five years combined — a trend Gartner expects to accelerate through 2028. How researchers uncovered the growing gap between AI vendor contracts and reality DataGrail's methodology for arriving at the 63.6% figure goes well beyond reading contracts. The company's research team cross-referenced DPA disclosures against product documentation, GitHub environments, API connections, and marketing materials for each of the 2,400 vendors in its tracking universe. Barber walked VentureBeat through the process: "We looked at the DPA as the baseline, but then what we also looked at is the GitHub environment, the API connections that a particular vendor has, the product documentation, the marketing documentation, and triangulate that information to discern — okay, so the DPA document says use OpenAI, but actually you've got these three AI subprocessors over here in your product documentation outlining features and functionality, but that is not reflected in your DPA." When asked directly about how confident he was that these gaps represent actual shadow AI risk rather than vendors using proprietary technology, Barber was unequivocal. "Very confident, because we looked at the sample of the 2,400 systems, and we spent a substantial amount of time actually looking at product documentation, GitHub environments, looking at actual API connections, because we integrate with these systems as well, so we know how they process personal information. It is from primary research." The disclosure gap matters because it undermines the entire chain of trust that privacy programs rely on. Consider a scenario Barber described: A company invests in an AI recruiting tool. The tool's DPA lists Claude as its foundational model. The company dutifully performs a security review of Anthropic's AI. But the recruiting tool also quietly uses OpenAI and Gemini behind the scenes — models the company never evaluated. Those undisclosed models then process thousands of resumes and execute automated hiring decisions. The company, without knowing it, has exposed sensitive personal information — home addresses, financial data, possibly Social Security numbers — to AI systems it never vetted, potentially violating FTC regulations on automated decision-making in employment. "How those vendors are evaluating and performing that automated decision making could be really disastrous for a business," Barber said. One-third of AI systems also process sensitive data, and the true number is likely higher The disclosure gap alone would be concerning enough. But DataGrail's report layers on another finding that makes the problem materially worse: 32.8% of AI systems that disclose AI capabilities also disclose at least one other high-risk activity, such as processing sensitive personal information or powering automated decision-making. Among AI systems with self-reported risk factors, 47.1% process personal data, 20.7% have the potential to power automated decision-making, 16.5% process sensitive data categories like health or financial information, and 7.5% process biometric data. The report argues these figures almost certainly undercount actual exposure, since they reflect only what vendors have formally disclosed. Vendors could underreport access to personal data, and the inherent flexibility of AI means even good-faith vendors might not predict riskier user applications of their tools. This has immediate regulatory implications. The CCPA's new risk assessment requirement, effective January 1, 2026, requires businesses to conduct and document risk assessments for processing activities that present significant privacy risks — and will require submission to CalPrivacy by April 2028, with executive attestation under penalty of perjury. Processing sensitive personal information with AI, or using AI for automated decision-making, are precisely the activities that trigger this obligation. The report finds that 42% of companies abandoned AI initiatives in 2025 with data privacy concerns cited as a primary obstacle — a statistic sourced to S&P Global research. Privacy teams that engage early with AI projects, Barber argues, can prevent that waste by ensuring safeguards are in place before launch, with AI risk assessments serving as the right starting point. Why consent management became 2025's most punished privacy failure While shadow AI is still a newer category of threat, the report makes clear that traditional privacy challenges have not eased — they have intensified. Consent management was the busiest enforcement topic of 2025. California alone publicly reported $4.3 million in CCPA consent settlements, and 2025 saw over 1,400 class action wiretapping suits driven by private firms investigating tracking pixels and session replay software. Despite this enforcement wave, 63% of the 5,000 websites DataGrail audited still fail to comply with universal opt-out mechanisms such as the Global Privacy Control signal. While that figure represents an improvement from 75% non-compliance in 2023, the pace of improvement is slow relative to the acceleration in enforcement. Barber pointed to the case of Todd Snyder, the menswear retailer that the California Privacy Protection Agency fined $345,178 in May 2025, as evidence that enforcement is no longer reserved for big tech. "This is a business that has two or three stores across the U.S. They have 300 employees," he said. "They run tight margins because they're a consumer menswear clothing store." The California Attorney General also reached a $2.75 million settlement with Disney over failures to honor opt-out signals, while the California Privacy Protection Agency has brought enforcement actions against PlayOn Sports and Ford — a pattern that demonstrates both the breadth and depth of regulatory activity. Among the trackers that fire even after a user sends a GPC signal, the report found that 27.1% come from Google Analytics and 43.8% are for targeted advertising via platforms like Meta and Microsoft. For users who do engage with consent banners, 48.3% click "Accept all," while only 12.4% select "Essential only" and 2.3% customize their preferences. A full 37% simply exit the banner without making a selection. The practical takeaway: less than 15% of users make a conscious choice to opt out of tracking, which means consent banners present relatively low business risk when properly configured — but enormous regulatory risk when they are not. Data deletion requests surge 567% as the cost of manual processing hits $1.5 million a year Data subject request volume hit an all-time high for the fifth consecutive year. Deletion requests have surged 567% since 2021 and now represent 87% of all data subject requests. Access requests, by contrast, have gradually declined as consumers skip visibility and reach straight for the delete button. The cost is staggering. For a mid-sized organization receiving 5 million annual web visitors, the report estimates manual DSR management now runs approximately $1.5 million per year, based on Gartner's estimated cost of $1,524 per manual DSR. The average cost has climbed from $238,000 in 2021 to $1.51 million in 2025 — a trajectory that makes manual processing not just inefficient but, as the report argues, "irresponsible." Barber emphasized that these numbers reflect verified human requests with bot and spam traffic excluded, and that data broker scenarios — which will see their own massive influx of requests under California's Delete Act — are reported separately. "That is a natural increase," Barber told VentureBeat. "If you've now got 20-plus U.S. states with privacy regulation, it's unlikely that we see a federal bill passed, even though we've seen one proposed. And while we don't see federal awareness and regulation, we do see at the state level over 20 states, and that may actually increase awareness for the consumer even more." He added a telling detail about how businesses are responding in practice: "99% of DataGrail customers do process that deletion" even for residents of states without privacy laws, "simply because it's too hard at this point. Discerning and even communicating to the person, 'Hey, you live in Montana, sorry, you're just in an unfortunate state without regulation' — you just can't do that." Data brokers felt the impact most acutely, with a 398% increase in deletion requests compared to 2024 and an average of over 2,000 deletion requests handled per month. State regulators issued $3.4 billion in privacy fines last year, and both parties want more The regulatory landscape underpinning all of these trends has fundamentally shifted from education to punishment. Nearly half of U.S. states now have a comprehensive privacy law in effect, plus over 160 AI-specific laws. State legislatures enacted 145 AI-related laws in 2025 alone, with another thousand introduced or reworked. According to Gartner, over 50% of the U.S. population is now covered by a comprehensive state privacy law, with 24 additional states expected to pass laws within five years. States have also begun pooling their resources, with ten forming the Consortium of Privacy Regulators last year and pledging to coordinate investigations across state lines. Barber argued that privacy enforcement is fundamentally bipartisan, which insulates it from the shifting political winds of the current administration. "Privacy overall is a pretty bipartisan issue," he said. "It's easy to pass privacy regulation because constituents somewhat expect privacy in their day-to-day living. If you were flying on an airline and they said, 'Okay, this seat, if you want your privacy, you're going to have to pay $6 more,' you're like, 'I'm going to go to another airline.' It's an expected part of a transaction at this stage." He predicted that other states will replicate California's enforcement model. "California has their enforcement division, CalPrivacy. That group has one task: to ensure enforcement of privacy throughout businesses. Is it likely that we see other states get funding and support to fund these types of groups? Highly likely. The enforcement fines — the actual payments — go back to us as constituents. That type of model, you could imagine, being very popular across the country." Privacy teams are losing a third of their staff just as AI governance demands explode Perhaps the most paradoxical finding in the report is that privacy teams lost as much as 33% of their headcount last year, even as their workloads expanded across every metric the report tracks. Cisco data cited in the report shows that 90% of privacy programs expanded in 2025 due to AI, while only 12% of AI governance programs are considered mature. Meanwhile, 74% of privacy teams planned to apply AI to privacy-related tasks in 2026, according to ISACA's State of Privacy 2026 survey. Barber sees this as part of a broader macroeconomic pattern rather than a sign that organizations do not value privacy. "It's actually a fascinating macro trend, and probably one you've seen across all functions," he said. "Businesses are driving more efficiency in all parts of the business. Privacy teams, five years ago, we would have said, 'Well, there's more regulation, the volume of deletions have increased 500%, we need more humans.' It's become clear that AI provides capabilities that can do the work for privacy individuals." He drew an analogy: "They might have had a design team of 20 people five years ago, now they have a design team of five, courtesy of Claude Design or Gamma or whatever the tool may be. I think that's what we're seeing here as well." DataGrail has positioned its own AI agent, Vera — launched in March 2026 — as part of the answer. Vera is embedded within DataGrail's existing platform and aims to automate privacy workflows across multiple jurisdictions. The company was also named the first production-ready Model Context Protocol server for privacy, using the standard created by Anthropic to enable customers to launch DataGrail tools from whatever application they are already working in, whether Slack, email, or Claude. Can a vendor-produced report be trusted to diagnose the problems that vendor sells solutions for? DataGrail is, of course, a company that directly benefits from the problems its report identifies. The company has raised a total of $84.2 million over five rounds, with its largest being a $45 million Series C in October 2022 led by Third Point Ventures. Its platform addresses precisely the data mapping, DSR automation, consent management, and risk assessment challenges the report spotlights. Barber acknowledged the tension directly. "It's a fair statement," he said when asked about potential skepticism. "DataGrail doesn't provide a service to keep DPAs up to date — that's on a business to evaluate how they work with a vendor. What DataGrail does help to do is assessments, and automate those assessments using our AI agent, Vera, to assess that increased risk." He argued that the more neutral reading of the data is structural: "This is evidence to show that the DPA unfortunately is not keeping up with technology and the speed at which technology is innovating. That's both exciting but also we need to accept that's where we are." The methodology does lend some credibility to this claim. The report draws on anonymized privacy operations data from hundreds of enterprise customers, the 2,400-system AI tracking database, and the 5,000-website consent audit — sources that are at least partially independent of DataGrail's commercial interests. And the broader findings on enforcement spending, DSR volume trends, and regulatory expansion align closely with independently published data from Gartner, Cisco, and state enforcement agencies. The next frontier: agentic AI could spread unvetted data across entire organizations autonomously When asked about the most important trend that did not make it into the report, Barber pointed to a next-generation risk that extends the shadow AI problem into far more dangerous territory: agentic AI workflows. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from under 5% in 2025 — a pace of adoption that could rapidly outstrip the governance mechanisms companies are only now beginning to build. "Where we go next with this research is agent processing," Barber said. "How are agents then leveraging that information? Because the downstream ramifications would be far more concerning for a business. One particular system is using shadow AI, the business has no idea that that's happening, and then an agent is propagating that information across a whole bunch of other places. The guardrails of you and I checking the system will be lower than maybe what we've seen in the past with agentic workflows." He framed the distinction in human terms: "The identity of an agent is different than a human. There is thought that goes into what am I about to use here, where did this information come from, how was it collected — that may not be considered in the same way for an agentic workflow. We need to solve the root of the problem, which is how are these businesses leveraging AI subprocessors. But this quickly becomes an agentic problem that could be far more concerning." For the enterprise privacy and security leaders absorbing this report today, the uncomfortable truth is that the foundational documents and processes they have relied on to manage vendor risk for years are decomposing in real time. The DPA is breaking down as a reliable instrument. State enforcement is accelerating on a bipartisan basis. Privacy teams are shrinking even as their mandates expand. And the next wave of agentic AI systems threatens to distribute unvetted data processing across networks of autonomous agents that operate with even less human oversight than today's tools. Five years ago, when DataGrail published its first trends report, deletion requests were a fraction of what they are today, only a handful of states had privacy laws on the books, and the phrase "shadow AI" did not exist. Every year since, the report has warned that the problem was getting worse. Every year, the data has proved it right. The companies that survive the next chapter will not be the ones with the biggest compliance teams or the thickest policy binders. They will be the ones that accept a disorienting new reality: in 2026, the contracts you signed may not describe the AI that is already processing your customers' data — and by 2027, autonomous agents may be deciding what to do with it.
