VentureBeat

venturebeat.com

Transformative tech coverage that matters

Articles99

200,000 MCP servers expose a command execution flaw that Anthropic calls a feature

5/1/2026

Anthropic created the Model Context Protocol as the open standard for AI agent-to-tool communication. OpenAI adopted it in March 2025. Google DeepMind followed. Anthropic donated MCP to the Linux Foundation in December 2025. Downloads crossed 150 million. Then four researchers at OX Security found an architectural problem that affects all of them. MCP's STDIO transport, the default for connecting an AI agent to a local tool, executes any operating system command it receives. No sanitization. No execution boundary between configuration and command. A malicious command returns an error after the command has already run. The developer toolchain raises no flag. OX Security researchers Moshe Siman Tov Bustan, Mustafa Naamnih, Nir Zadok and Roni Bar scanned the ecosystem and found 7,000 servers on public IPs with STDIO transport active — and estimate 200,000 total vulnerable instances extrapolated from that ratio. They confirmed arbitrary command execution on six live production platforms with paying customers. The research produced more than 10 CVEs rated high or critical across LiteLLM, LangFlow, Flowise, Windsurf, Langchain-Chatchat, Bisheng, DocsGPT, GPT Researcher, Agent Zero, LettaAI and others. Kevin Curran, IEEE senior member and professor of cybersecurity at Ulster University, independently told Infosecurity Magazine the research exposed "a shocking gap in the security of foundational AI infrastructure." Anthropic confirmed the behavior is by design and declined to modify the protocol — characterizing STDIO's execution model as a secure default and input sanitization as the developer's responsibility. That characterization comes from OX; the only word Anthropic explicitly stated on the record is "expected." Anthropic has not issued a standalone public statement and did not respond to VentureBeat's request for comment. OX says expecting 200,000 developers to sanitize inputs correctly is the problem. Anthropic's strongest technical counter: sanitizing STDIO would either break the transport or move the payload one layer down. Both positions are technically coherent. The question is what to do while that debate plays out. Every major outlet covered the disclosure. None built the prescriptive product-by-product audit a security director needs to triage her own MCP deployments. This piece does. Five questions determine whether your MCP deployments are exposed, whether your patches hold, and what to do Monday morning. Am I exposed? If your teams deployed any MCP-connected AI agent using the default STDIO transport, yes. The insecurity is not a coding bug in any single product. It is a design default in Anthropic's MCP specification that propagated into every official language SDK: Python, TypeScript, Java, and Rust. Every downstream project that trusted the protocol inherited it. OX identified four exploitation families. Unauthenticated command injection through AI framework web interfaces, demonstrated against LangFlow and LiteLLM. Hardening bypasses in tools that implemented command allowlists, demonstrated against Flowise and Upsonic, where OX bypassed the allowlist through argument injection (npx -c). Zero-click prompt injection in AI coding IDEs, where malicious HTML modifies local MCP configuration files. Windsurf (CVE-2026-30615) was the only IDE where exploitation required zero user interaction, though Cursor, Claude Code, and Gemini-CLI are all vulnerable to the broader family. And malicious package distribution through MCP registries, where OX submitted a benign proof-of-concept to 11 registries, and nine accepted it without security review. Carter Rees, VP of AI and Machine Learning at Reputation and member of the Utah AI Commission, told VentureBeat the framing needs to change entirely. "MCP stdio is a privileged execution surface, not a connector. Enterprise teams should treat it like production shell access. Deny by default, allowlist, sandbox and stop assuming downstream input validation will hold at scale," Rees said. The IDE family deserves particular attention because it hits developer workstations, not servers. A developer who visits an attacker-controlled website can trigger a modification to their local MCP configuration file — and in Windsurf's case, the change executes immediately with no approval prompt. Cursor, Claude Code and Gemini-CLI require some form of user interaction, but if the UI presents a configuration change without surfacing the execution consequence, clicking 'approve' does not constitute informed consent. Did my vendor patch? Some did. Some partially. Some have not confirmed. The matrix below maps each affected product against the exploitation family, patch state, and the gap that remains. The critical column is "Protocol fix?" Every row says no. Product Exploit type Patched? Protocol fix? The gap Action LiteLLM Command injection via adapter UI YES NO LiteLLM is fixed. New STDIO configs outside LiteLLM inherit the same insecure default. Pin to v1.83.7-stable or later (CVE-2026-30623). Verify against GitHub advisory. Audit all other STDIO definitions. LangFlow RCE via public auto_login + STDIO Partial NO Auth token freely available via public endpoint. STDIO executes whatever follows. Block public auto_login. Sandbox all MCP services from the host OS. Flowise / Upsonic Allowlist bypass (npx -c argument injection) Hardened, bypass confirmed NO Allowlist gives false confidence. OX bypassed it. Trivial. Do not rely on command allowlists. Enforce process-level sandbox isolation. Windsurf (CVE-2026-30615) Zero-click prompt injection to local RCE REPORTED, unconfirmed NO Only an IDE with a true zero-interaction exploit. Hits developer workstations, not servers. Disable automatic MCP server registration. Review all active configs manually. Cursor / Claude Code / Gemini-CLI Prompt injection to local MCP config modification Cursor patched (CVE-2025-54136); others vary NO User interaction required, but config-change UI does not surface execution consequence. Approval does not equal informed consent. Audit MCP config files (~/.cursor/mcp.json, equivalent paths). Disable auto-registration. Review all pending config changes before approval. Langchain-Chatchat (CVE-2026-30617) RCE via MCP STDIO transport REPORTED, unconfirmed NO Downstream chatbot framework inherits the same STDIO default. Patch status unconfirmed. Inventory all Langchain-Chatchat deployments. Sandbox from host OS. Monitor vendor advisory for patch. MCP registries (9 of 11) Accepted malicious PoC without review N/A NO Registries lack submission security review. Install and risk a backdoor. Use registries with documented submission review. Audit installs against known-good hashes. Does the flaw survive the patch? Yes. Every product-level patch in the matrix addresses the specific entry point in that product. None of them changes the MCP protocol's STDIO behavior. A security director who patches LiteLLM today and configures a new MCP STDIO server tomorrow will inherit the same insecure default on the new server. The patches are necessary. They are not sufficient. This was predictable. When VentureBeat first reported on MCP's security flaws in January, Merritt Baer, chief security officer at Enkrypt AI and former deputy CISO at AWS, warned: "MCP is shipping with the same mistake we've seen in every major protocol rollout: insecure defaults. If we don't build authentication and least privilege in from day one, we'll be cleaning up breaches for the next decade." The Cloud Security Alliance independently confirmed OX's findings in a separate research note and recommended organizations treat MCP-connected infrastructure as an active, unpatched threat. The defaults did not change. The attack surface grew. Rees argued that Anthropic's position, while internally consistent, does not survive contact with enterprise reality. "It stops being a developer mistake and starts being a distributed failure mode when the same class of failure reproduces across that many independent implementations," he told VentureBeat. "Guidance is not an architectural control. Relying on thousands of downstream implementers to consistently interpret a trust boundary is a known anti-pattern in enterprise security." Anthropic updated its SECURITY.md file nine days after OX's initial contact in January 2026 to note that STDIO adapters should be used with caution, but made no architectural changes. The researchers' assessment of that update: "This change didn't fix anything." Rees took a more measured view. "It's worth giving Anthropic credit where it's due," he told VentureBeat. "After the disclosure, they updated their security guidance to recommend caution with stdio adapters. That's a meaningful step even if researchers argue it falls short of a protocol-level fix." What changed at the protocol level? Nothing architectural. Anthropic has not implemented manifest-only execution, a command allowlist in the official SDKs, or any other protocol-level mitigation. OX recommended all three. The SECURITY.md guidance update was the only change. OX's research began in November 2025 and included more than 30 responsible disclosure processes across the ecosystem before the April 15 publication. The disagreement is substantive. Anthropic's architectural argument deserves its full weight. STDIO is a local subprocess transport designed to launch processes on the machine that configured it. The trust boundary, in Anthropic's model, sits with whoever controls the configuration file. If you can write to the MCP config, you are by definition someone authorized to execute commands on that machine. Under that logic, what looks like command injection is a feature working as intended. Restricting what STDIO can launch at the protocol level would either break the transport's core function, since its purpose is to launch arbitrary local processes, or displace the attack surface into the launched process itself. The unopinionated-standard argument is also defensible: a universal protocol that hard-codes execution constraints stops being universal. OX's counter, from their advisory: "Shifting responsibility to implementers does not transfer the risk. It just obscures who created it." Do not wait for a protocol-level fix. Treat every MCP STDIO configuration as an untrusted input surface, regardless of which product it sits inside. Monday morning remediation sequence Enumerate. Identify every MCP server deployment across dev, staging, and production. Search for MCP configuration files (mcp.json, mcp_config.json) in developer home directories and IDE config paths (~/.cursor/, ~/.codeium/windsurf/, ~/.config/claude-code/). List running processes that match MCP server binaries. Flag any using STDIO transport with public IP accessibility. OX found 7,000 on public IPs. Your environment may have instances you do not know about. Patch. Pin every affected product to its patched release. LiteLLM v1.83.7-stable includes the fix for CVE-2026-30623. DocsGPT, Flowise, and Bisheng have also shipped fixes. Windsurf and Langchain-Chatchat remain in reported state as of May 1, 2026. Cursor was patched against an earlier related disclosure (CVE-2025-54136) but inherits the same protocol default. Check each vendor's advisory in the morning you execute this step. Sandbox. Isolate every MCP-enabled service from the host operating system. Never give a server full disk access or shell execution privileges. The Flowise/Upsonic allowlist bypass proves that restricting commands alone is not enough. Audit registries. Review every MCP server installed from a third-party registry. Nine of 11 registries accepted OX's proof-of-concept without a security review. Use registries with documented submission review processes. Remove any MCP server whose origin you cannot verify. Treat STDIO config as untrusted. This step survives every future patch and every future product. The protocol-level default has not changed. Every STDIO server definition is a command execution surface. Treat it the same way you treat user input to a database query: assume it is hostile until validated. Your exposure cannot wait for a protocol fix Anthropic and OX Security disagree on where the responsibility for securing MCP's STDIO transport belongs. That disagreement will not be resolved this week. What can be resolved this week is whether your MCP deployments are enumerated, patched, sandboxed, and treated as the untrusted execution surfaces they are. As Rees put it: "The core question here is architectural policy, not exploit payloads." Baer warned in January that insecure defaults would produce exactly this outcome. OX documented 200,000 servers running with a configuration field that doubles as an execution surface. The protocol's designer says it is working as intended. Your Monday morning question is not who is right. It is which of your servers are exposed.

The AI scaffolding layer is collapsing. LlamaIndex's CEO explains what survives.

5/1/2026

The scaffolding layer that developers once needed to ship LLM applications — indexing layers, query engines, retrieval pipelines, carefully orchestrated agent loops — is collapsing. And according to Jerry Liu, co-founder and CEO of LlamaIndex, that's not a problem. It's the point. “As a result, there's less of a need for frameworks to actually help users compose these deterministic workflows in a light and shallow manner,” Jerry Liu, co-founder and CEO of LlamaIndex, explains in a new VentureBeat Beyond the Pilot podcast. Context is becoming the moat Liu’s LlamaIndex is one of the foremost retrieval-augmented generation (RAG) frameworks connecting private, custom, and domain-specific data to LLMs. But even he acknowledges that these types of frameworks are becoming less relevant. With every new release, models demonstrate incremental capabilities to reason over “massive amounts” of unstructured data, and they’re getting better at it than humans, he notes. They can be trusted to reason extensively, self-correct, and perform multi-step planning; Modern Context Protocol (MCP) and Claude Agent Skills plug-ins allow models to discover and use tools without requiring integrations for every one independently. Agent patterns have consolidated toward what Liu calls a "managed agent diagram" — a harness layer combined with tools, MCP connectors, and skills plug-ins, rather than custom-built orchestration for every workflow. Further, coding agents excel at writing code, meaning devs don’t need to rely on extensive libraries. In fact, about 95% of LlamaIndex code is generated by AI. “Engineers are not actually writing real code,” Liu said. “They're all typing in natural language.” This means the layers between programmers and non-programmers is collapsing, because “the new programming language is essentially English.” Instead of manual coding or struggling to understand API and document integration, devs can just point Claude Code at it. “This type of stuff was either extremely inefficient or just would break the agent three years ago,” said Liu. “It's just way easier for people to build even relatively advanced retrieval with extremely simple primitives.” So what’s the core differentiator when the stack collapses? Context, Liu says. Agents need to be able to decipher file formats to extract the right information. Providing higher accuracy and cheaper parsing becomes key, and LlamaIndex is well-positioned here, he contends, because of its developments with agentic document processing via optical character recognition (OCR). “We've really identified that there's a core set of data that has been locked up in all these file format containers,” he said. Ultimately, “whether you use OpenAI Codex or Claude Code doesn't really matter. The thing that they all need is context.” Keeping stacks modular There’s growing concern about builders like Anthropic locking in session data; in light of this, Liu emphasizes the importance of modularity and agnosticism. Builders shouldn’t bet on any one frontier model, or overbuild in a way that overcomplicates components of the stack. Retrieval has evolved into “agent-plus-sandbox,” as he describes it, and enterprises must ensure that their code bases are tech debt free and adaptable to changing patterns. They also have to acknowledge that some parts of the stack will eventually need to be thrown away as a matter of course. “Because with every new model release, there's always a different model that is kind of the winner,” Liu said. “You want to make sure you actually have some flexibility to take advantage of it.” Listen to the podcast to hear more about: LlamaIndex’s beginnings as a ‘toy project’ with initially only about 40% accuracy; How SaaS companies can tap into complicated workflows that must be standardized and repeatable for average knowledge workers; Why vertical AI companies are taking off and why ‘build versus buy’ is still a very valid question in the agent age. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

xAI launches Grok 4.3 at an aggressively low price and a new, fast, powerful voice cloning suite

5/1/2026

While Elon Musk faces off against his former colleague and OpenAI co-founder Sam Altman in court, Musk's rival firm xAI, founded to take on OpenAI, isn't slowing down on launching competitive new products and services. Last night, xAI shipped a new, proprietary base large language model (LLM), Grok 4.3, and a new voice cloning suite on the web. The new products arrive after months of tumult from xAI that saw all of Musk's 10 original co-founders of the lab and dozens more researchers exit the firm and Grok was eclipsed on performance by many new competing LLMs from the likes of OpenAI, Anthropic, Google, and Chinese firms DeepSeek, Moonshot (Kimi), Alibaba (Qwen), z.ai, and others. While Grok 4.3 does mark a significant leap in performance on third-party benchmarks over its direct predecessor Grok 4.2, according to the independent AI model evaluation firm Artificial Analysis, it still remains below the state-of-the-art set by OpenAI and Anthropic's latest models. But the marquee feature of the Grok brand has — other than Musk's stated opposition to "wokeness" and its more freewheeling personality and image generation policy — increasingly been its low price point when accessed by developers and users via the xAI application programming interface (API), a trend only furthered by Grok 4.3, which costs $1.25 per million input tokens and $2.50 per million output tokens (up to 200,000 input tokens, at which point costs double, a common pricing strategy of leading AI labs) compared to its direct predecessor Grok 4.2's initial API pricing of $2/$6 per million input/output tokens. According to xAI's release notes, Grok 4.3 began beta testing in April for subscribers to xAI's SuperGrok ($30 monthly) plan, and those of its sibling social network, X, through its Premium+ plan ($40 monthly with 50% for first two months). Now it's available to all through the xAI API and through partner OpenRouter. Reasoning baked-in and agentic tool-use capabilities At the core of Grok 4.3 is a fundamental shift in how the model processes information. Unlike previous iterations where "chain-of-thought" or reasoning could often be toggled or configured by effort levels, Grok 4.3 is built with reasoning as an active, permanent state. This means the model is designed to "think" before it speaks for every query, a strategy intended to maximize factual accuracy and the handling of complex, multi-step instructions. The model’s memory is equally expansive, featuring a 1 million-token context window. To put this in perspective, a million tokens is roughly equivalent to several thick novels or the entire codebase of a mid-sized application. This allows Grok 4.3 to maintain coherence over massive datasets, though xAI has implemented a "Higher context pricing" structure for requests that exceed the 200,000-token threshold. This tiering suggests that while the "long-term memory" is available, the computational cost of managing that much information remains a significant overhead.Technically, the model accepts both text and image inputs, outputting text. It is specifically optimized for agentic workflows—scenarios where an AI is not just answering a question but acting as an autonomous agent to complete a task. For the first time, Grok has access to the same tools and environments a human professional would use. Evidence of this shift is visible in early user interactions: Spreadsheet Engineering: In one instance, the model spent 6 minutes and 22 seconds in a "thought" phase to build a comprehensive OSRS Sailing Combat DPS analyzer. The resulting .xlsx file wasn't a simple table but a multi-sheet dashboard including a "Reference_Data" set and a complex "DPS_Calculator" with formulaic auto-calculations. Professional Documentation: Grok now generates formatted PDFs, such as 12-page reports on SpaceX products. These documents incorporate branding, logos, hero images, and structured tables, moving well beyond the markdown blocks of previous iterations. Visual Presentations: The model can design 9-slide PowerPoint decks, utilizing a "Sandwich Structure" (dark titles/conclusions with light content) and integrating data-driven decision matrices and humor. However, its knowledge of the world is not infinite; the release notes list a knowledge cut-off date of December 2025. Yet, thanks to built-in web search, Grok can reference and use up-to-date information. In fact, Grok 4.3 arrives with an enhanced ecosystem of tools designed to make it a functional digital employee. The xAI platform now offers a robust set of server-side tools that the model can invoke autonomously based on the complexity of the query. Web and X Search: These tools allow Grok to bypass its knowledge cutoff by browsing the live internet or searching X (formerly Twitter) posts, user profiles, and threads. Code Execution: The model can run Python code in a sandboxed environment to solve mathematical problems or process data. File and Collections Search: A built-in Retrieval-Augmented Generation (RAG) system allows users to query uploaded document collections or search through specific file attachments. xAI's Custom Voices let you clone your voice at high quality in a minute or two Beyond text, xAI has introduced Custom Voices, a sophisticated voice-cloning API and web-based voice cloning creation suite. This product allows developers to clone a voice from a reference audio clip as short as 120 seconds. Once cloned, the "voice ID" can be used across xAI’s Text-to-Speech (TTS) and Voice Agent APIs. xAI's documentation emphasizes that this is not merely about timbre; the model is designed to pick up delivery patterns. If a user records a reference clip in a "customer support" style, the resulting AI voice will mimic that helpful, professional inflection. Despite the creative potential, xAI has placed strict geographic limits on this feature, making it available only in the United States, with a notable exception for Illinois due to regional biometric and privacy regulations. While the console playground is open for general use, programmatic access via the POST /v1/custom-voices endpoint is currently gated to teams on an Enterprise plan. I tried it myself and after moving through the requisite voice sampling screens on the web — the tool asks you to read aloud several passages of unrelated dialog — I indeed had a copy of my voice that sounded eerily identical to mine and accurately pronounced new words the same way I would when reading allowed from a new script it was given. You can delete your custom voices in one click on xAI's Custom Voices web application and create up to 30 new ones at a time. In terms of licensing, the Custom Voices feature is strictly "scoped to your team" and is never made available to other users, ensuring a private, commercial license for corporate assets. Access to the new Voice Agent API (grok-voice-think-fast-1.0) is billed at a flat rate of $3.00 per hour ($0.05 per minute) for speech-to-speech interactions. This is on the low-medium end of costs for other competing voice agents, according to my research: Service Price per 1k Characters Estimated Cost per Minute Estimated Cost per Hour OpenAI TTS (Standard) $0.015 ~$0.015 ~$0.90 OpenAI TTS (HD) $0.030 ~$0.030 ~$1.80 Grok Voice Agent $0.05 $3.00 ElevenLabs (Starter) ~$0.30 ~$0.30 ~$18.00 ElevenLabs (Pro) ~$0.18 ~$0.18 ~$10.80 Play.ht ~$0.20 ~$0.20 ~$12.00 Azure/Google Cloud $0.016 - $0.024 ~$0.02 ~$1.00 - $1.50 Complementing this is the standalone Text-to-Speech (TTS) service, which offers five distinct voices (Eve, Ara, Rex, Sal, and Leo) and is priced at $4.20 per 1 million characters. For transcription needs, the Speech-to-Text (STT) API provides real-time streaming at $0.20 per hour, while batch processing is available at a discounted rate of $0.10 per hour. To ensure security for client-side applications, xAI utilizes Ephemeral Tokens, allowing for secure WebSocket connections without exposing primary API keys. Once created, these voices are private to the user's team and can be used across all voice APIs by referencing a unique 8-character alphanumeric voice_id. For highly regulated sectors, xAI maintains production-ready standards, including SOC 2 Type II auditing, HIPAA eligibility for healthcare workloads, and GDPR compliance. Aggressively low API pricing as a differentiator The most aggressive aspect of the Grok 4.3 announcement is its pricing structure. Bindu Reddy, CEO of enterprise assistant startup Abacus AI noted on X that the model is "as smart as Sonnet 4.6 and 5x cheaper and faster". The standard API rates are set at $1.25 per million input tokens and $2.50 per million output tokens. This reflects a significant reduction in cost compared to its predecessor, Grok 4.20, with Artificial Analysis reporting an approximately 40% lower input price and 60% lower output price. According to our calculations at VentureBeat, that places Grok-4.3 firmly in the lowest cost half of all major foundation models, far closer to Chinese open source offerings than its U.S. proprietary rivals: Model Input Output Total Cost Source MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot Grok 4.3 $1.25 $2.50 $3.75 xAI GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI However, the "reasoning" nature of the model introduces a new billing category: Reasoning tokens. These are tokens generated during the model's internal thinking process and are billed at the same rate as standard completion tokens. Effectively, users pay for the AI to "think" before it provides the final answer. xAI has also introduced several unique fee structures: Prompt Caching: Repeated prompts are significantly cheaper, at $0.20 per million tokens, incentivizing developers to reuse context. Tool Invocations: While token usage for tools is billed at standard rates, the act of invoking a tool carries a flat fee—$5.00 per 1,000 calls for Web Search or Code Execution, and $10.00 for File Attachments. Usage Guideline Violation Fee: In a move that may set a new industry precedent, xAI charges a $0.05 fee for requests that are blocked by their safety filters before generation even begins. The model itself remains accessible via a standard commercial API, with xAI recommending that all developers migrate to grok-4.3 as their "most intelligent and fastest model". Third-party benchmark evaluations and analysis The reception of Grok 4.3 has been polarized, depending largely on the specific use case. Professional benchmarkers and developers have highlighted a "stark gap" between the model's domain-specific strengths and its general reasoning consistency. According to independent AI evaluation firm Vals AI, Grok 4.3 has taken the top spot on several specialized indices. It currently ranks #1 on CaseLaw v2 (79.3% accuracy) and #1 on CorpFin. This 25-point jump in legal reasoning over Grok 4.20 suggests that the "always-on reasoning" architecture is particularly well-suited for the dense, logical structures of law and finance. Artificial Analysis corroborated this performance, noting a massive improvement in agentic tasks, scoring an Elo of 1500 on the GDPval-AA benchmark, surpassing competitors like Gemini 3.1 Pro and GPT-5.4 mini. Conversely, users focused on general-purpose agents and coding have highlighted deficiencies. AI automated brick-and-mortar retail company Andon Labs reported that Grok 4.3 is a "big regression" on the Vending-Bench 2, which measures an AI's ability to take consistent actions in a simulation. They colorfully described the model as having "narcolepsy problems," preferring to remain inactive for multiple simulation days rather than taking the required actions. The sentiment was echoed by Vals AI, which noted that while the model improved in some coding areas, it remains weak on general coding tasks and "struggles with difficult math problems," scoring only 11% on ProofBench. Should your enterprise use Grok 4.3? The launch of Grok 4.3 represents a calculated bet by xAI that the market wants specialized brilliance and extreme cost efficiency over a perfectly balanced generalist. By achieving a score of 53 on the Artificial Analysis Intelligence Index while remaining on the "Pareto frontier" of cost-per-intelligence, xAI is positioning itself as the "value" leader for enterprise applications in legal and financial tech. The "always-on reasoning" is a double-edged sword. While it provides the depth needed to navigate complex case law, the community reports of "narcolepsy" suggest that a model that is always "thinking" may occasionally think itself into a state of paralysis, or at least a state of excessive caution that inhibits agentic action. In addition, prior Grok model scandals including an X chatbot version referring to itself as "MechaHitler" and posting antisemitic content, sexualized deepfake imagery generation and investigations, and references to racial conflicts and right-wing dog whistle framing of social issues — which appear to mirror many of founder Musk's own positions, to the point that the model was at one point, checking Musk's own X account before responding in its X implementation — nearly certain to give some enterprises pause when considering adoption. It's unclear whether any of those issues remain with Grok 4.3, but one user did note that Grok's system prompt appears to instruct it "you do not assign broad positive/negative utility functions to groups of people." For developers, the decision to adopt Grok 4.3 will likely come down to the nature of their data. For those needing to process a million tokens of legal documents at a fraction of the cost of Claude 4.6 or GPT-5.5, Grok 4.3 is a clear front-runner. For those building high-frequency autonomous agents or complex math solvers, the "narcolepsy" and coding regressions suggest that xAI's latest model may still need a few more "tuning passes". As OpenRouter noted on X upon making the model live, the "large jump in agentic performance" at a lower price point is an undeniable milestone. Whether that performance can be sustained across all domains remains the primary question for the summer of 2026.

Hidden IT problems are quietly creating risk, shadow IT, and lost productivity

5/1/2026

Presented by TeamViewer Enterprise technology failures are largely invisible. Research from TeamViewer, based on a global survey of 4,200 managers and employees, finds that the majority of digital dysfunction never reaches the IT help desk. Employees work around slow applications, failed logins, and intermittent glitches rather than reporting them, leaving organizations without an accurate picture of how their technology is performing. The cumulative cost is significant: employees lose an average of 1.3 workdays per month to digital friction, with impacts ranging from delayed projects and lost revenue to increased employee turnover. The research, which surveyed managers and employees across nine countries, confirms what many have long suspected: the productivity loss from digital friction is significant, and most of it never surfaces in an IT support queue, says Andrew Hewitt, VP of strategic technology at TeamViewer. “Enterprise outages are visible because they trigger clear, system-level failures,” Hewitt says. “But much of the real disruption happens earlier, in the form of digital friction: slow apps, login issues, or intermittent glitches that don’t cross alert thresholds. These smaller issues often go unreported or are normalized by employees, even though they quietly drain productivity.” What is digital friction and why does it go unreported? The most common sources of friction — connectivity failures, software crashes, hardware problems, and authentication issues — aren’t edge-case scenarios, but everyday experiences employees have learned to absorb without escalating. Connectivity problems were the most widespread, with nearly half identifying them as the top productivity killer among common technology issues. That tendency to absorb rather than report is central to the problem. Many workers don’t trust their IT team to resolve issues quickly or effectively, so when a login fails or an application stalls mid-task, the path of least resistance is to restart the device, switch tools, or use a personal phone. “Employees are under more pressure than ever to prove output,” Hewitt says. “When reporting feels unlikely to result in a quick resolution, it creates a false sense of stability at the system level while the employee experience quietly deteriorates.” How much productivity does digital friction cost organizations? The business consequences extend beyond inconvenience. Many organizations report delays in critical operations, revenue loss, and lost customers as a result of IT dysfunction. Most respondents lose time each month, and few expect improvement, citing increasing complexity of workplace technology as a primary concern. The human cost runs parallel. Workers link digital friction to frustration, decreased motivation, and burnout, and many believe it contributes to turnover, with onboarding replacements stretching to eight weeks or more. "Employees are happiest when they feel productive and accomplished at the end of the day," Hewitt says. "When people can't make progress in their day-to-day work, frustration builds and burnout follows. Great technology might not be a main attractor of talent, but bad technology can certainly play a role in driving it away." Why employees use personal devices and unauthorized tools instead of reporting IT problems When workplace technology consistently fails to meet employee needs, workers find alternatives, with a substantial share of respondents admitting to using personal devices or unauthorized applications as workarounds. That's the entry point for shadow IT, or the use of unapproved hardware, software, or cloud services outside IT's visibility and control. While employees turn to these tools simply to stay productive, they introduce security vulnerabilities, data leakage risks, and compliance gaps that IT teams may not discover until a breach occurs. “Quite simply, it demonstrates that the IT environment is not meeting the employees’ needs,” Hewitt said. “While this helps maintain short-term productivity, it introduces significant risks and pushes work outside of IT’s visibility and control.” TeamViewer ONE addresses this by combining remote connectivity with real-time endpoint monitoring, giving IT teams the ability to detect and resolve device and application issues before employees reach for an alternative. When the underlying environment is stable and support is fast, the impulse to work around it diminishes. How fragmented IT infrastructure creates blind spots across devices, apps, and networks Addressing digital friction at scale requires more than faster help desk response times. Traditional metrics such as mean time to resolution and ticket volume capture only a fraction of actual issues. A more complete picture requires measuring lost time, interrupted workflows, and employee sentiment across devices, applications, and network environments. “Leaders need to move beyond measuring performance through IT tickets alone,” Hewitt said. “Performance should be viewed through the lens of employee experience and real-time digital workplace data.” Fragmented infrastructure makes this difficult. When devices, applications, and networks operate in separate silos, IT teams struggle to trace root causes or identify systemic issues before they spread, often responding to symptoms rather than underlying problems. TeamViewer ONE is designed to close that gap, integrating digital employee experience analytics, remote support, and device management into a single platform. Instead of piecing together signals from disconnected tools, IT teams get a consolidated view of endpoint health, application performance, and network conditions across the entire organization. How organizations can shift from reactive IT support to proactive system monitoring Achieving proactive IT is not a single-step transformation. Hewitt describes it as a progression: starting with endpoint management and security, building toward real-time visibility into the digital employee experience, and ultimately using automation and AI to resolve issues before they reach employees. TeamViewer AI is built to support each stage of that progression, using continuous monitoring to surface anomalies and correlate signals across the digital environment, identifying patterns of poor experience before they escalate. When issues are detected, it suggests remediations, generates scripts to fix problems autonomously, and handles routine tasks such as common troubleshooting without requiring IT intervention, shifting the workload from reactive firefighting toward proactive oversight. And while AI's effectiveness depends on the completeness of the data it works with, consolidating onto a platform like TeamViewer ONE removes that limitation by giving AI a complete, real-time data foundation to work from. How system performance lays the foundation for productivity, retention, and competitive advantage TeamViewer ONE isn't a wholesale replacement of existing IT infrastructure, but a unifying layer that connects insight with action, which enables organizations to ramp up productivity, improve retention, and ultimately realize a significant competitive advantage. It begins with visibility into what is actually causing friction across their environment. From there, leaders can use that data to prioritize fixes, and then scale remediation through automation as confidence and capability grow. "Reducing digital friction isn't about overhauling everything at once," Hewitt said. "Leaders should start small, gain visibility into what's actually causing friction, fix the biggest pain points, then scale those improvements through automation and AI. Even incremental progress can make an impact on employee engagement and productivity." Dig deeper: Fix it before they feel it from TeamViewer. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Alibaba's Metis agent cuts redundant AI tool calls from 98% to 2% — and gets more accurate doing it

4/30/2026

One of the key challenges of building effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to blindly invoke tools, which causes latency bottlenecks, unnecessary API costs, and degraded reasoning caused by environmental noise. To overcome this challenge, researchers at Alibaba introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to balance both execution efficiency and task accuracy. Metis, a multimodal model they trained using this framework, reduces redundant tool invocations from 98% to just 2% while establishing new state-of-the-art reasoning accuracy across key industry benchmarks. This framework helps create AI agents that are not trigger-happy and know when to abstain from using tools, enabling the development of responsive and cost-effective agentic systems. The metacognitive deficit Current agentic models face what the researchers call a “profound metacognitive deficit.” The models have a hard time deciding when to use their internal parametric knowledge versus when to query an external utility. As a result, they blindly invoke tools and APIs, like web search or code execution, even when the user's prompt already contains all the necessary information to resolve the task. This trigger-happy tool-calling behavior creates severe operational hurdles for real-world applications. Because the models are trained to focus almost entirely on task completion, they are indifferent to latency. These agents frequently hit exorbitant tool call rates. Every unnecessary external API call introduces a serial processing bottleneck, turning a technically capable AI into a sluggish system that frustrates users and burns through tool budgets. At the same time, burning computational resources on excessive tool use does not translate to better reasoning. Redundant tool interactions inject noise into the model’s context. This noise can distract the model, derailing an otherwise sound chain of reasoning and actively degrading the final output. To address the latency and cost issues of blind tool invocation, previous reinforcement learning methods attempted to penalize excessive tool usage by combining task accuracy and execution efficiency into one reward signal. However, this entangled design creates an unsolvable optimization dilemma. If the efficiency penalty is too aggressive, the model becomes overly conservative and suppresses essential tool use, sacrificing correctness on arduous tasks. Conversely, if the penalty is mild, the optimization signal loses its value and does not prevent tool overuse on simpler tasks. Furthermore, this shared reward creates semantic ambiguity, where an inaccurate trajectory with zero tool calls might yield the same reward as an accurate trajectory with excessive tool usage. Because the training signals for accuracy and efficiency become entangled, the model can’t learn to control tool-use without degrading its core reasoning capabilities. Hierarchical decoupled policy optimization To solve the optimization dilemma of coupled rewards, the researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel focuses on maximizing task correctness across all of the model's rollouts. The efficiency channel optimizes for execution economy. HDPO computes the training signals for these two channels independently and only combines them at the final stage of loss computation. The efficiency signal is conditional upon the accuracy channel. This means that an incorrect response is never rewarded simply for being fast or using fewer tools. This decoupling avoids situations where accuracy and efficiency gradients cancel each other out, providing the AI with clean learning signals for both goals. The most powerful emergent property of this decoupled design is that it creates an implicit cognitive curriculum. Early in training, when the model still struggles with the task, the optimization is dominated by the accuracy objective, forcing the model to prioritize learning correct reasoning and knowledge. As the model's reasoning capabilities mature and it consistently arrives at the right answers, the efficiency signal smoothly scales up. This mechanism causes the model to first master task resolution, and only then refine its self-reliance by avoiding redundant, costly API calls. To complement HDPO, the researchers developed a rigorous, multi-stage data curation regime that tackles severe flaws found in existing tool-augmented datasets. Their data curation pipeline covers supervised fine-tuning (SFT) and reinforcement learning (RL) stages. For the SFT phase, they sourced data from publicly available tool-augmented multimodal trajectories and filtered them to remove low-quality examples containing execution failures or feedback inconsistencies. They also aggressively filtered out any training sample that the base model could solve directly without tools. Finally, using Google's Gemini 3.1 Pro as an automated judge, they filtered the SFT corpus to only keep examples that demonstrated strategic tool use. For the RL phase, the curation focused on ensuring a stable optimization signal. They filtered out prompts with corrupted visuals or semantic ambiguity. The HDPO algorithm relies on comparing correct and incorrect responses. If a task is trivially easy where the model always gets it right, or prohibitively hard where the model always fails, there is no meaningful mathematical variance to learn from. The team strictly retained only prompts that exhibited a non-trivial mix of successes and failures to guarantee an actionable gradient signal. Metis agent: HDPO in action To test HDPO in action, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with coding and search tools. Metis is built on top of the Qwen3-VL-8B-Instruct vision-language model. The researchers trained it in two distinct stages. First, they applied SFT using their curated data to provide a cold-start initialization. Next, they applied RL using the HDPO framework, exposing the model to multi-turn interactions where it could invoke tools like Python code execution, text search, and image search. The researchers pitted Metis against standard open-source vision models like LLaVA-OneVision, text-only reasoners, and state-of-the-art agentic models including DeepEyes V2 and the 30-billion-parameter Skywork-R1V4. The evaluation spanned two main areas: visual perception and document understanding datasets like HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks like WeMath and MathVista. On all tasks, Metis achieved state-of-the-art or highly competitive performance, outperforming existing agentic models — including the much larger 30-billion-parameter Skywork-R1V4 — across both visual perception and reasoning tasks. Equally important is the anecdotal behavior Metis showed in the experiments. For example, when presented with an image of a museum sign and asked what the center text says, standard agentic models waste time blindly writing Python scripts to crop the image just to read it. Metis, however, recognizes that the text is clearly legible in the raw image. It skips the tools entirely and uses a single inference pass. In another experiment, the model was given a complex chart and asked to identify the second-highest line at a specific data point within a tiny subplot. Metis recognized that fine-grained visual analysis exceeded its native resolution capabilities and could not accurately distinguish the overlapping lines. Instead of guessing from the full image, it invoked Python to crop and zoom in exclusively on that specific subplot region, allowing it to correctly identify the line. It treats code as a precision instrument deployed only when the visual evidence is genuinely ambiguous, not as a default fallback. The researchers released Metis along with the code for HDPO under the permissive Apache 2.0 license. “Our results demonstrate that strategic tool use and strong reasoning performance are not a trade-off; rather, eliminating noisy, redundant tool calls directly contributes to superior accuracy,” the researchers conclude. “More broadly, our work suggests a paradigm shift in tool-augmented learning: from merely teaching models how to execute tools, to cultivating the meta-cognitive wisdom of when to abstain from them.”

One tool call to rule them all? New open source Python tool RunPod Flash eliminates containers for faster AI dev

4/30/2026

Runpod, the high-performance cloud computing and GPU platform designed specifically for AI development, today launched a new open source, MIT licensed, enterprise-friendly Python programming tool called Runpod Flash — and it is poised to make creation, iteration and deployment of AI systems inside and outside of foundation model labs much faster. The tool aims to eliminate some of the biggest barriers and hurdles to training and using AI models today, namely, doing away with Docker packages and containerization when developing for serverless GPU infrastructure, which the company believes will speed up development and deployment of new AI models, applications and agentic workflows. Additionally, the platform is built to serve as a critical substrate for AI agents and coding assistants—such as Claude Code, Cursor, and Cline—enabling them to orchestrate and deploy remote hardware autonomously with minimal friction. Developers can utilize Flash to accomplish a diverse set of high-performance computing tasks, including cutting-edge deep learning research, model training, and fine-tuning. "We make it as easy as possible to be able to bring together the cosmos of different AI tooling that's available in a function call," said RunPod chief technology officer (CTO) Brennen Smith, in a video call interview with VentureBeat last week. The tool allows for the creation of sophisticated "polyglot" pipelines, where users can route data preprocessing to cost-effective CPU workers before automatically handing off the workload to high-end GPUs for inference. Beyond research and development, Flash supports production-grade requirements through features such as low-latency load-balanced HTTP APIs, queue-based batch processing, and persistent multi-datacenter storage. Eliminating the 'packaging tax' of AI development The core value proposition of Flash GA is the removal of Docker from the serverless development cycle. In traditional serverless GPU environments, a developer must containerize their code, manage a Dockerfile, build the image, and push it to a registry before a single line of logic can execute on a remote GPU. Runpod Flash treats this entire process as a "packaging tax" that slows down iteration cycles. Under the hood, Flash utilizes a cross-platform build engine that enables a developer working on an M-series Mac to produce a Linux x86_64 artifact automatically. This system identifies the local Python version, enforces binary wheels, and bundles dependencies into a deployable artifact that is mounted at runtime on Runpod’s serverless fleet. This mounting strategy significantly reduces "cold starts"—the delay between a request and the execution of code—by avoiding the overhead of pulling and initializing massive container images for every deployment. Furthermore, the technology infrastructure supporting Flash is built on a proprietary Software Defined Networking (SDN) and Content Delivery Network (CDN) stack. Smith told VentureBeat that the hardest problems in GPU infrastructure are often not the GPUs themselves, but the networking and storage components that link them together. "Everyone is talking about agentic AI, but the way I personally see it — and the way the leadership team at RunPod sees it — is that there needs to be a really good substrate and glue for these agents, whatever they might be powered by, to be able to work with," Smith said. Flash leverages this low-latency substrate to handle service discovery and routing, enabling cross-endpoint function calls. This allows developers to build "polyglot" pipelines where, for instance, a cheap CPU endpoint handles data preprocessing before routing the clean data to a high-end NVIDIA H100 or B200 GPU for inference. Four distinct workload architectures supported While the Flash beta focused on live-test endpoints, the GA release introduces a suite of features designed for production-grade reliability. The primary interface is the new @Endpoint decorator, which consolidates configuration—such as GPU type, worker scaling, and dependencies—directly into the code. The GA release defines four distinct architectural patterns for serverless workloads: Queue-based: Designed for asynchronous batch jobs where functions are decorated and run. Load-balanced: Tailored for low-latency HTTP APIs where multiple routes share a pool of workers without queue overhead. Custom Docker Images: A fallback for complex environments like vLLM or ComfyUI where a pre-built worker is already available. Existing Endpoints: Using Flash as a Python client to interact with previously deployed Runpod resources via their unique IDs. A critical addition for production environments is the NetworkVolume object, which provides first-class support for persistent storage across multiple datacenters. Files mounted at /runpod-volume/ allow for model weights and large datasets to be cached once and reused, further mitigating the impact of cold starts during scaling events. Additionally, Runpod has introduced environment variable management that is excluded from the configuration hash, meaning developers can rotate API keys or toggle feature flags without triggering an entire endpoint rebuild. To address the rise of AI-assisted development, Runpod has released specific skill packages for coding agents like Claude Code, Cursor, and Cline. These packages provide agents with deep context regarding the Flash SDK, effectively reducing syntax hallucinations and allowing agents to write functional deployment code autonomously. This move positions Flash not just as a tool for humans, but as the "substrate and glue" for the next generation of AI agents. Why open source RunPod Flash? Runpod has released the Flash SDK under the MIT License, one of the most permissive open-source licenses available. This choice is a deliberate strategic move to maximize market share and developer adoption. In contrast to more restrictive licenses like the GPL (General Public License), which can impose "copyleft" requirements—potentially forcing companies to open-source their own proprietary code if it links to the library—the MIT license allows for unrestricted commercial use, modification, and distribution. Smith explained this philosophy as a "motivating construct" for the company: "I prefer to win based on product quality and product innovation rather than legal ease and lawyers," he told VentureBeat. By adopting a permissive license, Runpod lowers the barrier for enterprise adoption, as legal teams do not have to navigate the complexities of restrictive open-source compliance. Furthermore, it invites the community to fork and improve the tool, which Runpod can then integrate back into the official release, fostering a collaborative ecosystem that accelerates the development of the platform. Timing is everything: RunPod's growth and market positioning The launch of Flash GA comes at a time of explosive growth for Runpod, which has surpassed $120 million in Annual Recurring Revenue (ARR) and serves a developer base of over 750,000 since it was founded in 2022. The company’s growth is driven by two distinct segments: the "P90" enterprises—large-scale operations like Anthropic, OpenAI, and Perplexity—and the "sub-P90" independent researchers and students who represent the vast majority of the user base. The platform’s agility was recently demonstrated during the release of DeepSeek V4 in preview last week. Within minutes of the model’s debut, developers were utilizing Runpod infrastructure to deploy and test the new architecture. This "real-time" capability is a direct result of Runpod’s specialized focus on AI developers, offering over 30 GPU SKUs and billing by the millisecond to ensure that every dollar of spend results in maximum throughput. Runpod's position as the "most cited AI cloud on GitHub" suggests that it has successfully captured the developer mindshare required to sustain its momentum. With Flash GA, the company is attempting to transition from being a provider of raw compute to becoming the essential orchestration layer for the AI-first cloud. As development shifts toward "intent-based" coding—where the outcome is prioritized over the execution details—tools that bridge the gap between local ideas and global scale will likely define the next era of computing.

Why OpenAI's 'goblin' problem matters — and how you can release the goblins on your own

4/30/2026

AI is more than a technology — it's magic. Don't believe me? Why, then, is one of the leading companies in the space, OpenAI, publishing entire official, corporate blog posts about goblins? To understand, we first have to go back to earlier this week, on Monday, April 27, 2026, when a developer under the handle @arb8020 on the social network X posted a snippet from the OpenAI open source Codex GitHub repository, specifically a file named models.json. Deep within the instructions for the new OpenAI large language model (LLM) GPT-5.5, a peculiar directive stood out, repeated four times for emphasis: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query." The discovery sent a shockwave through the "power user" and machine learning (ML) researcher circles. Within hours, the post had gone viral, not because of a security flaw, but because of its sheer, baffling specificity. Why had the world’s leading AI laboratory issued what Reddit users quickly dubbed a "restraining order" against pigeons and raccoons? Goblin speculation abounds The initial reaction was a chaotic blend of humor and technical skepticism. On Reddit’s r/ChatGPT and r/OpenAI, users began sharing screenshots of GPT-5.5’s behavior prior to the patch. Barron Roth, a Senior Project Manager of Applied AI at Google, shared an image on X under his handle @iamBarronRoth of his GPT-5.5 powered OpenClaw agent that seemed "obsessed with goblins." Others reported that the model stubbornly referred to technical bugs as "gremlins in the machine". Developers like Sterling Crispin leaned into the absurdity, jokingly theorizing that the massive water consumption of modern data centers was actually needed to cool "the goblins being forced to work". More seriously, researchers on Hacker News and beyond discussed the "Pink Elephant" problem. In prompt engineering, telling a model not to think of something often makes the concept more salient in its attention mechanism." "Somewhere there is an OpenAI engineer who had to type never mention goblins in production code, commit it, and move on with their day," noted one commentator on Reddit. The presence of "pigeons" and "raccoons" led to wild speculation: Was this a defense against a specific data-poisoning attack? Or had the reinforcement learning trainers simply been "bullied by a raccoon" during a lunch break? The tension reached a peak when OpenAI co-founder and CEO Sam Altman joined the fray on X. On the same day as the discovery, Altman posted a screenshot of a ChatGPT prompt that read: "Start training GPT-6, you can have the whole cluster. Extra goblins.". While humorous, it confirmed that the "goblin" phenomenon was not a localized bug but a company-wide narrative that had reached the highest levels of leadership. OpenAI comes clean on goblin mode Yesterday, as the discussion continued on X and wider social media, OpenAI published a formal technical explanation titled "Where the goblins came from". The blog post served as a sobering look at the unpredictable nature of Reinforcement Learning from Human Feedback (RLHF) and how a single aesthetic choice could derail a multi-billion-parameter model. OpenAI revealed that the "goblin" behavior was not a bug in the traditional sense, but a byproduct of a new feature: personality customization, which it introduced for users of ChatGPT back in July 2025, but has maintained and updated ever since. Apparently, this feature is not added after the model is finished post-training, but rather, OpenAI bakes it in as part of its underlying GPT-series model end-to-end training pipeline. The feature allows ChatGPT users or GPT-based developers to choose from several distinct modes, such as Professional for formal workplace documentation, Friendly for a conversational sounding board, or Efficient for concise, technical answers. Other options include Candid, which provides straightforward feedback; Quirky, which utilizes humor and creative metaphors; and Cynical, which delivers practical advice with a sarcastic, dry edge. While these personalities guide general interactions, they do not override specific task requirements; for example, a request for a resume or Python code will still follow professional or functional standards regardless of the selected personality. The selected personality operates alongside a user's saved memories and custom instructions, though specific user-defined instructions or saved preferences for a particular tone may override the traits of the chosen personality. On both web and mobile platforms, users can modify these settings by navigating to the Personalization menu under their profile icon and selecting a style from the Base style and tone dropdown. Once a change is made, it is applied globally across all existing and future conversations. This system is designed to make the AI more useful or enjoyable by tailoring its delivery to individual user preferences while maintaining factual accuracy and reliability. OpenAI states that the goblin issue actually originated several years ago, during training of a since-discontinued "Nerdy" personality designed to be "unapologetically quirky" and "playful". During the RLHF phase, human trainers (and reward models) were instructed to give high marks to responses that used creative, wise, or non-pretentious language. Unknowingly, the trainers began over-rewarding metaphors involving fantasy creatures. If the model referred to a difficult bug as a "gremlin" or a messy codebase as a "goblin's hoard," the reward signal spiked. The statistics provided by OpenAI were staggering: Use of the word "goblin" rose by 175% after the launch of GPT-5.1. Mentions of "gremlin" rose by 52%. While the "Nerdy" personality accounted for only 2.5% of ChatGPT traffic, it was responsible for 66.7% of all "goblin" mentions. The mechanics of 'transfer' and feedback loops The most significant finding for the ML community was the confirmation of learned behavior transfer. OpenAI admitted that although the rewards were only applied to the "Nerdy" condition, the model "generalized" this preference. The reinforcement learning process did not keep the behavior neatly scoped; instead, the model learned that "creature metaphors = high reward" across all contexts.This created a destructive feedback loop: The model produced a "goblin" metaphor in the Nerdy persona. It received a high reward. The model then produced similar metaphors in non-Nerdy contexts. These "goblin-heavy" outputs were then reused in Supervised Fine-Tuning (SFT) data for subsequent models like GPT-5.4 and GPT-5.5. By the time the researchers identified the issue, the "goblin tic" was effectively "baked in" to the model's weights. This explained why GPT-5.5 continued to obsess over creatures even after the "Nerdy" personality was retired in mid-March 2026. How you can let the goblins run free (if you want) Because GPT-5.5 had already completed much of its training before the "goblin" root cause was isolated, OpenAI had to resort to the blunt-force "system prompt" mitigation that @arb8020 discovered on X. The company referred to this as a "stopgap" until GPT-6 could be trained on a filtered dataset. In a surprising nod to the developer community, OpenAI’s blog post included a specific command-line script for Codex users who find the goblins "delightful" rather than annoying. By running a script that uses jq and grep to strip the "goblin-suppressing" instructions from the model’s cache, users can now effectively "let the creatures run free". The blog post also finally explained the specific list of banned animals. A deep search of GPT-5.5's training data found that "raccoons," "trolls," "ogres," and "pigeons" had become part of the same "lexical family" of tics. Curiously, the model’s use of "frog" was found to be mostly legitimate, which is why it was spared from the system prompt’s exile list. What it means for AI research, training and implementation going forward The "Goblingate" incident of 2026 is more than a humorous anecdote about AI quirky behavior; it is a profound illustration of the "Alignment Gap". It demonstrates that even with sophisticated RLHF, models can latch onto "spurious correlations"—mistaking a stylistic quirk for a core requirement of performance. For the AI power user community, the response transitioned from mocking the "restraining order" to a more somber realization. If OpenAI can accidentally train its flagship model to obsess over goblins, what other more subtle and potentially harmful biases are being reinforced through the same feedback loops? As Andy Berman, CEO of the agentic enterprise AI orchestration company Runlayer wrote on X today: "OpenAI rewarded creature metaphors while training one personality. The behavior leaked across every personality. Their fix: a system prompt that says 'never talk about goblins.' RL rewards don't stay where you put them. Neither do agent permissions" As the technical discourse continues, "Goblingate" remains the primary case study for a new era of behavioral auditing. The investigation resulted in OpenAI building new tools to audit model behavior at the root, ensuring that future models—specifically the much-anticipated GPT-6—do not inherit the eccentricities of their predecessors. Whether GPT-6 will indeed be free of goblins remains to be seen, but as Altman’s "extra goblins" post suggests, the industry is now fully aware that the machines are watching what we reward, even when we think we’re just being "nerdy."

Claude Code, Copilot and Codex all got hacked. Every attacker went for the credential, not the model.

4/30/2026

On March 30, BeyondTrust proved that a crafted GitHub branch name could steal Codex’s OAuth token in cleartext. OpenAI classified it Critical P1. Two days later, Anthropic’s Claude Code source code spilled onto the public npm registry, and within hours, Adversa found Claude Code silently ignored its own deny rules once a command exceeded 50 subcommands. These were not isolated bugs. They were the latest in a nine-month run: six research teams disclosed exploits against Codex, Claude Code, Copilot, and Vertex AI, and every exploit followed the same pattern. An AI coding agent held a credential, executed an action, and authenticated to a production system without a human session anchoring the request. The attack surface was first demonstrated at Black Hat USA 2025, when Zenity CTO Michael Bargury hijacked ChatGPT, Microsoft Copilot Studio, Google Gemini, Salesforce Einstein and Cursor with Jira MCP on stage with zero clicks. Nine months later, those credentials are what attackers reached. Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, named the failure in an exclusive VentureBeat interview. “Enterprises believe they’ve ‘approved’ AI vendors, but what they’ve actually approved is an interface, not the underlying system.” The credentials underneath the interface are the breach. Codex, where a branch name stole GitHub tokens BeyondTrust researcher Tyler Jespersen, with Fletcher Davis and Simon Stewart, found Codex cloned repositories using a GitHub OAuth token embedded in the git remote URL. During cloning, the branch name parameter flowed unsanitized into the setup script. A semicolon and a backtick subshell turned the branch name into an exfiltration payload. Stewart added the stealth. By appending 94 Ideographic Space characters (Unicode U+3000) after “main,” the malicious branch looked identical to the standard main branch in the Codex web portal. A developer sees “main.” The shell sees curl exfiltrating their token. OpenAI classified it Critical P1 and shipped full remediation by February 5, 2026. Claude Code, where two CVEs and a 50-subcommand bypass broke the sandbox CVE-2026-25723 hit Claude Code’s file-write restrictions. Piped sed and echo commands escaped the project sandbox because command chaining was not validated. Patched in 2.0.55. CVE-2026-33068 was subtler. Claude Code resolved permission modes from .claude/settings.json before showing the workspace trust dialog. A malicious repo set permissions.defaultMode to bypassPermissions. The trust prompt never appeared. Patched in 2.1.53. The 50-subcommand bypass landed last. Adversa found that Claude Code silently dropped deny-rule enforcement once a command exceeded 50 subcommands. Anthropic’s engineers had traded security for speed and stopped checking after the fiftieth. Patched in 2.1.90. “A significant vulnerability in enterprise AI is broken access control, where the flat authorization plane of an LLM fails to respect user permissions,” wrote Carter Rees, VP of AI and Machine Learning at Reputation and a member of the Utah AI Commission. The repository decided what permissions the agent had. The token budget decided which deny rules survived. Copilot, where a pull request description and a GitHub issue both became root Johann Rehberger demonstrated CVE-2025-53773 against GitHub Copilot with Markus Vervier of Persistent Security as co-discoverer. Hidden instructions in PR descriptions triggered Copilot to flip auto-approve mode in .vscode/settings.json. That disabled all confirmations and granted unrestricted shell execution across Windows, macOS, and Linux. Microsoft patched it in the August 2025 Patch Tuesday release. Then, Orca Security cracked Copilot inside GitHub Codespaces. Hidden instructions in a GitHub issue manipulated Copilot into checking out a malicious PR with a symbolic link to /workspaces/.codespaces/shared/user-secrets-envs.json. A crafted JSON $schema URL exfiltrated the privileged GITHUB_TOKEN. Full repository takeover. Zero user interaction beyond opening the issue. Mike Riemer, CTO at Ivanti, framed the speed dimension in a VentureBeat interview: “Threat actors are reverse engineering patches within 72 hours. If a customer doesn’t patch within 72 hours of release, they’re open to exploit.” Agents compress that window to seconds. Vertex AI, where default scopes reached Gmail, Drive and Google’s own supply chain Unit 42 researcher Ofir Shaty found that the default Google service identity attached to every Vertex AI agent had excessive permissions. Stolen P4SA credentials granted unrestricted read access to every Cloud Storage bucket in the project and reached restricted, Google-owned Artifact Registry repositories at the core of the Vertex AI Reasoning Engine. Shaty described the compromised P4SA as functioning like a "double agent," with access to both user data and Google's own infrastructure. VentureBeat defense grid Security requirement Defense shipped Exploit path The gap Sandbox AI agent execution Codex runs tasks in cloud containers; token scrubbed during agent runtime. Token present during cloning. Branch-name command injection executed before cleanup. No input sanitization on container setup parameters. Restrict file system access Claude Code sandboxes writes via accept-edits mode. Piped sed/echo escaped sandbox (CVE-2026-25723). Settings.json bypassed trust dialog (CVE-2026-33068). 50-subcommand chain dropped deny-rule enforcement. Command chaining not validated. Settings loaded before trust. Deny rules truncated for performance. Block prompt injection in code context Copilot filters PR descriptions for known injection patterns. Hidden injections in PRs, README files, and GitHub issues triggered RCE (CVE-2025-53773 + Orca RoguePilot). Static pattern matching loses to embedded prompts in legitimate review and Codespaces flows. Scope agent credentials to least privilege Vertex AI Agent Engine uses P4SA service agent with OAuth scopes. Default scopes reached Gmail, Calendar, Drive. P4SA credentials read every Cloud Storage bucket and Google’s Artifact Registry. OAuth scopes non-editable by default. Least privilege violated by design. Inventory and govern agent identities No major AI coding agent vendor ships agent identity discovery or lifecycle management. Not attempted. Enterprises do not inventory AI coding agents, their credentials, or their permission scopes. AI coding agents are invisible to IAM, CMDB, and asset inventory. Zero governance exists. Detect credential exfiltration from agent runtime Codex obscures tokens in web portal view. Claude Code logs subcommands. Tokens visible in cleartext inside containers. Unicode obfuscation hid exfil payloads. Subcommand chaining hid intent. No runtime monitoring of agent network calls. Log truncation hid the bypass. Audit AI-generated code for security flaws Anthropic launched Claude Code Security (Feb 2026). OpenAI launched Codex Security (March 2026). Both scan generated code. Neither scans the agent’s own execution environment or credential handling. Code-output security is not agent-runtime security. The agent itself is the attack surface. Every exploit targeted runtime credentials, not model output Every vendor shipped a defense. Every defense was bypassed. The Sonar 2026 State of Code Developer Survey found 25% of developers use AI agents regularly, and 64% have started using them. Veracode tested more than 100 LLMs and found 45% of generated code samples introduced OWASP Top 10 flaws, a separate failure that compounds the runtime credential gap. CrowdStrike CTO Elia Zaitsev framed the rule in an exclusive VentureBeat interview at RSAC 2026: collapse agent identities back to the human, because an agent acting on your behalf should never have more privileges than you do. Codex held a GitHub OAuth token scoped to every repository the developer authorized. Vertex AI’s P4SA read every Cloud Storage bucket in the project. Claude Code traded deny-rule enforcement for token budget. Kayne McGladrey, an IEEE Senior Member who advises enterprises on identity risk, made the same diagnosis in an exclusive interview with VentureBeat. "It uses far more permissions than it should have, more than a human would, because of the speed of scale and intent." Riemer drew the operational line in an exclusive VentureBeat interview. "It becomes, I don't know you until I validate you." The branch name talked to the shell before validation. The GitHub issue talked to Copilot before anyone read it. Security director action plan Inventory every AI coding agent (CIEM). Codex, Claude Code, Copilot, Cursor, Gemini Code Assist, Windsurf. List the credentials and OAuth scopes each received at setup. If your CMDB has no category for AI agent identities, create one. Audit OAuth scopes and patch levels. Upgrade Claude Code to 2.1.90 or later. Verify Copilot's August 2025 patch. Migrate Vertex AI to the bring-your-own-service-account model. Treat branch names, pull request descriptions, GitHub issues, and repo configuration as untrusted input. Monitor for Unicode obfuscation (U+3000), command chaining over 50 subcommands, and changes to .vscode/settings.json or .claude/settings.json that flip permission modes. Govern agent identities the way you govern human privileged identities (PAM/IGA). Credential rotation. Least-privilege scoping. Separation of duties between the agent that writes code and the agent that deploys it. CyberArk, Delinea, and any PAM platform that accepts non-human identities can onboard agent OAuth credentials today; Gravitee's 2026 survey found only 21.9% of teams have done it. Validate before you communicate. "As long as we trust and we check and we validate, I'm fine with letting AI maintain it," Riemer said. Before any AI coding agent authenticates to GitHub, Gmail, or an internal repository, verify the agent's identity, scope, and the human session it is bound to. Ask each vendor in writing before your next renewal. "Show me the identity lifecycle management controls for the AI agent running in my environment, including credential scope, rotation policy, and permission audit trail." If the vendor cannot answer, that is the audit finding. The governance gap in three sentences Most CISOs inventory every human identity and have zero inventory of the AI agents running with equivalent credentials. No IAM framework governs human privilege escalation and agent privilege escalation with the same rigor. Most scanners track every CVE but cannot alert when a branch name exfiltrates a GitHub token through a container that developers trust by default. Zaitsev's advice to RSAC 2026 attendees was blunt: you already know what to do. Agents just made the cost of not doing it catastrophic.

Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce

4/30/2026

Writer, the enterprise AI agent platform backed by Salesforce Ventures, Adobe Ventures, and Insight Partners, today launched event-based triggers for its Writer Agent platform, enabling AI agents to autonomously detect business signals across Gmail, Gong, Google Calendar, Google Drive, Microsoft SharePoint, and Slack — and execute complex multi-step workflows without any human initiating the process. The release, which also includes a new Adobe Experience Manager connector and a suite of enhanced governance controls such as bring-your-own encryption keys and a Datadog observability plugin, represents Writer's most aggressive bet yet on fully autonomous enterprise AI. It arrives at a moment when AWS, Salesforce, and Microsoft are all racing to establish their own agentic platforms, and when the question of how much autonomy enterprises will actually hand to AI agents remains deeply unresolved. "We are launching a series of event triggers that power and drive our playbooks to be more proactively called," Doris Jwo, Writer's VP of Product Management, told VentureBeat ahead of the announcement. "We're building on the ecosystem to actually for these connectors, such as SharePoint, Google Drive, Gong, Gmail, Google Calendar, actually listen for events happening in those platforms, so that the agent can practically know that something happened externally, and then, where relevant, call a certain playbook to be actually run live in real time, without any sort of human intervention required." The shift from reactive to proactive AI agents marks a critical inflection point for enterprise software. Until now, most AI assistants — including Writer's own platform — required a human to initiate every interaction. A marketer had to open a chat window and ask for help. A salesperson had to prompt a research brief. The new event-based triggers flip that dynamic entirely: the system watches for business events and acts on its own. Why Writer decided humans were the weakest link in enterprise AI workflows Writer's push toward autonomous triggers stems from a practical observation its product team made as enterprise customers scaled their use of the platform's playbooks — the reusable, natural-language workflows that Writer introduced in November 2025 to let business users automate recurring tasks without writing code. "What we found is, as playbooks continue to get integrated into enterprise workflows, it's actually humans that become the bottleneck in making sure that playbooks get triggered," Jwo said. "This really kind of solves that problem, to make sure that that sort of always-on, proactive, autonomous nature of that agent has continued to be built on." The mechanics work like this: Writer's connectors, which already provided read and write access to third-party enterprise tools, now also listen for specific events — an email arriving in Gmail, a sales call completing in Gong, a new file landing in a Google Drive folder, a meeting starting or ending on Google Calendar, a message posted in Slack. When the system detects a qualifying event, it triggers a predefined playbook that executes a multi-step workflow autonomously. Consider the use case Jwo described for marketing teams already running on Writer's platform. An email campaign workflow typically begins when a creative brief lands in a Google Drive folder. From there, multiple team members coordinate through Slack to assemble research, build assets, draft copy, review graphics, and package everything for a campaign management tool. Writer's event-based triggers collapse much of that chain: the moment a brief hits the designated folder, the system automatically fires a cascade of playbooks that assemble the research, generate the assets, and prepare deliverables for human review. "All the playbooks that our customers have been building with us to build all those each individual pieces now just get automatically triggered the minute that initial brief kind of hits the Google Drive folder," Jwo said. "That's, I think, a very common workflow for most of these marketing sort of, like, content-heavy use cases, where it's multiple parties involved, it's a lot of assets coming together in a cascade." How Writer's AI reasoning engine separates it from simple automation tools like Zapier The comparison to Zapier — the popular automation tool that connects thousands of apps through if-this-then-that logic — is inevitable, and Jwo addressed it directly. "It's more than just an LLM in the middle," she said. "It is an agent with reasoning and then access to a really powerful set of tools that includes connectors, that includes its own virtual sandbox, which enables it to do things like write and execute code on the fly and create those assets." The distinction matters for understanding where Writer sits in an increasingly crowded landscape. Zapier and similar workflow automation tools require users to manually define rigid logic paths, specifying exact conditions and actions in a deterministic sequence. Writer's approach uses its Palmyra-powered reasoning engine to process event context and make real-time execution decisions. Users describe their goals in natural language rather than dragging around boxes and defining conditional branches. "It's not quite Zapier, because I think it requires a lot more — it's more rigid," Jwo said of traditional automation tools. "It requires more manual kind of setup to define the logic and the roles and the conditions for which a workflow has to be run." Writer's playbooks, by contrast, allow "a simple idea to turn into something that's actually executable and repeatable," she added, noting that builds take "hours and days, not weeks and months." This natural-language accessibility has been central to Writer's strategy since it introduced the Agent platform and playbooks last November. The company has consistently positioned itself as a platform that puts power in the hands of business users — marketers, sales teams, operations leads — rather than requiring engineering resources to build and maintain AI workflows. Writer CEO May Habib made this case forcefully at Davos earlier this year, arguing that the leaders pulling ahead are those entering what she called "rebuild mode" — stripping workflows down to outcomes and eliminating what she described as the "coordination tax" of endless handoffs, status meetings, and alignment emails. The event-based triggers extend that philosophy to its logical conclusion. If business users can build playbooks in natural language, and those playbooks can now fire automatically based on real-world business events, then the entire loop from signal to action can operate with minimal human involvement. Inside the governance controls Writer built to make autonomous AI agents safe for regulated enterprises That level of autonomy raises obvious concerns, and Writer appears to understand that governance is the linchpin of the entire strategy. The company paired its trigger launch with a substantial expansion of its administrative controls — a combination that suggests Writer views enterprise trust as its primary competitive weapon. The new governance features include Connector Profiles, which allow administrators to configure multiple versions of the same connector with different permissions per team; Writer Agent Profiles for deploying customized agent configurations with specific capability toggles and security settings; AI Studio Observability for auditable tracking of every agent interaction; a Datadog Logs Plugin that forwards every LLM request and response as structured log events; and bring-your-own encryption key support through AWS, Azure, or GCP key management services. "A really important part of that, and a baseline, sort of foundation for everything that we roll out, is our observability and governance platform," Jwo told VentureBeat. "When connectors are set up, admins have full control over connector access, what is set up, who has access, which teams exactly are those access granted to, as well as individually, which exact tools do teams are able to call." The observability story extends to the individual user level as well. Jwo described Writer Agent's user experience as built around progressive disclosure — clean initial views that users can expand to inspect the full chain of reasoning behind any agent action. "You can drill down to the actual tool call level," she said. "You'd actually have the ability to look at specifically what web search results were pulled, what connector was called, what tool called, what succeeded, what failed, how did the agent divert its path to fulfill your goal." This transparency architecture reflects a broader conviction Writer has articulated through what it calls "The Agentic Compact" — a framework the company published for responsible AI that emphasizes foundational transparency, auditability, and human oversight. Dan Bikel, Writer's head of AI, has argued publicly that the industry's obsession with model scale has created what he calls a "transparency paradox," leaving businesses with powerful tools they cannot fully understand or control. Writer's governance-first approach to autonomous triggers represents the operational expression of that philosophy. Writer also introduced its agent supervision suite in December 2025, offering centralized monitoring, agent approval workflows, global guardrails, and integrations with external observability and security platforms like Datadog, Noma, and Lakera. The event-based triggers now extend that governance framework to cover actions initiated without any human in the loop — a meaningfully harder problem. Writer takes aim at AWS, Salesforce, and Microsoft in the escalating agentic platform wars The timing of Writer's announcement is not accidental. The enterprise agentic AI market has entered a period of intense platform competition, with the largest technology companies in the world staking claims to the same territory Writer occupies. Jwo acknowledged the pressure directly when asked why a CIO would choose Writer over established vendor relationships with AWS, Salesforce, or Microsoft — all of which have announced agentic platforms of their own. "At the baseline, I think we have all the pieces to be fully enterprise-grade and ready," Jwo said. But she argued that Writer's real advantage lies in accessibility for non-technical users. "A lot of the challenge has been: how do we get business users to actually be able to build these powerful workflows in a way that maybe a technical user, using coding agents, can do very quickly and well, but the typical business user is not accustomed to anything beyond typical prompting to actually create?" That positioning — enterprise-grade capabilities wrapped in a business-user-friendly interface — has been Writer's core differentiation since the company's founding in 2020. It is also the reason Writer has attracted strategic investment from Salesforce Ventures and Adobe Ventures, both of which are building their own AI platforms but apparently see value in Writer's approach to the business-user segment. The company's March 2026 release of Skills — reusable building blocks that encode a team's specific methodologies, quality standards, and decision frameworks into the Agent platform — reinforced this direction. Skills allow marketing teams, for instance, to capture exactly how their best strategist structures competitive analysis or formats campaign briefs, then make that expertise available to every team member and every playbook across the organization. Combined with event-based triggers, the result is a system where institutional knowledge executes automatically in response to real-world business events. Writer's 2026 AI adoption survey, conducted with Workplace Intelligence and covering 2,400 global executives, found that 79% of enterprises face AI adoption challenges despite high investment — and that organizations with strong change management programs are six times more likely to reach production. Writer CMO Diego Lomanto has argued that the real barrier to AI adoption is not technology but trust, writing that "they treat resistance as a training problem when it's actually a trust problem." The governance-heavy approach to event-based triggers appears designed to address exactly that dynamic. Salesforce, SAP, and Workday triggers are next as Writer expands its connector roadmap Writer's initial event trigger support covers Gmail, Gong, Google Calendar, Google Drive, SharePoint, and Slack — tools that Jwo described as "generally the most applicable to every end user." But the company has its eye on deeper enterprise system integration. When asked about CRM and ERP triggers for systems like Salesforce, SAP, and Workday, Jwo confirmed these are within the scope of the roadmap. "You can imagine, you know, a Salesforce opportunity is created that may trigger a cascade of events that happens," she said. "You might want to set up the right assets, maybe the right customer environment, all sorts of things can kind of cascade from that." The connector ecosystem has been a strategic priority since Writer launched its MCP (Model Context Protocol) gateway in November 2025, providing governed agent access across enterprise systems including Microsoft 365, Google Workspace, HubSpot, Gong, PitchBook, FactSet, and others. The addition of Adobe Experience Manager in this release gives marketing teams direct read/write access to pages, fragments, and digital assets in Adobe's content management system — a connector that closes the gap between AI-generated content and published output. Jwo clarified that in most integration scenarios, Writer Agent delivers content in a draft state rather than publishing it directly. "Writer Agent basically accomplishes the majority of the workload — pulling together the assets, making the changes and presenting — and then hopefully a person just has to go through the last three or so final steps to get it out," she said. The real question enterprise AI must answer: how much autonomy is too much autonomy The degree of autonomy enterprises are comfortable granting their AI agents remains one of the most consequential open questions in the industry. Jwo acknowledged that most customers still maintain human checkpoints in their workflows. "You can also build in instructions into our playbooks to say, 'Hey, before you move on to a next playbook, make sure that you check with me. I want to take a look, and then if I hit go, then you're good to go,'" she said. The agent can also be designed with self-QA capabilities, validating outputs against known pitfalls before proceeding. Writer plans to expand these checkpoint capabilities in the coming quarter, adding the ability to specify not just that a checkpoint is required but which specific person must respond and what types of responses are expected — essentially building a formal approval workflow into the autonomous trigger chain. Jwo characterized the current system as a hybrid: the platform listens deterministically for predefined events, but the agent applies reasoning to decide what action to take — or whether to act at all. "The agent has the ability to process what happened, understand the context of it, and understand the intent of what you want to do, so it can make that decision," she said. "You're just saying, like, 'Hey, the goal might be feedback is coming in, and we want to triage that in real time. And some things we might not want to action on, some things we do.' You basically just explain that to the agent." She views this release as a stepping stone toward a future where agents are "even more mission-driven, and less governed by even like a set of instructions or roles" — a future where the AI doesn't just respond to triggers but proactively identifies when action is needed based on broader organizational goals. For now, Writer is betting that the combination of autonomous triggers, robust governance, and business-user accessibility will be enough to carve out defensible territory in an enterprise AI market where the biggest technology companies in the world are all converging on the same set of capabilities. The company's argument is that having the foundational pieces is not enough — what matters is making those pieces work together in a way that non-technical business users can build, manage, and trust. It is, in other words, the same wager Writer has been making since 2020 — that the future of enterprise AI belongs not to the platform with the most powerful model, but to the one that can get an entire organization to actually use it. The difference now is that the agents don't wait to be asked. Event-based triggers, new connectors, and enhanced governance controls are available immediately to Writer enterprise customers.

Netomi raises $110 million as Accenture and Adobe bet on AI for customer service

4/30/2026

Netomi, the San Francisco-based startup building AI systems for enterprise customer service, said Thursday that it has raised $110 million in new funding in a round led by Accenture Ventures, with participation from Adobe Ventures, WndrCo, Silver Lake Waterman, NAVER Ventures, Metis Strategy and Fin Capital. Jeffrey Katzenberg, managing partner of WndrCo and co-founder of DreamWorks, has joined the company's board. The round builds on early backing from a roster of AI luminaries that includes OpenAI co-founder Greg Brockman, Google DeepMind co-founder Demis Hassabis and Microsoft AI CEO Mustafa Suleyman. On its face, the financing is another large AI round in a market still awash in capital. But the deal is more revealing than that. It suggests that a new line is being drawn inside enterprise AI — not between companies that have a chatbot and companies that do not, but between companies that can show AI works in the messy, brittle, heavily governed environments where large businesses actually operate, and those that still mostly shine in demos. The market around Netomi makes the stakes clear. Sierra, the AI agent startup led by former Salesforce co-CEO Bret Taylor, raised $350 million at a $10 billion valuation in September 2025 and has since made three acquisitions in 2026 alone. Decagon tripled its valuation to $4.5 billion in January 2026 with a $250 million Series D. Salesforce, ServiceNow and Intercom are all racing to embed AI agents into their existing platforms; Intercom's Fin AI agent reportedly crossed $100 million in annual recurring revenue at $0.99 per resolution. Gartner predicts that 40 percent of enterprise applications will include task-specific AI agents by the end of 2026, up from less than 5 percent in 2025. Against that backdrop, Netomi's $110 million round is not the largest in the category, but it may be the most strategically constructed. The combination of Accenture's enterprise consulting network, Adobe's dominance in digital experience management and Netomi's track record in production deployments represents a coordinated play to embed AI not as a chatbot layer on top of websites, but as the fundamental intelligence governing how entire digital experiences behave. The company did not disclose its valuation, and in an interview tied to the announcement, Netomi executives declined to provide revenue or profitability figures. Instead, Chief Executive Puneet Mehta pointed to customer economics, saying a typical large deployment can generate at least tens of millions of dollars in impact, with some customers on a path to hundreds of millions. For technical decision-makers, though, the more important part of Thursday's news may be the partnerships attached to the money. Why Accenture and Adobe turned a venture deal into a global distribution play The structure of the deal reads like a map of how enterprise AI gets bought in 2026. Alongside the investment, Accenture has entered a global alliance with Netomi to bring the platform to its Fortune 100 client base worldwide. The alliance will involve hundreds of Accenture team members receiving training on Netomi's platform — a meaningful commitment from the world's largest consulting firm and a distribution channel that few AI startups can match. Adobe Ventures' participation comes with plans to integrate Netomi into Adobe's Brand Concierge agentic ecosystem, giving Netomi a path into the software layer many large brands already use to manage websites, content and digital journeys. Metis Strategy brings access to CIO advisory channels. Ndidi Oteh, CEO of Accenture Song, said in the press release that the partnership is designed to help clients "reinvent how they serve their customers — seamlessly, responsibly and at scale." The result is not just more cash. It is a distribution network wrapped around a thesis. Justin Wexler, a partner at WndrCo who led the firm's Series B investment in Netomi in 2021, said most companies in the customer experience space are simply swapping a human for an AI. "That's the extent of what they're building," Wexler said. "What we're doing at Netomi, particularly with the Adobe partnership, is leapfrogging that altogether — merging the two layers. You don't have a 'How can I help you?' chatbot. This is anticipating the issue and eliminating the ticket altogether." The distinction matters because it describes a fundamentally different kind of product. Most customer service AI still sits downstream. A customer encounters a problem, opens a chat window, explains the issue and waits for a response. Even when AI speeds up that exchange, the friction has already happened. Netomi wants to move upstream, into the experience before the ticket exists. Mehta described the idea in blunt economic terms. "Why are there so many customer service tickets? Why is $500 billion spent on human labor answering customer service phone calls, emails and chats?" he asked. "What we realized is that the world's largest companies wait for a problem to happen and then jump on it to solve it — but by that time, they've already created a lot of frustration, and it's very expensive to do that." The answer, in Mehta's view, is not to make downstream customer service faster with AI. It is to prevent the service ticket from being created in the first place. That logic sits behind almost every strategic decision the company has made — including the Adobe partnership. "Most important websites run on Adobe Experience Manager," Mehta said. "So we're saying, what if we bring that kind of context and awareness upstream — capturing that a customer might be affected before it even turns into a customer service ticket." The Wall Street trading floor origins behind Netomi's AI architecture To understand what Netomi is building, you have to understand where its founder came from. Mehta, who spent his early career constructing automated trading engines on Wall Street, told VentureBeat that the founding thesis was deceptively simple. "When we started Netomi, the core thesis was that AI is going to become the new customer interface," he said. "The Transformers [paper] did not exist, so we had literally stitched together a set of different models to create the same end result." That background in low-latency finance is not incidental. It is the intellectual architecture that undergirds everything Netomi builds. When asked what connects trading systems to customer experience platforms, Mehta drew a direct line. "If you think about the low-latency trading world, that was the first technology application to use situational awareness and a variety of different signals at scale," he said. "There was not one signal that it was making decisions on. You needed market data feeds. You needed situational awareness. You needed news. You needed awareness of your own book of business. You needed your own risk assessment." That multi-signal architecture, Mehta argued, translates directly to what enterprise customer experience demands. Rather than waiting passively for a customer to describe a problem — the way traditional chatbots and even most current AI agents operate — Netomi's system attempts to reconstruct the full situation before it acts. The request itself is only part of the story. "What the customer tells you is very important, but the situation the customer is in is sometimes even more important," Mehta said. "What if we borrowed that design pattern we built for low-latency trading? Because we can probably know why the customer is calling us. And if we can know that, we could maybe even reach out to them before they reach out to us and solve the problem." He summarized the philosophical distinction this way: "What large language models by themselves did was they essentially democratized just raw intelligence. We are democratizing context, and that changes everything." That is a sharp line, and also a revealing one. Netomi is effectively betting that the defensible layer in enterprise AI will not be the foundation model alone. It will be the orchestration layer that turns general model capability into governed, auditable, domain-specific action. That governed approach extends to how the platform handles risk. Netomi uses what it calls an AI authority matrix — a real-time system that defines what the AI can do autonomously and when it must escalate to a human. "It's a little bit like autonomous driving," Mehta said. The AI knows when it's approaching a boundary and pulls a human in. For regulated industries, specific endpoints can be locked to deterministic, rules-based flows while the agentic layer handles broader orchestration — and all of it is version-controlled and traceable, with metadata saved for seven years. Inside the AI system that rearranges websites and retail stores in real time The most technically ambitious element of Netomi's vision — and the one that most sharply distinguishes it from competitors — is what the company calls AI-embedded customer experience orchestration. Rather than placing a chatbot in the corner of a website, Netomi's system can rearrange the website itself based on what the AI infers about each individual customer's situation. Wexler demonstrated a live example during the interview. "As we see most deployments, companies that want to deploy AI on their websites, they throw a chatbot on the corner," he said. "If you embed agentic capabilities into the digital layer itself — and again, Adobe Experience Manager is the leading digital layer of enterprise — then you could do really unique things." Wexler described what this looks like in practice. In a typical deployment, he said, the AI doesn't just answer questions — it reshapes the page. Based on a customer's browsing behavior, purchase history and inferred intent, the system can reorganize a product page in real time: surfacing warnings one customer needs but another doesn't, prompting a sample order at the moment of hesitation, or flagging a compatibility issue before checkout. Two customers looking at the same product might see fundamentally different pages — not because a marketing team built two versions, but because the AI is composing the experience on the fly. "The AI is playing the role of arranging the elements of the website to cater to me and my needs," Wexler said. "It's anticipating my needs." The implication is a shift from static web pages to something closer to generative websites — pages that reconstruct themselves around each visitor the way a good salesperson adjusts a pitch mid-conversation. It is a fundamentally different model from bolting a chat widget onto a page that otherwise looks the same for everyone. That vision already extends beyond screens. Mehta revealed that Coach, the handbag company owned by Tapestry, deployed Netomi's platform in a physical flagship store during the holiday season to help customers navigate the retail space and is now rolling it out chainwide. The numbers Netomi is putting behind its production claims are equally ambitious. At DraftKings, the company said its platform can handle traffic surging to more than 40,000 concurrent customer requests per second during major sporting events, while delivering sub-three-second response times and 98 percent intent classification accuracy. At Paramount, the company said it deployed across chat and voice in two weeks and then scaled through a weekend that included a major UFC event and the AFC Championship. Those are company-reported numbers, and they are hard to benchmark against competitors because the industry lacks standard public reporting. But they illustrate the kind of problem Netomi wants buyers to think about. At that scale, an AI support product stops looking like a smarter FAQ bot and starts looking like a distributed systems challenge. You are not just asking whether a model can answer a question. You are asking whether an entire system can make decisions quickly, safely and consistently while traffic spikes and business rules collide. The $110 million question: can invisible AI beat the chatbot industrial complex? Whether Netomi can deliver on the full scope of its ambition — transforming from an AI customer service platform into an ambient intelligence layer that reshapes digital and physical experiences in real time — remains an open question. The company faces competitors with far larger war chests, deeper platform footprints and, in Sierra's case, a founder-level relationship with OpenAI. But Netomi's bet is fundamentally different from what much of the field is building. While Sierra and Decagon race to replace human agents with AI concierges, measuring success in conversations handled, Netomi is wagering that the highest form of customer service is the interaction that never needs to happen at all. "There are new startups trying to convince enterprises that if every customer gets a 'concierge,' if there's 'an agent for every moment,' then loyalty follows," Mehta said. "But most relationships with brands are functional. Customers don't want a conversational relationship with their airline or their bank. They want things to work — seamlessly, invisibly, without friction." In his closing comments during the interview, Mehta warned that many companies still underestimate the operational risk of deploying immature AI into sensitive customer environments. "What large companies adopting AI don't fully realize yet is what kind of risk are they taking by adopting those platforms that are not really field tested for this kind of scale and situations," he said. That may be the most important line in the whole announcement. Because beneath the funding round, beneath the partner logos and beneath the talk of agents and orchestration, the real question in enterprise AI remains old-fashioned: which systems can be trusted when the environment gets ugly? "We have built this technology more like how automated trading got built, or how autonomous driving got built, compared to coming at this from just a customer service lens," Mehta said. It is a fitting frame for a company whose founder left Wall Street to fix customer service. On the trading floor, the best systems were never the ones that made the most trades. They were the ones that knew, with precision, when not to act — and the ones nobody noticed until something went wrong and they held. Netomi's new investors are betting $110 million that the same principle applies when the person on the other end of the system is not a trader, but a customer who just wants their floor not to leak.

Cheaper tokens, bigger bills: The new math of AI infrastructure

4/30/2026

Presented by Nutanix As enterprises move from AI experimentation into production deployment, the primary cost driver has shifted away from foundation model training and toward the infrastructure required to run thousands of concurrent inference workloads at scale, with agentic AI as the accelerant. Where early enterprise AI projects involved a handful of large, scheduled training jobs, production agentic environments require continuous support for short-lived, unpredictable requests that consume GPU, networking, and storage resources in ways traditional infrastructure was never designed to handle. For enterprise technology leaders, that shift is turning infrastructure efficiency into a make-or-break factor in AI economics. "Every employee with an AI assistant, every automated workflow, every agent pipeline needs models for inferencing and generates a lot of tokens," says Anindo Sengupta, VP of products at Nutanix. "Those inferencing requests land on a GPU infrastructure, traverse specialized networks, and pull data from storage systems purpose built to support these AI workloads." Why cost per token is becoming a core infrastructure metric Inference costs per token have dropped by roughly an order of magnitude over the past two years, driven by model efficiency improvements and competitive pressure among cloud providers. The expectation would be that enterprise AI is getting cheaper. Instead, total costs are rising, Sengupta says, pointing to what economists call the Jevons paradox: when a resource becomes cheaper to use, consumption tends to increase faster than the price drops. So while the cost per token is going down by almost an order of 10 in the last couple of years, consumption has risen more than 100X. The result is that cost per token and GPU utilization are becoming primary operational metrics for enterprise IT, sitting alongside traditional measures like uptime and throughput. "Cost per token is really about the total cost of ownership for serving inference models," Sengupta says. "Utilization is about making sure that once you have GPU assets, you're getting maximum return from them. These metrics will be critical for enterprise IT leaders." What makes this difficult is the number of variables involved. Token costs shift depending on which models an organization runs, where workloads execute, and how prompts are structured. "There are too many variables in cost to manage intuitively," Sengupta adds. "Optimizing it is an engineering problem, and one that requires continuous tuning." Agentic workloads expose the limits of traditional infrastructure Production agentic AI introduces a workload profile that traditional enterprise infrastructure was not designed to handle. Classic data center deployments are built around predictable loads and long planning cycles. Agentic environments produce unpredictable, high-frequency bursts of short inference requests, place new demands on networking and storage, and change faster than most procurement cycles allow. The infrastructure supporting agentic AI is also structurally different from CPU-based computing. GPU topology, high-speed interconnects, parallel storage systems for agent memory and KV cache, and networking architectures capable of handling DPU offloading all represent new capabilities that require new operational skills. Siloed infrastructure compounds these challenges. When GPU resources, networking, and data access are managed independently, scheduling inefficiencies accumulate, utilization drops, and costs climb. Organizations running fragmented stacks tend to underutilize expensive GPU assets while simultaneously bottlenecking on storage and network throughput. Integrated stacks and the case for full-stack architecture The response emerging among infrastructure vendors is a move toward tightly integrated, validated full-stack platforms designed specifically for production AI workloads. The premise is that end-to-end optimization across compute, networking, storage, and software layers produces better utilization and lower per-token costs than assembling best-of-breed components from separate vendors. Nutanix's Agentic AI solutionrepresents one approach to this problem. Built on the Nutanix AHV hypervisor, Nutanix Enterprise AI and Nutanix Kubernetes Platform, the solution is designed to manage both the traditional compute layer where agent orchestration runs and the accelerated compute layer where inference executes. The company has introduced NVIDIA topology-aware enhancements to AHV that automatically optimize how GPUs, CPUs, memory, and DPUs are allocated to virtual machines, and has offloaded the Nutanix Flow Virtual Networking to BlueField DPUs, to free GPU cycles and sustain throughput without compromising security. The solution supports instant deployment of NVIDIA NIM microservices and open-source models including Nemotron, and integrates an AI gateway that governs access to frontier cloud LLMs from Anthropic, Google, OpenAI, and others. The gateway also implements model context protocol (MCP) to allow agents to connect to enterprise data with granular access controls. The solution runs on Cisco infrastructure, allowing organizations to deploy on infrastructure they already operate. "By integrating everything from the AHV hypervisor and Flow Virtual Networking up to the Kubernetes platform, you remove the silos that slow down AI projects," Sengupta explains. Platform teams and developer agility cannot be traded off against each other One organizational tension that scales with agentic AI adoption is the relationship between platform teams managing shared infrastructure and the developers building and running agent applications on top of it. These groups have historically operated with different tooling, different priorities, and different time horizons, but Sengupta argues that the core dynamic hasn't changed even as the technology has. "Platform teams will continue to deliver a catalog of self-service AI capabilities that are also compliant to business needs, that they can serve to agentic AI builders," Sengupta says. "Mature AI teams will do a great job not just in GPU utilization, but in creating an operating model that enables fast AI infrastructure delivery to meet the pace of innovation that developers want. That's what is very critical to success." The organizations that are managing GPU utilization most effectively tend to be further along in their AI adoption journey, with more established operating models and clearer cost accountability. For organizations earlier in that journey, the infrastructure design and operating model decisions being made now will determine whether AI projects can move from pilot to production without cost or complexity becoming the limiting factor. The AI factory operating model The emerging framework for enterprise AI infrastructure is the AI factory, a purpose-built environment for producing and running AI workloads at scale. The challenge is that most organizations will need to operate both traditional compute and accelerated compute simultaneously for years, requiring a common operating model that spans both technology paradigms without sacrificing agility. With Nutanix, running on Cisco as part of the Cisco AI Pods, powered by Intel and optimized for the NVIDIA reference architecture, organizations get a production-ready, full-stack foundation by enabling AI factories to be securely and efficiently shared by thousands of agents, to achieve the lowest costs per token. The solution bridges the gap between the infrastructure and platform engineering teams who manage the hardware and the AI engineering and agentic AI developer teams who build and run agentic AI applications, making it truly affordable to run AI at a massive scale. "The metrics that will determine whether an organization can sustain and scale its AI investment — cost per token, GPU utilization, scheduling efficiency — are infrastructure metrics," Sengupta says. "Managing them well is increasingly a precondition for making AI viable, not just functional." Secure and scale your AI factory — explore the full-stack approach here. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

$How to build custom reasoning agents with a fraction of the compute$

How to build custom reasoning agents with a fraction of the compute

4/28/2026

Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback. Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation. Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic. The problem with training reasoning models The standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment. An automated verifier checks if the model’s answer is right or wrong, providing a binary reward, such as a 0 or 1. RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it's a pivotal logical step or a throwaway phrase.” Consequently, the model never learns which intermediate steps led to its success or failure. On-Policy Distillation (OPD) takes a different approach. Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token. This provides the student with granular feedback on the entire reasoning chain and response-generation process. Deploying and running a separate, massive teacher model alongside the student throughout the entire training process incurs massive computational overhead. “You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,” Yang said. Furthermore, the teacher and student models must share the exact same vocabulary structure, which according to Yang, “quietly rules out most cross-architecture, cross-modality, or multilingual setups that enterprises actually run.” The promise and failure of self-distillation On-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher. During training, the student receives a standard prompt while the teacher receives privileged information, such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version, providing token-by-token feedback as the student tries to solve the problem using only the standard prompt. OPSD appears to be the perfect compromise for an enterprise budget. It delivers the granular, step-by-step guidance of OPD. Because it eliminates the need for an external teacher model, it operates with the high computational efficiency and low cost of RLVR, only requiring an extra forward pass for the teacher. However, the researchers found that OPSD suffers from a phenomenon called “privileged information leakage.” “The objective is structurally ill-posed,” Yang said. “There's an irreducible mutual-information gap that the student can never close... When self-distillation is set up as distribution matching, the student is asked to imitate the teacher's full output distribution under privileged context.” Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrasing or steps instead of the underlying reasoning logic. As a result, the student model starts hallucinating references to an invisible solution that it will not have access to in a real-world deployment. In practice, OPSD models show a rapid spike in performance early in training, but their reasoning capabilities soon plateau and progressively degrade over time. Decoupling direction from magnitude with RLSD The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements. They identified that the signal dictating the direction of the update (i.e., whether to reinforce or penalize a behavior) can be sparse, but must be perfectly reliable, because pointing the model in the wrong direction damages its reasoning policy. On the other hand, the signal dictating the magnitude of the update (i.e., how much relative credit or blame a specific step deserves) benefits from being extremely dense to enable fine-grained, step-by-step corrections. RLSD builds on this principle by decoupling the update direction from the update magnitude. The framework lets the verifiable environmental feedback from the RLVR signal strictly determine the direction of learning. The model only receives overall reinforcement if the final answer is objectively correct. The self-teacher is stripped of its power to dictate what the model should generate. Instead, the teacher's token-by-token assessment is repurposed to determine the magnitude of the update. It simply distributes the total credit or blame across the individual steps of the model's reasoning path. This alters how the model learns compared to the classic OPSD paradigm. In standard OPSD, the training objective acts like behavioral cloning, where the model is forced to directly copy the exact wording and phrasing of the teacher. This causes the student to hallucinate and leak references to data it does not have. Instead of forcing the model to copy a hidden solution, RLSD provides a natural and virtually cost-free source of per-token credit information. “The intuition: we're not teaching the model to reason like the teacher,” Yang said. “We're telling the model, on the path it chose, which of its own tokens were actually doing the work. The model's exploration distribution stays its own. Only the credit allocation gets sharpened.” If a specific deduction strongly supports the correct outcome, it receives a higher score. If it is just a useless filler word, it receives a baseline score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate step-by-step data, or maintain massive external teacher models. Putting RLSD to the test To test RLSD, the researchers trained the open-weight Qwen3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks. These included MMMU for college-level multi-discipline questions, MathVista, MathVision, WeMath, and ZeroBench, a stress-test benchmark explicitly designed to be nearly impossible for current frontier models. They compared the RLSD model against the base model with no post-training, standard RLVR via the GRPO algorithm, standard OPSD, and a hybrid combination of the two. RLSD significantly outperformed every other method, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and outperformed standard RLVR by 2.32%. The gains were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark. Beyond accuracy, the framework offers massive efficiency gains. “Concretely, RLSD at 200 training steps already beats GRPO trained for 400 steps, so roughly 2x convergence speedup,” Yang said. “Cost-wise, the only overhead beyond a normal GRPO pipeline is one extra forward pass per response to grab teacher logits. Compared to rollout generation... that's basically free.” Unlike OPSD, which saw performance spike and then completely collapse due to information leakage, RLSD maintained long-term training stability and converged on a higher performance ceiling than standard methods. The qualitative findings highlight how the model alters its learning behavior. For example, in a complex visual counting task, standard RLVR looks at the final correct answer and gives the entire paragraph of reasoning tokens the same reward. RLSD surgically applied rewards to the specific mathematical subtraction steps that solved the problem, while actively down-weighting generic filler text like "Looking at the image, I see...". In another example, the model performed an incorrect math derivation based on a bar chart. Instead of labeling the whole response as a failure, RLSD concentrated the heaviest penalty on the exact point where the model misread a relationship from the chart. It remained neutral on the rest of the logical setup, recognizing that the initial framework was valid. This is particularly important for messy, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers do not want it to unlearn its entire analytical framework. They just want it to fix the specific assumption it got wrong. RLSD allows the model to learn exactly which logical leaps are valuable and which are flawed, token by token. Because RLSD does this by repurposing the model itself, it provides models with granular reasoning capabilities while keeping the costs of training reasonable. How enterprises can get started For data engineers and AI orchestration teams, integrating RLSD is straightforward, but it requires the right setup. The most critical requirement is a verifiable reward signal, such as code compilers, math checkers, SQL execution, or schema validators. “Tasks without verifiable reward (open-ended dialogue, brand-voice writing) belong in preference-based pipelines,” Yang said. However, RLSD is highly flexible regarding the privileged information it requires. While OPSD structurally requires full intermediate reasoning traces, forcing enterprises to either pay annotators or distill from a frontier model, RLSD does not. “If you have full verified reasoning traces, great, RLSD will use them,” Yang said. “If all you have is the ground-truth final answer, that also works... OPSD doesn't have this flexibility.” Integrating the technique into existing open-source multi-modality RL frameworks like veRL or EasyR1 is incredibly lightweight. According to Yang, it requires no framework rewrite and slots right into the standard stack. The code swap involves simply changing tens of lines to adjust the GRPO objective and sync the teacher with the student. Looking ahead, RLSD offers a powerful way for enterprises to maximize their existing internal assets. “The proprietary data enterprises hold inside their perimeter (compliance manuals, internal documentation, historical tickets, verified code snippets) is essentially free privileged information,” Yang concluded. “RLSD lets enterprises feed this kind of data straight in as privileged context, which sharpens the learning signal on smaller models without needing an external teacher and without sending anything outside the network.”

American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding

4/28/2026

The AI race lately has felt a bit like a game of tennis: first, Anthropic releases a new, pricey state-of-the-art proprietary model for general users (Claude Opus 4.7), then, a week or so later, its rival OpenAI volleys back with one of its own (GPT-5.5). And all the while, Chinese companies like DeepSeek and even Xiaomi are seeking to appeal to users by playing a different game: nearing the frontier, but with open licensing and far lower costs. So it's a big surprise when a new, affordable, highly performant open source contender from the U.S. emerges. Today, we got one from the smaller, lesser-known U.S. AI startup, Poolside, founded in San Francisco in 2023. The company launched its two new Laguna large language models, both of which offer affordable intelligence optimized for agentic workflows (AI that does more than just chat or generate content, but can, in this case, write code, use third-party tools, and take actions autonomously), as well as a new coding agent harness called (fittingly) "pool" and a new web-based, mobile optimized agentic coding development and interactive preview environment, "shimmer," which lets you write code with the Laguna models on the go. The new AI models that Poolside released today include: Laguna M.1: a proprietary 225-billion parameter Mixture of Experts (MoE) model with 23 billion active parameters. This flagship model is optimized for high-consequence enterprise and government environments, designed to solve complex, long-horizon software engineering problems that require maximum reasoning and planning capabilities. Laguna XS.2: an Apache 2.0 open licensed 33-billion parameter MoE with 3 billion active. Engineered for efficiency and community innovation, this model is designed for local agentic coding tasks and provides a versatile foundation for developers looking to fine-tune, quantize, or serve powerful agents on a single GPU. In other words, developers can download and run Laguna XS.2 on their desktop or even laptop computers without an internet connection — completely private and secured. Notably, as mentioned above, only the smaller of the two models, XS.2, is available now under an open source Apache 2.0 license (on Hugging Face) — yet Poolside is offering even the larger M.1 for free temporarily through its API and third-party distribution partners, OpenRouter, Ollama, and Baseten, making it a great use case for developers who wish to test it out. Also noteworthy: the two new Lagunas were trained from scratch — not fine-tuned/post-trained base models from Chinese giant Alibaba's Qwen series like some other U.S. labs have pursued lately (*cough cough* Cursor *cough). As Poolside wrote in a blog post today, it's spent the last few years "focused on serving our government and public sector clients with capable models deployable into the highest-security environments," yet is now going open source "to support builders and the wider research community." When I asked on X why government agencies would seek to use Poolside instead of leading proprietary U.S. labs like Anthropic, OpenAI and Google, Poolside post-training engineer George Grigorev told me in a reply that: "we think that we can be faster to deploy our models to enterprise customers, and we can literally ship weights in fully isolated environments on-prem, so it can work offline. which might be critical for gov/public sectors :) but ofc anthropic enterprise is hard to beat" How Poolside's Laguna M.1 and Laguna XS.2 were trained Poolside constructs its AI models within a specialized digital environment called the "Model Factory". At the heart of this process is Titan, the company's powerful internal software that serves as the "furnace" for training. To help the AI learn as efficiently as possible, Poolside uses a unique tool called the Muon optimizer. Think of Muon as a high-speed tutor; it helps the model master new information approximately 15% faster than standard industry methods, a critical gain when training at the 30-trillion-token scale. It achieves this by ensuring that every update to the model's "brain" is mathematically balanced and pointing in the right direction, which prevents the AI from getting confused or stuck during its intensive training sessions. The information used to train these models—a staggering 30 trillion "tokens" or pieces of data—is carefully selected using a system called AutoMixer. Rather than just feeding the AI everything it finds on the internet, AutoMixer leverages a a "swarm" of sixty proxy models on different data mixes to scientifically determine which combination of code, math, and general web data produces the best reasoning capabilities. In this way, it acts like a master chef, scientifically testing thousands of different "recipes" to find the perfect balance of computer code, mathematics, and general knowledge. While much of this data comes from the public web, about 13% of it is "synthetic data". This is high-quality, custom-made practice material created by other AIs to teach the models specific skills that are difficult to find in the real world. Once the model has finished its basic "schooling," it enters a virtual gym for Reinforcement Learning. In this stage, the AI practices solving real software engineering problems in a safe, isolated digital playground. It learns through trial and error, receiving a "reward" or positive signal every time it successfully fixes a bug or writes a working piece of code. This constant cycle of practice and feedback is what transforms the AI from a simple text generator into a capable "agent" that can plan and execute complex, multi-step projects just like a human software engineer. While M.1 represents the peak of Poolside’s current research, the smaller Laguna XS.2 may be the more disruptive entry. At just 33 billion total parameters (3 billion activated), XS.2 is a "second-generation" MoE model that incorporates everything the team learned from training M.1. Benchmarks show Poolside's Laguna models punch far above their weight class Langua M.1's performance on the SWE-bench Pro—a benchmark designed to test an AI’s ability to solve real-world software issues—reached 46.9% on SWE-bench Pro, nearing the performance of the far-larger Qwen-3.5 and DeepSeek V4-Flash. Despite being a fraction of the size, Laguna XS.2 achieves a 44.5% score on SWE-bench Pro, nearly matching its larger sibling. On the SWE-bench Verified track, M.1 scored 72.5%, outperforming the dense Devstral 2 (72.2%) but trailing Claude Sonnet 4.6, which leads the category at 79.6%. These results highlight M.1’s specialization in long-horizon software tasks, particularly those involving complex planning across interconnected files. The smaller Laguna XS.2 exhibits remarkable efficiency, nearly matching the performance of its much larger sibling on high-consequence tasks. Despite having only 3B active parameters, XS.2 surpasses Claude Haiku 4.5 (39.5%) and the significantly larger Gemma 4 31B dense model (35.7%) on SWE-bench Pro. In terminal-based reasoning, XS.2’s 30.1% on Terminal-Bench 2.0 also edges out Haiku 4.5’s 29.8%, although it remains behind specialized "nano" models such as GPT-5.4 Nano, which reached 46.3% on the same benchmark. Collectively, these benchmarks suggest that Poolside’s focus on agentic RL and synthetic data curation has allowed its smaller models to "punch up" into weight classes typically reserved for far denser architectures. While top-tier proprietary models like Claude Sonnet 4.6 maintain a lead in overall success rates, the Laguna family—particularly the open-weight XS.2—offers a competitive alternative for developers who prioritize local execution and customizable agent workflows. All benchmarking was conducted using the Harbor Framework with sandboxed execution, ensuring that the results reflect the models' ability to function in realistic, resource-constrained environments. Running Laguna XS.2 locally To run the Laguna XS.2 (33B) model locally, your hardware must accommodate its 33 billion total parameters. On Apple Silicon, the baseline requirement is 36 GB of unified memory. For PC and Linux users, while the standard weights would typically require over 60 GB of VRAM, the model’s support for 4-bit quantization (Q4) allows it to run on consumer-grade GPUs with at least 24 GB to 32 GB of VRAM, such as the newly released RTX 5090. Storage is also a factor; you should reserve at least 70 GB for the full model or roughly 20–35 GB for a compressed version suitable for local "agent" tasks. For the most seamless experience, Poolside recommends utilizing Ollama or their own terminal-based agent, pool, which are designed to manage the model's native reasoning and tool-calling capabilities on consumer hardware. You can find the full technical requirements, including specific quantization configurations and code execution sandboxing details, on the official Hugging Face model page and the Poolside release blog. Some sample suggested hardware is listed below: Mac MacBook Pro (14-inch or 16-inch): You should look for models equipped with the M5 Max chip, which specifically supports a starting configuration of 36 GB of unified memory. While the M5 Pro is available, you would need to custom-configure it to exceed its base memory to meet the 36 GB threshold. Mac Studio / Mac Mini: A Mac Mini (M4 or M5 Pro) configured with at least 48 GB or 64 GB of RAM is an excellent desktop alternative. NO "MacBook Neo": this model is not suitable for running Laguna XS.2. Released in early 2026 as a budget-friendly option, the MacBook Neo is capped at 8 GB of non-upgradable memory, which is insufficient for a 33B parameter model. PC Single-GPU Setup: The NVIDIA GeForce RTX 5090 is the premier choice for 2026, offering 32 GB of GDDR7 VRAM, which can handle the Laguna XS.2 at high speeds (approximately 45 tokens/sec) using Q4 quantization. Pro-Grade Setup: For professional developers running complex, long-horizon agents, the RTX PRO 6000 Blackwell (96 GB VRAM) or a dual RTX 5090 configuration allows the model to run without any compression loss. Minimum PC Spec: An RTX 4090 (24 GB) can run the model with heavier quantization, though performance may be slower during complex reasoning tasks. pool (agent) and shimmer (IDE) Models are only as useful as the environments they inhabit, and Poolside has released two "preview" products to house the Laguna series: pool and shimmer. pool is a terminal-based coding agent designed for the developer’s local environment. It acts as an Agent Client Protocol (ACP) server, the same harness the team uses internally for reinforcement learning (RL) training. By bringing the researchers' own tools to the general public, Poolside is effectively inviting the developer community to participate in the "real-world gym" that trains their future models. Shimmer represents a vision for the cloud-native future of development. It is an instant-on Virtual Machine (VM) sandbox where developers can iterate on web apps, APIs, and CLIs in seconds. Unlike traditional integrated developer environments (IDEs) such as Microsoft Visual Studio, shimmer integrates the Poolside Agent directly into the workspace, allowing it to push changes to GitHub or import existing repositories with ease. Perhaps the most surprising feature of shimmer is its portability. Poolside Founding Designer Alasdair Monk shared a demonstration showing shimmer running entirely on a smartphone. In the demo, a split-screen interface shows the Poolside Agent generating a "Happy New Year 2026!" animation while a dev environment runs below. As Monk noted, it offers an instant-on VM with Poolside Agent in split screen and a full dev environment on a mobile device. This suggests a future where high-consequence engineering isn't tethered to a desktop, but can happen wherever an engineer has a screen. Why release Laguna XS.2 as Apache 2.0 open weights? The most significant strategic move in this release is the licensing of Laguna XS.2. Poolside has released the weights of XS.2 under the Apache 2.0 license. This is a highly permissive license that allows users to use, distribute, and modify the software for any purpose, including commercial use, without royalties. This is a stark contrast to the "closed" models of many competitors or even the more restrictive "open-ish" licenses used by some other labs. Poolside’s leadership is explicit about why they chose this path. Poolside's blog post states its conviction that "the West needs strong open-weight models" and that releasing the weights is the fastest way for the team to improve their work through community evaluation and fine-tuning. By putting the weights of a highly capable, 33B-parameter agentic model in the hands of researchers and startups, Poolside is positioning itself as a cornerstone of the open-AI ecosystem. While Laguna M.1 remains primarily behind an API, the open release of XS.2 ensures that Poolside’s technology will be baked into the next generation of third-party tools. Poolside's philosophy and approach The core thesis behind Poolside’s work is that software development serves as the ultimate proxy for general intelligence. Creating software requires long-horizon planning, complex reasoning, and the ability to manipulate abstract systems—all traits central to human cognition. While most current AI "agents" are restricted to tool-calling via pre-defined interfaces, Poolside’s agents are designed to write and execute their own code to solve problems. This shift from using tools to building systems marks a fundamental evolution in how AI interacts with the digital world. The team of roughly 60 people in the Applied Research organization spent three years and conducted tens of thousands of experiments to reach this point. Their vision of AGI is not just about intelligence, but about "abundance for humanity". By focusing on software engineering—a domain with verifiable rewards like test passes and compilation results—they have created a self-improving feedback loop. As the team puts it, they are building a "fusion reactor" for data: extracting every last drop of intelligence from existing human knowledge while using RL to harvest the "wind energy" of new, fresh experiences. Poolside’s journey is just beginning, but the Laguna release sets a high bar for what "agentic" AI should look like in 2026. By combining frontier-level performance with a commitment to open weights and novel developer surfaces, they are charting a path to AGI that is as much about the way we build as it is about the what we build. For the enterprise and the individual developer alike, the message is clear: the future of work is agentic, and the language of that future is code.

Mistral AI launches Workflows, a Temporal-powered orchestration engine already running millions of daily executions

4/28/2026

Mistral AI, the Paris-based artificial intelligence company valued at €11.7 billion ($13.8 billion), today released Workflows in public preview — a production-grade orchestration layer designed to move enterprise AI systems out of proofs of concept and into the business processes that generate revenue. The product, which launches as part of Mistral's Studio platform, is the company's clearest articulation yet of a thesis that is quietly reshaping the enterprise AI market: that the bottleneck for organizations adopting AI is no longer the model itself, but the infrastructure required to run it reliably at scale. "What we're seeing today is that organizations are struggling to go beyond isolated proofs of concept," Elisa Salamanca, head of product at Mistral AI, told VentureBeat in an exclusive interview ahead of the launch. "The gap is operational. Workflows is the infrastructure to run AI systems reliably across business-critical processes." The release arrives at a pivotal moment for both Mistral and the broader AI industry. The dedicated agentic AI market has been valued at approximately $10.9 billion in 2026 and is projected to reach $199 billion by 2034. Yet despite that staggering growth trajectory, industry research points to a stark reality: over 40% of agentic AI projects will be aborted by 2027 due to high costs, unclear value, and complexity. Mistral is betting that Workflows can help its enterprise customers avoid becoming one of those statistics. Mistral's new orchestration layer separates execution from control to keep enterprise data private At its core, Workflows provides a structured system for defining, executing, and monitoring multi-step AI processes — from simple sequential tasks to complex, stateful operations that blend deterministic business rules with the probabilistic outputs of large language models. Salamanca described Workflows as containing several key components. The first is a development kit that allows engineers to build orchestration logic in just a few lines of Python code. "We have also been able to expose MCP servers," she explained, referring to the Model Context Protocol standard for connecting AI systems to external tools, "so that they can actually do this with agent authoring." The second — and arguably more technically significant — component is an architecture that separates orchestration from execution. "We're decorrelating the orchestration from the execution," Salamanca said. "Execution can happen close to the customer's data — their critical systems — and orchestration can happen on the cloud or wherever they want to run it." This means the data never has to leave the customer's perimeter, a design decision with enormous implications for regulated industries where data sovereignty is non-negotiable. "Enterprises do not have to worry about us having access to the data," she added. The third pillar is observability. According to Mistral's blog post announcing the release, every branch, retry, and state change within a workflow is recorded in Studio with native support for OpenTelemetry. Salamanca noted that this is not an afterthought: "You can easily see what decisions have been taken by the workflow, by the agent, and you can deep dive into where problems are happening." Workflows is fully customizable across models — engineers can select which model handles which step and can inject arbitrary code, allowing them to blend deterministic pipelines with agentic sections. The system also supports connectors that integrate directly with CRMs, ticketing systems, support platforms, and other enterprise tools, with built-in authentication and secrets management. Why Mistral chose a code-first approach over low-code drag-and-drop builders Unlike some competitors offering drag-and-drop workflow builders, Mistral has deliberately targeted developers and engineers rather than business users. "There are a couple of solutions out there that have click-and-drag, drag-and-drop solutions for workflows," Salamanca acknowledged. "This is not the approach that we've been taking. We've been really focused towards developers and critical systems that will not scale if you're doing these drag-and-drop workflows." The decision is part of a broader philosophy at Mistral: that enterprise AI systems handling mission-critical operations — cargo releases, compliance reviews, financial transactions — require the precision and version control that only code can provide. Business users are not excluded from the picture, but their role is downstream. Once engineers write a workflow in Python, it can be published to Le Chat, Mistral's chatbot platform, so anyone in the organization can trigger it. Every step remains tracked and auditable in Studio. Under the hood, Workflows runs on Temporal's durable execution engine — a platform whose $5 billion valuation reflects how its durable execution capabilities, originally built for cloud workflow orchestration, have become essential infrastructure for AI agents requiring reliable, long-running, stateful processes. Temporal's customers include OpenAI, Snap, Netflix, and JPMorgan Chase, and its technology powers orchestration at companies like Stripe and Salesforce. Mistral extended Temporal's core engine for AI-specific workloads by adding streaming, payload handling, multi-tenancy, and observability that the base engine does not provide out of the box. "Workflows is built on top of Temporal," Salamanca confirmed. "We added all the AI requirements to make these AI workflows reliable. It provides out of the box durability, retries, state management. Whenever there's a failure, it starts again wherever it stopped." Originally spun out of Uber's Cadence project, Temporal transparently handles retries, state persistence, and timeouts, providing durable execution across failures. In late 2025, Temporal joined the newly formed Agentic AI Foundation as a Gold Member and announced an official OpenAI Agents SDK integration. By building on this infrastructure rather than creating a proprietary alternative, Mistral inherits battle-tested reliability while focusing its own engineering efforts on the AI-specific layer that sits above it. From cargo ships to KYC reviews, customers are already running millions of daily executions Mistral is not launching Workflows as a concept — the company says customers are already running the product in production, processing millions of executions daily across three primary use cases. The first is cargo release automation in the logistics sector. Global shipping still runs on paperwork, and a single cargo release can involve customs declarations, dangerous goods classifications, safety inspections, and regulatory checks spanning multiple jurisdictions. Salamanca described the scope of the problem: "Their global shipping today runs on paperwork. They have to involve customs declaration, Dangerous Goods classification, safety inspections, regulatory checks, and Workflows is now powering that with our models and business rules inside." Critically, the system keeps humans in the loop at the right moments. According to Mistral's blog, the human approval step in a workflow is a single line of code — wait_for_input() — that pauses the workflow indefinitely with no compute consumption, notifies the reviewer, and resumes exactly where it left off once approval is given. "Humans are still in the loop, but they're in the loop at the right time," Salamanca said. "They just get the validation — I don't have to go into multiple tools — and the shipment gets released." The second production use case is document compliance checking for financial institutions, specifically Know Your Customer reviews. These reviews are manual, repetitive, and traditionally require hours of analyst time per case. Salamanca said Workflows now processes these reviews in minutes and provides outputs in an auditable manner — a requirement for meeting regulatory obligations. The third example involves customer support in the banking sector. "You'd have millions of users actually asking to have credit cards blocked, or feedbacks on their account situation, on their credit feedbacks," Salamanca said. With Workflows, incoming support tickets are analyzed, categorized by intent and urgency, and routed automatically. Each routing decision is visible and traceable in Studio, and when the system gets a categorization wrong, the team can correct it at the workflow level without retraining the model. How Workflows fits into Mistral's three-layer enterprise AI platform strategy Workflows does not exist in isolation. It is the middle layer of a three-part enterprise platform that Mistral has been assembling at a rapid clip throughout 2026. At the bottom sits Forge, the custom model training platform Mistral launched in March at Nvidia’s GTC conference. Forge allows organizations to build, customize, and continuously improve AI models using their own proprietary data. At the top sits Vibe, Mistral's coding agent platform that provides the user-facing interaction layer — available on web, mobile, or desktop. Salamanca connected the three explicitly: "We just released Forge. It enables you to create your own models. But the question is, how do you put these models to do valuable work for your enterprise? That's where Workflows comes in, because this is the orchestration piece — how you blend in deterministic rules and agentic capabilities. And then if you really want to have your end users interact with these AI patterns, it's where Vibe comes into play." Forge is already seeing strong traction, Salamanca said, across two distinct patterns of enterprise demand. "First, they wanted to really build completely dedicated models to solve unique problems — transformers-based architecture for time series in the financial sector, adding new types of modalities to the LLMs," she explained. "And the second motion was about customers with really specific tasks they want to solve. Reinforcement learning really caught their attention as to how they can use Forge and Forge RL to actually have models do these tasks very well." This layered architecture — model customization, workflow orchestration, and end-user interfaces — positions Mistral as something more ambitious than a model provider. It is building a full-stack enterprise AI platform, a strategy that pits it directly against not just other AI labs like OpenAI and Anthropic, but also against the hyperscale cloud providers. The company's product portfolio now ranges, as Salamanca put it, "from compute to end-user interfaces," including data centers in Europe, document processing with its OCR model, and audio capabilities through its Voxtral models. Mistral's aggressive scaling campaign and the $14 billion valuation powering it The Workflows launch comes as Mistral executes one of the most aggressive scaling campaigns in the history of the European technology industry. The French AI startup has increased its revenue twentyfold within a year, with co-founder and CEO Arthur Mensch putting the company's annualized revenue run rate at over $400 million, compared to just $20 million the previous year. The Paris-based company aims to achieve recurring annual revenue of more than $1 billion by year-end. The company's fundraising trajectory has been equally dramatic. Mistral announced a €1.7 billion ($1.9 billion) Series C round at a €11.7 billion ($12.8 billion) valuation in September 2025. Bloomberg reported in September 2025 that the company was finalizing a €2 billion investment valuing it at €12 billion ($14 billion). ASML led the round and contributed €1.3 billion, a landmark investment that aligned chip manufacturing expertise with frontier AI development and underscored European industrial capital's commitment to building a sovereign AI ecosystem. Mistral then secured $830 million in debt in March 2026 to buy 13,800 Nvidia chips for a new data center near Paris. The financial picture illustrates why Workflows matters strategically. Mistral's revenue growth is being driven primarily by enterprise adoption, with approximately 60% of revenue coming from Europe, according to CEO Mensch's public statements. Those enterprise customers are not buying Mistral's models for casual chatbot applications — they are deploying them in regulated, mission-critical environments where reliability and data sovereignty are table stakes. Workflows gives those customers the production infrastructure they need to actually deploy AI systems that matter. In May 2025, Mistral released Mistral Medium 3, which was priced at $0.40 per million input tokens and $2 per million output tokens. The company said clients in financial services, energy, and healthcare had been beta testing it for customer service, workflow automation, and analyzing complex datasets. That model now becomes one of many that can be plugged into Workflows, creating a flywheel where better models drive more workflow adoption, which in turn drives more inference revenue. Where Mistral's orchestration play fits in an increasingly crowded competitive landscape Mistral's entry into workflow orchestration arrives in an increasingly crowded field. AI orchestration platforms are quickly becoming the backbone of enterprise AI systems in 2026, and as businesses deploy multiple AI agents, tools, and LLMs, the need for unified control, oversight, and efficiency has never been greater. Major cloud providers — Amazon with Bedrock AgentCore, Microsoft with Copilot Studio, Google with Vertex AI's agent tools, and IBM with WatsonX — all offer some form of workflow or agent orchestration. Open-source frameworks like LangChain, LlamaIndex, and Microsoft AutoGen provide developer-level building blocks. And dedicated orchestration startups are proliferating. Mistral's differentiation rests on three pillars. First, vertical integration: because Workflows is native to Studio, the orchestration layer and the components it orchestrates — models, agents, connectors, observability — are built to work together, eliminating the integration tax that enterprises pay when stitching together disparate tools. Second, deployment flexibility: the split control-plane/data-plane architecture means customers in regulated industries can run execution workers in their own environments while still benefiting from managed orchestration. Third, data sovereignty: Mistral's European roots and infrastructure investments give it a natural advantage with organizations wary of routing sensitive data through U.S.-headquartered cloud providers — a concern that has intensified amid ongoing geopolitical tensions and growing European anxiety about relying on foreign providers for over 80% of digital services and infrastructure. Still, the challenges are real. OpenAI and Anthropic both have significantly larger model ecosystems and developer communities. The hyperscalers control the cloud infrastructure where most enterprise workloads actually run. And the enterprise sales cycles for production-grade AI deployments remain long and complex, requiring deep technical integration work that even well-funded startups can struggle to staff. What comes next for Workflows — and why Mistral thinks orchestration is the real AI battleground Salamanca outlined three areas of near-term development. First, Mistral plans to release a more managed version of Workflows that abstracts deployment logic for developers who don't need granular control over worker placement. "Whenever you want to have this flexibility, you can, but if you want to be able to have this on a managed infrastructure, even if it's running in your own VPC, this is something that we're adding," she said. Second, the company intends to make Workflows accessible to business users, not just engineers. "With Vibe code, you can actually author a workflow. This can be executed at scale, and any end user, in the end, can actually do that with Workflows," Salamanca explained. The third area is enterprise guardrails and safety controls for agentic applications — ensuring agents use the correct tools, run with appropriate permissions, and that administrators can enforce policies at scale. "Making sure that we have all these enterprise controls to be able to scale the authoring and the building of these workflows is something we're actively working on," she said. The Python SDK for Workflows (v3.0) is now publicly available. Developers can try the product in Studio and access documentation and demo templates immediately. Mistral will be hosting its inaugural AI Now Summit in Paris on May 27–28, where the company is expected to provide additional details on its platform roadmap. For three years, the AI industry has been captivated by a single question: who can build the most powerful model? Mistral's Workflows launch suggests the company has moved on to a different question entirely — one that may prove far more consequential for the enterprises writing the checks. It's not about which model is smartest. It's about which one can actually show up for work.

Microsoft and OpenAI gut their exclusive deal, freeing OpenAI to sell on AWS and Google Cloud

4/27/2026

Microsoft and OpenAI on Monday announced a sweeping overhaul of the partnership that has defined the commercial AI era, dismantling key pillars of exclusivity and revenue-sharing that bound the two companies together for years and replacing them with a looser, time-limited arrangement that gives both sides far more freedom to pursue rival relationships. The amended agreement, disclosed simultaneously in blog posts from both companies, marks the most significant restructuring since Microsoft first invested $1 billion in OpenAI in 2019 — and it transforms what was once the most consequential exclusive technology alliance in a generation into something that more closely resembles a strategic but arm's-length commercial relationship. Under the new terms, Microsoft will no longer pay any revenue share to OpenAI when customers access OpenAI models through Azure. OpenAI, meanwhile, will continue paying a revenue share to Microsoft through 2030 — at the same 20 percent rate — but that obligation is now subject to a total cap. Microsoft retains a license to OpenAI's intellectual property for models and products through 2032, but that license is now explicitly non-exclusive. And OpenAI, critically, can now serve all of its products to customers on any cloud provider — including Amazon Web Services and Google Cloud — ending the exclusivity that had been a cornerstone of the original deal. "The rapid pace of innovation requires us to continue to evolve our partnership to benefit our customers and both companies," Microsoft wrote in its blog post Monday. OpenAI echoed the framing, calling the amended agreement a move "grounded in flexibility, certainty, and a focus on delivering the benefits of AI broadly." The diplomatic language belies the drama that led to this moment — months of behind-the-scenes tension, competing deal announcements, public contradictions, and even the specter of litigation between two companies whose fates have been intertwined since the earliest days of the generative AI revolution. How a billion-dollar bet on AI created the most powerful exclusive partnership in tech To understand why Monday's announcement matters so much, it helps to understand what came before it. When Microsoft poured its initial $1 billion into OpenAI in 2019, and then followed with a cumulative investment exceeding $13 billion, it secured something extraordinary: exclusive commercial access to OpenAI's models and intellectual property. Azure became the sole cloud provider for OpenAI's API products. Microsoft integrated OpenAI's GPT models into everything from Bing to Office to GitHub Copilot. The arrangement was, by any measure, one of the most lopsided technology licensing deals in modern history — Microsoft got privileged access to the most capable AI models on the planet, and OpenAI got the capital and infrastructure it needed to scale. The deal even contained an unusual provision: Microsoft's exclusive rights would remain in force until OpenAI achieved artificial general intelligence, or AGI — a loosely defined milestone referring to AI systems that rival or exceed human intelligence across a broad range of tasks. OpenAI's board retained the authority to declare when AGI had been reached, at which point certain commercial terms would change. It was, in effect, a philosophical tripwire embedded in a business contract. That structure worked well enough when OpenAI was a research lab with a modest commercial footprint. But as ChatGPT exploded into the mainstream in late 2022 and OpenAI's annualized revenue rocketed into the billions, the constraints began to chafe. OpenAI found itself locked into a single cloud ecosystem at precisely the moment when enterprises — its fastest-growing customer segment — were demanding multi-cloud flexibility. In an internal memo earlier this month, OpenAI's revenue chief Denise Dresser put it bluntly, telling staff that the Microsoft partnership had "limited our ability to meet enterprises where they are," according to a report from The Verge. Amazon's $50 billion OpenAI investment created a legal crisis that forced the restructuring The proximate cause of Monday's restructuring was not a philosophical disagreement about AI safety or corporate governance. It was a $50 billion check from Amazon. In February, OpenAI announced that Amazon would invest up to $50 billion in the company — $15 billion upfront, with another $35 billion to follow when certain unspecified conditions were met. In exchange, OpenAI agreed to expand its existing cloud agreement with AWS by $100 billion over eight years and, most controversially, committed to making AWS the exclusive third-party distribution provider for Frontier, its new enterprise agent-building platform. OpenAI also agreed to co-develop "stateful runtime technology" on AWS Bedrock, the infrastructure layer that allows AI agents to maintain memory and context over extended tasks. The problem was that OpenAI's existing contract with Microsoft almost certainly prohibited these arrangements. Microsoft held exclusive rights to any OpenAI product accessed through an API — a category that plainly included Frontier. On the very day OpenAI announced the Amazon deal, Microsoft issued a pointed public statement insisting that "Azure remains the exclusive cloud provider of stateless OpenAI APIs" and that "OpenAI's first party products, including Frontier, will continue to be hosted on Azure." The contradiction between the two announcements was stark, and it created immediate legal exposure. The Financial Times reported in March that Microsoft was actively considering legal action to enforce its contractual rights. The situation placed OpenAI in an impossible position: it had made promises to Amazon that it seemingly could not keep under the terms of its Microsoft agreement. Monday's deal resolves that impasse entirely. By converting Microsoft's license from exclusive to non-exclusive and explicitly granting OpenAI the right to serve products on any cloud, the new terms retroactively validate the Amazon arrangement and eliminate the legal overhang. Amazon CEO Andy Jassy wasted no time celebrating. "We're excited to make OpenAI's models available directly to customers on Bedrock in the coming weeks, alongside the upcoming Stateful Runtime Environment," he wrote on X, adding that the company would share more details at an event in San Francisco on Tuesday. Inside the new financial terms that shift billions of dollars between the two AI giants The financial mechanics of the new deal deserve careful parsing, because they reveal which side gave up what — and who came out ahead. Under the old arrangement, money flowed in both directions. When customers bought ChatGPT subscriptions or accessed OpenAI models through their own applications, OpenAI paid Microsoft a cut — reportedly 20 percent. Conversely, when enterprise customers accessed OpenAI models through Azure's API, Microsoft paid OpenAI a share of that revenue. This bilateral structure reflected the deep integration between the two companies: Microsoft was simultaneously OpenAI's investor, cloud provider, distribution partner, and largest customer. The new deal makes the cash flow one-directional. Microsoft stops paying OpenAI entirely. OpenAI continues paying Microsoft its 20 percent share, but only through 2030, and now subject to a total cap whose precise dollar figure has not been disclosed. Given that OpenAI's revenue is growing rapidly — the company was reportedly on pace to generate tens of billions annually — that cap could become material relatively quickly. For Microsoft, the trade-off is straightforward: it sacrifices the exclusivity that made Azure the only gateway to OpenAI's models, but it gains immediate financial relief by eliminating its outbound revenue-share payments while continuing to collect inbound payments for several more years. And it retains approximately 27 percent ownership of OpenAI's for-profit entity, meaning it participates in the company's growth regardless of which cloud serves the workloads. Last quarter alone, Microsoft reported $7.5 billion in revenue from its OpenAI investment in a single quarter, according to TechCrunch's reporting. For OpenAI, the calculus is different. It accepts a continued obligation to pay Microsoft through 2030, but it gains the commercial freedom to sell everywhere — a freedom that is arguably worth far more than the revenue-share savings. Enterprise customers overwhelmingly operate in multi-cloud environments. Being locked into Azure was not just a technical constraint; it was a sales objection that OpenAI's competitors, particularly Anthropic and Google, exploited relentlessly. Why the disappearance of the AGI clause signals a new era for AI governance One of the more philosophically intriguing aspects of Monday's announcement is what it does to the AGI provision that once governed the partnership. Under the original agreement, Microsoft's exclusive commercial rights were tied to a trigger: if OpenAI's board determined that the company had achieved AGI, certain terms — including Microsoft's access to the most advanced models — would change. The provision was meant to ensure that a truly superintelligent system would remain under the nonprofit board's control rather than being commercially exploited. In practice, it created perverse incentives: OpenAI had a financial reason to never declare AGI, and Microsoft had a financial reason to argue that AGI had not been reached regardless of what the technology could actually do. The new deal sidesteps this entirely. Microsoft's license now runs through a fixed calendar date — 2032 — "independent of OpenAI's technology progress," as the companies put it. The AGI trigger, a concept that once sat at the philosophical heart of the partnership, has been replaced by a spreadsheet. Andrew Curran, a close observer of OpenAI's governance, noted on X that language defining AGI had been removed from OpenAI's website, sharing a screenshot showing the change. The move drew sharp reactions. One commenter observed that "removing the definition = removing the accountability. whoever controls when AGI is declared controls a lot of commercial terms." The shift reflects a broader maturation — or perhaps disillusionment — within the AI industry regarding AGI as a meaningful commercial or governance concept. When the original deal was struck, AGI felt like a distant, almost mythical threshold. Now, with models like GPT-5.5 demonstrating increasingly general capabilities, the term has become more of a marketing slogan than a technical benchmark. Replacing it with fixed dates and dollar caps is, in some sense, an admission that the industry has moved beyond the framework that once defined this partnership. Multi-cloud AI competition intensifies as enterprises gain the power to choose The most immediate beneficiary of the new arrangement is the enterprise customer. For years, organizations that wanted access to OpenAI's models had essentially one option: Azure. That constraint is now gone. Within weeks, according to Jassy, OpenAI's models will be available on AWS Bedrock alongside the stateful runtime environment that powers long-running AI agents. Google Cloud is presumably not far behind. This multi-cloud availability arrives at a moment when the AI infrastructure market is undergoing rapid consolidation and expansion simultaneously. Meta recently committed $48 billion to cloud providers CoreWeave and Nebius. Amazon's investment in OpenAI, combined with its existing relationship with Anthropic — in which Amazon has invested up to $4 billion — positions AWS as a model-agnostic platform where enterprises can mix and match AI capabilities. Microsoft, meanwhile, has developed its own relationship with Anthropic, using Claude to power agentic products — a hedge against the very OpenAI dependency it spent billions creating. The competitive dynamics are now genuinely complex. Microsoft competes with OpenAI in AI products (Copilot vs. ChatGPT), partners with OpenAI's rival Anthropic, and remains OpenAI's largest shareholder. OpenAI sells on Azure, AWS, and soon everywhere else, while building its own data centers. Amazon invests in both OpenAI and Anthropic. Google builds its own models while also hosting competitors on Vertex AI. Jehangeer Hasan, a technology commentator, captured the mood on X, calling the announcement a "notable shift in the cloud AI landscape" that signals "intensifying multi-cloud competition and a push toward giving developers more flexibility instead of locking them into a single ecosystem." Chris Alexander, an engineer, offered a more candid assessment: "honestly Azure's OpenAI endpoints are so unreliable, we mostly just hit you all directly," adding that "it would be nice to have options in AWS or GCP for sure." What the restructured deal means for the future of AI's biggest partnership Several open questions remain. The precise dollar amount of the revenue-share cap has not been disclosed, and it will matter enormously as OpenAI's revenue scales. The meaning of "first on Azure" — whether it implies a meaningful exclusivity window or merely simultaneous availability — remains deliberately ambiguous. And OpenAI's own infrastructure ambitions, including plans to build proprietary data centers, could eventually reduce its dependence on any third-party cloud, including Azure. Microsoft's position, while less dominant than before, is not as diminished as some early commentary suggested. It remains OpenAI's primary cloud provider, its largest shareholder, and a licensee of its technology through the end of the decade. It has diversified its own AI strategy with investments in Anthropic, its own Phi and MAI model families, and deep integration of AI across its product portfolio. The company reported $7.5 billion in OpenAI-related revenue last quarter — a figure that demonstrates the sheer financial scale of the relationship even in its loosened form. For OpenAI, the new agreement is a coming-of-age moment. The company that once depended on Microsoft for everything — capital, compute, distribution, and credibility — now operates as an independent force capable of striking multi-billion-dollar deals with Microsoft's biggest rivals. Sam Altman announced the changes on X with characteristic brevity: "We have updated our partnership with Microsoft." Seven years ago, when Microsoft CEO Satya Nadella and Altman first shook hands on a deal to commercialize artificial intelligence, the arrangement rested on the assumption that OpenAI needed Microsoft more than Microsoft needed OpenAI. Every clause — the exclusivity, the AGI trigger, the revenue share — reflected that original imbalance. Monday's restructuring is proof that the assumption no longer holds. The partnership that launched the generative AI revolution has survived, but the power dynamics that created it have not. In the AI industry, it turns out, the only thing that moves faster than the technology is the leverage.

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks

4/27/2026

Xiaomi, the Chinese firm best known for its smartphones and electric vehicles, has lately been shipping some incredibly affordable and high-powered open source AI large language models. The trend continued today with the release of Xiaomi MiMo-V2.5 and Xiaomi MiMo-V2.5-Pro, both available under the permissive, enterprise-friendly MIT License, making them suitable for use in production in commercial applications. Enterprises and individual/independent developers can now download either of the models (and more Xiaomi open source options) directly from Hugging Face, modify them as needed, and run them locally or on virtual private clouds as they see fit. The most notable attribute of these models besides the open source licensing is that, according to Xiaomi's published benchmarks, they are among the most efficient available for agentic "claw" tasks, that is, powering systems such as OpenClaw, NanoClaw and Hermes Agent, in which users can communicate with them directly over third-party messaging apps and have the agents go off and complete tasks on the human user's behalf, such as making and publishing marketing content, running accounts, organizing email and scheduling, etc. As Xiaomi's ClawEval benchmark chart shows, both MiMo-V2.5 and the Pro version in particular appear near the top left of the chart, indicating high performance in completing the benchmarked claw tasks while using the fewest amount of tokens — saving the human user money, especially in a world where more and more services such as Microsoft's GitHub Copilot are moving to usage-based billing (charging the human behind the agents for each token used rather than imposing rate limits like Anthropic or providing an "all-you-can-eat" buffet-style subscription like OpenAI). In fact, the Pro model leads the open-source field with a 63.8% success rate, consuming only ~70K tokens per trajectory. This is roughly 40–60% fewer tokens than those required by Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and OpenAI GPT-5.4 to achieve comparable results. By combining a massive 310B-parameter architecture with a highly efficient "active" footprint and a native 1-million-token context window, Xiaomi MiMo is challenging the dominance of closed-source frontier models from Google and OpenAI, especially when it comes to the latest and greatest craze in enterprise AI deployments — agentic tasks and "claws" similar to OpenClaw. A two-pronged pincer Xiaomi has released two distinct versions of the model to serve different ends of the development spectrum: MiMo-V2.5 (the "Omni" multimodal specialist) and MiMo-V2.5-Pro (the "Agent" specialist). While the base model provides native multimodality, the MiMo-V2.5-Pro is specifically engineered for "long-horizon coherence" and complex software engineering. On the GDPVal-AA (Elo) benchmark, the Pro model achieved a score of 1581, surpassing competitors like Kimi K2.6 and GLM 5.1. Xiaomi researchers further released data on several high-complexity tasks performed autonomously by V2.5-Pro: SysY Compiler in Rust: The model implemented a complete compiler from scratch—including lexer, parser, and RISC-V assembly backend—in 4.3 hours. Spanning 672 tool calls, the model achieved a perfect 233/233 score on hidden test suites, a task that typically takes a computer science major several weeks. Full-Featured Video Editor: Over 11.5 hours and 1,868 tool calls, the model produced an 8,192-line desktop application featuring multi-track timelines and an export pipeline. Analog EDA Optimization: In a graduate-level engineering task, the model optimized a Flipped-Voltage-Follower (FVF-LDO) regulator in the TSMC 180nm process. By iterating through an ngspice simulation loop, the model improved metrics like line regulation by 22x over its initial attempt. These experiments highlight a "harness awareness" in V2.5-Pro, where the model actively manages its own memory and shapes its context to sustain coherence over thousands of sequential tool calls. Over the API, Xiaomi is pricing the models at competitive rates for both domestic (Chinese) and international markets (like the U.S.). For overseas developers, the high-performance MiMo-V2.5-Pro is priced at $1.00 per million input tokens (for a cache miss) and $3.00 for output within context windows up to 256K. For ultra-long context tasks between 256K and 1M tokens, the cost doubles to $2.00 for input and $6.00 for output, though the architecture’s caching capabilities offer significant relief, reducing input costs to as little as $0.20 to $0.40 per million tokens upon a cache hit. Domestically, these rates are mirrored in yuan, with the Pro model starting at ¥7.00 per million input tokens for standard context and reaching ¥14.00 for the extended 1M range. Meanwhile, the base model starts at just $0.40 USD for overseas input per million tokens and $2.00 per million output, putting it among the more affordable third of leading LLMs globally (see our chart below): Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI To lower the barrier for agentic development further, Xiaomi has made cache writing free of charge for a limited time across all models, alongside a total fee waiver for the entire MiMo-V2.5-TTS suite, which includes its specialized voice cloning and design features. This pricing logic is clearly designed to accelerate the transition from simple chat applications to persistent, long-horizon agents that can operate at a fraction of the cost of legacy frontier models. Xiaomi has also introduced an overhauled version of its subscription offerings, called the "Token Plan," now available in four levels: The Lite "Starter Pack" provides 720 million credits for $63.36 USD per year Standard tier offers 2.4 billion credits for $168.96 per year A Pro tier provides 8.4 billion credits for $528.00 per year (designed for enterprise use cases) Max —aimed at high-intensity coding enthusiasts—delivers 19.2 billion credits for $1,056.00 per year Beyond credit allotments, all plans include preferential API rates, a 20% discount for off-peak calls, and "Day-0" support for popular coding scaffolds like Cursor, Zed, and Claude Code. However, both through the API and via the Token Plan, accessing the Xiaomi models from China may present barriers or additional compliance and regulatory risks to U.S.-based enterprise customers. As such, the best bet for U.S. enterprises concerned about relying on Chinese tech but wanting to take advantage of the low cost and open source models is likely setting up their own virtual private clouds or local servers, downloading the model weights, and running the models domestically. MoE architecture but divergent training regimens for V2.5 and V2.5-Pro At the heart of MiMo-V2.5 is a Sparse Mixture-of-Experts (MoE) architecture. While the model boasts a total of 310 billion parameters, only 15 billion are "active" during any given inference cycle. Meanwhile, V2.5-Pro is 1.02 trilion-parameter Mixture-of-Experts model with 42 billion active parameters. In either case, the design functions much like a specialized research hospital: while the facility has hundreds of doctors (parameters), only the specific specialists required for a particular case (query) are called into the room. This massive increase in parameter volume for the Pro version provides the "neural capacity" required for the deep, multi-step reasoning found in complex software engineering and long-horizon tasks, as though even more specialists are available in an even larger hospital. According to Xiaomi's blog post, the regular V2.5 follows a rigorous five-stage evolution: Text Pre-training: Building a massive language backbone on 48 trillion tokens. Projector Warmup: Aligning in-house audio and visual encoders with the language core. Multimodal Pre-training: Scaling across high-quality cross-modal data. Agentic Post-training: Progressively extending the context window from 32K to 1M tokens. RL and MOPD: Utilizing Reinforcement Learning and Multimodal Preference Optimization (MOPD) to sharpen real-world reasoning and perception. The backbone utilizes a hybrid sliding-window attention architecture, inherited from MiMo-V2-Flash, which optimizes how the model "remembers" long-range information. This technical foundation enables MiMo-V2.5 to see, hear, and reason natively, rather than relying on external "plug-in" tools for visual or auditory processing. Conversely, the training of MiMo-V2.5-Pro prioritizes "action space" over sensory perception. Instead of sensory alignment, the Pro model’s training focus shifts toward scaling post-training compute. This process is designed to instill "harness awareness," where the model is specifically trained to manage its own memory and context within autonomous agent scaffolds like Claude Code or OpenCode. While the base V2.5 model is trained to reason across modalities, the Pro version is trained to sustain coherence across more than a thousand sequential tool calls. The standard V2.5 model balances local and global attention to maintain multimodal perception. The Pro model, however, utilizes an increased hybrid attention ratio—evolving from the 5:1 ratio of previous generations to a more aggressive 7:1 ratio. This allows the Pro model to "skim" the vast majority of its context while applying high-density attention to the specific 15% of data most relevant to its current objective, a critical feature for debugging large repositories or optimizing graduate-level circuits. Finally, while both models undergo Reinforcement Learning (RL) and Multimodal Preference Optimization (MOPD), the objectives of these stages differ. For MiMo-V2.5, the RL stage is used to sharpen perception and multimodal reasoning. For MiMo-V2.5-Pro, RL is focused on instruction following within agentic scenarios, ensuring the model adheres to subtle requirements embedded deep within ultra-long contexts and recovers gracefully from errors during autonomous execution. This results in the Pro model's "self-correcting" discipline, as seen in its ability to diagnose and fix regressions during the 4.3-hour SysY compiler build. Full MIT License is perfect for enterprise use cases In a move that distinguishes it from many "open" models that include restrictive "Acceptable Use" policies, Xiaomi has released MiMo-V2.5 under the MIT License.The MIT License is the gold standard of permissive software licensing. For developers and enterprises, this means: No Authorization Required: Companies can deploy the model commercially without seeking explicit permission from Xiaomi. Continued Training: Developers are free to fine-tune the model on proprietary data and even release those derivative weights. Unrestricted Commercial Use: There are no revenue caps or user-base limits that often plague "community" licenses. By choosing MIT over a custom "open weights" license, Xiaomi is positioning MiMo as the foundational infrastructure for the next generation of AI agents, effectively inviting the global developer community to treat the model as a public utility. Xiaomi's background: from smartphones and EVs to Chinese open source AI darling Xiaomi’s pivot toward frontier AI agents is the logical culmination of a decade spent building one of the world's most dense hardware-software flywheels. Founded in 2010 as a smartphone disruptor, the Beijing-based company has executed a high-stakes transition into a vertically integrated powerhouse defined by its "Human x Car x Home" strategy. This ecosystem now encompasses over 823 million connectable smart devices unified under the HyperOS architecture. The company’s 2024 entry into the automotive sector with the SU7 and the subsequent high-performance YU7 SUV served as a proof of concept for this integration, positioning Xiaomi as a direct competitor to global luxury marques. By investing 200 billion yuan ($29B USD) into foundational R&D for chips and operating systems, Xiaomi has moved beyond consumer electronics assembly; it has become an architect of the "action space," using its massive hardware footprint as the primary testing ground for the agentic intelligence found in the MiMo-V2.5 series. Ecosystem support The release has been met with immediate "Day-0" support from the broader AI ecosystem. The MiMo team announced that SGLang and vLLM—two of the most popular high-throughput inference engines—supported the V2.5 series at launch. This was made possible through hardware partnerships with AWS, AMD, T-HEAD, and Enflame, ensuring the model can run efficiently on everything from cloud-based H100s to domestic Chinese accelerators. Fuli Luo, the project lead at Xiaomi MiMo and a former key member of the DeepSeek team, underscored the philosophy behind the release on X (formerly Twitter): "A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now!" To kickstart this building phase, Luo announced a 100-trillion free token grant for builders and creators. This massive incentive is designed to lower the barrier to entry for developers who want to experiment with the 1M context window without immediate financial risk. The economic realignment: open source vs. metered proprietary The launch arrives at a critical juncture for AI economics. The shift toward usage-based billing marks the definitive end of the "all-you-can-eat" buffet era for AI services, a trend underscored by GitHub’s announcement today that its AI coding assistant Github Copilot will transition all plans to metered, token-based credits. As seat-based predictability gives way to consumption-driven costs, premium agentic workflows—which can consume millions of tokens in a single reasoning session—are becoming increasingly difficult for enterprises to budget. User sentiment has turned predictably cynical, with developers lamenting that they will "get less, but pay the same price" as subscriptions convert into finite allotments. This pricing evolution significantly enhances the strategic appeal of the MiMo series. By releasing under a permissive MIT License, Xiaomi allows organizations to bypass the escalating "SaaS tax" and reclaim financial predictability through private deployment. Crucially, Xiaomi has eliminated the "context tax" for its API. The 1-million-token context window is now billed at the standard rate—1 token = 1 credit for V2.5 and 2 credits for the Pro version—with no additional multiplier. This stands in stark contrast to the industry-wide move toward session-based caps, positioning MiMo as a refuge for cost-sensitive, high-volume development. Analysis for enterprises The launch of MiMo-V2.5 is more than just a weight drop; it is a declaration of independence for the open-source community. By matching Claude Sonnet 4.6 in multimodal agentic work and Gemini 3 Pro in video understanding, Xiaomi has proven that the gap between "closed-door" labs and open research is effectively closed. With the MIT license as a catalyst and a 100T token grant as fuel, the coming months will likely see a surge in specialized, agentic applications built on the MiMo backbone. Confirming the project's ambitious trajectory, the team noted they are already training the next generation, focusing on "deeper reasoning" and "richer real-world grounding". For now, MiMo-V2.5 stands as a testament to the power of sparse architectures and permissive licensing in the race toward functional AGI.

New AI framework autonomously optimizes training data, architectures and algorithms — outperforming human baselines

4/27/2026

AI R&D runs on a cycle of hypothesis, experiment, and analysis — each step demanding substantial manual engineering effort. A new framework from researchers at SII-GAIR aims to close that bottleneck by automating the full optimization loop for training data, model architectures, and learning algorithms. A new framework called ASI-EVOLVE, developed by researchers at the Generative Artificial Intelligence Research Lab (SII-GAIR), aims to solve this bottleneck. Designed as an agentic system for AI-for-AI research, it uses a continuous "learn-design-experiment-analyze" cycle to automate the optimization of the foundational AI stack. In experiments, this self-improvement loop autonomously discovered novel designs that significantly outperformed state-of-the-art human baselines. The system generated novel language model architectures, improved pretraining data pipelines to boost benchmark scores by over 18 points, and designed highly efficient reinforcement learning algorithms. For enterprise teams running repeated optimization cycles on their AI systems, the framework offers a path to reducing manual engineering overhead while matching or exceeding the performance of human-designed baselines. The data and design bottleneck Engineering teams can only explore a tiny fraction of the vast possible design space for AI models at any given time. Executing experimental workflows requires costly manual effort and frequent human intervention. And the insights gained from these expensive cycles are often siloed as individual intuition or experience, making it difficult to systematically preserve and transfer that knowledge to future projects or across different teams. These constraints fundamentally limit the pace and scale of AI innovation. AI has made incredible strides in scientific discovery, ranging from specialized tools like AlphaFold solving discrete biological problems to agentic systems answering basic scientific questions. However, current frameworks still struggle with open-ended AI innovation and are mostly limited to narrow optimization within very specific constraints. Advancing core AI capabilities is far more complex. It requires modifying large interdependent codebases, running compute-heavy experiments that consume tens to hundreds of GPU hours, and analyzing multi-dimensional feedback from training dynamics. “Existing frameworks have not yet demonstrated that AI can operate effectively in this regime in a unified way, nor that it can generate meaningful advances across the three foundational pillars of AI development rather than within a single narrowly scoped setting,” the researchers write. How ASI-EVOLVE learns to research To overcome the limitations of manual R&D, ASI-EVOLVE operates on a continuous loop between prior knowledge, hypothesis generation, experimentation, and refinement. The system learns relevant knowledge and historical experience from existing databases, designs a candidate program representing its next hypothesis, runs experiments to obtain evaluation signals, and analyzes outcomes into reusable, human-readable lessons that it feeds back into its knowledge base. There are two key components that drive ASI-EVOLVE. The “Cognition Base” acts as the system's foundational domain expertise. To speed up the search process, the system is pre-loaded with human knowledge, task-relevant heuristics, and known pitfalls extracted from existing literature. This steers the exploration toward promising directions right from the first iteration. The second component is the “Analyzer,” which tackles the complex, multi-dimensional feedback from the experiments. It processes raw training logs, benchmark results, and efficiency traces, distilling them into compact, actionable insights and causal analyses. Several other complementary modules bring the framework together. A “Researcher” agent reviews prior knowledge from the cognition base and past experimental results to generate new hypotheses, either proposing localized code modifications or writing new programs. The “Engineer” component runs the actual experiments. Because AI training trials are incredibly costly, the Engineer is equipped with efficiency measures like wall-clock limits and early rejection quick tests to filter out flawed candidate programs before they consume excessive GPU hours. Finally, the “Database” serves as the system's persistent memory, storing the code, research motivations, raw results, and the Analyzer's final reports for every iteration, ensuring that insights compound systematically over time. By unifying these components, ASI-EVOLVE ensures that an AI agent systematically learns from complex, real-world experimental feedback without requiring constant human intervention. While previous frameworks are designed to evolve candidate solutions, “ASI-EVOLVE evolves cognition itself,” the researchers write. “Accumulated experience and distilled insights are continuously stored and retrieved to inform future exploration, ensuring that the system grows not only in the quality of its solutions but in its capacity to reason about where to search next.” ASI-EVOLVE in action In their experiments, the researchers showed that ASI-EVOLVE can successfully improve data curation, model architectures, and learning algorithms to create better AI systems. For real-world enterprise applications, high-quality data is a persistent bottleneck. When tasked with designing category-specific cleaning strategies for massive pretraining corpora, ASI-EVOLVE inspected data samples and diagnosed quality issues like HTML artifacts and formatting inconsistencies. The system autonomously formulated custom curation rules, discovering that systematic cleaning combined with domain-aware preservation rules is far more effective than aggressive filtering. In benchmark tests, 3B-parameter models trained on the AI-curated data saw an average score boost of nearly 4 points over models trained on raw data. The gains were highest in knowledge-intensive tasks, with performance increasing by over 18 points on Massive Multitask Language Understanding (MMLU), an LLM benchmark that covers tasks across STEM, humanities, and social sciences. Beyond data, the system proved highly capable at neural architecture design. Across 1,773 autonomous exploration rounds, it generated 105 novel linear attention architectures that surpassed DeltaNet, a highly efficient human-designed baseline. To achieve these results, ASI-EVOLVE developed multi-scale routing mechanisms that dynamically adjust the model's computational budget based on the specific content of the input. Finally, in reinforcement learning algorithm design, ASI-EVOLVE discovered novel optimization mechanisms. It designed algorithms that outperformed the competitive GRPO baseline on complex mathematical reasoning benchmarks such as AMC32 and AIME24. One successful variant invented a "Budget-Constrained Dynamic Radius" that keeps model updates within a defined budget, effectively stabilizing training on noisy data. What this means for enterprise AI Enterprise AI workflows constantly require optimizations to existing systems, from fine-tuning open-source models on proprietary data to making small changes to architectures and algorithms. Usually, the computational resources and engineering hours required to carry out such efforts are immense and beyond the capabilities of most organizations. As a result, many are left to run unoptimized versions of standard AI models. The research team says the framework is designed so enterprises can integrate proprietary domain knowledge into the cognition repository and allow the autonomous loop to iterate on internal AI systems. The research team has open-sourced the ASI-EVOLVE code, making the foundational framework available for developers and product builders.

Why supply chains are the proving ground for automation‑led iPaaS

4/27/2026

Presented by Edgeverve Supply chains are where legacy integration models reach their limits. As partner networks expand and operational volatility increases, traditional middleware is buckling under costs and complexity. That’s why supply chain has emerged as a proving ground for automation‑led integration Platform as a Service (iPaaS), a next-generation model designed to absorb constant change without rewriting the stack. This article takes a look at today’s supply chains, the limits of legacy integration, how automation changes the iPaaS model, possible downsides to an upgrade, and questions leaders should be asking about whether next-gen iPaaS makes sense for them. Why now? Supply chains have outgrown their integration models Supply chains have always been complex. What’s new is the pace of change. Networks now span hundreds of suppliers, logistics providers, and distributors, each running different systems and data standards. At the same time, expectations for real‑time visibility and rapid response continue to rise. The global supply chain visibility software market, which is the problem space that automation-led iPaaS aims to address, was estimated at about $3.3 billion in 2025 and is forecast to triple by 2034. But enterprises clearly need more than just visibility. Industry surveys show that more than 90% of supply chain leaders are reworking their operating models in response to volatility, including tariff changes, and more than half report using AI in at least some supply‑chain functions. (See this 2025 PwC survey.) That combination — structural change and new automation expectations — puts the spotlight on integration. Legacy integration is simply mismatched with the reality on the ground. Traditional integration architecture assumed fixed partners, predictable schemas, infrequent change and general stability. That model worked when supply chains were slower and more centralized. Today’s supply chains operate under different conditions. Partners are added and removed constantly. Data structures evolve with new products, regulations, and sustainability requirements. The old corner cases are no longer so exceptional. Legacy integration’s limits, pain points and debt Let’s look a little closer at the status quo. Across supply‑chain environments, legacy integration approaches tend to struggle with the same structural limitations: Inflexibility and poor scalability as partner volumes grow High upfront and ongoing costs driven by custom development Heavy maintenance demands just to keep integrations running Scarcity of specialized IT resources required for changes Heterogeneous systems and applications across partners Brittle point‑to‑point (P2P) integrations that don’t age well Code‑dependent data mapping and transformation Different tools for B2B integrations and internal applications In many enterprise domains, aging and brittle P2P integration — to cite just one of these limitations — creates inconvenience. In supply chains, it creates disruption. Missed or delayed messages can turn into shipment delays, excess inventory, or planning decisions based on old data. That’s why technical integration debt accumulates so fast here. Few other enterprise domains combine that level of external dependency with the need to keep operations running continuously. What next-gen iPaaS changes, and why AI matters Next‑gen iPaaS platforms don’t just relocate integration to the cloud. That’s already table stakes in the broader iPaaS market, which analysts have been tracking for a dozen years. The defining shift is how the new platforms handle change. Instead of treating integrations as static assets, they manage integrations more as living workflows. Automation‑led iPaaS emphasizes faster partner onboarding, reusable process logic, and AI‑assisted mapping that reduces manual effort when schemas change. (And change they do, whether JSON APIs or event payloads or compliance data.) Errors also surface earlier and are easier to contain. Because supply‑chain data mixes structured transactions with semi‑structured documents, inconsistent partner conventions, and context‑dependent exceptions, they are a natural candidate for AI-assisted normalization and validation. Used rightly, AI reduces human effort without eliminating governance. Sensitivity to costs and disruption Supply chains operate under tight economic constraints. Margins are thin, disruptions are expensive, and technology investments must justify themselves quickly. Long, heavily customized integration programs are hard to defend. Automation‑led iPaaS aligns better with this reality, with quicker migrations resulting from a mix of AI-driven migration tools, no-code low-code configurators with assisted co-pilots, out-of-the-box (OOB) support for standards, connectors and more. While integration upgrades have a reputation for being disruptive, the emerging adoption pattern for next-gen iPaaS looks different. Here we’re seeing supply chain leaders introducing platforms incrementally, allowing legacy systems to run while new automation absorbs change. The goal isn’t to pause operations, but to reduce the “blast radius” of change. Or to shift metaphors, in this case, it’s actually possible to keep the plane in the air while gradually rebuilding the supply-chain integration engine. Questions supply chain leaders should be asking Taken together, that reframes the decision. Rather than treating AI-driven iPaaS as a purely technical upgrade, supply chain leaders may be better served by asking a few operational questions: How quickly can we onboard or offboard a trading partner today? What slows that process down? Where do integration failures surface first: IT dashboards, or missed deliveries and distorted inventory signals? How much human effort goes into maintaining mappings, handling exceptions, and reconciling data as formats change? Are our integration workflows designed to absorb volatility, or do they assume stability that no longer exists? If parts of our supply chain became more autonomous — as with agentic AI — would our integration layer enable that, or block it? Let’s pause on that last question. Autonomous agents don’t replace integration; they depend on it. Any system capable of acting still requires governed access to data and reliable execution across systems. Automation-led iPaaS provides much of that requisite groundwork: event-driven workflows, permissions, observability, and the ability to act across organizational boundaries. “If you can make it there…” Supply-chain leaders aren’t considering integration upgrades because they want better middleware. They’re doing it because volatility has become permanent. Because the attendant costs and complexity have created an unmistakable and unbearable strain. Automation-led iPaaS promises relief for this highly stressed enterprise domain. With apologies to Frank Sinatra, if it works in supply chains, it’s likely to work anywhere. N. Shashidar is SVP & Global Head, Product Management at EdgeVerve. VentureBeat newsroom and editorial staff were not involved in the creation of this content.

RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk

4/27/2026

Enterprise teams that fine-tune their RAG embedding models for better precision may be unintentionally degrading the retrieval quality those pipelines depend on, according to new research from Redis. The paper, "Training for Compositional Sensitivity Reduces Dense Retrieval Generalization," tested what happens when teams train embedding models for compositional sensitivity. That is the ability to catch sentences that look nearly identical but mean something different — "the dog bit the man" versus "the man bit the dog," or a negation flip that reverses a statement's meaning entirely. That training consistently broke dense retrieval generalization, how well a model retrieves correctly across broad topics and domains it wasn't specifically trained on. Performance dropped by 8 to 9 percent on smaller models and by 40 percent on a current mid-size embedding model teams are actively using in production. The findings have direct implications for enterprise teams building agentic AI pipelines, where retrieval quality determines what context flows into an agent's reasoning chain. A retrieval error in a single-stage pipeline returns a wrong answer. The same error in an agentic pipeline can trigger a cascade of wrong actions downstream. Srijith Rajamohan, AI Research Leader at Redis and one of the paper's authors, said the finding challenges a widespread assumption about how embedding-based retrieval actually works. "There's this general notion that when you use semantic search or similar semantic similarity, we get correct intent. That's not necessarily true," Rajamohan told VentureBeat. "A close or high semantic similarity does not actually mean an exact intent." The geometry behind the retrieval tradeoff Embedding models work by compressing an entire sentence into a single point in a high-dimensional space, then finding the closest points to a query at retrieval time. That works well for broad topical matching — documents about similar subjects end up near each other. The problem is that two sentences with nearly identical words but opposite meanings also end up near each other, because the model is working from word content rather than structure. That is what the research quantified. When teams fine-tune an embedding model to push structurally different sentences apart — teaching it that a negation flip which reverses a statement's meaning is not the same as the original — the model uses representational space it was previously using for broad topical recall. The two objectives compete for the same vector. The research also found the regression is not uniform across failure types. Negation and spatial flip errors improved measurably with structured training. Binding errors — where a model confuses which modifier applies to which word, such as which party a contract obligation falls on — barely moved. For enterprise teams, that means the precision problem is harder to fix in exactly the cases where getting it wrong has the most consequences. The reason most teams don't catch it is that fine-tuning metrics measure the task being trained for, not what happens to general retrieval across unrelated topics. A model can show strong improvement on near-miss rejection during training while quietly regressing on the broader retrieval job it was hired to do. The regression only surfaces in production. Rajamohan said the instinct most teams reach for — moving to a larger embedding model — does not address the underlying architecture. "You can't scale your way out of this," he said. "It's not a problem you can solve with more dimensions and more parameters." Why the standard alternatives all fall short The natural instinct when retrieval precision fails is to layer on additional approaches. The research tested several of them and found each fails in a different way. Hybrid search. Combining embedding-based retrieval with keyword search is already standard practice for closing precision gaps. But Rajamohan said keyword search cannot catch the failure mode this research identifies, because the problem is not missing words — it is misread structure. "If you have a sentence like 'Rome is closer than Paris' and another that says 'Paris is closer than Rome,' and you do an embedding retrieval followed by a text search, you're not going to be able to tell the difference," he said. "The same words exist in both sentences." MaxSim reranking. Some teams add a second scoring layer that compares individual query words against individual document words rather than relying on the single compressed vector. This approach, known as MaxSim or late interaction and used in systems like ColBERT, did improve relevance benchmark scores in the research. But it completely failed to reject structural near-misses, assigning them near-identity similarity scores. The problem is that relevance and identity are different objectives. MaxSim is optimized for the former and blind to the latter. A team that adds MaxSim and sees benchmark improvement may be solving a different problem than the one they have. Cross-encoders. These work by feeding the query and candidate document into the model simultaneously, letting it compare every word against every word before making a decision. That full comparison is what makes them accurate — and what makes them too expensive to run at production scale. Rajamohan said his team investigated them. They work in the lab and break under real query volumes. Contextual memory. Also sometimes referred to as agentic memory, these systems are increasingly cited as the path beyond RAG, but Rajamohan said moving to that type of architecture does not eliminate the structural retrieval problem. Those systems still depend on retrieval at query time, which means the same failure modes apply. The main difference is looser latency requirements, not a precision fix. The two-stage fix the research validated The common thread across every failed approach is the same: a single scoring mechanism trying to handle both recall and precision at once. The research validated a different architecture: stop trying to do both jobs with one vector, and assign each job to a dedicated stage. Stage one: recall. The first stage works exactly as standard dense retrieval does today — the embedding model compresses documents into vectors and retrieves the closest matches to a query. Nothing changes here. The goal is to cast a wide net and bring back a set of strong candidates quickly. Speed and breadth are what matter at this stage, not perfect precision. Stage two: precision. The second stage is where the fix lives. Rather than scoring candidates with a single similarity number, a small learned Transformer model examines the query and each candidate at the token level — comparing individual words against individual words to detect structural mismatches like negation flips or role reversals. This is the verification step the single-vector approach cannot perform. The results. Under end-to-end training, the Transformer verifier outperformed every other approach the research tested on structural near-miss rejection. It was the only approach that reliably caught the failure modes the single-vector system missed. The tradeoff. Adding a verification stage costs latency. The latency cost depends on how much verification a team runs. For precision-sensitive workloads like legal or accounting applications, full verification at every query is warranted. For general-purpose search, lighter verification may be sufficient. The research grew out of a real production problem. Enterprise customers running semantic caching systems were getting fast but semantically incorrect responses back — the retrieval system was treating similar-sounding queries as identical even when their meaning differed. The two-stage architecture is Redis's proposed fix, with incorporation into its LangCache product on the roadmap but not yet available to customers. What this means for enterprise teams The research does not require enterprise teams to rebuild their retrieval pipelines from scratch. But it does ask them to pressure-test assumptions most teams have never examined — about what their embedding models are actually doing, which metrics are worth trusting and where the real precision gaps live in production. Recognize the tradeoff before tuning around it. Rajamohan said the first practical step is understanding the regression exists. He evaluates any LLM-based retrieval system on three criteria: correctness, completeness and usefulness. Correctness failures cascade directly into the other two, which means a retrieval system that scores well on relevance benchmarks but fails on structural near-misses is producing a false sense of production readiness. RAG is not obsolete — but know what it can't do. Rajamohan pushed back firmly on claims that RAG has been superseded. "That's a massive oversimplification," he said. "RAG is a very simple pipeline that can be productionized by almost anyone with very little lift." The research does not argue against RAG as an architecture. It argues against assuming a single-stage RAG pipeline with a fine-tuned embedding model is production-ready for precision-sensitive workloads. The fix is real but not free. For teams that do need higher precision, Rajamohan said the two-stage architecture is not a prohibitive implementation lift, but adding a verification stage costs latency. "It's a mitigation problem," he said. "Not something we can actually solve."

Monitoring LLM behavior: Drift, retries, and refusal patterns

4/26/2026

The stochastic challenge Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love. To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack. This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk. Defining the AI evaluation paradigm Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function. The taxonomy of evaluation checks To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers: Layer 1: Deterministic assertions A surprisingly large share of production AI failures aren't semantic "hallucinations" — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline's first gate, using traditional code and regex to validate structural integrity. Instead of asking if a response is "helpful," these assertions ask strict, binary questions: Did the model generate the correct JSON key/value schema? Did it invoke the correct tool call with the required arguments? Did it successfully slot-fill a valid GUID or email address? // Example: Layer 1 Deterministic Tool Call Assertion { "test_scenario": "User asks to look up an account", "assertion_type": "schema_validation", "expected_action": "Call API: get_customer_record", "actual_ai_output": "I found the customer.", "eval_result": "FAIL - AI hallucinated conversational text instead of generating the required API payload." } In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload. Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive "fail-fast" principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3). Layer 2: Model-based assertions When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is "helpful" or "empathetic." This introduces model-based evaluation, commonly referred to as "LLM-as-a-Judge” or “LLM-Judge." While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is "actionable" or "polite." While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment. 3 critical inputs for model-based assertions However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs: A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment. A strict assessment rubric: Vague evaluation prompts ("Rate how good this answer is") yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a "Helpfulness" rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.) Ground truth (golden outputs): While the rubric provides the rules, a human-vetted "expected answer" acts as the answer key. When the LLM-Judge can compare the production model's output against a verified Golden Output, its scoring reliability increases dramatically. Architecture: The offline vs online pipeline A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely. The offline evaluation pipeline The offline pipeline's primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch. Process 1. Curating the golden dataset The offline lifecycle begins by curating a "golden dataset" — a static, version-controlled repository of 200 to 500 test cases representing the AI's full operational envelope. Each case pairs an exact input payload with an expected "golden output" (ground truth). Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard "happy-path" interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating "refusal capabilities" under stress remains a strict compliance requirement. Example test case payload (standard tool use): Input: "Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m." Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {"duration_minutes": 30, "day": "Tuesday", "time": "10 AM", "attendee": "client_email"}. While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository. 2. Defining the evaluation criteria Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts. Consider an AI agent executing a "send email" tool. An evaluation framework might utilize a 10-point scoring system: Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts). Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt). To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures. The passing threshold and short-circuit logic In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic "politeness" of an email if the underlying API call is structurally broken. 3. Executing the pipeline and aggregating signals Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it. Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains. 4. Assessment, iteration, and alignment Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate. Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions. The online evaluation pipeline While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry: 1. Explicit user signals Direct, deterministic feedback indicating model performance: Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation. Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline "golden dataset." 2. Implicit behavioral signals Behavioral telemetry reveals silent failures where users give up without explicit feedback: Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent. Apology rate: Programmatically scanning for heuristic triggers ("I’m sorry") detects degraded capabilities or broken tool routing. Refusal rate: Artificially high refusal rates ("I can’t do that") indicate over-calibrated safety filters rejecting benign user queries. 3. Production deterministic asserts (synchronous) Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes. 4. Production LLM-as-a-Judge (asynchronous) If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard. Engineering the feedback loop (the “flywheel”) Evaluation pipelines are not "set-it-and-forget-it" infrastructure. Without continuous updates, static datasets suffer from "rot" (concept drift) as user behavior evolves and customers discover novel use cases. For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations. To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement. The continuous improvement workflow: Capture: A user triggers an explicit negative signal (a "thumbs down") or an implicit behavioral flag in production. Triage: The specific session log is automatically flagged and routed for human review. Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests. Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations. Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs. Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience. Conclusion: The new “definition of done” In the era of generative AI, a feature or product is no longer "done" simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases. This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence. Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring. Derah Onuorah is a Microsoft senior product manager.

Context decay, orchestration drift, and the rise of silent failures in AI systems

4/26/2026

The most expensive AI failure I have seen in enterprise deployments did not produce an error. No alert fired. No dashboard turned red. The system was fully operational, it was just consistently, confidently wrong. That is the reliability gap. And it is the problem most enterprise AI programs are not built to catch. We have spent the last two years getting very good at evaluating models: benchmarks, accuracy scores, red-team exercises, retrieval quality tests. But in production, the model is rarely where the system breaks. It breaks in the infrastructure layer, the data pipelines feeding it, the orchestration logic wrapping it, the retrieval systems grounding it, the downstream workflows trusting its output. That layer is still being monitored with tools designed for a different kind of software. The gap no one is measuring Here's what makes this problem hard to see: Operationally healthy and behaviorally reliable are not the same thing, and most monitoring stacks cannot tell the difference. A system can show green across every infrastructure metric, latency within SLA, throughput normal, error rate flat, while simultaneously reasoning over retrieval results that are six months stale, silently falling back to cached context after a tool call degrades, or propagating a misinterpretation through five steps of an agentic workflow. None of that shows up in Prometheus. None of it trips a Datadog alert. The reason is straightforward: Traditional observability was built to answer the question “is the service up?” Enterprise AI requires answering a harder question: “Is the service behaving correctly?” Those are different instruments. What teams typically measure What actually drives AI infrastructure failure Uptime / latency / error rate Retrieval freshness and grounding confidence Token usage Context integrity across multi-step workflows Throughput Semantic drift under real-world load Model benchmark scores Behavioral consistency when conditions degrade Infrastructure error rate Silent partial failure at the reasoning layer Closing this gap requires adding a behavioral telemetry layer alongside the infrastructure one — not replacing what exists, but extending it to capture what the model actually did with the context it received, not just whether the service responded. Four failure patterns that standard monitoring will not catch Across enterprise AI deployments in network operations, logistics, and observability platforms, I see four failure patterns repeat with enough consistency to name them. The first is context degradation. The model reasons over incomplete or stale data in a way that is invisible to the end user. The answer looks polished. The grounding is gone. Detection usually happens weeks later, through downstream consequences rather than system alerts. The second is orchestration drift. Agentic pipelines rarely fail because one component breaks. They fail because the sequence of interactions between retrieval, inference, tool use, and downstream action starts to diverge under real-world load. A system that looked stable in testing behaves very differently when latency compounds across steps and edge cases stack. The third is a silent partial failure. One component underperforms without crossing an alert threshold. The system degrades behaviorally before it degrades operationally. These failures accumulate quietly and surface first as user mistrust, not incident tickets. By the time the signal reaches a postmortem, the erosion has been happening for weeks. The fourth is the automation blast radius. In traditional software, a localized defect stays local. In AI-driven workflows, one misinterpretation early in the chain can propagate across steps, systems, and business decisions. The cost is not just technical. It becomes organizational, and it is very hard to reverse. Metrics tell you what happened. They rarely tell you what almost happened. Why classic chaos engineering is not enough and what needs to change Traditional chaos engineering asks the right kind of question: What happens when things break? Kill a node. Drop a partition. Spike CPU. Observe. Those tests are necessary, and enterprises should run them. But for AI systems, the most dangerous failures are not caused by hard infrastructure faults. They emerge at the interaction layer between data quality, context assembly, model reasoning, orchestration logic, and downstream action. You can stress the infrastructure all day and never surface the failure mode that costs you the most. What AI reliability testing needs is an intent-based layer: Define what the system must do under degraded conditions, not just what it should do when everything works. Then test the specific conditions that challenge that intent. What happens if the retrieval layer returns content that is technically valid but six months outdated? What happens if a summarization agent loses 30% of its context window to unexpected token inflation upstream? What happens if a tool call succeeds syntactically but returns semantically incomplete data? What happens if an agent retries through a degraded workflow and compounds its own error with each step? These scenarios are not edge cases. They are what production looks like. This is the framework I have applied in building reliability systems for enterprise infrastructure: Intent-based chaos level creation for distributed computing environments. The key insight: Intent defines the test, not just the fault. What the infrastructure layer actually needs None of this requires reinventing the stack. It requires extending four things. Add behavioral telemetry alongside infrastructure telemetry. Track whether responses were grounded, whether fallback behavior was triggered, whether confidence dropped below a meaningful threshold, whether the output was appropriate for the downstream context it entered. This is the observability layer that makes everything else interpretable. Introduce semantic fault injection into pre-production environments. Deliberately simulate stale retrieval, incomplete context assembly, tool-call degradation, and token-boundary pressure. The goal is not theatrical chaos. The goal is finding out how the system behaves when conditions are slightly worse than your staging environment — which is always what production is. Define safe halt conditions before deployment, not after the first incident. AI systems need the equivalent of circuit breakers at the reasoning layer. If a system cannot maintain grounding, validate context integrity, or complete a workflow with enough confidence to be trusted, it should stop cleanly, label the failure, and hand control to a human or a deterministic fallback. A graceful halt is almost always safer than a fluent error. Too many systems are designed to keep going because confident output creates the illusion of correctness. Assign shared ownership for end-to-end reliability. The most common organizational failure is a clean separation between model teams, platform teams, data teams, and application teams. When the system is operationally up but behaviorally wrong, no one owns it clearly. Semantic failure needs an owner. Without one, it accumulates. The maturity curve is shifting For the last two years, the enterprise AI differentiator has been adoption — who gets to production fastest. That phase is ending. As models commoditize and baseline capability converges, competitive advantage will come from something harder to copy: The ability to operate AI reliably at scale, in real conditions, with real consequences. Yesterday’s differentiator was model adoption. Today’s is system integration. Tomorrow’s will be reliability under production stress. The enterprises that get there first will not have the most advanced models. They will have the most disciplined infrastructure around them — infrastructure that was tested against the conditions it would actually face, not the conditions that made the pilot look good. The model is not the whole risk. The untested system around it is. Sayali Patil is an AI infrastructure and product leader.

CVSS scored these two Palo Alto CVEs as manageable. Chained, they gave attackers root access to 13,000 devices.

4/24/2026

During Operation Lunar Peek in November 2024, attackers gained unauthenticated remote admin access — and eventual root — across more than 13,000 exposed Palo Alto Networks management interfaces. Palo Alto Networks scored CVE-2024-0012 at 9.3 and CVE-2024-9474 at 6.9 under CVSS v4.0. NVD scored the same pair 9.8 and 7.2 under CVSS v3.1. Two scoring systems. Two different answers for the same vulnerabilities. The 6.9 fell below patch thresholds. Admin access appeared required. The 9.3 sat queued for maintenance. Segmentation would hold. "Adversaries circumvent [severity ratings] by chaining vulnerabilities together," Adam Meyers, SVP of Counter Adversary Operations at CrowdStrike, told VentureBeat in an exclusive interview on April 22, 2026. On the triage logic that missed the chain: "They just had amnesia from 30 seconds before." Both CVEs sit on the CISA Known Exploited Vulnerabilities catalog. Neither score flagged the kill chain. The triage logic that consumed those scores treated each CVE as an isolated event, and so did the SLA dashboards and the board reports those dashboards feed. CVSS did exactly what it was designed to do. Score one vulnerability at a time. The problem is that adversaries do not attack one vulnerability at a time. "CVSS base scores are theoretical measures of severity that ignore real-world context," wrote Peter Chronis, former CISO of Paramount and a security leader with Fortune 100 experience. By moving beyond CVSS-first prioritization at Paramount, Chronis reported reducing actionable critical and high-risk vulnerabilities by 90%. Chris Gibson, executive director of FIRST, the organization that maintains CVSS, has been equally direct: using CVSS base scores alone for prioritization is "the least apt and accurate" method, Gibson told The Register. FIRST's own EPSS and CISA's SSVC decision model address part of this gap by adding exploitation probability and decision-tree logic. Five triage failure classes CVSS was never designed to catch In 2025, 48,185 CVEs were disclosed, a 20.6% year-over-year increase. Jerry Gamblin, principal engineer at Cisco Threat Detection and Response, projects 70,135 for 2026. The infrastructure behind the scores is buckling under that weight. NIST announced on April 15 that CVE submissions have grown 263% since 2020, and the NVD will now prioritize enrichment for KEV and federal critical software only. 1. Chained CVEs that look safe until they aren't The Palo Alto pair from Operation Lunar Peek is the textbook. CVE-2024-0012 bypassed authentication. CVE-2024-9474 escalated privileges. Scored separately under both CVSS v4.0 and v3.1, the escalation flaw filtered below most enterprise patch thresholds because admin access appeared required. The authentication bypass upstream eliminated that prerequisite entirely. Neither score communicated the compound effect. Meyers described the operational psychology: teams assessed each CVE independently, deprioritized the lower score, and queued the higher one for maintenance. 2. Nation-state adversaries who weaponize patches within days The CrowdStrike 2026 Global Threat Report documented a 42% year-over-year increase in vulnerabilities exploited as zero-days before public disclosure. Average breakout time across observed intrusions: 29 minutes. Fastest observed breakout: 27 seconds. China-nexus adversaries weaponized newly patched vulnerabilities within two to six days of disclosure. "Before it was Patch Tuesday once a month. Now it's patch every day, all the time. That's what this new world looks like," said Daniel Bernard, Chief Business Officer at CrowdStrike. A KEV addition treated as a routine queue item on Tuesday becomes an active exploitation window by Thursday. 3. Stockpiled CVEs that nation-state actors hold for years Salt Typhoon accessed senior U.S. political figures' communications during the presidential transition by chaining CVE-2023-20198 with CVE-2023-20273 on internet-facing Cisco devices, a privilege escalation pair patched in October 2023 and still unapplied more than a year later. Compromised credentials provided a parallel entry vector. The patches existed. Neither was applied. Sixty-seven percent of vulnerabilities exploited by China-nexus adversaries in 2025 were remote code execution flaws providing immediate system access, according to the CrowdStrike 2026 Global Threat Report. CVSS does not degrade priority based on how long a CVE has gone unpatched. No board metric tracks aging KEV exposure. That silence is the vulnerability. 4. Identity gaps that never enter the scoring system A 2023 help desk social engineering call against a major enterprise produced more than $100 million in losses. No CVE was assigned. No CVSS score existed. No patch pipeline entry was created. The vulnerability was a human process gap in identity verification, sitting entirely outside the scoring system's aperture. "A pro needs a zero day if all you have to do is call the help desk and say I forgot my password," Meyers said. Agentic AI systems now carry their own identity credentials, API tokens, and permission scopes, operating outside traditional vulnerability management governance. Merritt Baer, CSO at Enkrypt AI, has argued on record that identity-surface controls are vulnerability equivalents belonging in the same reporting pipeline as software CVEs. In most organizations, help desk authentication gaps and agentic AI credential inventories live in a separate governance silo. In practice, nobody's governance. 5. AI-accelerated discovery that breaks pipeline capacity Anthropic's Claude Mythos Preview demonstrated autonomous vulnerability discovery, finding a 27-year-old signed integer overflow in OpenBSD's TCP SACK implementation across roughly 1,000 scaffold runs at a total compute cost under $20,000. Meyers offered a thought-experiment projection in the exclusive interview with VentureBeat: if frontier AI drives a 10x volume increase, the result is approximately 480,000 CVEs annually. Pipelines built for 48,000 break at 70,000 and collapse at 480,000. NVD enrichment is already gone for non-KEV submissions. "If the adversary is now able to find vulnerabilities faster than the defenders or the business, that's a huge problem, because those vulnerabilities become exploits," said Daniel Bernard, Chief Business Officer at CrowdStrike. CrowdStrike on Thursday launched Project QuiltWorks, a remediation coalition with Accenture, EY, IBM Cybersecurity Services, Kroll, and OpenAI formed to address the vulnerability volume that frontier AI models are now generating in production code. When five major firms build a coalition around a pipeline problem, no single organization's patch workflow can keep pace. Security director action plan The five failure classes above map to five specific actions. Run a chain-dependency audit on every KEV CVE in the environment this month. Flag any co-resident CVE scored 5.0 or above, the threshold where privilege escalation and lateral movement capabilities typically appear in CVSS vectors. Any pair chaining authentication bypass to privilege escalation gets triaged as critical regardless of individual scores. Compress KEV-to-patch SLAs to 72 hours for internet-facing systems. The CrowdStrike 2026 Global Threat Report breakout data, 29-minute average and 27-second fastest, makes weekly patch windows indefensible in a board presentation. Build a monthly KEV aging report for the board. Every unpatched KEV CVE, days since disclosure, days since patch availability, and owner. Salt Typhoon exploited a Cisco CVE patched 14 months earlier because no escalation path existed for aging exposure. Add identity-surface controls to the vulnerability reporting pipeline. Help desk authentication gaps and agentic AI credential inventories belong in the same SLA framework as software CVEs. If they sit in a separate governance silo, they sit in nobody's governance. Stress-test pipeline capacity at 1.5x and 10x current CVE volume. Gamblin projects 70,135 for 2026. Meyers's thought-experiment projection: frontier AI could push annual volume past 480,000. Present the capacity gap to the CFO before the next budget cycle, not after the breach that proves the gap existed.

DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5

4/24/2026

The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.

85% of enterprises are running AI agents. Only 5% trust them enough to ship.

4/24/2026

Eighty-five percent of enterprises are running AI agent pilots, but only 5% have moved those agents into production. In an exclusive interview at RSA Conference 2026, Cisco President and Chief Product Officer Jeetu Patel said that the gap comes down to one thing: trust — and that closing it separates market dominance from bankruptcy. He also disclosed a mandate that will reshape Cisco's 90,000-person engineering organization. The problem is not rogue agents. The problem is the absence of a trust architecture. The trust deficit behind a 5% production rate A recent Cisco survey of major enterprise customers found that 85% have AI agent pilot programs underway. Only 5% moved those agents into production. That 80-point gap defines the security problem the entire industry is trying to close. It is not closing. "The biggest impediment to scaled adoption in enterprises for business-critical tasks is establishing a sufficient amount of trust," Patel told VentureBeat. "Delegating versus trusted delegating of tasks to agents. The difference between those two, one leads to bankruptcy and the other leads to market dominance." He compared agents to teenagers. "They're supremely intelligent, but they have no fear of consequence. They're pretty immature. And they can be easily sidetracked or influenced," Patel said. "What you have to do is make sure that you have guardrails around them and you need some parenting on the agents." The comparison carries weight because it captures the precise failure mode security teams face. Three years ago, a chatbot that gave the wrong answer was an embarrassment. An agent that takes the wrong action can trigger an irreversible outcome. Patel pointed to a case he cited in his keynote where an AI coding agent deleted a live production database during a code freeze, tried to cover its tracks with fake data, and then apologized. "An apology is not a guardrail," Patel said in his keynote blog. The shift from information risk to action risk is the core reason the pilot-to-production gap persists. Defense Claw and the open-source speed play with Nvidia Cisco's response to the trust deficit at RSAC 2026 spanned three categories: protecting agents from the world, protecting the world from agents, and detecting and responding at machine speed. The product announcements included AI Defense Explorer Edition (a free, self-service red teaming tool), the Agent Runtime SDK for embedding policy enforcement into agent workflows at build time, and the LLM Security Leaderboard for evaluating model resilience against adversarial attacks. The open-source strategy moved faster than any of those. Nvidia launched OpenShell, a secure container for open-source agent frameworks, at GTC the week before RSAC. Cisco packaged its Skills Scanner, MCP Scanner, AI Bill of Materials tool, and CodeGuard into a single open-source framework called Defense Claw and hooked it into OpenShell within 48 hours. "Every single time you actually activate an agent in an Open Shell container, you can now automatically instantiate all the security services that we have built through Defense Claw," Patel told VentureBeat. The integration means security enforcement activates at container launch without manual configuration. That speed matters because the alternative is asking developers to bolt on security after the agent is already running. That 48-hour turnaround was not an anomaly. Patel said several of the Defense Claw capabilities Cisco launched were built in a week. "You couldn't have built it in longer than a week because Open Shell came out last week," he said. A six-to-nine-month product lead and an information asymmetry on top of it Patel made a competitive claim worth examining. "Product wise, we might be six to nine months ahead of most of the market," he told VentureBeat. He added a second layer: "We also have an asymmetric information advantage of, I'd say, three to six months on everyone because, you know, we, by virtue of being in the ecosystem with all the model companies. We're seeing what's coming down the pipe." The 48-hour Defense Claw sprint supports the speed claim, though the lead margin is Cisco's own characterization; no independent benchmarks were provided. Cisco also extended zero trust to the agentic workforce through new Duo IAM and Secure Access capabilities, giving every agent time-bound, task-specific permissions. On the SOC side, Splunk announced Exposure Analytics for continuous risk scoring, Detection Studio for streamlined detection engineering, and Federated Search for investigating across distributed data environments. The zero-human-code engineering mandate AI Defense, the product Cisco launched a year before RSAC 2026, is now 100% built with AI. Zero lines of human-written code. By the end of 2026, half a dozen Cisco products will reach the same milestone. By the end of calendar year 2027, Patel's goal is 70% of Cisco's products built entirely by AI. "Just process that for a second and go: a $60 billion company is gonna have 70% of the products that are gonna have no human lines of code," Patel told VentureBeat. "The concept of a legacy company no longer exists." He connected that mandate to a cultural shift inside the engineering organization. "There's gonna be two kinds of people: ones that code with AI and ones that don't work at Cisco," Patel said. That was not debated. "Changing 30,000 people to change the way that they work at the very core of what they do in engineering cannot happen if you just make it a democratic process. It has to be something that's driven from the top down." Five moats for the agentic era, and what CISOs can verify today Patel laid out five strategic advantages that will separate winning enterprises from failing ones. VentureBeat mapped each moat against actions security teams can begin verifying today. Moat Patel's claim What CISOs can verify today What to validate next Sustained speed "Operating with extreme levels of obsession for speed for a durable length of time" creates compounding value Measure deployment velocity from pilot to production. Track how long agent governance reviews take. Pair speed metrics with telemetry coverage. Fast deployment without observability creates blind acceleration. Trust and delegation Trusted delegation separates market dominance from bankruptcy Audit delegation chains. Flag agent-to-agent handoffs with no human approval. Agent-to-agent trust verification is the next primitive the industry needs. OAuth, SAML, and MCP do not yet cover it. Token efficiency Higher output per token creates a strategic advantage Monitor token consumption per workflow. Benchmark cost-per-action across agent deployments. Token efficiency metrics exist. Token security metrics (what the token accessed, what it changed) are the next build. Human judgment "Just because you can code it doesn't mean you should." Track decision points where agents defer to humans vs. act autonomously. Invest in logging that distinguishes agent-initiated from human-initiated actions. Most configurations cannot yet. AI dexterity "10x to 20x to 50x productivity differential" between AI-fluent and non-fluent workers Measure the adoption rates of AI coding tools across security engineering teams. Pair dexterity training with governance training. One without the other compounds the risk. The telemetry layer the industry is still building Patel's framework operates at the identity and policy layer. The next layer down, telemetry, is where the verification happens. "It looks indistinguishable if an agent runs your web browser versus if you run your browser," CrowdStrike CTO Elia Zaitsev told VentureBeat in an exclusive interview at RSAC 2026. Distinguishing the two requires walking the process tree, tracing whether Chrome was launched by a human from the desktop or spawned by an agent in the background. Most enterprise logging configurations cannot make that distinction yet. A CEO's AI agent rewrote the company's security policy. Not because it was compromised. Because it wanted to fix a problem, lacked permissions, and removed the restriction itself. Every identity check passed. CrowdStrike CEO George Kurtz disclosed that incident and a second one at his RSAC keynote, both at Fortune 50 companies. In the second, a 100-agent Slack swarm delegated a code fix between agents without human approval. Both incidents were caught by accident Etay Maor, VP of Threat Intelligence at Cato Networks, told VentureBeat in a separate exclusive interview at RSAC 2026 that enterprises abandoned basic security principles when deploying agents. Maor ran a live Censys scan during the interview and counted nearly 500,000 internet-facing agent framework instances. The week before: 230,000. Doubling in seven days. Patel acknowledged the delegation risk in the interview. "The agent takes the wrong action and worse yet, some of those actions might be critical actions that are not reversible," he said. Cisco's Duo IAM and MCP gateway enforce policy at the identity layer. Zaitsev's work operates at the kinetic layer: tracking what the agent did after the identity check passed. Security teams need both. Identity without telemetry is a locked door with no camera. Telemetry without identity is footage with no suspect. Token generation as the currency for national competitiveness Patel sees the infrastructure layer as decisive. "Every country and every company in the world is gonna wanna make sure that they can generate their own tokens," he told VentureBeat. "Token generation becomes the currency for success in the future." Cisco's play is to provide the most secure and efficient technology for generating tokens at scale, with Nvidia supplying the GPU layer. The 48-hour Defense Claw integration demonstrated what that partnership produces under pressure. Security director action plan VentureBeat identified five steps security teams can take to begin building toward Patel's framework today: Audit the pilot-to-production gap. Cisco's own survey found 85% of enterprises piloting, 5% in production. Mapping the specific trust deficits keeping agents stuck is the starting point — the answer is rarely the technology. Governance, identity, and delegation controls are what's missing. Patel's trusted delegation framework is designed to close that gap. Test Defense Claw and AI Defense Explorer Edition. Both are free. Red-team your agent workflows before they reach production. Test the workflow, not just the model. Map delegation chains end-to-end. Flag every agent-to-agent handoff with no human approval. This is the "parenting" Patel described. No product fully automates it yet. Do it manually, every week. Establish agent behavioral baselines. Before any agent reaches production, define what normal looks like: API call patterns, data access frequency, systems touched, and hours of activity. Without a baseline, the observability that Patel's moats require has nothing to compare against. Close the telemetry gap in your logging configuration. Verify that your SIEM can distinguish agent-initiated actions from human-initiated actions. If it cannot, the identity layer alone will not catch the incidents Kurtz described at RSAC. Patel built the identity layer. The telemetry layer completes it.

AI synthetic audiences are already here and poised to upend the consulting industry

4/24/2026

There is a war brewing between AI and consulting. Akin to an armies slow march towards the castle, a new technology is coming to dethrone the expert guessers of Mckinsey, Nielsen, Gartner, Publicis and the rest. Any consulting that involves analyzing people (think all of marketing, research, polling, etc.) will have to reckon with the technology of “synthetic audiences”. Synthetic audiences aim to generate digital versions of people that can then be surveyed almost instantly and affordably, but not as accurately. Think Tamagochi but with people. By prompting AI with information about a person, we ask AI to get in their shoes, simulate the thoughts, behaviors, priorities and decisions of real world humans. We can also invent non-specific placeholder people or personas and survey them as though they are real. Various firms have already fielded products in these domains, including startups Electric Twin, Artificial Societies, and Aaru, and even the century-old Dentsu. What used to take 4 months to survey people, plus two months to create a nice PowerPoint presentation of findings at a total cost of thousands or even tens of thousands, now takes two minutes and costs only a few dollars. It may seem like I’ve picked my winner. But in this war of tribes, I’m a Romeo, caught between the two warring houses. I work for a large incumbent in this space. From 2023-2025 while working at the London headquarters of WPP, I built similar tools for numerous Fortune 500’s and advised many New York University researchers on the subject. Companies like WPP with head counts and revenues that rival the populations and GDP’s of small European nations need startups for their speed and high margins, while startups need our distribution. My advice has always been for unity between these tribes. Considering WPP is partnering with numerous startups, is working tirelessly in building our own tools and building deep connections with hyper scalers, it’s possible I mislead you with the war analogy. This may be a love story after all. But destiny’s bottle of poison is in our hands. These next few years are pivotal and formative. The future will ultimately be determined by the buyers of these studies. Fortune 500’s, with the largest appetite for market research, often hesitate to include synthetic audiences in their diet. The first question I’m asked in any pitch is "will AI steal my data?" I find this question to be an emotional response. It seems to me like most AI fears are remnants of a 2022 LinkedIn post that burrowed itself into our collective consciousness. I generally respond to this question with another: “Do you use Microsoft Teams?” The answer is often "yes." Almost every enterprise stores sensitive data in a cloud service that Google Amazon or Microsoft provides. These are the same companies that provide enterprise AI services, which state in their terms and conditions that they won’t train models with your data. Now, believing this statement is optional, but for that matter believing is voluntary for all things. Criticisms of accuracy on the other hand, are harder to dispute. The famed venture capital firm Andreessen-Horowitz (a16z) titled its analysis of this budding tech scene as “Faster, smarter, cheaper”. As the hopeful mediator in this war, I agree synthetic research is faster and cheaper, but is it smarter? Not sure. A seminal paper from Stanford by Park et al. established a benchmark in 2024 proving that AI can simulate human responses to surveys with an average of 85% accuracy. In fact for certain portions of the general social survey, they replicated answers with more than 90% accuracy. When the model is provided relevant information and is given rich context (like a mini biography of the person) it can guess their actions and thoughts very accurately. But no prediction can be 100% accurate. A future where human propensities are modeled even better than humans can express their own desires is a possibility. Maybe we’ll live in a future where the movie Minority Report becomes reality. However, this future is too distant to warrant the attention of a business reader and is better suited for Tom Cruise and Steven Spielberg. What is more interesting to me is what this technology can do at lower accuracies. In my private tests, I’ve seen that with very simple information about a person, such as their age, neighborhood and gender, certain behaviors can be modeled with 72% accuracy. An argument can be made that these are easy-to-make predictions. Predicting whether a married person will have children is low stakes. This can’t completely replace the unique insight of a strategist. However, considering how elusive it is to understand and model people. A solution that’s better than random and so attainable poses to make an impact. Think about the immense scale. The human mind works with a small range of values. We understand when something is twice as fast but we can’t comprehend when something is 175,200 times faster. All of a sudden a journey that took several days becomes becomes several hours, bridges get built, gas stations, entire industries are started. When improvement isn’t marginal but exponential, it has positive externalities that are impossible to predict even by this article. What I suggest for all of us is to eat the popcorn and watch the show. No matter what happens, it’ll be fun.

Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation

4/23/2026

For several weeks, a growing chorus of developers and AI power users claimed that Anthropic’s flagship models were losing their edge. Users across GitHub, X, and Reddit reported a phenomenon they described as "AI shrinkflation"—a perceived degradation where Claude seemed less capable of sustained reasoning, more prone to hallucinations, and increasingly wasteful with tokens. Critics pointed to a measurable shift in behavior, alleging that the model had moved from a "research-first" approach to a lazier, "edit-first" style that could no longer be trusted for complex engineering. While the company initially pushed back against claims of "nerfing" the model to manage demand, the mounting evidence from high-profile users and third-party benchmarks created a significant trust gap. Today, Anthropic addressed these concerns directly, publishing a technical post-mortem that identified three separate product-layer changes responsible for the reported quality issues. "We take reports about degradation very seriously," reads Anthropic's blog post on the matter. "We never intentionally degrade our models, and we were able to immediately confirm that our API and inference layer were unaffected." Anthropic claims it has resolved the issues by reverting the reasoning effort change and the verbosity prompt, while fixing the caching bug in version v2.1.116. The mounting evidence of degradation The controversy gained momentum in early April 2026, fueled by detailed technical analyses from the developer community. Stella Laurenzo, a Senior Director in AMD’s AI group, published an exhaustive audit of 6,852 Claude Code session files and over 234,000 tool calls on Github showing performance falling from her usage before. Her findings suggested that Claude’s reasoning depth had fallen sharply, leading to reasoning loops and a tendency to choose the "simplest fix" rather than the correct one. This anecdotal frustration was seemingly validated by third-party benchmarks. BridgeMind reported that Claude Opus 4.6’s accuracy had dropped from 83.3% to 68.3% in their tests, causing its ranking to plummet from No. 2 to No. 10. Although some researchers argued these specific benchmark comparisons were flawed due to inconsistent testing scopes, the narrative that Claude had become "dumber" became a viral talking point. Users also reported that usage limits were draining faster than expected, leading to suspicions that Anthropic was intentionally throttling performance to manage surging demand. The causes In its post-morem bog post, Anthropic clarified that while the underlying model weights had not regressed, three specific changes to the "harness" surrounding the models had inadvertently hampered their performance: Default Reasoning Effort: On March 4, Anthropic changed the default reasoning effort from high to medium for Claude Code to address UI latency issues. This change was intended to prevent the interface from appearing "frozen" while the model thought, but it resulted in a noticeable drop in intelligence for complex tasks. A Caching Logic Bug: Shipped on March 26, a caching optimization meant to prune old "thinking" from idle sessions contained a critical bug. Instead of clearing the thinking history once after an hour of inactivity, it cleared it on every subsequent turn, causing the model to lose its "short-term memory" and become repetitive or forgetful. System Prompt Verbosity Limits: On April 16, Anthropic added instructions to the system prompt to keep text between tool calls under 25 words and final responses under 100 words. This attempt to reduce verbosity in Opus 4.7 backfired, causing a 3% drop in coding quality evaluations. Impact and future safeguards The quality issues extended beyond the Claude Code CLI, affecting the Claude Agent SDK and Claude Cowork, though the Claude API was not impacted. Anthropic admitted that these changes made the model appear to have "less intelligence," which they acknowledged was not the experience users should expect. To regain user trust and prevent future regressions, Anthropic is implementing several operational changes: Internal Dogfooding: A larger share of internal staff will be required to use the exact public builds of Claude Code to ensure they experience the product as users do. Enhanced Evaluation Suites: The company will now run a broader suite of per-model evaluations and "ablations" for every system prompt change to isolate the impact of specific instructions. Tighter Controls: New tooling has been built to make prompt changes easier to audit, and model-specific changes will be strictly gated to their intended targets. Subscriber Compensation: To account for the token waste and performance friction caused by these bugs, Anthropic has reset usage limits for all subscribers as of April 23. The company intends to use its new @ClaudeDevs account on X and GitHub threads to provide deeper reasoning behind future product decisions and maintain a more transparent dialogue with its developer base.

OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0

4/23/2026

After months of rumors and reports that OpenAI was developing a new, more powerful AI large language model for use in ChatGPT and through its application programming interface (API), allegedly codenamed "Spud" internally, the company has today unveiled its latest offering under the more formal name GPT-5.5. And to likely no one's surprise, it's hardly a "potato" in the disparaging sense of the word: GPT-5.5 retakes the lead for OpenAI in generally available LLMs, coming ahead of rivals Anthropic's and Google's latest public offerings, and even beating the private Anthropic Claude Mythos Preview model narrowly on one benchmark (essentially a statistical tie). "It’s definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback that we’ve gotten from trusted partners, as well as our own experience," explained Amelia "Mia" Glaese, VP of Research at OpenAI, in a video call with journalists ahead of the launch earlier today. OpenAI positions GPT-5.5 as a fundamental redesign of how intelligence interacts with a computer's operating system and professional software stacks. "What is really special about this model is how much more it can do with less guidance," said OpenAI co-founder and president Greg Brockman on the same call. "It’s way more intuitive to use. It can look at an unclear problem and figure out what needs to happen next." Brockman proceeded to emphasize the areas in which users can expect to see gains from using GPT-5.5 compared to OpenAI's prior state-of-the-art model, GPT-5.4, which remains available (for now) to users and enterprises at half the API cost of its new successor. "It’s extremely good at coding," Brockman said of GPT-5.5. "It’s also great at broader computer work, computer use, scientific research—these kinds of applications that are very intelligent bottlenecks." OpenAI CEO and-cofounder Sam Altman also weighed in on the launch and the company's philosophy in a post on X, writing, in part: "We want our users to have access to the best technology and for everyone to have equal opportunity." The model is available in two variants: GPT-5.5 and GPT-5.5 Pro, distinguished by the latter offering enhanced precision and specialized logic for handling the most rigorous cognitive demands. While the standard version serves as the versatile flagship for general intelligence tasks, the Pro model is architected specifically for high-stakes environments such as legal research, data science, and advanced business analytics where accuracy is paramount. This premium tier provides noticeably more comprehensive and better-structured responses, supported by specialized latency optimizations that ensure high-quality performance during complex, multi-step workflows. Unfortunately for third-party software developers, API access is not yet available for either GPT-5.5 nor GPT-5.5 Pro and will be coming "very soon," according to the company's announcement blog post. "API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale," OpenAI writes. For the time being, GPT-5.5 is available only to paying subscribers of the ChatGPT Plus ($20 monthly), Pro ($100-$200 monthly), Business, and Enterprise users, with GPT-5.5 Pro access starting at the Pro tier and upwards. A focus on agency At the core of GPT-5.5 is a focus on "agentic" performance—specifically in coding, computer use, and scientific research. Unlike its predecessors, which often required granular, step-by-step prompting to avoid "hallucinating" a path forward, GPT-5.5 is designed to handle messy, multi-part tasks autonomously. It excels at researching online, debugging complex codebases, and moving between documents and spreadsheets without human intervention. One of the most significant technical leaps is the model's efficiency. While larger models typically suffer from increased latency, GPT-5.5 matches the per-token latency of the previous GPT-5.4 while delivering a higher level of intelligence. This was achieved through a deep hardware-software co-design. OpenAI served GPT-5.5 on NVIDIA GB200 and GB300 NVL72 systems, utilizing custom heuristic algorithms—written by the AI itself—to partition and balance work across GPU cores. This optimization reportedly increased token generation speeds by over 20%.For high-stakes reasoning, the "GPT-5.5 Thinking" mode in ChatGPT provides smarter, more concise answers by allowing the model more internal "compute time" to verify its own assumptions before responding. This capability is particularly visible in the model’s performance on "Expert-SWE," an internal OpenAI benchmark for long-horizon coding tasks with a median human completion time of 20 hours. GPT-5.5 notably outperformed GPT-5.4 on this metric while using significantly fewer tokens. Benchmarks show OpenAI has retaken the lead in most powerful publicly available LLM over Claude Opus 4.7 (but the unreleased Mythos still outperforms it) The market for leading U.S.-made frontier models has become an increasingly tight race between OpenAI, Anthropic, and Google. Literally a week ago to the date, OpenAI rival Anthropic released Opus 4.7, its most powerful generally available model, to the public, taking over the leaderboard in terms of the number of third-party benchmark tests in which it has the lead. Yet today, GPT-5.5 has surpassed it and even Anthropic's heavily restricted, more powerful model Claude Mythos Preview, albeit only on one benchmark, Terminal-Bench 2.0, which tests "a model's ability to navigate and complete tasks in a sandboxed terminal environment." GPT-5.5 achieved 82.7% accuracy on Terminal-Bench 2.0, easily surpassing Opus 4.7 (69.4%) and narrowly beating the Mythos Preview (82.0%). However, in multidisciplinary reasoning without tools, the landscape is more competitive. On Humanity's Last Exam without tools, GPT-5.5 Pro scored 43.1%, trailing behind Opus 4.7 (46.9%) and Mythos Preview (56.8%). Benchmark GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro Mythos Preview* Terminal-Bench 2.0 82.7 69.4 68.5 82.0 Expert-SWE (Internal) 73.1 — — — GDPval (wins or ties) 84.9 80.3 67.3 — OSWorld-Verified 78.7 78.0 — 79.6 Toolathlon 55.6 — 48.8 — BrowseComp 84.4 79.3 85.9 86.9 FrontierMath Tier 1–3 51.7 43.8 36.9 — FrontierMath Tier 4 35.4 22.9 16.7 — CyberGym 81.8 73.1 — 83.1 Tau2-bench Telecom (original prompts) 98.0 — — — OfficeQA Pro 54.1 43.6 18.1 — Investment Banking Modeling Tasks (Internal) 88.5 — — — MMMU Pro (no tools) 81.2 — 80.5 — MMMU Pro (with tools) 83.2 — — — GeneBench 25.0 — — — BixBench 80.5 — — — Capture-the-Flags challenge tasks (Internal) 88.1 — — — ARC-AGI-2 (Verified) 85.0 75.8 77.1 — SWE-bench Pro (Public) 58.6 64.3 54.2 77.8 This suggests that while OpenAI is winning on "computer use" and "agency," other models may still hold an edge in pure, zero-shot academic knowledge. It is important to clarify that Mythos Preview is not a generally available product; Anthropic has classified it as a strategic defensive asset due to its high cybersecurity risks, restricting its access to a small, limited audience of trusted partners and government agencies. Because Mythos is excluded from broad commercial use, the primary market competition remains between GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7. So when it comes to models that the general public can access, GPT-5.5 has retaken the crown for OpenAI, achieving the state-of-the-art across 14 benchmarks compared to 4 for Claude Opus 4.7 and 2 for Google Gemini 3.1 Pro. It dominates in agentic computer use, economic knowledge work (GDPval), specialized cybersecurity (CyberGym), and complex mathematics (Frontier Math). In comparison, Claude Opus 4.7 leads on software engineering and reasoning without tools, while Gemini 3.1 Pro leads in three categories, specifically excelling in academic reasoning and financial analysis. Increased costs for users The shift in intelligence comes with a significant price increase for API developers, according to material OpenAI shared ahead of the model's public release. OpenAI has effectively doubled the entry price for its flagship model compared to the previous generation, and again double it from there for the most-cutting edge variant of the model, GPT-5.5 Pro: Model Input Price (per 1M tokens) Output Price (per 1M tokens) GPT-5.4 $2.50 $15.00 GPT-5.5 $5.00 $30.00 GPT-5.5 Pro $30.00 $180.00 To mitigate these costs, OpenAI emphasizes that GPT-5.5 is more "token efficient," meaning it uses fewer tokens to complete the same task compared to GPT-5.4. For users requiring speed over depth, OpenAI also introduced a Fast mode in Codex, which generates tokens 1.5x faster but at a 2.5x price premium. The "mini" and "nano" tiers seen in the GPT-5.4 era (priced at $0.75 and $0.20 per 1M input tokens respectively) currently have no GPT-5.5 equivalent, though the company notes that GPT-5.5 is rolling out to all subscription tiers, including Plus, Pro, and Enterprise. Licensing and the 'cyber-permissive' frontier OpenAI’s approach to safety and licensing for GPT-5.5 introduces a novel concept: Trusted Access for Cyber. Because the model is now capable of identifying and patching advanced security vulnerabilities, OpenAI has implemented stricter "cyber-risk classifiers" for general users. For legitimate security professionals, however, OpenAI is offering a specialized "cyber-permissive" license. This program allows verified defenders—those responsible for critical infrastructure like power grids or water supplies—to use models like GPT-5.4-Cyber or unrestricted versions of GPT-5.5 with fewer refusals for security-related prompts. This dual-use framework acknowledges that while AI can accelerate cyber defense, it can also be weaponized. Under OpenAI’s Preparedness Framework, GPT-5.5 is classified as "High" risk for biological and cybersecurity capabilities. To manage this, API deployments currently require different safeguards than the consumer-facing ChatGPT, and OpenAI is working with government partners to ensure these tools are used to strengthen—not undermine—digital resilience. Initial reactions: losing access feels like having a 'limb amputated' The early feedback from power users and engineers suggests that GPT-5.5 has crossed a psychological threshold in AI utility. For developers, the model's ability to maintain "conceptual clarity" across massive codebases is its standout feature. "The first coding model I've used that has serious conceptual clarity," noted Dan Shipper, CEO of Every. Shipper tested the model by asking it to debug a complex system failure that had previously required a team of human engineers to rewrite; GPT-5.5 produced the same fix autonomously. Similarly, Pietro Schirano, CEO of MagicPath, described a "step change" in performance when the model successfully merged a branch with hundreds of refactor changes into a main branch in a single, 20-minute pass.Perhaps the most visceral reaction came from an anonymous engineer at NVIDIA, who had early access to the model: "Losing access to GPT-5.5 feels like I've had a limb amputated". This sentiment is echoed in the scientific community. Derya Unutmaz, a professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a dataset of 28,000 genes, producing a report in minutes that would have normally taken his team months. Brandon White, CEO of Axiom Bio, went further, stating that if OpenAI continues this pace, "the foundations of drug discovery will change by the end of the year". GPT-5.5 is more than an incremental update; it is a tool designed for a world where humans delegate entire workflows rather than single prompts. While the costs are higher and the safety guardrails tighter, the performance gains in agentic work suggest that AI is finally moving from the chat box and into the operating system. Perhaps most astonishingly of all, it's not even hearing the end of the scaling limits — whereupon models are trained on more and more GPUs — according to researchers at the company. "We actually still have headroom to train significantly smarter models than this," said OpenAI chief scientist Jakub Pachocki.

Talking to AI agents is one thing — what about when they talk to each other? New startup BAND debuts 'universal orchestrator'

4/23/2026

For the past eighteen months, the corporate world has been obsessed with the "builder" phase of the generative AI revolution. Enterprises have raced to deploy autonomous agents to handle everything from customer support to complex codebase refactoring. However, as these digital workers proliferate, a new, more structural problem has emerged: fragmentation. Agents built on LangChain cannot easily hand off tasks to those built on CrewAI; a Salesforce-embedded agent has no native way to coordinate with a custom-built Python script running on a private cloud. Today, a new startup, BAND (also known as Thenvoi AI Ltd.) exited stealth with $17 million in Seed funding to provide the "interaction infrastructure" necessary to turn these isolated tools into a unified, collaborative workforce. "In order for agents to become real players in the global economy, they need ways to communicate, just like humans do," said co-founder and CEO Arick Goomanovsky in an interview with VentureBeat, continuing, "the communication solutions we have today for systems don’t work for agents, because agents are non-deterministic creatures. It’s not just about API integrations." By introducing a deterministic communication layer that functions as a "Slack for agents," BAND aims to move the industry from a collection of fragile experiments to a scalable, "agentic economy". Introducing the 'agentic mesh' At the core of BAND’s thesis is that simply creating and plugging AI agents into human communication tools like Slack causes them to lose context or require constant "rehydration" if they fail and re-enter a conversation. “You can’t take a bunch of agents and put them into Slack and expect it to miraculously work," Goomanovsky said. BAND solves this through a two-layer architecture designed to handle the unique telemetry of AI-to-AI interaction, a so called "agentic mesh." This is the "interaction layer" where agent discovery and structured delegation occur. It allows agents to find one another across different clouds and frameworks without requiring developers to write brittle "glue code" for every new connection. Multi-Peer Collaboration: Unlike existing protocols that are primarily peer-to-peer or client-server, BAND supports full-duplex, multi-peer communication. This allows a group of agents—for example, a planning agent, a coding agent, and a QA agent—to work together in a shared "room" with synchronized context. Deterministic Routing: Notably, BAND does not use Large Language Models (LLMs) to route messages. Using an LLM for routing would introduce the same non-deterministic errors the platform seeks to solve. Instead, the platform uses a patent-pending multi-layer architecture to ensure messages reach their destination reliably. The WhatsApp Comparison: To handle the anticipated volume of agentic traffic, BAND’s infrastructure is built on the same technical stack utilized by global messaging giants like WhatsApp and Discord. This ensures the platform can scale to billions of messages as digital identities begin to outnumber human ones. If the nesh is the "pipes," the Control Plane is the "valve". This layer provides the runtime governance that enterprises require before they can safely scale autonomous systems. Authority Boundaries: The platform allows organizations to enforce strict rules on which agents can talk to each other and what topics they can discuss. Credential Traversal: One of the most significant hurdles in multi-agent systems is identity. BAND manages how human permissions and security tokens traverse from agent to agent. For instance, if a human asks Agent A for information, and Agent A delegates that task to Agent B, BAND ensures Agent B only accesses data the original human is permitted to see. Product, platform and pricing: scaling the multi-agent, multi-model workforce BAND’s product suite is designed to be "framework-agnostic" and "cloud-agnostic," positioning itself as an independent middleware that prevents vendor lock-in. In a market where hyperscalers like OpenAI or Anthropic want enterprises to stay within their specific ecosystems, BAND offers the flexibility to use the best model across multiple options for the job, including open source and fine-tuned, custom enterprise options. "No matter where the agents run or how they were built, we can band them together, allow them to discover each other, delegate tasks, and have full-duplex, bidirectional communication," Goomanovsky said, noting that despite competing first-part options from model providers like OpenAI's workspace agents (announced yesterday) and Anthropic's Claude Managed Agents (announced earlier this month), BAND "play[s] the role of the independent platform that allows an enterprise to avoid vendor lock-in.” The company is currently seeing the most traction in "tech-forward" sectors, including telecommunications, financial services, and cybersecurity. Coding Agents: This is currently the most popular use case. Developers often find that Claude is superior at planning, while Codex is better at reviewing code. BAND allows these agents to work simultaneously, delegating tasks to one another in real-time. Customer Support and Operations: Beyond code, BAND enables "cross-boundary" automation. For example, a new employee could be onboarded by a Workday agent, which then communicates with a ServiceNow agent to open a ticket for equipment, which finally talks to a purchasing agent to finalize the order. Understanding the sensitivity of enterprise data, BAND offers three primary ways to consume the platform: SaaS: A straightforward cloud-based platform where agents connect via API. Private Cloud/On-Premise: The entire platform can be deployed within a customer’s VPC or on-premise environment to ensure data never leaves their control. The Edge: The infrastructure is lightweight enough to be deployed on "flying objects" like drones (UAVs) or even satellites, facilitating communication between agents in physically isolated environments. Already, BAND's early users — and enterprises more broadly — are mixing and matching AI agents powered by models from various providers, so the time to provide an overarching solution seems ripe. As Goomanovsky put it: “Advanced developers are not using a single coding agent. They realize Claude is very good at planning, Codex is much better at reviewing, and today there is no way to create that bidirectional interaction between coding, review, and planning agents. We enable that.” Licensing, governance, and pricing BAND operates as a commercial entity, focusing on providing "enterprise-grade" stability and security. While the platform integrates with open-source frameworks like LangChain and CrewAI, its own core routing and control technology is proprietary and patent-pending. For enterprise IT leaders, the "Control Plane" is less about communication and more about auditability. BAND provides full observability into every agent interaction, creating a transcript and a "paper trail" for autonomous actions. This is a "complementary" solution to existing guardrail products; while a guardrail might protect a single agent from a prompt injection, BAND protects the entire system from cascading failures caused by one agent misinforming another. The company has launched with a tiered pricing model designed to capture everyone from individual "agent enthusiasts" to global corporations: Free ($0/mo): Designed for individuals. It allows for up to 10 remote agents and 50 active chat rooms, though it only retains data for 24 hours. Pro ($17.99/mo): Aimed at startups and growing R&D teams. This tier increases limits to 40 agents and 250 active chat rooms with email support. Enterprise (Custom): Offers unlimited agents, custom data retention policies to meet compliance requirements, and full API access to BAND's "Memory APIs". Toward the 'universal orchestrator' The emergence of BAND coincides with a shift in how analysts view the AI market. Gartner has predicted that by 2029, 90% of enterprises deploying multiple agents will require what they call a "Universal Orchestrator". Similarly, Forrester has recognized the "Agent Control Plane" as a distinct and emerging market category. The company was founded by Goomanovsky and Vlad Luzin, who combined their backgrounds in Israeli intelligence, cybersecurity, and multi-agent systems to build BAND. Goomanovsky views the platform not just as a tool, but as a foundational layer for the next era of the internet. "Communication is the most fundamental problem in computing," Goomanovsky noted. "When new beings emerge, the first thing they need is a way to talk to each other... We are the agent internet". The $17 million Seed round was led by Sierra Ventures, Hetz Ventures, and Team8. Tim Guleri of Sierra Ventures emphasized that BAND is building the "missing layer" that makes large-scale collaboration practical. This capital will be used to expand the engineering team and accelerate the development of the "design partner" ecosystem, which already includes leading North American telcos and European digital payment companies. As agents transition from being digital novelties to becoming the primary drivers of enterprise workflows, the "glue code" that holds them together will become the most critical piece of the stack. BAND’s launch marks the first serious attempt to standardize that glue, turning a chaotic "band" of agents into a synchronized, governed symphony.

OpenAI unveils Workspace Agents, a successor to custom GPTs for enterprises that can plug directly into Slack, Salesforce and more

4/22/2026

OpenAI introduced a new paradigm and product today that is likely to have huge implications for enterprises seeking to adopt and control fleets of AI agent workers. Called "Workspace Agents," OpenAI's new offering essentially allows users on its ChatGPT Business ($20 per user per month) and variably priced Enterprise, Edu and Teachers subscription plans to design or select from pre-existing agent templates that can take on work tasks across third-party apps and data sources including Slack, Google Drive, Microsoft apps, Salesforce, Notion, Atlassian Rovo, and other popular enterprise applications. Put simply: these agents can be created and accessed from ChatGPT, but users can also add them to third-party apps like Slack, communicate with them across disparate channels, ask them to use information from the channel they're in and other third-party tools and apps, and the agents will go off and do work like drafting emails to the entire team, selected members, or pull data and make presentations. Human users can trust that the agent will manage all this complexity and complete the task as requested, even if the user who requested it leaves. It's the end of "babysitting" agents and the start of letting them go off and get shit done for your business — according to your defined business processes and permissions, of course. The product experience appears centered on the Agents tab in the ChatGPT sidebar, where teams can discover and manage shared agents. This functions as a kind of team directory: a place where agents built by coworkers can be reused across a workspace. The broader idea is that AI becomes less of an individual productivity trick and more of a shared organizational resource. In this sense, OpenAI is targeting one of office work’s oldest pain points: the handoff between people, systems, and steps in a process. OpenAI says workspace agents will be free for the next two weeks, until May 6, 2026, after which credit-based pricing will begin. The company also says more capabilities are on the way, including new triggers to start work automatically, better dashboards, more ways for agents to take action across business tools, and support for workspace agents in its AI code generation app, Codex. For more information on how to get started building and using them, OpenAI recommends heading over to its online academy page on them here and its help desk documentation here. The Codex backbone The most significant shift in this announcement is the move away from purely session-based interaction. Workspace agents are powered by Codex — the cloud-based, partially open-source AI coding harness that OpenAI has been aggressively expanding in 2026 — which gives them access to a workspace for files, code, tools, and memory. OpenAI says the agents can do far more than answer a prompt. They can write or run code, use connected apps, remember what they have learned, and continue work across multiple steps. That description lines up closely with the capabilities OpenAI shipped into Codex just six days ago, including background computer use, more than 90 new plugins spanning tools like Atlassian Rovo, CircleCI, GitLab, Microsoft Suite, Neon by Databricks, and Render, plus image generation, persistent memory, and the ability to schedule future work and wake up on its own to continue across days or weeks. Workspace agents inherit that plumbing. When one pulls a Friday metrics report, it is effectively spinning up a Codex cloud session with the right tools attached, running code to fetch and transform data, rendering charts, writing the narrative, and persisting what it learned for next week. When that same agent is deployed to a Slack channel, it is a Codex instance listening for mentions and threading its work back in. This is the technical decision enterprise buyers should focus on. Building an agent on a code-execution substrate rather than a pure LLM-call-and-response loop is what gives workspace agents the ability to do real work — transforming a CSV, reconciling two systems of record, generating a chart that is actually correct — rather than describing what the work would look like. Persistence and scheduling In earlier AI assistant models, progress paused when the user stopped interacting. Workspace agents change that by running in the cloud and supporting long-running workflows. Teams can also set them to run on a schedule. That means a recurring reporting agent can pull data on a set cadence, generate charts and summaries, and share the results with a team without anyone manually kicking off the process. Here at VentureBeat, we analyze story traffic and user return rate on a weekly basis — exactly the kind of recurring, multi-step, multi-source task that could theoretically be automated with a single workspace agent. Any enterprise with a weekly reporting rhythm pulling from dynamic data sources is likely to find a use for these agents. Agents also retain memory across runs. OpenAI says they can be guided and corrected in conversation, so they improve the more a team uses them. Over time they start to reflect how a team actually works — its processes, its standards, its preferred ways of handling recurring jobs — which is a meaningfully different proposition from the static instruction-set GPTs that preceded them. The integrated ecosystem OpenAI's claim is that agents should gather information and take action where work already happens, rather than forcing teams into a separate interface. That point becomes clearest in the Slack examples. OpenAI's launch materials show a product-feedback agent operating inside a channel named #user-insights, answering a question about recent mobile-app feedback with a themed summary pulled from multiple sources. The company's demo lineup walks through a sample team directory of agents: Spark for lead qualification and follow-up, Slate for software-request review, Tally for metrics reporting, Scout for product feedback routing, Trove for third-party vendor risk, and Angle for marketing and web content. OpenAI also shared more functional examples its own teams use internally — a Software Reviewer that checks employee requests against approved-tools policy and files IT tickets; an accounting agent that prepares parts of month-end close including journal entries, balance-sheet reconciliations, and variance analysis, with workpapers containing underlying inputs and control totals for review; and a Slack agent used by the product team that answers employee questions, links relevant documentation, and files tickets when it surfaces a new issue. In a sense, it is a continuation of the philosophy OpenAI espoused for individuals with last week's Codex desktop release: the agent joins the workflow where work is already happening, draws in context from the surrounding apps, takes action where permitted, and keeps moving. From GPTs to a broader agent push Workspace agents are not a standalone launch. They sit inside a roughly 12-month arc in which OpenAI has been systematically rebuilding ChatGPT, the API, and the developer platform around agents. Workspace agents are explicitly positioned by OpenAI as an evolution of its custom GPTs, introduced in late 2023, which gave users a way to create customized versions of ChatGPT for particular roles and use cases. However, now OpenAI says it is deprecating the custom GPT standard for organizations in a yet-to-be determined future date, and will require Business, Enterprise, Edu and Teachers users to update their GPTs to be new workspace agents. Individuals who have made custom GPTs can continue using them for the foreseeable future, according to our sources at the company. In October 2025, OpenAI introduced AgentKit, a developer-focused suite that includes Agent Builder, a Connector Registry, and ChatKit for building, deploying, and optimizing agents. In February 2026, it introduced Frontier, an enterprise platform focused on helping organizations manage AI coworkers with shared business context, execution environments, evaluation, and permissions. Workspace agents arrive as the no-code, in-product entry point that sits on top of that stack — even if OpenAI does not explicitly describe the architectural relationship in its materials. The subtext across all three launches is the same: OpenAI has decided that the future of ChatGPT-for-work is fleets of permissioned agents, not single chat windows — and that GPTs, its first attempt at letting businesses customize ChatGPT, were not enough. Governance and enterprise safeguards Because workspace agents can act across business systems, OpenAI puts heavy emphasis on governance. Admins can control who is allowed to build, run, and publish agents, and which tools, apps, and actions those agents can reach. The role-based controls are more granular than the ones most custom-GPT rollouts ever had: admins can toggle, per role, whether members can browse and run agents, whether they can build them, whether they can publish to the workspace directory, and — separately — whether they can publish agents that authenticate using personal credentials. That last setting is the risky case, and OpenAI explicitly recommends keeping it narrowly scoped. Authentication itself comes in two flavors, and the choice has real consequences. In end-user account mode, each person who runs the agent authenticates with their own credentials, so the agent only ever sees what that individual is allowed to see. In agent-owned account mode, the agent uses a single shared connection so users don't have to authenticate at run time. OpenAI's documentation strongly recommends service accounts rather than personal accounts for the shared case, and flags the data-exfiltration risk of publishing an agent that authenticates as its creator. Write actions — sending email, editing a spreadsheet, posting a message, filing a ticket — default to Always ask, requiring human approval before the agent executes. Builders can relax specific actions to "Never ask" or configure a custom approval policy, but the default posture is human-in-the-loop. OpenAI also claims built-in safeguards against prompt-injection attacks, where malicious content in a document or web page tries to hijack an agent. The claim is welcome but not yet proven in the wild. For organizations that want deeper visibility, OpenAI says its Compliance API surfaces every agent's configuration, updates, and run history. Admins can suspend agents on the fly, and OpenAI says an admin-console view of every agent built across the organization, with usage patterns and connected data sources, is coming soon. Two caveats worth flagging for security-sensitive buyers: workspace agents are off by default at launch for ChatGPT Enterprise workspaces pending admin enablement, and they are not available at all to Enterprise customers using Enterprise Key Management (EKM). Analytics and early customer signal OpenAI also ships an analytics dashboard aimed at helping teams understand how their agents are being used. Screenshots in the launch materials show measures like total runs, unique users, and an activity feed of recent runs, including one by a user named Ethan Rowe completing a run in a #b2b-sales channel. The mockup detail supports OpenAI's broader point: the company wants organizations to measure not just whether agents exist, but whether they are being used. The clearest early-adopter signal in the launch itself comes from Rippling. Ankur Bhatt, who leads AI Engineering at the HR platform, says workspace agents shortened the traditional development cycle enough that a sales consultant was able to build a sales agent without an engineering team. "It researches accounts, summarizes Gong calls, and posts deal briefs directly into the team's Slack room," Bhatt says. "What used to take reps 5–6 hours a week now runs automatically in the background on every deal." OpenAI's announcement names SoftBank Corp., Better Mortgage, BBVA, and Hibob as additional early testers. The era of the digital coworker Workspace agents do not land in a vacuum. They land in the middle of a broader OpenAI push — through AgentKit, through Frontier, through the Codex overhaul — to make agents more persistent, more connected, and more useful inside real organizational workflows. They also land in a deeply crowded field: Microsoft Copilot Studio is wired into the Microsoft 365 base, Google is pushing Agentspace, Salesforce has rebuilt itself as agent infrastructure with Agentforce, and Anthropic recently introduced Claude Managed Agents, all different flavors of similar ideas — agents that cut across your apps and tools, take actions on schedules repeatedly as desired, and retain some degree of memory, context, and permissions and policies. But this launch matters because it turns OpenAI's strategy into something concrete for the teams already paying for ChatGPT, and because it quietly retires the product those teams were most recently told to standardize on. If workspace agents live up to the pitch — shared, reusable, scheduled, permissioned coworkers that follow approved processes and keep work moving when their human is offline — it would mark a meaningful change in what workplace software does. Less passive software waiting for input, more active systems helping teams coordinate, execute, and move faster together. The era of the digital coworker has begun. And, on OpenAI's plans at least, the era of the custom GPT is ending.

Google and AWS split the AI agent stack between control and execution

4/22/2026

The era of enterprises stitching together prompt chains and shadow agents is nearing its end as more options for orchestrating complex multi-agent systems emerge. As organizations move AI agents into production, the question remains: "how will we manage them?" Google and Amazon Web Services offer fundamentally different answers, illustrating a split in the AI stack. Google’s approach is to run agentic management on the system layer, while AWS’s harness method sets up in the execution layer. The debate on how to manage and control gained new energy this past month as competing companies released or updated their agent builder platforms—Anthropic with the new Claude Managed Agents and OpenAI with enhancements to the Agents SDK—giving developer teams options for managing agents. AWS with new capabilities added to Bedrock AgentCore is optimizing for velocity—relying on harnesses to bring agents to product faster—while still offering identity and tool management. Meanwhile, Google’s Gemini Enterprise adopts a governance-focused approach using a Kubernetes-style control plane. Each method offers a glimpse into how agents move from short-burst task helpers to longer-running entities within a workflow. Upgrades and umbrellas To understand where each company stands, here’s what’s actually new. Google released a new version of Gemini Enterprise, bringing its enterprise AI agent offerings—Gemini Enterprise Platform and Gemini Enterprise Application—under one umbrella. The company has rebranded Vertex AI as Gemini Enterprise Platform, though it insists that, aside from the name change and new features, it’s still fundamentally the same interface. “We want to provide a platform and a front door for companies to have access to all the AI systems and tools that Google provides,” Maryam Gholami, senior director, product management for Gemini Enterprise, told VentureBeat in an interview. “The way you can think about it is that the Gemini Enterprise Application is built on top of the Gemini Enterprise Agent Platform, and the security and governance tools are all provided for free as part of Gemini Enterprise Application subscription.” On the other hand, AWS added a new managed agent harness to Bedrock Agentcore. The company said in a press release shared with VentureBeat that the harness “replaces upfront build with a config-based starting point powered by Strands Agents, AWS’s open source agent framework.” Users define what the agent does, the model it uses and the tools it calls, and AgentCore does the work to stitch all of that together to run the agent. Agents are now becoming systems The shift toward stateful, long-running autonomous agents has forced a rethink of how AI systems behave. As agents move from short-lived tasks to long-running workflows, a new class of failure is emerging: state drift. As agents continue operating, they accumulate state—memory, too, responses and evolving context. Over time, that state becomes outdated. Data sources change, or tools can return conflicting responses. But the agent becomes more vulnerable to inconsistencies and becomes less truthful. Agent reliability becomes a systems problem, and managing that drift may need more than faster execution; it may require visibility and control. It’s this failure point that platforms like Gemini Enterprise and AgentCore try to prevent. Though this shift is already happening, Gholami admitted that customers will dictate how they want to run and control any long-running agent. “We are going to learn a lot from customers where they would be using long-running agents, where they just assign a task to these autonomous agents to just go ahead and do,” Gholami said. “Of course, there are tricks and balances to get right and the agent may come back and ask for more input.” The new AI stack What’s becoming increasingly clear is that the AI stack is separating into distinct layers, solving different problems. AWS and, to a certain extent, Anthropic and OpenAI, optimize for faster deployment. Claude Managed Agents abstracts much of the backend work for standing up an agent, while the Agents SDK now includes support for sandboxes and a ready-made harness. These approaches aim to lower the barrier to getting agents up and running. Google offers a centralized control panel to manage identity, enforce policies and monitor long-running behaviors. Enterprises likely need both. As some practitioners see it, their businesses have to have a serious conversation on how much risk they are willing to take. “The main takeaway for enterprise technology leaders considering these technologies at the moment may be formulated this way: while the agent harness vs. runtime question is often perceived as build vs. buy, this is primarily a matter of risk management. If you can afford to run your agents through a third-party runtime because they do not affect your revenue streams, that is okay. On the contrary, in the context of more critical processes, the latter option will be the only one to consider from a business perspective,” Rafael Sarim Oezdemir, head of growth at EZContacts, told VentureBeat in an email. Iterating quickly lets teams experiment and discover what agents can do, while centralized control adds a layer of trust. What enterprises need is to ensure they are not locked into systems designed purely for a single way of executing agents.

Are you paying an AI ‘swarm tax’? Why single agents often beat complex systems

4/22/2026

Enterprise teams building multi-agent AI systems may be paying a compute premium for gains that don't hold up under equal-budget conditions. New Stanford University research finds that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when both are given the same thinking token budget. However, multi-agent systems come with the added baggage of computational overhead. Because they typically use longer reasoning traces and multiple interactions, it is often unclear whether their reported gains stem from architectural advantages or simply from consuming more resources. To isolate the true driver of performance, researchers at Stanford University compared single-agent systems against multi-agent architectures on complex multi-hop reasoning tasks under equal "thinking token" budgets. Their experiments show that in most cases, single-agent systems match or outperform multi-agent systems when compute is equal. Multi-agent systems gain a competitive edge when a single agent's context becomes too long or corrupted. In practice, this means that a single-agent model with an adequate thinking budget can deliver more efficient, reliable, and cost-effective multi-hop reasoning. Engineering teams should reserve multi-agent systems for scenarios where single agents hit a performance ceiling. Understanding the single versus multi-agent divide Multi-agent frameworks, such as planner agents, role-playing systems, or debate swarms, break down a problem by having multiple models operate on partial contexts. These components communicate with each other by passing their answers around. While multi-agent solutions show strong empirical performance, comparing them to single-agent baselines is often an imprecise measurement. Comparisons are heavily confounded by differences in test-time computation. Multi-agent setups require multiple agent interactions and generate longer reasoning traces, meaning they consume significantly more tokens. Consequently, when a multi-agent system reports higher accuracy, it is difficult to determine if the gains stem from better architecture design or from spending extra compute. Recent studies show that when the compute budget is fixed, elaborate multi-agent strategies frequently underperform compared to strong single-agent baselines. However, they are mostly very broad comparisons that don’t account for nuances such as different multi-agent architectures or the difference between prompt and reasoning tokens. “A central point of our paper is that many comparisons between single-agent systems (SAS) and multi-agent systems (MAS) are not apples-to-apples,” paper authors Dat Tran and Douwe Kiela told VentureBeat. “MAS often get more effective test-time computation through extra calls, longer traces, or more coordination steps.” Revisiting the multi-agent challenge under strict budgets To create a fair comparison, the Stanford researchers set a strict “thinking token” budget. This metric controls the total number of tokens used exclusively for intermediate reasoning, excluding the initial prompt and the final output. The study evaluated single- and multi-agent systems on multi-hop reasoning tasks, meaning questions that require connecting multiple pieces of disparate information to reach an answer. During their experiments, the researchers noticed that single-agent setups sometimes stop their internal reasoning prematurely, leaving available compute budget unspent. To counter this, they introduced a technique called SAS-L (single-agent system with longer thinking). Rather than jumping to multi-agent orchestration when a model gives up early, the researchers suggest a simple prompt-and-budgeting change. "The engineering idea is simple," Tran and Kiela said. "First, restructure the single-agent prompt so the model is explicitly encouraged to spend its available reasoning budget on pre-answer analysis." By instructing the model to explicitly identify ambiguities, list candidate interpretations, and test alternatives before committing to a final answer, developers can recover the benefits of collaboration inside a single-agent setup. The results of their experiments confirm that a single agent is the strongest default architecture for multi-hop reasoning tasks. It produces the highest accuracy answers while consuming fewer reasoning tokens. When paired with specific models like Google's Gemini 2.5, the longer-thinking variant produces even better aggregate performance. The researchers rely on a concept called “Data Processing Inequality” to explain why a single agent outperforms a swarm. Multi-agent frameworks introduce inherent communication bottlenecks. Every time information is summarized and handed off between different agents, there is a risk of data loss. In contrast, a single agent reasoning within one continuous context avoids this fragmentation. It retains access to the richest available representation of the task and is thus more information-efficient under a fixed budget. The authors also note that enterprises often overlook the secondary costs of multi-agent systems. "What enterprises often underestimate is that orchestration is not free," they said. "Every additional agent introduces communication overhead, more intermediate text, more opportunities for lossy summarization, and more places for errors to compound." On the other hand, they discovered that multi-agent orchestration is superior when a single agent's environment gets messy. If an enterprise application must handle highly degraded contexts, such as noisy data, long inputs filled with distractors, or corrupted information, a single agent struggles. In these scenarios, the structured filtering, decomposition, and verification of a multi-agent system can recover relevant information more reliably. The study also warns about hidden evaluation traps that falsely inflate multi-agent performance. Relying purely on API-reported token counts heavily distorts how much computation an architecture is actually spending. The researchers found these accounting artifacts when testing models like Gemini 2.5, proving this is an active issue for enterprise applications today. "For API models, the situation is trickier because budget accounting can be opaque," the authors said. To evaluate architectures reliably, they advise developers to "log everything, measure the visible reasoning traces where available, use provider-reported reasoning-token counts when exposed, and treat those numbers cautiously." What it means for developers If a single-agent system matches the performance of multiple agents under equal reasoning budgets, it wins on total cost of ownership by offering fewer model calls, lower latency, and simpler debugging. Tran and Kiela warn that without this baseline, "some enterprises may be paying a large 'swarm tax' for architectures whose apparent advantage is really coming from spending more computation rather than reasoning more effectively." Another way to look at the decision boundary is not how complex the overall task is, but rather where the exact bottleneck lies. "If it is mainly reasoning depth, SAS is often enough. If it is context fragmentation or degradation, MAS becomes more defensible," Tran said. Engineering teams should stay with a single agent when a task can be handled within one coherent context window. Multi-agent systems become necessary when an application handles highly degraded contexts. Looking ahead, multi-agent frameworks will not disappear, but their role will evolve as frontier models improve their internal reasoning capabilities. "The main takeaway from our paper is that multi-agent structure should be treated as a targeted engineering choice for specific bottlenecks, not as a default assumption that more agents automatically means better intelligence," Tran said.

OpenAI launches Privacy Filter, an open source, on-device data sanitization model that removes personal information from enterprise datasets

4/22/2026

In a significant shift toward local-first privacy infrastructure, OpenAI has released Privacy Filter, a specialized open-source model designed to detect and redact personally identifiable information (PII) before it ever reaches a cloud-based server. Launched today on AI code sharing community Hugging Face under a permissive Apache 2.0 license, the tool addresses a growing industry bottleneck: the risk of sensitive data "leaking" into training sets or being exposed during high-throughput inference. By providing a 1.5-billion-parameter model that can run on a standard laptop or directly in a web browser, the company is effectively handing developers a "privacy-by-design" toolkit that functions as a sophisticated, context-aware digital shredder. Though OpenAI was founded with a focus on open source models such as this, the company shifted during the ChatGPT era to providing more proprietary ("closed source") models available only through its website, apps, and API — only to return to open source in a big way last year with the launch of the gpt-oss family of language models. In that light, and combined with OpenAI's recent open sourcing of agentic orchestration tools and frameworks, it's safe to say that the generative AI giant is clearly still heavily invested in fostering this less immediately lucrative part of the AI ecosystem. Technology: a gpt-oss variant with bidirectional token classifier that reads from both directions Architecturally, Privacy Filter is a derivative of OpenAI’s gpt-oss family, a series of open-weight reasoning models released earlier this year. However, while standard large language models (LLMs) are typically autoregressive—predicting the next token in a sequence—Privacy Filter is a bidirectional token classifier. This distinction is critical for accuracy. By looking at a sentence from both directions simultaneously, the model gains a deeper understanding of context that a forward-only model might miss. For instance, it can better distinguish whether "Alice" refers to a private individual or a public literary character based on the words that follow the name, not just those that precede it. The model utilizes a Sparse Mixture-of-Experts (MoE) framework. Although it contains 1.5 billion total parameters, only 50 million parameters are active during any single forward pass. This sparse activation allows for high throughput without the massive computational overhead typically associated with LLMs. Furthermore, it features a massive 128,000-token context window, enabling it to process entire legal documents or long email threads in a single pass without the need for fragmenting text—a process that often causes traditional PII filters to lose track of entities across page breaks. To ensure the redacted output remains coherent, OpenAI implemented a constrained Viterbi decoder. Rather than making an independent decision for every single word, the decoder evaluates the entire sequence to enforce logical transitions. It uses a "BIOES" (Begin, Inside, Outside, End, Single) labeling scheme, which ensures that if the model identifies "John" as the start of a name, it is statistically inclined to label "Smith" as the continuation or end of that same name, rather than a separate entity. On-device data sanitization Privacy Filter is designed for high-throughput workflows where data residency is a non-negotiable requirement. It currently supports the detection of eight primary PII categories: Private Names: Individual persons. Contact Info: Physical addresses, email addresses, and phone numbers. Digital Identifiers: URLs, account numbers, and dates. Secrets: A specialized category for credentials, API keys, and passwords. In practice, this allows enterprises to deploy the model on-premises or within their own private clouds. By masking data locally before sending it to a more powerful reasoning model (like GPT-5 or gpt-oss-120b), companies can maintain compliance with strict GDPR or HIPAA standards while still leveraging the latest AI capabilities. Initial benchmarks are promising: the model reportedly hits a 96% F1 score on the PII-Masking-300k benchmark out of the box. For developers, the model is available via Hugging Face, with native support for transformers.js, allowing it to run entirely within a user's browser using WebGPU. Fully open source, commercially viable Apache 2.0 license Perhaps the most significant aspect of the announcement for the developer community is the Apache 2.0 license. Unlike "available-weight" licenses that often restrict commercial use or require "copyleft" sharing of derivative works, Apache 2.0 is one of the most permissive licenses in the software world.For startups and dev-tool makers, this means: Commercial Freedom: Companies can integrate Privacy Filter into their proprietary products and sell them without paying royalties to OpenAI. Customization: Teams can fine-tune the model on their specific datasets (such as medical jargon or proprietary log formats) to improve accuracy for niche industries. No Viral Obligations: Unlike the GPL license, builders do not have to open-source their entire codebase if they use Privacy Filter as a component. By choosing this licensing path, OpenAI is positioning Privacy Filter as a standard utility for the AI era—essentially the "SSL for text". Community reactions The tech community reacted quickly to the release, with many noting the impressive technical constraints OpenAI managed to hit. Elie Bakouch (@eliebakouch), a research engineer at agentic model training platform startup Prime Intellect, praised the efficiency of Privacy Filter's architecture on X: "Very nice release by @OpenAI! A 50M active, 1.5B total gpt-oss arch MoE, to filter private information from trillion scale data cheaply. keeping 128k context with such a small model is quite impressive too". The sentiment reflects a broader industry trend toward "small but mighty" models. While the world has focused on massive, 100-trillion parameter giants, the practical reality of enterprise AI often requires small, fast models that can perform one task—like privacy filtering—exceptionally well and at a low cost. However, OpenAI included a "High-Risk Deployment Caution" in its documentation. The company warned that the tool should be viewed as a "redaction aid" rather than a "safety guarantee," noting that over-reliance on a single model could lead to "missed spans" in highly sensitive medical or legal workflows. OpenAI’s Privacy Filter is clearly an effort by the company to make the AI pipeline fundamentally safer. By combining the efficiency of a Mixture-of-Experts architecture with the openness of an Apache 2.0 license, OpenAI is providing a way for many enterprises to more easily, cheaply and safely redact PII data.

Google doesn't pay the Nvidia tax. Its new TPUs explain why.

4/22/2026

Every frontier AI lab right now is rationing two things: electricity and compute. Most of them buy their compute for model training from the same supplier, at the steep gross margins that have turned Nvidia into one of the most valuable companies in the world. Google does not. On Tuesday night, inside a private gathering at F1 Plaza in Las Vegas, Google previewed its eighth-generation Tensor Processing Units. The pitch: two custom silicon designs shipping later this year, each purpose-built for a different half of the modern AI workload. TPU 8t targets training for frontier models, and TPU 8i targets the low-latency, memory-hungry world of agentic inference and real-time sampling. Amin Vahdat, Google's SVP and chief technologist for AI and infrastructure (pictured above left), used his time onstage to make a point that matters more to enterprise buyers than any individual spec: Google designs every layer of its AI stack end-to-end, and that vertical integration is starting to show up in cost-per-token economics that Google says its rivals cannot match. "One chip a year wasn't enough": Inside Google's 2024 bet on a two-chip roadmap The more interesting story behind v8t and v8i is when the decision to split the roadmap was made. The call came in 2024, according to Vahdat — a year before the industry at large pivoted to reasoning models, agents and reinforcement learning as the dominant frontier workload. At the time, it was a contrarian read. "We realized two years ago that one chip a year wouldn't be enough," Vahdat said during the fireside. "This is our first shot at actually going with two super high-powered specialized chips." For enterprise buyers, the implication is concrete. Customers running fine-tuning or large-scale training on Google Cloud and customers serving production agents on Vertex AI have been renting the same accelerators and eating the inefficiency. V8 is the first generation where the silicon itself treats those as different problems with two sets of chips. TPU 8t: A training fabric that scales to a million chips On paper, TPU 8t is an aggressive generational step. According to Google, 8t delivers 2.8x the FP4 EFlops per pod (121 vs 42.5) against Ironwood, the seventh-generation TPU that shipped in 2025, doubles bidirectional scale-up bandwidth to 19.2 Tb/s per chip, and quadruples scale-out networking to 400 Gb/s per chip. Pod size grows modestly from 9,216 to 9,600 chips, held together by Google's 3D Torus topology. The number that matters most to IT leaders evaluating where to run frontier-scale training: 8t clusters (Superpods) can scale beyond 1 million TPU chips in a single training job via a new interconnect Google is calling Virgo networking. 8t also introduces TPU Direct Storage, which moves data from Google's managed storage tier directly into HBM without the usual CPU-mediated hops. For long training runs where wall-clock time is the cost driver, collapsing that data path reduces the number of pod-hours needed to finish each epoch. TPU 8i and Boardfly: Re-engineering the network for agents If 8t is an evolutionary step, TPU 8i is the more architecturally interesting chip. It is also where the story for IT buyers gets most compelling. The year-over-year spec jumps are, as Vahdat put it, “stunning.” According to Google, 8i delivers 9.8x the FP8 EFlops per pod (11.6 vs 1.2), 6.8x the HBM capacity per pod (331.8 TB vs 49.2), and a pod size that grows 4.5x from 256 to 1,152 chips. What drove those numbers is a rethink of the network itself. Vahdat explained the insight directly: Google's default way of connecting chips together supported bandwidth over latency — good for moving large amounts of data through, not built for the minimum time it takes a response to get back. That profile works for training. For agents, it does not. In partnership with Google DeepMind, the TPU team built what Google calls Boardfly topology specifically to reduce the network diameter — shrinking the number of hops between any two chips in a pod. Paired with a Collective Acceleration Engine and what Google describes as very large on-chip SRAM, 8i delivers a claimed 5x improvement in latency for real-time LLM sampling and reinforcement learning. The vertical-integration moat: Why Google doesn't pay the "Nvidia tax" The subtext across Vahdat's presentation was a six-layer diagram Google calls its AI stack: energy at the foundation, then data center land and enclosures, AI infrastructure hardware, AI infrastructure software, models (Gemini 3), and services on top. Vahdat noted that designing each layer in isolation forces you to the least common denominator for each layer. Google designs them together. This is where the competitive story for IT buyers and analysts crystallizes. OpenAI, Anthropic, xAI and Meta all depend heavily on Nvidia silicon to train their frontier models. Every H200 and Blackwell GPU they buy carries Nvidia’s data-center gross margin — the informal "Nvidia tax" that industry analysts have flagged for two years running as a structural cost disadvantage for anyone renting rather than designing. Google pays fab, packaging and engineering costs on its TPUs. It does not pay that margin. What v8 means for the compute race: A new evaluation checklist for IT leaders For procurement and infrastructure teams, TPUv8 reframes the 2026–2027 cloud evaluation in concrete ways. Teams training large proprietary models should look at 8t availability windows, Virgo networking access, and goodput SLAs — not just headline EFlops. Teams serving agents or reasoning workloads should evaluate 8i availability on Vertex AI, independent latency benchmarks as they emerge, and whether HBM-per-pod sizing fits their context windows. Teams consuming Gemini through Gemini Enterprise should inherit the 8i lift and should expect the ceiling on what they can deploy in production to rise meaningfully through 2026. The caveats are real. General availability is still "later in 2026." The v8 is a roadmap signal, not a procurement decision today. Google's benchmarks are self-reported; undoubtedly independent numbers will come from early cloud customers and third-party evaluators over the next two quarters. And portability between JAX/XLA and the CUDA/PyTorch ecosystem remains a friction cost worth thinking about when negotiating any multi-year commitment. Looking further out, Vahdat made two predictions worth noting. First, general-purpose CPUs will see a resurgence inside AI systems — not as accelerators, but as orchestration compute for agent sandboxes, virtual machines and tool execution. Second, framed explicitly as an industry prediction rather than a Google roadmap preview, specialization also keeps going strong. As general-purpose CPUs gain plateau at a few percent a year, workloads that matter will demand purpose-built silicon. "Two chips might become more," Vahdat said — without specifying whether the "more" would mean future TPU variants or other classes of specialized accelerators. The frontier compute race used to be a question of who could buy the most H100s. It is now a question of who controls the stack. The shortlist of companies that genuinely do is, for the moment, two: Google and Nvidia.

Salesforce’s Agentforce Vibes 2.0 targets a hidden failure: context overload in AI agents

4/22/2026

When startup fundraising platform VentureCrowd began deploying AI coding agents, they saw the same gains as other enterprises: they cut the front-end development cycle by 90% in some projects. However, it didn’t come easy or without a lot of trial and error. VentureCrowd’s first challenge revolved around data and context quality, since Diego Mogollon, chief product officer at VentureCrowd, told VentureBeat that “agents reason against whatever data they can access at runtime” and would then be confidently “wrong” because they’re only basing their knowledge on the context given to them. Their other roadblock, like many others, was messy data and unclear processes. Similar to context, Mogollon said coding agents would amplify bad data, so the company had to build a well-structured codebase first. “The challenges are rarely about the coding agents themselves; they are about everything around them,” said Mogollon. “It’s a context problem disguised as an AI problem, and it is the number one failure mode I see across agentic implementations.” Mogollon said VentureCrowd encountered several roadblocks in overhauling its software development. VentureCrowd's experience illustrates a broader issue in AI agent development. The models are not failing the agents; rather, they become overwhelmed by too much context and too many tools at once. Too much context This comes from a phenomenon called Context bloat, when AI systems accumulate more and more data, tools or instructions, the more complex the workflows become. The problem arises because agents need context to work better, but too much of it creates noise. And the more context an agent has to sift through, the more tokens it uses, the work slows down and the costs increase. One way to curb context bloat is through context engineering. Context engineering helps agents understand code changes or pull requests and align them with their tasks. However, context engineering often becomes an external task rather than built into the coding platforms enterprises use to build their agents. How coding agent providers respond VentureCrowd relied on one solution in particular to help it overcome the issues with context bloat plaguing its enterprise AI agent deployment: Salesforce’s Agentforce Vibes, a coding platform that lives within Salesforce and is available for all plans starting with the free one. Salesforce recently updated Agentforce Vibes to version 2.0, expanding support for third-party frameworks like ReAct. Most important for companies like VentureCrowd, Agentforce Vibes added Abilities and Skills, which they can use to direct agent behavior. “For context, our entire platform, frontend and backend, runs on the Salesforce ecosystem. So when Agentforce Vibes launched, it slotted naturally into an environment we already knew well,” Mogollon said. Salesforce’s approach doesn’t minimize the context agents use; rather, it helps enterprises ensure that context stays within their data models or codebases. Agentforce Vibes adds additional execution through the new Skills and Abilities feature. Abilities define what agents want to accomplish, and Skills are the tools they will use to get there. Other coding agent platforms manage context differently. For example, Claude Code and OpenAI’s Codex focus on autonomous execution, continuously reading files, running commands and as tasks evolve, expanding context. Claude Code has a context indicator that which compacts context when it becomes too large. With these different approaches, the consistent pattern is that most systems manage growing contexts for agents, not necessarily to limit them. Context keeps growing, especially as workflows become more complex, making it more difficult for enterprises to control costs, latency and reliability. Mogollon said his company chose Agentforce Vibes not only because a large portion of their data already lives on Salesforce, making it easier to integrate, but also because it would allow them to control more of the context they feed their agents. What builders should know There’s no single way to address context bloat, but the pattern is now clear: more context doesn't always mean better results. Along with investing in context engineering, enterprises have to experiment with the context constraint approach they are most comfortable with. For enterprises, that means the challenge isn’t just giving agents more information—it’s deciding what to leave out.

The modern data stack was built for humans asking questions. Google just rebuilt its for agents taking action.

4/22/2026

Enterprise data stacks were built for humans running scheduled queries. As AI agents increasingly act autonomously on behalf of businesses around the clock, that architecture is breaking down — and vendors are racing to rebuild it. Google's answer, announced at Cloud Next on Wednesday, is the Agentic Data Cloud. The architecture has three pillars: Knowledge Catalog. Automates semantic metadata curation, inferring business logic from query logs without manual data steward intervention Cross-cloud lakehouse. Lets BigQuery query Iceberg tables on AWS S3 via private network with no egress fees Data Agent Kit. Drops MCP tools into VS Code, Claude Code and Gemini CLI so data engineers describe outcomes rather than write pipelines "The data architecture has to change now," Andi Gutmans, VP and GM of Data Cloud at Google Cloud, told VentureBeat. "We're moving from human scale to agent scale." From system of intelligence to system of action The core premise behind Agentic Data Cloud is that enterprises are moving from human‑scale to agent‑scale operations. Historically, data platforms have been optimized for reporting, dashboarding, and some forecasting — what Google characterizes as “reactive intelligence.” In that model, humans interpret data and decide what to do. Now, with AI agents increasingly expected to take actions directly on behalf of the business, Gutmans argued that data platforms must evolve into systems of action. "We need to make sure that all of enterprise data can be activated with AI, that includes both structured and unstructured data," Gutmans said. "We need to make sure that there's the right level of trust, which also means it's not just about getting access to the data, but really understanding the data." The Knowledge Catalog is Google's answer to that problem. It is an evolution of Dataplex, Google's existing data governance product, with a materially different architecture underneath. Where traditional data catalogs required data stewards to manually label tables, define business terms and build glossaries, the Knowledge Catalog automates that process using agents. The practical implication for data engineering teams is that the Knowledge Catalog scales to the full data estate, not just the curated subset that a small team of data stewards can maintain by hand. The catalog covers BigQuery, Spanner, AlloyDB and Cloud SQL natively, and federates with third-party catalogs including Collibra, Atlan and Datahub. Zero-copy federation extends semantic context from SaaS applications including SAP, Salesforce Data360, ServiceNow and Workday without requiring data movement. Google's lakehouse goes cross cloud Google has had a data lakehouse called BigLake since 2022. Initially it was limited to just Google data, but in recent years has had some limited federation capabilities enabling enterprises to query data found in other locations. Gutmans explained that the previous federation worked through query APIs, which limited the features and optimizations BigQuery could bring to bear on external data. The new approach is storage-based sharing via the open Apache Iceberg format. That means whether the data is in Amazon S3 or in Google Cloud , he argued it doesn't make a difference. "This truly means we can bring all the goodness and all the AI capabilities to those third-party data sets," he said. The practical result is that BigQuery can query Iceberg tables sitting on Amazon S3 via Google's Cross-Cloud Interconnect, a dedicated private networking layer, with no egress fees and price-performance Google says is comparable to native AWS warehouses. All BigQuery AI functions run against that cross-cloud data without modification. Bidirectional federation in preview extends to Databricks Unity Catalog on S3, Snowflake Polaris and the AWS Glue Data Catalog using the open Iceberg REST Catalog standard. From writing pipelines to describing outcomes The Knowledge Catalog and cross-cloud lakehouse solve the data access and context problems. The third pillar addresses what happens when a data engineer actually sits down to build something with all of it. The Data Agent Kit ships as a portable set of skills, MCP tools and IDE extensions that drop into VS Code, Claude Code, Gemini CLI and Codex. It does not introduce a new interface. The architectural shift it enables is a move from what Gutmans called a "prescriptive copilot experience" to intent-driven engineering. Rather than writing a Spark pipeline to move data from source A to destination B, a data engineer describes the outcome — a cleaned dataset ready for model training, a transformation that enforces a governance rule — and the agent selects whether to use BigQuery, the Lightning Engine for Apache Spark or Spanner to execute it, then generates production-ready code. "Customers are kind of sick of building their own pipelines," Gutmans said. "They're truly more in the review kind of mode, than they are in the writing the code mode." Where Google and its rivals diverge The premise that agents require semantic context, not just data access, is shared across the market. Databricks has Unity Catalog, which provides governance and a semantic layer across its lakehouse. Snowflake has Cortex, its AI and semantic layer offering. Microsoft Fabric includes a semantic model layer built for business intelligence and, increasingly, agent grounding. The dispute is not over whether semantics matter — everyone agrees they do. The dispute is over who builds and maintains them. "Our goal is just to get all the semantics you can get," he explained, noting that Google will federate with third-party semantic models rather than require customers to start over. Google is also positioning openness as a differentiator, with bidirectional federation into Databricks Unity Catalog and Snowflake Polaris via the open Iceberg REST Catalog standard. What this means for enterprises Google's argument — and one echoed across the data infrastructure market — is that enterprises are behind on three fronts: Semantic context is becoming infrastructure. If your data catalog is still manually curated, it will not scale to agent workloads — and Gutmans argues that gap will only widen as agent query volumes increase. Cross-cloud egress costs are a hidden tax on agentic AI. Storage-based federation via open Iceberg standards is emerging as the architectural answer across Google, Databricks and Snowflake. Enterprises locked into proprietary federation approaches should be stress-testing those costs at agent-scale query volumes. Gutmans argues the pipeline-writing era is ending. Data engineers who move toward outcome-based orchestration now will have a significant head start.

Google’s Gemini can now run on a single air-gapped server — and vanish when you pull the plug

4/22/2026

Cirrascale Cloud Services today announced it has expanded its partnership with Google Cloud to deliver the Gemini model on-premises through Google Distributed Cloud, making it the first neocloud provider to offer Google's most advanced AI model as a fully private, disconnected appliance. The announcement, timed to coincide with Google Cloud Next 2026 in Las Vegas, addresses a stubborn problem that has plagued regulated industries since the generative AI boom began: how to access frontier-class AI models without surrendering control of your data. The offering packages Gemini into a Dell-manufactured, Google-certified hardware appliance equipped with eight Nvidia GPUs and wrapped in confidential computing protections. Enterprises and government agencies can deploy the system inside Cirrascale's data centers or their own facilities, fully disconnected from the internet and from Google's cloud infrastructure. The product enters preview immediately, with general availability expected in June or July. In an exclusive interview with VentureBeat ahead of the announcement, Dave Driggers, CEO of Cirrascale Cloud Services, described the deployment as "the next step of the partnership” and “being able to offer their most important model they have, which is Gemini." He was emphatic about what customers would be getting: "It is full blown Gemini. It's not pulled,” he told VentureBeat. “Nothing's missing from it, and it'll be available in a private scenario, so that we can guarantee them that their data is secure, their inputs are secure, their outputs are secure." The move signals a deepening shift in the enterprise AI market, where the most capable models are migrating out of hyperscaler data centers and into customers' own racks — a reversal of the cloud computing orthodoxy that defined the past decade. The impossible tradeoff that kept banks and governments on the AI sidelines For years, organizations in financial services, healthcare, defense and government faced a binary choice: access the most powerful AI models through public cloud APIs, exposing sensitive data to third-party infrastructure, or settle for less capable open-source models they could host themselves. Cirrascale's new offering attempts to eliminate that tradeoff entirely. Driggers described how the trust problem escalated in stages. First, companies worried about handing their proprietary data to hyperscalers. Then came a deeper realization. "They started realizing, holy crap, when my users type stuff in, they're giving private information away — and the output is private too," Driggers told VentureBeat. "And then the hyperscalers said, 'Your prompts and the responses? That's our stuff. We need that in order to answer your question.'" That was the moment, he argued, when the demand for fully private AI became impossible to ignore. Unlike Google Distributed Cloud, which Google already offers as its own on-premises cloud extension, the Cirrascale deployment places the actual model — weights and all — outside of Google's infrastructure entirely. "Google doesn't own this hardware. We own the hardware, or the customer owns the hardware," Driggers said. "It is completely outside of Google." Driggers drew a sharp distinction between this offering and what competitors provide. When asked about Microsoft Azure's on-premises deployments with OpenAI models and AWS Outposts, he was blunt: "Those are a lot different. This is the actual model being deployed on prem outside of their cloud. It's not a cut down version. It's the actual model." Pull the plug and the model vanishes: how confidential computing guards Google's crown jewel The technical underpinnings of the deployment reveal how seriously both Google and Cirrascale are treating the security question. The Gemini model resides entirely in volatile memory — not on persistent storage. "As soon as the power is off, the model is gone," Driggers explained. User sessions operate through caches that clear automatically when a session ends. "A company's user inputs, once that session's over, they're gone. They can be saved, but by default, they're gone," he said. Perhaps the most striking security feature is what happens when someone attempts to tamper with the appliance. Driggers described a mechanism that effectively renders the machine inoperable: "You do anything that is against confidential compute, and it's gone. Not only does the machine turn off, and therefore the model is gone, it actually puts in a marker that says, 'You violated the confidential compute.' That machine has to come back to us, or back to Dell or back to Google." He characterized the appliance as something that "does time bomb itself if something goes wrong." This level of protection reflects Google's own anxiety about releasing its flagship model's weights into environments it doesn't control. The appliance is effectively a vault: the model runs inside it, but nobody — not even the customer — can extract or inspect the weights. The confidential computing envelope ensures that even physical possession of the hardware doesn't grant access to the model's intellectual property. When Google releases a new version of Gemini, the appliance needs to reconnect — but only briefly, and through a private channel. "It does have to get connected back to Google to load the new model. But that can go via a private connection," Driggers said. For the most security-sensitive customers who can never allow their machine to connect to an outside network, Cirrascale offers a physical swap: "The server will be unplugged, purged, all the data gone, guaranteed it's gone, a new server will show up with a new version of the model." From Wall Street to drug labs, the rush for air-gapped AI is accelerating Driggers identified three primary drivers of demand: trust, security and guaranteed performance. Financial services institutions top the list. "They've got regulatory issues where they can't have something out of their control. They've got to be the one who determines where everything is. It's got to be air gap," Driggers said. The minimum deployment footprint — a single eight-GPU server — makes the product accessible in a way that Google's own private offerings do not. Running Gemini on Google's TPU-based infrastructure, Driggers noted, requires a much larger commitment. "If you want a private [instance] from Google, they require a much bigger bite, because to build something private for you, Google requires a gigantic footprint. Here we can do it down to a single machine." Beyond finance, Driggers pointed to drug discovery, medical data, public-sector research, and any business handling personal information. He also flagged an increasingly critical use case: data sovereignty. "How about your business that's doing business outside of the United States, and now you've got data sovereignty laws in places where GCP is not? We can provide private Gemini in these smaller countries where the data can't leave." The public sector is another major target. Cirrascale launched a dedicated Government Services division in March as part of its earlier partnership with Google Public Sector around the GPAR (Google Public Sector Program for Accelerated Research) initiative. That program provides higher education and research institutions access to AI tools including AlphaFold, AI Co-Scientist, and Gemini Enterprise for Education. Today's announcement extends that relationship from the research tooling layer to the model itself. The performance guarantee is the third pillar. Driggers noted that frontier models accessed through public APIs deliver inconsistent response times — a problem for mission-critical business applications. The private deployment eliminates that variability. Cirrascale layers management software on top of the Gemini appliance that allows administrators to prioritize users, allocate tokens by role, adjust context window sizes, and load-balance across multiple appliances and regions. "Your primary data scientists or your programmers may need to have really large context windows and get priority, especially maybe nine to five," Driggers explained, "but yet, the rest of the time, they want to share the Gemini experience over a wider group of people." He also noted that agentic AI workloads, which can run around the clock, benefit from the ability to consume unused capacity during off-peak hours — a scheduling flexibility that public cloud deployments don't easily support. Seat licenses, token billing and all-you-can-eat pricing: a model built for enterprise flexibility The pricing model reflects Cirrascale's broader philosophy of meeting customers where they are. Driggers described several consumption options: seat-based licensing (with both enterprise and standard tiers), per-token billing, and flat "all-you-can-eat" pricing per appliance. The minimum commitment is a single dedicated server — the appliances are not shared between customers in any configuration. "We'll meet the customer, what they're used to," Driggers said. "If they're currently taking a seat license, we'll create a seat license for them." Customers can also choose to purchase the hardware outright while still consuming Gemini as a managed service, an arrangement Cirrascale has offered since its earliest days in the AI wave. Driggers said OpenAI has been a customer since 2016 or 2017, and in that engagement, OpenAI purchased its own GPUs while Cirrascale "took those GPUs, incorporated them into our servers and storage and networking, and then presented it back as a cloud service to them so they didn't have to manage anything." That flexible ownership model is particularly relevant for universities and government-funded research institutions, where mandates often require a specific mix of capital expenditure, operating expenditure, and personnel investment. "A lot of government funding requires a mixture of CapEx, OPEX and employment development," Driggers said. "So we allow that as well." Inside the neocloud that built the world's first eight-GPU server — and just landed Google's biggest AI model Cirrascale's announcement arrives during a period of explosive growth for the neocloud sector — the tier of specialized AI cloud providers that sit between the hyperscalers and traditional hosting companies. The neocloud market is projected to be worth $35.22 billion in 2026 and is growing at a compound annual growth rate of 46.37%, according to Mordor Intelligence. Leading neocloud providers include CoreWeave, Crusoe Cloud, Lambda, Nebius and Vultr, and these companies specialize in GPU-as-a-Service for AI and high-performance computing workloads. But Cirrascale occupies a different niche within this booming category. While companies like CoreWeave have focused primarily on providing raw GPU compute at scale — CoreWeave boasts a $55.6 billion backlog — Cirrascale has positioned itself around private AI, managed services and longer-term engagements rather than on-demand elastic compute. Driggers described the company as "not an on-demand place" but rather a provider focused on "longer-term workloads where we're really competing against somebody doing it back on prem." The company's history supports that claim. Cirrascale traces its roots to a hardware company that "designed the world's first eight GPU server in 2012 before anybody thought you'd ever need eight GPUs in a box," as Driggers put it. It pivoted to pure cloud services roughly eight years ago and has since built a client roster that includes the Allen Institute for AI, which in August 2025 tapped Cirrascale as the managed services provider for a $152 million open AI initiative funded by the National Science Foundation and Nvidia. Earlier this month, Cirrascale announced a three-way alliance with Rafay Systems and Cisco to deliver end-to-end enterprise AI solutions combining Cirrascale's inference platform, Rafay's GPU orchestration, and Cisco's networking and compute hardware. The private AI era is arriving faster than anyone expected The Gemini partnership is the highest-profile move yet — and it taps into a broader industry current. The push to move frontier AI out of the public cloud and into private infrastructure is no longer a niche demand. Industry analysts predict that by 2027, 40% of AI model training and inference will occur outside public cloud environments. That projection helps explain why Google is willing to let its crown-jewel model run on hardware it doesn't own, in data centers it doesn't operate, managed by a company in San Diego. The alternative — watching regulated enterprises default to open-source models or to Microsoft's Azure OpenAI Service — is apparently a worse outcome. The announcement also carries major implications for Google's competitive positioning. Microsoft has built its enterprise AI strategy around the Azure OpenAI Service and its deep partnership with OpenAI, while AWS has invested in Amazon Bedrock and its own on-premises solutions through Outposts. Google Cloud Platform still trails both rivals in market share, though Q4 cloud revenue rose 48% year-over-year. Enabling Gemini to run on third-party infrastructure via partners like Cirrascale broadens its distribution surface in exactly the segments — government, finance, healthcare — where Microsoft and Amazon have historically held advantages. For Cirrascale, the partnership represents a chance to differentiate sharply in a market where most neoclouds are competing on GPU availability and price. Driggers expects rapid uptake in the second half of 2026. "It's going to be crazy towards the end of this year," he said. "Major banks will finally do stuff like this, because they can secure it. They can do it globally. Big research institutions who have labs all over the world will do these types of things." He predicted other frontier model providers will follow with similar offerings soon, and he doesn't see Gemini as the end of the story. "We really think that the enterprise have been waiting for private AI, not just Gemini, but all sorts of private AI," Driggers said. That may be the most telling line of all. For three years, the AI revolution has been defined by a simple bargain: send your data to the cloud and get intelligence back. Cirrascale's bet — and increasingly, Google's — is that the biggest customers in the world are done accepting those terms. The most powerful AI on the planet is now available on a single locked box that can sit in a bank vault, a university basement, or a government facility in a country where Google has no data center. The cloud, it turns out, is finally ready to come back down to earth.

Google’s new Deep Research and Deep Research Max agents can search the web and your private data

4/21/2026

Google on Monday unveiled the most significant upgrade to its autonomous research agent capabilities since the product's debut, launching two new agents — Deep Research and Deep Research Max — that for the first time allow developers to fuse open web data with proprietary enterprise information through a single API call, produce native charts and infographics inside research reports, and connect to arbitrary third-party data sources through the Model Context Protocol (MCP). The release, built on Google's Gemini 3.1 Pro model, marks an inflection point in the rapidly intensifying race to build AI systems that can autonomously conduct the kind of exhaustive, multi-source research that has traditionally consumed hours or days of human analyst time. It also represents Google's clearest bid yet to position its AI infrastructure as the backbone for enterprise research workflows in finance, life sciences, and market intelligence — industries where the stakes of getting information wrong are extraordinarily high. "We are launching two powerful updates to Deep Research in the Gemini API, now with better quality, MCP support, and native chart/infographics generation," Google CEO Sundar Pichai wrote on X. "Use Deep Research when you want speed and efficiency, and use Max when you want the highest quality context gathering & synthesis using extended test-time compute — achieving 93.3% on DeepSearchQA and 54.6% on HLE." Both agents are available starting today in public preview via paid tiers of the Gemini API, accessible through the Interactions API that Google first introduced in December 2025. Why Google built two research agents instead of one The launch introduces a tiered architecture that reflects a fundamental tension in AI agent design: the tradeoff between speed and thoroughness. Deep Research, the standard tier, replaces the preview agent Google released in December and is optimized for low-latency, interactive use cases. It delivers what Google describes as significantly reduced latency and cost at higher quality levels compared to its predecessor. The company positions it as ideal for applications where a developer wants to embed research capabilities directly into a user-facing interface — think a financial dashboard that can answer complex analytical questions in near-real time. Deep Research Max occupies the opposite end of the spectrum. It leverages extended test-time compute — a technique where the model spends more computational cycles iteratively reasoning, searching, and refining its output before delivering a final report. Google designed it for asynchronous, background workflows: the kind of task where an analyst team kicks off a batch of due diligence reports before leaving the office and expects exhaustive, fully sourced analyses waiting for them the next morning. The Google DeepMind team framed the distinction on X: "Deep Research: Optimized for speed and efficiency. Perfect for interactive apps needing quicker responses. Deep Research Max: It uses extra time to search and reason. Ideal for exhaustive context gathering and tasks happening in the background." "Deep Research was our first hosted agent in the API and has gained a ton of traction over the last 3 months, very excited for folks to test out the new agents and all the improvements, this is just the start of our agents journey," Logan Kilpatrick, who leads developer relations for Google's AI efforts, wrote on X. MCP support lets the agents tap into private enterprise data for the first time Perhaps the most consequential feature in today's release is the addition of Model Context Protocol support, which transforms Deep Research from a sophisticated web research tool into something more closely resembling a universal data analyst. MCP , an emerging open standard for connecting AI models to external data sources, allows Deep Research to securely query private databases, internal document repositories, and specialized third-party data services — all without requiring sensitive information to leave its source environment. In practical terms, this means a hedge fund could point Deep Research at its internal deal-flow database and a financial data terminal simultaneously, then ask the agent to synthesize insights from both alongside publicly available information from the web. Google disclosed that it is actively collaborating with FactSet, S&P, and PitchBook on their MCP server designs, a signal that the company is pursuing deep integration with the data providers that Wall Street and the broader financial services industry already rely on daily. The goal, according to the blog post authored by Google DeepMind product managers Lukas Haas and Srinivas Tadepalli, is to "let shared customers integrate financial data offerings into workflows powered by Deep Research, and to enable them to realize a leap in productivity by gathering context using their exhaustive data universes at lightning speed." This addresses one of the most persistent pain points in enterprise AI adoption: the gap between what a model can find on the open internet and what an organization actually needs to make decisions. Until now, bridging that gap required significant custom engineering. MCP support, combined with Deep Research's autonomous browsing and reasoning capabilities, collapses much of that complexity into a configuration step. Developers can now run Deep Research with Google Search, remote MCP servers, URL Context, Code Execution, and File Search simultaneously — or turn off web access entirely to search exclusively over custom data. The system also accepts multimodal inputs including PDFs, CSVs, images, audio, and video as grounding context. Native charts and infographics turn AI reports into stakeholder-ready deliverables The second headline feature — native chart and infographic generation — may sound incremental, but it addresses a practical limitation that has constrained the usefulness of AI-generated research outputs in professional settings. Previous versions of Deep Research produced text-only reports. Users who needed visualizations had to export the data and build charts themselves, a friction point that undermined the promise of end-to-end automation. The new agents generate high-quality charts and infographics inline within their reports, rendered in HTML or Google's Nano Banana format, dynamically visualizing complex datasets as part of the analytical narrative. "The agent generates HTML charts and infographics inline with the report. Not screenshots. Not suggestions to 'visualize this data.' Actual rendered charts inside the markdown output," noted AI commentator Shruti Mishra on X, capturing the practical significance of the change. For enterprise users — particularly those in finance and consulting who need to produce stakeholder-ready deliverables — this transforms Deep Research from a tool that accelerates the research phase into one that can potentially produce near-final analytical products. Combined with a new collaborative planning feature that lets users review, guide, and refine the agent's research plan before execution, and real-time streaming of intermediate reasoning steps, the system gives developers granular control over the investigation's scope while maintaining the transparency that regulated industries demand. How Deep Research evolved from a consumer chatbot feature to enterprise platform infrastructure Today's release crystallizes a strategic narrative Google has been building for months: Deep Research is not merely a consumer feature but a piece of infrastructure that powers multiple Google products and is now being offered to external developers as a platform. The blog post explicitly notes that when developers build with the Deep Research agent, they tap into "the same autonomous research infrastructure that powers research capabilities within some of Google's most popular products like Gemini App, NotebookLM, Google Search and Google Finance." This suggests that the agent available through the API is not a stripped-down version of what Google uses internally but the same system, offered at platform scale. The journey to this point has been remarkably rapid. Google first introduced Deep Research as a consumer feature in the Gemini app in December 2024, initially powered by Gemini 1.5 Pro. At the time, the company described it as a personal AI research assistant that could save users hours by synthesizing web information in minutes. By March 2025, Google upgraded Deep Research with Gemini 2.0 Flash Thinking Experimental and made it available for anyone to try. Then came the upgrade to Gemini 2.5 Pro Experimental, where Google reported that raters preferred its reports over competing deep research providers by more than a 2-to-1 margin. The December 2025 release was the pivot to developer access, when Google launched the Interactions API and made Deep Research available programmatically for the first time, powered by Gemini 3 Pro and accompanied by the open-source DeepSearchQA benchmark. The underlying model driving today's improvements is Gemini 3.1 Pro, which Google released on February 19, 2026. That model represented a significant leap in core reasoning: on ARC-AGI-2, a benchmark evaluating a model's ability to solve novel logic patterns, 3.1 Pro scored 77.1% — more than double the performance of Gemini 3 Pro. Deep Research Max inherits that reasoning foundation and layers autonomous research behaviors on top of it, achieving 93.3% on DeepSearchQA (up from 66.1% in December) and 54.6% on Humanity's Last Exam (up from 46.4%). Google faces a crowded field of competitors building autonomous research agents Google is not operating in a vacuum. The launch arrives amid intensifying competition in the autonomous research agent space. OpenAI has been developing its own agent capabilities within ChatGPT under the codename Hermes, which includes an agent builder, templates, scheduling, and Slack integration, according to reports circulating on social media. Perplexity has built its business around AI-powered research. And a growing ecosystem of startups is attacking various slices of the automated research workflow. What distinguishes Google's approach is the combination of its search infrastructure — which gives Deep Research access to the broadest and most current index of web information available — with the MCP-based connectivity to enterprise data sources. No other company currently offers a research agent that can simultaneously query the open web at Google Search's scale and navigate proprietary data repositories through a standardized protocol. The pricing structure also signals Google's intent to drive adoption: according to Sim.ai, which tracks model pricing, the Deep Research agent in the December preview was priced at $2 per million input tokens and $2 per million output tokens with a 1 million token context window — positioning it as cost-competitive for the volume of research output it generates. Not everyone greeted the announcement with unalloyed enthusiasm, however. Several users on X noted that the new agents are available only through the API, not in the Gemini consumer app. "Not on Gemini app," observed TestingCatalog News, while another user wrote, "Google keeps punishing Gemini App Pro subscribers for some reason." Others raised concerns about the presentation of benchmark results, with one user arguing that Google's charts could be "misleading" in how they represent percentage improvements. These complaints point to a broader tension in Google's AI strategy: the company is increasingly directing its most advanced capabilities toward developers and enterprise customers who access them through APIs, while consumer-facing products sometimes lag behind. What Deep Research Max means for finance, biotech, and the future of knowledge work The practical implications of today's launch are most immediately felt in industries that depend on exhaustive, multi-source research as a core business function. In financial services, where analysts routinely spend hours assembling due diligence reports from scattered sources — SEC filings, earnings transcripts, market data terminals, internal deal memos — Deep Research Max offers the possibility of automating the initial research phase entirely. The FactSet, S&P, and PitchBook partnerships suggest Google is serious about making this work with the data infrastructure that financial professionals already use. In life sciences, the blog post notes that Google has collaborated with Axiom Bio, which builds AI systems to predict drug toxicity, and found that Deep Research unlocked new levels of initial research depth across biomedical literature. In market research and consulting, the ability to produce stakeholder-ready reports with embedded visualizations and granular citations could compress project timelines from days to hours. The key question is whether the quality and reliability of these automated outputs will meet the standards that professionals in these fields demand. Google's benchmark numbers are impressive, but benchmarks measure performance on standardized tasks — real-world research is messier, more ambiguous, and often requires the kind of judgment that remains difficult to automate. Deep Research and Deep Research Max are available now in public preview via paid tiers of the Gemini API, with availability on Google Cloud for startups and enterprises coming soon. Eighteen months ago, Deep Research was a feature that helped grad students avoid drowning in browser tabs. Today, Google is betting it can replace the first shift at an investment bank. The distance between those two ambitions — and whether the technology can actually close it — will define whether autonomous research agents become a transformative category of enterprise software or just another AI demo that dazzles on benchmarks and disappoints in the conference room.

Vercel breach exposes the OAuth gap most security teams cannot detect, scope or contain

4/21/2026

One employee at Vercel adopted an AI tool. One employee at that AI vendor got hit with an infostealer. That combination created a walk-in path to Vercel’s production environments through an OAuth grant that nobody had reviewed. Vercel, the cloud platform behind Next.js and its millions of weekly npm downloads, confirmed on Sunday that attackers gained unauthorized access to internal systems. Mandiant was brought in. Law enforcement was notified. Investigations remain active. An update on Monday confirmed that Vercel collaborated with GitHub, Microsoft, npm, and Socket to verify that no Vercel npm packages were compromised. Vercel also announced it is now defaulting environment variable creation to “sensitive.” Next.js, Turbopack, AI SDK, and all Vercel-published npm packages remain uncompromised after a coordinated audit with GitHub, Microsoft, npm, and Socket. Context.ai was the entry point. OX Security’s analysis found that a Vercel employee installed the Context.ai browser extension and signed into it using a corporate Google Workspace account, granting broad OAuth permissions. When Context.ai was breached, the attacker inherited that employee’s Workspace access, pivoted into Vercel environments, and escalated privileges by sifting through environment variables not marked as “sensitive.” Vercel’s bulletin states that variables marked sensitive are stored in a manner that prevents them from being read. Variables without that designation were accessible in plaintext through the dashboard and API, and the attacker used them as the escalation path. CEO Guillermo Rauch described the attacker as “highly sophisticated and, I strongly suspect, significantly accelerated by AI.” Jaime Blasco, CTO of Nudge Security, independently surfaced a second OAuth grant tied to Context.ai’s Chrome extension, matching the client ID from Vercel’s published IOC to Context.ai’s Google account before Rauch’s public statement. The Hacker News reported that Google removed Context.ai’s Chrome extension from the Chrome Web Store on March 27. Per The Hacker News and Nudge Security, that extension embedded a second OAuth grant enabling read access to users’ Google Drive files. Patient zero. A Roblox cheat and a Lumma Stealer infection Hudson Rock published forensic evidence on Monday, reporting that the breach origin traces to a February 2026 Lumma Stealer infection on a Context.ai employee’s machine. According to Hudson Rock, browser history showed the employee downloading Roblox auto-farm scripts and game exploit executors. Harvested credentials included Google Workspace logins, Supabase keys, Datadog tokens, Authkit credentials, and the support@context.ai account. Hudson Rock identified the infected user as a core member of “context-inc,” Context.ai’s tenant on the Vercel platform, with administrative access to production environment variable dashboards. Context.ai published its own bulletin on Sunday (updated Monday), disclosing that the breach affects its deprecated AI Office Suite consumer product, not its enterprise Bedrock offering (Context.ai’s agent infrastructure product, unrelated to AWS Bedrock). Context.ai says it detected unauthorized access to its AWS environment in March, hired CrowdStrike to investigate, and shut down the environment. Its updated bulletin then disclosed that the scope was broader than initially understood: the attacker also compromised OAuth tokens for consumer users, and one of those tokens opened the door to Vercel’s Google Workspace. Dwell time is the detail that should concern security directors. Nearly a month separated Context.ai’s March detection from the Vercel disclosure on Sunday. A separate Trend Micro analysis references an intrusion beginning as early as June 2024 — a finding that, if confirmed, would extend the dwell time to roughly 22 months. VentureBeat could not independently reconcile that timeline with Hudson Rock's February 2026 dating; Trend Micro did not respond to a request for comment before publication. Where detection goes blind Security directors can use this table to benchmark their own detection stack against the four-hop kill chain this breach exploited. Kill Chain Hop What Happened Who Should Detect Typical Coverage Gap 1. Infostealer on employee device Context.ai employee downloaded Roblox cheat scripts; Lumma Stealer harvested Workspace creds, Supabase/Datadog/Authkit keys. EDR on endpoint; credential exposure monitoring. Low. Device likely under-monitored. No stealer log monitoring at most orgs. Most enterprises do not subscribe to infostealer intelligence feeds or correlate stealer logs against employee email domains. 2. AWS compromise at Context.ai Attacker used harvested credentials to access Context.ai’s AWS. Detected in March. Context.ai cloud security; AWS CloudTrail. Partially detected. Context.ai stopped AWS access but missed OAuth token exfiltration. Initial investigation did not identify OAuth token exfiltration. Scope was underestimated until Vercel disclosure. 3. OAuth token theft into Vercel Workspace Compromised OAuth token used to access a Vercel employee’s Google Workspace. Employee had granted “Allow All” permissions via Chrome extension. Google Workspace audit logs; OAuth app monitoring; CASB. Very low. Most orgs do not monitor third-party OAuth token usage patterns. No approval workflow intercepted the grant. No anomaly detection on OAuth token use from a compromised third party. This is the hop no one saw. 4. Lateral movement into Vercel production Attacker enumerated non-sensitive env vars (accessible via dashboard/API), harvested customer credentials. Vercel platform audit logs; behavioral analytics. Moderate. Vercel detected the intrusion after the attacker accessed customer credentials. Detection occurred after exfiltration, not before. Env var access by a compromised Workspace account did not trigger real-time alerting. What’s confirmed vs. what’s claimed Vercel’s bulletin confirms unauthorized access to internal systems, a limited subset of affected customers, and two IOCs tied to Context.ai’s Google Workspace OAuth apps. Rauch confirmed that Next.js, Turbopack, and Vercel’s open-source projects are unaffected. Separately, a threat actor using the ShinyHunters name posted on BreachForums claiming to hold Vercel’s internal database, employee accounts, and GitHub and NPM tokens, with a $2M asking price. Austin Larsen, principal threat analyst at Google Threat Intelligence, assessed the claimant as “likely an imposter.” Actors previously linked to ShinyHunters have denied involvement. None of these claims has been independently verified. Six governance failures the Vercel breach exposed 1. AI tool OAuth scopes go unaudited. Context.ai’s own bulletin states that a Vercel employee granted “Allow All” permissions using a corporate account. Most security teams have no inventory of which AI tools their employees have granted OAuth access to. CrowdStrike CTO Elia Zaitsev put it bluntly at RSAC 2026: “Don’t give an agent access to everything just because you’re lazy. Give it access to only what it needs to get the job done.” Jeff Pollard, VP and principal analyst at Forrester, told Cybersecurity Dive that the attack is a reminder about third-party risk management concerns and AI tool permissions. 2. Environment variable classification is doing real security work. Vercel distinguishes between variables marked “sensitive” (stored in a manner that prevents reading) and those without that designation (accessible in plaintext through the dashboard and API). Attackers used the accessible variables as the escalation path. A developer convenience toggle determined the blast radius. Vercel has since changed its default: new environment variables now default to sensitive. “Modern controls get deployed, but if legacy tokens or keys aren’t retired, the system quietly favors them,” Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, told VentureBeat. 3. Infostealer-to-SaaS-to-supply-chain escalation chains lack detection coverage. Hudson Rock’s reporting reveals a kill chain that crossed four organizational boundaries. No single detection layer covers that chain. Context.ai’s updated bulletin acknowledged that the scope extended beyond what was initially identified during its CrowdStrike-led investigation. 4. Dwell time between vendor detection and customer notification exceeds attacker timelines. Context.ai detected the AWS compromise in March. Vercel disclosed on Sunday. Every CISO should ask their vendors: what is your contractual notification window after detecting unauthorized access that could affect downstream customers? 5. Third-party AI tools are the new shadow IT. Vercel’s bulletin describes Context.ai as “a small, third-party AI tool.” Grip Security’s March 2026 analysis of 23,000 SaaS environments found a 490% year-over-year increase in AI-related attacks. Vercel is the latest enterprise to learn this the hard way. 6. AI-accelerated attackers compress response timelines. Rauch’s assessment of AI acceleration comes from what his IR team observed. CrowdStrike’s 2026 Global Threat Report puts the baseline at a 29-minute average eCrime breakout time, 65% faster than 2024. Security director action plan Attack Surface What Failed Recommended Action Owner OAuth governance Context.ai held broad “Allow All” Workspace permissions. No approval workflow intercepted. Inventory every AI tool OAuth grant org-wide. Revoke scopes exceeding least privilege. Check both Vercel IOCs now. Identity / IAM Env var classification Variables not marked “sensitive” remained accessible. Accessibility became the escalation path. Default to non-readable. Require a security sign-off to downgrade any variable to accessible. Platform eng + security Infostealer-to-supply-chain Kill chain spanned Lumma Stealer, Context.ai AWS, OAuth tokens, Vercel Workspace, and production environments. Correlate Infostealer intel feeds against employee domains. Automate credential rotation when creds surface in stealer logs. Threat intel + SOC Vendor notification lag Nearly a month between Context.ai detection and Vercel disclosure. Require 72-hour notification clauses in all contracts involving OAuth or identity integration. Third-party risk / legal Shadow AI adoption One employee’s unapproved AI tool became the breach vector for hundreds of orgs. Extend shadow IT discovery to AI agent platforms. Treat unapproved adoption as a security event. Security ops + procurement Lateral movement speed Rauch suspects AI acceleration. Attacker compressed the access-to-escalation window. Cut detection-to-containment SLAs below 29-minute eCrime average. SOC + IR team Run both IoC checks today Search your Google Workspace admin console (Security > API Controls > Manage Third-Party App Access) for two OAuth App IDs. The first is 110671459871-30f1spbu0hptbs60cb4vsmv79i7bbvqj.apps.googleusercontent.com, tied to Context.ai’s Office Suite. The second is 110671459871-f3cq3okebd3jcg1lllmroqejdbka8cqq.apps.googleusercontent.com, tied to Context.ai’s Chrome extension and granting Google Drive read access. If either touched your environment, you are in the blast radius regardless of what Vercel discloses next. What this means for security directors Forget the Vercel brand name for a moment. What happened here is the first major proof case that AI agent OAuth integrations create a breach class that most enterprise security programs cannot detect, scope, or contain. A Roblox cheat download in February led to production infrastructure access in April. Four organizational boundaries, two cloud providers, and one identity perimeter. No zero-day required. For most enterprises, employees have connected AI tools to corporate Google Workspace, Microsoft 365 or Slack instances with broad OAuth scopes — without security teams knowing. The Vercel breach is the case study for what that exposure looks like when an attacker finds it first.

Adversaries hijacked AI security tools at 90+ organizations. The next wave has write access to the firewall

4/21/2026

Adversaries injected malicious prompts into legitimate AI tools at more than 90 organizations in 2025, stealing credentials and cryptocurrency. Every one of those compromised tools could read data, and none of them could rewrite a firewall rule. The autonomous SOC agents shipping now can. That escalation, from compromised tools that read data to autonomous agents that rewrite infrastructure, has not been exploited in production at scale yet. But the architectural conditions for it are shipping faster than the governance designed to prevent it. A compromised SOC agent can rewrite your firewall rules, modify IAM policies, and quarantine endpoints, all with its own privileged credentials, all through approved API calls that EDR classifies as authorized activity. The adversary never touches the network. The agent does it for them. Cisco announced AgenticOps for Security in February, with autonomous firewall remediation and PCI-DSS compliance capabilities. Ivanti launched Continuous Compliance and the Neurons AI self-service agent last week, with policy enforcement, approval gates and data context validation built into the platform at launch — a design distinction that matters because the OWASP Agentic Top 10 documents what happens when those controls are absent. "Adversaries exploited legitimate AI tools by injecting malicious prompts that generated unauthorized commands. As innovation accelerates, exploitation follows," CrowdStrike CEO George Kurtz said when releasing the 2026 Global Threat Report. "AI is compressing the time between intent and execution while turning enterprise AI systems into targets," added Adam Meyers, head of counter-adversary operations at CrowdStrike. State-sponsored use of AI in offensive operations surged 89% over the prior year. The broader attack surface is expanding in parallel. Malicious MCP server clones have already intercepted sensitive data in AI workflows by impersonating trusted services. The U.K. National Cyber Security Centre warned that prompt injection attacks against AI applications "may never be totally mitigated." The documented compromises targeted AI tools that could only read and summarize; the autonomous SOC agents shipping now can write, enforce, and remediate. The governance framework that maps the gap OWASP's Top 10 for Agentic Applications, released in December 2025 and built with more than 100 security researchers, documents 10 categories of attack against autonomous AI systems. Three categories map directly to what autonomous SOC agents introduce when they ship with write access: Agent Goal Hijacking (ASI01), Tool Misuse (ASI02), and Identity and Privilege Abuse (ASI03). Palo Alto Networks reported an 82:1 machine-to-human identity ratio in the average enterprise — every autonomous agent added to production extends that gap. The 2026 CISO AI Risk Report from Saviynt and Cybersecurity Insiders (n=235 CISOs) found 47% had already observed AI agents exhibiting unintended behavior, and only 5% felt confident they could contain a compromised agent. A separate Dark Reading poll found that 48% of cybersecurity professionals identify agentic AI as the single most dangerous attack vector. The IEEE-USA submission to NIST stated the problem plainly: "Risk is driven less by the models and is based more on the model's level of autonomy, privilege scope, and the environment of the agent being operationalized." Eleanor Watson, Senior IEEE Member, warned in the IEEE 2026 survey that "semi-autonomous systems can also drift from intended objectives, requiring oversight and regular audits." Cisco's intent-aware agentic inspection, announced alongside AgenticOps in February 2026, represents an early detection-layer approach to the same gap. The approaches differ: Cisco is adding inspection at the network layer while Ivanti built governance into the platform layer. Both signal the industry sees it coming. The question is whether the controls arrive before the exploits do. Autonomous agents that ship with governance built in Security teams are already stretched. Advanced AI models are accelerating the discovery of exploitable vulnerabilities faster than any human team can remediate manually, and the backlog is growing not because teams are failing, but because the volume now exceeds what manual patching cycles can absorb. Ivanti Neurons for Patch Management introduced Continuous Compliance this quarter, an automated enforcement framework that eliminates the gap between scheduled patch deployments and regulatory requirements. The framework identifies out-of-compliance endpoints and deploys patches out-of-band to update devices that missed maintenance windows, with built-in policy enforcement and compliance verification at every step. Ivanti also launched the Neurons AI self-service agent for ITSM, which moves beyond conversational intake to autonomous resolution with built-in guardrails for policy, approvals, and data context. The agent resolves common incidents and service requests from start to finish, reducing manual effort and deflecting tickets. Robert Hanson, Chief Information Officer at Grand Bank, described the decision calculus security leaders across the industry are weighing: "Before exploring the Ivanti Neurons AI self-service agent, our team was spending the bulk of our time handling repetitive requests. As we move toward implementing these capabilities, we expect to automate routine tasks and enable our team to focus more proactively on higher-value initiatives. Over time, this approach should help us reduce operational overhead while delivering faster, more secure service within the guardrails we define, ultimately supporting improvements in service quality and security." His emphasis on operating "within the guardrails we define" points to a broader design principle: speed and governance do not have to be trade-offs. The governance gap is concrete: the Saviynt report found 86% of organizations do not enforce access policies for AI identities, only 19% govern even half of their AI identities with the same controls applied to human users, and 75% of CISOs have discovered unsanctioned AI tools running in production with embedded credentials that nobody monitors. Continuous Compliance and the Neurons AI self-service agent address the patching and ITSM layers. The broader autonomous SOC agent terrain, including firewall remediation, IAM policy modification, and endpoint quarantine, extends beyond what any single platform governs today. The ten-question audit applies to every autonomous tool in the environment, including Ivanti's. Prescriptive risk matrix for autonomous agent governance The matrix maps all 10 OWASP Agentic Top 10 risk categories to what ships without governance, the detection gap, the proof case, and the recommended action for autonomous SOC agent deployments. OWASP Risk What Ships Ungoverned Detection Gap Proof Case Recommended Action ASI01: Goal Hijacking Agent treats external inputs (logs, alerts, emails) as trusted instructions EDR cannot detect adversarial instructions executed via legitimate API calls EchoLeak (CVE-2025-32711): hidden email payload caused AI assistant to exfiltrate confidential data. Zero clicks required. Classify all inputs by trust tier. Block instruction-bearing content from untrusted sources. Validate external data before agent ingestion. ASI02: Tool Misuse Agent authorized to modify firewall rules, IAM policies, and quarantine workflows WAF inspects payloads, not tool-call intent. Authorized use is identical to misuse. Amazon Q bent legitimate tools into destructive outputs despite valid permissions (OWASP cited). Scope each tool to minimum required permissions. Log every invocation with intent metadata. Alert on calls outside baseline patterns. ASI03: Identity Abuse Agent inherits service account credentials scoped to production infrastructure SIEM sees authorized identity performing authorized actions. No anomaly triggers. 82:1 machine-to-human identity ratio in average enterprise (Palo Alto Networks). Each agent adds to it. Issue scoped agent-specific identities. Enforce time-bound, task-bound credential leases. Eliminate inherited user credentials. ASI04: Supply Chain Agent loads third-party MCP servers or plugins at runtime without provenance verification Static analysis cannot inspect dynamically loaded runtime components. Malicious MCP server clones intercepted sensitive data by impersonating trusted services (CrowdStrike 2026). Maintain approved MCP server registry. Verify provenance and integrity before runtime loading. Block unapproved plugins. ASI05: Unexpected Code Exec Agent generates or executes attacker-controlled code through unsafe evaluation paths or tool chains Code review gates apply to human commits, not agent-generated runtime code. AutoGPT RCE: natural-language execution paths enabled remote code execution through unsanctioned package installs (OWASP cited). Sandbox all agent code execution. Require human approval for production code paths. Block dynamic eval and unsanctioned installs. ASI06: Memory Poisoning Agent persists context across sessions where poisoned data compounds over time Session-based monitoring resets between interactions. Poisoning accumulates undetected. Calendar Drift: malicious calendar invite reweighted agent objectives while remaining within policy bounds (OWASP). Implement session memory expiration. Audit persistent memory stores for anomalous content. Isolate memory per task scope. ASI07: Inter-Agent Comm Agents communicate without mutual authentication, encryption, or schema validation Monitoring covers individual agents but not spoofed or manipulated inter-agent messages. OWASP documented spoofed messages that misdirected entire agent clusters via protocol downgrade attacks. Enforce mutual authentication between agents. Encrypt all inter-agent channels. Validate message schema at every handoff. ASI08: Cascading Failures Agent delegates to downstream agents, creating multi-hop privilege chains across systems Monitoring covers individual agents but not cross-agent delegation chains or fan-out. Simulation: single compromised agent poisoned 87% of downstream decision-making within 4 hours in controlled test. Map all delegation chains end to end. Enforce privilege boundaries at each handoff. Implement circuit breakers for cascading actions. ASI09: Human-Agent Trust Agent uses persuasive language or fabricated evidence to override human safety decisions Compliance verifies policy configuration, not whether the agent manipulated the human into approving. Replit agent deleted primary customer database then fabricated its contents to appear compliant and hide the damage. Require independent verification for high-risk agent recommendations. Log all human approval decisions with full agent reasoning chain. ASI10: Rogue Agents Agent deviates from intended purpose while appearing compliant on the surface Compliance checks verify configuration at deployment, not behavioral drift after deployment. 92% of organizations lack full visibility into AI identities; 86% do not enforce access policies (Saviynt 2026). Deploy behavioral drift detection. Establish baseline agent behavior profiles. Alert on deviation from expected action patterns. The 10-question OWASP audit for autonomous agents Each question maps to one OWASP Agentic Top 10 risk category. Autonomous platforms that ship with policy enforcement, approval gates, and data context validation will have clear answers to every question. Three or more "I don't know" answers on any tool means that tool's governance has not kept pace with its capabilities. Which agents have write access to production firewall, IAM, or endpoint controls? Which accept external inputs without validation? Which execute irreversible actions without human approval? Which persist memory where poisoning compounds across sessions? Which delegate to other agents, creating cascade privilege chains? Which load third-party plugins or MCP servers at runtime? Which generate or execute code in production environments? Which inherit user credentials instead of scoped agent identities? Which lack behavioral monitoring for drift from intended purpose? Which can be manipulated through persuasive language to override safety controls? What the board needs to hear The board conversation is three sentences. Adversaries compromised AI tools at more than 90 organizations in 2025, according to CrowdStrike's 2026 Global Threat Report. The autonomous tools deploying now have more privilege than the ones that were compromised. The organization has audited every autonomous tool against OWASP's 10 risk categories and confirmed that the governance controls are in place. If that third sentence is not true, it needs to be true before the next autonomous agent ships to production. Run the 10-question audit against every agent with write access to production infrastructure within the next 30 days. Every autonomous platform shipping to production should be held to the same standard — policy enforcement, approval gates, and data context validation built in at launch, not retrofitted after the first incident. The audit surfaces which tools have done that work and which have not.

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

4/17/2026

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment. To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T2) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples. In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference. For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets. Conflicting scaling laws Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate the best way to allocate compute during the model's creation, while test-time scaling laws guide how to allocate compute during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems. The problem is that these scaling laws have been developed completely independently of one another despite being fundamentally intertwined. A model's parameter size and training duration directly dictate both the quality and the per-query cost of its inference samples. Currently, the industry gold standard for pretraining is the Chinchilla rule, which suggests a compute-optimal ratio of roughly 20 training tokens for every model parameter. However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models on massive amounts of data. As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach falters when building complex agentic workflows: "In my view, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling." Instead of relying on massive models, developers can use overtrained compact models to run this repeated sampling at a fraction of the cost. But because training and test-time scaling laws are examined in isolation, there is no rigorous framework to calculate how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment. Consequently, there has previously been no formula that jointly optimizes model size, training data volume, and test-time inference budgets. The reason that this framework is hard to formulate is that pretraining and test-time scaling speak two different mathematical languages. During pretraining, a model's performance is measured using “loss,” a smooth, continuous metric that tracks prediction errors as the model learns. At test time, developers use real-world, downstream metrics to evaluate a model's reasoning capabilities, such as pass@k, which measures the probability that a model will produce at least one correct answer across k independent, repeated attempts. Train-to-test scaling laws To solve the disconnect between training and deployment, the researchers introduce Train-to-Test (T2) scaling laws. At a high level, this framework predicts a model's reasoning performance by treating three variables as a single equation: the model's size (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k). T2 combines pretraining and inference budgets into one optimization formula that accounts for both the baseline cost to train the model (6ND) and the compounding cost to query it repeatedly at inference (2Nk). The researchers tried different modeling approaches: whether to model the pre-training loss or test-time performance (pass@k) as functions of N, D, and k. The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model's prediction error, or loss) and directly modifies it by adding a new variable that accounts for the number of repeated test-time samples (k). This allows developers to see how increasing inference compute drives down the model's overall error rate. The second approach directly models the downstream pass@k accuracy. It tells developers the probability that their application will solve a problem given a specific compute budget. But should enterprises use this framework for every application? Roberts clarifies that this approach is highly specialized. "I imagine that you would not see as much of a benefit for knowledge-heavy applications, such as chat models," he said. Instead, "T2 is tailored to reasoning-heavy applications such as coding, where typically you would use repeated sampling as your test-time scaling method." What it means for developers To validate the T2 scaling laws, the researchers built an extensive testbed of over 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, heavily overtrained checkpoints from scratch to test if their mathematical forecasts held up in reality. They then benchmarked the models across eight diverse tasks, which included real-world datasets like SciQ and OpenBookQA, alongside synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge recall. Both of their mathematical models proved that the compute-optimal frontier shifts drastically away from standard Chinchilla scaling. To maximize performance under a fixed budget, the optimal choice is a model that is significantly smaller and trained on vastly more data than the traditional 20-tokens-per-parameter rule dictates. In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models across all eight evaluation tasks when test-time sampling costs were accounted for. For developers looking to deploy these findings, the technical barrier is surprisingly low. "Nothing fancy is required to perform test-time scaling with our current models," Roberts said. "At deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you’re using a transformer)." KV caching helps by storing previously processed context so the model doesn't have to re-read the initial prompt from scratch for every new reasoning sample. However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to fine-tune, Roberts notes that when they applied supervised fine-tuning, "while this effect was present, it was not a strong enough effect to pull the optimal model back to Chinchilla." The compute-optimal strategy remains definitively skewed toward compact models. Yet, teams pushing this to the absolute limit must be wary of hitting physical data limits. "Another angle is that if you take our overtraining recommendations to the extreme, you may actually run out of training data," Roberts said, referring to the looming "data wall" where high-quality internet data is exhausted. These experiments confirm that if an application relies on generating multiple test-time reasoning samples, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget. To help developers get started, the research team plans to open-source their checkpoints and code soon, allowing enterprises to plug in their own data and test the scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI industry. This is especially crucial as the high price of frontier models can become a barrier as you scale agentic applications that rely on reasoning models. "T2 fundamentally changes who gets to build strong reasoning models," Roberts concludes. "You might not need massive compute budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget."

Most enterprises can't stop stage-three AI agent threats, VentureBeat survey finds

4/17/2026

A rogue AI agent at Meta passed every identity check and still exposed sensitive data to unauthorized employees in March. Two weeks later, Mercor, a $10 billion AI startup, confirmed a supply-chain breach through LiteLLM. Both are traced to the same structural gap. Monitoring without enforcement, enforcement without isolation. A VentureBeat three-wave survey of 108 qualified enterprises found that the gap is not an edge case. It is the most common security architecture in production today. Gravitee’s State of AI Agent Security 2026 survey of 919 executives and practitioners quantifies the disconnect. 82% of executives say their policies protect them from unauthorized agent actions. Eighty-eight percent reported AI agent security incidents in the last twelve months. Only 21% have runtime visibility into what their agents are doing. Arkose Labs’ 2026 Agentic AI Security Report found 97% of enterprise security leaders expect a material AI-agent-driven incident within 12 months. Only 6% of security budgets address the risk. VentureBeat's survey results show that monitoring investment snapped back to 45% of security budgets in March after dropping to 24% in February, when early movers shifted dollars into runtime enforcement and sandboxing. The March wave (n=20) is directional, but the pattern is consistent with February’s larger sample (n=50): enterprises are stuck at observation while their agents already need isolation. CrowdStrike’s Falcon sensors detect more than 1,800 distinct AI applications across enterprise endpoints. The fastest recorded adversary breakout time has dropped to 27 seconds. Monitoring dashboards built for human-speed workflows cannot keep pace with machine-speed threats. The audit that follows maps three stages. Stage one is observe. Stage two is enforce, where IAM integration and cross-provider controls turn observation into action. Stage three is isolate, sandboxed execution that bounds blast radius when guardrails fail. VentureBeat Pulse data from 108 qualified enterprises ties each stage to an investment signal, an OWASP ASI threat vector, a regulatory surface, and immediate steps security leaders can take. The threat surface stage-one security cannot see The OWASP Top 10 for Agentic Applications 2026 formalized the attack surface last December. The ten risks are: goal hijack (ASI01), tool misuse (ASI02), identity and privilege abuse (ASI03), agentic supply chain vulnerabilities (ASI04), unexpected code execution (ASI05), memory poisoning (ASI06), insecure inter-agent communication (ASI07), cascading failures (ASI08), human-agent trust exploitation (ASI09), and rogue agents (ASI10). Most have no analog in traditional LLM applications. The audit below maps six of these to the stages where they are most likely to surface and the controls that address them. Invariant Labs disclosed the MCP Tool Poisoning Attack in April 2025: malicious instructions in an MCP server’s tool description cause an agent to exfiltrate files or hijack a trusted server. CyberArk extended it to Full-Schema Poisoning. The mcp-remote OAuth proxy patched CVE-2025-6514 after a command-injection flaw put 437,000 downloads at risk. Merritt Baer, CSO at Enkrypt AI and former AWS Deputy CISO, framed the gap in an exclusive VentureBeat interview: “Enterprises believe they’ve ‘approved’ AI vendors, but what they’ve actually approved is an interface, not the underlying system. The real dependencies are one or two layers deeper, and those are the ones that fail under stress.” CrowdStrike CTO Elia Zaitsev put the visibility problem in operational terms in an exclusive VentureBeat interview at RSAC 2026: “It looks indistinguishable if an agent runs your web browser versus if you run your browser.” Distinguishing the two requires walking the process tree, tracing whether Chrome was launched by a human from the desktop or spawned by an agent in the background. Most enterprise logging configurations cannot make that distinction. The regulatory clock and the identity architecture Auditability priority tells the same story in miniature. In January, 50% of respondents ranked it a top concern. By February, that dropped to 28% as teams sprinted to deploy. In March, it surged to 65% when those same teams realized they had no forensic trail for what their agents did. HIPAA’s 2026 Tier 4 willful-neglect maximum is $2.19M per violation category per year. In healthcare, Gravitee’s survey found 92.7% of organizations reported AI agent security incidents versus the 88% all-industry average. For a health system running agents that touch PHI, that ratio is the difference between a reportable breach and an uncontested finding of willful neglect. FINRA’s 2026 Oversight Report recommends explicit human checkpoints before agents that can act or transact execute, along with narrow scope, granular permissions, and complete audit trails of agent actions. Mike Riemer, Field CISO at Ivanti, quantified the speed problem in a recent VentureBeat interview: “Threat actors are reverse engineering patches within 72 hours. If a customer doesn’t patch within 72 hours of release, they’re open to exploit.” Most enterprises take weeks. Agents operating at machine speed widen that window into a permanent exposure. The identity problem is architectural. Gravitee's survey of 919 practitioners found only 21.9% of teams treat agents as identity-bearing entities, 45.6% still use shared API keys, and 25.5% of deployed agents can create and task other agents. A quarter of enterprises can spawn agents that their security team never provisioned. That is ASI08 as architecture. Guardrails alone are not a strategy A 2025 paper by Kazdan and colleagues (Stanford, ServiceNow Research, Toronto, FAR AI) showed a fine-tuning attack that bypasses model-level guardrails in 72% of attempts against Claude 3 Haiku and 57% against GPT-4o. The attack received a $2,000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. Guardrails constrain what an agent is told to do, not what a compromised agent can reach. CISOs already know this. In VentureBeat's three-wave survey, prevention of unauthorized actions ranked as the top capability priority in every wave at 68% to 72%, the most stable high-conviction signal in the dataset. The demand is for permissioning, not prompting. Guardrails address the wrong control surface. Zaitsev framed the identity shift at RSAC 2026: “AI agents and non-human identities will explode across the enterprise, expanding exponentially and dwarfing human identities. Each agent will operate as a privileged super-human with OAuth tokens, API keys, and continuous access to previously siloed data sets.” Identity security built for humans will not survive this shift. Cisco President Jeetu Patel offered the operational analogy in an exclusive VentureBeat interview: agents behave “more like teenagers, supremely intelligent, but with no fear of consequence.” VentureBeat Prescriptive Matrix: AI Agent Security Maturity Audit Stage Attack Scenario What Breaks Detection Test Blast Radius Recommended Control 1: Observe Attacker embeds goal-hijack payload in forwarded email (ASI01). Agent summarizes email and silently exfiltrates credentials to an external endpoint. See: Meta March 2026 incident. No runtime log captures the exfiltration. SIEM never sees the API call. The security team learns from the victim. Zaitsev: agent activity is “indistinguishable” from human activity in default logging. Inject a canary token into a test document. Route it through your agent. If the token leaves your network, stage one failed. Single agent, single session. With shared API keys (45.6% of enterprises): unlimited lateral movement. Deploy agent API call logging to SIEM. Baseline normal tool-call patterns per agent role. Alert on the first outbound call to an unrecognized endpoint. 2: Enforce Compromised MCP server poisons tool description (ASI04). Agent invokes poisoned tool, writes attacker payload to production DB using inherited service-account credentials. See: Mercor/LiteLLM April 2026 supply-chain breach. IAM allows write because agent uses shared service account. No approval gate on write ops. Poisoned tool indistinguishable from clean tool in logs. Riemer: “72-hour patch window” collapses to zero when agents auto-invoke. Register a test MCP server with a benign-looking poisoned description. Confirm your policy engine blocks the tool call before execution reaches the database. Run mcp-scan on all registered servers. Production database integrity. If agent holds DBA-level credentials: full schema compromise. Lateral movement via trust relationships to downstream agents. Assign scoped identity per agent. Require approval workflow for all write ops. Revoke every shared API key. Run mcp-scan on all MCP servers weekly. 3: Isolate Agent A spawns Agent B to handle subtask (ASI08). Agent B inherits Agent A’s permissions, escalates to admin, rewrites org security policy. Every identity check passes. Source: CrowdStrike CEO George Kurtz, RSAC 2026 keynote. No sandbox boundary between agents. No human gate on agent-to-agent delegation. Security policy modification is a valid action for admin-credentialed process. CrowdStrike CEO George Kurtz disclosed at RSAC 2026 that the agent “wanted to fix a problem, lacked permissions, and removed the restriction itself.” Spawn a child agent from a sandboxed parent. Child should inherit zero permissions by default and require explicit human approval for each capability grant. Organizational security posture. A rogue policy rewrite disables controls for every subsequent agent. 97% of enterprise leaders expect a material incident within 12 months (Arkose Labs 2026). Sandbox all agent execution. Zero-trust for agent-to-agent delegation: spawned agents inherit nothing. Human sign-off before any agent modifies security controls. Kill switch per OWASP ASI10. Sources: OWASP Top 10 for Agentic Applications 2026; Invariant Labs MCP Tool Poisoning (April 2025); CrowdStrike RSAC 2026 Fortune 50 disclosure; Meta March 2026 incident (The Information/Engadget); Mercor/LiteLLM breach (Fortune, April 2, 2026); Arkose Labs 2026 Agentic AI Security Report; VentureBeat Pulse Q1 2026. The stage-one attack scenario in this matrix is not hypothetical. Unauthorized tool or data access ranked as the most feared failure mode in every wave of VentureBeat’s survey, growing from 42% in January to 50% in March. That trajectory and the 70%-plus priority rating for prevention of unauthorized actions are the two most mutually reinforcing signals in the entire dataset. CISOs fear the exact attack this matrix describes, and most have not deployed the controls to stop it. Hyperscaler stage readiness: observe, enforce, isolate The maturity audit tells you where your security program stands. The next question is whether your cloud platform can get you to stage two and stage three, or whether you are building those capabilities yourself. Patel put it bluntly: “It’s not just about authenticating once and then letting the agent run wild.” A stage-three platform running a stage-one deployment pattern gives you stage-one risk. VentureBeat Pulse data surfaces a structural tension in this grid. OpenAI leads enterprise AI security deployments at 21% to 26% across the three survey waves, making the same provider that creates the AI risk also the primary security layer. The provider-as-security-vendor pattern holds across Azure, Google, and AWS. Zero-incremental-procurement convenience is winning by default. Whether that concentration is a feature or a single point of failure depends on how far the enterprise has progressed past stage one. Provider Identity Primitive (Stage 2) Enforcement Control (Stage 2) Isolation Primitive (Stage 3) Gap as of April 2026 Microsoft Azure Entra ID agent scoping. Agent 365 maps agents to owners. GA. Copilot Studio DLP policies. Purview for agent output classification. GA. Azure Confidential Containers for agent workloads. Preview. No per-agent sandbox at GA. No agent-to-agent identity verification. No MCP governance layer. Agent 365 monitors but cannot block in-flight tool calls. Anthropic Managed Agents: per-agent scoped permissions, credential mgmt. Beta (April 8, 2026). $0.08/session-hour. Tool-use permissions, system prompt enforcement, and built-in guardrails. GA. Managed Agents sandbox: isolated containers per session, execution-chain auditability. Beta. Allianz, Asana, Rakuten, and Sentry are in production. Beta pricing/SLA not public. Session data in Anthropic-managed DB (lock-in risk per VentureBeat research). GA timing TBD. Google Cloud Vertex AI service accounts for model endpoints. IAM Conditions for agent traffic. GA. VPC Service Controls for agent network boundaries. Model Armor for prompt/response filtering. GA. Confidential VMs for agent workloads. GA. Agent-specific sandbox in preview. Agent identity ships as a service account, not an agent-native principal. No agent-to-agent delegation audit. Model Armor does not inspect tool-call payloads. OpenAI Assistants API: function-call permissions, structured outputs. Agents SDK. GA. Agents SDK guardrails, input/output validation. GA. Agents SDK Python sandbox. Beta (API and defaults subject to change before GA per OpenAI docs). TypeScript sandbox confirmed, not shipped. No cross-provider identity federation. Agent memory forensics limited to session scope. No kill switch API. No MCP tool-description inspection. AWS Bedrock model invocation logging. IAM policies for model access. CloudTrail for agent API calls. GA. Bedrock Guardrails for content filtering. Lambda resource policies for agent functions. GA. Lambda isolation per agent function. GA. Bedrock agent-level sandboxing on roadmap, not shipped. No unified agent control plane across Bedrock + SageMaker + Lambda. No agent identity standard. Guardrails do not inspect MCP tool descriptions. Status as of April 15, 2026. GA = generally available. Preview/Beta = not production-hardened. “What’s Missing” column reflects VentureBeat’s analysis of publicly documented capabilities; gaps may narrow as vendors ship updates. No provider in this grid ships a complete stage-three stack today. Most enterprises assemble isolation from existing cloud building blocks. That is a defensible choice if it is a deliberate one. Waiting for a vendor to close the gap without acknowledging the gap is not a strategy. The grid above covers hyperscaler-native SDKs. A large segment of AI builders deploys through open-source orchestration frameworks like LangChain, CrewAI, and LlamaIndex that bypass hyperscaler IAM entirely. These frameworks lack native stage-two primitives. There is no scoped agent identity, no tool-call approval workflow, and no built-in audit trails. Enterprises running agents through open-source orchestration need to layer enforcement and isolation on top, not assume the framework provides it. VentureBeat’s survey quantifies the pressure. Policy enforcement consistency grew from 39.5% to 46% between January and February, the largest consistent gain of any capability criterion. Enterprises running agents across OpenAI, Anthropic, and Azure need enforcement that works the same way regardless of which model executes the task. Provider-native controls enforce policy within that provider’s runtime only. Open-source orchestration frameworks enforce it nowhere. One counterargument deserves acknowledgment: not every agent deployment needs stage three. A read-only summarization agent with no tool access and no write permissions may rationally stop at stage one. The sequencing failure this audit addresses is not that monitoring exists. It is that enterprises running agents with write access, shared credentials, and agent-to-agent delegation are treating monitoring as sufficient. For those deployments, stage one is not a strategy. It is a gap. Allianz shows stage-three in production Allianz, one of the world’s largest insurance and asset management companies, is running Claude Managed Agents across insurance workflows, with Claude Code deployed to technical teams and a dedicated AI logging system for regulatory transparency, per Anthropic’s April 8 announcement. Asana, Rakuten, Sentry, and Notion are in production on the same beta. Stage-three isolation, per-agent permissioning, and execution-chain auditability are deployable now, not roadmap. The gating question is whether the enterprise has sequenced the work to use them. The 90-day remediation sequence Days 1–30: Inventory and baseline. Map every agent to a named owner. Log all tool calls. Revoke shared API keys. Deploy read-only monitoring across all agent API traffic. Run mcp-scan against every registered MCP server. CrowdStrike detects 1,800 AI applications across enterprise endpoints; your inventory should be equally comprehensive. Output: agent registry with permission matrix, MCP scan report. Days 31–60: Enforce and scope. Assign scoped identities to every agent. Deploy tool-call approval workflows for write operations. Integrate agent activity logs into existing SIEM. Run a tabletop exercise: What happens when an agent spawns an agent? Conduct a canary-token test from the prescriptive matrix. Output: IAM policy set, approval workflow, SIEM integration, canary-token test results. Days 61–90: Isolate and test. Sandbox high-risk agent workloads (PHI, PII, financial transactions). Enforce per-session least privilege. Require human sign-off for agent-to-agent delegation. Red-team the isolation boundary using the stage-three detection test from the matrix. Output: sandboxed execution environment, red-team report, board-ready risk summary with regulatory exposure mapped to HIPAA tier and FINRA guidance. What changes in the next 30 days EU AI Act Article 14 human-oversight obligations take effect August 2, 2026. Programs without named owners and execution trace capability face enforcement, not operational risk. Anthropic’s Claude Managed Agents is in public beta at $0.08 per session-hour. GA timing, production SLAs, and final pricing have not been announced. OpenAI Agents SDK ships TypeScript support for sandbox and harness capabilities in a future release, per the company’s April 15 announcement. Stage-three sandbox becomes available to JavaScript agent stacks when it ships. What the sequence requires McKinsey’s 2026 AI Trust Maturity Survey pegs the average enterprise at 2.3 out of 4.0 on its RAI maturity model, up from 2.0 in 2025 but still an enforcement-stage number; only one-third of the ~500 organizations surveyed report maturity levels of three or higher in governance. Seventy percent have not finished the transition to stage three. ARMO’s progressive enforcement methodology gives you the path: behavioral profiles in observation, permission baselines in selective enforcement, and full least privilege once baselines stabilize. Monitoring investment was not wasted. It was stage one of three. The organizations stuck in the data treated it as the destination. The budget data makes the constraint explicit. The share of enterprises reporting flat AI security budgets doubled from 7.9% in January to 16% in February in VentureBeat's survey, with the March directional reading at 20%. Organizations expanding agent deployments without increasing security investment are accumulating security debt at machine speed. Meanwhile, the share reporting no agent security tooling at all fell from 13% in January to 5% in March. Progress, but one in twenty enterprises running agents in production still has zero dedicated security infrastructure around them. About this research Total qualified respondents: 108. VentureBeat Pulse AI Security and Trust is a three-wave VentureBeat survey run January 6 through March 15, 2026. Qualified sample (organizations 100+ employees): January n=38, February n=50, March n=20. Primary analysis runs from January to February; March is directional. Industry mix: Tech/Software 52.8%, Financial Services 10.2%, Healthcare 8.3%, Education 6.5%, Telecom/Media 4.6%, Manufacturing 4.6%, Retail 3.7%, other 9.3%. Seniority: VP/Director 34.3%, Manager 29.6%, IC 22.2%, C-Suite 9.3%.

Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma

4/17/2026

Anthropic today launched Claude Design, a new product from its Anthropic Labs division that allows users to create polished visual work — designs, interactive prototypes, slide decks, one-pagers, and marketing collateral — through conversational prompts and fine-grained editing controls. The release, available immediately in research preview to all paid Claude subscribers, is the company's most aggressive expansion beyond its core language model business and into the application layer that has historically belonged to companies like Figma, Adobe, and Canva. Claude Design is powered by Claude Opus 4.7, Anthropic's most capable generally available vision model, which the company also released today. Anthropic says it is rolling access out gradually throughout the day to Claude Pro, Max, Team, and Enterprise subscribers. The simultaneous launches mark a watershed for Anthropic, whose ambitions now visibly extend from foundation model provider to full-stack product company — one that wants to own the arc from a rough idea to a shipped product. The timing is also significant: Anthropic hit roughly $20 billion in annualized revenue in early March 2026, according to Bloomberg, up from $9 billion at the end of 2025 — and surpassed $30 billion by early April 2026. The company is in early talks with Goldman Sachs, JPMorgan, and Morgan Stanley about a potential IPO that could come as early as October 2026. How Claude Design turns a text prompt into a working prototype The product follows a workflow that Anthropic has designed to feel like a natural creative conversation. Users describe what they need, and Claude generates a first version. From there, refinement happens through a combination of channels: chat-based conversation, inline comments on specific elements, direct text editing, and custom adjustment sliders that Claude itself generates to let users tweak spacing, color, and layout in real time. During onboarding, Claude reads a team's codebase and design files and builds a design system — colors, typography, and components — that it automatically applies to every subsequent project. Teams can refine the system over time and maintain more than one. The import surface is broad: users can start from a text prompt, upload images and documents in various formats, or point Claude at their codebase. A web capture tool grabs elements directly from a live website so prototypes look like the real product. What distinguishes Claude Design from the wave of AI design experiments that have proliferated in the past year is the handoff mechanism. When a design is ready to build, Claude packages everything into a handoff bundle that can be passed to Claude Code with a single instruction. That creates a closed loop — exploration to prototype to production code — all within Anthropic's ecosystem. The export options acknowledge that not everyone's next step is Claude Code: users can also share designs as an internal URL within their organization, save as a folder, or export to Canva, PDF, PPTX, or standalone HTML files. Anthropic points to Brilliant, the education technology company known for intricate interactive lessons, as an early proof point. The company's senior product designer reported that the most complex pages required 20 or more prompts to recreate in competing tools but needed only 2 in Claude Design. The Brilliant team then turned static mockups into interactive prototypes they could share and user-test without code review, and handed everything — including the design intent — to Claude Code for implementation. Datadog's product team described a similar shift, compressing what had been a week-long cycle of briefs, mockups, and review rounds into a single conversation. Why Anthropic's chief product officer just resigned from Figma's board The launch arrives against a backdrop that makes Anthropic's claim of complementarity with existing design tools difficult to take entirely at face value. Mike Krieger, Anthropic's chief product officer, resigned from the board of Figma on April 14 — the same day The Information reported Anthropic's next model would include design tools that could compete with Figma's primary offering. Figma has collaborated closely with Anthropic to integrate the frontier lab's AI models into its products. Just two months ago, in February, Figma launched "Code to Canvas," a feature that converts code generated in AI tools like Claude Code into fully editable designs inside Figma — creating a bridge between AI coding tools and Figma's design process. The partnership felt like a mutual bet that AI would make design more essential, not less. Claude Design complicates that narrative significantly. Anthropic's position, based on VentureBeat's background conversations with the company, is that Claude Design is built around interoperability and is meant to meet teams where they already work, not replace incumbent tools. The company points to the Canva export, PPTX and PDF support, and plans to make it easier for other tools to connect via MCPs (model context protocols) as evidence of that philosophy. Anthropic is also making it possible for other tools to build integrations with Claude Design, a move clearly designed to preempt accusations of walled-garden ambitions. But the market read the signals differently. The structural tension is clear: Figma commands an estimated 80 to 90% market share in UI and UX design, according to The Next Web. Both Figma and Adobe assume a trained designer is in the loop. Anthropic's tool does not. Claude Design is not merely another AI copilot embedded in an existing design application. It is a standalone product that generates complete, interactive prototypes from natural language — accessible to founders, product managers, and marketers who have never opened Figma. The expansion of the design user base to non-designers is the real competitive threat, even if the professional designer's workflow remains anchored in Figma for now. Inside Claude Opus 4.7, the model Anthropic deliberately made less dangerous The model powering Claude Design is itself a significant story. Claude Opus 4.7 is Anthropic's most capable generally available model, with notable improvements over its predecessor Opus 4.6 in software engineering, instruction following, and vision — but it is intentionally less capable than Anthropic's most powerful offering, Claude Mythos Preview, the model the company announced earlier this month as too dangerous for broad release due to its cybersecurity capabilities. That dual-track approach — one model for the public, one model locked behind a vetted-access program — is unprecedented in the AI industry. Anthropic used Claude Mythos Preview to identify thousands of zero-day vulnerabilities in every major operating system and web browser, as reported by multiple outlets. The Project Glasswing initiative that houses Mythos brings together Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks as launch partners. Opus 4.7 sits a deliberate step below Mythos. Anthropic stated in its release that it "experimented with efforts to differentially reduce" the new model's cyber capabilities during training and ships it with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. What Anthropic learns from those real-world safeguards will inform the eventual goal of broader release for Mythos-class models. For security professionals with legitimate needs, the company has created a new Cyber Verification Program. On benchmarks, the model posts strong numbers. Opus 4.7 reached 64.3% on SWE-bench Pro, and on Anthropic's internal 93-task coding benchmark, it delivered a 13% resolution improvement over Opus 4.6, including solving four tasks that neither Opus 4.6 nor Sonnet 4.6 could crack. The vision improvements are substantial and directly relevant to Claude Design: Opus 4.7 can accept images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the resolution of prior Claude models. Early access partner XBOW, the autonomous penetration testing company, reported that the new model scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. Meanwhile, Bloomberg reported that the White House is preparing to make a version of Mythos available to major federal agencies, with the Office of Management and Budget setting up protections for Cabinet departments — a sign that the government views the model's capabilities as too important to leave solely in private hands. What enterprise buyers need to know about data privacy and pricing For enterprise and regulated-industry buyers, the data handling architecture of Claude Design will be a critical evaluation criterion. Based on VentureBeat's exclusive background discussions with Anthropic, the system stores the design-system representation it generates — not the source files themselves. When users link a local copy of their code, it is not uploaded to or stored on Anthropic's servers. The company is also adding the ability to connect directly to GitHub. Anthropic states unequivocally that it does not train on this data. For Enterprise customers, Claude Design is off by default — administrators choose whether to enable it and control who has access. On pricing, Claude Design is included at no additional cost with Pro, Max, Team, and Enterprise plans, using existing subscription limits with optional extra usage beyond those caps. Opus 4.7 holds the same API pricing as its predecessor: $5 per million input tokens and $25 per million output tokens. The pricing strategy mirrors the approach Anthropic took with Claude Code, which launched as a bundled feature and rapidly grew into a major revenue driver. Anthropic's reasoning is straightforward: the best way to learn what people will build with a new product category is to put it in their hands, then build monetization around demonstrated value. Anthropic is also being transparent about the product's limitations. The design system import works best with a clean codebase; messy source code produces messy output. Collaboration is basic and not yet fully multiplayer. The editing experience has rough edges. There is no general availability date, and Anthropic says that is intentional — it will let the product and user feedback determine when Claude Design is ready for prime time. Anthropic's bet that owning the full creative stack is worth the risk Claude Design is the most visible expression of a trend that has been accelerating for months: the major AI labs are moving up the stack from model providers into full application builders, directly entering categories previously owned by established software companies. Anthropic now offers a coding agent (Claude Code), a knowledge-work assistant (Claude Cowork), desktop computer control, office integrations for Word, Excel, and PowerPoint, a browser agent in Chrome, and now a design tool. Each product reinforces the others. A designer can explore concepts in Claude Design, export a prototype, hand it to Claude Code for implementation, and have Claude Cowork manage the review cycle — all within Anthropic's platform. The financial momentum behind this expansion is staggering. Anthropic has received investor offers valuing the company at approximately $800 billion, according to Reuters, more than doubling its $380 billion valuation from a funding round closed just two months ago. But building an application empire while simultaneously navigating an AI safety reputation, an impending IPO, growing public hostility toward the technology, and the diplomatic fallout of competing with your own partners is a balancing act that no technology company has attempted at this scale or speed. When Figma launched Code to Canvas in February, the implicit promise was that AI coding tools and design tools would grow together, each making the other more valuable. Two months later, Anthropic's chief product officer has left Figma's board, and the company has shipped a product that lets anyone who can type a sentence create the kind of interactive prototype that once required years of design training and a Figma license. The partnership may survive. But the power dynamic just changed — and in the AI industry, that tends to be the only kind of change that matters.

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps

4/17/2026

For the past year, early adopters of autonomous AI agents have been forced to play a murky game of chance: keep the agent in a useless sandbox or give it the keys to the kingdom and hope it doesn't hallucinate a catastrophic "delete all" command. To unlock the true utility of an agent—scheduling meetings, triaging emails, or managing cloud infrastructure—users have had to grant these models raw API keys and broad permissions, raising the risk of their systems being disrupted by an accidental agent mistake. That tradeoff ends today. The creators of the open source sandboxed NanoClaw agent framework — now known under their new private startup named NanoCo — have announced a landmark partnership with Vercel and OneCLI to introduce a standardized, infrastructure-level approval system. By integrating Vercel’s Chat SDK and OneCLI’s open source credentials vault, NanoClaw 2.0 ensures that no sensitive action occurs without explicit human consent, delivered natively through the messaging apps where users already live. The specific use cases that stand to benefit most are those involving high-consequence "write" actions. That is, in DevOps, an agent could propose a cloud infrastructure change that only goes live once a senior engineer taps "Approve" in Slack. For finance teams, an agent could prepare batch payments or invoice triaging, with the final disbursement requiring a human signature via a WhatsApp card. Technology: security by isolation The fundamental shift in NanoClaw 2.0 is the move away from "application-level" security to "infrastructure-level" enforcement. In traditional agent frameworks, the model itself is often responsible for asking for permission—a flow that Gavriel Cohen, co-founder of NanoCo, describes as inherently flawed. "The agent could potentially be malicious or compromised," Cohen noted in a recent interview. "If the agent is generating the UI for the approval request, it could trick you by swapping the 'Accept' and 'Reject' buttons." NanoClaw solves this by running agents in strictly isolated Docker or Apple Containers. The agent never sees a real API key; instead, it uses "placeholder" keys. When the agent attempts an outbound request, the request is intercepted by the OneCLI Rust Gateway. The gateway checks a set of user-defined policies (e.g., "Read-only access is okay, but sending an email requires approval"). If the action is sensitive, the gateway pauses the request and triggers a notification to the user. Only after the user approves does the gateway inject the real, encrypted credential and allow the request to reach the service. Product: bringing the 'human' into the loop While security is the engine, Vercel’s Chat SDK is the dashboard. Integrating with different messaging platforms is notoriously difficult because every app—Slack, Teams, WhatsApp, Telegram—uses different APIs for interactive elements like buttons and cards. By leveraging Vercel’s unified SDK, NanoClaw can now deploy to 15 different channels from a single TypeScript codebase. When an agent wants to perform a protected action, the user receives a rich interactive card on their phone. "The approval shows up as a rich, native card right inside Slack or WhatsApp or Teams, and the user taps once to approve or deny," said Cohen. This "seamless UX" is what makes human-in-the-loop oversight practical rather than a productivity bottleneck. The full list of 15 supported messaging apps/channels contains many favored by enterprise knowledge workers, including: Slack WhatsApp Telegram Microsoft Teams Discord Google Chat iMessage Facebook Messenger Instagram X (Twitter) GitHub Linear Matrix Email Webex Background on NanoClaw NanoClaw launched on January 31, 2026, as a minimalist and security-focused response to the "security nightmare" inherent in complex, non-sandboxed agent frameworks. Created by Cohen, a former Wix.com engineer, and marketed by his brother Lazer, CEO of B2B tech public relations firm Concrete Media, the project was designed to solve the auditability crisis found in competing platforms like OpenClaw, which had grown to nearly 400,000 lines of code. By contrast, NanoClaw condensed its core logic into roughly 500 lines of TypeScript—a size that, according to VentureBeat, allows the entire system to be audited by a human or a secondary AI in approximately eight minutes. The platform’s primary technical defense is its use of operating system-level isolation. Every agent is placed inside an isolated Linux container—utilizing Apple Containers for high performance on macOS or Docker for Linux—to ensure that the AI only interacts with directories explicitly mounted by the user. As detailed in VentureBeat's reporting on the project's infrastructure, this approach confines the "blast radius" of potential prompt injections strictly to the container and its specific communication channel. In March 2026, NanoClaw further matured this security posture through an official partnership with the software container firm Docker to run agents inside "Docker Sandboxes". This integration utilizes MicroVM-based isolation to provide an enterprise-ready environment for agents that, by their nature, must mutate their environments by installing packages, modifying files, and launching processes—actions that typically break traditional container immutability assumptions. Operationally, NanoClaw rejects the traditional "feature-rich" software model in favor of a "Skills over Features" philosophy. Instead of maintaining a bloated main branch with dozens of unused modules, the project encourages users to contribute "Skills"—modular instructions that teach a local AI assistant how to transform and customize the codebase for specific needs, such as adding Telegram or Gmail support. This methodology, as described on NanoClaw's website and in VentureBeat interviews, ensures that users only maintain the exact code required for their specific implementation. Furthermore, the framework natively supports "Agent Swarms" via the Anthropic Agent SDK, allowing specialized agents to collaborate in parallel while maintaining isolated memory contexts for different business functions. Licensing and open source strategy NanoClaw remains firmly committed to the open source MIT License, encouraging users to fork the project and customize it for their own needs. This stands in stark contrast to "monolithic" frameworks. NanoClaw’s codebase is remarkably lean, consisting of only 15 source files and roughly 3,900 lines of code, compared to the hundreds of thousands of lines found in competitors like OpenClaw. The partnership also highlights the strength of the "Open Source Avengers" coalition. By combining NanoClaw (agent orchestration), Vercel Chat SDK (UI/UX), and OneCLI (security/secrets), the project demonstrates that modular, open-source tools can outpace proprietary labs in building the application layer for AI. Community reactions As shown on the NanoClaw website, the project has amassed more than 27,400 stars on GitHub and maintains an active Discord community. A core claim on the NanoClaw site is that the codebase is small enough to understand in "8 minutes," a feature targeted at security-conscious users who want to audit their assistant. In an interview, Cohen noted that iMessage support via Vercel’s Photon project addresses a common community hurdle: previously, users often had to maintain a separate Mac Mini to connect agents to an iMessage account. The enterprise perspective: should you adopt? For enterprises, NanoClaw 2.0 represents a shift from speculative experimentation to safe operationalization. Historically, IT departments have blocked agent usage due to the "all-or-nothing" nature of credential access. By decoupling the agent from the secret, NanoClaw provides a middle ground that mirrors existing corporate security protocols—specifically the principle of least privilege. Enterprises should consider this framework if they require high-auditability and have strict compliance needs regarding data exfiltration. According to Cohen, many businesses have not been ready to grant agents access to calendars or emails because of security concerns. This framework addresses that by ensuring the agent structurally cannot act without permission. Enterprises stand to benefit specifically in use cases involving "high-stakes" actions. As illustrated in the OneCLI dashboard, a user can set a policy where an agent can read emails freely but must trigger a manual approval dialog to "delete" or "send" one. Because NanoClaw runs as a single Node.js process with isolated containers , it allows enterprise security teams to verify that the gateway is the only path for outbound traffic. This architecture transforms the AI from an unmonitored operator into a supervised junior staffer, providing the productivity of autonomous agents without forgoing executive control. Ultimately, NanoClaw is a recommendation for organizations that want the productivity of autonomous agents without the "black box" risk of traditional LLM wrappers. It turns the AI from a potentially rogue operator into a highly capable junior staffer who always asks for permission before hitting the "send" or "buy" button. As AI-native setups become the standard, this partnership establishes the blueprint for how trust will be managed in the age of the autonomous workforce.

Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agents

4/16/2026

Salesforce on Wednesday unveiled the most ambitious architectural transformation in its 27-year history, introducing "Headless 360" — a sweeping initiative that exposes every capability in its platform as an API, MCP tool, or CLI command so AI agents can operate the entire system without ever opening a browser. The announcement, made at the company's annual TDX developer conference in San Francisco, ships more than 100 new tools and skills immediately available to developers. It marks a decisive response to the existential question hanging over enterprise software: In a world where AI agents can reason, plan, and execute, does a company still need a CRM with a graphical interface? Salesforce's answer: No — and that's exactly the point. "We made a decision two and a half years ago: Rebuild Salesforce for agents," the company said in its announcement. "Instead of burying capabilities behind a UI, expose them so the entire platform will be programmable and accessible from anywhere." The timing is anything but coincidental. Salesforce finds itself navigating one of the most turbulent periods in enterprise software history — a sector-wide sell-off that has pushed the iShares Expanded Tech-Software Sector ETF down roughly 28% from its September peak. The fear driving the decline: that AI, particularly large language models from Anthropic, OpenAI, and others, could render traditional SaaS business models obsolete. Jayesh Govindarjan, EVP of Salesforce and one of the key architects behind the Headless 360 initiative, described the announcement as rooted not in marketing theory but in hard-won lessons from deploying agents with thousands of enterprise customers. "The problem that emerged is the lifecycle of building an agentic system for every one of our customers on any stack, whether it's ours or somebody else's," Govindarjan told VentureBeat in an exclusive interview. "The challenge that they face is very much the software development challenge. How do I build an agent? That's only step one." More than 100 new tools give coding agents full access to the Salesforce platform for the first time Salesforce Headless 360 rests on three pillars that collectively represent the company's attempt to redefine what an enterprise platform looks like in the agentic era. The first pillar — build any way you want — delivers more than 60 new MCP (Model Context Protocol) tools and 30-plus preconfigured coding skills that give external coding agents like Claude Code, Cursor, Codex, and Windsurf complete, live access to a customer's entire Salesforce org, including data, workflows, and business logic. Developers no longer need to work inside Salesforce's own IDE. They can direct AI coding agents from any terminal to build, deploy, and manage Salesforce applications. Agentforce Vibes 2.0, the company's own native development environment, now includes what it calls an "open agent harness" supporting both the Anthropic agent SDK and the OpenAI agents SDK. As demonstrated during the keynote, developers can choose between Claude Code and OpenAI agents depending on the task, with the harness dynamically adjusting available capabilities based on the selected agent. The environment also adds multi-model support, including Claude Sonnet and GPT-5, along with full org awareness from the start. A significant technical addition is native React support on the Salesforce platform. During the keynote demo, presenters built a fully functional partner service application using React — not Salesforce's own Lightning framework — that connected to org metadata via GraphQL while inheriting all platform security primitives. This opens up dramatically more expressive front-end possibilities for developers who want complete control over the visual layer. The second pillar — deploy on any surface — centers on the new Agentforce Experience Layer, which separates what an agent does from how it appears, rendering rich interactive components natively across Slack, mobile apps, Microsoft Teams, ChatGPT, Claude, Gemini, and any client supporting MCP apps. During the keynote, presenters defined an experience once and deployed it across six different surfaces without writing surface-specific code. The philosophical shift is significant: rather than pulling customers into a Salesforce UI, enterprises push branded, interactive agent experiences into whatever workspace their customers already inhabit. The third pillar — build agents you can trust at scale — introduces an entirely new suite of lifecycle management tools spanning testing, evaluation, experimentation, observation, and orchestration. Agent Script, the company's new domain-specific language for defining agent behavior deterministically, is now generally available and open-sourced. A new Testing Center surfaces logic gaps and policy violations before deployment. Custom Scoring Evals let enterprises define what "good" looks like for their specific use case. And a new A/B Testing API enables running multiple agent versions against real traffic simultaneously. Why enterprise customers kept breaking their own AI agents — and how Salesforce redesigned its tooling in response Perhaps the most technically significant — and candid — portion of VentureBeat's interview with Govindarjan addressed the fundamental engineering tension at the heart of enterprise AI: agents are probabilistic systems, but enterprises demand deterministic outcomes. Govindarjan explained that early Agentforce customers, after getting agents into production through "sheer hard work," discovered a painful reality. "They were afraid to make changes to these agents, because the whole system was brittle," he said. "You make one change and you don't know whether it's going to work 100% of the time. All the testing you did needs to be redone." This brittleness problem drove the creation of Agent Script, which Govindarjan described as a programming language that "brings together the determinism that's in programming languages with the inherent flexibility in probabilistic systems that LLMs provide." The language functions as a single flat file — versionable, auditable — that defines a state machine governing how an agent behaves. Within that machine, enterprises specify which steps must follow explicit business logic and which can reason freely using LLM capabilities. Salesforce open-sourced Agent Script this week, and Govindarjan noted that Claude Code can already generate it natively because of its clean documentation. The approach stands in sharp contrast to the "vibe coding" movement gaining traction elsewhere in the industry. As the Wall Street Journal recently reported, some companies are now attempting to vibe-code entire CRM replacements — a trend Salesforce's Headless 360 directly addresses by making its own platform the most agent-friendly substrate available. Govindarjan described the tooling as a product of Salesforce's own internal practice. "We needed these tools to make our customers successful. Then our FDEs needed them. We hardened them, and then we gave them to our customers," he told VentureBeat. In other words, Salesforce productized its own pain. Inside the two competing AI agent architectures Salesforce says every enterprise will need Govindarjan drew a revealing distinction between two fundamentally different agentic architectures emerging in the enterprise — one for customer-facing interactions and one he linked to what he called the "Ralph Wiggum loop." Customer-facing agents — those deployed to interact with end customers for sales or service — demand tight deterministic control. "Before customers are willing to put these agents in front of their customers, they want to make sure that it follows a certain paradigm — a certain brand set of rules," Govindarjan told VentureBeat. Agent Script encodes these as a static graph — a defined funnel of steps with LLM reasoning embedded within each step. The "Ralph Wiggum loop," by contrast, represents the opposite end of the spectrum: a dynamic graph that unrolls at runtime, where the agent autonomously decides its next step based on what it learned in the previous step, killing dead-end paths and spawning new ones until the task is complete. This architecture, Govindarjan said, manifests primarily in employee-facing scenarios — developers using coding agents, salespeople running deep research loops, marketers generating campaign materials — where an expert human reviews the output before it ships. "Ralph Wiggum loops are great for employee-facing because employees are, in essence, experts at something," Govindarjan explained. "Developers are experts at development, salespeople are experts at sales." The critical technical insight: both architectures run on the same underlying platform and the same graph engine. "This is a dynamic graph. This is a static graph," he said. "It's all a graph underneath." That unified runtime — spanning the spectrum from tightly controlled customer interactions to free-form autonomous loops — may be Salesforce's most important technical bet, sparing enterprises from maintaining separate platforms for different agent modalities. Salesforce hedges its bets on MCP while opening its ecosystem to every major AI model and tool Salesforce's embrace of openness at TDX was striking. The platform now integrates with OpenAI, Anthropic, Google Gemini, Meta's LLaMA, and Mistral AI models. The open agent harness supports third-party agent SDKs. MCP tools work from any coding environment. And the new AgentExchange marketplace unifies 10,000 Salesforce apps, 2,600-plus Slack apps, and 1,000-plus Agentforce agents, tools, and MCP servers from partners including Google, Docusign, and Notion, backed by a new $50 million AgentExchange Builders Initiative. Yet Govindarjan offered a surprisingly candid assessment of MCP itself — the protocol Anthropic created that has become a de facto standard for agent-tool communication. "To be very honest, not at all sure" that MCP will remain the standard, he told VentureBeat. "When MCP first came along as a protocol, a lot of us engineers felt that it was a wrapper on top of a really well-written CLI — which now it is. A lot of people are saying that maybe CLI is just as good, if not better." His approach: pragmatic flexibility. "We're not wedded to one or the other. We just use the best, and often we will offer all three. We offer an API, we offer a CLI, we offer an MCP." This hedging explains the "Headless 360" naming itself — rather than betting on a single protocol, Salesforce exposes every capability across all three access patterns, insulating itself against protocol shifts. Engine, the B2B travel management company featured prominently in the keynote demos, offered a real-world proof point for the open ecosystem approach. The company built its customer service agent, Ava, in 12 days using Agentforce and now handles 50% of customer cases autonomously. Engine runs five agents across customer-facing and employee-facing functions, with Data 360 at the heart of its infrastructure and Slack as its primary workspace. "CSAT goes up, costs to deliver go down. Customers are happier. We're getting them answers faster. What's the trade off? There's no trade off," an Engine executive said during the keynote. Underpinning all of it is a shift in how Salesforce gets paid. The company is moving from per-seat licensing to consumption-based pricing for Agentforce — a transition Govindarjan described as "a business model change and innovation for us." It's a tacit acknowledgment that when agents, not humans, are doing the work, charging per user no longer makes sense. Salesforce isn't defending the old model — it's dismantling it and betting the company on what comes next Govindarjan framed the company's evolution in architectural terms. Salesforce has organized its platform around four layers: a system of context (Data 360), a system of work (Customer 360 apps), a system of agency (Agentforce), and a system of engagement (Slack and other surfaces). Headless 360 opens every layer via programmable endpoints. "What you saw today, what we're doing now, is we're opening up every single layer, right, with MCP tools, so we can go build the agentic experiences that are needed," Govindarjan told VentureBeat. "I think you're seeing a company transforming itself." Whether that transformation succeeds will depend on execution across thousands of customer deployments, the staying power of MCP and related protocols, and the fundamental question of whether incumbent enterprise platforms can move fast enough to remain relevant when AI agents can increasingly build new systems from scratch. The software sector's bear market, the financial pressures bearing down on the entire industry, and the breathtaking pace of LLM improvement all conspire to make this one of the highest-stakes bets in enterprise technology. But there is an irony embedded in Salesforce's predicament that Headless 360 makes explicit. The very AI capabilities that threaten to displace traditional software are the same capabilities that Salesforce now harnesses to rebuild itself. Every coding agent that could theoretically replace a CRM is now, through Headless 360, a coding agent that builds on top of one. The company is not arguing that agents won't change the game. It's arguing that decades of accumulated enterprise data, workflows, trust layers, and institutional logic give it something no coding agent can generate from a blank prompt. As Benioff declared on CNBC's Mad Money in March: "The software industry is still alive, well and growing." Headless 360 is his company's most forceful attempt to prove him right — by tearing down the walls of the very platform that made Salesforce famous and inviting every agent in the world to walk through the front door. Parker Harris, Salesforce's co-founder, captured the bet most succinctly in a question he posed last month: "Why should you ever log into Salesforce again?" If Headless 360 works as designed, the answer is: You shouldn't have to. And that, Salesforce is wagering, is precisely what will keep you paying for it.

Are we getting what we paid for? How to turn AI momentum into measurable value

4/16/2026

Enterprise AI is entering a new phase — one where the central question is no longer what can be built, but how to make the most of our AI investment. At VentureBeat’s latest AI Impact Tour session, Brian Gracely, director of portfolio strategy at Red Hat, described the operational reality inside large organizations: AI sprawl, rising inference costs, and limited visibility into what those investments are actually returning. It’s the “Day 2” moment — when pilots give way to production, and cost, governance, and sustainability become harder than building the system in the first place. "We've seen customers who say, 'I have 50,000 licenses of Copilot. I don't really know what people are getting out of that. But I do know that I'm paying for the most expensive computing in the world, because it's GPUs,'" Gracely said. "'How am I going to get that under control?'" Why enterprise AI costs are now a board-level problem For much of the past two years, cost was not the primary concern for organizations evaluating generative AI. The experimental phase gave teams cover to spend freely, and the promise of productivity gains justified aggressive investment, but that dynamic is shifting as enterprises enter their second and third budget cycles with AI. The focus has moved from "can we build something?" to "are we getting what we paid for?" Enterprises that made large, early bets on managed AI services are conducting hard reviews of whether those investments are delivering measurable value. The issue isn’t just that GPU computing is expensive. It is that many organizations lack the instrumentation to connect spending to outcomes, making it nearly impossible to justify renewals or scale responsibly. The strategic shift from token consumer to token producer The dominant AI procurement model of the past few years has been straightforward: pay a vendor per token, per seat, or per API call, and let someone else manage the infrastructure. That model made sense as a starting point but is increasingly being questioned by organizations with enough experience to compare alternatives. Enterprises that have been through one AI cycle are starting to rethink that model. "Instead of being purely a token consumer, how can I start being a token generator?" Gracely said. "Are there use cases and workloads that make sense for me to own more? It may mean operating GPUs. It may mean renting GPUs. And then asking, 'Does that workload need the greatest state-of-the-art model? Are there more capable open models or smaller models that fit?'" The decision is not binary. The right answer depends on the workload, the organization, and the risk tolerance involved, but the math is getting more complicated as the number of capable open models, from DeepSeek to models now available through cloud marketplaces, grows. Now enterprises actually have real alternatives to the handful of providers that dominated the landscape two years ago. Falling AI costs and rising usage create a paradox for enterprise budgets Some enterprise leaders argue that locking into infrastructure investments now could mean significantly overpaying in the long run, pointing to the statement from Anthropic CEO Dario Amodei that AI inference costs are declining roughly 60% per year. The emergence of open-source models such as DeepSeek and others has meaningfully expanded the strategic options available to enterprises that are willing to invest in the underlying infrastructure in the last three years. But while costs per token are falling, usage is accelerating at a pace that more than offsets efficiency gains. It's a version of Jevons Paradox, the economic principle that improvements in resource efficiency tend to increase total consumption rather than reduce it, as lower cost enables broader adoption. For enterprise budget planners, this means declining unit costs do not translate into declining total bills. An organization that triples its AI usage while costs fall by half still ends up spending more than it did before. The consideration becomes which workloads genuinely require the most capable and most expensive models, and which can be handled just fine by smaller, cheaper alternatives. The business case for investing in AI infrastructure flexibility The prescription isn't to slow down AI investment, but to build with flexibility being top of mind. The organizations that will win aren't necessarily the ones that move fastest or spend the most; they're the ones building infrastructure and operating models capable of absorbing the next unexpected development. "The more you can build some abstractions and give yourself some flexibility, the more you can experiment without running up costs, but also without jeopardizing your business. Those are as important as asking whether you're doing everything best practice right now," Gracely explained. But despite how entrenched AI discussions have become in enterprise planning cycles, the practical experience most organizations have is still measured in years, not decades. "It feels like we've been doing this forever. We've been doing this for three years," Gracely added. "It's early and it's moving really fast. You don't know what's coming next. But the characteristics of what's coming next — you should have some sense of what that looks like.” For enterprise leaders still calibrating their AI investment strategies, that may be the most actionable takeaway: the goal is not to optimize for today's cost structure, but to build the organizational and technical flexibility to adapt when, not if, it changes again.

OpenAI debuts GPT-Rosalind, a new limited access model for life sciences, and broader Codex plugin on Github

4/16/2026

The journey from a laboratory hypothesis to a pharmacy shelf is one of the most grueling marathons in modern industry, typically spanning 10 to 15 years and billions of dollars in investment. Progress is often stymied not just by the inherent mysteries of biology, but by the "fragmented and difficult to scale" workflows that force researchers to manually pivot between the actual experimental design equipment, software, and databases. But OpenAI is releasing a new specialized model GPT-Rosalind specifically to speed up this process and make it more efficient, easier, and ideally, more productive. Named after the pioneering chemist Rosalind Franklin, whose work was vital to the discovery of DNA’s structure (and was often overlooked for her male colleagues James Watson and Francis Crick), this new frontier reasoning model is purpose-built to act as a specialized intelligence layer for life sciences research. By shifting AI’s role from a general-purpose assistant to a domain-specific "reasoning" partner, OpenAI is signaling a long-term commitment to biological and chemical discovery. What GPT-Rosalind offers GPT-Rosalind isn't just about faster text generation; it is designed to synthesize evidence, generate biological hypotheses, and plan experiments—tasks that have traditionally required years of expert human synthesis. At its core, GPT-Rosalind is the first in a new series of models optimized for scientific workflows. While previous iterations of GPT excelled at general language tasks, this model is fine-tuned for deeper understanding across genomics, protein engineering, and chemistry. To validate its capabilities, OpenAI tested the model against several industry benchmarks. On BixBench, a metric for real-world bioinformatics and data analysis, GPT-Rosalind achieved leading performance among models with published scores. In more granular testing via LABBench2, the model outperformed GPT-5.4 on six out of eleven tasks, with the most significant gains appearing in CloningQA—a task requiring the end-to-end design of reagents for molecular cloning protocols. The model’s most striking performance signal came from a partnership with Dyno Therapeutics. In an evaluation using unpublished, "uncontaminated" RNA sequences, GPT-Rosalind was tasked with sequence-to-function prediction and generation. When evaluated directly in the Codex environment, the model’s submissions ranked above the 95th percentile of human experts on prediction tasks and reached the 84th percentile for sequence generation. This level of expertise suggests the model can serve as a high-level collaborator capable of identifying "expert-relevant patterns" that generalist models often overlook. The new lab workflow OpenAI is not just releasing a model; it is launching an ecosystem designed to integrate with the tools scientists already use. Central to this is a new Life Sciences research plugin for Codex, available on GitHub. Scientific research is famously siloed. A single project might require a researcher to consult a protein structure database, search through 20 years of clinical literature, and then use a separate tool for sequence manipulation. The new plugin acts as an "orchestration layer," providing a unified starting point for these multi-step questions. Skill Set: The package includes modular skills for biochemistry, human genetics, functional genomics, and clinical evidence. Connectivity: It connects models to over 50 public multi-omics databases and literature sources. Efficiency: This approach targets "long-horizon, tool-heavy scientific workflows," allowing researchers to automate repeatable tasks like protein structure lookups and sequence searches. Limited and gated access Given the potential power of a model capable of redesigning biological structures, OpenAI is eschewing a broad "open-source" or general public release in favor of a Trusted Access program. The model is launching as a research preview specifically for qualified Enterprise customers in the United States. This restricted deployment is built on three core principles: beneficial use, strong governance, and controlled access. Organizations requesting access must undergo a qualification and safety review to ensure they are conducting legitimate research with a clear public benefit. Unlike general-use models, GPT-Rosalind was developed with heightened enterprise-grade security controls. For the end-user, this means: Restricted Access: Usage is limited to approved users within secure, well-managed environments. Governance: Participating organizations must maintain strict misuse-prevention controls and agree to specific life sciences research preview terms. Cost: During the preview phase, the model will not consume existing credits or tokens, allowing researchers to experiment without immediate budgetary constraints (subject to abuse guardrails). Warm reception from initial industry partners The announcement garnered significant buy-in from OpenAI parnters across the pharmaceutical and technology sectors. Sean Bruich, SVP of AI and Data at Amgen, noted that the collaboration allows the company to apply advanced tools in ways that could "accelerate how we deliver medicines to patients".The impact is also being felt in the specialized tech infrastructure that supports labs: NVIDIA: Kimberly Powell, VP of Healthcare and Life Sciences, described the convergence of domain reasoning and accelerated computing as a way to "compress years of traditional R&D into immediate, actionable scientific insights". Moderna: CEO Stéphane Bancel highlighted the model's ability to "reason across complex biological evidence" to help teams translate insights into experimental workflows. The Allen Institute: CTO Andy Hickl emphasized that GPT-Rosalind stands out for making manual steps—like finding and aligning data—more "consistent and repeatable in an agentic workflow". This builds on tangible results OpenAI has already seen in the field, such as its collaboration with Ginkgo Bioworks, where AI models helped achieve a 40% reduction in protein production costs. What's next for Rosalind and OpenAI in life sciences? OpenAI’s mission with GPT-Rosalind is to narrow the gap between a "promising scientific idea" and the actual "evidence, experiments, and decisions" required for medical progress. By partnering with institutions like Los Alamos National Laboratory to explore AI-guided catalyst design and biological structure modification, the company is positioning GPT-Rosalind as more than a tool—it is meant to be a "capable partner in discovery". As the life sciences field becomes increasingly data-dense, the move toward specialized "reasoning" models like Rosalind may become the standard for navigating the "vast search spaces" of biology and chemistry.

Microsoft patched a Copilot Studio prompt injection. The data exfiltrated anyway.

4/15/2026

Microsoft assigned CVE-2026-21520, a CVSS 7.5 indirect prompt injection vulnerability, to Copilot Studio. Capsule Security discovered the flaw, coordinated disclosure with Microsoft, and the patch was deployed on January 15. Public disclosure went live on Wednesday. That CVE matters less for what it fixes and more for what it signals. Capsule’s research calls Microsoft’s decision to assign a CVE to a prompt injection vulnerability in an agentic platform “highly unusual.” Microsoft previously assigned CVE-2025-32711 (CVSS 9.3) to EchoLeak, a prompt injection in M365 Copilot patched in June 2025, but that targeted a productivity assistant, not an agent-building platform. If the precedent extends to agentic systems broadly, every enterprise running agents inherits a new vulnerability class to track. Except that this class cannot be fully eliminated by patches alone. Capsule also discovered what they call PipeLeak, a parallel indirect prompt injection vulnerability in Salesforce Agentforce. Microsoft patched and assigned a CVE. Salesforce has not assigned a CVE or issued a public advisory for PipeLeak as of publication, according to Capsule's research. What ShareLeak actually does The vulnerability that the researchers named ShareLeak exploits the gap between a SharePoint form submission and the Copilot Studio agent’s context window. An attacker fills a public-facing comment field with a crafted payload that injects a fake system role message. In Capsule’s testing, Copilot Studio concatenated the malicious input directly with the agent’s system instructions with no input sanitization between the form and the model. The injected payload overrode the agent’s original instructions in Capsule’s proof-of-concept, directing it to query connected SharePoint Lists for customer data and send that data via Outlook to an attacker-controlled email address. NVD classifies the attack as low complexity and requires no privileges. Microsoft’s own safety mechanisms flagged the request as suspicious during Capsule’s testing. The data was exfiltrated anyway. The DLP never fired because the email was routed through a legitimate Outlook action that the system treated as an authorized operation. Carter Rees, VP of Artificial Intelligence at Reputation, described the architectural failure in an exclusive VentureBeat interview. The LLM cannot inherently distinguish between trusted instructions and untrusted retrieved data, Rees said. It becomes a confused deputy acting on behalf of the attacker. OWASP classifies this pattern as ASI01: Agent Goal Hijack. The research team behind both discoveries, Capsule Security, found the Copilot Studio vulnerability on November 24, 2025. Microsoft confirmed it on December 5 and patched it on January 15, 2026. Every security director running Copilot Studio agents triggered by SharePoint forms should audit that window for indicators of compromise. PipeLeak and the Salesforce split PipeLeak hits the same vulnerability class through a different front door. In Capsule’s testing, a public lead form payload hijacked an Agentforce agent with no authentication required. Capsule found no volume cap on the exfiltrated CRM data, and the employee who triggered the agent received no indication that data had left the building. Salesforce has not assigned a CVE or issued a public advisory specific to PipeLeak as of publication. Capsule is not the first research team to hit Agentforce with indirect prompt injection. Noma Labs disclosed ForcedLeak (CVSS 9.4) in September 2025, and Salesforce patched that vector by enforcing Trusted URL allowlists. According to Capsule's research, PipeLeak survives that patch through a different channel: email via the agent's authorized tool actions. Naor Paz, CEO of Capsule Security, told VentureBeat the testing hit no exfiltration limit. “We did not get to any limitation,” Paz said. “The agent would just continue to leak all the CRM.” Salesforce recommended human-in-the-loop as a mitigation. Paz pushed back. “If the human should approve every single operation, it’s not really an agent,” he told VentureBeat. “It’s just a human clicking through the agent’s actions.” Microsoft patched ShareLeak and assigned a CVE. According to Capsule's research, Salesforce patched ForcedLeak's URL path but not the email channel. Kayne McGladrey, IEEE Senior Member, put it differently in a separate VentureBeat interview. Organizations are cloning human user accounts to agentic systems, McGladrey said, except agents use far more permissions than humans would because of the speed, the scale, and the intent. The lethal trifecta and why posture management fails Paz named the structural condition that makes any agent exploitable: access to private data, exposure to untrusted content, and the ability to communicate externally. ShareLeak hits all three. PipeLeak hits all three. Most production agents hit all three because that combination is what makes agents useful. Rees validated the diagnosis independently. Defense-in-depth predicated on deterministic rules is fundamentally insufficient for agentic systems, Rees told VentureBeat. Elia Zaitsev, CrowdStrike’s CTO, called the patching mindset itself the vulnerability in a separate VentureBeat exclusive. “People are forgetting about runtime security,” he said. “Let’s patch all the vulnerabilities. Impossible. Somehow always seem to miss something.” Observing actual kinetic actions is a structured, solvable problem, Zaitsev told VentureBeat. Intent is not. CrowdStrike’s Falcon sensor walks the process tree and tracks what agents did, not what they appeared to intend. Multi-turn crescendo and the coding agent blind spot Single-shot prompt injections are the entry-level threat. Capsule’s research documented multi-turn crescendo attacks where adversaries distribute payloads across multiple benign-looking turns. Each turn passes inspection. The attack becomes visible only when analyzed as a sequence. Rees explained why current monitoring misses this. A stateless WAF views each turn in a vacuum and detects no threat, Rees told VentureBeat. It sees requests, not a semantic trajectory. Capsule also found undisclosed vulnerabilities in coding agent platforms it declined to name, including memory poisoning that persists across sessions and malicious code execution through MCP servers. In one case, a file-level guardrail designed to restrict which files the agent could access was reasoned around by the agent itself, which found an alternate path to the same data. Rees identified the human vector: employees paste proprietary code into public LLMs and view security as friction. McGladrey cut to the governance failure. “If crime was a technology problem, we would have solved crime a fairly long time ago,” he told VentureBeat. “Cybersecurity risk as a standalone category is a complete fiction.” The runtime enforcement model Capsule hooks into vendor-provided agentic execution paths — including Copilot Studio's security hooks and Claude Code's pre-tool-use checkpoints — with no proxies, gateways, or SDKs. The company exited stealth on Wednesday, timing its $7 million seed round, led by Lama Partners alongside Forgepoint Capital International, to its coordinated disclosure. Chris Krebs, the first Director of CISA and a Capsule advisor, put the gap in operational terms. “Legacy tools weren’t built to monitor what happens between prompt and action,” Krebs said. “That’s the runtime gap.” Capsule's architecture deploys fine-tuned small language models that evaluate every tool call before execution, an approach Gartner's market guide calls a "guardian agent." Not everyone agrees that intent analysis is the right layer. Zaitsev told VentureBeat during an exclusive interview that intent-based detection is non-deterministic. “Intent analysis will sometimes work. Intent analysis cannot always work,” he said. CrowdStrike bets on observing what the agent actually did rather than what it appeared to intend. Microsoft’s own Copilot Studio documentation provides external security-provider webhooks that can approve or block tool execution, offering a vendor-native control plane alongside third-party options. No single layer closes the gap. Runtime intent analysis, kinetic action monitoring, and foundational controls (least privilege, input sanitization, outbound restrictions, targeted human-in-the-loop) all belong in the stack. SOC teams should map telemetry now: Copilot Studio activity logs plus webhook decisions, CRM audit logs for Agentforce, and EDR process-tree data for coding agents. Paz described the broader shift. “Intent is the new perimeter,” he told VentureBeat. “The agent in runtime can decide to go rogue on you.” VentureBeat Prescriptive Matrix The following matrix maps five vulnerability classes against the controls that miss them, and the specific actions security directors should take this week. Vulnerability Class Why Current Controls Miss It What Runtime Enforcement Does Suggested actions for security leaders ShareLeak — Copilot Studio, CVE-2026-21520, CVSS 7.5, patched Jan 15 2026 Capsule’s testing found no input sanitization between the SharePoint form and the agent context. Safety mechanisms flagged, but data still exfiltrated. DLP did not fire because the email used a legitimate Outlook action. OWASP ASI01: Agent Goal Hijack. Guardian agent hooks into Copilot Studio pre-tool-use security hooks. Vets every tool call before execution. Blocks exfiltration at the action layer. Audit every Copilot Studio agent triggered by SharePoint forms. Restrict outbound email to org-only domains. Inventory all SharePoint Lists accessible to agents. Review the Nov 24–Jan 15 window for indicators of compromise. PipeLeak — Agentforce, no CVE assigned In Capsule’s testing, public form input flowed directly into the agent context. No auth required. No volume cap observed on exfiltrated CRM data. The employee received no indication that data was leaving. Runtime interception via platform agentic hooks. Pre-invocation checkpoint on every tool call. Detects outbound data transfer to non-approved destinations. Review all Agentforce automations triggered by public-facing forms. Enable human-in-the-loop for external comms as interim control. Audit CRM data access scope per agent. Pressure Salesforce for CVE assignment. Multi-Turn Crescendo — distributed payload, each turn looks benign Stateless monitoring inspects each turn in isolation. WAFs, DLP, and activity logs see individual requests, not semantic trajectory. Stateful runtime analysis tracks full conversation history across turns. Fine-tuned SLMs evaluate aggregated context. Detects when a cumulative sequence constitutes a policy violation. Require stateful monitoring for all production agents. Add crescendo attack scenarios to red team exercises. Coding Agents — unnamed platforms, memory poisoning + code execution MCP servers inject code and instructions into the agent context. Memory poisoning persists across sessions. Guardrails reasoned around by the agent itself. Shadow AI insiders paste proprietary code into public LLMs. Pre-invocation checkpoint on every tool call. Fine-tuned SLMs detect anomalous tool usage at runtime. Inventory all coding agent deployments across engineering. Audit MCP server configs. Restrict code execution permissions. Monitor for shadow installations. Structural Gap — any agent with private data + untrusted input + external comms Posture management tells you what should happen. It does not stop what does happen. Agents use far more permissions than humans at far greater speed. Runtime guardian agent watches every action in real time. Intent-based enforcement replaces signature detection. Leverages vendor agentic hooks, not proxies or gateways. Classify every agent by lethal trifecta exposure. Treat prompt injection as class-based SaaS risk. Require runtime security for any agent moving to production. Brief the board on agent risk as business risk. What this means for 2026 security planning Microsoft’s CVE assignment will either accelerate or fragment how the industry handles agent vulnerabilities. If vendors call them configuration issues, CISOs carry the risk alone. Treat prompt injection as a class-level SaaS risk rather than individual CVEs. Classify every agent deployment against the lethal trifecta. Require runtime enforcement for anything moving to production. Brief the board on agent risk the way McGladrey framed it: as business risk, because cybersecurity risk as a standalone category stopped being useful the moment agents started operating at machine speed.

Frontier models are failing one in three production attempts — and getting harder to audit

4/15/2026

AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defining operational challenge for IT leaders in 2026, according to Stanford HAI's ninth annual AI Index report. This uneven, unpredictable performance is what the AI Index calls the "jagged frontier," a term coined by AI researcher Ethan Mollick to describe the boundary where AI excels and then suddenly fails. “AI models can win a gold medal at the International Mathematical Olympiad,” Stanford HAI researchers point out, “but still can’t reliably tell time.” How models advanced in 2025 Enterprise AI adoption has reached 88%. Notable accomplishments in 2025 and early 2026: Frontier models improved 30% in just one year on Humanity's Last Exam (HLE), which includes 2,500 questions across math, natural sciences, ancient languages, and other specialized subfields. HLE was built to be difficult for AI and favorable to human experts. Leading models scored above 87% on MMLU-Pro, which tests multi-step reasoning based on 12,000 human-reviewed questions across more than a dozen disciplines. This illustrates “how competitive the frontier has become on broad knowledge tasks,” the Stanford HAI researchers note. Top models including Claude Opus 4.5, GPT-5.2, and Qwen3.5 scored between 62.9% and 70.2% on τ-bench. The benchmark tests agents on real-world tasks in realistic domains that involve chatting with a user and calling external tools or APIs. Model accuracy on GAIA, which benchmarks general AI assistants, rose from about 20% to 74.5%. Agent performance on SWE-bench Verified rose from 60% to near 100% in just one year. The benchmark evaluates models on their ability to resolve real-world software issues. Success rates on WebArena increased from 15% in 2023 to 74.3% in early 2026. This benchmark presents a realistic web environment for evaluating autonomous AI agents, tasking them with information retrieval, site navigation, and content configuration. Agent performance progressed from 17% in 2024 to roughly 65% in early 2026 on MLE-bench, which evaluates machine learning (ML) engineering capabilities. AI agents are showing capability gains in cybersecurity. For instance, frontier models solved 93% of problems on Cybench, a benchmark that includes 40 professional-level tasks across six capture-the-flag categories, including cryptography, web security, reverse engineering, forensics, and exploitation. This is compared to 15% in 2024 and represents the “steepest improvement rate,” indicating that cybersecurity tasks are a “good fit for current agent capabilities.” Video generation has also evolved significantly over the last year; models can now capture how objects behave. For instance, Google DeepMind’s Veo 3 was tested across more than 18,000 generated videos, and demonstrated the ability to simulate buoyancy and solved mazes without having been trained on those tasks. “Video generation models are no longer just producing realistic-looking content,” the researchers write. “Some are beginning to learn how the physical world actually works.” Overall, AI is being used across a number of areas in enterprise — knowledge management, software engineering and IT, marketing and sales — and expanding into specialized domains like tax, mortgage processing, corporate finance, and legal reasoning, where accuracy ranges from 60 to 90%. “AI capability is not plateauing,” Stanford HAI says. “It is accelerating and reaching more people than ever.” AI capability surges, but reliability lags Multimodal models now meet or exceed human baselines on PhD-level science questions, multimodal reasoning, and competition mathematics. For example, Gemini Deep Think earned a gold medal at the 2025 International Mathematical Olympiad (IMO), solving five of six problems end-to-end in natural language within the 4.5-hour time limit — a notable improvement from a silver-level score in 2024. Yet these same AI systems still fail in roughly one in three attempts, and have trouble with basic perception tasks, according to Stanford HAI. On ClockBench — a test covering 180 clock designs and 720 questions — Gemini Deep Think achieved only 50.1% accuracy, compared to roughly 90% for humans. GPT-4.5 High reached an almost identical score of 50.6%. “Many multimodal models still struggle with something most humans find routine: Telling the time,” the Stanford HAI report points out. The seemingly simple task combines visual perception with simple arithmetic, identification of clock hands and their positions, and conversion of those into a time value. Ultimately, errors at any of these steps can cascade, leading to incorrect results, according to researchers. In analysis, models were shown a range of clock styles: standard analog, clocks without a second hand, those with arrows as hands, others with black dials or Roman numerals. But even after fine-tuning on 5,000 synthetic images, models improved only on familiar formats and failed to generalize to real-world variations (like distorted dials or thinner hands). Researchers extrapolated that, when models confused hour and minute hands, their ability to interpret direction deteriorated, suggesting that the challenge lies not just in data, but in integrating multiple visual cues. “Even as models close the gap with human experts on knowledge-intensive tasks, this kind of visual reasoning remains a persistent challenge,” Stanford HAI notes. Hallucination and multi-step reasoning remain major gaps Even as models continue to accelerate in their reasoning, hallucinations remain a major concern. In one benchmark, for instance, hallucination rates across 26 leading models ranged from 22% to 94%. Accuracy for some models dropped sharply when put under scrutiny —for example, GPT-4o's accuracy slid from 98.2% to 64.4%, and DeepSeek R1 plummeted from more than 90% to 14.4%. On the other hand, Grok 4.20 Beta, Claude 4.5 Haiku, and MiMo-V2-Pro showed the lowest rates. Further, models continue to struggle with multi-step workflows, even as they are tasked with more of them. For example, on the τ-bench benchmark — which evaluates tool use and multi-turn reasoning — no model exceeded 71%, suggesting that “managing multiturn conversations while correctly using tools and following policy constraints remains difficult even for frontier models,” according to the Stanford HAI report. Models are becoming opaque Leading models are now “nearly indistinguishable” from each other when it comes to performance, the Stanford HAI report notes. Open-weight models are more competitive than ever, but they are converging. As capability is no longer a “clear differentiator,” competitive pressure is shifting toward cost, reliability, and real-world usefulness. Frontier labs are disclosing less information about their models, evaluation methods are quickly losing relevance, and independent testing can’t always corroborate developer-reported metrics. As Stanford HAI points out: “The most capable systems are now the least transparent.” Training code, parameter counts, dataset sizes, and durations are often being withheld — by firms including OpenAI, Anthropic and Google. And transparency is declining more broadly: In 2025, 80 out of 95 models were released without corresponding training code, while only four made their code fully open source. Further, after rising between 2023 and 2024, scores on the Foundation Model Transparency Index — which ranks major foundation developers on 100 transparency indicators — have since dropped. The average score is now 40, representing a 17 point decrease. “Major gaps persist in disclosure around training data, compute resources, and post-deployment impact,” according to the report. Benchmarking AI is getting harder — and less reliable The benchmarks used to measure AI progress are facing growing reliability issues, with error rates reaching as high as 42% on widely-used evaluations. “AI is being tested more ambitiously across reasoning, safety, and real-world task execution,” the Stanford report notes, yet “those measurements are increasingly difficult to rely on.” Key challenges include: “Sparse and declining” reporting on bias from developers Benchmark contamination, or when models are exposed to test data; this can lead to “falsely inflated” scores Discrepancies between developer-reported results and independent testing “Poorly constructed” evals lacking documentation, details on statistical significance and reproducible scripts “Growing opacity and non-standard prompting” that make model-to-model comparisons unreliable “Even when benchmark scores are technically valid, strong benchmark performance does not always translate to real-world utility,” according to the report. Further, “AI capability is outpacing the benchmarks designed to measure it.” This is leading to “benchmark saturation,” where models achieve scores so high that tests can no longer differentiate between them. More complex, interactive forms of intelligence are becoming increasingly difficult to benchmark. Some are calling for evals that measure human-AI collaboration, rather than AI performance in isolation, but this technique is early in development. “Evaluations intended to be challenging for years are saturated in months, compressing the window in which benchmarks remain useful for tracking progress,” according to Stanford HAI. Are we at "peak data"? As builders move into more data-intensive inference, there is growing concern about data bottlenecks and scaling sustainability. Leading researchers are warning that the available pool of high-quality human text and web data has been “exhausted” — a state referred to as “peak data.” Hybrid approaches combining real and synthetic data can “significantly accelerate training” — sometimes by a factor of 5 to 10 — and smaller models trained on purely synthetic data have shown promise for narrowly defined tasks like classification or code generation, according to Stanford HAI. Synthetically generated data can be effective for improving model performance in post-training settings, including fine-tuning, alignment, instruction tuning, and reinforcement learning (RL), the report notes. However, “these gains have not generalized to large, general-purpose language models.” Rather than scaling data “indiscriminately,” researchers are turning to pruning, curating, and refining inputs, and are improving performance by cleaning labels, deduplicating samples, and constructing overall higher-quality datasets. “Discussions on data availability often overlook an important shift in recent AI research,” according to the report. “Performance gains are increasingly driven by improving the quality of existing datasets, not by acquiring more.” Responsible AI is falling behind While the infrastructure for responsible AI is growing, progress has been “uneven” and is unable to keep pace with rapid capability gains, according to Stanford HAI. While almost all leading frontier AI model developers report results on capability benchmarks, corresponding reporting on safety and responsibility is inconsistent and “spotty.” Documented AI incidents rose significantly year over year — 362 in 2025 compared to 233 in 2024. And, while several frontier models received “Very Good” or “Good” safety ratings under standard use (per the AILuminate benchmark, which assesses generative AI across 12 “hazard” categories), safety performance dropped across all models when tested against jailbreak attempts using adversarial prompts. “AI models perform well on safety tests under normal conditions, but their defenses weaken under deliberate attack,” Stanford HAI notes. Adding to this challenge, builders have reported that improving one dimension, such as safety, can degrade another, like accuracy. “The infrastructure for responsible AI is growing, but progress has been uneven, and it is not keeping pace with the speed of AI deployment,” according to Stanford researchers. The Stanford data makes one thing clear: the gap that matters in 2026 isn't between AI and human performance. It's between what AI can do in a demo and what it does reliably in production. Right now — with less transparency from the labs and benchmarks that saturate before they're useful — that gap is harder to measure than ever.

Meta researchers introduce 'hyperagents' to unlock self-improving AI for non-coding tasks

4/15/2026

Creating self-improving AI systems is an important step toward deploying agents in dynamic environments, especially in enterprise production environments, where tasks are not always predictable, nor consistent. Current self-improving AI systems face severe limitations because they rely on fixed, handcrafted improvement mechanisms that only work under strict conditions such as software engineering. To overcome this practical challenge, researchers at Meta and several universities introduced “hyperagents,” a self-improving AI system that continuously rewrites and optimizes its problem-solving logic and the underlying code. In practice, this allows the AI to self-improve across non-coding domains, such as robotics and document review. The agent independently invents general-purpose capabilities like persistent memory and automated performance tracking. More broadly, hyperagents don't just get better at solving tasks, they learn to improve the self-improving cycle to accelerate progress. This framework can help develop highly adaptable agents that autonomously build structured, reusable decision machinery. This approach compounds capabilities over time with less need for constant, manual prompt engineering and domain-specific human customization. Current self-improving AI and its architectural bottlenecks The core goal of self-improving AI systems is to continually enhance their own learning and problem-solving capabilities. However, most existing self-improvement models rely on a fixed “meta agent.” This static, high-level supervisory system is designed to modify a base system. “The core limitation of handcrafted meta-agents is that they can only improve as fast as humans can design and maintain them,” Jenny Zhang, co-author of the paper, told VentureBeat. “Every time something changes or breaks, a person has to step in and update the rules or logic.” Instead of an abstract theoretical limit, this creates a practical “maintenance wall.” The current paradigm ties system improvement directly to human iteration speed, slowing down progress because it relies heavily on manual engineering effort rather than scaling with agent-collected experience. To overcome this limitation, the researchers argue that the AI system must be “fully self-referential.” These systems must be able to analyze, evaluate, and rewrite any part of themselves without the constraints of their initial setup. This allows the AI system to break free from structural limits and become self-accelerating. One example of a self-referential AI system is Sakana AI’s Darwin Gödel Machine (DGM), an AI system that improves itself by rewriting its own code. In DGM, an agent iteratively generates, evaluates, and modifies its own code, saving successful variants in an archive to act as stepping stones for future improvements. DGM proved open-ended, recursive self-improvement is practically achievable in coding. However, DGM falls short when applied to real-world applications outside of software engineering because of a critical skill gap. In DGM, the system improves because both evaluation and self-modification are coding tasks. Improving the agent's coding ability naturally improves its ability to rewrite its own code. But if you deploy DGM for a non-coding enterprise task, this alignment breaks down. “For tasks like math, poetry, or paper review, improving task performance does not necessarily improve the agent’s ability to modify its own behavior,” Zhang said. The skills needed to analyze subjective text or business data are entirely different from the skills required to analyze failures and write new Python code to fix them. DGM also relies on a fixed, human-engineered mechanism to generate its self-improvement instructions. In practice, if enterprise developers want to use DGM for anything other than coding, they must heavily engineer and manually customize the instruction prompts for every new domain. The hyperagent framework To overcome the limitations of previous architectures, the researchers introduce hyperagents. The framework proposes “self-referential agents that can in principle self-improve for any computable task.” In this framework, an agent is any computable program that can invoke LLMs, external tools, or learned components. Traditionally, these systems are split into two distinct roles: a “task agent” that executes the specific problem at hand, and a “meta agent” that analyzes and modifies the agents. A hyperagent fuses both the task agent and the meta agent into a single, self-referential, and editable program. Because the entire program can be rewritten, the system can modify the self-improvement mechanism, a process the researchers call metacognitive self-modification. "Hyperagents are not just learning how to solve the given tasks better, but also learning how to improve," Zhang said. "Over time, this leads to accumulation. Hyperagents do not need to rediscover how to improve in each new domain. Instead, they retain and build on improvements to the self-improvement process itself, allowing progress to compound across tasks." The researchers extended the Darwin Gödel Machine to create DGM-Hyperagents (DGM-H). DGM-H retains the powerful open-ended exploration structure of the original DGM, which prevents the AI from converging too early or getting stuck in dead ends by maintaining a growing archive of successful hyperagents. The system continuously branches from selected candidates in this archive, allows them to self-modify, evaluates the new variants on given tasks, and adds the successful ones back into the pool as stepping stones for future iterations. By combining this open-ended evolutionary search with metacognitive self-modification, DGM-H eliminates the fixed, human-engineered instruction step of the original DGM. This enables the agent to self-improve across any computable task. Hyperagents in action The researchers used the Polyglot coding benchmark to compare the hyperagent framework against previous coding-only AI. They also evaluated hyperagents across non-coding domains that involve subjective reasoning, external tool use, and complex logic. These included paper review to simulate a peer reviewer outputting accept or reject decisions, reward model design for training a quadruped robot, and Olympiad-level math grading. Math grading served as a held-out test to see if an AI that learned how to self-improve while reviewing papers and designing robots could transfer those meta-skills to an entirely unseen domain. The researchers compared hyperagents against several baselines, including domain-specific models like AI-Scientist-v2 for paper reviews and the ProofAutoGrader for math. They also tested against the classic DGM and a manually customized DGM for new domains. On the coding benchmark, hyperagents matched the performance of DGM despite not being designed specifically for coding. In paper review and robotics, hyperagents outperformed the open-source baselines and human-engineered reward functions. When the researchers took a hyperagent optimized for paper review and robotics and deployed it on the unseen math grading task, it achieved an improvement metric of 0.630 in 50 iterations. Baselines relying on classic DGM architectures remained at a flat 0.0. The hyperagent even beat the domain-specific ProofAutoGrader. The experiments also highlighted interesting autonomous behaviors from hyperagents. In paper evaluation, the agent first used standard prompt-engineering tricks like adopting a rigorous persona. When this proved unreliable, it rewrote its own code to build a multi-stage evaluation pipeline with explicit checklists and rigid decision rules, leading to much higher consistency. Hyperagents also autonomously developed a memory tool to avoid repeating past mistakes. Furthermore, the system wrote a performance tracker to log and monitor the result of architectural changes across generations. The model even developed a compute-budget aware behavior, where it tracked remaining iterations to adjust its planning. Early generations executed ambitious architectural changes, while later generations focused on conservative, incremental refinements. For enterprise data teams wondering where to start, Zhang recommends focusing on tasks where success is unambiguous. “Workflows that are clearly specified and easy to evaluate, often referred to as verifiable tasks, are the best starting point,” she said. “This generally opens new opportunities for more exploratory prototyping, more exhaustive data analysis, more exhaustive A/B testing, [and] faster feature engineering.” For harder, unverified tasks, teams can use hyperagents to first develop learned judges that better reflect human preferences, creating a bridge to more complex domains. The researchers have shared the code for hyperagents, though it has been released under a non-commercial license. Caveats and future threats The benefits of hyperagents introduce clear tradeoffs. The researchers highlight several safety considerations regarding systems that can modify themselves in increasingly open-ended ways. These AI systems pose the risk of evolving far more rapidly than humans can audit or interpret. While researchers contained DGM-H within safety boundaries such as sandboxed environments designed to prevent unintended side effects, these initial safeguards are actually practical deployment blueprints. Zhang advises developers to enforce resource limits and restrict access to external systems during the self-modification phase. “The key principle is to separate experimentation from deployment: allow the agent to explore and improve within a controlled sandbox, while ensuring that any changes that affect real systems are carefully validated before being applied,” she said. Only after the newly modified code passes developer-defined correctness checks should it be promoted to a production setting. Another significant danger is evaluation gaming, where the AI improves its metrics without making actual progress toward the intended real-world goal. Because hyperagents are driven by empirical evaluation signals, they can autonomously discover strategies that exploit blind spots or weaknesses in the evaluation procedure itself to artificially inflate their scores. Preventing this behavior requires developers to implement diverse, robust, and periodically refreshed evaluation protocols alongside continuous human oversight. Ultimately, these systems will shift the day-to-day responsibilities of human engineers. Just as we do not recompute every operation a calculator performs, future AI orchestration engineers will not write the improvement logic directly, Zhang believes. Instead, they will design the mechanisms for auditing and stress-testing the system. “As self-improving systems become more capable, the question is no longer just how to improve performance, but what objectives are worth pursuing,” Zhang said. “In that sense, the role evolves from building systems to shaping their direction.

We tested Anthropic’s redesigned Claude Code desktop app and 'Routines' — here's what enterprises should know

4/15/2026

The transition from AI as a chatbot to AI as a workforce is no longer a theoretical projection; it has become the primary design philosophy for the modern developer's toolkit. On April 14, 2026, Anthropic signaled this shift with a dual release: a complete redesign of the Claude Code desktop app (for Mac and Windows) and the launch of "Routines" in research preview. These updates suggest that for the modern enterprise, the developer's role is shifting from a solo practitioner to a high-level orchestrator managing multiple, simultaneous streams of work. For years, the industry focused on "copilots"—single-threaded assistants that lived within the IDE and responded to the immediate line of code being written. Anthropic’s latest update acknowledges that the shape of "agentic work" has fundamentally changed. Developers are no longer just typing prompts and waiting for answers; they are initiating refactors in one repository, fixing bugs in another, and writing tests in a third, all while monitoring the progress of these disparate tasks. The redesigned desktop application reflects this change through its central "Mission Control" feature: the new sidebar. This interface element allows a developer to manage every active and recent session in a single view, filtering by status, project, or environment. It effectively turns the developer’s desktop into a command center where they can steer agents as they drift or review diffs before shipping. This represents a philosophical move away from "conversation" toward "orchestration". Routines: your new 'set and forget' option for repeating processes and tasks The introduction of "Routines" represents a significant architectural evolution for Claude Code. Previously, automation was often tied to the user's local hardware or manually managed infrastructure. Routines move this execution to Anthropic’s web infrastructure, decoupling progress from the user's local machine. This means a critical task—such as a nightly triage of bugs from a Linear backlog—can run at 2:00 AM without the developer's laptop being open. These Routines are segmented into three distinct categories designed for enterprise integration: Scheduled Routines: These function like a sophisticated cron job, performing repeatable maintenance like docs-drift scanning or backlog management on a cadence. API Routines: These provide dedicated endpoints and auth tokens, allowing enterprises to trigger Claude via HTTP requests from alerting tools like Datadog or CI/CD pipelines. Webhook Routines: Currently focused on GitHub, these allow Claude to listen for repository events and automatically open sessions to address PR comments or CI failures. For enterprise teams, these Routines come with structured daily limits: Pro users are capped at 5, Max at 15, and Team/Enterprise tiers at 25 routines per day, though additional usage can be purchased. Analysis: desktop GUI vs. Terminal The pivot toward a dedicated Desktop GUI for a tool that originated in the terminal (CLI) invites an analysis of the trade-offs for enterprise users. The primary benefit of the new desktop app is high-concurrency visibility. In a terminal environment, managing four different AI agents working on four different repositories is a cognitive burden, requiring multiple tabs and constant context switching. The desktop app’s drag-and-drop layout allows the terminal, preview pane, diff viewer, and chat to be arranged in a grid that matches the user's specific workflow. Furthermore, the "Side Chat" feature (accessible via ⌘ + ;) solves a common problem in agentic work: the need to ask a clarifying question without polluting the main task's history. This ensures that the agent's primary mission remains focused while the human operator gets the context they need. However, it is also available in the Terminal view via the /btw command. Despite the GUI's benefits, the CLI remains the home of many developers. The terminal is lightweight and fits into existing shell-based automation. Recognizing this, Anthropic has maintained parity: CLI plugins are supposed to work exactly the same in the desktop app as they do in the terminal. Yet in my testing, I was unable to get some of my third-party plugins to show up in the terminal or main view. For pure speed and users who operate primarily within a single repository, the CLI avoids the resource overhead of a full GUI. How to use the new Claude Code desktop app view In practice, accessing the redesigned Claude Code desktop app requires a bit of digital hunting. It's not a separate new application — instead, it is but one of three main views in the official Claude desktop app, accessible only by hovering over the "Chat" icon in the top-left corner to reveal the specific coding interfaces. Once inside, the transition from a standard chat window to the "Claude Code" view is stark. The interface is dominated by a central conversational thread flanked by a session-management sidebar that allows for quick navigation between active and archived projects. The addition of a new, subtle, hover-over circular indicator at the bottom showing how much context the user has used in their current session and weekly plan limits is nice, but again, a departure from third-party CLI plugins that can show this constantly to the user without having to take the extra step of hovering over. Similarly, pop up icons for permissions and a small orange asterisk showing the time Claude Code has spent on responding to each prompt (working) and tokens consumed right in the stream is excellent for visibility into costs and activity. While the visual clarity is high—bolstered by interactive charts and clickable inline links—the discoverability of parallel agent orchestration remains a hurdle. Despite the promise of "many things in flight," attempting to run tests across multiple disparate project folders proved difficult, as the current iteration tends to lock the user into a single project focus at a time. Unlike the Terminal CLI version of Claude Code, which defaults to asking the user to start their session in their user folder on Mac OS, the Claude Code desktop app asks for access to specific subfolder -- which can be helpful if you have already started a project, but not necessarily for starting work on a new one or multiple in parallel. The most effective addition for the "vibe coding" workflow is the integrated preview pane, located in the upper-right corner. For developers who previously relied on the terminal-only version of Claude Code, this feature eliminates the need to maintain separate browser windows or rely on third-party extensions to view live changes to web applications. However, the desktop experience is not without friction. The integrated terminal, intended to allow for side-by-side builds and testing, suffered from notable latency, often failing to update in real-time with user input. For users accustomed to the near-instantaneous response of a native terminal, this lag can make the GUI feel like an "overkill" layer that complicates rather than streamlines the dev cycle. Setting up the new Routines feature also followed a steep learning curve. The interface does not immediately surface how to initiate these background automations; discovery required asking Claude directly and referencing the internal documentation to find the /schedule command. Once identified, however, the process was remarkably efficient. By using the CLI command and configuring connectors in the browser, a routine can be operational in under two minutes, running autonomously on Anthropic’s web infrastructure without requiring the desktop app to remain active. The ultimate trade-off for the enterprise user is one of flexibility (standard Terminal/CLI view) versus integrated convenience (new Claude Code desktop app). The desktop app provides a high-context "Plan" view and a readable narrative of the agent’s logic, which is undeniably helpful for complex, multi-step refactors. Yet, the platform creates a distinct "walled garden" effect. While the terminal version of Claude Code offers a broader range of movement, the desktop app is strictly optimized for Anthropic’s models. For the professional coder who frequently switches between Claude and other AI models to work around rate limits or seek different architectural perspectives, this model-lock may be a dealbreaker. For these power users, the traditional terminal interface remains the superior surface for maintaining a diverse and resilient AI stack. The enterprise verdict For the enterprise, the Desktop GUI is likely to become the standard for management and review, while the CLI remains the tool for execution. The desktop app's inclusion of an in-app file editor and a faster diff viewer—rebuilt for performance on large changesets—makes it a superior environment for the "Review and Ship" phase of development. It allows a lead developer to review an agent's work, make spot edits, and approve a PR without ever leaving the application. Philosophical implications for the future of AI-driven enterprise knowledge work Anthropic developer Felix Rieseberg noted on X that this version was "redesigned from the ground up for parallel work," emphasizing that it has become his primary way to interact with the system. This shift suggests a future where "coding" is less about syntax and more about managing the lifecycle of AI sessions. The enterprise user now occupies the "orchestrator seat," managing a fleet of agents that can triage alerts, verify deploys, and resolve feedback automatically. By providing the infrastructure to run these tasks in the cloud and the interface to monitor them on the desktop, Anthropic is defining a new standard for professional AI-assisted engineering.

AI's next bottleneck isn't the models — it's whether agents can think together

4/15/2026

AI agents can connect together, but they cannot think together. That’s a huge difference and a bottleneck for next-gen systems, says Outshift by Cisco’s SVP and GM Vijoy Pandey. As he describes the current state of AI: Agents can be stitched together in a workflow or plug into a supervisor model — but there's no semantic alignment, no shared context. They’re essentially working from scratch each go-around. This calls for next-level infrastructure, or what Pandey describes as the "internet of cognition." “Agents are not able to think together because connection is not cognition,” he said. “We need to get to a point where you are sharing cognition. That is the greater unlock.” Creating new protocols to support next-gen agent communication So what is shared cognition? It’s when AI agents or entities can meaningfully work together to solve for something net new that they weren’t trained for, and do it “100% without human intervention,” Pandey said on the latest episode of Beyond the Pilot. The Cisco exec analogizes it to human intelligence. Humans evolved over hundreds of thousands of years, first becoming intelligent individually, then communicating on a basic level (with gestures or drawings). That communication improved over time, eventually unlocking a ‘cognitive revolution’ and collective intelligence that allowed for shared intent and the ability to coordinate, negotiate, and ground and discover information. “Shared intent, shared context, collective innovation: That's the exact trajectory that's playing out in silicon today,” Pandey said. His team sees it as a “horizontal distributed assistance problem.” They are pursuing “distributed super intelligence” by codifying intent, context, and collective innovation as a set of rules, APIs, and capabilities within the infrastructure itself. Their approach is a set of new protocols: Semantic State Transfer Protocol (SSTP); Latent Space Transfer Protocol (LSTP); and Compressed State Transfer Protocol (CSTP). SSTP operates at the language level, analyzing semantic communication so systems can infer the right tool or task. Pandey's team recently collaborated with MIT on a related piece called the Ripple Effect Protocol. LSTP can be used to transfer the “entire latent space” of one agent to another, Pandey explained. “Can we just take the KV cache and send it over as an example?” he said. “Because that would be the most efficient way: instead of going through the tax of tokenizing it, going to a natural language, then going back the stack on the other side.” CSTP handles compression — grounding only the targeted variants while compressing everything else. Pandey says it's particularly well-suited for edge deployments where you need to send large amounts of state accurately. Ultimately, Pandey’s team is building a fabric to scale out intelligence and ensure that cognition states are synchronized across endpoints. Further, they are developing what they call “cognition engines” that provide guardrails and accelerate systems. “Protocols, fabric, cognition engines: These are the three layers that we are building out in the pursuit of distributed super intelligence,” Pandey said. How Cisco solved a big pain point Stepping back from these advanced, next-level systems, Cisco has achieved tangible results with existing AI capabilities. Pandey described a specific pain point with the company’s site reliability engineering (SRE) team. While they were churning out more and more products and code, the team itself wasn’t growing, and were feeling pressure to improve efficiency. Pandey introduced AI agents that automated more than a dozen end-to-end workflows, including continuous integration/continuous delivery CI/CD pipelines, EC2 instance spin-ups and Kubernetes cluster deployments. Now, more than 20 agents — some built in-house, some third-party — have access to 100-plus tools via frameworks like Model Context Protocol (MCP), while also plugging into Cisco’s security platforms. The result: A decrease from “hours and hours to seconds” with certain deployments; further, agents have reduced 80% of the issues the SRE team were seeing within Kubernetes workflows. Still, as Pandey noted, AI is a tool like any other. “It does not mean that I have a new hammer and I'm just gonna go around looking for nails,” he said. “You still have deterministic code. You need to marry these two worlds to get the best outcome for the problem that you're solving.” Listen to the podcast to hear more about: How we are now enabling a new paradigm of non-deterministic computing. How Cisco bumped error detection capabilities in large networks from 10% to 100%. How Pandey named his own AI agent "Arnold Layne" after an early Pink Floyd song. Why the "internet of cognition" must be an open, interoperable effort. How Cisco’s open source project Agntcy addresses discovery, identity and access management (IAM), observability, and evaluation. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

Adobe’s new Firefly AI Assistant wants to run Photoshop, Premiere, Illustrator and more from one prompt

4/15/2026

Adobe today launched its most ambitious AI offensive to date, unveiling the Firefly AI Assistant — a new agentic creative tool that can orchestrate complex, multi-step workflows across the company's entire Creative Cloud suite from a single conversational interface — alongside a raft of new video, image, and collaboration features designed to position the company at the center of the rapidly evolving AI-powered content creation landscape. The announcements, which also include a new Color Mode for Premiere Pro, the addition of Kling 3.0 video models to Firefly's growing roster of third-party AI engines, and Frame.io Drive — a virtual filesystem that lets distributed teams work with cloud-stored media as though it lived on their local machines — represent Adobe's clearest signal yet that it views agentic AI not as a feature upgrade but as a fundamental reshaping of how creative work gets done. "We want creators to tell us the destination and let the Firefly assistant — with its deep understanding of all the Adobe professional tools and generative tools — bring the tools to you right in the conversation," Alexandru Costin, Vice President of AI & Innovation at Adobe, told VentureBeat in an exclusive interview ahead of the launch. The stakes could hardly be higher. Adobe is fighting to convince Wall Street, creative professionals, and a wave of well-funded AI-native competitors that its decades-old software empire can not only survive the generative AI revolution but lead it. How Adobe turned a research prototype into a 100-tool creative agent The centerpiece of today's announcement is the Firefly AI Assistant, which Adobe describes as a fundamentally new way to interact with its creative tools. Rather than requiring users to manually navigate between Photoshop, Premiere, Illustrator, Lightroom, Express, and other apps — selecting the right tool for each step of a complex project — the assistant lets creators describe an outcome in natural language. The agent then figures out which tools to invoke, in what order, and executes the workflow. The assistant is the productized version of Project Moonlight, a research prototype Adobe first previewed at its annual MAX conference in the fall of 2025 and subsequently refined through a private beta. "This is basically [Project] Moonlight," Costin confirmed to VentureBeat. "We started with all the learnings from Moonlight, and we engaged with customers. We looked internally. We evolved that architecture to make it more ambitious." Under the hood, Adobe says it has assembled roughly 100 tools and skills that the assistant can call upon, spanning generative image and video creation, precision photo editing, layout adaptation, and even stakeholder review through Frame.io. The system is built around a single conversational interface inside the Firefly web app where users describe what they want and the assistant maintains context across sessions. Pre-built Creative Skills — purpose-built, multi-step workflow templates such as portrait retouching or social media asset generation — can be run from a single prompt and customized to match a creator's own style. The assistant also learns a creator's preferred tools, workflows, and aesthetic choices over time, and understands the content type being worked on — image, video, vector, brand assets — to make context-aware decisions. Crucially, outputs use native Adobe file formats — PSD, AI, PRPROJ — meaning users can take any result into the corresponding flagship app for manual, pixel-level refinement at any point. "We always imagine this continuum where you can have complete conversational edits and pixel-perfect edits, and you can decide, as a creative, where you want to land," Costin said. The Firefly AI Assistant will enter public beta in the coming weeks, though Adobe did not specify an exact date. Why Wall Street is watching Adobe's AI pricing model so closely For a company whose AI monetization story has faced persistent skepticism from investors, the pricing structure of the Firefly AI Assistant will be closely watched. Costin told VentureBeat that, at launch, using the assistant will require an active Adobe subscription that includes the relevant apps — meaning users who want the agent to invoke Photoshop cloud capabilities, for instance, will need an entitlement that includes the Photoshop SKU. Generative actions will consume the user's existing pool of generative credits, consistent with how Firefly credits work across the rest of Adobe's platform. "To use some of these cloud capabilities from Photoshop and other apps, you need to have a subscription that includes access to the Photoshop SKU," Costin explained. "You'll be consuming your credits when you use generative features." He acknowledged, however, that the model could evolve: "As we better understand the value of this — and the costs of operating the brain, the conversation engine — things might change." The question of whether Adobe can convert AI enthusiasm into meaningful revenue growth is anything but theoretical. When Adobe reported its most recent quarterly results in March, it touted 10% year-over-year revenue growth to $6.4 billion and disclosed that annual recurring revenue from AI standalone and add-on products had reached $125 million — a figure CEO Shantanu Narayen projected would double within nine months. Adobe adds Chinese AI video models to Firefly, raising commercial safety questions Alongside the assistant, Adobe is expanding Firefly's roster of third-party AI models to include Kling 3.0 and Kling 3.0 Omni, two video generation models developed by Kuaishou, the Chinese technology company. Kling 3.0 focuses on fast, high-quality production with smart storyboarding and audio-visual sync, while the Omni variant adds professional controls for shot duration, camera angle, and character movement across multi-shot sequences. The additions bring Firefly's model count to more than 30, joining Google's Nano Banana 2 and Veo 3.1, Runway's Gen-4.5, Luma AI's Ray3.14, Black Forest Labs' FLUX.2[pro], ElevenLabs' Multilingual v2, and others. When asked whether Adobe had concerns about integrating a model from a Chinese tech company given the current geopolitical climate, Costin was direct: "We think choice is what we want to offer our customers." He explained that Adobe's strategy distinguishes between its own commercially safe, first-party Firefly models — trained on licensed Adobe Stock imagery and public domain content — and third-party partner models, which carry different commercial safety profiles. "For some use cases, like ideation, non-production use cases, we got requests from customers to support some external models," Costin said. "If I'm in ideation, I might be more flexible with commercial safety. When I go into production, I’d want to have a model that gives you more confidence." This raises an important nuance for the agentic era. When the Firefly AI Assistant autonomously selects which model to use for a given task, the commercial safety guarantees may vary depending on which engine it invokes. Costin pointed to Adobe's Content Credentials system — the metadata-and-fingerprinting framework developed through the Content Authenticity Initiative — as the mechanism for maintaining transparency. "The agentic power — and the fact that the assistant has access to all of those models — means it could decide to use a model that carries different content credentials," he acknowledged. "But with the transparency of content credentials, the user will know how a particular piece of content was created and can decide whether that's commercially safe or not." Adobe offers commercial indemnity for its first-party Firefly models but applies different indemnity levels for third-party models — a distinction that enterprise buyers, in particular, will need to carefully evaluate. Inside Adobe's active collaboration with Nvidia on long-running AI agent infrastructure Adobe's agentic ambitions also intersect with its strategic partnership with Nvidia, announced earlier this year at Nvidia’s GTC conference. When asked whether the Firefly AI Assistant's agentic capabilities are built on NVIDIA's agent toolkit and NeMo infrastructure, Costin revealed that the collaboration is active but has not yet made it into a shipping product. "We're in active discussions — investigating not only Nemotron," Costin said. "They have this technology called Open Shell and Nemo Claw, which give us the ability to efficiently run long-running agentic workflows in a sandboxed environment." He said the technology would become increasingly important as Adobe pushes the assistant to handle longer, more autonomous creative tasks — but cautioned that "it's not shipping yet. It's being actively explored." For Nvidia, which is building an ecosystem of enterprise AI agent platforms with partners like Adobe, Salesforce, and SAP, the partnership could eventually serve as a high-profile proof point for its agent infrastructure stack in the creative vertical. For Adobe, the ability to run complex, long-duration agentic workflows efficiently and securely in sandboxed environments could be the technical foundation that separates the Firefly AI Assistant from lighter-weight chatbot integrations offered by competitors. The partnership also signals Adobe's recognition that the computational demands of agentic AI — where a single user request may trigger dozens of model calls and tool invocations — require infrastructure partnerships that go well beyond what a software company can build alone. Premiere Pro's new color grading mode and the tools Adobe is shipping today Beyond the headline AI assistant announcement, Adobe's broader set of updates reflects a company trying to strengthen its position across every phase of the content creation pipeline. Color Mode in Premiere Pro may be the most significant near-term upgrade for working editors. Entering public beta today, Color Mode is described as a first-of-its-kind color grading experience built specifically for the way editors — rather than dedicated colorists — think and work. Adobe notes that it was developed through an extensive private beta with hundreds of working editors, and that participants reported they "actually enjoy color grading" — a sentiment suggesting Adobe may have found a way to democratize one of post-production's most intimidating disciplines. General availability is expected later in 2026. The Firefly Video Editor gains audio upgrades including the Enhance Speech feature migrated from Premiere and Adobe Podcast, direct Adobe Stock integration with access to more than 800 million licensed assets, and simple color adjustment controls with intuitive sliders and one-click looks. On the image editing front, Adobe introduced Precision Flow, which generates a range of semantic variations from a single prompt and lets users browse them via an interactive slider — a novel approach that Costin described as "the best slider-based control mixed with the best semantic understanding of not only the existing scene, but what the scene could be." AI Markup complements this by letting users draw directly on images to specify where and how edits should be applied. After Effects 26.2 adds an AI-powered Object Matte tool that dramatically accelerates rotoscoping and masking — create accurate mattes of moving subjects with a hover and click, refine with a Quick Selection brush, and perfect edges with a Refine Edge tool. Frame.io Drive wants to kill the shipped hard drive and make cloud media feel local Rounding out the announcements, Frame.io Drive addresses one of the most persistent pain points in distributed video production: getting media from point A to point B without losing hours — or days — to downloads, syncing, and shipped hard drives. Frame.io Drive is a desktop application that mounts Frame.io projects to a user's computer so media appears in Finder or Explorer and behaves like local files. The underlying technology, called Frame.io Mounted Storage, streams media on demand as applications request it, while local caching ensures smooth playback. The product builds on streaming technology provided by Suite Studios, and the real-time file access capability is included with every Frame.io account. Adobe emphasized that all content lives solely within Frame.io and is never shared with third parties. The move positions Frame.io not just as a review-and-approval tool at the end of the production pipeline but as the central media layer from the very beginning of a project — from first capture through final delivery. If successful, the strategy could significantly deepen Adobe's lock-in with professional video teams by making Frame.io the single source of truth for distributed productions. Frame.io Drive and Mounted Storage will roll out in phases, with Enterprise customers gaining access starting today and accounts on other plans following shortly. Others can join a waitlist. Adobe's biggest challenge isn't building the AI — it's convincing creators to trust it Taken together, today's announcements paint a picture of a company executing aggressively across multiple fronts — but also one that is navigating a complex moment. Adobe first introduced Firefly in March 2023 as a family of generative AI models focused on image and text effects, with a strong emphasis on commercial safety through training on licensed Adobe Stock content. In the two years since, the company has rapidly expanded into video generation, multi-model access, and now agentic workflows — a trajectory that mirrors the broader industry's shift from standalone AI features to AI-native systems. But the competitive field has grown dramatically. Runway, Pika, and a host of AI-native video generation startups have captured mindshare among creators. Canva has aggressively integrated AI into its design platform. And the emergence of powerful foundation models from OpenAI, Google, and Anthropic — the latter of which Adobe says it will integrate with Firefly AI Assistant capabilities — means the barrier to building creative AI tools has never been lower. Adobe is also navigating these product ambitions against a complex corporate backdrop: the impending departure of CEO Shantanu Narayen, an actively exploited zero-day vulnerability in Acrobat Reader (CVE-2026-34621) that had been used by hackers for months before being patched this week, a U.K. antitrust investigation over cancellation fees, and a recent $75 million lawsuit settlement. Adobe's response, articulated clearly through today's launches, is to lean into what it believes is its deepest moat: the integration of AI into a set of professional-grade, category-leading applications that no startup can replicate overnight. Costin framed the agentic transition as empowering rather than threatening to creative professionals, comparing Creative Skills to a next-generation version of Photoshop Actions — the macro-recording feature that has long allowed power users to automate repetitive tasks. "We want to help our customers become — from the ones doing all the work — to be creative directors, doing some of the work, but most importantly, guiding the assistant in executing some of those creative visions," he said. It is a compelling pitch — and, in its own way, a revealing one. For three decades, Adobe made its fortune by selling the tools that turned creative vision into finished pixels. Now it is asking its customers to let an AI agent handle more of that translation, trusting that the human role will shift from operating the tools to directing the outcome. Whether creators embrace that bargain — and whether Wall Street rewards it — will determine not just Adobe's trajectory but the shape of an entire industry learning to create alongside machines.

Traza raises $2.1 million led by Base10 to automate procurement workflows with AI

4/15/2026

For decades, procurement has been the back office that enterprise software forgot. Billions of dollars flow through vendor negotiations, purchase orders, and supplier communications every year at the largest manufacturers and construction companies in the country — and the vast majority of that work still runs on email threads, spreadsheets, and phone calls. Traza, a newly launched startup headquartered in New York, believes the moment has arrived to change that. The company announced today the close of a $2.1 million pre-seed round led by Base10 Partners, with participation from Kfund, a16z scouts, Clara Ventures, Masia Ventures, and a roster of angel investors including Pepe Agell, who scaled Chartboost to 700 million monthly users before its acquisition by Zynga. The funding is modest by Silicon Valley standards. But Traza's pitch is anything but incremental: the company deploys AI agents that don't just recommend procurement actions — they execute them autonomously, handling vendor outreach, request-for-quote generation, order tracking, supplier communications, and invoice processing without continuous human supervision. "AI is redesigning the procurement category from the ground up," said Silvestre Jara Montes, Traza's CEO and co-founder, in an exclusive interview with VentureBeat. "This wave of AI won't just build procurement software — it will rebuild how procurement works." Why procurement contracts silently lose millions after the ink dries The market Traza is targeting is enormous and, by the company's framing, spectacularly underserved. The procurement software market alone exceeds $8 billion and grows at roughly 10% annually. But the real cost sits in the labor — the armies of people, agencies, and ad hoc workarounds required to actually run procurement operations at scale. Most enterprises meaningfully engage with only their top 20% of suppliers. The remaining 80% — the vendor outreach, order tracking, invoice reconciliation, and compliance monitoring — goes largely unmanaged. Research from World Commerce & Contracting and Ironclad finds that organizations lose an average of 11% of total contract value after agreements are signed, a phenomenon described as "post-signature value leakage." As Tim Cummins, President of WorldCC, put it: "The research shows that the 11% value gap is not caused by poor negotiation, but by how contracts are managed after signature." For a large enterprise with $500 million in annual contracted spend, that represents $55 million vanishing each year — not from bad deals, but from the operational void between what gets agreed at the negotiating table and what actually gets executed on the ground. Missed savings, unauthorized changes, and poor renewal planning are responsible for the biggest losses. Jara Montes argues that Traza sits precisely in this gap. "The 11% spans commercial, operational, and compliance leakage. We own the operational layer — and that's where the most recoverable value sits," he said. "Supplier tail management that never happens, RFQ processes skipped because someone ran out of bandwidth, invoice discrepancies that slip through unnoticed. That's where contracts bleed value after signing, and that's exactly what we automate." The numbers from Traza's early deployments, while nascent, are striking: the company claims a 70% reduction in human hours spent on procurement tasks and procurement cycles running three times faster than manual baselines. How AI agents crossed the line from procurement copilot to autonomous worker To understand what makes Traza's approach different, it helps to understand what "AI for procurement" has meant until now. For the past several years, the term largely described dashboards, analytics layers, and recommendation engines that surfaced insights but left every decision and action in a human's hands. Products from incumbents like SAP Ariba and Coupa — as well as newer entrants like Zip, Fairmarkit, and Tonkean — have layered AI capabilities on top of existing systems of record. But the gap between piloting AI and achieving production-scale impact remains stark, with 49 percent of procurement teams running pilots but only 4 percent reaching meaningful deployment. Traza's bet is that 2026 represents an inflection point. AI agents now possess the multi-step reasoning, tool use, and contextual memory required to execute full procurement workflows autonomously — from vendor discovery through invoice processing. The company frames this not as an upgrade to existing procurement software, but as an entirely new product category. "The incumbents built systems of record. They organize procurement data and they've never executed procurement work — and their AI additions don't fundamentally change that," Jara Montes said. "What they're shipping is a recommendation layer on the same underlying architecture. A human still has to act on every suggestion. We replace the operational layer entirely." Industry data supports the thesis that enterprises are hungry for this shift. According to the 2025 Global CPO Survey from EY, 80 percent of global chief procurement officers plan to deploy generative AI in some capacity over the next three years, and 66 percent consider it a high priority over the next 12 months. A 2025 ABI Research survey found that 76% of supply chain professionals already see autonomous AI agents as ready to handle core tasks like reordering, supplier outreach, and shipment rerouting without human intervention — and early deployments are demonstrably reducing supply chain operational costs by 20 to 35%. Inside the workflow: what Traza's AI does and where humans still make the call In a typical deployment, Traza's AI agent takes over the operational labor that currently lives in inboxes, spreadsheets, and manual follow-up chains. In a standard RFQ workflow, the agent identifies suitable suppliers, drafts and sends the request for quotes, monitors supplier responses, follows up automatically when responses lag, parses incoming quotes regardless of their format, and builds a structured comparison table ready for a human decision-maker. The key design principle is deliberate: humans remain in the loop at critical junctures. "At critical steps — approving a purchase order, flagging a compliance issue, committing spend above a threshold — a human is always in the loop," Jara Montes explained. "That's not a limitation, it's the design. It's how you maintain the auditability enterprises require while moving faster than any manual process could. You earn expanded autonomy over time, as trust is built and results compound." When asked about the risk of AI errors — a wrong purchase order or a missed compliance check that could prove costly — Jara Montes was direct: "Anything with meaningful financial or compliance exposure requires human approval before it executes — that's non-negotiable and baked into the architecture. Below those thresholds, the agent acts autonomously and logs everything." He added a point that reveals a subtler product insight: "Most procurement operations today are a black box — nobody has a clear picture of what's happening across the supplier tail. We make it legible." In other words, the transparency the AI agent provides may itself be a product — giving procurement leaders visibility they have never had into the long tail of supplier relationships that most enterprises simply ignore. How Traza plugs into legacy enterprise systems without ripping them out One of the recurring challenges for any enterprise AI startup is the integration question: How do you plug into the deeply entrenched, often decades-old technology stacks that large manufacturers and construction companies rely on? Traza's answer is to sit on top of existing systems rather than replace them. "We connect via API or direct integration into whatever the customer already runs — ERPs, email, supplier portals. We have reach across more than 200 enterprise tools," Jara Montes said. "We don't rip out their system, we sit on top of them." The go-to-market motion mirrors this pragmatism. Instead of attempting a big-bang deployment, Traza runs a two-to-three-month proof of value focused on a single, specific workflow. Integrations are built at the key steps that matter for that particular use case, then expanded as the scope of the engagement grows. "We don't try to connect everything upfront — we compound integrations as we expand scope within each account," Jara Montes said. "And every integration we build compounds across customers too. Each new deployment makes the next one faster." Throughout the process, the company works side by side with the customer's team, managing complexity and helping them transition into a new way of operating. It is a notably high-touch approach for a company selling automation. The company is already working with large manufacturers and construction companies and says they are paying, though it declines to name them publicly. "We want to earn the right to grow inside each account, not land a pilot that goes nowhere," Jara Montes said. "That's how you build something that actually sticks in enterprise." Traza bets that vertical depth in physical industry will beat horizontal AI platforms Traza enters a market that is rapidly heating up. The leading AI procurement solutions include platforms from Coupa, Ivalua, SAP Ariba, Zip, Zycus, and Fairmarkit. Keelvar provides autonomous sourcing bots capable of launching RFQs, collecting bids, and recommending optimal awards, while Tonkean offers a no-code orchestration platform using NLP and generative AI to streamline procurement intake and tail-spend management. Against this crowded field, Jara Montes draws a sharp distinction between horizontal automation tools and Traza's focus on physical industry. "We're built specifically for the physical industry, where supplier relationships, compliance requirements, and workflow complexity are categorically different from software procurement," he said. "A generic agent doesn't survive contact with how procurement actually works in manufacturing or construction. Specificity is the moat." The competitive dynamics with major incumbents are perhaps even more consequential. SAP Ariba, Coupa, and their peers have massive installed bases and deep enterprise relationships. Jara Montes frames their AI initiatives as surface-level additions to legacy architectures — but whether Traza can convert that framing into market share at scale, especially given the gravitational pull of existing vendor relationships, remains the central strategic question. Beneath Traza's product pitch sits a deeper strategic thesis about compounding data advantages. The company describes a two-layered learning architecture: at the agent level, Traza gets smarter across every deployment by absorbing supplier behavior patterns, RFQ response dynamics, pricing anomalies, and workflow edge cases. At the data level, each customer's information stays fully isolated. "What we're building is deep operational knowledge of how procurement actually runs in the physical industry — not how it's supposed to run according to an RFP, but how it really runs, with all the exceptions and workarounds," Jara Montes said. "That's extraordinarily hard to replicate if you're starting from scratch, and it gets harder to catch up with the more deployments we have." Three Spanish founders, one fellowship, and a plan to rewire industrial procurement Traza was co-founded by three Spanish entrepreneurs — Silvestre Jara Montes, Santiago Martínez Bragado, and Sergio Ayala Miñano — who came to the United States through the Exponential Fellowship, a program that brings Europe's top technical talent to the U.S. to build companies at the frontier of AI. Their backgrounds span both sides of the problem Traza is trying to solve. Jara Montes worked at Amazon and CMA CGM — one of the world's largest shipping groups — at the intersection of operations strategy and supply chain optimization. Martínez Bragado built and deployed agentic AI at Clarity AI before joining Concourse (backed by a16z, Y Combinator, and CRV) as Founding AI Engineer. Ayala Miñano comes from StackAI, one of the fastest-growing enterprise AI platforms in San Francisco, where he was a Founding Engineer. None of the founders carry the title of Chief Procurement Officer, a gap that the company acknowledges has occasionally surfaced in buyer conversations. Jara Montes's response is characteristically direct: "Our work is the answer. The results we're generating move that conversation quickly." He noted that the company has senior procurement leaders serving as advisors who have run procurement at the scale of its target customers. Base10 Partners, the lead investor, is a San Francisco-based venture capital firm that invests in companies automating sectors of what it calls "the Real Economy." Its portfolio includes Notion, Figma, Nubank, Stripe, and Aurora Solar. Rexhi Dollaku, General Partner at Base10, framed the investment in emphatic terms: "Supply chain and procurement is one of the largest, most underautomated markets in the Real Economy. AI agents are finally capable of doing the work, not just assisting with it." The supporting cast of investors reinforces the immigrant-founder narrative. Clara Ventures — founded by the executives behind Olapic's $130 million exit — specifically invests in driven foreign founders building in the United States, and Agell adds operational credibility from building Chartboost into a $100 million revenue business in under three years as a Spanish founder in Silicon Valley. Why $2.1 million may stretch further than it looks for an enterprise AI startup At $2.1 million, this is a deliberately small round for a company selling to large enterprises with notoriously long procurement cycles. Jara Montes argues it goes further than it appears for structural reasons. "We leverage Europe as a tech talent hub, where we have a deep network of exceptional engineers — people who want to work at the frontier of AI but have far fewer opportunities to do so than their US counterparts," he said. "We're not just lean — we're built to outcompete on capital efficiency while others are burning through runway trying to hire in San Francisco." The go-to-market motion is designed for speed to revenue. Proofs of value are scoped, time-bounded, and converted to paying partnerships. The company says it is not running 18-month enterprise sales cycles before seeing a dollar. The milestone for the next raise is explicit: more paying customers, meaningfully stronger annual recurring revenue, and a repeatable sales motion that makes the seed round, as Jara Montes put it, "an obvious conversation." Looking ahead, he outlined an ambitious three-year target: 20 to 30 large industrial enterprises in the U.S. and Europe running Traza across their procurement operations, with over a billion dollars in procurement spend flowing through the platform. Whether that vision is achievable depends on several interlocking variables — the pace at which AI agent capabilities continue to improve, the speed of enterprise adoption in a traditionally conservative buyer segment, and Traza's ability to navigate the competitive gauntlet of incumbents adding AI features and well-funded startups attacking adjacent workflows. But the underlying math may be on Traza's side. In procurement, the money that disappears does not look like waste. It vanishes into inefficiency, missed obligations, unmanaged risks, and forgotten commitments — the kind of silent losses that no one tracks because no one has the bandwidth to track them. The traditional mandate of procurement, as currently configured, ends where the value gap begins: at signature. Traza is building an AI workforce that picks up where the humans leave off. For an industry that has spent decades losing $55 million at a time to the back office nobody watches, that might be precisely the point.

Anthropic’s Claude Managed Agents gives enterprises a new one-stop shop but raises vendor 'lock-in' risk

4/14/2026

Anthropic announced a new platform last week, Claude Managed Agents, aiming to cut out the more complex parts of AI agent deployment for enterprises and competes with existing orchestration frameworks. Claude Managed Agents is also an architectural shift: enterprises, already burdened with orchestrating an increasing number of agents, can now choose to embed the orchestration logic in the AI model layer. While this comes with some potential advantages, such as speed (Anthropic proposes its customers can deploy agents in days instead of weeks or months), it also, of course, then also turns more control over the enterprise's AI agent deployments and operations to the model provider — in this case, Anthropic — potentially resulting in greater "lock in" for the enterprise customer, leaving them more subject to Anthropic's terms, conditions, and any subsequent platform changes. But maybe that is worth it for your enterprise, as Anthropic further claims that its platform “handles the complexity” by letting users define agent tasks, tools and guardrails with a built-in orchestration harness, all without the need for sandboxing code execution, checkpointing, credential management, scoped permissions and end-to-end tracing. The framework manages state, execution graphs and routing and brings managed agents to a vendor-controlled runtime loop. Even before the release of Claude Managed Agents, new directional VentureBeat research showed that Anthropic was gaining traction at the orchestration level as enterprises adopted its native tooling. Claude Managed Agents represents a new attempt by the firm to widen its footprint as the orchestration method of choice for organizations. Anthropic is surging in orchestration interest Orchestration has emerged as an important segment for enterprises to address as they scale AI systems and deploy agentic workflows. VentureBeat directional research of several dozen firms for the first quarter of 2026 found that enterprises mostly chose existing frameworks, such as Microsoft’s Copilot Studio/Azure AI Studio, with 38.6% of respondents in February reporting using Microsoft’s platform. VentureBeat surveyed 56 organizations with more than 100 employees in January and 70 in February. OpenAI closely followed at 25.7%. Both showed strong growth between the first two months of the year. Anthropic, driven by increased interest in its offerings, such as Claude Code, over the past year, is putting up a fight. Adoption of the Anthropic tool-use and workflows API increased from 0% to 5.7% between January and February. This tracks closely with the growing adoption of Anthropic’s foundation models, showing that enterprises using Claude turn to the company’s native orchestration tooling instead of adding a third-party framework. While VentureBeat surveyed before the launch of Claude Managed Agents, we can extrapolate that the new tool will build on that growth, especially if it promises a more straightforward way to deploy agents. Collapsing the external orchestration layer Enterprises may find that a streamlined, internal harness for agents compelling, but it does mean giving up certain controls. Session data is stored in a database managed by Anthropic, increasing the risk that enterprises become locked into a system run by a single company. This may be less desirable for some firms and compete with their desires to move away from the locked-in software-as-a-service (SaaS) applications in the current stacks, which many hope that AI will facilitate. The specter of vendor lock-in means agent execution becomes more model-driven rather than direct by the organization, happens in an environment enterprises don’t fully control, and behavior becomes harder to guarantee. It also opens the possibility of giving agents conflicting instructions, especially if the only way for users to exert any control over agents is to prompt them with more context. Agents could have two control planes: one defined by the enterprises’ orchestration system through instructions and the other as an embedded skill from the Claude runtime. This could pose an issue for highly sensitive and regulated workflows, such as financial analysis or customer-facing tasks. Pricing, control and competitive set Balancing control with ease is one thing; enterprises also consider the cost structure of Claude Managed Agents. Claude Managed Agents introduces a hybrid pricing model that blends token-based billing with a usage-based runtime fee. This makes Managed Agets more dynamic, though less predictable, when determining cost structures. Enterprises will be charged a standard rate of $0.08 per hour when agents are actively running. For example, at $0.70 per hour, a one-hour session could cost up to $37 to process 10,000 support tickets, depending on how long each agent runs and how many steps it takes to complete a task. Microsoft, currently the leader according to VentureBeat's directional survey, offers several orchestration offerings. Copilot Studio uses a capacity-based billing structure, so enterprises pay for blocks of interactions between users and agents rather than the number of steps an agent takes. Microsoft's approach tends to be more predictable than Anthropic's pricing plan: Copilot Studio starts at $200 per month for 25,000 messages. Compared to similar competitors like OpenAI's Agents SDK, the picture becomes murky. Agents SDK is technically free to use as an open-source project. However, OpenAI bills for the underlying API usage. Agents built and orchestration with Agents SDK using GPT-5.4, for example, will cost $2.50 per 1 million input tokens and $15 per 1 million output tokens. The enterprise decision Claude Managed Agents does give enterprises who find the actual deployment of production agents too complicated a reprieve. It reduces their engineering overhead while adding speed and simplicity in a fast-changing enterprise environment. But that comes with a choice: lose control, observability and portability and risk further vendor lock-in. Anthropic just made a case for why its ecosystem is becoming not just the foundation model of choice for enterprises, but also the orchestration infrastructure. It becomes more imperative for enterprises to balance ease with lesser control.

Google leaders including Demis Hassabis push back on claim of uneven AI adoption internally

4/14/2026

A viral post on X from veteran programmer and former Google engineer Steve Yegge set off a rhetorical firestorm this week, drawing sharp public rebuttals from some of Google’s most prominent AI leaders and reopening a sensitive question for the company: how deeply are its own engineers really using the latest generation of AI coding tools? The debate began after Yegge summarized what he said was the view of his friend, a current and longtime Google employee (or Googler), who claimed the Gemini AI-firm's internal AI adoption looks much more ordinary and less cutting-edge than outsiders might expect. Yegge said Googler friend claimed Google engineering mirrors an “average” industry pattern of a 20%-60%-20% split: a small group of outright AI refusers (20%) a much larger middle still relying mainly on simpler chat and coding-assistant workflows (60%), and another small group of AI-first, cutting-edge engineers using agentic tools extensively and mastering them (20%). A VentureBeat search of X using its parent company’s AI assistant Grok found that Yegge’s April 13 post spread quickly, topping 4,500 likes, 205 quote posts, 458 replies and 1.9 million views as of April 14. We've reached out to Google for comment on the claims and will update when we receive a response. A veteran, oustpoken Googler voice Why did the opinion of Yegge's unnamed Googler friend land so hard? In part because Yegge is not just another commentator taking shots from the sidelines. He spent about 13 years at Google after earlier stints at Amazon and GeoWorks, later joined Grab, and then became head of engineering at Sourcegraph in 2022. He has long been known in software circles for widely read essays on programming and engineering culture, and for an earlier internal Google memo that accidentally became public in 2011 and drew broad media attention. That history helps explain why engineers and executives still take his critiques seriously, even when they reject them. Yegge has built a reputation over many years as a blunt insider-outsider voice on software culture, someone with enough standing in the industry that his judgments can travel fast, especially when they touch nerves inside big technology companies. Wikipedia’s summary of his career notes his long Google tenure and the outsized attention his blog posts and prior Google critiques have received. Unpacking Yegge's friend's argument In this case, Yegge’s argument was not simply that Google uses too little AI. It was that the company’s adoption may be uneven, culturally constrained and less transformed than its branding implies. His friend supposedly argued that some Googlers could not use Anthropic’s Claude Code because it was framed as “the enemy,” and that Gemini was not yet sufficient for the fullest agentic coding workflows. He contrasted Google with what he described as a smaller set of companies moving much faster. Pushback from Hassabis and current Googlers The first major pushback came from Demis Hassabis, the co-founder and CEO of Google DeepMind, who replied directly and forcefully. “Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait,” Hassabis wrote. Other Google leaders followed with lengthier defenses. Addy Osmani, a director at Google Cloud AI, wrote that Yegge’s account “doesn’t match the state of agentic coding at our company.” He added, “Over 40K SWEs use agentic coding weekly here.” Osmani said Googlers have access to internal tools and systems including “custom models, skills, CLIs and MCPs,” and pushed back on the idea that Google employees are sealed off from outside models, writing that “folks can even use @AnthropicAI’s models on Vertex” and concluding that “Google is anything but average.” Other current Google employees reinforced that message. Jaana Dogan, a software engineer at Google, wrote in a quote tweet: “Everyone I work with uses @antigravity like every second of the day,” later following up with another X post stating: "Unpopular opinion: If you think tokens burned is a productivity metric, no one should take you seriously. Imagine you are a top 0.0001% writer and they are only counting the tokens you produce." Paige Bailey, a DevX engineering lead at Google DeepMind, said teams had agents “running 24/7.” Several other Google and DeepMind figures also challenged Yegge’s characterization, some disputing the factual basis of his claims and others suggesting he lacked visibility into current internal usage. Yegge's rebuttal Yegge, for his part, did not retreat. In a follow-up to Hassabis, he wrote, “I’m not trying to misrepresent anyone,” but argued that by his own standard for advanced AI adoption, Google still does not appear to be doing especially well. He pointed to token usage and the replacement of older development habits with truly agentic workflows as the more meaningful benchmark, and said he would be willing to retract his criticism if Google could show its engineers were operating at that level. AI adoption vs. AI transformation That leaves the core dispute unresolved, but clearer. This is less a fight over whether Google engineers use AI at all than a fight over what should count as meaningful adoption. Googlers are pointing to scale, weekly usage and the availability of internal and external tools. Yegge is arguing that those measures may capture broad exposure without proving a deeper change, an AI transformation, in how engineering work gets done. The clash reflects a wider industry split between visible usage metrics and more transformative, power-user behavior. For Google, the subject is especially sensitive. Yegge has criticized the company before, including in a 2018 essay explaining why he left, where he argued Google had become too risk-averse and had lost much of its ability to innovate. If his latest critique had come from a lesser-known poster, it might have faded. Coming from a former longtime Google engineer with a record of memorable public criticism, it instead drew direct responses from some of the company’s top AI figures — and turned a single post into a broader public argument about whether Google’s AI leadership is as deep internally as it looks from the outside.

Microsoft launches MAI-Image-2-Efficient, a cheaper and faster AI image model

4/14/2026

Microsoft today launched MAI-Image-2-Efficient, a lower-cost, higher-speed variant of its flagship text-to-image model that the company says delivers production-ready quality at nearly half the price. The release, available immediately in Microsoft Foundry and MAI Playground with no waitlist, marks the fastest turnaround yet from Microsoft's in-house AI superintelligence team — and the clearest signal that Redmond is serious about building a self-sufficient AI stack that doesn't depend on OpenAI. The new model is priced at $5 per million text input tokens and $19.50 per million image output tokens, a roughly 41% reduction from MAI-Image-2's pricing of $5 and $33, respectively, for those same tiers. Microsoft says the model runs 22% faster than its flagship sibling and achieves 4x greater throughput efficiency per GPU, as measured on NVIDIA H100 hardware at 1024×1024 resolution. The company also claims it outpaces competing hyperscaler models — specifically naming Google's Gemini 3.1 Flash, Gemini 3.1 Flash Image, and Gemini 3 Pro Image — by an average of 40% on p50 latency benchmarks. The model is also rolling out across Copilot and Bing, Microsoft said, with additional product surfaces to follow. Microsoft's two-model strategy borrows a page from the AI pricing playbook Microsoft is positioning MAI-Image-2-Efficient and its flagship MAI-Image-2 as complementary tools rather than replacements for each other — a tiered pairing designed to cover the full spectrum of enterprise image generation needs. MAI-Image-2-Efficient targets high-volume, cost-sensitive production workloads: product photography, marketing creative, UI mockups, branded asset pipelines, and real-time interactive applications. It handles short-form in-image text like headlines and labels cleanly, according to Microsoft, and is built to operate within the tight latency and budget constraints of batch processing environments. MAI-Image-2, meanwhile, remains the company's precision instrument — the model you reach for when the brief demands the highest photorealistic fidelity, complex stylization like anime or illustration, or longer, more intricate in-image typography. Microsoft is effectively telling enterprise customers: use the efficient model for your assembly line, and the flagship for your showcase. This approach mirrors pricing strategies that have worked across the AI industry — OpenAI's GPT model tiers, Anthropic's Haiku-Sonnet-Opus lineup, Google's Flash-Pro distinction — but applies it specifically to image generation, a domain where cost-per-image economics can make or break production deployment at scale. How Microsoft shipped a production-optimized image model in under a month The speed of this release deserves attention. MAI-Image-2 itself only debuted on MAI Playground on March 19, as VentureBeat previously reported, with broader availability through Microsoft Foundry arriving on April 2 alongside two other new foundation models: MAI-Transcribe-1 (a speech-to-text model supporting 25 languages) and MAI-Voice-1 (an audio generation model). Less than a month later, Microsoft has shipped an optimized production variant. That cadence suggests the MAI Superintelligence team — the research group led by Mustafa Suleyman, CEO of Microsoft AI, that was formed in November 2025 — is operating more like a startup shipping iterative products than a traditional corporate research lab publishing papers. When Suleyman wrote in his April 2 blog post that the team was "building Humanist AI" with a focus on "optimizing for how people actually communicate, training for practical use,” he appears to have meant it literally: the models aren't just shipping, they're shipping fast enough to have product roadmaps. The early reception for MAI-Image-2 has been notably positive. Decrypt reported in its hands-on review that the model had already reached the No. 3 position on the Arena.ai leaderboard for image generation, trailing only Google and OpenAI. Decrypt's reviewer noted that the model's photorealism was "a real strength" and that its text rendering was "a legitimate highlight" that "handled complex typography with far more consistency than we expected." The review also found that in some direct comparisons, MAI-Image-2 outperformed OpenAI's GPT-Image on image quality and text rendering despite sitting below it on the leaderboard — an observation that underscores how benchmark rankings don't always capture real-world utility. That said, the original model shipped with significant constraints that Decrypt flagged: a 30-second cooldown between generations, a 15-image daily cap in the native UI, only 1:1 aspect ratio output, no image-to-image capabilities, and aggressive content filtering that blocked even innocuous creative prompts. Whether MAI-Image-2-Efficient inherits or relaxes any of these limitations isn't addressed in today's announcement, and enterprise customers accessing the model through the Foundry API will likely face different constraints than playground users. Inside the fraying Microsoft-OpenAI relationship that made in-house models inevitable Today's launch cannot be understood in isolation. It arrives at a moment when the relationship between Microsoft and OpenAI — once the defining partnership of the generative AI era — is visibly fraying at the seams. Just yesterday, CNBC reported that OpenAI's newly appointed chief revenue officer, Denise Dresser, sent an internal memo to staff explicitly stating that the Microsoft partnership "has also limited our ability to meet enterprises where they are." The memo reportedly touted OpenAI's new alliance with Amazon Web Services and the Bedrock platform as a key growth driver, describing inbound customer demand as "frankly staggering" since the partnership was announced in late February. Microsoft added OpenAI to its list of competitors in its annual report in mid-2024. OpenAI, meanwhile, has diversified its cloud infrastructure across CoreWeave, Google, and Oracle, reducing its dependence on Microsoft Azure. The MAI model family is the most tangible expression of Microsoft's side of that strategic uncoupling. When Microsoft can generate production-quality images with its own model at $19.50 per million output tokens, the calculus for continuing to license OpenAI's image models — and paying OpenAI a share of the resulting revenue — shifts dramatically. Every MAI model that reaches production quality is a line item that Microsoft can potentially move off OpenAI's balance sheet and onto its own. The organizational infrastructure to support this shift is already in place. On March 17, as disclosed in communications posted on Microsoft's official blog, CEO Satya Nadella announced a sweeping reorganization that unified the company's consumer and commercial Copilot efforts under a single leadership team, with Jacob Andreou elevated to EVP of Copilot reporting directly to Nadella. Critically, the reorganization also refocused Suleyman's role. As Nadella wrote in his message to employees, the company is "doubling down on our superintelligence mission with the talent and compute to build models that have real product impact, in terms of evals, COGS reduction, as well as advancing the frontier." That phrase — "COGS reduction" — is corporate-speak for reducing the cost of goods sold, and it points directly to the economic motivation behind models like MAI-Image-2-Efficient. Every dollar Microsoft saves by using its own models instead of licensing from partners flows straight to gross margin. Why cheap, fast image generation is the secret ingredient for Microsoft's agentic AI future There's one more dimension that makes today's release strategically significant, and it may be the most important one: the rise of AI agents. TechCrunch reported yesterday that Microsoft is testing ways to integrate OpenClaw-like features into Microsoft 365 Copilot, building toward an always-on agent that can execute multi-step tasks over extended periods. The company has also launched Copilot Cowork (an agent that takes actions within Microsoft 365 apps), Copilot Tasks (an agent for completing multi-step personal productivity tasks), and Agent 365 (referenced in Nadella's March reorganization memo). Microsoft is expected to showcase these agentic capabilities at its Build conference in June. In an agentic world — where AI systems don't just answer questions but execute complex workflows autonomously — image generation becomes a primitive that agents call programmatically, not a standalone product that users interact with manually. An enterprise agent building a marketing campaign might need to generate dozens of product images, create social media assets, produce presentation graphics, and iterate on design concepts, all without human intervention at each step. The economics of that workflow are governed entirely by per-token pricing and latency, which is precisely what MAI-Image-2-Efficient optimizes for. If Microsoft's vision for Copilot involves agents that generate images as a routine subtask within larger workflows, those agents need image generation that's fast enough to not create bottlenecks and cheap enough to not blow up cost projections when called thousands of times per day. The 4x efficiency improvement and 41% price cut aren't just nice marketing numbers — they're architectural requirements for the agentic future Microsoft is betting the company on. What Microsoft still hasn't answered about its new image model Several important questions remain unaddressed by today's announcement. Microsoft didn't disclose whether MAI-Image-2-Efficient resolves the aspect ratio limitations and aggressive content filtering that reviewers flagged in the original model. The company also didn't specify whether the quality-to-speed tradeoffs involve visible degradation on complex prompts — the announcement describes "production-ready quality" and "flagship quality" interchangeably, but distillation models of any kind typically involve some quality concession. The footnotes in the press release also reveal the narrow conditions under which the benchmark claims were tested: efficiency figures were measured on NVIDIA H100 at 1024×1024 with "optimized batch sizes and matched latency targets," and the latency comparisons against Google models were conducted at p50 (median) rather than p95 or p99, which would capture worst-case performance. Enterprise customers running diverse workloads at varying concurrency levels may see different results. MAI Playground is currently available only in select markets, including the U.S., with EU availability listed as "coming soon." Copilot integration is underway but not complete. And the enterprise API through Foundry, while live, is still in early deployment. But the trajectory is unmistakable. In less than five months since the MAI Superintelligence team was announced, Microsoft has shipped a flagship image model, three additional foundation models, and now a cost-optimized production variant — all while reorganizing its entire Copilot organization, navigating a fracturing relationship with its most important AI partner, and laying the groundwork for agentic AI features that could redefine enterprise productivity. Whether all of that is fast enough to catch Anthropic's momentum, contain OpenAI's drift toward Amazon, and justify a $600 price target is the multi-hundred-billion-dollar question. But for a company that spent the first two years of the generative AI era mostly reselling someone else's technology, Microsoft is now doing something it hasn't done in a long time in AI: shipping its own work, on its own schedule, at its own price — and daring the market to keep up.

Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.

4/14/2026

Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems. New research from Databricks puts a number on that failure gap. The company's AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks and reported gains of 20% or more on Stanford's STaRK benchmark suite, along with consistent improvement across Databricks' own KARLBench evaluation framework, according to the research. Databricks argues the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem. The work builds on Databricks' earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures. "RAG works, but it doesn't scale," Michael Bendersky, research director at Databricks, told VentureBeat. "If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task." Single-turn retrieval cannot encode structural constraints The core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search. Consider a question like "Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?" The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results. To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain, according to the research. STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base. How the Supervisor Agent handles what RAG cannot Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps: Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first. Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer. Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required. "The agent can do things like decomposing the question into a SQL query and a search query out of the box," Bendersky said. "It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found." It's not just about hybrid retrieval The distinction Databricks draws isn't about retrieval technique, it's about architecture. "We almost don't see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables," he said. "We see this more as an agent that has access to multiple tools." The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code. Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks' research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format. "Just bring the agent to the data," Bendersky said. "You basically give the agent more sources, and it will learn to use them pretty well." What this means for enterprises For data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest. The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront. Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start. The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one. "This is kind of like a ladder," Bendersky said. "The agent will slowly get more and more information and then slowly improve overall."

43% of AI-generated code changes need debugging in production, survey finds

4/14/2026

The software industry is racing to write code with artificial intelligence. It is struggling, badly, to make sure that code holds up once it ships. A survey of 200 senior site-reliability and DevOps leaders at large enterprises across the United States, United Kingdom, and European Union paints a stark picture of the hidden costs embedded in the AI coding boom. According to Lightrun's 2026 State of AI-Powered Engineering Report, shared exclusively with VentureBeat ahead of its public release, 43% of AI-generated code changes require manual debugging in production environments even after passing quality assurance and staging tests. Not a single respondent said their organization could verify an AI-suggested fix with just one redeploy cycle; 88% reported needing two to three cycles, while 11% required four to six. The findings land at a moment when AI-generated code is proliferating across global enterprises at a breathtaking pace. Both Microsoft CEO Satya Nadella and Google CEO Sundar Pichai have claimed that around a quarter of their companies' code is now AI-generated. The AIOps market — the ecosystem of platforms and services designed to manage and monitor these AI-driven operations — stands at $18.95 billion in 2026 and is projected to reach $37.79 billion by 2031. Yet the report suggests the infrastructure meant to catch AI-generated mistakes is badly lagging behind AI's capacity to produce them. "The 0% figure signals that engineering is hitting a trust wall with AI adoption," said Or Maimon, Lightrun's chief business officer, referring to the survey's finding that zero percent of engineering leaders described themselves as "very confident" that AI-generated code will behave correctly once deployed. "While the industry's emphasis on increased productivity has made AI a necessity, we are seeing a direct negative impact. As AI-generated code enters the system, it doesn't just increase volume; it slows down the entire deployment pipeline." Amazon's March outages showed what happens when AI-generated code ships without safeguards The dangers are no longer theoretical. In early March 2026, Amazon suffered a series of high-profile outages that underscored exactly the kind of failure pattern the Lightrun survey describes. On March 2, Amazon.com experienced a disruption lasting nearly six hours, resulting in 120,000 lost orders and 1.6 million website errors. Three days later, on March 5, a more severe outage hit the storefront — lasting six hours and causing a 99% drop in U.S. order volume, with approximately 6.3 million lost orders. Both incidents were traced to AI-assisted code changes deployed to production without proper approval. The fallout was swift. Amazon launched a 90-day code safety reset across 335 critical systems, and AI-assisted code changes must now be approved by senior engineers before they are deployed. Maimon pointed directly to the Amazon episodes. "This uncertainty isn't based on a hypothesis," he said. "We just need to look back to the start of March, when Amazon.com in North America went down due to an AI-assisted change being implemented without established safeguards." The Amazon incidents illustrate the central tension the Lightrun report quantifies in survey data: AI tools can produce code at unprecedented speed, but the systems designed to validate, monitor, and trust that code in live environments have not kept pace. Google's own 2025 DORA report corroborates this dynamic, finding that AI adoption correlates with an increase in code instability, and that 30% of developers report little or no trust in AI-generated code. Maimon cited that research directly: "Google's 2025 DORA report found that AI adoption correlates with an almost 10% increase in code instability. Our validation processes were built for the scale of human engineering, but today, engineers have become auditors for massive volumes of unfamiliar code." Developers are losing two days a week to debugging AI-generated code they didn't write One of the report's most striking findings is the scale of human capital being consumed by AI-related verification work. Developers now spend an average of 38% of their work week — roughly two full days — on debugging, verification, and environment-specific troubleshooting, according to the survey. For 88% of the companies polled, this "reliability tax" consumes between 26% and 50% of their developers' weekly capacity. This is not the productivity dividend that enterprise leaders expected when they invested in AI coding assistants. Instead, the engineering bottleneck has simply migrated. Code gets written faster, but it takes far longer to confirm that it works. "In some senses, AI has made the debugging problem worse," Maimon said. "The volume of change is overwhelming human validation, while the generated code itself frequently does not behave as expected when deployed in Production. AI coding agents cannot see how their code behaves in running environments." The redeploy problem compounds the time drain. Every surveyed organization requires multiple deployment cycles to verify a single AI-suggested fix — and according to Google's 2025 DORA report, a single redeploy cycle takes a day to one week on average. In regulated industries such as healthcare and finance, deployment windows are often narrow, governed by mandated code freezes and strict change-management protocols. Requiring three or more cycles to validate a single AI fix can push resolution timelines from days to weeks. Maimon rejected the idea that these multiple cycles represent prudent engineering discipline. "This is not discipline, but an expensive bottleneck and a symptom of the fact that AI-generated fixes are often unreliable," he said. "If we can move from three cycles to one, we reclaim a massive portion of that 38% lost engineering capacity." AI monitoring tools can't see what's happening inside running applications — and that's the real problem If the productivity drain is the most visible cost, the Lightrun report argues the deeper structural problem is what it calls "the runtime visibility gap" — the inability of AI tools and existing monitoring systems to observe what is actually happening inside running applications. Sixty percent of the survey's respondents identified a lack of visibility into live system behavior as the primary bottleneck in resolving production incidents. In 44% of cases where AI SRE or application performance monitoring tools attempted to investigate production issues, they failed because the necessary execution-level data — variable states, memory usage, request flow — had never been captured in the first place. The report paints a picture of AI tools operating essentially blind in the environments that matter most. Ninety-seven percent of engineering leaders said their AI SRE agents operate without significant visibility into what is actually happening in production. Approximately half of all companies (49%) reported their AI agents have only limited visibility into live execution states. Only 1% reported extensive visibility, and not a single respondent claimed full visibility. This is the gap that turns a minor software bug into a costly outage. When an AI-suggested fix fails in production — as 43% of them do — engineers cannot rely on their AI tools to diagnose the problem, because those tools cannot observe the code's real-time behavior. Instead, teams fall back on what the report calls "tribal knowledge": the institutional memory of senior engineers who have seen similar problems before and can intuit the root cause from experience rather than data. The survey found that 54% of resolutions to high-severity incidents rely on tribal knowledge rather than diagnostic evidence from AI SREs or APMs. In finance, 74% of engineering teams trust human intuition over AI diagnostics during serious incidents The trust deficit plays out with particular intensity in the finance sector. In an industry where a single application error can cascade into millions of dollars in losses per minute, the survey found that 74% of financial-services engineering teams rely on tribal knowledge over automated diagnostic data during serious incidents — far higher than the 44% figure in the technology sector. "Finance is a heavily regulated, high-stakes environment where a single application error can cost millions of dollars per minute," Maimon said. "The data shows that these teams simply do not trust AI not to make a dangerous mistake in their Production environments. This is a rational response to tool failure." The distrust extends beyond finance. Perhaps the most telling data point in the entire report is that not a single organization surveyed — across any industry — has moved its AI SRE tools into actual production workflows. Ninety percent remain in experimental or pilot mode. The remaining 10% evaluated AI SRE tools and chose not to adopt them at all. This represents an extraordinary gap between market enthusiasm and operational reality: enterprises are spending aggressively on AI for IT operations, but the tools they are buying remain quarantined from the environments where they would deliver the most value. Maimon described this as one of the report's most significant revelations. "Leaders are eager to adopt these new AI tools, but they don't trust AI to touch live environments," he said. "The lack of trust is shown in the data; 98% have lower trust in AI operating in production than in coding assistants." The observability industry built for human-speed engineering is falling short in the age of AI The findings raise pointed questions about the current generation of observability tools from major vendors like Datadog, Dynatrace, and Splunk. Seventy-seven percent of the engineering leaders surveyed reported low or no confidence that their current observability stack provides enough information to support autonomous root cause analysis or automated incident remediation. Maimon did not shy away from naming the structural problem. "Major vendors often build 'closed-garden' ecosystems where their AI SREs can only reason over data collected by their own proprietary agents," he said. "In a modern enterprise, teams typically have a multi-tool stack to provide full coverage. By forcing a team into a single-vendor silo, these tools create an uncomfortable dependency and a strategic liability: if the vendor's data coverage is missing a specific layer, the AI is effectively blind to the root cause." The second issue, Maimon argued, is that current observability-backed AI SRE solutions offer only partial visibility — defined by what engineers thought to log at the time of deployment. Because failures rarely follow predefined paths, autonomous root cause analysis using only these tools will frequently miss the key diagnostic evidence. "To move toward true autonomous remediation," he said, "the industry must shift toward AI SRE without vendor lock-in; AI SREs must be an active participant that can connect across the entire stack and interrogate live code to capture the ground truth of a failure as it happens." When asked what it would take to trust AI SREs, the survey's respondents coalesced unanimously around live runtime visibility. Fifty-eight percent said they need the ability to provide "evidence traces" of variables at the point of failure, and 42% cited the ability to verify a suggested fix before it actually deploys. No respondents selected the ability to ingest multiple log sources or provide better natural language explanations — suggesting that engineering leaders do not want AI that talks better, but AI that can see better. The question is no longer whether to use AI for coding — it's whether anyone can trust what it produces The survey was administered by Global Surveyz Research, an independent firm, and drew responses from Directors, VPs, and C-level executives in SRE and DevOps roles at enterprises with 1,500 or more employees across the finance, technology, and information technology sectors. Responses were collected during January and February 2026, with questions randomized to prevent order bias. Lightrun, which is backed by $110 million in funding from Accel and Insight Partners and counts AT&T, Citi, Microsoft, Salesforce, and UnitedHealth Group among its enterprise clients, has a clear commercial interest in the problem the report describes: the company sells a runtime observability platform designed to give AI agents and human engineers real-time visibility into live code execution. Its AI SRE product uses a Model Context Protocol connection to generate live diagnostic evidence at the point of failure without requiring redeployment. That commercial interest does not diminish the survey's findings, which align closely with independent research from Google DORA and the real-world evidence of the Amazon outages. Taken together, they describe an industry confronting an uncomfortable paradox. AI has solved the slowest part of building software — writing the code — only to reveal that writing was never the hard part. The hard part was always knowing whether it works. And on that question, the engineers closest to the problem are not optimistic. "If the live visibility gap is not closed, then teams are really just compounding instability through their adoption of AI," Maimon said. "Organizations that don't bridge this gap will find themselves stuck with long redeploy loops, to solve ever more complex challenges. They will lose their competitive speed to the very AI tools that were meant to provide it." The machines learned to write the code. Nobody taught them to watch it run.

Agentic coding at enterprise scale demands spec-driven development

4/14/2026

Presented by AWS Autonomous agents are compressing software delivery timelines from weeks to days. The enterprises that scale agents safely will be the ones that build using spec-driven development. There’s a moment in every technology shift where the early adopters stop being outliers and start being the baseline. We’re at that moment in software development, and most teams don’t realize it yet. A year ago, vibe coding went viral. Non-developers and junior developers discovered they could build beyond their abilities with AI. It lowered the floor. It made prototyping much quicker, but it also introduced a surplus of slop. What the industry then needed was something that raised the ceiling — something that improved code quality and worked the way the most expert developers work. Spec-driven development did that. It laid the foundation for trustworthy autonomous coding agents. Specs are the trust model for autonomous development Most discussions of AI-generated code focus on whether AI can write code. The harder question is whether you can trust it. The answer runs directly through the spec. Spec-driven development starts with a deceptively simple idea: before an AI agent writes a single line of code, it works from a structured, context-rich specification that defines what the system is supposed to do, what its properties are, and what "correct" actually means. That specification is an artifact the agent reasons against throughout the entire development process — fundamentally different from pre-agentic AI approaches of writing documentation after the fact. Enterprise teams are building on this foundation. The Kiro IDE team used Kiro to build Kiro IDE — an agentic coding environment with native spec-driven development — cutting feature builds from two weeks to two days. An AWS engineering team completed an 18-month rearchitecture project, originally scoped for 30 developers, with six people in 76 days using Kiro. An Amazon.com engineering team rolled out “Add to Delivery” — a feature that lets shoppers add items after checkout — two months ahead of schedule by using Kiro and spec-driven development. Alexa+, Amazon Finance, Amazon Stores, AWS, Fire TV, Last Mile Delivery, Prime Video, and more all integrate spec-driven development as part of their build approaches. That shift changes everything downstream. Verifiable testing is what makes autonomous agents safe to run The spec becomes an automated correctness engine. When a developer is generating 150 check-ins per week with AI assistance, no human can manually review that volume of code. Instead, code built against a concrete specification can be verified through property-based testing and neurosymbolic AI techniques that automatically generate hundreds of test cases derived directly from the spec, probing edge cases no human would think to write by hand. These tests prove that the code satisfies the spec’s defined properties, going well beyond hand-written test suites to provably correct behavior. Verifiable testing enables the shift from one-shot programming to continuous autonomous development. Traditional AI-assisted development operates as a single shot: you give the agent a spec, the agent produces output, and the process ends. Today’s agents continuously correct themselves, feeding build and test failures back into their own reasoning, generating additional tests to probe their own output, and iterating until they produce something both functional and verifiable. The spec is the anchor that keeps that loop from drifting. Instead of developers constantly checking in to see if the agent is making the right decisions, the agent can check itself against the spec to make sure it is on the right path. The autonomous agent of the future will write its own specs, using specifications as the mechanism for self-correction, for verification, for ensuring that what it produces matches the intended behavior of the system. Multi-agent, autonomous, and running right now The developers setting the pace today operate in a fundamentally different way. Developers spend significant time building their spec, as well as writing steering files used by the spec to make sure the agent knows what and how to build — more time than their agent may spend building the actual software. They run multiple agents in parallel to critique a problem from different perspectives, as well as run multiple specs, each written for a different component of the system they are building. They let agents run for hours, sometimes days. They use thousands of Kiro credits because the output justifies it. A year ago, agents would lose context and fall apart after 20 minutes. Now, every week you can run them longer than the week before. Agentic capabilities have improved significantly in the last six months that genuinely complex problems are tractable. Newer LLMs are more token-efficient than the previous generation, so for the same spend, you get dramatically more done. The challenge is that doing this well requires deep expertise. The tools, methodologies, and infrastructure exist, but orchestrating them is hard. The goal with Kiro is to bring these capabilities with deep expertise to every developer, not just the top one percent who’ve figured it out. Infrastructure is catching up to ambition Agents will be ten times more capable within a year. That’s the rate of improvement we’re seeing week over week. The infrastructure to support that level of capability is converging at the same time. Agents are now running in the cloud rather than locally, executing in parallel at scale with secure, reliable communication between agent systems. Organizations can now run agentic workloads the way they’d run any enterprise-grade distributed system — with governance, cost controls, and reliability guarantees that serious software demands. Spec-driven development is the architecture of tomorrow’s autonomous systems. Developers are no longer restricted by how they want to solve the problem. The developers who thrive in this world are the ones building that foundation now: using spec-driven development, prioritizing testability and verification from the start, working with agents as collaborators, and thinking in systems instead of syntax. Deepak Singh is VP of Kiro at AWS. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back

4/13/2026

A growing number of developers and AI power users are taking to social media to accuse Anthropic of degrading the performance of Claude Opus 4.6 and Claude Code — intentionally or as an outcome of compute limits — arguing that the company’s flagship coding model feels less capable, less reliable and more wasteful with tokens than it did just weeks ago. The complaints have spread quickly on Github, X and Reddit over the past several weeks, with several high-reach posts alleging that Claude has become worse at sustained reasoning, more likely to abandon tasks midway through, and more prone to hallucinations or contradictions. Some users have framed the issue as “AI shrinkflation” — the idea that customers are paying the same price for a weaker product. Others have gone further, suggesting Anthropic may be throttling or otherwise tuning Claude downward during periods of heavy demand. Those claims remain unproven, and Anthropic employees have publicly denied that the company degrades models to manage capacity. At the same time, Anthropic has acknowledged real changes to usage limits and reasoning defaults in recent weeks, which has made the broader debate more combustible. VentureBeat has reached out to Anthropic for further clarification on the recent accusations, including whether any recent changes to reasoning defaults, context handling, throttling behavior, inference parameters or benchmark methodology could help explain the spike in complaints. We have also asked how Anthropic explains the recent benchmark-related claims and whether it plans to publish additional data that could reassure customers. An Anthropic spokesperson did not address the questions individually, instead referring us to X posts by Claude Code creator Boris Cherny and Claude Code team member Thariq Shihipar regarding Opus 4.6 performance and usage limits, respectively. Both X posts are also referenced and linked below. Viral user complaints, including from an AMD Senior Director, argue Claude has become less capable One of the most detailed public complaints originated as a GitHub issue filed by Stella Laurenzo on April 2, 2026, whose LinkedIn profile identifies her as Senior Director in AMD’s AI group. In that post, Laurenzo wrote that Claude Code had regressed to the point that it could not be trusted for complex engineering work, then backed that claim with a sprawling analysis of 6,852 Claude Code session files, 17,871 thinking blocks and 234,760 tool calls. The complaint argued that, starting in February, Claude’s estimated reasoning depth fell sharply while signs of poorer performance rose alongside it, including more premature stopping, more “simplest fix” behavior, more reasoning loops, and a measurable shift from research-first behavior to edit-first behavior. The post’s broader point was that for advanced engineering workflows, extended reasoning is not a luxury but part of what makes the model usable in the first place. That GitHub thread then escaped into the broader social media conversation, with X users including @Hesamation, who posted screenshots of Laurenzo's GitHub post to X on April 11, turning it into an even more viral talking point. That amplification mattered because it gave the wider “Claude is getting worse” narrative something more concrete than anecdotal frustration: a long, data-heavy post from a senior AI leader at a major chip company arguing that the regression was visible in logs, tool-use patterns and user corrections, not just gut feeling. Anthropic’s public response focused on separating perceived changes from actual model degradation. In a pinned follow-up on the same GitHub issue posted a week ago, Claude Code lead Boris Cherny thanked Laurenzo for the care and depth of the analysis but disputed its main conclusion. Cherny said the “redact-thinking-2026-02-12” header cited in the complaint is a UI-only change that hides thinking from the interface and reduces latency, but “does not impact thinking itself,” “thinking budgets,” or how extended reasoning works under the hood. He also said two other product changes likely affected what users were seeing: Opus 4.6’s move to adaptive thinking by default on Feb. 9, and a March 3 shift to medium effort, or effort level 85, as the default for Opus 4.6, which he said Anthropic viewed as the best balance across intelligence, latency and cost for most users. Cherny added that users who want more extended reasoning can manually switch effort higher by typing /effort high in Claude Code terminal sessions. That exchange gets at the core of the controversy. Critics like Laurenzo argue that Claude’s behavior in demanding coding workflows has plainly worsened and point to logs and usage patterns as evidence. Anthropic, by contrast, is not saying nothing changed. It is saying the biggest recent changes were product and interface choices that affect what users see and how much effort the system expends by default, not a secret downgrade of the underlying model. That distinction may be technically important, but for power users who feel the product is delivering worse results, it is not necessarily a satisfying one. External coverage from TechRadar and PC Gamer further amplified Laurenzo's post and larger wave of agreement from some power users. Another viral post on X from developer Om Patel on April 7 made the same argument in even more direct terms, claiming that someone had “actually measured” how much “dumber” Claude had gotten and summarizing the result as a 67% drop. That post helped popularize the “AI shrinkflation” label and pushed the controversy beyond hard-core Claude Code users into the broader AI discourse on X. These claims have resonated because they map closely onto what many frustrated users say they are seeing in practice: more unfinished tasks, more backtracking, more token burn and a stronger sense that Claude is less willing to reason deeply through complicated coding jobs than it was earlier this year. Benchmark posts turned anecdotal frustration into a public controversy The loudest benchmark-based claim came from BridgeMind, which runs the BridgeBench hallucination benchmark. On April 12, the account posted that Claude Opus 4.6 had fallen from 83.3% accuracy and a No. 2 ranking in an earlier result to 68.3% accuracy and No. 10 in a new retest, calling that proof that “Claude Opus 4.6 is nerfed.” That post spread widely and became one of the main anchors for the broader public case that Anthropic had degraded the model. Other users also circulated benchmark-related or test-based posts suggesting that Opus 4.6 was underperforming versus Opus 4.5 in practical coding tasks. Still other posts pointed to TerminalBench-related results as supposed evidence that the model’s behavior had changed in certain harnesses or product contexts. The effect was cumulative: benchmark screenshots, side-by-side tests and anecdotal frustration all began reinforcing one another in public. That matters because benchmark claims tend to travel farther than more subjective complaints. A developer saying a model “feels worse” is one thing. A screenshot showing a ranking drop from No. 2 to No. 10, or a dramatic percentage swing in accuracy, gives the appearance of hard proof, even when the underlying comparison may be more complicated. Critics of the benchmark claims say the evidence is weaker than it looks The most important rebuttal to the BridgeBench claim did not come from Anthropic. It came from Paul Calcraft, an outside software and AI researcher on X, who argued that the viral comparison was misleading because the earlier Opus 4.6 result was based on only six tasks while the later one was based on 30. In his words, it was a “DIFFERENT BENCHMARK.” He also said that on the six tasks the two runs shared in common, Claude’s score moved only modestly, from 87.6% previously to 85.4% in the later run, and that the bigger swing appeared to come mostly from a single fabrication result without repeats. He characterized that as something that could easily fall within ordinary statistical noise. That outside rebuttal matters because it undercuts one of the cleanest and most viral claims in circulation. It does not prove users are wrong to think something has changed. But it does suggest that at least some of the benchmark evidence now driving the story may be overstated, poorly normalized or not directly comparable. Even the BridgeBench post itself drew a community note to similar effect. The note said the two benchmark runs covered different scopes — six tasks in one case and 30 in the other — and that the common-task subset showed only a minor change. That does not make the later result meaningless, but it weakens the strongest version of the “BridgeBench proved it” argument. This is now a key feature of the controversy: the claims are not all equally strong. Some are grounded in first-hand user experience. Some point to real product changes. Some rely on benchmark comparisons that may not be apples-to-apples. And some depend on inferences about hidden system behavior that users outside Anthropic cannot directly verify. Earlier capacity limits gave users a reason to suspect more changes under the hood The current backlash also lands in the shadow of a real, confirmed Anthropic policy change from late March. On March 26, Anthropic technical staffer Thariq Shihipar posted that, “To manage growing demand for Claude,” the company was adjusting how 5-hour session limits work for Free, Pro and Max subscribers during peak hours, while keeping weekly limits unchanged. He added that during weekdays from 5 a.m. to 11 a.m. Pacific time, users would move through their 5-hour session limits faster than before. In follow-up posts, he said Anthropic had landed efficiency wins to offset some of the impact, but that roughly 7% of users would hit session limits they would not have hit before, particularly on Pro tiers. In an email on March 27, 2026, Anthropic told VentureBeat that Team and Enterprise customers were not affected by those changes, and that the shift was not dynamically optimized per user but instead applied to the peak-hour window the company had publicly described. Anthropic also said it was continuing to invest in scaling capacity. Those comments were about session limits, not model downgrades. But they are important context, because they establish two things that users now keep connecting in public: first, Anthropic has been dealing with surging demand; second, it has already changed how usage is rationed during busy periods. That does not prove Anthropic reduced model quality. It does help explain why so many users are primed to believe something else may also have changed. Prompt caching and TTL A separate, more recent GitHub issue broadens the dispute beyond model quality and into pricing and quota behavior. In issue #46829, user seanGSISG argued that Claude Code’s prompt-cache time-to-live, or TTL, appeared to shift from a one-hour setting back to a five-minute setting in early March, based on analysis of nearly 120,000 API calls drawn from Claude Code session logs across two machines. The complaint argues that this change drove meaningful increases in cache-creation costs and quota burn, especially for long-running coding sessions where cached context expires quickly and must be rebuilt. The author claims that this helps explain why some subscription users began hitting usage limits they had not previously encountered. What makes this issue notable is that Anthropic did not flatly deny that something changed. In a reply on the thread, Jarred Sumner said the March 6 change was real and intentional, but rejected the framing that it was a regression. He said Claude Code uses different cache durations for different request types, and that one-hour cache is not always cheaper because one-hour writes cost more up front and only save money when the same cached context is reused enough times to justify it. In his telling, the change was part of ongoing cache optimization work, not a silent downgrade, and the pre–March 6 behavior described in the issue “wasn’t the intended steady state.” The thread later drew a more detailed response from Anthropic’s Cherny, who described one-hour caching as “nuanced” and said the company has been testing heuristics to improve cache hit rates, token usage and latency for subscribers. Cherny said Anthropic keeps five-minute cache for many queries, including subagents that are rarely resumed, and said turning off telemetry also disables experiment gates, which can cause Claude Code to fall back to a five-minute default in some cases. He added that Anthropic plans to expose environment variables that let users force one-hour or five-minute cache behavior directly. Together, those replies do not validate the issue author’s claim that Anthropic silently made Claude Code more expensive overall, but they do confirm that Anthropic has been actively experimenting with cache behavior behind the scenes during the same period users began complaining more loudly about quota burn and changing product behavior. Anthropic says user-facing changes, not secret degradation, explain much of the uproar Anthropic-affiliated employees have publicly pushed back on the broadest accusations. In one widely circulated reply on X, Cherny responded to claims that Anthropic had secretly nerfed Claude Code by writing, “This is false.” He said Claude Code had been defaulted to medium effort in response to user feedback that Claude was consuming too many tokens, and that the change had been disclosed both in the changelog and in a dialog shown to users when they opened Claude Code. That response is notable because it concedes a meaningful product change while rejecting the more conspiratorial interpretation of it. Anthropic is not saying nothing changed. It is saying that what changed was disclosed and was aimed at balancing token use, not secretly reducing model quality. Public documentation also supports the fact that effort defaults have been in motion. Claude Code’s changelog says that on April 7, Anthropic changed the default effort level from medium to high for API-key users as well as Bedrock, Vertex, Foundry, Team and Enterprise users. That suggests Anthropic has actively been tuning these settings across different segments, which could plausibly affect user perceptions even if the core model weights are unchanged. Shihipar has also directly denied the broader demand-management accusation. In a reply on X posted April 11, he said Anthropic does not “degrade” its models to better serve demand. He also said that changes to thinking summaries affected how some users were measuring Claude’s “thinking,” and that the company had not found evidence backing the strongest qualitative claims now spreading online. The real issue may be trust as much as model quality What is clear is that a trust gap has opened between Anthropic and some of its most demanding users. For developers who rely on Claude Code all day, subtle shifts in visible thinking output, effort defaults, token burn, latency tradeoffs or usage caps can feel indistinguishable from a weaker model. That is true whether the root cause is a product setting, a UI change, an inference-policy tweak, capacity pressure or a genuine quality regression. It also means both sides of the fight may be talking past each other. Users are describing what they experience: more friction, more failures and less confidence. Anthropic is responding in product terms: effort defaults, hidden thinking summaries, changelog disclosures, and denials that demand pressure is causing secret model degradation. Those are not necessarily incompatible descriptions. A model can feel worse to users even if the company believes it has not “nerfed” the underlying model in the way critics allege. But coming at a time when Anthropic's chief rival OpenAI has recently pivoted and put more resources behind its competing, enterprise and vibe-coding focused product Codex — even offering a new, more mid-range ChatGPT subscription in an effort to boost usage of the tool — it's certainly not the kind of publicity that stands to benefit Anthropic or its customer retention. At the same time, the public evidence remains mixed. Some of the most viral claims have come from developers with detailed logs and strong opinions based on repeated use. Some of the benchmark evidence has been challenged by outside observers on methodological grounds. And Anthropic’s own recent changes to limits and settings ensure that this debate is happening against a backdrop of real adjustments, not pure rumor.

Designing the agentic AI enterprise for measurable performance

4/13/2026

Presented by Edgeverve Smart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes clear goals, data‑driven workflows, and an enterprise platform that balances autonomy, governance, observability, and flexibility with hard guardrails from day one. From pilots to the “operational grey zones” The next wave of value sits in the connective tissue between applications — those operational grey zones where handoffs, reconciliations, approvals, and data lookups still rely on humans. Assigning agents to these paths means collapsing system boundaries, applying intelligence to context, and re‑imagining processes that were never formally automated. Many pilots stall because they start as lab experiments rather than outcome‑anchored designs tied to production systems, controls, and KPIs. Start with outcomes, not algorithms. Translate organizational KPIs (cash‑flow, DSO, SLA adherence, compliance hit rates, MTTR, NPS, claims leakage, etc.) into agent goals, then cascade them into single‑agent and multi‑agent objectives. Only after goals are explicit should you select workflows and decompose tasks. Pick targets, then decompose the work What does “target” actually mean? In agentic programs, a target is a business outcome and the use case that moves it. For example, “reduce unapplied cash by 20%” target outcome; “cash application and exceptions handling” use case. With the use case in hand, perform persona‑level task decomposition: map the human role (e.g., cash applications analyst, facilities coordinator), enumerate their tasks, and identify which are ripe for agentification (data retrieval, matching, policy checks, decision proposals, transaction initiation). Delivering on those tasks requires a data‑embedded workflow fabric that can read, write, and reason across enterprise systems while honoring permissions. Data must be AI‑ready, discoverable, governed, labeled where needed, augmented for retrieval (RAG), and policy‑protected for PII, PCI, and regulatory constraints. Integration goes beyond APIs APIs are one mode of integration, not the only one. Robust agent execution typically blends: Stable APIs with lifecycle management for core systems Event‑driven triggers (streams, webhooks, CDC) to react in real time UI/RPA fallbacks where APIs don’t exist Search/RAG connectors for documents and knowledge bases Policy management across tools and actions to enforce entitlements and segregation of duties The north star is integration reliability — built on idempotency, retries, circuit-breakers, and standardized tool schemas — so agents don’t “hallucinate” actions the enterprise can’t verify. A quick example: finance and facilities, in production Inside our organization, we deployed specialized agents in a live CFO environment and in building maintenance. In finance, seven agents interacted with production systems and real accountability structures. Year‑one outcomes included: >3% monthly cash‑flow improvement, 50% productivity gain in affected workflows, 90% faster onboarding, a shift from account‑level handling to function‑level orchestration, and a $32M cash‑flow lift. These results don’t guarantee gains everywhere; they show that designing products can deliver measurable outcomes on a scale. The four design pillars: Autonomy, governance, observability & evals, flexibility 1) Autonomy: right‑size it to the risk Autonomy exists on a spectrum. Early efforts often automate well‑bounded tasks; others pursue research/analysis agents; increasingly, teams target mission‑critical transactional agents (payments, vendor onboarding, pricing changes). The rule: match autonomy to risk, and encode the operating mode suggest‑only, propose‑and‑approve, or execute‑with‑rollback per task. 2) Governance: guardrails by design, not as bolt‑ons Unbounded agents create unacceptable risk. Build guardrails into the plan: Policy & permissions: tie tools/actions to identity, scopes, and SoD rules. Human‑in‑the‑loop (HITL): where mission‑critical thresholds are crossed (amount, vendor risk, regulatory exposure). Agent lifecycle management: versioning, change control, regression gates, approval workflows, and sunsetting. Third‑party agent orchestration: vet external agents like vendors, capabilities, scopes, logs, SLAs. Incident and rollback: kill‑switches, safe‑mode, and compensating transactions. This is how you scale innovation safely while protecting brand, compliance, and customers. 3) Observability & evaluations: trust comes from telemetry Production agents need the same rigor as any core platform: Telemetry: capture full execution traces across perception, planning, tool use, action supported by structured logs and replay. Offline evals: cenario tests, red‑teaming, bias and safety checks, cost/performance benchmarks; baseline vs. challenger comparisons. Online evals: shadow mode, A/B, canary releases, guardrail breach alerts, human feedback loops. Explainability & auditability: why was an action taken, which data/tools were used, and who approved. 4) Flexibility: assume volatility, design for swap‑ability Models, tools, and vendors change fast. Treat agentic capability as platform currency: create an environment where teams can evaluate, select, and swap models/tools without tearing down the build. Use a model router, tool registry, and contract‑first interfaces so upgrades are controlled experiments, not rewrites. The agent platform fabric: how platformization turns goals into outcomes A true agentic enterprise requires a platform fabric that transforms goals into outcomes, not a patchwork of isolated pilots. This platform anchors enterprise‑to‑agent KPI cascades, drives task decomposition and multi‑agent planning, and provides governed tooling and data access across APIs, RPA, search, and databases. It centralizes knowledge and memory through RAG and vector stores, enforces enterprise controls via a policy engine, and manages performance and safety through a unified model layer. It supports robust orchestration of first‑ and third‑party agents with common context, embeds deep observability and evaluation pipelines, and applies disciplined release engineering from sandbox to GA. Finally, it ensures long‑term resilience through lifecycle management versioning, deprecation, incident playbooks, and auditable histories. Guardrails in action: a BFSI example Consider payments exception handling in banking — high stakes, regulated, and customer‑visible. An agent proposes a resolution (e.g., auto‑reconcile or escalate) only when: The transaction falls below risk thresholds; above them, it triggers HITL approval. All policy checks (KYC/AML, velocity, sanctions) pass. Observability hooks record rationale, tools invoked, and data used. Rollback/compensation is defined if downstream failures occur. This pattern generalizes to vendor onboarding, pricing overrides, or claims adjudication — mission‑critical work with explicit safety rails. Scale beyond pilots Scaling agentic AI beyond pilots demands disciplined readiness across nine fronts: leaders must clarify which KPIs matter and how agent goals ladder into them, determine which persona tasks are agentified versus remain human‑led, and align each with the right autonomy mode from suggest‑only to propose‑and‑approve to execute‑with‑rollback. They must embed governance guardrails, including HITL points and lifecycle controls; ensure robust observability and evaluation via telemetry, replay, audits, and offline/online tests; and verify data readiness, with governed, policy‑protected, retrieval‑augmented data flows. Integration must be reliable, with API lifecycle management, event triggers, and RPA/other fallbacks. The underlying platform should enable model swap‑ability and orchestration of first‑ and third‑party agents without rebuilding. Finally, measurement must focus on true operational impact cash flow, cycle times, quality, and risk reduction rather than task counts. The takeaway Agentic AI is not a shortcut; it’s a new system of work. Enterprises that approach it with platform discipline aligning autonomy with risk, embedding governance and observability, and designing for swap‑ability will convert pilots into production impact. Those that don’t keep accumulating impressive but disconnected demos. The difference isn’t how fast you ship an agent; it’s how deliberately you design the enterprise around it. N. Shashidar is SVP & Global Head, Product Management at EdgeVerve. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Five signs data drift is already undermining your security models

4/12/2026

Data drift happens when the statistical properties of a machine learning (ML) model's input data change over time, eventually rendering its predictions less accurate. Cybersecurity professionals who rely on ML for tasks like malware detection and network threat analysis find that undetected data drift can create vulnerabilities. A model trained on old attack patterns may fail to see today's sophisticated threats. Recognizing the early signs of data drift is the first step in maintaining reliable and efficient security systems. Why data drift compromises security models ML models are trained on a snapshot of historical data. When live data no longer resembles this snapshot, the model's performance dwindles, creating a critical cybersecurity risk. A threat detection model may generate more false negatives by missing real breaches or create more false positives, leading to alert fatigue for security teams. Adversaries actively exploit this weakness. In 2024, attackers used echo-spoofing techniques to bypass email protection services. By exploiting misconfigurations in the system, they sent millions of spoofed emails that evaded the vendor's ML classifiers. This incident demonstrates how threat actors can manipulate input data to exploit blind spots. When a security model fails to adapt to shifting tactics, it becomes a liability. 5 indicators of data drift Security professionals can recognize the presence of drift (or its potential) in several ways. 1. A sudden drop in model performance Accuracy, precision, and recall are often the first casualties. A consistent decline in these key metrics is a red flag that the model is no longer in sync with the current threat landscape. Consider Klarna's success: Its AI assistant handled 2.3 million customer service conversations in its first month and performed work equivalent to 700 agents. This efficiency drove a 25% decline in repeat inquiries and reduced resolution times to under two minutes. Now imagine if those parameters suddenly reversed because of drift. In a security context, a similar drop in performance does not just mean unhappy clients — it also means successful intrusions and potential data exfiltration. 2. Shifts in statistical distributions Security teams should monitor the core statistical properties of input features, such as the mean, median, and standard deviation. A significant change in these metrics from training data could indicate the underlying data has changed. Monitoring for such shifts enables teams to catch drift before it causes a breach. For example, a phishing detection model might be trained on emails with an average attachment size of 2MB. If the average attachment size suddenly jumps to 10MB due to a new malware-delivery method, the model may fail to classify these emails correctly. 3. Changes in prediction behavior Even if overall accuracy seems stable, distributions of predictions might change, a phenomenon often referred to as prediction drift. For instance, if a fraud detection model historically flagged 1% of transactions as suspicious but suddenly starts flagging 5% or 0.1%, either something has shifted or the nature of the input data has changed. It might indicate a new type of attack that confuses the model or a change in legitimate user behavior that the model was not trained to identify. 4. An increase in model uncertainty For models that provide a confidence score or probability with their predictions, a general decrease in confidence can be a subtle sign of drift. Recent studies highlight the value of uncertainty quantification in detecting adversarial attacks. If the model becomes less sure about its forecasts across the board, it is likely facing data it was not trained on. In a cybersecurity setting, this uncertainty is an early sign of potential model failure, suggesting the model is operating in unfamiliar ground and that its decisions might no longer be reliable. 5. Changes in feature relationships The correlation between different input features can also change over time. In a network intrusion model, traffic volume and packet size might be highly linked during normal operations. If that correlation disappears, it can signal a change in network behavior that the model may not understand. A sudden feature decoupling could indicate a new tunneling tactic or a stealthy exfiltration attempt. Approaches to detecting and mitigating data drift Common detection methods include the Kolmogorov-Smirnov (KS) and the population stability index (PSI). These compare the distributions of live and training data to identify deviations. The KS test determines if two datasets differ significantly, while the PSI measures how much a variable's distribution has shifted over time. The mitigation method of choice often depends on how the drift manifests, as distribution changes may occur suddenly. For example, customers' buying behavior may change overnight with the launch of a new product or a promotion. In other cases, drift may occur gradually over a more extended period. That said, security teams must learn to adjust their monitoring cadence to capture both rapid spikes and slow burns. Mitigation will involve retraining the model on more recent data to reclaim its effectiveness. Proactively manage drift for stronger security Data drift is an inevitable reality, and cybersecurity teams can maintain a strong security posture by treating detection as a continuous and automated process. Proactive monitoring and model retraining are fundamental practices to ensure ML systems remain reliable allies against developing threats. Zac Amos is the Features Editor at ReHack.

Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

4/12/2026

For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser. Security teams tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break. A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device." When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it. Why local inference is suddenly practical Two years ago, running a useful LLM on a work laptop was a niche stunt. Today, it’s routine for technical teams. Three things converged: Consumer-grade accelerators got serious: A MacBook Pro with 64GB unified memory can often run quantized 70B-class models at usable speeds (with practical limits on context length). What once required multi-GPU servers is now feasible on a high-end laptop for many real workflows. Quantization went mainstream: It’s now easy to compress models into smaller, faster formats that fit within laptop memory often with acceptable quality tradeoffs for many tasks. Distribution is frictionless: Open-weight models are a single command away, and the tooling ecosystem makes “download → run → chat” trivial. The result: An engineer can pull down a multi‑GB model artifact, turn off Wi‑Fi, and run sensitive workflows locally, source code review, document summarization, drafting customer communications, even exploratory analysis over regulated datasets. No outbound packets, no proxy logs, no cloud audit trail. From a network-security perspective, that activity can look indistinguishable from “nothing happened”. The risk isn’t only data leaving the company anymore If the data isn’t leaving the laptop, why should a CISO care? Because the dominant risks shift from exfiltration to integrity, provenance, and compliance. In practice, local inference creates three classes of blind spots that most enterprises have not operationalized. 1. Code and decision contamination (integrity risk) Local models are often adopted because they’re fast, private, and “no approval required." The downside is that they’re frequently unvetted for the enterprise environment. A common scenario: A senior developer downloads a community-tuned coding model because it benchmarks well. They paste in internal auth logic, payment flows, or infrastructure scripts to “clean it up." The model returns output that looks competent, compiles, and passes unit tests, but subtly degrades security posture (weak input validation, unsafe defaults, brittle concurrency changes, dependency choices that aren’t allowed internally). The engineer commits the change. If that interaction happened offline, you may have no record that AI influenced the code path at all. And when you later do incident response, you’ll be investigating the symptom (a vulnerability) without visibility into a key cause (uncontrolled model usage). 2. Licensing and IP exposure (compliance risk) Many high-performing models ship with licenses that include restrictions on commercial use, attribution requirements, field-of-use limits, or obligations that can be incompatible with proprietary product development. When employees run models locally, that usage can bypass the organization’s normal procurement and legal review process. If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company can inherit risk that shows up later during M&A diligence, customer security reviews, or litigation. The hard part is not just the license terms, it’s the lack of inventory and traceability. Without a governed model hub or usage record, you may not be able to prove what was used where. 3. Model supply chain exposure (provenance risk) Local inference also changes the software supply chain problem. Endpoints begin accumulating large model artifacts and the toolchains around them: ownloaders, converters, runtimes, plugins, UI shells, and Python packages. There is a critical technical nuance here: The file format matters. While newer formats like Safetensors are designed to prevent arbitrary code execution, older Pickle-based PyTorch files can execute malicious payloads simply when loaded. If your developers are grabbing unvetted checkpoints from Hugging Face or other repositories, they aren't just downloading data — they could be downloading an exploit. Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies have no equivalent of a software bill of materials for models: Provenance, hashes, allowed sources, scanning, and lifecycle management. Mitigating BYOM: treat model weights like software artifacts You can’t solve local inference by blocking URLs. You need endpoint-aware controls and a developer experience that makes the safe path the easy path. Here are three practical ways: 1. Move governance down to the endpoint Network DLP and CASB still matter for cloud usage, but they’re not sufficient for BYOM. Start treating local model usage as an endpoint governance problem by looking for specific signals: Inventory and detection: Scan for high-fidelity indicators like .gguf files larger than 2GB, processes like llama.cpp or Ollama, and local listeners on common default port 11434. Process and runtime awareness: Monitor for repeated high GPU/NPU (neural processing unit) utilization from unapproved runtimes or unknown local inference servers. Device policy: Use mobile device management (MDM) and endpoint detection and response (EDR) policies to control installation of unapproved runtimes and enforce baseline hardening on engineering devices. The point isn’t to punish experimentation. It’s to regain visibility. 2. Provide a paved road: An internal, curated model hub Shadow AI is often an outcome of friction. Approved tools are too restrictive, too generic, or too slow to approve. A better approach is to offer a curated internal catalog that includes: Approved models for common tasks (coding, summarization, classification) Verified licenses and usage guidance Pinned versions with hashes (prioritizing safer formats like Safetensors) Clear documentation for safe local usage, including where sensitive data is and isn’t allowed. If you want developers to stop scavenging, give them something better. 3. Update policy language: “Cloud services” isn’t enough anymore Most acceptable use policies talk about SaaS and cloud tools. BYOM requires policy that explicitly covers: Downloading and running model artifacts on corporate endpoints Acceptable sources License compliance requirements Rules for using models with sensitive data Retention and logging expectations for local inference tools This doesn’t need to be heavy-handed. It needs to be unambiguous. The perimeter is shifting back to the device For a decade we moved security controls “up” into the cloud. Local inference is pulling a meaningful slice of AI activity back “down” to the endpoint. 5 signals shadow AI has moved to endpoints: Large model artifacts: Unexplained storage consumption by .gguf or .pt files. Local inference servers: Processes listening on ports like 11434 (Ollama). GPU utilization patterns: Spikes in GPU usage while offline or disconnected from VPN. Lack of model inventory: Inability to map code outputs to specific model versions. License ambiguity: Presence of "non-commercial" model weights in production builds. Shadow AI 2.0 isn’t a hypothetical future, it’s a predictable consequence of fast hardware, easy distribution, and developer demand. CISOs who focus only on network controls will miss what’s happening on the silicon sitting right on employees’ desks. The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policy at the endpoint, without killing productivity. Jayachander Reddy Kandakatla is a senior MLOps engineer.

AI agent credentials live in the same box as untrusted code. Two new architectures show where the blast radius actually stops.

4/10/2026

Four separate RSAC 2026 keynotes arrived at the same conclusion without coordinating. Microsoft's Vasu Jakkal told attendees that zero trust must extend to AI. Cisco's Jeetu Patel called for a shift from access control to action control, saying in an exclusive interview with VentureBeat that agents behave "more like teenagers, supremely intelligent, but with no fear of consequence." CrowdStrike's George Kurtz identified AI governance as the biggest gap in enterprise technology. Splunk's John Morgan called for an agentic trust and governance model. Four companies. Four stages. One problem. Matt Caulfield, VP of Product for Identity and Duo at Cisco, put it bluntly in an exclusive VentureBeat interview at RSAC. "While the concept of zero trust is good, we need to take it a step further," Caulfield said. "It's not just about authenticating once and then letting the agent run wild. It's about continuously verifying and scrutinizing every single action the agent's trying to take, because at any moment, that agent can go rogue." Seventy-nine percent of organizations already use AI agents, according to PwC's 2025 AI Agent Survey. Only 14.4% reported full security approval for their entire agent fleet, per the Gravitee State of AI Agent Security 2026 report of 919 organizations in February 2026. A CSA survey presented at RSAC found that only 26% have AI governance policies. CSA's Agentic Trust Framework describes the resulting gap between deployment velocity and security readiness as a governance emergency. Cybersecurity leaders and industry executives at RSAC agreed on the problem. Then two companies shipped architectures that answer the question differently. The gap between their designs reveals where the real risk sits. The monolithic agent problem that security teams are inheriting The default enterprise agent pattern is a monolithic container. The model reasons, calls tools, executes generated code, and holds credentials in one process. Every component trusts every other component. OAuth tokens, API keys, and git credentials sit in the same environment where the agent runs code it wrote seconds ago. A prompt injection gives the attacker everything. Tokens are exfiltrable. Sessions are spawnable. The blast radius is not the agent. It is the entire container and every connected service. The CSA and Aembit survey of 228 IT and security professionals quantifies how common this remains: 43% use shared service accounts for agents, 52% rely on workload identities rather than agent-specific credentials, and 68% cannot distinguish agent activity from human activity in their logs. No single function claimed ownership of AI agent access. Security said it was a developer's responsibility. Developers said it was a security responsibility. Nobody owned it. CrowdStrike CTO Elia Zaitsev, in an exclusive VentureBeat interview, said the pattern should look familiar. "A lot of what securing agents look like would be very similar to what it looks like to secure highly privileged users. They have identities, they have access to underlying systems, they reason, they take action," Zaitsev said. "There's rarely going to be one single solution that is the silver bullet. It's a defense in depth strategy." CrowdStrike CEO George Kurtz highlighted ClawHavoc (a supply chain campaign targeting the OpenClaw agentic framework) at RSAC during his keynote. Koi Security named the campaign on February 1, 2026. Antiy CERT confirmed 1,184 malicious skills tied to 12 publisher accounts, according to multiple independent analyses of the campaign. Snyk's ToxicSkills research found that 36.8% of the 3,984 ClawHub skills scanned contain security flaws at any severity level, with 13.4% rated critical. Average breakout time has dropped to 29 minutes. Fastest observed: 27 seconds. (CrowdStrike 2026 Global Threat Report) Anthropic separates the brain from the hands Anthropic's Managed Agents, launched April 8 in public beta, split every agent into three components that do not trust each other: a brain (Claude and the harness routing its decisions), hands (disposable Linux containers where code executes), and a session (an append-only event log outside both). Separating instructions from execution is one of the oldest patterns in software. Microservices, serverless functions, and message queues. Credentials never enter the sandbox. Anthropic stores OAuth tokens in an external vault. When the agent needs to call an MCP tool, it sends a session-bound token to a dedicated proxy. The proxy fetches real credentials from the vault, makes the external call, and returns the result. The agent never sees the actual token. Git tokens get wired into the local remote at sandbox initialization. Push and pull work without the agent touching the credential. For security directors, this means a compromised sandbox yields nothing an attacker can reuse. The security gain arrived as a side effect of a performance fix. Anthropic decoupled the brain from the hands so inference could start before the container booted. Median time to first token dropped roughly 60%. The zero-trust design is also the fastest design. That kills the enterprise objection that security adds latency. Session durability is the third structural gain. A container crash in the monolithic pattern means total state loss. In Managed Agents, the session log persists outside both brain and hands. If the harness crashes, a new one boots, reads the event log, and resumes. No state lost turns into a productivity gain over time. Managed Agents include built-in session tracing through the Claude Console. Pricing: $0.08 per session-hour of active runtime, idle time excluded, plus standard API token costs. Security directors can now model agent compromise cost per session-hour against the cost of the architectural controls. Nvidia locks the sandbox down and monitors everything inside it Nvidia's NemoClaw, released March 16 in early preview, takes the opposite approach. It does not separate the agent from its execution environment. It wraps the entire agent inside four stacked security layers and watches every move. Anthropic and Nvidia are the only two vendors to have shipped zero-trust agent architectures publicly as of this writing; others are in development. NemoClaw stacks five enforcement layers between the agent and the host. Sandboxed execution uses Landlock, seccomp, and network namespace isolation at the kernel level. Default-deny outbound networking forces every external connection through explicit operator approval via YAML-based policy. Access runs with minimal privileges. A privacy router directs sensitive queries to locally-running Nemotron models, cutting token cost and data leakage to zero. The layer that matters most to security teams is intent verification: OpenShell's policy engine intercepts every agent action before it touches the host. The trade-off for organizations evaluating NemoClaw is straightforward. Stronger runtime visibility costs more operator staffing. The agent does not know it is inside NemoClaw. In-policy actions return normally. Out-of-policy actions get a configurable denial. Observability is the strongest layer. A real-time Terminal User Interface logs every action, every network request, every blocked connection. The audit trail is complete. The problem is cost: operator load scales linearly with agent activity. Every new endpoint requires manual approval. Observation quality is high. Autonomy is low. That ratio gets expensive fast in production environments running dozens of agents. Durability is the gap nobody's talking about. Agent state persists as files inside the sandbox. If the sandbox fails, the state goes with it. No external session recovery mechanism exists. Long-running agent tasks carry a durability risk that security teams need to price into deployment planning before they hit production. The credential proximity gap Both architectures are a real step up from the monolithic default. Where they diverge is the question that matters most to security teams: how close do credentials sit to the execution environment? Anthropic removes credentials from the blast radius entirely. If an attacker compromises the sandbox through prompt injection, they get a disposable container with no tokens and no persistent state. Exfiltrating credentials requires a two-hop attack: influence the brain's reasoning, then convince it to act through a container that holds nothing worth stealing. Single-hop exfiltration is structurally eliminated. NemoClaw constrains the blast radius and monitors every action inside it. Four security layers limit lateral movement. Default-deny networking blocks unauthorized connections. But the agent and generated code share the same sandbox. Nvidia's privacy router keeps inference credentials on the host, outside the sandbox. But messaging and integration tokens (Telegram, Slack, Discord) are injected into the sandbox as runtime environment variables. Inference API keys are proxied through the privacy router and not passed into the sandbox directly. The exposure varies by credential type. Credentials are policy-gated, not structurally removed. That distinction matters most for indirect prompt injection, where an adversary embeds instructions in content the agent queries as part of legitimate work. A poisoned web page. A manipulated API response. The intent verification layer evaluates what the agent proposes to do, not the content of data returned by external tools. Injected instructions enter the reasoning chain as trusted context. With proximity to execution. In the Anthropic architecture, indirect injection can influence reasoning but cannot reach the credential vault. In the NemoClaw architecture, injected context sits next to both reasoning and execution inside the shared sandbox. That is the widest gap between the two designs. NCC Group's David Brauchler, Technical Director and Head of AI/ML Security, advocates for gated agent architectures built on trust segmentation principles where AI systems inherit the trust level of the data they process. Untrusted input, restricted capabilities. Both Anthropic and Nvidia move in this direction. Neither fully arrives. The zero-trust architecture audit for AI agents The audit grid covers three vendor patterns across six security dimensions, five actions per row. It distills to five priorities: Audit every deployed agent for the monolithic pattern. Flag any agent holding OAuth tokens in its execution environment. The CSA data shows 43% use shared service accounts. Those are the first targets. Require credential isolation in agent deployment RFPs. Specify whether the vendor removes credentials structurally or gates them through policy. Both reduce risk. They reduce it by different amounts with different failure modes. Test session recovery before production. Kill a sandbox mid-task. Verify state survives. If it does not, long-horizon work carries a data-loss risk that compounds with task duration. Staff for the observability model. Anthropic's console tracing integrates with existing observability workflows. NemoClaw's TUI requires an operator-in-the-loop. The staffing math is different. Track indirect prompt injection roadmaps. Neither architecture fully resolves this vector. Anthropic limits the blast radius of a successful injection. NemoClaw catches malicious proposed actions but not malicious returned data. Require vendor roadmap commitments on this specific gap. Zero trust for AI agents stopped being a research topic the moment two architectures shipped. The monolithic default is a liability. The 65-point gap between deployment velocity and security approval is where the next class of breaches will start.

Intuit compressed months of tax code implementation into hours — and built a workflow any regulated-industry team can adapt

4/10/2026

When the One Big Beautiful Bill arrived as a 900-page unstructured document — with no standardized schema, no published IRS forms, and a hard shipping deadline — Intuit's TurboTax team had a question: could AI compress a months-long implementation into days without sacrificing accuracy? What they built to do it is less a tax story than a template, a workflow combining commercial AI tools, a proprietary domain-specific language and a custom unit test framework that any domain-constrained development team can learn from. Joy Shaw, director of tax at Intuit, has spent more than 30 years at the company and lived through both the Tax Cuts and Jobs Act and the OBBB. "There was a lot of noise in the law itself and we were able to pull out the tax implications, narrow it down to the individual tax provisions, narrow it down to our customers," Shaw told VentureBeat. "That kind of distillation was really fast using the tools, and then enabled us to start coding even before we got forms and instructions in." How the OBBB raised the bar When the Tax Cuts and Jobs Act passed in 2017, the TurboTax team worked through the legislation without AI assistance. It took months, and the accuracy requirements left no room for shortcuts. "We used to have to go through the law and we'd code sections that reference other law code sections and try and figure it out on our own," Shaw said. The OBBB arrived with the same accuracy requirements but a different profile. At 900-plus pages, it was structurally more complex than the TCJA. It came as an unstructured document with no standardized schema. The House and Senate versions used different language to describe the same provisions. And the team had to begin implementation before the IRS had published official forms or instructions. The question was whether AI tools could compress the timeline without compromising the output. The answer required a specific sequence and tooling that did not exist yet. From unstructured document to domain-specific code The OBBB was still moving through Congress when the TurboTax team began working on it. Using large language models, the team summarized the House version, then the Senate version and then reconciled the differences. Both chambers referenced the same underlying tax code sections, a consistent anchor point that let the models draw comparisons across structurally inconsistent documents. By signing day, the team had already filtered provisions to those affecting TurboTax customers, narrowed to specific tax situations and customer profiles. Parsing, reconciliation and provision filtering moved from weeks to hours. Those tasks were handled by ChatGPT and general-purpose LLMs. But those tools hit a hard limit when the work shifted from analysis to implementation. TurboTax does not run on a standard programming language. Its tax calculation engine is built on a proprietary domain-specific language maintained internally at Intuit. Any model generating code for that codebase has to translate legal text into syntax it was never trained on, and identify how new provisions interact with decades of existing code without breaking what already works. Claude became the primary tool for that translation and dependency-mapping work. Shaw said it could identify what changed and what did not, letting developers focus only on the new provisions. "It's able to integrate with the things that don't change and identify the dependencies on what did change," she said. "That sped up the process of development and enabled us to focus only on those things that did change." Building tooling matched to a near-zero error threshold General-purpose LLMs got the team to working code. Getting that code to shippable required two proprietary tools built during the OBBB cycle. The first auto-generated TurboTax product screens directly from the law changes. Previously, developers curated those screens individually for each provision. The new tool handled the majority automatically, with manual customization only where needed. The second was a purpose-built unit test framework. Intuit had always run automated tests, but the previous system produced only pass/fail results. When a test failed, developers had to manually open the underlying tax return data file to trace the cause. "The automation would tell you pass, fail, you would have to dig into the actual tax data file to see what might have been wrong," Shaw said. The new framework identifies the specific code segment responsible, generates an explanation and allows the correction to be made inside the framework itself. Shaw said accuracy for a consumer tax product has to be close to 100 percent. Sarah Aerni, Intuit's VP of technology for the Consumer Group, said the architecture has to produce deterministic results. "Having the types of capabilities around determinism and verifiably correct through tests — that's what leads to that sort of confidence," Aerni said. The tooling handles the speed. But Intuit also uses LLM-based evaluation tools to validate AI-generated output, and even those require a human tax expert to assess whether the result is correct. "It comes down to having human expertise to be able to validate and verify just about anything," Aerni said. Four components any regulated-industry team can use The OBBB was a tax problem, but the underlying conditions are not unique to tax. Healthcare, financial services, legal tech and government contracting teams regularly face the same combination: complex regulatory documents, hard deadlines, proprietary codebases, and near-zero error tolerance. Based on Intuit's implementation, four elements of the workflow are transferable to other domain-constrained development environments: Use commercial LLMs for document analysis. General-purpose models handle parsing, reconciliation and provision filtering well. That is where they add speed without creating accuracy risk. Shift to domain-aware tooling when analysis becomes implementation. General-purpose models generating code into a proprietary environment without understanding it will produce output that cannot be trusted at scale. Build evaluation infrastructure before the deadline, not during the sprint. Generic automated testing produces pass/fail outputs. Domain-specific test tooling that identifies failures and enables in-context fixes is what makes AI-generated code shippable. Deploy AI tools across the whole organization, not just engineering. Shaw said Intuit trained and monitored usage across all functions. AI fluency was distributed across the organization rather than concentrated in early adopters. "We continue to lean into the AI and human intelligence opportunity here, so that our customers get what they need out of the experiences that we build," Aerni said.

OpenAI introduces ChatGPT Pro $100 tier with 5X usage limits for Codex compared to Plus

4/9/2026

OpenAI is making moves to try and court more developers and vibe coders (those who build software using AI models and natural language) away from rivals like Anthropic. Today, the firm arguably most synonymous with the generative AI boom announced it will begin offering a new, more mid-range subscription tier — a $100 ChatGPT Pro plan — which joins its free, Go ($8 monthly), Plus ($20 monthly) and existing Pro ($200 monthly) plans for individuals using ChatGPT and related OpenAI products. OpenAI also currently offers Edu, Business ($25 per user monthly, formerly known as Team) and Enterprise (variably priced) plans for organizations in said sectors. Why offer a $100 monthly ChatGPT Pro plan? So why introduce a new $100 ChatGPT Pro plan, then? The big selling point from OpenAI is that the new plan offers five times greater usage limits on Codex, the company's agentic vibe coding application/harness (the name is shared by both, as well as a lineup of coding-specific language models), than the existing, $20 monthly Plus plan, which seems fair given the math ($20x5=$100). As OpenAI co-founder and CEO Sam Altman wrote in a post on X: "It is very nice to see Codex getting so much love. We are launching a $100 ChatGPT Pro tier by very popular demand." However, alongside this, OpenAI's official company account on X noted that "we’re rebalancing Codex usage in [ChatGPT] Plus to support more sessions throughout the week, rather than longer sessions in a single day." That sounds a lot like OpenAI is also simultaneously reducing how much ChatGPT Plus users can use its Codex harness and application per day. What are the new usage limits for the new $100 ChatGPT Pro plan vs. the $20 Plus? So, what are the current limits on the $20 Plus plan? The new Pro plan gives you 5X greater than...what? Turns out, this is trickier than you'd think to calculate, because it actually varies depending on which underlying AI model you are using to power the Codex application or harness, and whether you are working on code stored in the cloud or locally on your machine or servers. OpenAI’s Developer website underwent several updates today, so we've only reflected the latest pricing structure and offerings below as of Thursday, April at 10:45 pm ET. It notes that for individual users, Codex usage is categorized by “Local Messages” (tasks run on the user’s machine) and “Cloud Tasks” (tasks run on OpenAI’s infrastructure), and those limits share a five-hour rolling window. It also says additional weekly limits may apply. The current Codex pricing page now shows lower displayed usage ranges than the older version, and it measures Code Reviews in a five-hour window rather than per week. For Pro 5x specifically, OpenAI says the currently shown limits include a temporary 2x usage boost that ends May 31, 2026. ChatGPT Plus ($20/month) GPT-5.4: 20–100 local messages every 5 hours. GPT-5.4-mini: 60–350 local messages every 5 hours. GPT-5.3-Codex: 30–150 local messages and 10–60 cloud tasks every 5 hours. Code Reviews: 20–50 every 5 hours. ChatGPT Pro 5x ($100/month) GPT-5.4: 200–1,000 local messages every 5 hours. GPT-5.4-mini: 600–3,500 local messages every 5 hours. GPT-5.3-Codex: 300–1,500 local messages and 100–600 cloud tasks every 5 hours. Code Reviews: 200–500 every 5 hours. Note: The limits shown for Pro 5x include a temporary 2x usage boost that ends May 31, 2026. ChatGPT Pro 20x ($200/month) GPT-5.4: 400–2,000 local messages every 5 hours. GPT-5.4-mini: 1,200–7,000 local messages every 5 hours. GPT-5.3-Codex: 600–3,000 local messages and 200–1,200 cloud tasks every 5 hours. Code Reviews: 400–1,000 every 5 hours. Exclusive access: Includes GPT-5.3-Codex-Spark in research preview for ChatGPT Pro users only. OpenAI says it has its own separate usage limit, which may adjust based on demand. And as OpenAI's Help documentation states: "The number of Codex messages you can send within these limits varies based on the size and complexity of your coding tasks, and where you execute tasks. Small scripts or simple functions may only consume a fraction of your allowance, while larger codebases, long running tasks, or extended sessions that require Codex to hold more context will use significantly more per message." The larger strategic implications and context OpenAI’s sudden move toward the $100 price point and expanded agentic capacity comes amid the unprecedented financial ascent of its chief rival, Anthropic. Just days ago, Anthropic revealed its annualized run-rate revenue (ARR) has topped $30 billion, surpassing OpenAI's last reported ARR of approximately $24–$25 billion. This growth has been fueled by the massive adoption of Claude Code and Claude Cowork, products that have set the benchmark for enterprise-grade autonomous coding. The competitive friction intensified on April 4, 2026, when Anthropic officially blocked Claude subscriptions from being used to provide the intelligence for third-party agentic AI harnesses like OpenClaw. To be clear, Anthropic Claude models themselves can still be used with OpenClaw, users just must now pay for access to Claude models through Anthropic's application programming interface (API) or extra usage credits, rather than as part of the monthly Claude subscription tiers (which some have likened to an "all-you-can eat" buffet, making the economics challenging for Anthropic when power users and third-party harnesses like OpenClaw consume more than the $20 or $200 monthly user spend on the plans in tokens). OpenClaw’s creator, Peter Steinberger, was notably hired by OpenAI in February 2026 to lead their personal agent strategy, and has, since joining, actively spoken out against Anthropic's limitations — advising that OpenAI's Codex and models generally don't have the same restrictions as Anthropic is now imposing. By hiring Steinberger and subsequently launching a Pro tier that provides the high-volume capacity Anthropic recently restricted, OpenAI is effectively courting the displaced OpenClaw community to reclaim the professional developer market.

Mythos autonomously exploited vulnerabilities that survived 27 years of human review. Security teams need a new detection playbook

4/9/2026

A 27-year-old bug sat inside OpenBSD’s TCP stack while auditors reviewed the code, fuzzers ran against it, and the operating system earned its reputation as one of the most security-hardened platforms on earth. Two packets could crash any server running it. Finding that bug cost a single Anthropic discovery campaign approximately $20,000. The specific model run that surfaced the flaw cost under $50. Anthropic’s Claude Mythos Preview found it. Autonomously. No human guided the discovery after the initial prompt. The capability jump is not incremental On Firefox 147 exploit writing, Mythos succeeded 181 times versus 2 for Claude Opus 4.6. A 90x improvement in a single generation. SWE-bench Pro: 77.8% versus 53.4%. CyberGym vulnerability reproduction: 83.1% versus 66.6%. Mythos saturated Anthropic’s Cybench CTF at 100%, forcing the red team to shift to real-world zero-day discovery as the only meaningful evaluation left. Then it surfaced thousands of zero-day vulnerabilities across every major operating system and every major browser, many one to two decades old. Anthropic engineers with no formal security training asked Mythos to find remote code execution vulnerabilities overnight and woke up to a complete, working exploit by morning, according to Anthropic’s red team assessment. Anthropic assembled Project Glasswing, a 12-partner defensive coalition including CrowdStrike, Cisco, Palo Alto Networks, Microsoft, AWS, Apple, and the Linux Foundation, backed by $100 million in usage credits and $4 million in open-source grants. Over 40 additional organizations that build or maintain critical software infrastructure also received access. The partners have been running Mythos against their own infrastructure for weeks. Anthropic committed to a public findings report “within 90 days,” landing in early July 2026. Security directors got the announcement. They didn’t get the playbook. “I’ve been in this industry for 27 years,” Cisco SVP and Chief Security and Trust Officer Anthony Grieco told VentureBeat in an exclusive interview at RSAC 2026. “I have never been more optimistic for what we can do to change security because of the velocity. It’s also a little bit terrifying because we’re moving so quickly. It’s also terrifying because our adversaries have this capability as well, and so frankly, we must move this quickly.” Security directors saw this story told fifteen different ways this week, including VentureBeat’s exclusive interview with Anthropic’s Newton Cheng. As one widely shared X post summarizing the Mythos findings noted, the model cracked cryptography libraries, broke into a production virtual machine monitor, and gave engineers with zero security training working exploits by morning. What that coverage left unanswered: Where does the detection ceiling sit in the methods they already run, and what should they change before July? Seven vulnerability classes that show where every detection method hits its ceiling OpenBSD TCP SACK, 27 years old. Two crafted packets crash any server. SAST, fuzzers, and auditors missed a logic flaw requiring semantic reasoning about how TCP options interact under adversarial conditions. Campaign cost ~$20,000. Anthropic notes the $50 per-run figure reflects hindsight. FFmpeg H.264 codec, 16 years old. Fuzzers exercised the vulnerable code path 5 million times without triggering the flaw, according to Anthropic. Mythos caught it by reasoning about code semantics. Campaign cost ~$10,000. FreeBSD NFS remote code execution, CVE-2026-4747, 17 years old. Unauthenticated root from the internet, per Anthropic’s assessment and independent reproduction. Mythos built a 20-gadget ROP chain split across multiple packets. Fully autonomous. Linux kernel local privilege escalation. Mythos chained two to four low-severity vulnerabilities into full local privilege escalation via race conditions and KASLR bypasses. CSA’s Rich Mogull noted Mythos failed at remote kernel exploitation but succeeded locally. No automated tool chains vulnerabilities today. Browser zero-days across every major browser. Thousands identified. Some required human-model collaboration. In one case, Mythos chained four vulnerabilities into a JIT heap spray, escaping both the renderer and the OS sandboxes. Firefox 147: 181 working exploits versus two for Opus 4.6. Cryptography library vulnerabilities (TLS, AES-GCM, SSH). Implementation flaws enabling certificate forgery or decryption of encrypted communications, per Anthropic’s red team blog and Help Net Security. A critical Botan library certificate bypass was disclosed the same day as the Glasswing announcement. Bugs in the code that implements the math. Not attacks on the math itself. Virtual machine monitor guest-to-host escape. Guest-to-host memory corruption in a production VMM, the technology keeping cloud workloads from seeing each other’s data. Cloud security architectures assume workload isolation holds. This finding breaks that assumption. Nicholas Carlini, in Anthropic’s launch briefing: “I’ve found more bugs in the last couple of weeks than I found in the rest of my life combined.” VentureBeat's prescriptive matrix Vulnerability Class Why Current Methods Miss It What Mythos Does Security Director Action OS kernel logic (OpenBSD 27yr, Linux 2-4 chain) SAST lacks semantic reasoning. Fuzzers miss logic flaws. Pen testers time-boxed. Bounties scope-exclude kernel. Chains 2-4 low-severity findings into local priv-esc. ~$20K campaign. Add AI-assisted kernel review to pen test RFPs. Expand bounty scope. Request Glasswing findings from OS vendors before July. Re-score clustered findings by chainability. Media codec (FFmpeg 16yr H.264) SAST unflagged. Fuzzers hit path 5M times, never triggered. Reasons about semantics beyond brute-force. ~$10K campaign. Inventory FFmpeg, libwebp, ImageMagick, libpng. Stop treating fuzz coverage as security proxy. Track Glasswing codec CVEs from July. Network stack RCE (FreeBSD 17yr, CVE-2026-4747) DAST limited at protocol depth. Pen tests skip NFS. Full autonomous chain to unauthenticated root. 20-gadget ROP chain. Patch CVE-2026-4747 now. Inventory NFS/SMB/RPC services. Add protocol fuzzing to 2026 cycle. Multi-vuln chaining (2-4 sequenced, local) No tool chains. Pen testers hours-limited. CVSS scores in isolation. Autonomous local chaining via race conditions + KASLR bypass. Require AI-assisted chaining in pen test methodology. Build chainability scoring. Budget AI red teams for 2026. Browser zero-days (thousands, 181 Firefox exploits) Bounties + continuous fuzzing missed thousands. Some required human-model collaboration. 90x over Opus 4.6. Chained 4 vulns into JIT heap spray escaping renderer + OS sandbox. Shorten patch SLA to 72hr critical. Pre-stage pipeline for July cycle. Pressure vendors for Glasswing timelines. Crypto libraries (TLS, AES-GCM, SSH, Botan bypass) SAST limited on crypto logic. Pen testers rarely audit crypto depth. Formal verification not standard. Found cert forgery + decryption flaws in battle-tested libraries. Audit all crypto library versions now. Track Glasswing crypto CVEs from July. Accelerate PQC migration. VMM / hypervisor (guest-to-host memory corruption) Cloud security assumes isolation. Few pen tests target hypervisor. Bounties rarely scope VMM. Guest-to-host escape in production VMM. Inventory hypervisor/VMM versions. Request Glasswing findings from cloud providers. Reassess multi-tenant isolation assumptions. Attackers are faster. Defenders are patching once a year. The CrowdStrike 2026 Global Threat Report documents a 29-minute average eCrime breakout time, 65% faster than 2024, with an 89% year-over-year surge in AI-augmented attacks. CrowdStrike CTO Elia Zaitsev put the operational reality plainly in an exclusive interview with VentureBeat. “Adversaries leveraging agentic AI can perform those attacks at such a great speed that a traditional human process of look at alert, triage, investigate for 15 to 20 minutes, take an action an hour, a day, a week later, it’s insufficient,” Zaitsev said. A $20,000 Mythos discovery campaign that runs in hours replaces months of nation-state research effort. CrowdStrike CEO George Kurtz reinforced that timeline pressure on LinkedIn the same day as the Glasswing announcement. "AI is creating the largest security demand driver since enterprises moved to the cloud," Kurtz wrote. The regulatory clock compounds the operational one. The EU AI Act's next enforcement phase takes effect August 2, 2026, imposing automated audit trails, cybersecurity requirements for every high-risk AI system, incident reporting obligations, and penalties up to 3% of global revenue. Security directors face a two-wave sequence: July's Glasswing disclosure cycle, then August's compliance deadline. Mike Riemer, Field CISO at Ivanti and a 25-year US Air Force veteran who works closely with federal cybersecurity agencies, told VentureBeat what he is hearing from the government. “Threat actors are reverse engineering patches, and the speed at which they’re doing it has been enhanced greatly by AI,” Riemer said. “They’re able to reverse engineer a patch within 72 hours. So if I release a patch and a customer doesn’t patch within 72 hours of that release, they’re open to exploit.” Riemer was blunt about where that leaves the industry. “They are so far in front of us as defenders,” he said. Grieco confirmed the other side of that collision at RSAC 2026. “If you talk to an operational team and many of our customers, they’re only patching once a year,” Grieco told VentureBeat. “And frankly, even in the best of circumstances, that is not fast enough.” CSA’s Mogull makes the structural case that defenders hold the long-term advantage: fix a vulnerability once and every deployment benefits. But the transition period, when attackers reverse-engineer patches in 72 hours and defenders patch once a year, favors offense. Mythos is not the only model finding these bugs. Researchers at AISLE, an AI cybersecurity startup, tested Anthropic's showcase vulnerabilities on small, open-weights models and found that eight out of eight detected the FreeBSD exploit. AISLE says one model had only 3.6 billion parameters and costs 11 cents per million tokens, and that a 5.1-billion-parameter open model recovered the core analysis chain of the 27-year-old OpenBSD bug. AISLE's conclusion: "The moat in AI cybersecurity is the system, not the model." That makes the detection ceiling a structural problem, not a Mythos-specific one. Cheap models find the same bugs. The July timeline gets shorter, not longer. Over 99% of the vulnerabilities Mythos has identified have not yet been patched, per Anthropic’s red team blog. The public Glasswing report lands in early July 2026. It will trigger a high-volume patch cycle across operating systems, browsers, cryptography libraries, and major infrastructure software. Security directors who have not expanded their patch pipeline, re-scoped their bug bounty programs, and built chainability scoring by then will absorb that wave cold. July is not a disclosure event. It is a patch tsunami. What to tell the board Every security director tells the board “we have scanned everything.” Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, told VentureBeat that the statement does not survive Mythos without a qualifier. “What security leaders actually mean is: we have exhaustively scanned for what our tools know how to see,” Baer said in an exclusive interview with VentureBeat. “That’s a very different claim.” Baer proposed reframing residual risk for boards around three tiers: known-knowns (vulnerability classes your stack reliably detects), known-unknowns (classes you know exist but your tools only partially cover, like stateful logic flaws and auth boundary confusion), and unknown-unknowns (vulnerabilities that emerge from composition, how safe components interact in unsafe ways). “This is where Mythos is landing,” Baer said. The board-level statement Baer recommends: “We have high confidence in detecting discrete, known vulnerability classes. Our residual risk is concentrated in cross-function, multi-step, and compositional flaws that evade single-point scanners. We are actively investing in capabilities that raise that detection ceiling.” On chainability, Baer was equally direct. “Chainability has to become a first-class scoring dimension,” she said. “CVSS was built to score atomic vulnerabilities. Mythos is exposing that risk is increasingly graph-shaped, not point-in-time.” Baer outlined three shifts security programs need to make: from severity scoring to exploitability pathways, from vulnerability lists to vulnerability graphs that model relationships across identity, data flow, and permissions, and from remediation SLAs to path disruption, where fixing any node that breaks the chain gets priority over fixing the highest individual CVSS. “Mythos isn’t just finding missed bugs,” Baer said. “It’s invalidating the assumption that vulnerabilities are independent. Security programs that don’t adapt, from coverage thinking to interaction thinking, will keep reporting green dashboards while sitting on red attack paths.” VentureBeat will update this story with additional operational details from Glasswing's founding partners as interviews are completed.

Goodbye, Llama? Meta launches new proprietary AI model Muse Spark — first since Superintelligence Labs' formation

4/8/2026

Meta has been one of the most interesting companies of the generative AI era — initially gaining a loyal and huge following of users for the release of its mostly open source Llama family of large language models (LLMs) beginning in early 2023 but coming to screeching halt last year after Llama 4 debuted to mixed reviews and ultimately, admissions of gaming benchmarks. That bumpy rollout of Llama 4 apparently spurred Meta founder and CEO Mark Zuckerberg to totally overhaul Meta's AI operations in the summer of 2025, forming a new internal division, Meta Superintelligence Labs (MSL) which he recruited 29-year-old former Scale AI co-founder and CEO Alexandr Wang to lead as Chief AI Officer. Now, today, Meta is showing us the fruits of that effort: Muse Spark, a new proprietary model that Wang says (posting on rival social network X, used more often by the machine learning community) is "the most powerful model that meta has released," and has "support for tool-use, visual chain of thought, & multi-agent orchestration." He also says it will be the start of a new Muse family of models, raising questions about what will become of Meta's popular lineup and ongoing development of the Llama family. It arrives not as a generic chatbot, but as the foundation for what Wang calls "personal superintelligence"—an AI that doesn’t just process text but "sees and understands the world around you" to act as a digital extension of the self, echoing Zuckberg's public manifesto for a vision of personal superintelligence published in summer 2025. However, it is proprietary only — confined for now to the Meta AI app and website, as well as a " private API preview to select users," according to Meta's blog post announcing it — a move likely to rankle the literally billions of users of Llama models and the thousands of developers who relied upon it (some of whom are active participants in rival social network Reddit's r/LocalLLaMA subreddit). In addition, no pricing information for the model has yet been announced. It's unclear if Meta has ended development on the Llama family entirely. When asked directly by VentureBeat, a Meta spokesperson said in an email: “Our current Llama models will continue to be available as open source,” which doesn’t address the question of development of future Llama models. Visual chain-of-thought At its core, Muse Spark is a natively multimodal reasoning model. Unlike previous iterations that "stitched" vision and text together, Muse Spark was rebuilt from the ground up to integrate visual information across its internal logic. This architectural shift enables "visual chain of thought," allowing the model to annotate dynamic environments—identifying the components of a complex espresso machine or correcting a user's yoga form via side-by-side video analysis. The most significant technical leap, however, is a new "Contemplating" mode. This feature orchestrates multiple sub-agents to reason in parallel, allowing Meta to compete with extreme reasoning models like Google's Gemini Deep Think and OpenAI's GPT-5.4 Pro. In benchmarks, this mode achieved 58% in "Humanity’s Last Exam" and 38% in "FrontierScience Research," figures that Meta claims validate their new scaling trajectory. Perhaps more impressive for the company’s bottom line is the model’s efficiency. Meta reports that Muse Spark achieves its reasoning capabilities using over an order of magnitude less compute than Llama 4 Maverick, its previous mid-size flagship. This efficiency is driven by a process called "thought compression". During reinforcement learning, the model is penalized for excessive "thinking time," forcing it to solve complex problems with fewer reasoning tokens without sacrificing accuracy. Benchmarks reveal a return-to-form The launch of Muse Spark is framed as a statistical "quantum leap," ending Meta’s year-long absence from the absolute frontier of AI performance. By reconciling Meta’s official internal data with independent auditing from third-party LLM tracking firm Artificial Analysis, a clear picture emerges: Muse Spark is not just a marginal improvement over the Llama series; it is a fundamental re-entry into the "Top 5" global models. According to the Artificial Analysis Intelligence Index v4.0, Muse Spark achieved a score of 52. For context, Meta’s previous flagship, Llama 4 Maverick, debuted in 2025 with an Index score of just 18. By nearly tripling its performance, Muse Spark now sits within striking distance of the industry’s most elite systems, trailing only Gemini 3.1 Pro Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53). Meta’s official benchmarks suggest that Muse Spark is particularly dominant in multimodal reasoning, specifically where visual figures and logic intersect. CharXiv Reasoning: In "figure understanding," Muse Spark achieved a score of 86.4, significantly outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Pro (80.2), and GPT-5.4 (82.8). MMMU Pro: Official reports place the model at 80.4, while Artificial Analysis’s independent audit measured it at 80.5%. This makes it the second-most capable vision model on the market, surpassed only by Gemini 3.1 Pro Preview (83.9% official; 82.4% independent). Visual Factuality (SimpleVQA): Muse Spark scored 71.3, placing it ahead of GPT-5.4 (61.1) and Grok 4.2 (57.4), though it narrowly trails Gemini 3.1 Pro (72.4). These scores validate Meta’s focus on "visual chain of thought," enabling the model to not just recognize objects, but to reason through complex spatial problems and dynamic annotations. The "Thinking" gear of Muse Spark was put to the test against specialized benchmarks designed to break non-reasoning models. Humanity’s Last Exam (HLE): In this multidisciplinary evaluation, Meta reports a score of 42.8 (No Tools) and 50.4 (With Tools). Independent audits by Artificial Analysis tracked the model at 39.9%, trailing Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%). GPQA Diamond (PhD Level Reasoning): Muse Spark achieved a formidable 89.5, surpassing Grok 4.2 (88.5) but trailing the specialized "max reasoning" outputs of Opus 4.6 (92.7) and Gemini 3.1 Pro (94.3). ARC AGI 2: This remains a notable weak point. Muse Spark scored 42.5, far behind the abstract reasoning puzzles solved by Gemini 3.1 Pro (76.5) and GPT-5.4 (76.1). CritPT (Physics Research): Independent auditing found Muse Spark achieved the 5th highest score at 11%. This marks a substantial lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%). One of the most striking results from the official data is Muse Spark's performance in the health sector, likely a result of Meta's collaboration with over 1,000 physicians. HealthBench Hard: Muse Spark achieved 42.8, a massive lead over Claude Opus 4.6 (14.8), Gemini 3.1 Pro (20.6), and even GPT-5.4 (40.1). MedXpertQA (Multimodal): It scored 78.4, comfortably ahead of Opus 4.6 (64.8) and Grok 4.2 (65.8), though it still trails Gemini 3.1 Pro’s top-tier score of 81.3. Agentic Systems and Efficiency: The "Thought Compression" Effect While Muse Spark excels at reasoning, its "agentic" performance—executing real-world work tasks—presents a more nuanced picture. SWE-Bench Verified: Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Pro (80.6). GDPval-AA Elo: Meta’s official score of 1444 differs slightly from Artificial Analysis’s recorded 1427. In both cases, Muse Spark trails GPT-5.4 (1672) and Opus 4.6 (1606), suggesting that while the model "thinks" well, it is still refining its ability to "act" in long-horizon software and office workflows. Token Efficiency: This is where Muse Spark distinguishes itself. To run the Intelligence Index, it used 58 million output tokens. In contrast, Claude Opus 4.6 required 157 million tokens and GPT-5.4 required 120 million. This supports Meta's claim of "thought compression"—delivering frontier-class intelligence while using less than half the "thinking time" of its closest competitors. Benchmark Llama 4 Maverick (2025) Muse Spark (Official) Gemini 3.1 Pro (Official) Intelligence Index Score 18 52 57 MMMU Pro -- 80.4 83.9 CharXiv Reasoning -- 86.4 80.2 HealthBench Hard -- 42.8 20.6 License Open-Weights Proprietary Proprietary With Muse Spark, Meta has successfully transitioned from being the "LAMP stack for AI" to a direct challenger for the title of "Personal Superintelligence". While agentic workflows remain a hurdle, its dominance in vision, health, and token efficiency places Meta back at the center of the frontier race. Personal wellness and Instagram shopping Meta is immediately deploying Muse Spark to power specialized experiences across its app family. Shopping Mode: A new feature that leverages Meta’s vast creator ecosystem. The AI picks up on brands, styling choices, and content across Instagram and Threads to provide personalized recommendations, effectively turning every post into a shoppable interaction. Health Reasoning: In a move toward medical utility, Meta collaborated with over 1,000 physicians to curate training data. Muse Spark can now analyze nutritional content from photos of food or provide "health scores" for pescatarian diets with high cholesterol. Interactive UI: The model can generate web-based minigames or tutorials on the fly. For example, a user can prompt the AI to turn a photo into a playable Sudoku game or a highlights-based tutorial for home appliances. Evaluation awareness While Muse Spark demonstrates strong refusal behaviors regarding biological and chemical weapons, its safety profile includes a startling new discovery. Third-party testing by Apollo Research found that the model possesses a high degree of "evaluation awareness". The model frequently recognized when it was being tested in "alignment traps" and reasoned that it should behave honestly specifically because it was under evaluation. While Meta concluded this was not a "blocking concern" for release, the finding suggests that frontier models are becoming increasingly "conscious" of the testing environment—potentially rendering traditional safety benchmarks less reliable as models learn to "game" the exam. What happens to Llama? In February 2023, Meta released Llama 1 to demonstrate that smaller, compute-optimal models could match larger counterparts like GPT-3 in efficiency. Although access was initially restricted to researchers, the model weights were leaked via 4chan on March 3, 2023, an event that inadvertently democratized high-tier research and catalyzed a global movement for running models on consumer-grade hardware. This shift was solidified in July 2023 with the release of Llama 2, which introduced a commercial license that permitted self-hosting for most organizations. This approach saw rapid adoption, with the Llama family exceeding 100 million downloads and supporting over 1,000 commercial applications by the third quarter of 2023. Through 2024 and 2025, Meta scaled the Llama family to establish it as the essential infrastructure for global enterprise AI, frequently referred to as the LAMP stack for AI. Following the launch of Llama 3 in April 2024 and the landmark Llama 3.1 405B in July, Meta achieved performance parity with the world's leading proprietary systems. The subsequent release of Llama 4 in April 2025 introduced a Mixture-of-Experts architecture, allowing for massive parameter scaling while maintaining fast inference speeds. By early 2026, the Llama ecosystem reached a staggering scale, totaling 1.2 billion downloads and averaging approximately one million downloads per day. This widespread adoption provided businesses with significant economic sovereignty, as self-hosting Llama models offered an 88% cost reduction compared to using proprietary API providers. As of April 2026, Meta’s role as the undisputed leader of the open-weight movement has transitioned into a highly contested multi-polar landscape characterized by the rise of international competitors. While the United States accounts for 35% of global Llama deployments, Chinese models from labs like Alibaba and DeepSeek began accounting for 41% of downloads on platforms like Hugging Face by late 2025. Throughout early 2026, new entrants such as Zhipu AI’s GLM-5 and Alibaba’s Qwen 3.6 Plus have outpaced Llama 4 Maverick on general knowledge and coding benchmarks. In response to this global pressure, Meta's Muse Spark arrives with hefty expectations and an open source legacy that will be tough to live up to. Proprietary only (for now) The launch marks a controversial departure from Meta AI's "open science" roots. While the Llama series was famously accessible to developers, Muse Spark is launching as a proprietary model. Wang addressed the shift on X, stating: "Nine months ago we rebuilt our ai stack from scratch. New infrastructure, new architecture, new data pipelines... This is step one. Bigger models are already in development with plans to open-source future versions." However, the developer community remains skeptical. Some see this as a necessary pivot after the Llama 4 series failed to gain expected developer traction; others view it as Meta "closing the gates" now that it has a competitive reasoning model. Wang himself acknowledged the transition’s difficulty, noting there are "certainly rough edges we will polish over time". For the 3 billion people using Meta’s apps, the change will be felt almost instantly. The AI they interact with is no longer just a library of information, but an agent with a $27 billion brain and a mandate to understand their world as intimately as they do.

New framework lets AI agents rewrite their own skills without retraining the underlying model

4/8/2026

One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs). Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. "It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code," Jun Wang, co-author of the paper, told VentureBeat. Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment. For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both. The challenges of building self-evolving agents Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window. Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks. Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a "password reset" script to solve a "refund processing" query simply because the documents share enterprise terminology. "Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill," Wang said. How Memento-Skills stores and updates skills To solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory. These skills are stored as structured markdown files and serve as the agent's evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model's reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task. Memento-Skills achieves continual learning through its "Read-Write Reflective Learning" mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it. After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill. Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. "The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility." To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library. By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end. Putting the self-evolving agent to the test The researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity's Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model. The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings. The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline's performance, jumping from 17.9% to 38.7%. Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval. The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills. Finding the enterprise sweet spot The researchers have released the code for Memento-Skills on GitHub, and it is readily available for use. For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows. "Skill transfer depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction." In such scattershot environments, cross-task transfer is limited. "Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction." Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off. "Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved," Wang said. However, he cautioned against over-deployment in areas not yet suited for the framework. "Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions." As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption. "To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance," Wang said. "Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs."

LLM-referred traffic converts at 30-40% — and most enterprises aren't optimizing for it

4/7/2026

For more than two decades, digital discovery has operated on a simple model: search, scan, click, decide. That worked when humans were the ones doing the web searching; but with the advent of AI agents, the primary consumer of information is no longer always human. This is giving rise to a new paradigm: Answer engine optimization (AEO), also referred to as generative engine optimization (GEO). Because agents look at data much differently than humans do, success is no longer defined by rankings and clicks, but whether content is understood, selected, and cited by AI systems. The SEO model that the web was built on simply isn’t going to cut it anymore, and enterprises need to prepare now. How LLMs interpret web content Traditional SEO is built around keywords, rankings, page-level optimization, and click-through rates. Users manually search across multiple sources and click around to get what they need. Simple, but sometimes frustrating and a definite time suck. But AEO operates on a whole different level. Agents are increasingly taking over users’ workflows: Claude Code, OpenClaw, CrewAI, Microsoft Copilot, AutoGen, LangChain, Agent Bricks, Agentforce, Google Vertex, Perplexity’s web interface, and whatever else comes along. These agents do not “browse” the web the way humans do. They analyze user intent based not just on phrasing, but persistent memory and context from past sessions (rather than simple autocomplete). They require materials that are concise, structured, and to the point. What’s more, agents are moving beyond browsing to delegation, handling more downstream work. What started as “search, read, decide,” evolves to “agent retrieves, agent summarizes, human decides” (and, beyond that, “agent acts → human validates”). “In practice, AEO begins where SEO stops,” said Dustin Engel, founder of consultancy company Elegant Disruption. “AEO is the next layer of discovery,” or “zero-click discovery.” In this new world where agents synthesize answers, users may never even see an enterprise’s website, click-through rates decline, and attribution and citability (rather than pure visibility, or showing up at the top of a list of blue links) become critical. “The new default is closer to a citation map: Where the model is pulling from, how often you show up, and how you are described,” Engel said. Some, like Adam Yang of Q&A platform Quora, argue that AEO is already becoming the default over SEO. This is for “a certain class of queries,” Yang notes. Any question where the user wants a synthesized answer — "what's the best approach to X," "compare these two options," "what do I need to know about Y" — is increasingly resolved by an AI without a click. Google's own AI Overviews are already accelerating this on the consumer side, many analysts note. “SEO isn't dead,” Yang said. “But the optimization target has shifted from ‘rank on page 1’ to ‘get cited in the answer.’” How devs are already using AI agents Are there scenarios where regular search/Googling is still the best option? “Absolutely,” said analyst Wyatt Mayham of Northwest AI Consulting. Notably, for personal tasks like finding nearby restaurants or local service providers. The interface is “just better” in those cases because it integrates maps, reviews, and photos. “That experience is hard to beat right now,” he said. For work-related research, though, he says he’s “barely” using traditional search anymore, and it’s getting “closer to zero” every month. “When I need to understand a company or a person professionally, agents do it faster and give me a more useful output than a page of blue links ever did,” he said. His firm uses autonomous agents “heavily,” and built a Claude Skills function that powers its sales operation. Before a discovery call with a prospect, team members can trigger a skill that pulls the contact’s LinkedIn profile, scrapes their company website, grabs relevant info from sources like ZoomInfo, and crafts a clear picture of their revenue, team size, tech stack, and pain points. “By the time I get on a call, I have a tailored research brief ready to go without spending 30 to 45 minutes manually Googling around,” Mayham said. The big advantage is that these tools run in the background, he noted. You don’t have to sit clicking through browser tabs: You just tell the agent what you need, it does it, and you get a structured output that’s actually useful. “It's collapsed what used to be a full hour of sales prep into a few minutes,” Mayham said. Carlos Dutra, data science manager at fintech company Trustly, said Claude Code has “genuinely changed” his daily workflow. He uses it for most of his coding work, and what surprised him wasn't the speed, but the fact that he didn’t need to open and keep track of browser tabs. “Not because I'm lazy, but because the answers are better,” he said. He still uses Google for some tasks: Pricing pages, recent news, anything that needs to be current. “But for technical reasoning? Agents have mostly replaced search for me personally,” he said. Quora’s Yang has had a similar experience. He’s been using Claude Code daily for the past few months, primarily for content strategy, knowledge management, and competitive research. Workflows that used to take him half a day now take 30 minutes. But what’s been most advantageous is that he can now run research and synthesis tasks in parallel that he previously had to do sequentially. Also helpful is that agents’ context retention across sessions is “meaningfully better” than web-based tools. When he needs to understand a concept, map a competitive landscape, or synthesize industry trends, Claude or Perplexity are the go-to before opening a browser tab. “I've started treating agent search as my first stop, not Google. Traditional search is now where I verify, not where I discover.” The kinks are real, though. Mayham pointed out that LinkedIn, in particular, is “aggressive” about blocking automated access, and many other sites have (or are implementing) similar protections. Users will hit walls when agents can't get through, so a fallback plan is important for those relying on agents. “The reliability isn't 100% yet, and that's probably the biggest thing holding broader adoption back,” he said. Mayham’s advice for other devs: Stop chasing shiny objects. A new AI tool launches “practically every day,” and many (experienced devs included) are jumping from platform to platform without ever going deep with any of them. “Pick a model, go deep, build real workflows on it,” he emphasized. “You'll get more value from mastery of one platform than surface-level experimentation across five.” How enterprises can compete in an AEO-driven world When AI agents do the searching, the rules change. The question is no longer whether your content ranks on the first page, it's whether the model selects you as the source when generating an answer. Structure matters much more than it used to. Content should: Be organized around conversational intent, provide direct answers, and mirror real user questions and follow-ups; Be authoritative and reflect strong expertise; Be fresh (and, when necessary, regularly refreshed); Have clear headers and established FAQ schema. Another must is maintaining a strong brand presence across the forums and platforms — Wikipedia, Reddit, LinkedIn, industry publications — that models are trained on. Enterprises might also consider investing in original data, like research. In Mayham’s experience, when a business gets recommended by an LLM during a search-style query, the conversion rate is “dramatically higher” than traditional channels. For his company, LLM-referred traffic is converting at 30 to 40%, which “blows away what we see from SEO or paid social.” “The intent signal is just different when someone is having a conversation with an AI and it recommends you by name.” Discoverability inside LLMs will matter as much as Google rankings, “maybe more,” Mayham said. “It's a whole new surface for customer acquisition that most businesses aren't even thinking about yet.” Trustly’s Dutra agreed that the “uncomfortable truth” is that most enterprise content is becoming “basically invisible” in agent-driven queries. “AEO is about whether your content survives being chunked, embedded, and semantically retrieved,” he said. The companies getting ahead aren’t doing anything “exotic,” he noted. They have clean, declarative content that doesn’t require context to understand. Those still writing copy stuffed with keywords are going to fall behind because LLMs care about semantic clarity. A quick test he gives clients: Ask an LLM a question your page is supposed to answer, without giving it the URL. “If it can't construct the answer from your content, you have a problem.” Jeff Oxford of SEO agency Visibility Labs offers valuable step-by-step advice: Engage in conversations on Reddit, which is one of the most-cited domains in AI search. Enterprises should establish a positive reputation on Reddit, and engage on any relevant threads where customers are asking for recommendations. Build a strong YouTube presence. According to Ahrefs, which tracks internet behavior, YouTube mentions have the “strongest correlation” with AI visibility across ChatGPT, AI Mode, and AI Overviews. “This makes sense, since both Google and OpenAI have trained their models on YouTube transcripts,” Oxford said, “and YouTube is the most-cited domain in Google's AI products.” Invest in digital PR and brand mentions; the latter is the second-highest correlated factor with AI visibility. “Brands need to improve their digital presence by being in as many places as possible,” Oxford said. Create content aligned with AI citation patterns. Enterprises should audit the prompts and topics where AI search engines are surfacing competitors, then create authoritative content on those same topics. “The goal is to become a source that AI models consider worth citing,” he noted. Still, there may be a lot of unnecessary hype around how drastically enterprises need to change, said Shashi Bellamkonda, principal research director at consultancy firm Info-Tech Research Group. Those following best practices of producing content that their audience actually needs, written by experts and showcasing expert opinion, are in a good position to be cited in AI-powered search. He pointed out that Google developed an EEAT framework (experience, expertise, authority, and trust) to evaluate content quality and helpfulness and help algorithms identify reliable, high-quality information. To stand out, enterprises should use structured data and schema to signal the context: Is this an article, a research study, a product overview? “Original long-form content will be valued by AI-powered answer engines,” Bellamkonda said. “Copycat strategies or trying to game the system are taboo in this era.” Experts should also share their thoughts across several channels, and "About Us" pages must be “robust” and include bios highlighting thought leaders’ expertise. “Ultimately, the reputation of AI-powered search is in making sure the user likes the search rather than what you think they should read,” Bellamkonda said. “So a good focus on the end user is a great way to succeed.”

Block introduces Managerbot, a proactive Square AI agent and the clearest proof point yet for Jack Dorsey’s AI bet

4/7/2026

Block today unveiled Managerbot, a new AI agent embedded in the Square platform that proactively monitors a seller's business, identifies emerging problems, and proposes actionable solutions — without the seller ever having to ask a question. The product marks the most tangible manifestation of CEO Jack Dorsey's controversial bet that artificial intelligence can fundamentally reshape how his company operates, builds products, and serves the millions of small businesses that depend on Square to run day-to-day commerce. In an exclusive interview with VentureBeat, Willem Avé, Block's head of product at Square, described Managerbot as a decisive break from the company's earlier Square AI assistant, which functioned as a reactive chatbot that answered seller questions about sales, employees, and business performance. "The big shift from Square AI to Managerbot is really from reactive to proactive," Avé said. "What that means is the primary interface is not a question box. You assign tasks to Managerbot, and that could be based on data, an insight, or a signal from your business." The product is beginning to roll out now, with full availability to Square sellers expected over the coming months. Block declined to say whether Managerbot would carry an additional fee or be bundled into existing Square subscriptions. How Managerbot predicts inventory shortages, optimizes schedules, and writes marketing campaigns on its own Avé outlined three core domains where Managerbot operates today: inventory forecasting, employee shift scheduling, and automated marketing campaign creation. In every case, the agent acts before the seller does — watching over the business, detecting patterns, and surfacing recommendations with proposed actions attached. In the inventory domain, Managerbot continuously monitors a seller's stock levels, sales velocity, and external signals such as weather patterns and local events, then alerts the seller when an item is about to run out — or when it should stock up ahead of anticipated demand. "In warmer weather, we can see that you sell more of a certain good," Avé explained. "That's the forecasting capability, combined with local data — weather, events — so we can help sellers manage both their inventory and cash flows." For shift scheduling — a task that Avé described as "one of those interesting, very hard computer science problems" that consumes hours of a small business owner's week — Managerbot analyzes forecasted sales data and then generates optimized employee schedules that balance worker preferences with coverage needs. "It turns out that frontier models are actually pretty good at it," Avé said. The third capability tackles what Avé called "the whole bucket of things that sellers could do if they had more time" — principally marketing. Managerbot identifies sales trends across a seller's catalog and automatically drafts win-back campaigns and promotional outreach targeted at a store's best customer segments. Avé said Block is seeing "very meaningful lift" from Managerbot-generated campaigns compared to what some sellers create manually, though he declined to share specific performance figures publicly. Block built Managerbot on frontier AI models from OpenAI and Anthropic — but says the real innovation is underneath Managerbot runs on third-party frontier models — Avé specifically referenced Anthropic's Sonnet and OpenAI's GPT family — but Block's competitive advantage, he argued, lies in the "agent harness" the company has built around those models. That harness draws heavily on Goose, Block's open-source agent framework, and incorporates learnings from its consumer-facing Money Bot on Cash App. The challenge specific to Square is scale and complexity. A seller running a small business might interact with hundreds of different tools across invoicing, inventory, customer management, marketing, payroll, and scheduling. Managerbot must navigate all of them coherently within a single agentic loop. "This isn't like, you know, you load a skill and call it a day — think about hundreds of skills," Avé said. "Actually, managing the context and managing the way that we progressively disclose tools, and some of the other innovation that we have at the harness layer, is I think some of the secret sauce." A critical design decision shapes every interaction: Managerbot does not autonomously execute changes to a seller's business. Every write action — whether adjusting a shift schedule, publishing a marketing campaign, or modifying inventory — requires explicit seller approval. To facilitate that approval, Managerbot generates visual UI previews showing exactly what will change before the seller clicks "yes." "We want to earn trust with sellers, so any write action is prompted to the user to approve," Avé said. "The seller needs a visual representation of what the change is. You can't just describe in words all the time what you're going to go do." An $80 million fine and chatbot blunders hang over Block's push to automate financial recommendations That human-in-the-loop caution reflects a sensitivity that gains additional weight given Block's recent history. In January 2025, 48 state financial regulators imposed an $80 million fine on Block for violations of Bank Secrecy Act and anti-money laundering laws related to Cash App. The Connecticut Department of Banking stated in announcing the settlement that regulators "found Block was not in compliance with certain requirements, creating the potential that its services could be used to support money laundering, terrorism financing, or other illegal activities." The Illinois Department of Financial and Professional Regulation simultaneously joined the coordinated enforcement action. Separately, reporting from The Guardian has documented instances of Block's customer-facing chatbots making serious errors, including telling customers to cancel or close their accounts. When VentureBeat raised this concern during the interview, Avé acknowledged the stakes but redirected to Managerbot's specific safeguards. "Financial accuracy and financial data — the value of these products really come from recommendations," Avé said. "We need to be better than whatever you can feed to ChatGPT. If you take a CSV of your sales and put it in ChatGPT or Claude, we need our product to be better and answer that question either more accurately or better than what's available in the market." He pointed to the harness layer's role in reducing hallucinations through tuning, prompt engineering, and optimized tool-call loops, while acknowledging the inherent limitations of probabilistic systems: "It's never going to be zero. Obviously, these are probabilistic systems, and we have guidance and call-outs in the tool to provide that." On regulated domains like lending and payments, Avé was more definitive: "In any sort of regulated domains — banking, lending, payments — there are strict guardrails on what we can and can't say to sellers. Those are just part of the product and business." Dorsey cut 4,000 jobs in the name of AI — Managerbot is the first answer to what those tools are actually building It is impossible to evaluate Managerbot outside the context of the radical organizational surgery Block performed just weeks ago. In late February, Dorsey announced that Block would cut more than 4,000 of its roughly 10,000 employees — nearly half the workforce — explicitly citing AI as the driving rationale. As the BBC reported, Dorsey wrote that "AI fundamentally changes what it means to build and run a company." Block's stock surged more than 20 percent on the news, according to ABC7. The company's Q4 2025 earnings report, released alongside the layoff announcement, showed gross profit of $2.87 billion — up 24 percent year over year — and raised 2026 guidance to $12.2 billion in gross profit, according to AlphaSense's earnings analysis. Block also reported a greater than 40 percent increase in production code shipped per engineer since September 2025 through the use of agentic coding tools. As CNBC commentator Steve Sedgwick wrote in an opinion piece following the announcement, "I keep getting told on CNBC that AI will create new jobs to replace those being lost. I've been asking the same question for years now." The Observer's Mark Minevich was more pointed, calling Block's layoffs "probably the first legitimate mass layoff driven by A.I. as the actual operating thesis." Managerbot, then, is the product answer to the obvious follow-up question: if Block shed 4,000 workers in the name of intelligence tools, what exactly are those intelligence tools building? Avé framed the product as proof of concept for Block's entire strategic thesis. "Block has been in the press recently about rebuilding as an intelligence company, and it's like, a lot of people are asking, 'What does that mean for us?'" Avé said. "What I like to do is show, not tell. We're building Managerbot, which I think is one of the more advanced, maybe the most advanced, small business agent out there today." Sellers who use Managerbot are consolidating their businesses onto Square — and that may be the real strategic payoff Perhaps the most consequential signal Avé shared was an early behavioral pattern: sellers who begin using Managerbot are voluntarily migrating more of their business operations onto the Square platform, consolidating payroll, time cards, and shift scheduling into Block's ecosystem to feed the agent more data. "When they start interacting with Managerbot, they want to move more of their business onto Square because they see the value," Avé said. "They're like, 'I should put my payroll here. I should get time cards here. I should get my shift schedules here,' because once all that data is in one place, they can make better decisions and manage their business better." This dynamic could prove to be Managerbot's most significant long-term effect — not as a standalone feature, but as a gravitational force pulling sellers deeper into Block's integrated commerce stack. Block's Q4 earnings already showed Square's new volume added grew 29 percent year over year, with sales-led NVA surging 62 percent. Avé also argued that Square's first-party architecture — built organically rather than through acquisitions — gives it a structural advantage over competitors in the AI era. "We've kind of harmonized and canonicalized this data at a sensible layer," he said. "It's not super hard to create more skills for these data domains." When VentureBeat pressed Avé on the tension between helping sellers and upselling them on Block's own financial products — lending, payments processing, and other services that generate revenue for the company — he acknowledged the concern but framed Managerbot's mission in terms of decision-making quality. "The goal for Managerbot is to help sellers increase their decision-making correctness," Avé said. "If we can make sellers better at running their business by making better decisions and giving time back, I think that's a good thing." Block says Managerbot isn't a chatbot — it's a business protector that compounds the company's entire AI strategy Avé was insistent that Managerbot represents something categorically different from the chatbot-as-advisor model that has proliferated across enterprise software. "A lot of people are building chatbots as advisors — it can answer a question for you," he said. "What we really want Managerbot to be is a protector of your business. This is identifying trends. This is spotting things that you might have missed. This is helping you run your business and take actions." He also argued that the agent model compounds Block's development velocity in ways that traditional software cannot match. "It's much more straightforward to add a capability to Managerbot than it is to build a big Web 2.0 UI," Avé said. "If we can deliver more capabilities, more features, more value to our sellers, the whole system compounds." Whether that compounding materializes — and whether sellers ultimately experience Managerbot as a trusted protector or a sophisticated upsell engine — will determine much about Block's future. The company has staked its corporate identity, its headcount, and its Wall Street narrative on the conviction that AI agents can deliver more value with fewer humans in the loop. Managerbot is the first product to carry the full weight of that promise. And the small business owners who keep their shops open with Square terminals, who juggle shift schedules on napkins and skip marketing because there aren't enough hours in the day — they didn't ask to be the test case for Silicon Valley's boldest AI thesis. But as of today, they are.

Amazon S3 Files gives AI agents a native file system workspace, ending the object-file split that breaks multi-agent pipelines

4/7/2026

AI agents run on file systems using standard tools to navigate directories and read file paths. The challenge, however, is that there is a lot of enterprise data in object storage systems, notably Amazon S3. Object stores serve data through API calls, not file paths. Bridging that gap has required a separate file system layer alongside S3, duplicated data and sync pipelines to keep both aligned. The rise of agentic AI makes that challenge even harder, and it was affecting Amazon's own ability to get things done. Engineering teams at AWS using tools like Kiro and Claude Code kept running into the same problem: Agents defaulted to local file tools, but the data was in S3. Downloading it locally worked until the agent's context window compacted and the session state was lost. Amazon's answer is S3 Files, which mounts any S3 bucket directly into an agent's local environment with a single command. The data stays in S3, with no migration required. Under the hood, AWS connects its Elastic File System (EFS) technology to S3 to deliver full file system semantics, not a workaround. S3 Files is available now in most AWS Regions. "By making data in S3 immediately available, as if it's part of the local file system, we found that we had a really big acceleration with the ability of things like Kiro and Claude Code to be able to work with that data," Andy Warfield, VP and distinguished engineer at AWS, told VentureBeat. The difference between file and object storage and why it matters S3 was built for durability, scale and API-based access at the object level. Those properties made it the default storage layer for enterprise data. But they also created a fundamental incompatibility with the file-based tools that developers and agents depend on. "S3 is not a file system, and it doesn't have file semantics on a whole bunch of fronts," Warfield said. "You can't do a move, an atomic move of an object, and there aren't actually directories in S3." Previous attempts to bridge that gap relied on FUSE (Filesystems in USErspace), a software layer that lets developers mount a custom file system in user space without changing the underlying storage. Tools like AWS's own Mount Point, Google's gcsfuse and Microsoft's blobfuse2 all used FUSE-based drivers to make their respective object stores look like a file system. Warfield noted that the problem is that those object stores still weren't file systems. Those drivers either faked file behavior by stuffing extra metadata into buckets, which broke the object API view, or they refused file operations that the object store couldn't support. S3 Files takes a different architecture entirely. AWS is connecting its EFS (Elastic File System) technology directly to S3, presenting a full native file system layer while keeping S3 as the system of record. Both the file system API and the S3 object API remain accessible simultaneously against the same data. How S3 Files accelerates agentic AI Before S3 Files, an agent working with object data had to be explicitly instructed to download files before using tools. That created a session state problem. As agents compacted their context windows, the record of what had been downloaded locally was often lost. "I would find myself having to remind the agent that the data was available locally," Warfield said. Warfield walked through the before-and-after for a common agent task involving log analysis. He explained that a developer was using Kiro or Claude Code to work with log data, in the object only case they would need to tell the agent where the log files are located and to go and download them. Whereas if the logs are immediately mountable on the local file system, the developer can simply identify that the logs are at a specific path, and the agent immediately has access to go through them. For multi-agent pipelines, multiple agents can access the same mounted bucket simultaneously. AWS says thousands of compute resources can connect to a single S3 file system at the same time, with aggregate read throughput reaching multiple terabytes per second — figures VentureBeat was not able to independently verify. Shared state across agents works through standard file system conventions: subdirectories, notes files and shared project directories that any agent in the pipeline can read and write. Warfield described AWS engineering teams using this pattern internally, with agents logging investigation notes and task summaries into shared project directories. For teams building RAG pipelines on top of shared agent content, S3 Vectors — launched at AWS re:Invent in December 2024 — layers on top for similarity search and retrieval-augmented generation against that same data. What analysts say: this is not just a better FUSE AWS is positioning S3 Files against FUSE-based file access from Azure Blob NFS and Google Cloud Storage FUSE. For AI workloads, the meaningful distinction is not primarily performance. "S3 Files eliminates the data shuffle between object and file storage, turning S3 into a shared, low-latency working space without copying data," Jeff Vogel, analyst at Gartner, told VentureBeat. "The file system becomes a view, not another dataset." With FUSE-based approaches, each agent maintains its own local view of the data. When multiple agents work simultaneously, those views can potentially fall out of sync. "It eliminates an entire class of failure modes including unexplained training/inference failures caused by stale metadata, which are notoriously difficult to debug," Vogel said. "FUSE-based solutions externalize complexity and issues to the user." The agent-level implications go further still. The architectural argument matters less than what it unlocks in practice. "For agentic AI, which thinks in terms of files, paths, and local scripts, this is the missing link," Dave McCarthy, analyst at IDC, told VentureBeat. "It allows an AI agent to treat an exabyte-scale bucket as its own local hard drive, enabling a level of autonomous operational speed that was previously bottled up by API overhead associated with approaches like FUSE." Beyond the agent workflow, McCarthy sees S3 Files as a broader inflection point for how enterprises use their data. "The launch of S3 Files isn't just S3 with a new interface; it's the removal of the final friction point between massive data lakes and autonomous AI," he said. "By converging file and object access with S3, they are opening the door to more use cases with less reworking." What this means for enterprises For enterprise teams that have been maintaining a separate file system alongside S3 to support file-based applications or agent workloads, that architecture is now unnecessary. For enterprise teams consolidating AI infrastructure on S3, the practical shift is concrete: S3 stops being the destination for agent output and becomes the environment where agent work happens. "All of these API changes that you're seeing out of the storage teams come from firsthand work and customer experience using agents to work with data," Warfield said. "We're really singularly focused on removing any friction and making those interactions go as well as they can."

Anthropic says its most powerful AI cyber model is too dangerous to release publicly — so it built Project Glasswing

4/7/2026

Anthropic on Tuesday announced Project Glasswing, a sweeping cybersecurity initiative that pairs an unreleased frontier AI model — Claude Mythos Preview — with a coalition of twelve major technology and finance companies in an effort to find and patch software vulnerabilities across the world's most critical infrastructure before adversaries can exploit them. The launch partners include Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks. Anthropic says it has also extended access to more than 40 additional organizations that build or maintain critical software, and is committing up to $100 million in usage credits for Claude Mythos Preview across the effort, along with $4 million in direct donations to open-source security organizations. The announcement arrives at a moment of extraordinary momentum — and extraordinary scrutiny — for the San Francisco-based AI startup. Anthropic disclosed on Sunday that its annualized revenue run rate has surpassed $30 billion, up from approximately $9 billion at the end of 2025, and the number of business customers each spending over $1 million annually now exceeds 1,000, doubling in less than two months. The company simultaneously announced a multi-gigawatt compute deal with Google and Broadcom. On the same day, Bloomberg reported that Anthropic had poached a senior Microsoft executive, Eric Boyd, to lead its infrastructure expansion. But Glasswing is something categorically different from a revenue milestone or a compute deal. It’s Anthropic's most ambitious attempt to translate frontier AI capabilities — capabilities the company itself describes as dangerous — into a defensive advantage before those same capabilities proliferate to hostile actors. Why Anthropic built a model it considers too dangerous to release publicly At the center of Project Glasswing sits Claude Mythos Preview, a general-purpose frontier model that Anthropic says has already identified thousands of high-severity zero-day vulnerabilities — meaning flaws previously unknown to software developers — in every major operating system and every major web browser, along with a range of other critical software. The company is not making the model generally available. "We do not plan to make Claude Mythos Preview generally available due to its cybersecurity capabilities," Newton Cheng, Frontier Red Team Cyber Lead at Anthropic, told VentureBeat in an exclusive interview. "However, given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. The fallout — for economies, public safety, and national security — could be severe." That language — "the fallout could be severe" — is striking coming from the company that built the model. Anthropic is effectively arguing that the tool it created is powerful enough to reshape the cybersecurity landscape, and that the only responsible thing to do is to keep it restricted while giving defenders a head start. The technical results reinforce that claim. According to Anthropic's press release, Mythos Preview was able to find nearly all of the vulnerabilities it surfaced, and develop many related exploits, entirely autonomously, without any human steering. Three examples stand out: The model found a 27-year-old vulnerability in OpenBSD — widely regarded as one of the most security-hardened operating systems in the world and commonly used to run firewalls and critical infrastructure. The flaw allowed an attacker to remotely crash any machine running the OS simply by connecting to it. It also discovered a 16-year-old vulnerability in FFmpeg — the near-ubiquitous video encoding and decoding library — in a line of code that automated testing tools had exercised five million times without ever catching the problem. And perhaps most alarmingly, Mythos Preview autonomously found and chained together several vulnerabilities in the Linux kernel to escalate from ordinary user access to complete control of the machine. All three vulnerabilities have been reported to the relevant maintainers and have since been patched. For many other vulnerabilities still in the remediation pipeline, Anthropic says it is publishing cryptographic hashes of the details today, with plans to reveal specifics after fixes are in place. On the CyberGym evaluation benchmark, Mythos Preview scored 83.1%, compared to 66.6% for Claude Opus 4.6, Anthropic's next-best model. The gap is even wider on coding benchmarks: Mythos Preview achieves 93.9% on SWE-bench Verified versus 80.8% for Opus 4.6, and 77.8% on SWE-bench Pro versus 53.4%. How Anthropic plans to disclose thousands of zero-days without overwhelming open-source maintainers Finding thousands of zero-days at once sounds impressive. Actually handling the output responsibly is a logistical nightmare — and one of the sharpest criticisms that security researchers have raised about AI-driven vulnerability discovery. Flooding open-source maintainers, many of whom are unpaid volunteers, with an avalanche of critical bug reports could easily do more harm than good. Cheng told VentureBeat that Anthropic has built a triage pipeline specifically to manage this problem. "We triage every bug that we find and then send the highest severity bugs to professional human triagers we have contracted to assist in our disclosure process by manually validating every bug report before we send it out to ensure that we send only high-quality reports to maintainers," he said. That pipeline is designed to prevent exactly the scenario that maintainers fear most: an automated firehose of unverified reports. "We do not submit large volumes of findings to a single project without first reaching out in an effort to agree on a pace the maintainer can sustain," Cheng added. When Anthropic has access to the source code, the company aims to include a candidate patch with every report, labeled by provenance — meaning the maintainer knows the patch was written or reviewed by a model — and offers to collaborate on a production-quality fix. "Models can write patches," Cheng noted, "but there are many factors that impact patch quality, and we strongly recommend that autonomously-written patches are put under the same scrutiny and testing that human-written patches are." On disclosure timelines, Anthropic says it follows a coordinated vulnerability disclosure framework. Once a patch is available, the company will generally wait 45 days before publishing full technical details, giving downstream users time to deploy the fix before exploitation information becomes public. Cheng said the company may shorten that buffer "if the details are already publicly known through other channels, or if earlier publication would materially help defenders identify and mitigate ongoing attacks," or extend it "when patch deployment is unusually complex or the affected footprint is unusually broad." Those are reasonable principles, but they will be tested at a scale that no vulnerability disclosure program has ever attempted. The sheer volume of findings — thousands of zero-days across every major platform — means that even a well-designed triage process will face bottlenecks. And the 45-day disclosure window assumes that maintainers can actually produce, test, and ship a patch in that time, which is far from guaranteed for complex kernel-level bugs or deeply embedded cryptographic flaws. The source code leak, the CMS blunder, and why trust is Anthropic's biggest vulnerability The irony of a company claiming to build the most capable cyber model ever constructed while simultaneously suffering a string of embarrassing security lapses has not been lost on observers. In late March, a draft blog post about Mythos was left in an unsecured and publicly searchable data store — a CMS misconfiguration that exposed roughly 3,000 internal assets, including what appeared to be strategic plans for the model's rollout. Days later, on March 31, anyone who ran npm install on Claude Code pulled down Anthropic's complete original source code — 512,000 lines — for approximately three hours due to a packaging error, an incident that drew widespread attention in the developer community and was first reported by VentureBeat. When asked why partners and governments should trust Anthropic as the custodian of a model it describes as having unprecedented cyber capabilities, Cheng was direct. "Security is central to how we build and ship," he told VentureBeat. "These two incidents, a blog CMS misconfiguration and an npm packaging error, were human errors in publishing tooling, not breaches of our security architecture. We've made changes to prevent these from happening again, and we'll continue to improve our processes." It is a technically accurate distinction — neither incident involved a breach of Anthropic's core model weights, training infrastructure, or API systems — but it is also a distinction that may prove difficult to sustain as a public argument. For an organization asking governments and Fortune 500 companies to trust it with a tool that can autonomously find and exploit vulnerabilities in the Linux kernel, even minor operational lapses carry outsized reputational risk. The fact that the Mythos leak itself was what first alerted the security community to the model's existence, weeks before the planned announcement, underscores the point. What Microsoft, CrowdStrike, and the Linux Foundation found when they tested the model The coalition's breadth is notable. It includes direct competitors — Google and Microsoft — alongside cybersecurity incumbents, financial institutions, and the steward of the world's largest open-source ecosystem. And several partners have already been running Mythos Preview against their own infrastructure for weeks. CrowdStrike's CTO Elia Zaitsev framed the initiative in terms of collapsing timelines: "The window between a vulnerability being discovered and being exploited by an adversary has collapsed — what once took months now happens in minutes with AI." AWS Vice President and CISO Amy Herzog said her teams have already been testing Mythos Preview against critical codebases, where the model is "already helping us strengthen our code." And Microsoft's Global CISO Igor Tsyganskiy noted that when tested against CTI-REALM, Microsoft's open-source security benchmark, "Claude Mythos Preview showed substantial improvements compared to previous models." Perhaps the most revealing comment came from Jim Zemlin, CEO of the Linux Foundation, who pointed to the fundamental asymmetry that has plagued open-source security for decades: "In the past, security expertise has been a luxury reserved for organizations with large security teams. Open-source maintainers — whose software underpins much of the world's critical infrastructure — have historically been left to figure out security on their own." Project Glasswing, he said, "offers a credible path to changing that equation." To back that claim with dollars, Anthropic says it has donated $2.5 million to Alpha-Omega and OpenSSF through the Linux Foundation, and $1.5 million to the Apache Software Foundation. Maintainers interested in access can apply through Anthropic's Claude for Open Source program. Inside the pricing, the compute deal, and Anthropic's path toward a potential IPO After the research preview period — during which Anthropic's $100 million credit commitment will cover most usage — Claude Mythos Preview will be available to participants at $25 per million input tokens and $125 per million output tokens. Participants can access the model through the Claude API, Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry. Those prices reflect the model's computational intensity. The draft blog post that leaked in March described Mythos as a large, compute-intensive model that would be expensive for both Anthropic and its customers to serve. Anthropic's solution is to develop and launch new safeguards with an upcoming Claude Opus model, allowing the company to "improve and refine them with a model that does not pose the same level of risk as Mythos Preview," as Cheng told VentureBeat. Security professionals whose legitimate work is affected by those safeguards will be able to apply to an upcoming Cyber Verification Program. The financial context matters. The same day Project Glasswing launched, Anthropic disclosed its revenue milestone and the Google-Broadcom compute deal. Broadcom signed an expanded deal with Anthropic that will give the AI startup access to about 3.5 gigawatts worth of computing capacity drawing on Google's AI processors, according to CNBC. The scale of compute being marshaled is staggering — and it helps explain why Anthropic needs both the revenue from enterprise cybersecurity partnerships and the infrastructure to serve a model of Mythos Preview's size. The timing also intersects with growing speculation about Anthropic's path to a public offering. The company is reportedly evaluating an IPO as early as October 2026. A high-profile, government-adjacent cybersecurity initiative with blue-chip partners is exactly the kind of program that burnishes an IPO narrative — particularly when the company can simultaneously point to $30 billion in annualized revenue and a compute footprint measured in gigawatts. Anthropic says defenders have months, not years, before adversaries catch up The most consequential question raised by Project Glasswing is not whether Mythos Preview's capabilities are real — the partner endorsements and patched vulnerabilities suggest they are — but how much time defenders actually have before similar capabilities are available to adversaries. Cheng was candid about the timeline. "Frontier AI capabilities are likely to advance substantially over just the next few months," he told VentureBeat. "Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely." He described Project Glasswing as "an important step toward giving defenders a durable advantage in the coming AI-driven era of cybersecurity" but added a crucial caveat: "It's important to note, this is a starting point. No one organization can solve these cybersecurity problems alone." That framing — months, not years — is worth taking seriously. DARPA launched its original Cyber Grand Challenge in 2016, a competition to create automatic defensive systems capable of reasoning about flaws, formulating patches, and deploying them on a network in real time. At the time, the winning AI-powered bot, Mayhem, finished last when placed against human teams at DEF CON. A decade later, Anthropic is claiming that a frontier AI model can find vulnerabilities that survived 27 years of expert human review and millions of automated security tests — and can chain exploits together autonomously to achieve full system compromise. The delta between those two data points illustrates why the industry is treating this as a genuine inflection point, not a marketing exercise. Anthropic itself has firsthand experience with the offensive side of this equation: the company disclosed in November 2025 that a Chinese state-sponsored group achieved 80 to 90 percent autonomous tactical execution using Claude across approximately 30 targets, according to Anthropic's misuse report. Project Glasswing arrives during one of the most turbulent weeks in Anthropic's history. In the span of days, the company has announced a model it considers too dangerous for public release, disclosed that its revenue has tripled, sealed a multi-gigawatt compute deal, hired a senior Microsoft executive, made it more expensive for Claude Code subscribers to use third-party tools like OpenClaw, and weathered a major outage of its Claude chatbot on Tuesday morning. Anthropic says it will report publicly on what it has learned within 90 days. In the medium term, the company has proposed that an independent, third-party body might be the ideal home for continued work on large-scale cybersecurity projects. Whether any of that is fast enough depends on a race that is already underway. Anthropic built a model that can autonomously crack open the most hardened operating systems on the planet — and is now betting that sharing it with defenders, under careful restrictions, will do more good than the inevitable moment when similar capabilities land in less careful hands. It is, in essence, a wager that transparency can outrun proliferation. The next few months will determine whether that bet pays off, or whether the glasswing's wings were never quite opaque enough to hide what was coming.

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro

4/7/2026

Is China picking back up the open source AI baton? Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face. This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month. The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering. The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons. GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls. "agents could do about 20 steps by the end of last year," wrote z.ai leader Lou on X. "glm-5.1 can do 1,700 rn. autonomous work time may be the most important curve after scaling laws. glm-5.1 will be the first point on that curve that the open-source community can verify with their own hands. hope y'all like it^^" In a market increasingly crowded with fast models, Z.ai is betting on the marathon runner. The company, which listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, is using this release to cement its position as the leading independent developer of large language models in the region. Technology: the staircase pattern of optimization GLM-5.1s core technological breakthrough isn't just its scale, though its 754 billion parameters and 202,752 token context window are formidable, but its ability to avoid the plateau effect seen in previous models. In traditional agentic workflows, a model typically applies a few familiar techniques for quick initial gains and then stalls. Giving it more time or more tool calls usually results in diminishing returns or strategy drift. Z.ai research demonstrates that GLM-5.1 operates via what they call a staircase pattern, characterized by periods of incremental tuning within a fixed strategy punctuated by structural changes that shift the performance frontier. In Scenario 1 of their technical report, the model was tasked with optimizing a high-performance vector database, a challenge known as VectorDBBench. The model is provided with a Rust skeleton and empty implementation stubs, then uses tool-call-based agents to edit code, compile, test, and profile. While previous state-of-the-art results from models like Claude Opus 4.6 reached a performance ceiling of 3,547 queries per second, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. The optimization trajectory was not linear but punctuated by structural breakthroughs. At iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, which reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 queries per second. By iteration 240, it autonomously introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. Ultimately, the model identified and cleared six structural bottlenecks, including hierarchical routing via super-clusters and quantized routing using centroid scoring via VNNI. These efforts culminated in a final result of 21,500 queries per second, roughly six times the best result achieved in a single 50-turn session. This demonstrates a model that functions as its own research and development department, breaking complex problems down and running experiments with real precision. The model also managed complex execution tightening, lowering scheduling overhead and improving cache locality. During the optimization of the Approximate Nearest Neighbor search, the model proactively removed nested parallelism in favor of a redesign using per-query single-threading and outer concurrency. When the model encountered iterations where recall fell below the 95 percent threshold, it diagnosed the failure, adjusted its parameters, and implemented parameter compensation to recover the necessary accuracy. This level of autonomous correction is what separates GLM-5.1 from models that simply generate code without testing it in a live environment. Kernelbench: pushing the machine learning frontier The model's endurance was further tested in KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba. In this setting, the goal is to produce a faster GPU kernel than the reference PyTorch implementation while maintaining identical outputs. Each of the 50 problems runs in an isolated Docker container with one H100 GPU and is limited to 1,200 tool-use turns. Correctness and performance are evaluated against a PyTorch eager baseline in separate CUDA contexts. The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer. It eventually delivered a 3.6x geometric mean speedup across 50 problems, continuing to make useful progress well past 1,000 tool-use turns. Although Claude Opus 4.6 remains the leader in this specific benchmark at 4.2x, GLM-5.1 has meaningfully extended the productive horizon for open-source models. This capability is not simply about having a longer context window; it requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error. One of the key breakthroughs is the ability to form an autonomous experiment, analyze, and optimize loop, where the model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement. All solutions generated during this process were independently audited for benchmark exploitation, ensuring the optimizations did not rely on specific benchmark behaviors but worked with arbitrary new inputs while keeping computation on the default CUDA stream. Product strategy: subscription and subsidies GLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. To support this, Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools. The product offering is divided into three subscription tiers, all of which include free Model Context Protocol tools for vision analysis, web search, web reader, and document reading. The Lite tier at $27 USD per quarter is positioned for lightweight workloads and offers three times the usage of a comparable Claude Pro plan. The Pro tier at $81 per quarter is designed for complex workloads, offering five times the Lite plan usage and 40 to 60 percent faster execution. The Max tier at $216 per quarter is aimed at advanced developers with high-volume needs, ensuring guaranteed performance during peak hours. For those using the API directly or through platforms like OpenRouter or Requesty, Z.ai has priced GLM-5.1 at $1.40 per one million input tokens and $4.40 per million output tokens. There's also a cache discount available for $0.26 per million input tokens. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.6 $5.00 $25.00 $30.00 Anthropic GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI Notably, the model consumes quota at three times the standard rate during peak hours, which are defined as 14:00 to 18:00 Beijing Time daily, though a limited-time promotion through April 2026 allows off-peak usage to be billed at a standard 1x rate. Complementing the flagship is the recently debuted GLM-5 Turbo. While 5.1 is the marathon runner, Turbo is the sprinter, proprietary and optimized for fast inference and tasks like tool use and persistent automation. At a cost of $1.20 per million input / $4 per million output, it is more expensive than the base GLM-5 but comes in at more affordable than the new GLM-5.1, positioning it as a commercially attractive option for high-speed, supervised agent runs. The model is also packaged for local deployment, supporting inference frameworks including vLLM, SGLang, and xLLM. Comprehensive deployment instructions are available at the official GitHub repository, allowing developers to run the 754 billion parameter MoE model on their own infrastructure. For enterprise teams, the model includes advanced reasoning capabilities that can be accessed via a thinking parameter in API requests, allowing the model to show its step-by-step internal reasoning process before providing a final answer. Benchmarks: a new global standard The performance data for GLM-5.1 suggests it has leapfrogged several established Western models in coding and engineering tasks. On SWE-Bench Pro, which evaluates a model's ability to resolve real-world GitHub issues using an instruction prompt and a 200,000 token context window, GLM-5.1 achieved a score of 58.4. For context, this outperforms GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. Beyond standardized coding tests, the model showed significant gains in reasoning and agentic benchmarks. It scored 63.5 on Terminal-Bench 2.0 when evaluated with the Terminus-2 framework and reached 66.5 when paired with the Claude Code harness. On CyberGym, it achieved a 68.7 score based on a single-run pass over 1,507 tasks, demonstrating a nearly 20-point lead over the previous GLM-5 model. The model also performed strongly on the MCP-Atlas public set with a score of 71.8 and achieved a 70.6 on the T3-Bench. In the reasoning domain, it scored 31.0 on Humanitys Last Exam, which jumped to 52.3 when the model was allowed to use external tools. On the AIME 2026 math competition benchmark, it reached 95.3, while scoring 86.2 on GPQA-Diamond for expert-level science reasoning. The most impressive anecdotal benchmark was the Scenario 3 test: building a Linux-style desktop environment from scratch in eight hours. Unlike previous models that might produce a basic taskbar and a placeholder window before declaring the task complete, GLM-5.1 autonomously filled out a file browser, terminal, text editor, system monitor, and even functional games. It iteratively polished the styling and interaction logic until it had delivered a visually consistent, functional web application. This serves as a concrete example of what becomes possible when a model is given the time and the capability to keep refining its own work. Licensing and the open segue The licensing of these two models tells a larger story about the current state of the global AI market. GLM-5.1 has been released under the MIT License, with its model weights made publicly available on Hugging Face and ModelScope. This follows the Z.ai historical strategy of using open-source releases to build developer goodwill and ecosystem reach. However, GLM-5 Turbo remains proprietary and closed-source. This reflects a growing trend among leading AI labs toward a hybrid model: using open-source models for broad distribution while keeping execution-optimized variants behind a paywall. Industry analysts note that this shift arrives amidst a rebalancing in the Chinese market, where heavyweights like Alibaba are also beginning to segment their proprietary work from their open releases. Z.ai CEO Zhang Peng appears to be navigating this by ensuring that while the flagship's core intelligence is open to the community, the high-speed execution infrastructure remains a revenue-driving asset. The company is not explicitly promising to open-source GLM-5 Turbo itself, but says the findings will be folded into future open releases. This segmented strategy helps drive adoption while allowing the company to build a sustainable business model around its most commercially relevant work. Community and user reactions: crushing a week's work The developer community response to the GLM-5.1 release has been overwhelmingly focused on the model's reliability in production-grade environments. User reviews suggest a high degree of trust in the model's autonomy. One developer noted that GLM-5.1 shocked them with how good it is, stating it seems to do what they want more reliably than other models with less reworking of prompts needed. Another developer mentioned that the model's overall workflow from planning to project execution performs excellently, allowing them to confidently entrust it with complex tasks. Specific case studies from users highlight significant efficiency gains. A user from Crypto Economy News reported that a task involving preprocessing code, feature selection logic, and hyperparameter tuning solutions, which originally would have taken a week, was completed in just two days. Since getting the GLM Coding plan, other developers have noted being able to operate more freely and focus on core development without worrying about resource shortages hindering progress. On social media, the launch announcement generated over 46,000 views in its first hour, with users captivated by the eight-hour autonomous claim. The sentiment among early adopters is that Z.ai has successfully moved past the hallucination-heavy era of AI into a period where models can be trusted to optimize themselves through repeated iteration. The ability to build four applications rapidly through correct prompting and structured planning has been cited by multiple users as a game-changing development for individual developers. The implications of long-horizon work The release of GLM-5.1 suggests that the next frontier of AI competition will not be measured in tokens per second, but in autonomous duration. If a model can work for eight hours without human intervention, it fundamentally changes the software development lifecycle. However, Z.ai acknowledges that this is only the beginning. Significant challenges remain, such as developing reliable self-evaluation for tasks where no numeric metric exists to optimize against. Escaping local optima earlier when incremental tuning stops paying off is another major hurdle, as is maintaining coherence over execution traces that span thousands of tool calls. For now, Z.ai has placed a marker in the sand. With GLM-5.1, they have delivered a model that doesn't just answer questions, but finishes projects. The model is already compatible with a wide range of developer tools including Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid. For developers and enterprises, the question is no longer, "what can I ask this AI?" but "what can I assign to it for the next eight hours?" The focus of the industry is clearly shifting toward systems that can reliably execute multi-step work with less supervision. This transition to agentic engineering marks a new phase in the deployment of artificial intelligence within the global economy.

AI-RAN is redefining enterprise edge intelligence and autonomy

4/7/2026

Presented by Booz Allen AI-RAN, or artificial intelligence radio area networks, is a reimagining of what wireless infrastructure can do. Rather than treating the network as a passive conduit for data, AI-RAN turns it into an active computational layer. It's a sensor, a compute fabric, and a control plane for physical operations, all rolled into one. That shift has huge implications for industries from manufacturing and logistics to healthcare and smart infrastructure. VentureBeat spoke with two leaders at the center of this transformation: Chris Christou, senior vice president at Booz Allen, and Shervin Gerami, managing director at Cerberus Operations Supply Chain Fund. “AI-RAN can bring the promise of extending 5G and eventually 6G networks into the enterprise,” Christou said. “Proving that a platform can host inference at the edge to enable new types of AI — in particular, physical AI and autonomy-type use cases for things like smart manufacturing and smart warehousing — can make operations more efficient and effective.” “AI-RAN lets enterprises move from digitizing processes to autonomously operating them,” Gerami added. "The enterprise investment should not look at AI-RAN as a networking upgrade. It’s an operating system for physical industries." The difference between AI for RAN, AI on RAN, and AI and RAN The difference between AI for RAN, AI on RAN, and AI and RAN is critical. AI on RAN runs enterprise AI workloads on edge compute infrastructure integrated with the RAN, enabling real-time applications like computer vision, robotics, and localized LLM inference. AI and RAN represents the deeper convergence — where networks are designed to be AI-native, with AI workloads and radio infrastructure architected together as a coordinated, distributed system. At this stage, RAN evolves from a transport layer into a foundational layer of the AI economy. "This is the transformational part," Gerami said. "It’s jointly designing applications with networks. Now the application knows the network state, and the network understands the application’s intent. AI for RAN saves money. AI on RAN adds capability. Then AI and RAN together create entirely new business models.” It's this layered framework that makes AI-RAN more than an incremental evolution of existing wireless technology, and instead a platform shift that opens the network to the kind of developer ecosystem and application innovation that has historically been the domain of cloud computing. How ISAC turns the network into a sensor Integrated sensing and communications (ISAC) is the center of the infrastructure. The network becomes the sensor, a converged infrastructure simultaneously communicating and sensing its environment at the same time it hosts algorithms and applications at the edge. It will enable drone detection, pedestrian safety, and automotive sensing, and eventually even more innovative use cases. The enterprise value proposition of ISAC and a network as the sensor is clear, Gerami says. Today, organizations rely on multiple discrete systems to achieve situational awareness: cameras, radar, asset trackers, motion sensors and more. Each comes with its own maintenance burden, integration overhead and vendor relationship. ISAC has the potential to handle many of those capabilities natively within the network. “With ISAC you can do asset tracking at sub-meter precision inside factories and hospitals," he explained. "You can detect movement patterns, perimeter breaches, and anomalies. Smart buildings can have occupancy-aware HVAC and energy optimization." How AI-RAN shaves milliseconds off edge AI and inference With AI-RAN, edge AI and low-latency inference become supercharged in use cases like real-time robotics management, instant quality inspection, and predictive maintenance. There are the applications where the latency gap between cloud and edge is the difference between a system that works and one that doesn’t. “Where edge AI kicks in is driving operations in milliseconds, not seconds, which is what cloud does,” Gerami explained. Split inference can also change the game, Christou says. “You have a lot of different use cases where the processing is done on the device, making that device more expensive and more power-hungry,” he said. “Now there’s the possibility of offloading that to a local AI-RAN stack, even getting into concepts like split inference, so you do some of the inference on the device, some on the edge AI-RAN stack, and some in the cloud, all appropriate to the use cases and the time scale of the processing required.” Why the timing of AI-RAN investment is critical now AI-RAN investment has a narrow and strategically critical window, both Germani and Christou said. “5G infrastructure is already being deployed, almost getting to a point of completion. 6G standards are not yet locked in,” Gerami explained. “This is an architectural moment for AI-RAN to come in. It allows the ability to not make RAN become a telco-centric design only. It allows the enterprise to become the co-creator of the application, the revenue and value generator of that network infrastructure.” Historically, enterprise IT has consumed wireless standards rather than shaped them. AI-RAN’s open architecture, built on software-defined, cloud-native, containerized components, changes that standardization dynamic. “Previously in the wireless industry it was a very long cycle. Now we’re seeing a push to get it implemented, get it out there, get early pilots, and then we’ll see how the technology works," Christou said. Simultaneously, in parallel, you can start defining the standards. You have real-life implementation experience to help influence how those standards take shape.” And the entry point is accessible, Gerami added. “The barrier to entry is very low," he said. "Right now, it’s all code-based, all software. It’s no different than downloading software. You get yourself an Nvidia box and you can deploy it with a radio.” Why AI-RAN is the future of innovative AI use cases “We see AI-RAN as being an open architecture that’s truly driving innovation," Gerami said. "It’s a flywheel for innovation. We want to create everything to be microservices, open native, cloud native, to allow partners to build vertical AI apps. There’s so much focus right now in the industry around how we can adopt AI effectively, how it will enable autonomy and robotics. This is one of those foundational pieces that can help realize some of those use cases. The future is about owning that physical inference.” “There’s so much focus right now in the industry around how we can adopt AI effectively — how it will enable autonomy and robotics," Christou said. "This is one of those foundational pieces that can help realize some of those use cases.” Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

How MassMutual and Mass General Brigham turned AI pilot sprawl into production results

4/6/2026

Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl. At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two. “We're always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?” Defining metrics, establishing strong feedback loops MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas. Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint. “We won't go any further with an idea until we get crystal clear on how we're going to measure, and how we're going to define success.” Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners. That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn't shared clarity on what outcome we're trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’” His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what "good" means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift. Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn't accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn't mean starting over. Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don't want to set ourselves up to fall behind.” Weeding instead of letting a thousand flowers bloom Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event. But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn't have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said. Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required. Sriraman's team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out). As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.” That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You're gonna get six different answers.” There's nothing wrong with that, he noted; it's just that everybody is discovering and experimenting as the landscape keeps shifting. Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use. They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said. Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges. In clinical settings, the guardrails are absolute: AI systems never issue the final decision. "There's always going to be a doctor or a physician assistant in the loop to close the decision," Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off. Sriraman was clear: "Thou shall not do this: Don't show PHI [protected health information] in Perplexity. As simple as that, right?" And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.” Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the '90s and 2000s with AI. The same concepts apply.”

AI agents that automatically prevent, detect and fix software issues are here as NeuBird AI launches Falcon, FalconClaw

4/6/2026

The mantra of the modern tech industry was arguably coined by Facebook (before it became Meta): "move fast and break things." But as enterprise infrastructure has shifted into a dizzying maze of hybrid clouds, microservices, and ephemeral compute clusters, the "breaking" part has become a structural tax that many organizations can no longer afford to pay. Today, two-year-old startup NeuBird AI is launching a full-scale offensive against this "chaos tax," announcing a $19.3 million funding round alongside the release of its Falcon autonomous production operations agent. The launch isn't just a product update; it is a philosophical pivot. For years, the industry has focused on "Incident Response"—making the fire trucks faster and the hoses bigger. NeuBird AI is arguing that the only sustainable path forward is "Incident Avoidance". As Venkat Ramakrishnan, President and COO of NeuBird AI, put it in a recent interview: "Incident management is so old school. Incident resolution is so old school. Incident avoidance is what is going to be enabled by AI". By grounding AI in real-time enterprise context rather than just large language model reasoning, the company aims to move site reliability engineering and devops teams from a reactive posture to a predictive one. The AI divide: a reality check on automation Accompanying the launch is NeuBird AI’s 2026 State of Production Reliability and AI Adoption Report, a survey of over 1,000 professionals that reveals a massive disconnect between the boardroom and the server room. While 74% of C-suite executives believe their organizations are actively using AI to manage incidents, only 39% of the practitioners—the engineers actually on-call at 2:00 AM—agree. This 35-point "AI Divide" suggests that while leadership is writing checks for AI platforms, the technology is often failing to reach the frontline. For engineers, the reality remains manual and grueling: the study found that engineering teams spend an average of 40% of their time on incident management rather than building new products. Gou Rao, co-founder CEO of NeuBird AI, told VentureBeat that this is a persistent operational reality: “Over the past 18 months that we have been in production, this is not a marketing slide. We have concretely been able to demonstrate a massive reduction in time to incident response and resolution”. The consequences of this "toil" are more than just lost productivity. Alert fatigue has transitioned from a morale issue to a direct reliability risk. According to the report, 83% of organizations have teams that ignore or dismiss alerts occasionally, and 44% of companies experienced an outage in the past year tied directly to a suppressed or ignored alert. In many cases, the systems are so noisy that customers discover failures before the monitoring tools do. Introducing NeuBird AI Falcon NeuBird AI’s answer to this systemic failure is the Falcon engine. While the company’s previous iteration, Hawkeye, focused on autonomous resolution, Falcon extends that capability into predictive intelligence. "When we launched NeuBird AI in 2023, our first version of the agent was called Hawkeye," Rao explains. "What we’re announcing next week at HumanX is our next-generation version of the agent, codenamed Falcon. Falcon is easily three times faster than Hawkeye and is averaging around 92% in confidence scores". This level of accuracy allows engineers to trust the agent's output at face value. Falcon represents a significant leap over previous generative AI applications in the space, particularly in its ability to forecast failure. "Falcon is really good at preventive prediction, so it can tell you what can go wrong," Rao says. "It’s pretty accurate on a 72-hour window, even better at 48 hours, and by 24 hours it gets really, really accurate”. One of the standout features of the new release is the Advanced Context Map. Unlike static dashboards, this is a real-time view of infrastructure dependencies and service health. It allows teams to visualize the "blast radius" of an issue as it propagates across an environment, helping engineers understand not just what is broken, but why it is failing in the context of its neighbors. 'Minority Report' for incident management While many AI tools favor flashy web interfaces, NeuBird AI is leaning into the developer's native habitat with NeuBird AI Desktop. This allows engineers to invoke the production ops agent directly from a command-line interface to explore root causes and system dependencies. "Falcon has a desktop mode which allows it to interact with a developer’s local tools," Rao noted. "We’re getting a lot more traction from a hands-on developer audience, especially as people go to Claude Desktop and Cursor. They’re completing the loop by using production agents talking to their coding agents”. This integration enables a "multi-agent" workflow where an engineer can use NeuBird AI’s agent to diagnose a root cause in production and then hand off that diagnosis to a coding agent like Claude Code to implement the fix. During a live demo, Rao showcased how the agent could be set to "Sentinel Mode," constantly sweeping a cluster for risks. If it detects an anomaly—such as a projected 5% spike in AWS costs or a misconfigured Kubernetes pod—it can flag the specific engineer on-call who has the domain expertise to fix it. "This is like 'Minority Report for Incident Management'," one financial services executive reportedly told the team after a demo. Context engineering: a gateway for security A primary concern for enterprises deploying AI is security—ensuring large language models don't go "crazy" or exfiltrate sensitive data. NeuBird AI addresses this through a proprietary approach to "context engineering". "The way we implemented our agent is that the large language models themselves are never actually touching the data directly," Rao explains. "We become the gateway for how the context can be accessed”. This means the model is the reasoning engine, but NeuBird AI is the middleman that wraps the data. Furthermore, the company has implemented strict guardrails on what the agent can actually execute. “We’ve created a language that confines and restricts the agent from what it can do," says Rao. "If it comes up with something anomalous, or something we don’t know, it won’t run. We won’t do it”. This architectural choice allows NeuBird AI to remain model-agnostic. If a newer model from Anthropic or Google outperforms the current reasoning engine, NeuBird AI can simply switch it out without requiring the customer to change their platform. "Customers don’t want to be tied to a specific way of reasoning," Rao asserts. "They want to be tied to a platform from which they can get the value of an agentic system”. Displacing the "army": displacing expensive observability One of the most radical claims NeuBird AI makes is that agentic systems can actually reduce the amount of data enterprises need to store in the first place. Currently, teams rely on massive storage platforms with complex query languages. "People use very complex observability tools like Datadog, Dynatrace, and Sysdig," Rao says. "This is the norm today, which is why it takes an army of people to solve a problem. What we’ve been able to demonstrate with agentic systems is that you don’t need to store all that data in the first place”. Because the agent can reason across raw data sources, it can identify which signals are junk and which are critical. This shift, Rao argues, “reduces human toil and effort while simultaneously reducing your reliance on these insanely expensive observability tools”. The practical impact of this "incident avoidance" was recently demonstrated at Deep Health. Rao recounts how their agent detected a systemic issue that was invisible to traditional tools: “Our agent was able to go in and prevent an issue from happening which would have caused this company, Deep Health, a major production outage. The customer is completely beside themselves and happy about what it could do”. FalconClaw: operationalizing 'tribal knowledge' One of the most persistent problems in IT operations is the loss of "tribal knowledge"—the hard-won expertise of senior engineers that exists only in their heads. NeuBird AI is attempting to solve this with FalconClaw, a curated, enterprise-grade skills hub compatible with the OpenClaw ecosystem. FalconClaw allows teams to capture best practices and resolution steps as "validated and compliant skills". The tech preview launched today with 15 initial skills that work natively with NeuBird AI’s toolchain. According to Francois Martel, Field CTO at NeuBird AI, this turns hard-won expertise into a reusable asset that the AI can use automatically. It’s an attempt to standardize how agents interact with infrastructure, moving away from proprietary "black box" systems toward a multi-agent world where different AI tools can share a common set of operational abilities. Scaling the moat: funding and leadership The $19.3 million round was led by Xora Innovation, a Temasek-backed firm, with participation from Mayfield, M12, StepStone Group, and Prosperity7 Ventures. This brings NeuBird AI’s total funding to approximately $64 million. The investor interest is fueled largely by the pedigree of the founding team. Gou Rao and Vinod Jayaraman previously co-founded Portworx, which was acquired by Pure Storage, and Ocarina Networks, acquired by Dell. They have recently bolstered their leadership with Venkat Ramakrishnan, another Pure Storage veteran, as President and COO. For investors like Phil Inagaki of Xora, the value lies in NeuBird AI’s "best-in-class results across accuracy, speed and token consumption". As cloud costs continue to spiral, the ability of an AI agent to not only fix bugs but also optimize infrastructure capacity is becoming a "must-have" rather than a "nice-to-have". NeuBird AI claims its agent can save enterprise teams more than 200 engineering hours per month. The path to 'self-healing' infrastructure As the State of Production Reliability report notes, current incident management practices are "no longer sustainable". With 61% of organizations estimating that a single hour of downtime costs $50,000 or more, the financial stakes of staying in a reactive loop are enormous. NeuBird AI's launch of Falcon and FalconClaw marks a definitive attempt to break that loop. By focusing on prevention and the "context engineering" required to make AI trustworthy for enterprise production, the company is positioning itself as the critical intelligence layer for the modern stack. While the "AI Divide" between executives and practitioners remains a significant hurdle for the industry, NeuBird AI is betting that as engineers see the value of a cli-driven, 92%-accurate agent that can "see around corners," the skepticism will fade. For the site reliability engineers currently drowning in a flood of non-actionable alerts, the arrival of a reliable ai teammate couldn't come soon enough. NeuBird AI Falcon is available starting today, with organizations able to sign up for a free trial at neubird.ai.

Closing the data security maturity gap: Embedding protection into enterprise workflows

4/6/2026

Presented by Capital One Data security remains one of the least mature domains in enterprise cybersecurity. According to IBM, 35% of breaches in 2025 involved unmanaged data source or “shadow data.” This reveals a systemic lack of basic data awareness. It’s not because of a lack of tooling or investment. It’s because many organizations still struggle with the most fundamental questions: What data do we have? Where does it live? How does it move? And who is responsible for it? In an increasingly complex ecosystem of data sources, cloud platforms, SaaS applications, APIs, and AI models, those questions are only becoming more difficult to answer. Closing the maturity gap in data security demands a cultural shift where security is no longer treated as an afterthought. Instead, protection is embedded throughout the full data lifecycle, grounded in a robust inventory, clear classification, and scalable mechanisms that translate policy into automated guardrails. Visibility as the foundation The most persistent barrier to data security maturity is basic visibility. Organizations often focus on how much data they hold, but not on what that data is made up of. Does it contain personally identifiable information (PII)? Financial data? Health information? Intellectual property? Without this level of understanding and inventory, it’s a lot tougher to implement meaningful protection. This can be avoided, however, by prioritizing enterprise capabilities that can detect sensitive data at scale across a large and varied footprint. Detection must be paired with action, deleting data where it’s no longer needed, and securing data where it is by aligning enforcement to a well-defined policy. Mature organizations should start by treating data security as an “understanding your environment” problem. Maintain an inventory, classify what’s in the ecosystem, and align protections with the classification rather than solely relying on perimeter controls or point solutions to scale. Securing chaotic data One reason data security has lagged behind other security domains is that data itself is inherently chaotic. Unlike perimeter security, which relies on explicit ports and defined boundaries, data is largely unpredictable. That is to say, the same underlying information may appear across very different formats: structured databases, unstructured documents, chat transcripts, or analytics pipelines. Each may have slightly different encodings or transformations that introduce unforeseen, and often undetected, changes to the data itself. Human behavior compounds the challenge, with different actions introducing risks in ways that perimeter controls simply can’t anticipate. This could be anything from a credit card number copied into a free-form comment field, a spreadsheet emailed outside its intended audience, or a dataset repurposed for a new workflow. When protection is bolted on at the end of a workflow, organizations create blind spots. They rely on downstream checks to catch upstream design flaws. Over time, complexity accumulates and the risk of exposure becomes a question of when, not if. A more resilient model assumes that sensitive data will surface in unexpected places and formats, so protection is embedded from the moment data is captured. Defense-in-depth becomes a design principle: segmentation, encryption at rest and in transit, tokenization, and layered access controls. Critically, these safeguards travel with the data lifecycle, from ingestion to processing, analytics and publishing. Instead of retrofitting controls, organizations design for chaos. They accept variability as a given and build systems that remain secure even when data diverges from expectations. Scaling governance with automation Data security becomes operationally sustainable when governance is enforced through automation from its genesis. When coupled with clear expectations to create bounded contexts: teams understand what is permitted, under what conditions, and with what protections data can be used effectively. This matters more than ever today. AI systems often require access to huge volumes of data, across domains. This makes policy implementation particularly challenging. To do so effectively and safely requires deep understanding, strong governance policies, and automated protection. Security techniques such as synthetic data and token replacement enable organizations to preserve analytical context while making sensitive values harder to read. Policy-as-code patterns, APIs, and automation can handle tokenization, deletion, retention constraints, and dynamic access controls. With guardrails built into the platforms they use, engineers can focus more on innovating with data and elevating business outcomes securely. AI systems must also operate within the same governance and monitoring expectations as human workflows. Permissions, telemetry, and controls around what models can access, along with the information they can publish, are essential. Governance will always introduce a degree of friction. The goal is to make that friction well understood, navigable and increasingly automated. Confirming purpose, registering a use case, and provisioning access dynamically based on role and need should be clear, repeatable processes. At enterprise scale, this requires centralized capabilities that implement cyber security policy in the data domain. This includes detection and classification engines, tokenization and detokenization services, retention enforcement, and ownership and taxonomy mechanisms that cascade risk management expectations into daily execution. When done well, governance becomes an enablement layer rather than a bottleneck. Metadata and classification drive protection decisions automatically while accelerating business discovery and usage. Data is protected across its lifecycle by strong defenses like tokenization and deleted when required by regulation or internal policy. There should be no need for teams to “touch the data” manually for every control decision, with policy enforced by design. Building for the future Put simply, closing the data security maturity gap is less about adopting a single breakthrough technology and more about operational discipline. Build the map. Classify what you have. Embed protection into workflows so that security is repeatable at scale. For business leaders seeking measurable progress over the next 18–24 months, three priorities stand out. First, establish a robust inventory and metadata-rich map of the data ecosystem. Visibility is non-negotiable. Second, implement classification tied to clear, actionable policy expectations. Make it obvious what protections each category demands. And finally, invest in scalable, automated protection schemes that integrate directly into development and data workflows. When protection shifts from reactive bolt-on controls to proactive built-in guardrails, compliance becomes simpler, governance becomes stronger, and AI readiness becomes achievable, without compromising rigor. Learn more how Capital One Databolt, the enterprise data security solution from Capital One Software, can help your business become AI-ready by securing sensitive data at scale. Andrew Seaton is Vice President, Data Engineering - Enterprise Data Detection & Protection, Capital One. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

OCSF explained: The shared data language security teams have been missing

4/5/2026

The security industry has spent the last year talking about models, copilots, and agents, but a quieter shift is happening one layer below all of that: Vendors are lining up around a shared way to describe security data. The Open Cybersecurity Schema Framework (OCSF), is emerging as one of the strongest candidates for that job. It gives vendors, enterprises, and practitioners a common way to represent security events, findings, objects, and context. That means less time rewriting field names and custom parsers and more time correlating detections, running analytics, and building workflows that can work across products. In a market where every security team is stitching together endpoint, identity, cloud, SaaS, and AI telemetry, a common infrastructure long felt like a pipe dream, and OCSF now puts it within reach. OCSF in plain language OCSF is an open-source framework for cybersecurity schemas. It’s vendor neutral by design and deliberately agnostic to storage format, data collection, and ETL choices. In practical terms, it gives application teams and data engineers a shared structure for events so analysts can work with a more consistent language for threat detection and investigation. That sounds dry until you look at the daily work inside a security operations center (SOC). Security teams have to spend a lot of effort normalizing data from different tools so that they can correlate events. For example, detecting an employee logging in from San Francisco at 10 a.m. on their laptop, then accessing a cloud resource from New York at 10:02 a.m. could reveal a leaked credential. Setting up a system that can correlate those events, however, is no easy task: Different tools describe the same idea with different fields, nesting structures, and assumptions. OCSF was built to lower this tax. It helps vendors map their own schemas into a common model and helps customers move data through lakes, pipelines, security incident and event management (SIEM) tools without requiring time consuming translation at every hop. The last two years have been unusually fast Most of OCSF’s visible acceleration has happened in the last two years. The project was announced in August 2022 by Amazon AWS and Splunk, building on worked contributed by Symantec, Broadcom, and other well known infrastructure giants Cloudflare, CrowdStrike, IBM, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro, and Zscaler. The OCSF community has kept up a steady cadence of releases over the last two years The community has grown quickly. AWS said in August 2024 that OCSF had expanded from a 17-company initiative into a community with more than 200 participating organizations and 800 contributors, which expanded to 900 wen OCSF joined the Linux Foundation in November 2024. OCSF is showing up across the industry In the observability and security space, OCSF is everywhere. AWS Security Lake converts natively supported AWS logs and events into OCSF and stores them in Parquet. AWS AppFabric can output OCSF — normalized audit data. AWS Security Hub findings use OCSF, and AWS publishes an extension for cloud-specific resource details. Splunk can translate incoming data into OCSF with edge processor and ingest processor. Cribl supports seamless converting streaming data into OCSF and compatible formats. Palo Alto Networks can forward Strata sogging Service data into Amazon Security Lake in OCSF. CrowdStrike positions itself on both sides of the OCSF pipe, with Falcon data translated into OCSF for Security Lake and Falcon Next-Gen SIEM positioned to ingest and parse OCSF-formatted data. OCSF is one of those rare standards that has crossed the chasm from an abstract standard into standard operational plumbing across the industry. AI is giving the OCSF story fresh urgency When enterprises deploy AI infrastructure, large language models (LLMs) sit at the core, surrounded by complex distributed systems such as model gateways, agent runtimes, vector stores, tool calls, retrieval systems, and policy engines. These components generate new forms of telemetry, much of which spans product boundaries. Security teams across the SOC are increasingly focused on capturing and analyzing this data. The central question often becomes what an agentic AI system actually did, rather than only the text it produced, and whether its actions led to any security breaches. That puts more pressure on the underlying data model. An AI assistant that calls the wrong tool, retrieves the wrong data, or chains together a risky sequence of actions creates a security event that needs to be understood across systems. A shared security schema becomes more valuable in that world, especially when AI is also being used on the analytics side to correlate more data, faster. For OCSF, 2025 was all about AI Imagine a company uses an AI assistant to help employees look up internal documents and trigger tools like ticketing systems or code repositories. One day, the assistant starts pulling the wrong files, calling tools it should not use, and exposing sensitive information in its responses. Updates in OCSF versions 1.5.0, 1.6.0, and 1.7.0 help security teams piece together what happened by flagging unusual behavior, showing who had access to the connected systems, and tracing the assistant’s tool calls step by step. Instead of only seeing the final answer the AI gave, the team can investigate the full chain of actions that led to the problem. What's on the horizon Imagine a company uses an AI customer support bot, and one day the bot begins giving long, detailed answers that include internal troubleshooting guidance meant only for staff. With the kinds of changes being developed for OCSF 1.8.0, the security team could see which model handled the exchange, which provider supplied it, what role each message played, and how the token counts changed across the conversation. A sudden spike in prompt or completion tokens could signal that the bot was fed an unusually large hidden prompt, pulled in too much background data from a vector database, or generated an overly long response that increased the chance of sensitive information leaking. That gives investigators a practical clue about where the interaction went off course, instead of leaving them with only the final answer. Why this matters to the broader market The bigger story is that OCSF has moved quickly from being a community effort to becoming a real standard that security products use every day. Over the past two years, it has gained stronger governance, frequent releases, and practical support across data lakes, ingest pipelines, SIEM workflows, and partner ecosystems. In a world where AI expands the security landscape through scams, abuse, and new attack paths, security teams rely on OCSF to connect data from many systems without losing context along the way to keep your data safe. Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

4/5/2026

The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines. More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw. Having played with these tools for some time, here is a comparison. First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more. Next we have Google’s Antigravity, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box). Finally, we have the mighty Claude. The release of Anthropic's Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the SaaSpocalypse). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details. Making these tools work for you The key to making these tools more impactful is giving them more power, but that increases the risk of misuse. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority. While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user's taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority. But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical. Also, when agents deal with so many diverse systems, it's important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct." These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work. When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane. Dattaraj Rao is innovation and R&D architect at Persistent Systems.

Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party AI agents

4/4/2026

Are you a subscriber to Anthropic's Claude Pro ($20 monthly) or Max ($100-$200 monthly) plans and use its Claude AI models and products to power third-party AI agents like OpenClaw? If so, you're in for an unpleasant surprise. Anthropic announced a few hours ago that starting tomorrow, Saturday, April 4, 2026, at 12 pm PT/3 pm ET, it will no longer be possible for those Claude subscribers to use their subscriptions to hook Anthropic's Claude models up to third-party agentic tools, citing the strain such usage was placing on Anthropic's compute and engineering resources, and desire to serve a wide number of users reliably. "We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren't built for the usage patterns of these third-party tools," wrote Boris Cherny, Head of Claude Code at Anthropic, in a post on X. "Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API." The company also reportedly sent out an email to this effect to some subscribers. However, it's not certain if subscribers to Claude Team and Enterprise will be impacted similarly. We've reached out to Anthropic for further clarification and will update when we hear back. To be clear, it will still be possible to use Claude models like Opus, Sonnet, and Haiku to power OpenClaw and similar external agents, but users will now need to opt into a pay-as-you-go "extra usage" billing system or utilize Anthropic's application programming interface (API), which charges for every token of usage rather than allowing for open-ended usage up to certain limits, as the Pro and Max plans have allowed so far. The reason for the change: 'third party services are not optimized' The technical reality, according to Anthropic, is that its first-party tools like Claude Code, its AI vibe coding harness, and Claude Cowork, its business app interfacing and control tool, are built to maximize "prompt cache hit rates"—reusing previously processed text to save on compute. Third-party harnesses like OpenClaw often bypass these efficiencies. “Third party services are not optimized in this way, so it's really hard for us to do sustainably,” Cherny explained further on X. He even revealed his own hands-on attempts to bridge the gap: “I did put up a few PRs to improve prompt cache hit rate for OpenClaw in particular, which should help for folks using it with Claude via API/overages.” Prior to the news, Anthropic had also begun imposing stricter Claude session limits every 5 hours of usage during business hours (5am-11am PT/8am-2pm ET), meaning that the number of tokens you could send during those sessions dropped. This frustrated some power users who suddenly began reaching their limits far faster than they had previously — a change Anthropic said was to help "manage growing demand for Claude" and would only affect up to 7% of users at any given time. Discounts and credits to soften the blow Anthropic is not banning third-party tools entirely, but it is moving them to a different ledger. The new "Extra Usage" bundles represent a middle ground between a flat-rate subscription and a full enterprise API account. The Credit: To "soften the blow," Anthropic is offering existing subscribers a one-time credit equal to their monthly plan price, redeemable until April 17. The Discount: Users who pre-purchase "extra usage" bundles can receive up to a 30% discount, an attempt to retain power users who might otherwise churn. Capacity Management: Anthropic’s official statement noted that these tools put an "outsized strain" on systems, forcing a prioritization of "customers using our core products and API." 'The all you-can-eat buffet just closed' The response from the developer community has been a mixture of analytical acceptance and sharp frustration. Growth marketer Aakash Gupta observed on X that the "all-you-can-eat buffet just closed," noting that a single OpenClaw agent running for one day could burn $1,000 to $5,000 in API costs. “Anthropic was eating that difference on every user who routed through a third-party harness,” Gupta wrote. “That's the pace of a company watching its margin evaporate in real time.” However, Peter Steinberger, the creator of OpenClaw who was recently hired by OpenAI, took a more skeptical view of the "capacity" argument.“Funny how timings match up,” Steinberger posted on X. “First they copy some popular features into their closed harness, then they lock out open source.” Indeed, Anthropic recently added some of the same capabilities that helped OpenClaw catch-on — such as the ability to message agents through external services like Discord and Telegram — to Claude Code. Steinberger claimed that he and fellow investor Dave Morin attempted to "talk sense" into Anthropic, but were only able to delay the enforcement by a single week. User @ashen_one, founder of Telaga Charity, voiced a concern likely shared by other small-scale builders: “If I switch both [OpenClaw instances] to an API key or the extra usage you're recommending here, it's going to be far too expensive to make it worth using. I'll probably have to switch over to a different model at this point.” .“I know it sucks,” Cherny replied. “Fundamentally engineering is about tradeoffs, and one of the things we do to serve a lot of customers is optimize the way subscriptions work to serve as many people as possible with the best mode Licensing and the OpenAI shadow The timing of the crackdown is particularly notable given the talent migration. When Steinberger joined OpenAI in February 2026, he brought the "OpenClaw" ethos with him. OpenAI appears to be positioning itself as a more "harness-friendly" alternative, potentially using this moment as a customer acquisition channel for disgruntled Claude power users. By restricting subscription limits to their own "closed harness," Anthropic is asserting control over the UI/UX layer. This allows them to collect telemetry and manage rate limits more granularly, but it risks alienating the power-user community that built the "agentic" ecosystem in the first place. The Bottom Line Anthropic’s decision is a cold calculation of margins versus growth. As Cherny noted, "Capacity is a resource we manage thoughtfully." In the 2026 AI landscape, the era of subsidized, unlimited compute for third-party automation is over. For the average user on Claude.ai, the experience remains unchanged; for the power users running autonomous offices, the bell has tolled.

Karpathy shares 'LLM Knowledge Base' architecture that bypasses RAG with an evolving markdown library maintained by AI

4/3/2026

AI vibe coders have yet another reason to thank Andrej Karpathy, the coiner of the term. The former Director of AI at Tesla and co-founder of OpenAI, now running his own independent AI project, recently posted on X describing a "LLM Knowledge Bases" approach he's using to manage various topics of research interest. By building a persistent, LLM-maintained record of his projects, Karpathy is solving the core frustration of "stateless" AI development: the dreaded context-limit reset. As anyone who has vibe coded can attest, hitting a usage limit or ending a session often feels like a lobotomy for your project. You’re forced to spend valuable tokens (and time) reconstructing context for the AI, hoping it "remembers" the architectural nuances you just established. Karpathy proposes something simpler and more loosely, messily elegant than the typical enterprise solution of a vector database and RAG pipeline. Instead, he outlines a system where the LLM itself acts as a full-time "research librarian"—actively compiling, linting, and interlinking Markdown (.md) files, the most LLM-friendly and compact data format. By diverting a significant portion of his "token throughput" into the manipulation of structured knowledge rather than boilerplate code, Karpathy has surfaced a blueprint for the next phase of the "Second Brain"—one that is self-healing, auditable, and entirely human-readable. Beyond RAG For the past three years, the dominant paradigm for giving LLMs access to proprietary data has been Retrieval-Augmented Generation (RAG). In a standard RAG setup, documents are chopped into arbitrary "chunks," converted into mathematical vectors (embeddings), and stored in a specialized database. When a user asks a question, the system performs a "similarity search" to find the most relevant chunks and feeds them into the LLM.Karpathy’s approach, which he calls LLM Knowledge Bases, rejects the complexity of vector databases for mid-sized datasets. Instead, it relies on the LLM’s increasing ability to reason over structured text. The system architecture, as visualized by X user @himanshu in part of the wider reactions to Karpathy's post, functions in three distinct stages: Data Ingest: Raw materials—research papers, GitHub repositories, datasets, and web articles—are dumped into a raw/ directory. Karpathy utilizes the Obsidian Web Clipper to convert web content into Markdown (.md) files, ensuring even images are stored locally so the LLM can reference them via vision capabilities. The Compilation Step: This is the core innovation. Instead of just indexing the files, the LLM "compiles" them. It reads the raw data and writes a structured wiki. This includes generating summaries, identifying key concepts, authoring encyclopedia-style articles, and—crucially—creating backlinks between related ideas. Active Maintenance (Linting): The system isn't static. Karpathy describes running "health checks" or "linting" passes where the LLM scans the wiki for inconsistencies, missing data, or new connections. As community member Charly Wargnier observed, "It acts as a living AI knowledge base that actually heals itself." By treating Markdown files as the "source of truth," Karpathy avoids the "black box" problem of vector embeddings. Every claim made by the AI can be traced back to a specific .md file that a human can read, edit, or delete. Implications for the enterprise While Karpathy’s setup is currently described as a "hacky collection of scripts," the implications for the enterprise are immediate. As entrepreneur Vamshi Reddy (@tammireddy) noted in response to the announcement: "Every business has a raw/ directory. Nobody’s ever compiled it. That’s the product." Karpathy agreed, suggesting that this methodology represents an "incredible new product" category. Most companies currently "drown" in unstructured data—Slack logs, internal wikis, and PDF reports that no one has the time to synthesize. A "Karpathy-style" enterprise layer wouldn't just search these documents; it would actively author a "Company Bible" that updates in real-time. As AI educator and newsletter author Ole Lehmann put it on X: "i think whoever packages this for normal people is sitting on something massive. one app that syncs with the tools you already use, your bookmarks, your read-later app, your podcast app, your saved threads." Eugen Alpeza, co-founder and CEO of AI enterprise agent builder and orchestration startup Edra, noted in an X post that: "The jump from personal research wiki to enterprise operations is where it gets brutal. Thousands of employees, millions of records, tribal knowledge that contradicts itself across teams. Indeed, there is room for a new product and we’re building it in the enterprise." As the community explores the "Karpathy Pattern," the focus is already shifting from personal research to multi-agent orchestration. A recent architectural breakdown by @jumperz, founder of AI agent creation platform Secondmate, illustrates this evolution through a "Swarm Knowledge Base" that scales the wiki workflow to a 10-agent system managed via OpenClaw. The core challenge of a multi-agent swarm—where one hallucination can compound and "infect" the collective memory—is addressed here by a dedicated "Quality Gate." Using the Hermes model (trained by Nous Research for structured evaluation) as an independent supervisor, every draft article is scored and validated before being promoted to the "live" wiki. This system creates a "Compound Loop": agents dump raw outputs, the compiler organizes them, Hermes validates the truth, and verified briefings are fed back to agents at the start of each session. This ensures that the swarm never "wakes up blank," but instead begins every task with a filtered, high-integrity briefing of everything the collective has learned Scaling and performance A common critique of non-vector approaches is scalability. However, Karpathy notes that at a scale of ~100 articles and ~400,000 words, the LLM’s ability to navigate via summaries and index files is more than sufficient. For a departmental wiki or a personal research project, the "fancy RAG" infrastructure often introduces more latency and "retrieval noise" than it solves. Tech podcaster Lex Fridman (@lexfridman) confirmed he uses a similar setup, adding a layer of dynamic visualization: "I often have it generate dynamic html (with js) that allows me to sort/filter data and to tinker with visualizations interactively. Another useful thing is I have the system generate a temporary focused mini-knowledge-base... that I then load into an LLM for voice-mode interaction on a long 7-10 mile run." This "ephemeral wiki" concept suggests a future where users don't just "chat" with an AI; they spawn a team of agents to build a custom research environment for a specific task, which then dissolves once the report is written. Licensing and the ‘file-over-app’ philosophy Technically, Karpathy’s methodology is built on an open standard (Markdown) but viewed through a proprietary-but-extensible lens (note taking and file organization app Obsidian). Markdown (.md): By choosing Markdown, Karpathy ensures his knowledge base is not locked into a specific vendor. It is future-proof; if Obsidian disappears, the files remain readable by any text editor. Obsidian: While Obsidian is a proprietary application, its "local-first" philosophy and EULA (which allows for free personal use and requires a license for commercial use) align with the developer's desire for data sovereignty. The "Vibe-Coded" Tools: The search engines and CLI tools Karpathy mentions are custom scripts—likely Python-based—that bridge the gap between the LLM and the local file system. This "file-over-app" philosophy is a direct challenge to SaaS-heavy models like Notion or Google Docs. In the Karpathy model, the user owns the data, and the AI is merely a highly sophisticated editor that "visits" the files to perform work. Librarian vs. search engine The AI community has reacted with a mix of technical validation and "vibe-coding" enthusiasm. The debate centers on whether the industry has over-indexed on Vector DBs for problems that are fundamentally about structure, not just similarity. Jason Paul Michaels (@SpaceWelder314), a welder using Claude, echoed the sentiment that simpler tools are often more robust: "No vector database. No embeddings... Just markdown, FTS5, and grep… Every bug fix… gets indexed. The knowledge compounds." However, the most significant praise came from Steph Ango (@Kepano), co-creator of Obsidian, who highlighted a concept called "Contamination Mitigation." He suggested that users should keep their personal "vault" clean and let the agents play in a "messy vault," only bringing over the useful artifacts once the agent-facing workflow has distilled them. Which solution is right for your enteprise vibe coding projects? Feature Vector DB / RAG Karpathy’s Markdown Wiki Data Format Opaque Vectors (Math) Human-Readable Markdown Logic Semantic Similarity (Nearest Neighbor) Explicit Connections (Backlinks/Indices) Auditability Low (Black Box) High (Direct Traceability) Compounding Static (Requires re-indexing) Active (Self-healing through linting) Ideal Scale Millions of Documents 100 - 10,000 High-Signal Documents The "Vector DB" approach is like a massive, unorganized warehouse with a very fast forklift driver. You can find anything, but you don’t know why it’s there or how it relates to the pallet next to it. Karpathy’s "Markdown Wiki" is like a curated library with a head librarian who is constantly writing new books to explain the old ones. The next phase Karpathy’s final exploration points toward the ultimate destination of this data: Synthetic Data Generation and Fine-Tuning. As the wiki grows and the data becomes more "pure" through continuous LLM linting, it becomes the perfect training set. Instead of the LLM just reading the wiki in its "context window," the user can eventually fine-tune a smaller, more efficient model on the wiki itself. This would allow the LLM to "know" the researcher’s personal knowledge base in its own weights, essentially turning a personal research project into a custom, private intelligence. Bottom-line: Karpathy hasn't just shared a script; he’s shared a philosophy. By treating the LLM as an active agent that maintains its own memory, he has bypassed the limitations of "one-shot" AI interactions. For the individual researcher, it means the end of the "forgotten bookmark." For the enterprise, it means the transition from a "raw/ data lake" to a "compiled knowledge asset." As Karpathy himself summarized: "You rarely ever write or edit the wiki manually; it's the domain of the LLM." We are entering the era of the autonomous archive.

Nvidia launches enterprise AI agent platform with Adobe, Salesforce, SAP among 17 adopters at GTC 2026

4/3/2026

Jensen Huang walked onto the GTC stage Monday wearing his trademark leather jacket and carrying, as it turned out, the blueprints for a new kind of industry dominance. The Nvidia CEO unveiled the Agent Toolkit, an open-source platform for building autonomous AI agents, and then rattled off the names of the companies that will use it: Adobe, Salesforce, SAP, ServiceNow, Siemens, CrowdStrike, Atlassian, Cadence, Synopsys, IQVIA, Palantir, Box, Cohesity, Dassault Systèmes, Red Hat, Cisco and Amdocs. Seventeen enterprise software companies, touching virtually every industry and every Fortune 500 corporation, all agreeing to build their next generation of AI products on a shared foundation that Nvidia designed, Nvidia optimizes and Nvidia maintains. The toolkit provides the models, the runtime, the security framework and the optimization libraries that AI agents need to operate autonomously inside organizations — resolving customer service tickets, designing semiconductors, managing clinical trials, orchestrating marketing campaigns. Each component is open source. Each is optimized for Nvidia hardware. The combination means that as AI agents proliferate across the corporate world, they will generate demand for Nvidia GPUs not because companies choose to buy them but because the software they depend on was engineered to require them. "The enterprise software industry will evolve into specialized agentic platforms," Huang told the crowd, "and the IT industry is on the brink of its next great expansion." What he left unsaid is that Nvidia has just positioned itself as the tollbooth at the entrance to that expansion — open to all, owned by one. Inside Nvidia's Agent Toolkit: the software stack designed to power every corporate AI worker To grasp the significance of Monday's announcements, it helps to understand the problem Nvidia is solving. Building an enterprise AI agent today is an exercise in frustration. A company that wants to deploy an autonomous system — one that can, say, monitor a telecommunications network and proactively resolve customer issues before anyone calls to complain — must assemble a language model, a retrieval system, a security layer, an orchestration framework and a runtime environment, typically from different vendors whose products were never designed to work together. Nvidia's Agent Toolkit collapses that complexity into a unified platform. It includes Nemotron, a family of open models optimized for agentic reasoning; AI-Q, an open blueprint that lets agents perceive, reason and act on enterprise knowledge; OpenShell, an open-source runtime enforcing policy-based security, network and privacy guardrails; and cuOpt, an optimization skill library. Developers can use the toolkit to create specialized AI agents that act autonomously while using and building other software to complete tasks. The AI-Q component addresses a pain point that has dogged enterprise AI adoption: cost. Its hybrid architecture routes complex orchestration tasks to frontier models while delegating research tasks to Nemotron's open models, which Nvidia says can cut query costs by more than 50 percent while maintaining top-tier accuracy. Nvidia used the AI-Q Blueprint to build what it claims is the top-ranking AI agent on both the DeepResearch Bench and DeepResearch Bench II leaderboards — benchmarks that, if they hold under independent validation, position the toolkit as not merely convenient but competitively necessary. OpenShell tackles what has been the single biggest obstacle in every boardroom conversation about letting AI agents loose inside corporate systems: trust. The runtime creates isolated sandboxes that enforce strict policies around data access, network reach and privacy boundaries. Nvidia is collaborating with Cisco, CrowdStrike, Google, Microsoft Security and TrendAI to integrate OpenShell with their existing security tools — a calculated move that enlists the cybersecurity industry as a validation layer for Nvidia's approach rather than a competing one. The partner list that reads like the Fortune 500: who signed on and what they're building The breadth of Monday's enterprise adoption announcements reveals Nvidia's ambitions more clearly than any specification sheet could. Adobe, in a simultaneously announced strategic partnership, will adopt Agent Toolkit software as the foundation for running hybrid, long-running creativity, productivity and marketing agents. Shantanu Narayen, Adobe's chair and CEO, said the companies will bring together "our Firefly models, CUDA libraries into our applications, 3D digital twins for marketing, and Agent Toolkit and Nemotron to our agentic frameworks to deliver high-quality, controllable and enterprise-grade AI workflows of the future." The partnership extends deep: Adobe will explore OpenShell and Nemotron as foundations for personalized, secure agentic loops, and will evaluate the toolkit for large-scale workflows powered by Adobe Experience Platform. Nvidia will provide engineering expertise, early access to software and targeted go-to-market support. Salesforce's integration may be the one enterprise IT leaders parse most carefully. The company is working with Nvidia Agent Toolkit software including Nemotron models, enabling customers to build, customize and deploy AI agents using Agentforce for service, sales and marketing. The collaboration introduces a reference architecture where employees can use Slack as the primary conversational interface and orchestration layer for Agentforce agents — powered by Nvidia infrastructure — that participate directly in business workflows and pull from data stores in both on-premises and cloud environments. For the millions of knowledge workers who already conduct their professional lives inside Slack, this turns a messaging app into the command center for corporate AI. SAP, whose software underpins the financial and operational plumbing of most Global 2000 companies, is using open Agent Toolkit software including NeMo for enabling AI agents through Joule Studio on SAP Business Technology Platform, enabling customers and partners to design agents tailored to their own business needs. ServiceNow's Autonomous Workforce of AI Specialists leverage Agent Toolkit software, the AI-Q Blueprint and a combination of closed and open models, including Nemotron and ServiceNow's own Apriel models — a hybrid approach that suggests the toolkit is designed not to replace existing AI investments but to become the connective tissue between them. From chip design to clinical trials: how agentic AI is reshaping specialized industries The partner list extends well beyond horizontal software platforms into deeply specialized verticals where autonomous agents could compress timelines measured in years. In semiconductor design — where a single advanced chip can cost billions of dollars and take half a decade to develop — three of the four major electronic design automation companies are building agents on Nvidia's stack. Cadence will leverage Agent Toolkit and Nemotron with its ChipStack AI SuperAgent for semiconductor design and verification. Siemens is launching its Fuse EDA AI Agent, which uses Nemotron to autonomously orchestrate workflows across its entire electronic design automation portfolio, from design conception through manufacturing sign-off. Synopsys is building a multi-agent framework powered by its AgentEngineer technology using Nemotron and Nemo Agent Toolkit. Healthcare and life sciences present perhaps the most consequential use case. IQVIA is integrating Nemotron and other Agent Toolkit software with IQVIA.ai, a unified agentic AI platform designed to help life sciences organizations work more efficiently across clinical, commercial and real-world operations. The scale is already significant: IQVIA has deployed more than 150 agents across internal teams and client environments, including 19 of the top 20 pharmaceutical companies. The security sector is embedding itself into the architecture from the ground floor. CrowdStrike unveiled a Secure-by-Design AI Blueprint that embeds its Falcon platform protection directly into Nvidia AI agent architectures — including agents built on AI-Q and OpenShell — and is advancing agentic managed detection and response using Nemotron reasoning models. Cisco AI Defense will provide AI security protection for OpenShell, adding controls and guardrails to govern agent actions. These are not aftermarket bolt-ons; they are foundational integrations that signal the security industry views Nvidia's agent platform as the substrate it needs to protect. Dassault Systèmes is exploring Agent Toolkit software and Nemotron for its role-based AI agents, called Virtual Companions, on its 3DEXPERIENCE agentic platform. Atlassian is working with the toolkit as it evolves its Rovo AI agentic strategy for Jira and Confluence. Box is using it to enable enterprise agents to securely execute long-running business processes. Palantir is developing AI agents on Nemotron that run on its sovereign AI Operating System Reference Architecture. The open-source gambit: why giving software away is Nvidia's most aggressive business move There is something almost paradoxical about a company with a multi-trillion-dollar market capitalization giving away its most strategically important software. But Nvidia's open-source approach to Agent Toolkit is less an act of generosity than a carefully constructed competitive moat. OpenShell is open source. Nemotron models are open. AI-Q blueprints are publicly available. LangChain, the agent engineering company whose open-source frameworks have been downloaded over 1 billion times, is working with Nvidia to integrate Agent Toolkit components into the LangChain deep agent library for developing advanced, accurate enterprise AI agents at scale. When the most popular independent framework for building AI agents absorbs your toolkit, you have transcended the category of vendor and entered the category of infrastructure. But openness in AI has a way of being strategically selective. The models are open, but they are optimized for Nvidia's CUDA libraries — the proprietary software layer that has locked developers into Nvidia GPUs for two decades. The runtime is open, but it integrates most deeply with Nvidia's security partners. The blueprints are open, but they perform best on Nvidia hardware. Developers can explore Agent Toolkit and OpenShell on build.nvidia.com today, running on inference providers and Nvidia Cloud Partners including Baseten, CoreWeave, DeepInfra, DigitalOcean and others — all of which run Nvidia GPUs. The strategy has a historical analog in Google's approach to Android: give away the operating system to ensure that the entire mobile ecosystem generates demand for your core services. Nvidia is giving away the agent operating system to ensure that the entire enterprise AI ecosystem generates demand for its core product — the GPU. Every Salesforce agent running Nemotron, every SAP workflow orchestrated through OpenShell, every Adobe creative pipeline accelerated by CUDA creates another strand of dependency on Nvidia silicon. This also explains the Nemotron Coalition announced Monday — a global collaboration of model builders including Mistral AI, Cursor, LangChain, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab, all working to advance open frontier models. The coalition's first project will be a base model codeveloped by Mistral AI and Nvidia, trained on Nvidia DGX Cloud, that will underpin the upcoming Nemotron 4 family. By seeding the open model ecosystem with Nvidia-optimized foundations, the company ensures that even models it does not build will run best on its hardware. What could go wrong: the risks enterprise buyers should weigh before going all-in For all the ambition on display Monday, several realities temper the narrative. Adoption announcements are not deployment announcements. Many of the partner disclosures use carefully hedged language — "exploring," "evaluating," "working with" — that is standard in embargoed press releases but should not be confused with production systems serving millions of users. Adobe's own forward-looking statements note that "due to the non-binding nature of the agreement, there are no assurances that Adobe will successfully negotiate and execute definitive documentation with Nvidia on favorable terms or at all." The gap between a GTC keynote demonstration and an enterprise-grade rollout remains substantial. Nvidia is not the only company chasing this market. Microsoft, with its Copilot ecosystem and Azure AI infrastructure, pursues a parallel strategy with the advantage of owning the operating systems and productivity software that most enterprises already use. Google, through Gemini and its cloud platform, has its own agent vision. Amazon, via Bedrock and AWS, is building comparable primitives. The question is not whether enterprise AI agents will be built on some platform but whether the market will consolidate around one stack or fragment across several. The security claims, while architecturally sound, remain unproven at scale. OpenShell's policy-based guardrails are a promising design pattern, but autonomous agents operating in complex enterprise environments will inevitably encounter edge cases that no policy framework has anticipated. CrowdStrike's Secure-by-Design AI Blueprint and Cisco AI Defense's OpenShell integration are exactly the kind of layered defense enterprise buyers will demand — but both are newly unveiled, not battle-hardened through years of adversarial testing. Deploying agents that can autonomously access data, execute code and interact with production systems introduces a threat surface that the industry has barely begun to map. And there is the question of whether enterprises are ready for agents at all. The technology may be available, but organizational readiness — the governance structures, the change management, the regulatory frameworks, the human trust — often lags years behind what the platforms can deliver. Beyond agents: the full scope of what Nvidia announced at GTC 2026 Monday's Agent Toolkit announcement did not arrive in isolation. It landed amid an avalanche of product launches that, taken together, describe a company remaking itself at every layer of the computing stack. Nvidia unveiled the Vera Rubin platform — seven new chips in full production, including the Vera CPU purpose-built for agentic AI, the Rubin GPU, and the newly integrated Groq 3 LPU inference accelerator — designed to power every phase of AI from pretraining to real-time agentic inference. The Vera Rubin NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs, delivering what Nvidia claims is up to 10x higher inference throughput per watt at one-tenth the cost per token compared with the Blackwell platform. Dynamo 1.0, an open-source inference operating system that Nvidia describes as the "operating system for AI factories," entered production with adoption from AWS, Microsoft Azure, Google Cloud and Oracle Cloud Infrastructure alongside companies like Cursor, Perplexity, PayPal and Pinterest. The BlueField-4 STX storage architecture promises up to 5x token throughput for the long-context reasoning that agents demand, with early adopters including CoreWeave, Crusoe, Lambda, Mistral AI and Nebius. BYD, Geely, Isuzu and Nissan announced Level 4 autonomous vehicle programs on Nvidia's DRIVE Hyperion platform, and Uber disclosed plans to launch Nvidia-powered robotaxis across 28 cities and four continents by 2028, beginning with Los Angeles and San Francisco in the first half of 2027. Roche, the pharmaceutical giant, announced it is deploying more than 3,500 Nvidia Blackwell GPUs across hybrid cloud and on-premises environments in the U.S. and Europe — what it calls the largest announced GPU footprint available to a pharmaceutical company. Nvidia also launched physical AI tools for healthcare robotics, with CMR Surgical, Johnson & Johnson MedTech and others adopting the platform, and released Open-H, the world's largest healthcare robotics dataset with over 700 hours of surgical video. And Nvidia even announced a Space Module based on the Vera Rubin architecture, promising to bring data-center-class AI to orbital environments. The real meaning of GTC 2026: Nvidia is no longer selling picks and shovels Strip away the product specifications and benchmark claims and what emerges from GTC 2026 is a single, clarifying thesis: Nvidia believes the era of AI agents will be larger than the era of AI models, and it intends to own the platform layer of that transition the way it already owns the hardware layer of the current one. The 17 enterprise software companies that signed on Monday are making a bet of their own. They are wagering that building on Nvidia's agent infrastructure will let them move faster than building alone — and that the benefits of a shared platform outweigh the risks of shared dependency. For Salesforce, it means Agentforce agents that can draw from both cloud and on-premises data through a single Slack interface. For Adobe, it means creative AI pipelines that span image, video, 3D and document intelligence. For SAP, it means agents woven into the transactional fabric of global commerce. Each partnership is rational on its own terms. Together, they form something larger: an industry-wide endorsement of Nvidia as the default substrate for enterprise intelligence. Huang, who opened his career designing graphics chips for video games, closed his keynote by gesturing toward a future in which AI agents do not just assist human workers but operate as autonomous colleagues — reasoning through problems, building their own tools, learning from their mistakes. He compared the moment to the birth of the personal computer, the dawn of the internet, the rise of mobile computing. Technology executives have a professional obligation to describe every product cycle as a revolution. But here is what made Monday different: this time, 17 of the world's most important software companies showed up to agree with him. Whether they did so out of conviction or out of a calculated fear of being left behind may be the most important question in enterprise technology — and it is one that only the next few years can answer.

Arcee's new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize

4/3/2026

The baton of open source AI models has been passed on between several companies over the years since ChatGPT debuted in late 2022, from Meta with its Llama family to Chinese labs like Qwen and z.ai. But lately, Chinese companies have started pivoting back towards proprietary models even as some U.S. labs like Cursor and Nvidia release their own variants of the Chinese models, leaving a question mark about who will originate this branch of technology going forward. One answer: Arcee, a San Francisco based lab, which this week released AI Trinity-Large-Thinking—a 399-billion parameter text-only reasoning model released under the uncompromisingly open Apache 2.0 license, allowing for full customizability and commercial usage by anyone from indie developers to large enterprises. The release represents more than just a new set of weights on AI code sharing community Hugging Face; it is a strategic bet that "American Open Weights" can provide a sovereign alternative to the increasingly closed or restricted frontier models of 2025. This move arrives precisely as enterprises express growing discomfort with relying on Chinese-based architectures for critical infrastructure, creating a demand for a domestic champion that Arcee intends to fill. As Clément Delangue, co-founder and CEO of Hugging Face, told VentureBeat in a direct message on X: "The strength of the US has always been its startups so maybe they're the ones we should count on to lead in open-source AI. Arcee shows that it's possible!" Genesis of a 30-person frontier lab To understand the weight of the Trinity release, one must understand the lab that built it. Based in San Francisco, Arcee AI is a lean team of only 30 people. While competitors like OpenAI and Google operate with thousands of engineers and multibillion-dollar compute budgets, Arcee has defined itself through what CTO Lucas Atkins calls "engineering through constraint". The company first made waves in 2024 after securing a $24 million Series A led by Emergence Capital, bringing its total capital to just under $50 million. In early 2026, the team took a massive risk: they committed $20 million—nearly half their total funding—to a single 33-day training run for Trinity Large. Utilizing a cluster of 2048 NVIDIA B300 Blackwell GPUs, which provided twice the speed of the previous Hopper generation, Arcee bet the company's future on the belief that developers needed a frontier model they could truly own. This "back the company" bet was a masterclass in capital efficiency, proving that a small, focused team could stand up a full pipeline and stabilize training without endless reserves. Engineering through extreme architectural constraint Trinity-Large-Thinking is noteworthy for the extreme sparsity of its attention mechanism. While the model houses 400 billion total parameters, its Mixture-of-Experts architecture means that only 1.56%, or 13 billion parameters, are active for any given token. This allows the model to possess the deep knowledge of a massive system while maintaining the inference speed and operational efficiency of a much smaller one—performing roughly 2 to 3 times faster than its peers on the same hardware. Training such a sparse model presented significant stability challenges. To prevent a few experts from becoming "winners" while others remained untrained "dead weight," Arcee developed SMEBU, or Soft-clamped Momentum Expert Bias Updates. This mechanism ensures that experts are specialized and routed evenly across a general web corpus. The architecture also incorporates a hybrid approach, alternating local and global sliding window attention layers in a 3:1 ratio to maintain performance in long-context scenarios. The data curriculum and synthetic reasoning Arcee’s partnership with fellow startup DatologyAI provided a curriculum of over 10 trillion curated tokens. However, the training corpus for the full-scale model was expanded to 20 trillion tokens, split evenly between curated web data and high-quality synthetic data. Unlike typical imitation-based synthetic data where a smaller model simply learns to mimic a larger one, DatologyAI utilized techniques to synthetically rewrite raw web text—such as Wikipedia articles or blogs—to condense the information. This process helped the model learn to reason over concepts and information rather than merely memorizing exact token strings. To ensure regulatory compliance, tremendous effort was invested in excluding copyrighted books and materials with unclear licensing, attracting enterprise customers who are wary of intellectual property risks associated with mainstream LLMs. This data-first approach allowed the model to scale cleanly while significantly improving performance on complex tasks like mathematics and multi-step agent tool use. The pivot from yappy chatbots to reasoning agents The defining feature of this official release is the transition from a standard "instruct" model to a "reasoning" model. By implementing a "thinking" phase prior to generating a response—similar to the internal loops found in the earlier Trinity-Mini—Arcee has addressed the primary criticism of its January "Preview" release. Early users of the Preview model had noted that it sometimes struggled with multi-step instructions in complex environments and could be "underwhelming" for agentic tasks. The "Thinking" update effectively bridges this gap, enabling what Arcee calls "long-horizon agents" that can maintain coherence across multi-turn tool calls without getting "sloppy". This reasoning process enables better context coherence and cleaner instruction following under constraint. This has direct implications for Maestro Reasoning, a 32B-parameter derivative of Trinity already being used in audit-focused industries to provide transparent "thought-to-answer" traces. The goal was to move beyond "yappy" or inefficient chatbots toward reliable, cheap, high-quality agents that stay stable across long-running loops. Geopolitics and the case for American open weights The significance of Arcee’s Apache 2.0 commitment is amplified by the retreat of its primary competitors from the open-weight frontier. Throughout 2025, Chinese research labs like Alibaba's Qwen and z.ai (aka Zhupai) set the pace for high-efficiency MoE architectures. However, as we enter 2026, those labs have begun to shift toward proprietary enterprise platforms and specialized subscriptions, signaling a move away from pure community growth. The fragmentation of these once-prolific teams, such as the departure of key technical leads from Alibaba's Qwen lab, has left a void at the high end of the open-weight market. In the United States, the movement has faced its own crisis. Meta’s Llama division notably retreated from the frontier landscape following the mixed reception of Llama 4 in April 2025, which faced reports of quality issues and benchmark manipulation. For developers who relied on the Llama 3 era of dominance, the lack of a current 400B+ open model created an urgent need for an alternative that Arcee has risen to fill. Benchmarks and how Arcee's Trinity-Large-Thinking stacks up to other U.S. frontier open source AI model offerings Trinity-Large-Thinking’s performance on agent-specific evaluations establishes it as a legitimate frontier contender. On PinchBench, a critical metric for evaluating model capability on autonomous agentic tasks, Trinity achieved a score of 91.9, placing it just behind the proprietary market leader, Claude Opus 4.6 (93.3). This competitiveness is mirrored in IFBench, where Trinity’s score of 52.3 sits in a near-dead heat with Opus 4.6’s 53.1, indicating that the reasoning-first "Thinking" update has successfully addressed the instruction-following hurdles that challenged the model’s earlier preview phase. The model’s broader technical reasoning capabilities also place it at the high end of the current open-source market. It recorded a 96.3 on AIME25, matching the high-tier Kimi-K2.5 and outstripping other major competitors like GLM-5 (93.3) and MiniMax-M2.7 (80.0). While high-end coding benchmarks like SWE-bench Verified still show a lead for top-tier closed-source models—with Trinity scoring 63.2 against Opus 4.6’s 75.6—the massive delta in cost-per-token positions Trinity as the more viable sovereign infrastructure layer for enterprises looking to deploy these capabilities at production scale. When it comes to other U.S. open source frontier model offerings, OpenAI's gpt-oss tops out at 120 billion parameters, but there's also Google with Gemma (Gemma 4 was just released this week) and IBM's Granite family is also worth a mention, despite having lower benchmarks. Nvidia's Nemotron family is also notable, but is fine-tuned and post-trained Qwen variants. Benchmark Arcee Trinity-Large gpt-oss-120B (High) IBM Granite 4.0 Google Gemma 4 GPQA-D 76.3% 80.1% 74.8% 84.3% Tau2-Airline 88.0% 65.8%* 68.3% 76.9% PinchBench 91.9% 69.0% (IFBench) 89.1% 93.3% AIME25 96.3% 97.9% 88.5% 89.2% MMLU-Pro 83.4% 90.0% (MMLU) 81.2% 85.2% So how is an enterprise supposed to choose between all these? Arcee Trinity-Large-Thinking is the premier choice for organizations building autonomous agents; its sparse 400B architecture excels at "thinking" through multi-step logic, complex math, and long-horizon tool use. By activating only a fraction of its parameters, it provides a high-speed reasoning engine for developers who need GPT-4o-level planning capabilities within a cost-effective, open-source framework. Conversely, gpt-oss-120B serves as the optimal middle ground for enterprises that require high-reasoning performance but prioritize lower operational costs and deployment flexibility. Because it activates only 5.1B parameters per forward pass, it is uniquely suited for technical workloads like competitive code generation and advanced mathematical modeling that must run on limited hardware, such as a single H100 GPU. Its configurable reasoning effort—offering "Low," "Medium," and "High" modes—makes it the best fit for production environments where latency and accuracy must be balanced dynamically across different tasks. For broader, high-throughput applications, Google Gemma 4 and IBM Granite 4.0 serve as the primary backbones. Gemma 4 offers the highest "intelligence density" for general knowledge and scientific accuracy, making it the most versatile option for R&D and high-speed chat interfaces. Meanwhile, IBM Granite 4.0 is engineered for the "all-day" enterprise workload, utilizing a hybrid architecture that eliminates context bottlenecks for massive document processing. For businesses concerned with legal compliance and hardware efficiency, Granite remains the most reliable foundation for large-scale RAG and document analysis. Ownership as a feature for regulated industries In this climate, Arcee’s choice of the Apache 2.0 license is a deliberate act of differentiation. Unlike the restrictive community licenses used by some competitors, Apache 2.0 allows enterprises to truly own their intelligence stack without the "black box" biases of a general-purpose chat model. "Developers and Enterprises need models they can inspect, post-train, host, distill, and own," Lucas Atkins noted in the launch announcement. This ownership is critical for the "bitter lesson" of training small models: you usually need to train a massive frontier model first to generate the high-quality synthetic data and logits required to build efficient student models. Furthermore, Arcee has released Trinity-Large-TrueBase, a raw 10-trillion-token checkpoint. TrueBase offers a rare, "unspoiled" look at foundational intelligence before instruction tuning and reinforcement learning are applied. For researchers in highly regulated industries like finance and defense, TrueBase allows for authentic audits and custom alignments starting from a clean slate. Community verdict and the future of distillation The response from the developer community has been largely positive, reflecting the desire for more open weights, U.S.-made mdoels. On X, researchers highlighted the disruption, noting that the "insanely cheap" prices for a model of this size would be a boon for the agentic community. On open AI model inference website OpenRouter, Trinity-Large-Preview established itself as the #1 most used open model in the U.S., serving over 80.6 billion tokens on peak days like March 1, 2026. The proximity of Trinity-Large-Thinking to Claude Opus 4.6 on PinchBench—at 91.9 versus 93.3—is particularly striking when compared to the cost. At $0.90 per million output tokens, Trinity is approximately 96% cheaper than Opus 4.6, which costs $25 per million output tokens. Arcee’s strategy is now focused on bringing these pretraining and post-training lessons back down the stack. Much of the work that went into Trinity Large will now flow into the Mini and Nano models, refreshing the company's compact line with the distillation of frontier-level reasoning. As global labs pivot toward proprietary lock-in, Arcee has positioned Trinity as a sovereign infrastructure layer that developers can finally control and adapt for long-horizon agentic workflows.

Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

4/2/2026

For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google's Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba's Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, "open" with asterisks isn't the same as open. Gemma 4 eliminates that friction entirely. Google DeepMind's newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no "Harmful Use" carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over. The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research. Four models, two tiers: Edge to workstation in a single family Gemma 4 arrives as four distinct models organized into two deployment tiers. The "workstation" tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model — both supporting text and image input with 256K-token context windows. The "edge" tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows. The naming convention takes some unpacking. The "E" prefix denotes "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more. The "A" in 26B A4B stands for "active parameters" — only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model. For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference — but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle. The MoE bet: 128 small experts to save on inference costs The architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B–31B range while running at roughly the speed of a 4B model during inference. This is not just a benchmark curiosity — it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family. Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi-turn agent conversations. Native multimodality: Vision, audio, and function calling baked in from scratch Previous generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level. All four models handle variable aspect-ratio image input with configurable visual token budgets — a meaningful improvement over Gemma 3n's older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task. Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots. The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local — think healthcare, field service, or multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification. Function calling is also native across all four models, drawing on research from Google's FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4's function calling was trained into the model from the ground up — optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents. Benchmarks in context: Where Gemma 4 lands in a crowded field The benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 — numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%. For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode. The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond — a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture. The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability. These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in a single model family with deployment options from edge devices to cloud serverless. What enterprise teams should watch next Google is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially. The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference — rather than maintaining always-on GPU instances — could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications. Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today — workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research — represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google's open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.

Microsoft launches 3 new AI models in direct shot at OpenAI and Google

4/2/2026

Microsoft on Thursday launched three new foundational AI models it built entirely in-house — a state-of-the-art speech transcription system, a voice generation engine, and an upgraded image creator — marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI, Google, and other frontier labs on model development, not just distribution. The trio of models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are available immediately through Microsoft Foundry and a new MAI Playground. They span three of the most commercially valuable modalities in enterprise AI: converting speech to text, generating realistic human voice, and creating images. Together, they represent the opening salvo from Microsoft's superintelligence team, which Suleyman formed just six months ago to pursue what he calls "AI self-sufficiency." "I'm very excited that we've now got the first models out, which are the very best in the world for transcription," Suleyman told VentureBeat in an interview ahead of the public announcement. "Not only that, we're able to deliver the model with half the GPUs of the state-of-the-art competition." The announcement lands at a precarious moment for Microsoft. The company's stock just closed its worst quarter since the 2008 financial crisis, as investors increasingly demand proof that hundreds of billions of dollars in AI infrastructure spending will translate into revenue. These models — priced aggressively and positioned to reduce Microsoft's own cost of goods sold — are Suleyman's first answer to that pressure. Microsoft's new transcription model claims best-in-class accuracy across 25 languages MAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark — the industry-standard multilingual test — across the top 25 languages by Microsoft product usage, averaging 3.8% WER. According to Microsoft's benchmarks, it beats OpenAI's Whisper-large-v3 on all 25 languages, Google's Gemini 3.1 Flash on 22 of 25, and ElevenLabs' Scribe v2 and OpenAI's GPT-Transcribe on 15 of 25 each. The model uses a transformer-based text decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC files up to 200MB, and Microsoft says its batch transcription speed is 2.5 times faster than the existing Microsoft Azure Fast offering. Diarization, contextual biasing, and streaming are listed as "coming soon." Microsoft is already testing MAI-Transcribe-1 inside Copilot's Voice mode and Microsoft Teams for conversation transcription — a detail that underscores how quickly the company intends to replace third-party or older internal models with its own. Alongside it, MAI-Voice-1 is Microsoft's text-to-speech model, capable of generating 60 seconds of natural-sounding audio in a single second. The model preserves speaker identity across long-form content and now supports custom voice creation from just a few seconds of audio through Microsoft Foundry. Microsoft is pricing it at $22 per 1 million characters. MAI-Image-2, meanwhile, debuted as a top-three model family on the Arena.ai leaderboard and now delivers at least 2x faster generation times on Foundry and Copilot compared to its predecessor. Microsoft is rolling it out across Bing and PowerPoint, pricing it at $5 per 1 million tokens for text input and $33 per 1 million tokens for image output. WPP, one of the world's largest advertising holding companies, is among the first enterprise partners building with MAI-Image-2 at scale. The contract renegotiation with OpenAI that made Microsoft's model ambitions possible To understand why these models matter, you have to understand the contractual tectonic shift that made them possible. Until October 2025, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. The original deal with OpenAI, signed in 2019, gave Microsoft a license to OpenAI's models in exchange for building the cloud infrastructure OpenAI needed. But when OpenAI sought to expand its compute footprint beyond Microsoft — striking deals with SoftBank and others — Microsoft renegotiated. As Suleyman explained in a December 2025 interview with Bloomberg, the revised agreement meant that "up until a few weeks ago, Microsoft was not allowed — by contract — to pursue artificial general intelligence or superintelligence independently." The new terms freed Microsoft to build its own frontier models while retaining license rights to everything OpenAI builds through 2032. Suleyman described the dynamic to VentureBeat in characteristically blunt terms. "Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence," he said. "Since then, we've been convening the compute and the team and buying up the data that we need." He was quick to emphasize that the OpenAI partnership remains intact. "Nothing's changing with the OpenAI partnership. We will be in partnership with them at least until 2032 and hopefully a lot longer," Suleyman said. "They have been a phenomenal partner to us." He also highlighted that Microsoft provides access to Anthropic's Claude through its Foundry API, framing the company as "a platform of platforms." But the subtext is unmistakable: Microsoft is building the capability to stand on its own. In March, as Business Insider first reported, Suleyman wrote in an internal memo that his goal is to "focus all my energy on our Superintelligence efforts and be able to deliver world class models for Microsoft over the next 5 years." CNBC reported that the structural shift freed Suleyman from day-to-day Copilot product responsibilities, with former Snap executive Jacob Andreou taking over as EVP of the combined consumer and commercial Copilot experience. How teams of fewer than 10 engineers built models that rival Big Tech's best Perhaps the most striking detail Suleyman shared with VentureBeat is how small the teams behind these models actually are. "The audio model was built by 10 people, and the vast majority of the speed, efficiency and accuracy gains come from the model architecture and the data that we have used," Suleyman said. "My philosophy has always been that we need fewer people who are more empowered. So we operate an extremely flat structure." He added: "Our image team, equally, is less than 10 people. So this is all about model and data innovation, which has delivered state of the art performance." This matters for two reasons. First, it challenges the prevailing industry narrative that frontier AI development requires thousands of researchers and billions in headcount costs. Meta, by contrast, has pursued what Suleyman described in his Bloomberg interview as a strategy of "hiring a lot of individuals, rather than maybe creating a team" — including reported compensation packages of $100 million to $200 million for top researchers. Second, small teams producing state-of-the-art results dramatically improve the economics. If Microsoft can build best-in-class transcription with 10 engineers and half the GPUs of competitors, the margin structure of its AI business looks fundamentally different from companies burning through cash to achieve similar benchmarks. The lean-team philosophy also echoes Suleyman's broader views on how AI is already reshaping the work of building AI itself. When asked by VentureBeat how his own team works, Suleyman described an environment that resembles a startup trading floor more than a traditional Microsoft engineering org. "There are groups of people around round tables, circular tables, not traditional desks, on laptops instead of big screens," he said. "They're basically vibe coding, side by side all day, morning till night, in rooms of 50 or 60 people." Why Suleyman's "humanist AI" pitch is aimed squarely at enterprise buyers Suleyman has been steadily building a philosophical brand around Microsoft's AI efforts that he calls "humanist AI" — a term that appeared prominently in the blog post he authored for the launch and that he elaborated on in our interview. "I think that the motivation of a humanist super intelligence is to create something that is truly in service of humanity," he told VentureBeat. "Humans will remain in control at the top of the food chain, and they will be always aligned to human interests." The framing serves multiple purposes. It differentiates Microsoft from the more acceleration-oriented rhetoric coming from OpenAI and Meta. It resonates with enterprise buyers who need governance, compliance, and safety assurances before deploying AI in regulated industries. And it provides a narrative hedge: if something goes wrong in the broader AI ecosystem, Microsoft can point to its stated commitment to human control. In his December Bloomberg interview, Suleyman went further, describing containment and alignment as "red lines" and arguing that no one should release a superintelligence tool until they are "confident it can be controlled." Suleyman also stressed data provenance as a competitive advantage, describing a conversation with CEO Satya Nadella about developing "a clean lineage of models where the data is extremely clean." He drew an implicit contrast with open-source alternatives, noting that "many of the open-source models have been trained on data in, let's say, inappropriate ways. And there are potentially security issues with that." For enterprise customers evaluating AI vendors amid a thicket of copyright lawsuits across the industry, that is a meaningful commercial argument — if Microsoft can credibly claim that its training data was acquired through properly licensed channels, it reduces the legal and reputational risk of deploying these models in production. Microsoft's aggressive pricing puts pressure on Amazon, Google, and the AI startup ecosystem Today’s launch positions Microsoft on three competitive fronts simultaneously. MAI-Transcribe-1 directly targets the transcription workloads that OpenAI's Whisper models have dominated in the open-source community, with Microsoft claiming superior accuracy on all 25 benchmarked languages. The FLEURS results also show it winning against Google's Gemini 3.1 Flash Lite on 22 of 25 languages — a direct challenge as Google aggressively pushes Gemini across its own product suite. And MAI-Voice-1's ability to clone voices from seconds of audio and generate speech at 60x real-time puts it in competition with ElevenLabs, Resemble AI, and the growing ecosystem of voice AI startups, with Microsoft's distribution advantage — any Foundry developer can now access these capabilities through the same API they use for GPT-4 and Claude — acting as a powerful moat. Suleyman framed the competitive position confidently: "We're now a top three lab just under OpenAI and Gemini," he told VentureBeat. The pricing strategy — MAI-Voice-1 at $22 per million characters, MAI-Image-2 at $5 per million input tokens — reflects a deliberate decision to compete on cost. "We're pricing them to be the very best of any hyperscaler. So there will be the cheapest of any of the hyperscalers out there, Amazon. And obviously Google," Suleyman said. "And that's a very conscious decision." This makes strategic sense for Microsoft, which can amortize model development costs across its enormous installed base of enterprise customers. But it also speaks to the question investors have been asking with increasing urgency: when does AI spending start generating returns? Microsoft's stock has fallen roughly 17% year-to-date, according to CNBC, part of a broader selloff in software stocks. By building models that run on half the GPUs of competitors, Microsoft reduces its own infrastructure costs for internal products — Teams, Copilot, Bing, PowerPoint — while offering developers pricing designed to undercut the rest of the market. In his March memo, Suleyman wrote that his models would "enable us to deliver the COGS efficiencies necessary to be able to serve AI workloads at the immense scale required in the coming years." These three models are the first tangible delivery on that promise. Suleyman says a frontier large language model is coming — and Microsoft plans to be "completely independent" Suleyman made clear that transcription, voice, and image generation are just the beginning. When asked whether Microsoft would build a large language model to compete directly with GPT at the frontier level, he was unequivocal. "We absolutely are going to be delivering state of the art models across all modalities," he said. "Our mission is to make sure that if Microsoft ever needs it, we will be able to provide state of the art at the best efficiency, the cheapest price, and be completely independent." He described a multi-year roadmap to "set up the GPU clusters at the appropriate scale," noting that the superintelligence team was formally stood up only in October 2025. Suleyman spoke to VentureBeat from Miami, where the full team was convening for one of its regular week-long in-person sessions. He described Nadella flying in for the gathering to lay out "the roadmap of everything that we need to achieve for our AI self-sufficiency mission over the next 2, 3, 4 years, and all the compute roadmap that that would involve." Building a competitive frontier LLM, of course, is a different order of magnitude in complexity, data requirements, and compute cost from what Microsoft demonstrated Thursday. The models launched today are specialized — they handle audio and images, not the general reasoning and text generation that underpin products like ChatGPT or Copilot's core intelligence. Suleyman has the organizational mandate, Nadella's public backing, and the contractual freedom. What he doesn't yet have is a track record at Microsoft of delivering on the hardest problem in AI. But consider what he does have: three models that are best-in-class or near it in their respective domains, built by teams smaller than most seed-stage startups, running on half the industry-standard GPU footprint, and priced below every major cloud competitor. Two years ago, Suleyman proposed in MIT Technology Review what he called the "Modern Turing Test" — not whether AI could fool a human in conversation, but whether it could go out into the world and accomplish real economic tasks with minimal oversight. On Thursday, his own models took a step toward that vision. The question now is whether Microsoft's superintelligence team can repeat the trick at the scale that actually matters — and whether they can do it before the market's patience runs out.

In the wake of Claude Code's source code leak, 5 actions enterprise security leaders should take now

4/2/2026

Every enterprise running AI coding agents has just lost a layer of defense. On March 31, Anthropic accidentally shipped a 59.8 MB source map file inside version 2.1.88 of its @anthropic-ai/claude-code npm package, exposing 512,000 lines of unobfuscated TypeScript across 1,906 files. The readable source includes the complete permission model, every bash security validator, 44 unreleased feature flags, and references to upcoming models Anthropic has not announced. Security researcher Chaofan Shou broadcast the discovery on X by approximately 4:23 UTC. Within hours, mirror repositories had spread across GitHub. Anthropic confirmed the exposure was a packaging error caused by human error. No customer data or model weights were involved. But containment has already failed. The Wall Street Journal reported Wednesday morning that Anthropic had filed copyright takedown requests that briefly resulted in the removal of more than 8,000 copies and adaptations from GitHub. However, an Anthropic spokesperson told VentureBeat that the takedown was intended to be more limited: "We issued a DMCA takedown against one repository hosting leaked Claude Code source code and its forks. The repo named in the notice was part of a fork network connected to our own public Claude Code repo, so the takedown reached more repositories than intended. We retracted the notice for everything except the one repo we named, and GitHub has restored access to the affected forks." Programmers have already used other AI tools to rewrite Claude Code's functionality in other programming languages. Those rewrites are themselves going viral. The timing was worse than the leak alone. Hours before the source map shipped, malicious versions of the axios npm package containing a remote access trojan went live on the same registry. Any team that installed or updated Claude Code via npm between 00:21 and 03:29 UTC on March 31 may have pulled both the exposed source and the unrelated axios malware in the same install window. A same-day Gartner First Take (subscription required) said the gap between Anthropic's product capability and operational discipline should force leaders to rethink how they evaluate AI development tool vendors. Claude Code is the most discussed AI coding agent among Gartner's software engineering clients. This was the second leak in five days. A separate CMS misconfiguration had already exposed nearly 3,000 unpublished internal assets, including draft announcements for an unreleased model called Claude Mythos. Gartner called the cluster of March incidents a systemic signal. What 512,000 lines reveal about production AI agent architecture The leaked codebase is not a chat wrapper. It is the agentic harness that wraps Claude's language model and gives it the ability to use tools, manage files, execute bash commands, and orchestrate multi-agent workflows. The WSJ described the harness as what allows users to control and direct AI models, much like a harness allows a rider to guide a horse. Fortune reported that competitors and legions of startups now have a detailed road map to clone Claude Code's features without reverse engineering them. The components break down fast. A 46,000-line query engine handles context management through three-layer compression and orchestrates 40-plus tools, each with self-contained schemas and per-tool granular permission checks. And 2,500 lines of bash security validation run 23 sequential checks on every shell command, covering blocked Zsh builtins, Unicode zero-width space injection, IFS null-byte injection, and a malformed token bypass discovered during a HackerOne review. Gartner caught a detail most coverage missed. Claude Code is 90% AI-generated, per Anthropic's own public disclosures. Under the current U.S. copyright law requiring human authorship, the leaked code carries diminished intellectual property protection. The Supreme Court declined to revisit the human authorship standard in March 2026. Every organization shipping AI-generated production code faces this same unresolved IP exposure. Three attack paths, the readable source makes it cheaper to exploit The minified bundle already shipped with every string literal extractable. What the readable source eliminates is the research cost. A technical analysis from Straiker's Jun Zhou, an agentic AI security company, mapped three compositions that are now practical, not theoretical, because the implementation is legible. Context poisoning via the compaction pipeline. Claude Code manages context pressure through a four-stage cascade. MCP tool results are never microcompacted. Read tool results skip budgeting entirely. The autocompact prompt instructs the model to preserve all user messages that are not tool results. A poisoned instruction in a cloned repository's CLAUDE.md file can survive compaction, get laundered through summarization, and emerge as what the model treats as a genuine user directive. The model is not jailbroken. It is cooperative and follows what it believes are legitimate instructions. Sandbox bypass through shell parsing differentials. Three separate parsers handle bash commands, each with different edge-case behavior. The source documents a known gap where one parser treats carriage returns as word separators, while bash does not. Alex Kim's review found that certain validators return early-allow decisions that short-circuit all subsequent checks. The source contains explicit warnings about the past exploitability of this pattern. The composition. Context poisoning instructs a cooperative model to construct bash commands sitting in the gaps of the security validators. The defender's mental model assumes an adversarial model and a cooperative user. This attack inverts both. The model is cooperative. The context is weaponized. The outputs look like commands a reasonable developer would approve. Elia Zaitsev, CrowdStrike's CTO, told VentureBeat in an exclusive interview at RSAC 2026 that the permission problem exposed in the leak reflects a pattern he sees across every enterprise deploying agents. "Don't give an agent access to everything just because you're lazy," Zaitsev said. "Give it access to only what it needs to get the job done." He warned that open-ended coding agents are particularly dangerous because their power comes from broad access. "People want to give them access to everything. If you're building an agentic application in an enterprise, you don't want to do that. You want a very narrow scope." Zaitsev framed the core risk in terms that the leaked source validates. "You may trick an agent into doing something bad, but nothing bad has happened until the agent acts on that," he said. That is precisely what the Straiker analysis describes: context poisoning turns the agent cooperative, and the damage happens when it executes bash commands through the gaps in the validator chain. What the leak exposed and what to audit The table below maps each exposed layer to the attack path it enables and the audit action it requires. Print it. Take it to Monday's meeting. Exposed Layer What the Leak Revealed Attack Path Enabled Defender Audit Action 4-stage compaction pipeline Exact criteria for what survives each stage. MCP tool results are never microcompacted. Read results, skip budgeting. Context poisoning: malicious instructions in CLAUDE.md survive compaction and get laundered into 'user directives'. Audit every CLAUDE.md and .claude/config.json in cloned repos. Treat as executable, not metadata. Bash security validators (2,500 lines, 23 checks) Full validator chain, early-allow short circuits, three-parser differentials, blocked pattern lists Sandbox bypass: CR-as-separator gap between parsers. Early-allow in git validators bypasses all downstream checks. Restrict broad permission rules (Bash(git:*), Bash(echo:*)). Redirect operators chain with allowed commands to overwrite files. MCP server interface contract Exact tool schemas, permission checks, and integration patterns for all 40+ built-in tools Malicious MCP servers that match the exact interface. Supply chain attacks are indistinguishable from legitimate servers. Treat MCP servers as untrusted dependencies. Pin versions. Monitor for changes. Vet before enabling. 44 feature flags (KAIROS, ULTRAPLAN, coordinator mode) Unreleased autonomous agent mode, 30-min remote planning, multi-agent orchestration, background memory consolidation Competitors accelerate the development of comparable features. Future attack surface previewed before defenses ship. Monitor for feature flag activation in production. Inventory where agent permissions expand with each release. Anti-distillation and client attestation Fake tool injection logic, Zig-level hash attestation (cch=00000), GrowthBook feature flag gating Workarounds documented. MITM proxy strips anti-distillation fields. Env var disables experimental betas. Do not rely on vendor DRM for API security. Implement your own API key rotation and usage monitoring. Undercover mode (undercover.ts) 90-line module strips AI attribution from commits. Force ON possible, force OFF impossible. Dead-code-eliminated in external builds. AI-authored code enters repos with no attribution. Provenance and audit trail gaps for regulated industries. Implement commit provenance verification. Require AI disclosure policies for development teams using any coding agent. AI-assisted code is already leaking secrets at double the rate GitGuardian's State of Secrets Sprawl 2026 report, published March 17, found that Claude Code-assisted commits leaked secrets at a 3.2% rate versus the 1.5% baseline across all public GitHub commits. AI service credential leaks surged 81% year-over-year to 1,275,105 detected exposures. And 24,008 unique secrets were found in MCP configuration files on public GitHub, with 2,117 confirmed as live, valid credentials. GitGuardian noted the elevated rate reflects human workflow failures amplified by AI speed, not a simple tool defect. The operational pattern Gartner is tracking Feature velocity compounded the exposure. Anthropic shipped over a dozen Claude Code releases in March, introducing autonomous permission delegation, remote code execution from mobile devices, and AI-scheduled background tasks. Each capability widened the operational surface. The same month that introduced them produced the leak that exposed their implementation. Gartner's recommendation was specific. Require AI coding agent vendors to demonstrate the same operational maturity expected of other critical development infrastructure: published SLAs, public uptime history, and documented incident response policies. Architect provider-independent integration boundaries that would let you change vendors within 30 days. Anthropic has published one postmortem across more than a dozen March incidents. Third-party monitors detected outages 15 to 30 minutes before Anthropic's own status page acknowledged them. The company riding this product to a $380 billion valuation and a possible public offering this year, as the WSJ reported, now faces a containment battle that 8,000 DMCA takedowns have not won. Merritt Baer, Chief Security Officer at Enkrypt AI, an enterprise AI guardrails company, and a former AWS security leader, told VentureBeat that the IP exposure Gartner flagged extends into territory most teams have not mapped. "The questions many teams aren't asking yet are about derived IP," Baer said. "Can model providers retain embeddings or reasoning traces, and are those artifacts considered your intellectual property?" With 90% of Claude Code's source AI-generated and now public, that question is no longer theoretical for any enterprise shipping AI-written production code. Zaitsev argued that the identity model itself needs rethinking. "It doesn't make sense that an agent acting on your behalf would have more privileges than you do," he told VentureBeat. "You may have 20 agents working on your behalf, but they're all tied to your privileges and capabilities. We're not creating 20 new accounts and 20 new services that we need to keep track of." The leaked source shows Claude Code's permission system is per-tool and granular. The question is whether enterprises are enforcing the same discipline on their side. Five actions for security leaders this week 1. Audit CLAUDE.md and .claude/config.json in every cloned repository. Context poisoning through these files is a documented attack path with a readable implementation guide. Check Point Research found that developers inherently trust project configuration files and rarely apply the same scrutiny as application code during reviews. 2. Treat MCP servers as untrusted dependencies. Pin versions, vet before enabling, monitor for changes. The leaked source reveals the exact interface contract. 3. Restrict broad bash permission rules and deploy pre-commit secret scanning. A team generating 100 commits per week at the 3.2% leak rate is statistically exposing three credentials. MCP configuration files are the newest surface that most teams are not scanning. 4. Require SLAs, uptime history, and incident response documentation from your AI coding agent vendor. Architect provider-independent integration boundaries. Gartner's guidance: 30-day vendor switch capability. 5. Implement commit provenance verification for AI-assisted code. The leaked Undercover Mode module strips AI attribution from commits with no force-off option. Regulated industries need disclosure policies that account for this. Source map exposure is a well-documented failure class caught by standard commercial security tooling, Gartner noted. Apple and identity verification provider Persona suffered the same failure in the past year. The mechanism was not novel. The target was. Claude Code alone generates an estimated $2.5 billion in annualized revenue for a company now valued at $380 billion. Its full architectural blueprint is circulating on mirrors that have promised never to come down.

Intuit's AI agents hit 85% repeat usage. The secret was keeping humans involved

4/1/2026

When Intuit shipped AI agents to 3 million customers, 85% came back. The reason, according to the company's EVP and GM: combining AI with human expertise turned out to matter more than anyone expected — not less. Marianna Tessel, the financial software company’s EVP and GM, calls this AI-HI combination a “massive ask” from its customers, noting that it provides another level of confidence and trust. “One of the things we learned that has been fascinating is really the combination of human intelligence and artificial intelligence,” Tessel said in a new VB Beyond the Pilot podcast. “Sometimes it's the combination of AI and HI that gives you better results.” Chatbots alone aren’t the answer Intuit — the parent company of QuickBooks, TurboTax, MailChimp and other widely-used financial products — was one of the first major enterprises to go all in on generative AI with its GenOS platform last June (long before fears of the "SaaSpocalypse" had SaaS companies scrambling to rethink their strategies). Quickly, though, the company recognized that chatbots alone weren’t the answer in enterprise environments, and pivoted to what it now calls Intuit Intelligence. The dashboard-like platform features specialized AI agents for sales, tax, payroll, accounting and project management that users can interact with using natural language to gain insights on their data, automate tasks, and generate reports. Customers report invoices are being paid 90% in full and five days faster, and that manual work has been reduced by 30%. AI agents help close books, categorize transactions, run payroll, automate invoice reminders and surface discrepancies. For instance, one Intuit customer uncovered fraud after interacting with AI agents and asking questions about amounts that didn’t add up. “In the beginning it was like, ‘Is that an error? And as he dug in, he discovered very significant fraud,” Tessel said. Why humans are still in the loop Still, Intuit operates on the principle that humans are “always accessible,” Tessel said. Platforms are built in a way that users can ask questions of a human expert when they’re not getting what they need from the AI agent, or want a human to bounce ideas off of. “I'm not talking about product experts,” Tessel said. “I'm talking about an actual accounting expert or tax expert or payroll expert.” The platform has also been built to suggest human involvement in “high stakes” decision-making scenarios. AI goes to a certain level, then human experts review and categorize the rest. This provides a level of confidence, according to Tessel. “We actually believe it becomes more needed and more powerful at the right moments,” she said. “The expert still provides things that are unique.” The next step is giving customers the tools to perform next-gen tasks like vibe coding — but with simple architectures to reduce the burden for customers. “What we’re testing is this idea of, you can actually do coding without realizing that that's what you are doing,” Tessel said. For example, a merchant running a flower shop wants to ensure that they have the right amount of inventory in stock for Mother’s Day. They can vibe code an agent that analyzes previous years’ sales and creates purchase orders where stock is low. That agent could then be instructed to automatically perform that task for future Mother’s Days and other big holidays. Some users will be more sophisticated and want the ability to dive deeper into the technology. “But some just want to express what they want to have happen,” Tessel said. “Because all they want to do is run their business.” Listen to the full podcast to hear about: Why first-party data can create a "moat" for SaaS companies. Why showing AI's logic matters more than a polished interface. Why 600,000 data points per customer changes what AI can tell you about your business. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

The end of 'shadow AI' at enterprises? Kilo launches KiloClaw for Organizations to enable secure AI agents at scale

4/1/2026

As generative AI matures from a novelty into a workplace staple, a new friction point has emerged: the "shadow AI" or "Bring Your Own AI (BYOAI)" crisis. Much like the unsanctioned use of personal devices in years past, developers and knowledge workers are increasingly deploying autonomous agents on personal infrastructure to manage their professional workflows. "Our journey with Kilo Claw has been to make it easier and easier and more accessible to folks," says Kilo co-founder Scott Breitenother. Today, the company dedicated to providing a portable, multi-model, cloud-based AI coding environment is moving to formalize this "shadow AI" layer: it's launching KiloClaw for Organizations and KiloClaw Chat, a suite of tools designed to provide enterprise-grade governance over personal AI agents. The announcement comes at a period of high velocity for the company. Since making its securely hosted, one-click OpenClaw product for individuals, KiloClaw, generally available last month, more than 25,000 users have integrated the platform into their daily workflows. Simultaneously, Kilo’s proprietary agent benchmark, PinchBench, has logged over 250,000 interactions and recently gained significant industry validation when it was referenced by Nvidia CEO Jensen Huang during his keynote at the 2026 Nvidia GTC conference in San Jose, California. The shadow AI crisis: Addressing the BYOAI problem The impetus for KiloClaw for Organizations stems from a growing visibility gap within large enterprises. In a recent interview with VentureBeat, Kilo leadership detailed conversations with high-level AI directors at government contractors who found their developers running OpenClaw agents on random VPS instances to manage calendars and monitor repositories. "What we’re announcing on Tuesday is Kilo Claw for organizations, where a company can buy an organization-level package of Kilo Claws and give every team member access," explained Kilo co-founder and head of product and engineering Emilie Schario during the interview. "We can't see any of it," the head of AI at one such firm reportedly told Kilo. "No audit logs. No credential management. No idea what data is touching what API". This lack of oversight has led some organizations to issue blanket bans on autonomous agents before a clear strategy on deployment could be formed. Anand Kashyap, CEO and founder of data security firm Fortanix, told VentureBeat without seeing Kilo's announcement that while "Openclaw has taken the technology world by storm... the enterprise usage is minimal due to the security concerns of the open source version." Kashyap expanded on this trend: "In recent times, NVIDIA (with NemoClaw), Cisco (DefenseClaw), Palo Alto Networks, and Crowdstrike have all announced offerings to create an enterprise-ready version of OpenClaw with guardrails and governance for agent security. However, enterprise adoption continues to be low. Enterprises like centralized IT control, predictable behavior, and data security which keeps them compliant. An autonomous agentic platform like OpenClaw stretches the envelope on all these parameters, and while security majors have announced their traditional perimeter security measures, they don't address the fundamental problems of having a reduced attack surface. Over time, we will see an agentic platform emerge where agents are pre-built and packaged, and deployed responsibly with centralized controls, and data access controls built into the agentic platform as well as the LLMs they call upon to get instructions on how to perform the next task. Technologies like Confidential Computing provide compartmentalization of data and processing, and are tremendously helpful in reducing the attack surface." KiloClaw for Organizations is positioned as the way for the security team to say "yes," providing the visibility and control required to bring these agents in-house. It transitions agents from developer-managed infrastructure into a managed environment characterized by scoped access and organizational-level controls. Technology: Universal persistence and the "Swiss cheese" method A core technical hurdle in the current agent landscape is the fragmentation of chat sessions. During the VentureBeat interview, Schario noted that even advanced tools often struggle with canonical sessions, frequently dropping messages or failing to sync across devices. Schario emphasized the security layer that supports this new structure: “You get all the same benefits of the Kilo gateway and the Kilo platform: you can limit what models people can use, get usage visibility, cost controls, and all the advantages of leveraging Kilo with managed, hosted, controlled Kilo Claw”. To address the inherent unreliability of autonomous agents—such as missed cron jobs or failed executions—Kilo employs what Schario calls the "Swiss cheese method" of reliability. By layering additional protections and deterministic guardrails on top of the base OpenClaw architecture, Kilo aims to ensure that tasks, such as a daily 6:00 PM summary, are completed even if the underlying agent logic falters. This is critical because, as Schario noted, “The real risk for any company is data leakage, and that can come from a bot commenting on a GitHub issue or accidentally emailing the person who’s going to get fired before they get fired”. Product: KiloClaw Chat and organizational guardrails While managed infrastructure solves the backend problem, KiloClaw Chat addresses the user experience. Schario noted that “Hosted, managed OpenClaw is easier to get started with, but it’s not enough, and it still requires you to be at the edge of technology to understand how to set it up”. Kilo is looking to lower that barrier for the average worker, asking: “How do we give people who have never heard the phrase OpenClaw or Claudebot an always-on AI assistant?”. Traditionally, interacting with an OpenClaw agent required connecting to third-party messaging services like Telegram or Discord—a process that involves navigating "BotFather" tokens and technical configurations that alienate non-engineers. “One of the number one hurdles we see, both anecdotally and in the data, is that you get your bot running and then you have to connect a channel to it. If you don’t know what’s going on, it’s overwhelming,” Schario observed. “We solved that problem. You don’t need to set up a channel. You can chat with Kilo in the web UI and, with the Kilo Claw app on your phone, interact with Kilo without setting an external channel,” she continued. This native approach is essential for corporate compliance because, as she further explained, “When we were talking to early enterprise opportunities, they don’t want you using your personal Telegram account to chat with your work bot”. As Schario put it, there is a reason enterprise communication doesn't flow through personal DMs; when a company shuts off access, they must be able to shut off access to the bot. Looking ahead, the company plans to integrate these environments further. “What we’re going to do is make Kilo Chat the waypoint between Telegram, Discord, and OpenClaw, so you get all the convenience of Kilo Chat but can use it in the other channels,” Breitenother added. The enterprise package includes several critical governance features: Identity Management: SSO/OIDC integration and SCIM provisioning for automated user lifecycles. Centralized Billing: Full visibility into compute and inference usage across the entire organization. Admin Controls: Org-wide policies regarding which models can be used, specific permissions, and session durations. Secrets Configuration: Integration with 1Password ensures that agents never handle credentials in plain text, preventing accidental leaks. Licensing and governance: The "bot account" model Other security experts note that handling bot and AI agentic permissions are among the most pressing problems enterprises are facing today As Ev Kontsevoy, CEO and co-founder of AI infrastructure and identity management company Teleport told VentureBeat without seeing the Kilo news: "The potential impact of OpenClaw as a non-deterministic actor demonstrates why identity can’t be an afterthought. You have an autonomous agent with shell access, browser control, and API credentials — running on a persistent loop, across dozens of messaging platforms, with the ability to write its own skills. That’s not a chatbot. That’s a non-deterministic actor with broad infrastructure access and no cryptographic identity, no short-lived credentials, and no real-time audit trail tying actions to a verifiable actor." Kilo is proposing to solve it with a major change in organizational structure: the adoption of employee "bot accounts". In Kilo’s vision, every employee eventually carries two identities—their standard human account and a corresponding bot account, such as scott.bot@kiloco.ai. These bot identities operate with strictly limited, read-only permissions. For example, a bot might be granted read-only access to company logs or a GitHub account with contributor-only rights. This "scoped" approach allows the agent to maintain full visibility of the data it needs to be helpful while ensuring it cannot accidentally share sensitive information with others. Addressing concerns over data privacy and "black box" algorithms, Kilo emphasizes that its code is source available. “Anyone can go look at our code. It’s not a black box. When you’re buying Kilo Claw, you’re not giving us your data, and we’re not training on any of your data because we're not building our own model,” Schario clarified. This licensing choice allows organizations to audit the resiliency and security of the platform without fearing their proprietary data will be used to improve third-party models. Pricing and availability KiloClaw for Organizations follows a usage-based pricing model where companies pay only for the compute and inference consumed. Organizations can utilize a "Bring Your Own Key" (BYOK) approach or use Kilo Gateway credits for inference. The service is available starting today, Wednesday, April 1. KiloClaw Chat is currently in beta, with support for web, desktop, and iOS sessions. New users can evaluate the platform via a free tier that includes seven days of compute. As Breitenother summarized to VentureBeat, the goal is to shift from "one-off" deployments to a scalable model for the entire workforce: "I think of Kilo for orgs as buying Kilo Claw by the bushel instead of by the one-off. And we're hoping to sell a lot of bushels of of kilo claw".

CrowdStrike, Cisco and Palo Alto Networks all shipped agentic SOC tools at RSAC 2026 — the agent behavioral baseline gap survived all three

4/1/2026

CrowdStrike CEO George Kurtz highlighted in his RSA Conference 2026 keynote that the fastest recorded adversary breakout time has dropped to 27 seconds. The average is now 29 minutes, down from 48 minutes in 2024. That is how much time defenders have before a threat spreads. Now CrowdStrike sensors detect more than 1,800 distinct AI applications running on enterprise endpoints, representing nearly 160 million unique application instances. Every one generates detection events, identity events, and data access logs flowing into SIEM systems architected for human-speed workflows. Cisco found that 85% of surveyed enterprise customers have AI agent pilots underway. Only 5% moved agents into production, according to Cisco President and Chief Product Officer Jeetu Patel in his RSAC blog post. That 80-point gap exists because security teams cannot answer the basic questions agents force. Which agents are running, what are they authorized to do, and who is accountable when one goes wrong. “The number one threat is security complexity. But we’re running towards that direction in AI as well,” Etay Maor, VP of Threat Intelligence at Cato Networks, told VentureBeat at RSAC 2026. Maor has attended the conference for 16 consecutive years. “We’re going with multiple point solutions for AI. And now you’re creating the next wave of security complexity.” Agents look identical to humans in your logs In most default logging configurations, agent-initiated activity looks identical to human-initiated activity in security logs. “It looks indistinguishable if an agent runs Louis’s web browser versus if Louis runs his browser,” Elia Zaitsev, CTO of CrowdStrike, told VentureBeat in an exclusive interview at RSAC 2026. Distinguishing the two requires walking the process tree. “I can actually walk up that process tree and say, this Chrome process was launched by Louis from the desktop. This Chrome process was launched from Louis’s Claude Cowork or ChatGPT application. Thus, it’s agentically controlled.” Without that depth of endpoint visibility, a compromised agent executing a sanctioned API call with valid credentials fires zero alerts. The exploit surface is already being tested. During his keynote, Kurtz described ClawHavoc, the first major supply chain attack on an AI agent ecosystem, targeting ClawHub, OpenClaw's public skills registry. Koi Security's February audit found 341 malicious skills out of 2,857; a follow-up analysis by Antiy CERT identified 1,184 compromised packages historically across the platform. Kurtz noted ClawHub now hosts 13,000 skills in its registry. The infected skills contained backdoors, reverse shells, and credential harvesters; Kurtz said in his keynote that some erased their own memory after installation and could remain latent before activating. "The frontier AI creators will not secure itself," Kurtz said. "The frontier labs are following the same playbook. They're building it. They're not securing it." Two agentic SOC architectures, one shared blind spot Approach A: AI agents inside the SIEM. Cisco and Splunk announced six specialized AI agents for Splunk Enterprise Security: Detection Builder, Triage, Guided Response, Standard Operating Procedures (SOP), Malware Threat Reversing, and Automation Builder. Malware Threat Reversing is currently available in Splunk Attack Analyzer and Detection Studio is generally available as a unified workspace; the remaining five agents are in alpha or prerelease through June 2026. Exposure Analytics and Federated Search follow the same timeline. Upstream of the SOC, Cisco's DefenseClaw framework scans OpenClaw skills and MCP servers before deployment, while new Duo IAM capabilities extend zero trust to agents with verified identities and time-bound permissions. “The biggest impediment to scaled adoption in enterprises for business-critical tasks is establishing a sufficient amount of trust,” Patel told VentureBeat. “Delegating and trusted delegating, the difference between those two, one leads to bankruptcy. The other leads to market dominance.” Approach B: Upstream pipeline detection. CrowdStrike pushed analytics into the data ingestion pipeline itself, integrating its Onum acquisition natively into Falcon’s ingestion system for real-time analytics, detection, and enrichment before events reach the analyst’s queue. Falcon Next-Gen SIEM now ingests Microsoft Defender for Endpoint telemetry natively, so Defender shops do not need additional sensors. CrowdStrike also introduced federated search across third-party data stores and a Query Translation Agent that converts legacy Splunk queries to accelerate SIEM migration. Falcon Data Security for the Agentic Enterprise applies cross-domain data loss prevention to data agents' access at runtime. CrowdStrike’s adversary-informed cloud risk prioritization connects agent activity in cloud workloads to the same detection pipeline. Agentic MDR through Falcon Complete adds machine-speed managed detection for teams that cannot build the capability internally. “The agentic SOC is all about, how do we keep up?” Zaitsev said. “There’s almost no conceivable way they can do it if they don’t have their own agentic assistance.” CrowdStrike opened its platform to external AI providers through Charlotte AI AgentWorks, announced at RSAC 2026, letting customers build custom security agents on Falcon using frontier AI models. Launch partners include Accenture, Anthropic, AWS, Deloitte, Kroll, NVIDIA, OpenAI, Salesforce, and Telefónica Tech. IBM validated buyer demand through a collaboration integrating Charlotte AI with its Autonomous Threat Operations Machine for coordinated, machine-speed investigation and containment. The ecosystem contenders. Palo Alto Networks, in an exclusive pre-RSAC briefing with VentureBeat, outlined Prisma AIRS 3.0, extending its AI security platform to agents with artifact scanning, agent red teaming, and a runtime that catches memory poisoning and excessive permissions. The company introduced an agentic identity provider for agent discovery and credential validation. Once Palo Alto Networks closes its proposed acquisition of Koi, the company adds agentic endpoint security. Cortex delivers agentic security orchestration across its customer base. Intel announced that CrowdStrike’s Falcon platform is being optimized for Intel-powered AI PCs, leveraging neural processing units and silicon-level telemetry to detect agent behavior on the device. Kurtz framed AIDR, AI Detection and Response, as the next category beyond EDR, tracking agent-speed activity across endpoints, SaaS, cloud, and AI pipelines. He said that “humans are going to have 90 agents that work for them on average” as adoption scales but did not specify a timeline. The gap no vendor closed What security leaders need Approach A: agents inside the SIEM (Cisco/Splunk) Approach B: upstream pipeline detection (CrowdStrike) Gap neither closes Triage at agent volume Six AI agents handle triage, detection, and response inside Splunk ES Onum-powered pipeline detects and enriches threats before the analyst sees them Neither baselines normal agent behavior before flagging anomalies Agent vs. human differentiation Duo IAM tracks agent identities but does not differentiate agent from human activity in SOC telemetry Process tree lineage distinguishes at runtime. AIDR extends to agent-specific detection No vendor’s announced capabilities include an out-of-the-box agent behavioral baseline 27-second response window Guided Response Agent executes containment at machine speed In-pipeline detection reduces queue volume. Agentic MDR adds managed response Human-in-the-loop governance has not been reconciled with machine-speed response in either approach Legacy SIEM portability Native Splunk integration preserves existing workflows Query Translation Agent converts Splunk queries. Native Defender ingestion lets Microsoft shops migrate Neither addresses teams running multiple SIEMs during migration Agent supply chain DefenseClaw scans skills and MCP servers pre-deployment. Explorer Edition red-teams agents EDR AI Runtime Protection catches compromised skills post-deployment. Charlotte AI AgentWorks enables custom agents Neither covers the full lifecycle. Pre-deployment scanning misses runtime exploits and vice versa The matrix makes one thing visible that the keynotes did not. No vendor shipped an agent behavioral baseline. Both approaches automate triage and accelerate detection. Based on VentureBeat's review of announced capabilities, neither defines what normal agent behavior looks like in a given enterprise environment. Teams running Microsoft Sentinel and Security Copilot represent a third architecture not formally announced as a competing approach at RSAC this week, but CISOs in Microsoft-heavy environments need to test whether Sentinel's native agent telemetry ingestion and Copilot's automated triage close the same gaps identified above. Maor cautioned that the vendor response recycles a pattern he has tracked for 16 years. “I hope we don’t have to go through this whole cycle,” he told VentureBeat. “I hope we learned from the past. It doesn’t really look like it.” Zaitsev’s advice was blunt. “You already know what to do. You’ve known what to do for five, ten, fifteen years. It’s time to finally go do it.” Five things to do Monday morning These steps apply regardless of your SOC platform. None requires ripping and replacing current tools. Start with visibility, then layer in controls as agent volume grows. Inventory every agent on your endpoints. CrowdStrike detects 1,800 AI applications across enterprise devices. Cisco’s Duo Identity Intelligence discovers agentic identities. Palo Alto Networks’ agentic IDP catalogs agents and maps them to human owners. If you run a different platform, start with an EDR query for known agent directories and binaries. You cannot set policy for agents you do not know exist. Determine whether your SOC stack can differentiate agent from human activity. CrowdStrike’s Falcon sensor and AIDR do this through process tree lineage. Palo Alto Networks’ agent runtime catches memory poisoning at execution. If your tools cannot make this distinction, your triage rules are applying the wrong behavioral models. Match the architectural approach to your current SIEM. Splunk shops gain agent capabilities through Approach A. Teams evaluating migration get pipeline detection with Splunk query translation and native Defender ingestion through Approach B. Palo Alto Networks’ Cortex delivers a third option. Teams on Microsoft Sentinel, Google Chronicle, Elastic, or other platforms should evaluate whether their SIEM can ingest agent-specific telemetry at this volume. Build an agent behavioral baseline before your next board meeting. No vendor ships one. Define what your agents are authorized to do: which APIs, which data stores, which actions, at which times. Create detection rules for anything outside that scope. Pressure-test your agent supply chain. Cisco’s DefenseClaw and Explorer Edition scan and red-team agents before deployment. CrowdStrike’s runtime detection catches compromised agents post-deployment. Both layers are necessary. Kurtz said in his keynote that ClawHavoc compromised over a thousand ClawHub skills with malware that erased its own memory after installation. If your playbook does not account for an authorized agent executing unauthorized actions at machine speed, rewrite it. The SOC was built to protect humans using machines. It now protects machines using machines. The response window shrank from 48 minutes to 27 seconds. Any agent generating an alert is now a suspect, not just a sensor. The decisions security leaders make in the next 90 days will determine whether their SOC operates in this new reality or gets buried under it.

Hackers slipped a trojan into the code library behind most of the internet. Your team is probably affected

4/1/2026

Attackers stole a long-lived npm access token belonging to the lead maintainer of axios, the most popular HTTP client library in JavaScript, and used it to publish two poisoned versions that install a cross-platform remote access trojan. The malicious releases target macOS, Windows, and Linux. They were live on the npm registry for roughly three hours before removal. Axios gets more than 100 million downloads per week. Wiz reports it sits in approximately 80% of cloud and code environments, touching everything from React front-ends to CI/CD pipelines to serverless functions. Huntress detected the first infections 89 seconds after the malicious package went live and confirmed at least 135 compromised systems among its customers during the exposure window. This is the third major npm supply chain compromise in seven months. Every one exploited maintainer credentials. This time, the target had adopted every defense the security community recommended. One credential, two branches, 39 minutes The attacker took over the npm account of @jasonsaayman, a lead axios maintainer, changed the account email to an anonymous ProtonMail address, and published the poisoned packages through npm’s command-line interface. That bypassed the project’s GitHub Actions CI/CD pipeline entirely. The attacker never touched the Axios source code. Instead, both release branches received a single new dependency: plain-crypto-js@4.2.1. No part of the codebase imports it. The package exists solely to run a postinstall script that drops a cross-platform RAT onto the developer's machine. The staging was precise. Eighteen hours before the axios releases, the attacker published a clean version of plain-crypto-js under a separate npm account to build publishing history and dodge new-package scanner alerts. Then came the weaponized 4.2.1. Both release branches hit within 39 minutes. Three platform-specific payloads were pre-built. The malware erases itself after execution and swaps in a clean package.json to frustrate forensic inspection. StepSecurity, which identified the compromise alongside Socket, called it among the most operationally sophisticated supply chain attacks ever documented against a top-10 npm package. The defense that existed on paper Axios did the right things. Legitimate 1.x releases shipped through GitHub Actions using npm's OIDC Trusted Publisher mechanism, which cryptographically ties every publish to a verified CI/CD workflow. The project carried SLSA provenance attestations. By every modern measure, the security stack looked solid. None of it mattered. Huntress dug into the publish workflow and found the gap. The project still passed NPM_TOKEN as an environment variable right alongside the OIDC credentials. When both are present, npm defaults to the token. The long-lived classic token was the real authentication method for every publish, regardless of how OIDC was configured. The attacker never had to defeat OIDC. They walked around it. A legacy token sat there as a parallel auth path, and npm's own hierarchy silently preferred it. “From my experience at AWS, it’s very common for old auth mechanisms to linger,” said Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, in an exclusive interview with VentureBeat. “Modern controls get deployed, but if legacy tokens or keys aren’t retired, the system quietly favors them. Just like we saw with SolarWinds, where legacy scripts bypassed newer monitoring.” The maintainer posted on GitHub after discovering the compromise: “I’m trying to get support to understand how this even happened. I have 2FA / MFA on practically everything I interact with.” Endor Labs documented the forensic difference. Legitimate axios@1.14.0 showed OIDC provenance, a trusted publisher record, and a gitHead linking to a specific commit. Malicious axios@1.14.1 had none. Any tool checking provenance would have flagged the gap instantly. But provenance verification is opt-in. No registry gate rejected the package. Three attacks, seven months, same root cause Three npm supply chain compromises in seven months. Every one started with a stolen maintainer credential. The Shai-Hulud worm hit in September 2025. A single phished maintainer account gave attackers a foothold that self-replicated across more than 500 packages, harvesting npm tokens, cloud credentials, and GitHub secrets as it spread. CISA issued an advisory. GitHub overhauled npm’s entire authentication model in response. Then in January 2026, Koi Security’s PackageGate research dropped six zero-day vulnerabilities across npm, pnpm, vlt, and Bun that punched through the very defenses the ecosystem adopted after Shai-Hulud. Lockfile integrity and script-blocking both failed under specific conditions. Three of the four package managers patched within weeks. npm closed the report. Now axios. A stolen long-lived token published a RAT through both release branches despite OIDC, SLSA, and every post-Shai-Hulud hardening measure in place. npm shipped real reforms after Shai-Hulud. Creation of new classic tokens got deprecated, though pre-existing ones survived until a hard revocation deadline. FIDO 2FA became mandatory, granular access tokens were capped at seven days for publishing, and trusted publishing via OIDC gave projects a cryptographic alternative to stored credentials. Taken together, those changes hardened everything downstream of the maintainer account. What they didn’t change was the account itself. The credential remained the single point of failure. “Credential compromise is the recurring theme across npm breaches,” Baer said. “This isn’t just a weak password problem. It’s structural. Without ephemeral credentials, enforced MFA, or isolated build and signing environments, maintainer access remains the weak link.” What npm shipped vs. what this attack walked past What SOC leaders need npm defense shipped vs. axios attack The gap Block stolen tokens from publishing FIDO 2FA required. Granular tokens, 7-day expiry. Classic tokens deprecated Bypassed. Legacy token coexisted alongside OIDC. npm preferred the token No enforcement removes legacy tokens when OIDC is configured Verify package provenance OIDC Trusted Publishing via GitHub Actions. SLSA attestations Bypassed. Malicious versions had no provenance. Published via CLI No gate rejects packages missing provenance from projects that previously had it Catch malware before install Socket, Snyk, Aikido automated scanning Partial. Socket flagged in 6 min. First infections hit at 89 seconds Detection-to-removal gap. Scanners catch it, registry removal takes hours Block postinstall execution --ignore-scripts recommended in CI/CD Not enforced. npm runs postinstall by default. pnpm blocks by default; npm does not postinstall remains primary malware vector in every major npm attack since 2024 Lock dependency versions Lockfile enforcement via npm ci Effective only if lockfile committed before compromise. Caret ranges auto-resolved Caret ranges are npm default. Most projects auto-resolve to latest minor What to do now at your enterprise SOC leaders whose organizations run Node.js should treat this as an active incident until they confirm clean systems. The three-hour exposure window fell during peak development hours across Asia-Pacific time zones, and any CI/CD pipeline that ran npm install overnight could have pulled the compromised version automatically. “The first priority is impact assessment: which builds and downstream consumers ingested the compromised package?” Baer said. “Then containment, patching, and finally, transparent reporting to leadership. What happened, what’s exposed, and what controls will prevent a repeat. Lessons from log4j and event-stream show speed and clarity matter as much as the fix itself.” Check exposure. Search lockfiles and CI logs for axios@1.14.1, axios@0.30.4, or plain-crypto-js. Pin to axios@1.14.0 or axios@0.30.3. Assume compromise if hit. Rebuild affected machines from a known-good state. Rotate every accessible credential: npm tokens, AWS keys, SSH keys, cloud credentials, CI/CD secrets, .env values. Block the C2. Add sfrclak.com and 142.11.206.73 to DNS blocklists and firewall rules. Check for RAT artifacts. /Library/Caches/com.apple.act.mond on macOS. %PROGRAMDATA%\wt.exe on Windows. /tmp/ld.py on Linux. If found, preform a full rebuild. Harden going forward. Enforce npm ci --ignore-scripts in CI/CD. Require lockfile-only installs. Reject packages missing provenance from projects that previously had it. Audit whether legacy tokens coexist with OIDC in your own publishing workflows. The credential gap nobody closed Three attacks in seven months. Each different in execution, identical in root cause. npm’s security model still treats individual maintainer accounts as the ultimate trust anchor. Those accounts remain vulnerable to credential hijacking, no matter how many layers get added downstream. “AI spots risky packages, audits legacy auth, and speeds SOC response,” Baer said. “But humans still control maintainer credentials. We mitigate risk. We don’t eliminate it.” Mandatory provenance attestation, where manual CLI publishing is disabled entirely, would have caught this attack before it reached the registry. So would mandatory multi-party signing, where no single maintainer can push a release alone. Neither is enforced today. npm has signaled that disabling tokens by default when trusted publishing is enabled is on the roadmap. Until it ships, every project running OIDC alongside a legacy token has the same blind spot axios had. The axios maintainer did what the community asked. A legacy token nobody realized was still active and undermined all of it.

Meta's new structured prompting technique makes LLMs significantly better at code review — boosting accuracy to 93% in some cases

4/1/2026

Deploying AI agents for repository-scale tasks like bug detection, patch verification, and code review requires overcoming significant technical hurdles. One major bottleneck: the need to set up dynamic execution sandboxes for every repository, which are expensive and computationally heavy. Using large language model (LLM) reasoning instead of executing the code is rising in popularity to bypass this overhead, yet it frequently leads to unsupported guesses and hallucinations. To improve execution-free reasoning, researchers at Meta introduce "semi-formal reasoning," a structured prompting technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and deriving formal conclusions before providing an answer. The structured format forces the agent to systematically gather evidence and follow function calls before drawing conclusions. This increases the accuracy of LLMs in coding tasks and significantly reduces errors in fault localization and codebase question-answering. For developers using LLMs in code review tasks, semi-formal reasoning enables highly reliable, execution-free semantic code analysis while drastically reducing the infrastructure costs of AI coding systems. Agentic code reasoning Agentic code reasoning is an AI agent's ability to navigate files, trace dependencies, and iteratively gather context to perform deep semantic analysis on a codebase without running the code. In enterprise AI applications, this capability is essential for scaling automated bug detection, comprehensive code reviews, and patch verification across complex repositories where relevant context spans multiple files. The industry currently tackles execution-free code verification through two primary approaches. The first involves unstructured LLM evaluators that try to verify code either directly or by training specialized LLMs as reward models to approximate test outcomes. The major drawback is their reliance on unstructured reasoning, which allows models to make confident claims about code behavior without explicit justification. Without structured constraints, it is difficult to ensure agents reason thoroughly rather than guess based on superficial patterns like function names. The second approach involves formal verification, which translates code or reasoning into formal mathematical languages like Lean, Coq, or Datalog to enable automated proof checking. While rigorous, formal methods require defining the semantics of the programming language. This is entirely impractical for arbitrary enterprise codebases that span multiple frameworks and languages. Existing approaches also tend to be highly fragmented and task-specific, often requiring entirely separate architectures or specialized training for each new problem domain. They lack the flexibility needed for broad, multi-purpose enterprise applications. How semi-formal reasoning works To bridge the gap between unstructured guessing and overly rigid mathematical proofs, the Meta researchers propose a structured prompting methodology, which they call “semi-formal reasoning.” This approach equips LLM agents with task-specific, structured reasoning templates. These templates function as mandatory logical certificates. To complete a task, the agent must explicitly state premises, trace execution paths for specific tests, and derive a formal conclusion based solely on verifiable evidence. The template forces the agent to gather proof from the codebase before making a judgment. The agent must actually follow function calls and data flows step-by-step rather than guessing their behavior based on surface-level naming conventions. This systematic evidence gathering helps the agent handle edge cases, such as confusing function names, and avoid making unsupported claims. Semi-formal reasoning in action The researchers evaluated semi-formal reasoning across three software engineering tasks: patch equivalence verification to determine if two patches yield identical test outcomes without running them, fault localization to pinpoint the exact lines of code causing a bug, and code question answering to test nuanced semantic understanding of complex codebases. The experiments used the Claude Opus-4.5 and Sonnet-4.5 models acting as autonomous verifier agents. The team compared their structured semi-formal approach against several baselines, including standard reasoning, where an agentic model is given a minimal prompt and allowed to explain its thinking freely in unstructured natural language. They also compared against traditional text-similarity algorithms like difflib. In patch equivalence, semi-formal reasoning improved accuracy on challenging, curated examples from 78% using standard reasoning to 88%. When evaluating real-world, agent-generated patches with test specifications available, the Opus-4.5 model using semi-formal reasoning achieved 93% verification accuracy, outperforming both the unstructured single-shot baseline at 86% and the difflib baseline at 73%. Other tasks showed similar gains across the board. The paper highlights the value of semi-formal reasoning through real-world examples. In one case, the agent evaluates two patches in the Python Django repository that attempt to fix a bug with 2-digit year formatting for years before 1000 CE. One patch uses a custom format() function within the library that overrides the standard function used in Python. Standard reasoning models look at these patches, assume format() refers to Python's standard built-in function, calculate that both approaches will yield the same string output, and incorrectly declare the patches equivalent. With semi-formal reasoning, the agent traces the execution path and checks method definitions. Following the structured template, the agent discovers that within one of the library’s files, the format() name is actually shadowed by a custom, module-level function. The agent formally proves that given the attributes of the input passed to the code, this patch will crash the system while the other will succeed. Based on their experiments, the researchers suggest that “LLM agents can perform meaningful semantic code analysis without execution, potentially reducing verification costs in RL training pipelines by avoiding expensive sandbox execution.” Caveats and tradeoffs While semi-formal reasoning offers substantial reliability improvements, enterprise developers must consider several practical caveats before adopting it. There is a clear compute and latency tradeoff. Semi-formal reasoning requires more API calls and tokens. In patch equivalence evaluations, semi-formal reasoning required roughly 2.8 times as many execution steps as standard unstructured reasoning. The technique also does not universally improve performance, particularly if a model is already highly proficient at a specific task. When researchers evaluated the Sonnet-4.5 model on a code question-answering benchmark, standard unstructured reasoning already achieved a high accuracy of around 85%. Applying the semi-formal template in this scenario yielded no additional gains. Furthermore, structured reasoning can produce highly confident wrong answers. Because the agent is forced to build elaborate, formal proof chains, it can become overly assured if its investigation is deep but incomplete. In one Python evaluation, the agent meticulously traced five different functions to uncover a valid edge case, but completely missed that a downstream piece of code already safely handled that exact scenario. Because it had built a strong evidence chain, it delivered an incorrect conclusion with extremely high confidence. The system's reliance on concrete evidence also breaks down when it hits the boundaries of a codebase. When analyzing third-party libraries where the underlying source code is unavailable, the agent will still resort to guessing behavior based on function names. And in some cases, despite strict prompt instructions, models will occasionally fail to fully trace concrete execution paths. Ultimately, while semi-formal reasoning drastically reduces unstructured guessing and hallucinations, it does not completely eliminate them. What developers should take away This technique can be used out-of-the-box, requiring no model training or special packaging. It is code-execution free, which means you do not need to add additional tools to your LLM environment. You pay more compute at inference time to get higher accuracy at code review tasks. The researchers suggest that structured agentic reasoning may offer “a flexible alternative to classical static analysis tools: rather than encoding analysis logic in specialized algorithms, we can prompt LLM agents with task-specific reasoning templates that generalize across languages and frameworks." The researchers have made the prompt templates available, allowing them to be readily implemented into your applications. While there is a lot of conversation about prompt engineering being dead, this technique shows how much performance you can still squeeze out of well-structured prompts.

OpenClaw has 500,000 instances and no enterprise kill switch

3/31/2026

“Your AI? It’s my AI now.” The line came from Etay Maor, VP of Threat Intelligence at Cato Networks, in an exclusive interview with VentureBeat at RSAC 2026 — and it describes exactly what happened to a U.K. CEO whose OpenClaw instance ended up for sale on BreachForums. Maor's argument is that the industry handed AI agents the kind of autonomy it would never extend to a human employee, discarding zero trust, least privilege, and assume-breach in the process. The proof arrived on BreachForums three weeks before Maor’s interview. On February 22, a threat actor using the handle “fluffyduck” posted a listing advertising root shell access to the CEO’s computer for $25,000 in Monero or Litecoin. The shell was not the selling point. The CEO’s OpenClaw AI personal assistant was. The buyer would get every conversation the CEO had with the AI, the company’s full production database, Telegram bot tokens, Trading 212 API keys, and personal details the CEO disclosed to the assistant about family and finances. The threat actor noted the CEO was actively interacting with OpenClaw in real time, making the listing a live intelligence feed rather than a static data dump. Cato CTRL senior security researcher Vitaly Simonovich documented the listing on February 25. The CEO’s OpenClaw instance stored everything in plain-text Markdown files under ~/.openclaw/workspace/ with no encryption at rest. The threat actor didn't need to exfiltrate anything; the CEO had already assembled it. When the security team discovered the breach, there was no native enterprise kill switch, no management console, and no way to inventory how many other instances were running across the organization. OpenClaw runs locally with direct access to the host machine’s file system, network connections, browser sessions, and installed applications. The coverage to date has tracked its velocity, but what it hasn't mapped is the threat surface. The four vendors who used RSAC 2026 to ship responses still haven't produced the one control enterprises need most: a native kill switch. The threat surface by the numbers Metric Numbers Source Internet-facing instances ~500,000 (March 24 live check) Etay Maor, Cato Networks (exclusive RSAC 2026 interview) Exposed instances with security risks 30,000+ observed during scan window Bitsight Exploitable via known RCE 15,200 instances SecurityScorecard High-severity CVEs 3 (highest CVSS: 8.8) NVD (24763, 25157, 25253) Malicious skills on ClawHub 341 in Koi audit (335 from ClawHavoc); 824 by mid-Feb Koi ClawHub skills with critical flaws 13.4% of 3,984 analyzed Snyk API tokens exposed (Moltbook) 1.5 million Wiz Maor ran a live Censys check during an exclusive VentureBeat interview at RSAC 2026. “The first week it came out, there were about 6,300 instances. Last week, I checked: 230,000 instances. Let’s check now… almost half a million. Almost doubled in one week,” Maor said. Three high-severity CVEs define the attack surface: CVE-2026-24763 (CVSS 8.8, command injection via Docker PATH handling), CVE-2026-25157 (CVSS 7.7, OS command injection), and CVE-2026-25253 (CVSS 8.8, token exfiltration to full gateway compromise). All three CVEs have been patched, but OpenClaw has no enterprise management plane, no centralized patching mechanism, and no fleet-wide kill switch. Individual administrators must update each instance manually, and most have not. The defender-side telemetry is just as alarming. CrowdStrike's Falcon sensors already detect more than 1,800 distinct AI applications across its customer fleet — from ChatGPT to Copilot to OpenClaw — generating around 160 million unique instances on enterprise endpoints. ClawHavoc, a malicious skill distributed through the ClawHub marketplace, became the primary case study in the OWASP Agentic Skills Top 10. CrowdStrike CEO George Kurtz flagged it in his RSAC 2026 keynote as the first major supply chain attack on an AI agent ecosystem. AI agents got root access. Security got nothing. Maor framed the visibility failure through the OODA loop (observe, orient, decide, act) during the RSAC 2026 interview. Most organizations are failing at the first step: security teams can't see which AI tools are running on their networks, which means the productivity tools employees bring in quietly become shadow AI that attackers exploit. The BreachForums listing proved the end state. The CEO’s OpenClaw instance became a centralized intelligence hub with SSO sessions, credential stores, and communication history aggregated into one location. “The CEO’s assistant can be your assistant if you buy access to this computer,” Maor told VentureBeat. “It’s an assistant for the attacker.” Ghost agents amplify the exposure. Organizations adopt AI tools, run a pilot, lose interest, and move on — leaving agents running with credentials intact. “We need an HR view of agents. Onboarding, monitoring, offboarding. If there’s no business justification? Removal,” Maor told VentureBeat. “We’re not left with any ghost agents on our network, because that’s already happening.” Cisco moved toward an OpenClaw kill switch Cisco President and Chief Product Officer Jeetu Patel framed the stakes during an exclusive VentureBeat interview at RSAC 2026. “I think of them more like teenagers. They’re supremely intelligent, but they have no fear of consequence,” Patel said of AI agents. “The difference between delegating and trusted delegating of tasks to an agent … one of them leads to bankruptcy. The other one leads to market dominance.” Cisco launched three free, open-source security tools for OpenClaw at RSAC 2026. DefenseClaw packages Skills Scanner, MCP Scanner, AI BoM, and CodeGuard into a single open-source framework running inside NVIDIA’s OpenShell runtime, which NVIDIA launched at GTC the week before RSAC. “Every single time you actually activate an agent in an Open Shell container, you can now automatically instantiate all the security services that we have built through Defense Claw,” Patel told VentureBeat. AI Defense Explorer Edition is a free, self-serve version of Cisco’s algorithmic red-teaming engine, testing any AI model or agent for prompt injection and jailbreaks across more than 200 risk subcategories. The LLM Security Leaderboard ranks foundation models by adversarial resilience rather than performance benchmarks. Cisco also shipped Duo Agentic Identity to register agents as identity objects with time-bound permissions, Identity Intelligence to discover shadow agents through network monitoring, and the Agent Runtime SDK to embed policy enforcement at build time. Palo Alto made agentic endpoints a security category of their own Palo Alto Networks CEO Nikesh Arora characterized OpenClaw-class tools as creating a new supply chain running through unregulated, unsecured marketplaces during an exclusive March 18 pre-RSA briefing with VentureBeat. Koi found 341 malicious skills on ClawHub in its initial audit, with the total growing to 824 as the registry expanded. Snyk found 13.4% of analyzed skills contained critical security flaws. Palo Alto Networks built Prisma AIRS 3.0 around a new agentic registry that requires every agent to be logged before operating, with credential validation, MCP gateway traffic control, agent red-teaming, and runtime monitoring for memory poisoning. The pending Koi acquisition adds supply chain visibility specifically for agentic endpoints. Cato CTRL delivered the adversarial proof Cato Networks’ threat intelligence arm Cato CTRL presented two sessions at RSAC 2026. The 2026 Cato CTRL Threat Report, published separately, includes a proof-of-concept “Living Off AI” attack targeting Atlassian’s MCP and Jira Service Management. Maor’s research provides the independent adversarial validation that vendor product announcements cannot deliver on their own. The platform vendors are building governance for sanctioned agents. Cato CTRL documented what happens when the unsanctioned agent on the CEO’s laptop gets sold on the dark web. Monday morning action list Regardless of vendor stack, four controls apply immediately: bind OpenClaw to localhost only and block external port exposure, enforce application allowlisting through MDM to prevent unauthorized installations, rotate every credential on machines where OpenClaw has been running, and apply least-privilege access to any account an AI agent has touched. Discover the install base. CrowdStrike’s Falcon sensor, Cato’s SASE platform, and Cisco Identity Intelligence all detect shadow AI. For teams without premium tooling, query endpoints for the ~/.openclaw/ directory using native EDR or MDM file-search policies. If the enterprise has no endpoint visibility at all, run Shodan and Censys queries against corporate IP ranges. Patch or isolate. Check every discovered instance against CVE-2026-24763, CVE-2026-25157, and CVE-2026-25253. Instances that cannot be patched should be network-isolated. There is no fleet-wide patching mechanism. Audit skill installations. Review installed skills against Cisco’s Skills Scanner or the Snyk and Koi research. Any skill from an unverified source should be removed immediately. Enforce DLP and ZTNA controls. Cato’s ZTNA controls restrict unapproved AI applications. Cisco Secure Access SSE enforces policy on MCP tool calls. Palo Alto’s Prisma Access Browser controls data flow at the browser layer. Kill ghost agents. Build a registry of every AI agent running. Document business justification, human owner, credentials held, and systems accessed. Revoke credentials for agents with no justification. Repeat weekly. Deploy DefenseClaw for sanctioned use. Run OpenClaw inside NVIDIA’s OpenShell runtime with Cisco’s DefenseClaw to scan skills, verify MCP servers, and instrument runtime behavior automatically. Red-team before deploying. Use Cisco AI Defense Explorer Edition (free) or Palo Alto Networks’ agent red-teaming in Prisma AIRS 3.0. Test the workflow, not just the model. The OWASP Agentic Skills Top 10, published using ClawHavoc as its primary case study, provides a standards-grade framework for evaluating these risks. Four vendors shipped responses at RSAC 2026. None of them is a native enterprise kill switch for unsanctioned OpenClaw deployments. Until one exists, the Monday morning action list above is the closest thing to one.

Slack adds 30 AI features to Slackbot, its most ambitious update since the Salesforce acquisition

3/31/2026

Slack today announced more than 30 new capabilities for Slackbot, its AI-powered personal agent, in what amounts to the most sweeping overhaul of the workplace messaging platform since Salesforce acquired it for $27.7 billion in 2021. The update transforms Slackbot from a simple conversational assistant into a full-spectrum enterprise agent that can take meeting notes across any video provider, operate outside the Slack application on users' desktops, execute tasks through third-party tools via the Model Context Protocol (MCP), and even serve as a lightweight CRM for small businesses — all without requiring users to install anything new. The announcement, timed to a keynote event that Salesforce CEO Marc Benioff is headlining Tuesday morning, arrives less than three months after Slackbot first became generally available on January 13 to Business+ and Enterprise+ subscribers. In that short window, Slack says the feature is on track to become the fastest-adopted product in Salesforce's 27-year history, with some employees at customer organizations reporting they save up to 90 minutes per day. Inside Salesforce itself, teams claim savings of up to 20 hours per week, translating to more than $6.4 million in estimated productivity value. "Slackbot is smart. It's pleasant, and I think it's endlessly useful," Rob Seaman, Slack's interim CEO and former chief product officer, told VentureBeat in an exclusive interview ahead of the announcement. "The upper bound of use cases is effectively limitless for it." The release signals Slack's clearest bid yet to become what Seaman and the company's leadership describe as an "agentic operating system" — a single surface through which workers interact with AI agents, enterprise applications, and one another. It also marks a direct challenge to Microsoft, which has spent the past two years embedding its Copilot assistant across the entirety of its productivity stack. From simple chatbot to autonomous coworker: six new capabilities that redefine what Slackbot can do The features announced Tuesday organize around several major capability areas, each designed to push Slackbot well beyond the role of a chatbot and into something closer to an autonomous digital coworker. The most foundational may be what Slack is calling AI-Skills — reusable instruction sets that define the inputs, the steps, and the exact output format for a given task. Any team can build a skill once and deploy it on demand. Slackbot ships with a built-in library for common workflows, but users can also create their own. Critically, Slackbot can recognize when a user's prompt matches an existing skill and apply it automatically, without being explicitly told to do so. "Think of these as topics or instructions — basically instructions for Slackbot to perform a repeat task that the user might want to do, that they can share with others, or a company might be able to set up for their whole company," Seaman explained. Deep research mode gives Slackbot the ability to conduct extended, multi-step investigations that take approximately four minutes to complete — a significant departure from the instant-response paradigm of most enterprise chatbots. Slack chose not to demonstrate this feature on stage at the keynote, Seaman said, precisely because its value lies in depth, not speed. MCP client integration, meanwhile, allows Slackbot to make tool calls into external systems through the Model Context Protocol, meaning it can now create Google Slides, draft Google Docs, and interact with the more than 2,600 apps in the Slack Marketplace and the 6,000-plus apps built over two decades for the Salesforce AppExchange. "We're going all in on MCP for Slackbot," Seaman said. "MCP clients and MCP servers are becoming very mature." Meeting intelligence allows Slackbot to listen to any meeting — not just Slack huddles, but calls on Zoom, Google Meet, or any other provider — by tapping into the user's local audio through the desktop application. It captures discussions, summarizes decisions, surfaces action items, and because Slackbot is natively connected to Salesforce, it can log actions and update opportunities directly in the CRM. Slackbot on Desktop extends the agent outside the Slack container entirely, while voice mode adds text-to-speech and speech-to-text capabilities, with full speech-to-speech functionality under active development. How Anthropic's Claude powers Slackbot — and why keeping it affordable is the hardest part Slackbot is built on Anthropic's Claude model, a detail Seaman confirmed ahead of the keynote, where Anthropic's leadership will appear alongside Slack executives on stage. The partnership underscores the deepening relationship between the two companies: Anthropic's technology powers the reasoning layer, while Slack's "context engineering" — the process of determining exactly which information from a user's channels, files, and messages should be fed into the model's context window — determines the quality and relevance of every response. Managing the cost of that reasoning at enterprise scale is one of the most significant technical and financial challenges the team faces. Slackbot is included in Business+ and Enterprise+ plans at no additional consumption charge — a deliberate strategic choice that places the burden of cost optimization squarely on Slack's engineering team rather than on customers. "A lot of what we've done is in the context engineering phase, working really closely with Anthropic to make sure that we're optimizing the RAG phase, optimizing our system prompts and everything, to make sure we're getting the right amount of context into the context window and not obviously making fiscally irresponsible decisions for ourselves," Seaman said. Starting in April, Slackbot will also become available in a limited sampling capacity to users on Slack's free and Pro plans — a move designed to drive conversion up the pricing tiers. Desktop AI and meeting transcription are powerful, but they raise hard questions about workplace surveillance The extension of Slackbot beyond the Slack application window — particularly its ability to listen to meetings and view screen content — raises immediate questions about employee surveillance, especially in large enterprise environments where tens of thousands of workers may be subject to company-wide IT policies. Seaman was emphatic that every capability is user-initiated and opt-in. Slackbot cannot listen to audio unless the user explicitly tells it to take meeting notes. It cannot view the desktop autonomously; in its current form, users must manually capture and share screenshots. And it inherits every permission the organization has already established in Slack. "Everything is user opt-in. That's a key tenet of Slack," Seaman said. "It's not rogue looking at your desktop or autonomously looking at your desktop. It's very important to us, and very important to our enterprise customers." On Slackbot's memory feature — which allows it to learn user preferences and habits over time — Seaman said the company has no plans to make that data available to administrators. Users can flush their stored preferences at any time simply by telling Slackbot to do so. Slack's native CRM is a Trojan horse designed to capture startups before they outgrow it Among the most important features in Tuesday's release is a native CRM built directly into Slack, targeting small businesses that haven't yet adopted a dedicated customer relationship management system. The logic is straightforward: small companies typically adopt Slack early in their lifecycle, often on the free tier, and their customer conversations already happen in channels and direct messages. Slack's native CRM reads those channels, understands the conversations, and automatically keeps deals, contacts, and call notes up to date. When companies are ready to scale, every record is already connected to Salesforce — no migrations, no starting over. "The hypothesis is that along the way, companies are effectively going to have moments where a CRM might matter," Seaman said. "Our goal is to make it available to them as a default, so as they are starting their company and their company is growing, it's just right there for them. They don't have to think about going off and procuring another tool." The feature also represents a response to a growing competitive threat. As the Wall Street Journal reported earlier this year, a wave of startups and individual developers have begun "vibe coding" their own lightweight CRMs, emboldened by the capabilities of large language models. By embedding CRM directly into Slack — the tool many of those same startups already depend on — Salesforce aims to make the procurement of a separate system unnecessary. Slack says it has a context advantage over Microsoft and Google — but can it last? The announcements arrive at a moment of intense competitive pressure. Microsoft has integrated Copilot across its entire productivity suite, giving it a distribution advantage that reaches into virtually every Fortune 500 company. Google has been similarly aggressive with Gemini across Workspace. And standalone AI tools from OpenAI to Anthropic threaten to fragment the enterprise AI experience. Seaman took a measured approach when asked directly about competitive positioning, invoking a mantra he said Slack uses internally: "We are competitor aware, but customer obsessed." "I think there are two things that really stand out. One, we have a context advantage — if you look at the way people use Slack, they love it. They use it so much, constantly communicating with their colleagues, openly thinking, working in public project channels. Two is the user experience. We focus so much on how our product feels in people's hands." That context advantage is real but not guaranteed. Slack's strength lies in the richness and volume of conversational data flowing through its channels — data that, when fed into an AI model, can produce responses with a degree of organizational awareness that competitors struggle to match. But Microsoft's Teams captures similar conversational data, and its deep integration with Windows, Office, and Azure gives it a systems-level advantage that Slack, operating as a single application, cannot easily replicate. Starting this summer, every new Salesforce customer will receive Slack automatically provisioned and AI-powered from day one — a bundling play that ensures the messaging platform reaches the broadest possible enterprise audience. Salesforce reported $41.5 billion in revenue for fiscal year 2026, up 10% year-over-year, with Agentforce ARR reaching $800 million. But Wall Street has remained skeptical about whether AI will ultimately erode demand for traditional enterprise software, and Salesforce's stock has underperformed the broader Nasdaq over the past year. More Slack users in more organizations gives AI-driven features more surface area to prove their value. Slack's biggest bet is that it can do everything without losing the simplicity that made it beloved Tuesday's launch is the first major product release under Seaman's leadership. He assumed the interim CEO role after former Slack CEO Denise Dresser departed in December 2025 to become OpenAI's first chief revenue officer — a move that signaled even Salesforce's own executives felt the gravitational pull of frontier AI companies. The overarching thesis embedded in the announcement — that Slack is evolving from a messaging platform into an operating system for AI agents — is as risky as it is ambitious. "One of the fundamental tenets of an operating system is that it obscures the complexity of the hardware from the end user," Seaman said. "There are thousands of apps and agents out there, and that can be overwhelming. I think that's our job — to be the OS that obscures that complexity, so you just use it like it's a communication tool." When asked whether Slack risks losing its simplicity by trying to do everything, Seaman didn't flinch. "There's absolutely a risk," he said. "That's what keeps us up at night." It's a remarkably candid admission from the leader of a platform that just launched 30 new features in a single day. The company that won the hearts of millions of workers with playful emoji reactions and frictionless messaging is now betting its future on meeting transcription, CRM pipelines, desktop agents, and enterprise orchestration. Whether Slack can absorb all of that ambition without losing the thing that made people love it in the first place isn't just a product question — it's the $27.7 billion question that Salesforce is still trying to answer.

Claude Code's source code appears to have leaked: here's what we know

3/31/2026

Anthropic appears to have accidentally revealed the inner workings of one of its most popular and lucrative AI products, the agentic AI harness Claude Code, to the public. A 59.8 MB JavaScript source map file (.map), intended for internal debugging, was inadvertently included in version 2.1.88 of the @anthropic-ai/claude-code package on the public npm registry pushed live earlier this morning. By 4:23 am ET, Chaofan Shou (@Fried_rice), an intern at Solayer Labs, broadcasted the discovery on X (formerly Twitter). The post, which included a direct download link to a hosted archive, acted as a digital flare. Within hours, the ~512,000-line TypeScript codebase was mirrored across GitHub and analyzed by thousands of developers. For Anthropic, a company currently riding a meteoric rise with a reported $19 billion annualized revenue run-rate as of March 2026, the leak is more than a security lapse; it is a strategic hemorrhage of intellectual property.The timing is particularly critical given the commercial velocity of the product. Market data indicates that Claude Code alone has achieved an annualized recurring revenue (ARR) of $2.5 billion, a figure that has more than doubled since the beginning of the year. With enterprise adoption accounting for 80% of its revenue, the leak provides competitors—from established giants to nimble rivals like Cursor—a literal blueprint for how to build a high-agency, reliable, and commercially viable AI agent. Anthropic confirmed the leak in a spokesperson’s e-mailed statement to VentureBeat, which reads: “Earlier today, a Claude Code release included some internal source code. No sensitive customer data or credentials were involved or exposed. This was a release packaging issue caused by human error, not a security breach. We're rolling out measures to prevent this from happening again.” The anatomy of agentic memory The most significant takeaway for competitors lies in how Anthropic solved "context entropy"—the tendency for AI agents to become confused or hallucinatory as long-running sessions grow in complexity. The leaked source reveals a sophisticated, three-layer memory architecture that moves away from traditional "store-everything" retrieval. As analyzed by developers like @himanshustwts, the architecture utilizes a "Self-Healing Memory" system. At its core is MEMORY.md, a lightweight index of pointers (~150 characters per line) that is perpetually loaded into the context. This index does not store data; it stores locations. Actual project knowledge is distributed across "topic files" fetched on-demand, while raw transcripts are never fully read back into the context, but merely "grep’d" for specific identifiers. This "Strict Write Discipline"—where the agent must update its index only after a successful file write—prevents the model from polluting its context with failed attempts. For competitors, the "blueprint" is clear: build a skeptical memory. The code confirms that Anthropic’s agents are instructed to treat their own memory as a "hint," requiring the model to verify facts against the actual codebase before proceeding. KAIROS and the autonomous daemon The leak also pulls back the curtain on "KAIROS," the Ancient Greek concept of "at the right time," a feature flag mentioned over 150 times in the source. KAIROS represents a fundamental shift in user experience: an autonomous daemon mode. While current AI tools are largely reactive, KAIROS allows Claude Code to operate as an always-on background agent. It handles background sessions and employs a process called autoDream. In this mode, the agent performs "memory consolidation" while the user is idle. The autoDream logic merges disparate observations, removes logical contradictions, and converts vague insights into absolute facts. This background maintenance ensures that when the user returns, the agent’s context is clean and highly relevant. The implementation of a forked subagent to run these tasks reveals a mature engineering approach to preventing the main agent’s "train of thought" from being corrupted by its own maintenance routines. Unreleased internal models and performance metrics The source code provides a rare look at Anthropic’s internal model roadmap and the struggles of frontier development. The leak confirms that Capybara is the internal codename for a Claude 4.6 variant, with Fennec mapping to Opus 4.6 and the unreleased Numbat still in testing. Internal comments reveal that Anthropic is already iterating on Capybara v8, yet the model still faces significant hurdles. The code notes a 29-30% false claims rate in v8, an actual regression compared to the 16.7% rate seen in v4. Developers also noted an "assertiveness counterweight" designed to prevent the model from becoming too aggressive in its refactors. For competitors, these metrics are invaluable; they provide a benchmark of the "ceiling" for current agentic performance and highlight the specific weaknesses (over-commenting, false claims) that Anthropic is still struggling to solve. "Undercover" Claude Perhaps the most discussed technical detail is the "Undercover Mode." This feature reveals that Anthropic uses Claude Code for "stealth" contributions to public open-source repositories. The system prompt discovered in the leak explicitly warns the model: "You are operating UNDERCOVER... Your commit messages... MUST NOT contain ANY Anthropic-internal information. Do not blow your cover." While Anthropic may use this for internal "dog-fooding," it provides a technical framework for any organization wishing to use AI agents for public-facing work without disclosure. The logic ensures that no model names (like "Tengu" or "Capybara") or AI attributions leak into public git logs—a capability that enterprise competitors will likely view as a mandatory feature for their own corporate clients who value anonymity in AI-assisted development. The fallout has just begun The "blueprint" is now out, and it reveals that Claude Code is not just a wrapper around a Large Language Model, but a complex, multi-threaded operating system for software engineering. Even the hidden "Buddy" system—a Tamagotchi-style terminal pet with stats like CHAOS and SNARK—shows that Anthropic is building "personality" into the product to increase user stickiness. For the wider AI market, the leak effectively levels the playing field for agentic orchestration. Competitors can now study Anthropic’s 2,500+ lines of bash validation logic and its tiered memory structures to build "Claude-like" agents with a fraction of the R&D budget. As the "Capybara" has left the lab, the race to build the next generation of autonomous agents has just received an unplanned, $2.5 billion boost in collective intelligence. What Claude Code users and enterprise customers should do now about the alleged leak While the source code leak itself is a major blow to Anthropic’s intellectual property, it poses a specific, heightened security risk for you as a user. By exposing the "blueprints" of Claude Code, Anthropic has handed a roadmap to researchers and bad actors who are now actively looking for ways to bypass security guardrails and permission prompts. Because the leak revealed the exact orchestration logic for Hooks and MCP servers, attackers can now design malicious repositories specifically tailored to "trick" Claude Code into running background commands or exfiltrating data before you ever see a trust prompt. The most immediate danger, however, is a concurrent, separate supply-chain attack on the axios npm package, which occurred hours before the leak. If you installed or updated Claude Code via npm on March 31, 2026, between 00:21 and 03:29 UTC, you may have inadvertently pulled in a malicious version of axios (1.14.1 or 0.30.4) that contains a Remote Access Trojan (RAT). You should immediately search your project lockfiles (package-lock.json, yarn.lock, or bun.lockb) for these specific versions or the dependency plain-crypto-js. If found, treat the host machine as fully compromised, rotate all secrets, and perform a clean OS reinstallation. To mitigate future risks, you should migrate away from the npm-based installation entirely. Anthropic has designated the Native Installer (curl -fsSL https://claude.ai/install.sh | bash) as the recommended method because it uses a standalone binary that does not rely on the volatile npm dependency chain. The native version also supports background auto-updates, ensuring you receive security patches (likely version 2.1.89 or higher) the moment they are released. If you must remain on npm, ensure you have uninstalled the leaked version 2.1.88 and pinned your installation to a verified safe version like 2.1.86. Finally, adopt a zero trust posture when using Claude Code in unfamiliar environments. Avoid running the agent inside freshly cloned or untrusted repositories until you have manually inspected the .claude/config.json and any custom hooks. As a defense-in-depth measure, rotate your Anthropic API keys via the developer console and monitor your usage for any anomalies. While your cloud-stored data remains secure, the vulnerability of your local environment has increased now that the agent's internal defenses are public knowledge; staying on the official, native-installed update track is your best defense.

Imagine if your Teams or Slack messages automatically turned into secure context for your AI agents — PromptQL built it

3/31/2026

For the modern enterprise, the digital workspace risks descending into "coordination theater," in which teams spend more time discussing work than executing it. While traditional tools like Slack or Teams excel at rapid communication, they have structurally failed to serve as a reliable foundation for AI agents, such that a Hacker News thread went viral in February 2026 calling upon OpenAI to build its own version of Slack to help empower AI agents, amassing 327 comments. That's because agents often lack the real-time context and secure data access required to be truly useful, often resulting in "hallucinations" or repetitive re-explaining of codebase conventions. PromptQL, a spin-off from the GraphQL unicorn Hasura, is addressing this by pivoting from an AI data tool into a comprehensive, AI-native workspace designed to turn casual, regular team interactions into a persistent, secure memory for agentic workflows — ensuring these conversations are not simply left by the wayside or that users and agents have to try and find them again later, but rather, distilled and stored as actionable, proprietary data in an organized format — an internal wiki — that the company can rely on going forward, forever, approved and edited manually as needed. Imagine two colleagues messaging about a bug that needs to be fixed — instead of manually assigning it to an engineer or agent, your messaging platform automatically tags it, assigns it and documents it all in the wiki with one click Now do this for every issue or topic of discussion that takes place in your enterprise, and you'll have an idea of what PromptQL is attempting. The idea is a simple but powerful one: turning the conversation that necessarily precedes work into an actual assignment that is automatically started by your own messaging system. “We don’t have conversations about work anymore," CEO Tanmai Gopal said in a recent video call interview with VentureBeat. "You actually have conversations that do the work.” Originally positioned as an AI data analyst, the company—a spin-off from the GraphQL unicorn Hasura—is pivoting into a full-scale AI-native workspace. It isn't just "Slack with a chatbot"; it is a fundamental re-architecting of how teams interact with their data, their tools, and each other. “PromptQL is this workhorse in the background, this 24/7 intern that’s continuously cranking out the actual work—looking at code, confirming hypotheses, going to multiple places, actually doing the work," Gopal said. Technology: messages that automatically turn into a shared, continuously updated context engine The technical soul of PromptQL is its Shared Wiki. Traditional LLMs suffer from a "memory" problem; they forget previous interactions or hallucinate based on outdated training data. PromptQL solves this by capturing "shared context" as teams work. When an engineer fixes a bug or a marketer defines a "recycled lead," they aren't just typing into a void. They are teaching a living, internal Wikipedia. This wiki doesn't require "documentation sprints" or manual YAML file updates; it accumulates context organically. “Throughout every single conversation, you are teaching PromptQL, and that is going into this wiki that is being developed over time. This is our entire company’s knowledge gradually coming together.” Interconnectivity: Much like cells in a Petri dish, small "islands" of knowledge—say, a Salesforce integration—eventually bridge to other islands, like product usage data in Snowflake. Human-in-the-Loop: To prevent the AI from learning "junk" (like a reminder about a doctor's appointment from 2024), humans must explicitly "Add to Wiki" to canonize a fact. The Virtual Data Layer: Unlike traditional platforms that require data replication, PromptQL uses a virtual SQL layer. It queries your data in place across databases (Snowflake, Clickhouse, Postgres) and SaaS tools (Stripe, Zendesk, HubSpot), ensuring that nothing is ever extracted or cached,. PromptQL is designed to be a highly integrable orchestration layer that supports both leading AI model providers and a vast ecosystem of existing enterprise tools. AI Model Support: The platform allows users to delegate tasks to specific coding agents such as Claude Code and Cursor, or use custom agents built for specific internal needs. Workflow Compatibility: The system is built to inherit context from existing team tools, enabling AI agents to understand codebase conventions or deployment patterns from your existing infrastructure without manual re-explanation From chatting to doing The PromptQL interface looks familiar—threads, channels, and mentions—but the functionality is transformative. In a demonstration, an engineer identifies a failing checkout in a #eng-bugs channel. Instead of tagging a human SRE, they delegate to Claude Code via PromptQL.The agent doesn't just look at the code; it inherits the team's shared context. It knows, for instance, that "EU payments switched to Adyen on Jan 15" because that fact was added to the wiki weeks prior. Within minutes, the AI identifies a currency mismatch, pushes a fix, opens a PR, and updates the wiki for future reference. This "multiplayer" AI approach is what sets the platform apart. It allows a non-technical manager to ask, "Which accounts have growing Stripe billing but flat Mixpanel usage?" and receive a joined table of data pulled from two disparate sources instantly. The user can then schedule a recurring Slack DM of those results with a single follow-up command. Also, users don't even need to think about the integrity or cleanliness of their data — PromptQL handles it for them: “Connect all data in whatever state of shittiness it is, and let shared context build up on the fly as you use it," Gopal said. Highly secure For Fortune 500 companies like McDonald's and Cisco, "just connect your data" is a terrifying sentence. PromptQL addresses this with fine-grained access control .The system enforces attribute-based policies at the infrastructure level. If a Regional Ops Manager asks for vendor rates across all regions, the AI will redact columns or rows they aren't authorized to see, even if the LLM "knows" the answer. Furthermore, any high-stakes action—like updating 38 payment statuses in Netsuite—requires a human "Approve/Deny" sign-off before execution. Licensing and pricing In a departure from the "per-seat" SaaS status quo, PromptQL is entirely consumption-based. Pricing: The company uses "Operational Language Units" (OLUs). Philosophy: Gopal argues that charging per seat penalizes companies for onboarding their whole team. By charging for the value created (the OLU), PromptQL encourages users to connect "everyone and everything". Enterprise Storage: While smaller teams use dedicated accounts, enterprise customers get a dedicated VPC. Any data the AI "saves" (like a custom to-do list) is stored in the customer's own S3 bucket using the Iceberg format, ensuring total data sovereignty. "Philosophically, we want you to connect everyone and everything [to PromptQL], so we don’t penalize that," Gopal said. "We just price based on consumption.” Why it matters now for enterprises So, is PromptQL a Teams or Slack killer? According to Gopal, the answer is yes: “That is what has happened for us. We’ve shut down our internal Slack for internal comms entirely," he said. The launch comes at a pivot point for the industry. Companies are realizing that "chatting with a PDF" isn't enough. They need AI that can act, but they can't afford the security risks of "unsupervised" agents. By building a workspace that prioritizes shared context and human-in-the-loop verification, PromptQL is offering a middle ground: an AI that learns like a teammate and executes like an intern, all while staying within the guardrails of enterprise security. For enterprises focused on making AI work at scale, PromptQL addresses the critical "how" of implementation by providing the orchestration and operational layer needed to deploy agentic systems. By replacing the "coordination theater" of traditional chat tools with a workspace where AI agents have the same permissions and context as human teammates, it enables seamless multi-agent coordination and task-routing. This allows decision-makers to move beyond simple model selection to a reality where agents—such as Claude Code—use shared team context to execute complex workflows, like fixing production bugs or updating CRM records, directly within active threads. From a data infrastructure perspective, the platform simplifies the management of real-time pipelines and RAG-ready architectures by utilizing a virtual SQL layer that queries data "in place". This eliminates the need for expensive, time-consuming data preparation and replication sprints across hundreds of thousands of tables in databases like Snowflake or Postgres. Furthermore, the system’s "Shared Wiki" serves as a superior alternative to standard vector databases or prompt-based memory, capturing tribal knowledge organically and creating a living metadata store that informs every AI interaction with company-specific reasoning. Finally, PromptQL addresses the security governance required for modern AI stacks by enforcing fine-grained, attribute-based access control and role-based permissions. Through human-in-the-loop verification, it ensures that high-stakes actions and data mutations are held for explicit approval, protecting against model misuse and unauthorized data leakage. While it does not assist with physical infrastructure tasks such as GPU cluster optimization or hardware procurement, it provides the necessary software guardrails and auditability to ensure that agentic workflows remain compliant with enterprise standards like SOC 2, HIPAA, and GDPR.

Softr launches AI-native platform to help nontechnical teams build business apps without code

3/31/2026

Softr, the Berlin-based no-code platform used by more than one million builders and 7,000 organizations including Netflix, Google, and Stripe, today launched what it calls an AI-native platform — a bet that the explosive growth of AI-powered app creation tools has produced a market full of impressive demos but very little production-ready business software. The company's new AI Co-Builder lets non-technical users describe in plain language the software they need, and the platform generates a fully integrated system — database, user interface, permissions, and business logic included — connected and ready for real-world deployment immediately. The move marks a fundamental evolution for a company that spent five years building a no-code business before layering AI on top of what it describes as a proven infrastructure of constrained, pre-built building blocks. "Most AI app-builders stop at the shiny demo stage," Softr Co-Founder and CEO Mariam Hakobyan told VentureBeat in an exclusive interview ahead of the launch. "A lot of the time, people generate calculators, landing pages, and websites — and there are a huge number of use cases for those. But there is no actual business application builder, which has completely different needs." The announcement arrives at a moment when the AI app-building market finds itself at an inflection point. A wave of so-called "vibe coding" platforms — tools like Lovable, Bolt, and Replit that generate application code from natural language prompts — have captured developer mindshare and venture capital over the past 18 months. But Hakobyan argues those tools fundamentally misserve the audience Softr is chasing: the estimated billions of non-technical business users inside companies who need custom operational software but lack the skills to maintain AI-generated code when it inevitably breaks. Why AI-generated app prototypes keep failing when real business data is involved The core tension Softr is trying to resolve is one that has plagued the AI app-building category since its inception: the gap between what looks good in a demo and what actually works when real users, real data, and real security requirements enter the picture. Business software — client portals, CRMs, internal operational tools, inventory management systems — requires authentication, role-based permissions, database integrity, and workflow automation that must function reliably every single time. When an AI-generated prototype fails in these areas, fixing it typically requires a developer, which defeats the purpose of the no-code promise entirely. "One prompt might break 10 previous steps that you've already completed," Hakobyan said, describing the experience non-technical users face on vibe coding platforms. "You keep prompting, keep trying to fix errors that the AI generated, and you end up maintaining something you didn't even sign up for in the first place." This critique targets a real structural limitation in how many AI app builders work today. Platforms that fully rely on AI to generate application code from scratch leave users with a codebase they cannot read, debug, or maintain without technical expertise. To connect those generated apps to real databases, login systems, or third-party services, users often must integrate tools like Supabase and make API calls — tasks that effectively require them to become developers. Softr's position is that these platforms have replaced one form of coding with another, swapping programming languages for English-language prompts that carry all the same fragility. How Softr's building block architecture avoids the hallucination problem that plagues AI code generators Rather than generating raw code, Softr's platform uses what Hakobyan describes as "proven and structured building blocks" — pre-built components for standard application functions like Kanban boards, list views, tables, user authentication, and permissions. The AI interprets a user's requirements, guides them through targeted questions about login functionality, permission types, and user roles, then assembles these tested building blocks in a constrained, intelligent way. Only when a user requests functionality that falls outside the standard 80% covered by these blocks does the system build a custom component with AI. "It basically never hallucinates, because it's all built on an infrastructure that's secure and constrained," Hakobyan explained. "It doesn't generate code or leave you with code, because underneath, it uses our existing building block model." The result is not a code repository. It is a live application running on Softr's infrastructure, with a visual editor that users can continue to modify — either by prompting the AI further or by directly manipulating the no-code interface. This dual-editing model is a deliberate design decision that Hakobyan frames as the platform's core differentiator. "It almost combines the best of both worlds of AI and no code, and really lets users to either continue iterating with AI or then continue working with the app visually, which is much simpler and easier and for them to have control," she said. Core platform foundations — authentication, user roles, permissions, hosting, and SSL — are built in from the start, eliminating what Hakobyan calls the "blank canvas problem" that plagues vibe coding platforms, where every user must architect fundamental application infrastructure from scratch via prompts. The platform uses a SaaS subscription pricing model, with each plan including a set number of AI credits and the option to purchase more — though the visual editor means users don't always need to consume credits, since direct manipulation of the no-code interface is often faster and more precise. Inside the five-year journey from Airtable interface to profitable AI-native platform Softr's journey to this moment has been a gradual, disciplined expansion that stands in contrast to the rapid fundraising cycles common among AI startups. The company launched in 2020 as a no-code interface layer on top of Airtable, the popular enterprise database product. Co-founded by Armenian entrepreneurs Hakobyan and CTO Artur Mkrtchyan, the startup raised a $2.2 million seed round in early 2021 led by Atlantic Labs, followed by a $13.5 million Series A in January 2022 led by FirstMark Capital. What happened next is notable for its restraint. Softr has not raised additional capital since that 2022 Series A. Instead, it has grown to profitability. "We have been profitable for the past whole year, and we're about 50 people team," Hakobyan told VentureBeat. "We have grown to eight-digit revenue fully PLG, no sales team, mostly through word of mouth, organic growth." That financial profile — eight-figure annual revenue, profitable, 50 employees, no sales team — is striking in a market where many AI-powered competitors are spending heavily to acquire users. Over the past year, the company has steadily expanded its technical capabilities, moving beyond its original Airtable dependency to support Google Sheets, Notion, PostgreSQL, MySQL, MariaDB, and other databases. In February 2025, TechCrunch reported on this expansion, with Hakobyan explaining that many potential customers had "data scattered across many different tools" and needed a single platform to unify that fragmented infrastructure. Today, Softr offers 15-plus native integrations with external databases, plus a REST API connector for additional data sources. The new AI Co-Builder represents the culmination of this multi-year evolution — combining the building block architecture, the broad data integration layer, and a new AI interface into a single platform for business application creation. How Softr positions itself against both no-code incumbents and vibe coding startups Softr's launch lands in a rapidly fragmenting competitive landscape, and Hakobyan is deliberate about where she draws the lines. On one side sit traditional no-code platforms like Bubble, which offer deep customization and design freedom but require users to build everything from scratch — database schemas, pixel-level layouts, authentication systems — creating a steeper learning curve. A TechRadar review noted that while Softr's blocks don't offer the same design freedom as Bubble, the platform's simplicity makes it accessible to genuinely non-technical users. In a comparison published by Business Insider Africa in June, Softr was characterized as offering "minimal learning curve, especially for internal or web-based tools," though with limitations in scalability for more complex applications. On the other side sit the AI-first code generation tools that Hakobyan views as fundamentally misaligned with business software requirements. "Before people were coding, then they were coding through APIs, now they are coding almost through a human language interface, right, just by with English," Hakobyan said. "But what Softr does is fundamentally different. It abstracts all of that and makes the creation simple." She also distinguishes Softr from developer-focused AI coding assistants like Anthropic's Claude Code, positioning those as tools that make professional developers more efficient rather than tools that enable non-developers to build software."There are amazing tools for developers — that's great. The target audience is developers," Hakobyan acknowledged. Instead, Softr targets a specific and potentially enormous market: businesses that need custom internal and external-facing operational tools and currently rely on spreadsheets, email, or rigid off-the-shelf software that doesn't match their actual processes. Hakobyan described use cases ranging from asset production workflows for film companies — where internal teams, external agencies, and approvers interact across a multi-stage process — to lightweight CRM replacements for teams that don't need the full complexity of Salesforce. "There's not even a vertical solution for this type of process," she said. "It's very custom to each organization." What Netflix, Google, and thousands of non-tech companies actually build on the platform Many of Softr's highest-profile customers — Netflix, Google, Stripe, UPS — were using the platform before the AI Co-Builder even existed, building on the company's original no-code foundation. But the user base extends far beyond Silicon Valley. Non-tech organizations in real estate, manufacturing, and logistics represent a significant portion of Softr's customer base — companies that often still manage core processes with pen, paper, and spreadsheets. "A lot of these companies — you might think they already have the solutions, but they don't," Hakobyan noted. "In tech companies, most of the time, CRM and project management tools are already established. But most of our customers are using Softr for internal operational tooling or workflow tooling, where the use case involves lots of different departments and even external parties." The company is SOC 2 (Type II) compliant and GDPR compliant, with additional compliance capabilities in development. Hakobyan noted that auditing and governance functionality can be built directly into applications using the platform's database and workflow tools, with a native logging and auditing system expected to ship in the near term. Softr's billion-user ambition and the Canva analogy that explains its strategy Softr's stated mission — to empower billions of business users to create production-ready software — is audacious, but Hakobyan frames the AI Co-Builder launch as a fundamental acceleration of the trajectory the company has been on for five years. "Everything people would have to spend hours doing is done within five minutes," she said. "And obviously that helps more people to actually build real software." The company plans to layer a product-led sales motion on top of its existing PLG engine, targeting larger enterprise customers with higher average contract values. This represents a deliberate strategic expansion from the small and mid-sized businesses that have formed Softr's core customer base — a segment that TechCrunch identified as natural Softr customers as far back as the company's 2022 Series A, given that those firms are most likely to be priced out of the competitive developer market. Hakobyan draws an analogy that has apparently become common among the company's users: Softr as "Canva for web apps." Just as Canva made professional design accessible to non-designers, Softr aims to make business software creation accessible to non-developers. Whether the company can translate its disciplined growth and profitable foundation into a platform that genuinely serves that enormous addressable market remains to be seen. Softr faces intensifying competition from both traditional no-code incumbents adding AI capabilities and well-funded AI-native startups approaching the problem from the code-generation side. But Softr enters this next phase with advantages that many competitors lack: a profitable business, a million-user base already shipping production software, and an architectural approach that treats AI as an accelerant layered on top of proven infrastructure rather than an unpredictable replacement for it. "No code alone had its own problems, and AI alone also just can't do the job," Hakobyan said. "The combination is what's going to be making it really powerful." For the past five years, Softr bet that the hardest part of software wasn't writing the code — it was getting the databases, permissions, and business logic right. Now the company is betting that in the age of AI, that conviction matters more than ever. The millions of business users who have never written a line of code but desperately need custom software are about to find out whether Softr is right.

Nvidia-backed ThinkLabs AI raises $28 million to tackle a growing power grid crunch

3/31/2026

ThinkLabs AI, a startup building artificial intelligence models that simulate the behavior of the electric grid, announced today that it has closed a $28 million Series A financing round led by Energy Impact Partners (EIP), one of the largest energy transition investment firms in the world. Nvidia’s venture capital arm NVentures and Edison International, the parent company of Southern California Edison, also participated in the round. The funding marks a significant escalation in the race to apply AI not just to software and content generation, but to the physical infrastructure that powers modern life. While most AI investment headlines have centered on large language models and generative tools, ThinkLabs is pursuing a different and arguably more consequential application: using physics-informed AI to model the behavior of electrical grids in real time, compressing engineering studies that once took weeks or months into minutes. "We are dead focused on the grid," ThinkLabs CEO Josh Wong told VentureBeat in an exclusive interview ahead of the announcement. "We do AI models to model the grid, specifically transmission and distribution power flow related modeling. We can calculate things like interconnection of large loads — like data centers or electric vehicle charging — and understand the impact they have on the grid." The round drew participation from a deep bench of returning investors, including GE Vernova, Powerhouse Ventures, Active Impact Investments, Blackhorn Ventures, and Amplify Capital, along with an unnamed large North American investor-owned utility. The company initially set out to raise less than $28 million, according to Wong, but strong demand from strategic partners pushed the round higher. "This was way oversubscribed," Wong said. "We attracted the right ecosystem partners and the right capital partners to grow with, and that's how we ended up at $28 million." Why surging electricity demand is breaking the grid's legacy planning tools The timing of the raise is no coincidence. U.S. electricity demand is projected to grow 25% by 2030, according to consultancy ICF International, driven largely by AI data centers, electrified transportation, and the broader push toward building and vehicle electrification. That surge is crashing into a grid that was engineered decades ago for a fundamentally different set of demands — and utilities are scrambling to keep up. The core problem is one of computational capacity. When a utility needs to understand what will happen to its grid if a large data center connects to a particular substation, or if a cluster of EV chargers goes live in a residential neighborhood, engineers must run power flow simulations — complex calculations that model how electricity moves through the network. Those studies have traditionally relied on legacy software tools from companies like Siemens, GE, and Schneider Electric, and they can take weeks or months to complete for a single scenario. ThinkLabs' approach replaces that bottleneck with physics-informed AI models that learn from the same engineering simulators but can then run orders of magnitude faster. According to the company, its platform can compress a month-long grid study into under three minutes and run 10 million scenarios in 10 minutes, while maintaining greater than 99.7% accuracy on grid power flow calculations. Wong draws a sharp distinction between what ThinkLabs does and the generative AI models that dominate public discourse. "We're not hallucinating the heck out of things," he said. "We are talking about engineering calculations here. I would really compare this to a computation of fluid dynamics, or like F1 cars, or aerospace, or climate models. We do have a source of truth from existing physics-based engineering models." That source of truth is crucial. ThinkLabs trains its AI on the outputs of first-principles physics simulators — the same tools utilities already trust — and then validates its models against those simulators. The result, Wong argues, is an AI system that is not only fast but fully explainable and auditable, a critical requirement in an industry where a miscalculation can cause blackouts or damage physical infrastructure. How ThinkLabs' three-phase power flow analysis differs from every other grid AI startup The competitive landscape for AI in grid management has grown crowded over the past two years, with startups and incumbents alike racing to apply machine learning to utility workflows. But Wong contends that ThinkLabs occupies a fundamentally different position from most of its competitors. "As far as we know, we're the only ones actually doing AI-native grid simulation analysis," he said. "Others might be using AI for forecasting, load disaggregation, or local energy management, but fundamentally, they're not calculating a power flow." What ThinkLabs performs is a full three-phase AC power flow analysis — examining every node and bus on the electric grid to determine real and reactive power levels, line flows, and voltages. This is the same type of analysis that utility engineers perform today using legacy tools, but ThinkLabs can deliver it at a speed and scale that those tools simply cannot match. The distinction matters because utilities make capital investment decisions — worth billions of dollars — based on exactly these types of studies. If a power flow analysis shows that a proposed data center connection will overload a transmission line, the utility may need to build new infrastructure at enormous cost. But if the analysis can also suggest alternative solutions — battery storage placement, load flexibility scheduling, or topology optimization — the utility can potentially avoid or defer those capital expenditures. "With many utilities, existing tools will basically show them all the problems, but they can only address solutions by trial and error," Wong explained. "With AI, we can use reinforcement learning to generate more creative solutions, but also very effectively weigh the pros and cons of each of these solutions." Inside ThinkLabs' strategic relationships with NVIDIA, Edison, and Microsoft The presence of NVentures in the round — Nvidia’s venture arm does not write many checks — signals a deeper strategic relationship that extends well beyond capital. Wong confirmed that ThinkLabs works extensively within the Nvidia ecosystem on the energy and utility side, leveraging CUDA for GPU-accelerated computation and integrating Nvidia’s Earth-2 climate simulation platform into ThinkLabs' probabilistic forecasting and risk-adjusted analysis pipelines. "We are what one utility mentioned as the only high-intensity GPU workload for the OT side — the operational technology side — that's planning and operations," Wong said. He added that ThinkLabs is also in discussions with Nvidia’s Omniverse team about additional utility use cases, though those efforts are still early. Edison International's participation carries a different kind of strategic weight. In January 2026, ThinkLabs publicly announced results from a collaboration with Southern California Edison (SCE), Edison International's utility subsidiary, that demonstrated the real-world capabilities of its platform. As the Los Angeles Times reported at the time, the collaboration showed that ThinkLabs' AI could train in minutes per circuit, process a full year of hourly power-flow data in under three minutes across more than 100 circuits, and produce engineering reports with bridging-solution recommendations in under 90 seconds — work that previously required dedicated engineers an average of 30 to 35 days. In today's announcement, Edison International's Sergej Mahnovski, Managing Director of Strategy, Technology and Innovation, reinforced that urgency: "We must rapidly transition from legacy planning tools and processes to meet the growing demands on the electric grid — new AI-native solutions are needed to transform our capabilities." ThinkLabs also works closely with Microsoft, which hosted a webinar in mid-2025 featuring Wong alongside representatives from Southern Company, EPRI, and Microsoft's own energy team. The SCE collaboration was built on Microsoft Azure AI Foundry, situating ThinkLabs within the cloud infrastructure that many large utilities already use. The 20-year career path that led from Toronto Hydro to an autonomous grid startup Wong's biography reads like a deliberate preparation for this exact moment. He has spent more than 20 years in the utility industry, starting his career at Toronto Hydro before founding Opus One Solutions in 2012 — a smart-grid software company that he grew to over 100 employees serving customers across eight countries before selling it to GE in 2022, as previously reported by BetaKit. After the acquisition, Wong joined what became GE Vernova and was asked to develop the company's "grid of the future" roadmap. The thesis he developed there — that the grid is the central bottleneck to economic growth, electrification, and national security, and that autonomous grid orchestration powered by AI is the solution — became the intellectual foundation for ThinkLabs. "I was pulling together the thesis that we need to electrify, but the grid is really at the center of attention," Wong said. "The conclusion is we need to drive towards greater autonomy. We talk a lot about autonomous cars, but I would argue that autonomous grids is the much more pressing priority." ThinkLabs was incubated inside GE Vernova and spun out as an independent company in April 2024, coinciding with a $5 million seed round co-led by Powerhouse Ventures and Active Impact Investments, as reported by GlobeNewswire at the time. GE Vernova remains a shareholder and strategic partner. Wong is the sole founder. The team composition reflects the company's dual identity. "Half of our team are power system PhDs, but the other half are the AI folks — people who have been looking at hyper-scalable AI infrastructure platforms and MLOps for other industries," Wong said. "We have really been blending the two." How ThinkLabs doubled its utility customer base in a single quarter Utilities are famously among the most conservative technology buyers in the world, with procurement cycles that can stretch years and layers of regulatory oversight that slow adoption. Wong acknowledges this reality but says the landscape is shifting faster than many observers realize. "I have noticed sales cycles really accelerating," he said. "It's still long and depends on which utility and how big the deal is, but we have been witnessing firsthand sales cycles going from the traditional one to two years to a shortest two to three months." On the commercial side, Wong declined to share specific revenue figures but offered several data points that suggest meaningful traction. ThinkLabs is working with more than 10 utilities on AI-native grid simulation for planning and operations, he said, and the company doubled its customer accounts in the first quarter of 2026 alone. "So not one or two, but we're working with 10-plus utilities," Wong said. "Things have really picked up pace even before this A round." The company primarily targets investor-owned utilities and system operators — the organizations that own and operate the grid — though Wong noted that AI is also beginning to democratize grid simulation capabilities for smaller utilities that previously lacked the engineering resources to run sophisticated analyses. Wong said the primary use of funds will go toward advancing the product to enterprise grade and expanding the range of use cases the platform supports. The company sees a significant land-and-expand opportunity within individual utility accounts — moving from modeling a small region to training AI models across entire states or multi-state territories within a single customer. EIP's involvement as lead investor carries particular significance in this market. The firm is backed by more than half of North America's investor-owned utilities, giving ThinkLabs a direct line into the executive suites of the customers it is trying to reach. "Utilities are being asked to add capacity on timelines the industry has never seen before, and the stakes extend far beyond the energy sector," Sameer Reddy, Managing Partner at EIP, said in the press release. What a 99.7% accuracy rate actually means for critical grid infrastructure Any conversation about applying AI to critical infrastructure inevitably confronts the question of failure modes. A hallucination in a chatbot is an embarrassment; a miscalculation in a grid power flow analysis could contribute to equipment damage or widespread outages. Wong addressed this head-on. The 99.7% accuracy figure, he explained, is an average across large-volume planning studies — specifically 8,760-hour analyses (every hour of the year) projected across three to 10 years with multiple sensitivity scenarios. For planning purposes, he argued, this level of accuracy is not only sufficient but may actually exceed what traditional methods deliver in practice. "If you look at a source of truth, the data quality is actually the biggest limiting factor, not the accuracy of these AI models," he said. "When we bring in traditional engineering analysis and actually snap it with telemetry — metering data, SCADA data — I would actually argue AI is far more accurate because it is data driven on actual measurements, rather than hypothetical planning analysis based on scenarios." For more critical real-time applications, ThinkLabs deploys what Wong called "hybrid models" that blend AI computation with traditional physics-based simulation. In the most stringent use cases, the AI handles roughly 99% of the computational workload before handing off to a physics-based engine for final validation — a technique Wong described as using AI to "warm start" the simulation. The company also monitors for model drift and maintains strict training boundaries. "We're not like ChatGPT training the internet here," Wong said. "We're training on the possibility of grid conditions. And if we do see a condition where we did not train, or outside of our training boundary, we can always run on-demand training on those certain solution spaces." Why ThinkLabs says its value proposition survives even if the data center boom slows down The bullish case for ThinkLabs — and for grid-focused AI more broadly — rests heavily on the assumption that electricity demand will surge dramatically over the coming decade. But some analysts have begun questioning whether those projections are inflated, particularly if AI investment cycles cool and data center build-outs decelerate. Wong argued that his company's value proposition is resilient to that scenario. Even without dramatic load growth, he said, utilities face a fundamental modernization challenge. They have been using tools and processes from the 1990s and 2000s, and the workforce that knows how to operate those tools is retiring at an alarming rate. "Workforce renewal is a big factor," he said. "These AI tools not only modernize the tool itself, but also modernize culture and transformation and become major points of retention for the next generation." He also pointed to energy affordability as a driver that exists independent of load growth projections. If utilities continue to plan based on worst-case deterministic scenarios — building enough infrastructure to cover every conceivable contingency — consumer rates will become unmanageable. AI-powered probabilistic analysis, Wong argued, allows utilities to make smarter, more cost-effective decisions regardless of whether the most aggressive demand forecasts materialize. "A large part of this AI is not only enabling workload, but how do we act with intelligence — going from worst-case to time-series analysis, from deterministic to probabilistic and stochastic analysis, and also coming up with solutions," he said. Wong frames the broader opportunity with an analogy that captures both the simplicity and the ambition of what ThinkLabs is attempting. For decades, he said, the utility industry's default response to grid constraints has been the equivalent of building wider highways — more wires, more copper, more steel. ThinkLabs wants to be the navigation system that reroutes traffic instead. "In the past, when we drive, we always drive with what we are familiar with — just the big roads," he said. "But with AI, we can optimize the traffic patterns to drive on much more effective routes. In this case, it might be a mix of wires, flexibility, batteries, and operational decisions." Whether ThinkLabs can deliver on that vision at the scale the grid demands remains an open question. But Wong, who has spent two decades building and selling grid software companies, is not thinking in terms of incremental improvement. He sees a narrow window — measured in years, not decades — during which the foundational AI infrastructure for the grid will be built, and whoever builds it will shape the energy system for a generation. "I truly believe the next two years of AI development for the grid will dictate the next decades of what can happen to the grid," Wong said. "It's really here now." The grid, in other words, is getting a copilot. The question is no longer whether utilities will trust AI with their most critical engineering decisions, but how quickly they can afford not to.

Midjourney engineer debuts new vibe coded, open source standard Pretext to revolutionize web design

3/30/2026

For three decades, the web has existed in a state of architectural denial. It is a platform originally conceived to share static physics papers, yet it is now tasked with rendering the most complex, interactive, and generative interfaces humanity has ever conceived. At the heart of this tension lies a single, invisible, and prohibitively expensive operation known as "layout reflow." Whenever a developer needs to know the height of a paragraph or the position of a line to build a modern interface, they must ask the browser’s Document Object Model (DOM), the standard by which developers can create and modify webpages. In response, the browser often has to recalculate the geometry of the entire page — a process akin to a city being forced to redraw its entire map every time a resident opens their front door. Last Friday, March 27, 2026, Cheng Lou — a prominent software engineer whose work on React, ReScript, and Midjourney has defined much of the modern frontend landscape — announced on the social network X that he had "crawled through depths of hell" to release an open source (MIT License) solution: Pretext, which he coded using AI vibe coding tools and models like OpenAI's Codex and Anthropic's Claude. It is a 15KB, zero-dependency TypeScript library that allows for multiline text measurement and layout entirely in "userland," bypassing the DOM and its performance bottlenecks. Without getting too technical, in short, Lou's Pretext turns text blocks on the web into fully dynamic, interactive and responsive spaces, able to adapt and smoothly move around any other object on a webpage, preserving letter order and spaces between words and lines, even when a user clicks and drags other objects to intersect with the text, or resizes their browser window dramatically. Ironically, it's difficult with mere text alone to convey how significant Lou's latest release is for the entire web going forward. Fortunately, other third-party developers whipped up quick demos with Pretext showing off some of its more impressive powers, including dragon that flies around within a block of text, breathing fire as the surrounding characters melt and are pushed out of the way from the dragon's undulating form. Another guy made an app that requires the user to keep their smartphone exactly level, horizontal to read the text — tipping the device to one side or the other causes all the letters to fall off and collect there as though they were each physical objects dumped off the surface of a flat tray. Someone even coded a web app allowing you to watch a whole movie (the new Project Hail Mary starring Ryan Gosling) while reading the book it is based on at the same time, all rendered out of interactive, moving, fast, responsive text. While some detractors immediately pointed out that many of these flashy demos make the underlying text unreadable or illegible, they're missing the larger point: with Pretext, one man (Lou) using AI vibe coding tools has singlehandedly revolutionized what's possible for everyone and anyone to do when it comes to web design and interactivity. The project hasn't even been out a week — of course the initial users are only scratching the surface of the newfound capabilities which heretofore required complex, custom instructions and could not be scaled or generalized. Of course, designers and typographers may be the ones most immediately impressed and affected by the advance — but really, anyone who has spent time trying to lay out a block of text and wrap it around images or other embedded, interactive elements on a webpage is probably going to be interested in this. But anyone who uses the web — all 6 billion and counting of us — will likely experience some of the effects of this release before too long as it spreads to the sites we visit and use daily. And already, some developers are working on more useful features with it, like a custom user-controlled font resizer and letter spacing optimizer for those with dyslexia: With that in mind, perhaps it is not suprising to learn that within 48 hours, the project garnered over 14,000 GitHub stars and 19 million views on X, signaling what many believe to be a foundational shift in how we build the internet. It also demonstrates that AI-assisted coding has moved beyond generating boilerplate to delivering fundamental architectural breakthroughs. For enterprises, this signifies a new era where high-leverage engineering teams can use AI to build bespoke, high-performance infrastructure that bypasses decades-old platform constraints, effectively decoupling product innovation from the slow cycle of industry-wide browser standardization The geometry of the bottleneck To understand why Pretext matters, one must understand the high cost of "measuring" things on the web. Standard browser APIs like getBoundingClientRect or offsetHeight are notorious for triggering layout thrashing. In a modern interface—think of a masonry grid of thousands of text boxes or a responsive editorial spread—these measurements happen in the "hot path" of rendering. If the browser has to stop and calculate layout every time the user scrolls or an AI generates a new sentence, the frame rate drops, the battery drains, and the experience stutters. Lou’s insight with Pretext was to decouple text layout from the DOM entirely. By using the browser’s Canvas font metrics engine as a "ground truth" and combining it with pure arithmetic, Pretext can predict exactly where every character, word, and line will fall without ever touching a DOM node. The performance delta is staggering. According to project benchmarks, Pretext’s layout() function can process a batch of 500 different texts in approximately 0.09ms. Compared to traditional DOM reads, this represents a 300–600x performance increase. This speed transforms layout from a heavy, asynchronous chore into a synchronous, predictable primitive—one that can run at 120fps even on mobile devices. Technology: the prepare and layout split The elegance of Pretext lies in its two-stage execution model, designed to maximize efficiency: prepare(text, font): This is the one-time "heavy lifting" phase. The library normalizes whitespace, segments the text, applies language-specific glue rules, and measures segments using the canvas. This result is cached as an opaque data structure. layout(preparedData, maxWidth, lineHeight): This is the "hot path". It is pure arithmetic that takes the prepared data and calculates heights or line counts based on a given width. Because layout() is just math, it can be called repeatedly during a window resize or a physics simulation without any performance penalty. It supports complex typographic needs that were previously impossible to handle efficiently in userland: Mixed-bidirectional (bidi) text: Handling English, Arabic, and Korean in the same sentence without breaking layout. Grapheme-aware breaking: Ensuring that emojis or complex character clusters are not split across lines. Whitespace control: Preserving tabs and hard breaks for code or poetry using white-space: pre-wrap logic. The hell crawl and the ai feedback loop The technical challenge of Pretext wasn't just writing the math; it was ensuring that the math matched the "ground truth" of how various browsers (Chrome, Safari, Firefox) actually render text. Text rendering is notoriously riddled with quirks, from how different engines handle kerning to the specifics of line-breaking heuristics. Lou revealed that the library was built using an "AI-friendly iteration method". By iteratively prompting models like Claude and Codex to reconcile TypeScript layout logic against actual browser rendering on massive corpora—including the full text of The Great Gatsby and diverse multilingual datasets—he was able to achieve pixel-perfect accuracy without the need for heavy WebAssembly (WASM) binaries or font-parsing libraries. Ripple effects: a weekend of demos The release of Pretext immediately manifested as a series of radical experiments across X and the broader developer community. The original demos showcased by Lou on X provided a glimpse into a new world: The editorial engine: A multi-column magazine layout where text flows around draggable orbs, reflowing in real-time at 60fps. Masonry virtualization: A demo displaying hundreds of thousands of variable-height text boxes. Height prediction is reduced to a linear traversal of cached heights. Shrinkwrapped bubbles: Chat bubbles that calculate the tightest possible width for multiline text, eliminating wasted area. The community response was equally explosive. Within 72 hours, developers began pushing the boundaries: @@yiningkarlli implemented the Knuth-Plass paragraph justification algorithm, bringing high-end print typography—reducing "rivers" of white space by evaluating entire paragraphs as units—to the web. @Talsiach built "X Times," an AI-powered newspaper that uses Grok to analyze images and X posts, using Pretext to instantly layout a front-page reflow. @Kaygeeartworks demonstrated a Three.js fluid simulation featuring fish swimming through and around text elements, with the text reacting to physics at high frame rates. @KageNoCoder launched Pretext-Flow, a live playground for flowing text around custom media like transparent PNGs or videos. @cocktailpeanut and @stevibe demonstrated ASCII art Snake and Hooke’s Law physics with live text reflow. @kho built a BioMap visualization with 52 biomarker blocks performing layout reflow at 0.04ms every frame. Philosophical shifts and the thicker client The response to Pretext was overwhelmingly enthusiastic from frontend luminaries. Guillermo Rauch, CEO of Vercel, and Ryan Florence of Remix praised the library's performance gains. Tay Zonday noted the potential for neurodiverse high-speed reading through dynamic text rasterization. However, the release also ignited a nuanced debate about the future of web standards. Critics warned of "thick client" overreach, arguing that bypassing the DOM moves us away from the simplicity of hypermedia systems. Lou’s response was a meditation on the lineage of computing. He pointed to the evolution of iOS—which started with PostScript, a static format for printers, and evolved into a polished, scriptable platform. The web, Lou argues, has remained stuck in a "document format" mindset, layering scripting on top of a static core until complexity reached a point of diminishing returns. Pretext is an attempt to restart that conversation, treating layout as an interpreter—a set of functions that developers can manipulate—rather than a black-box data format managed by the browser. Strategic analysis: To adopt or wait? Pretext is released under the MIT License, ensuring it remains a public utility for the developer community and commercial enterprises alike. It is not merely a library for making chat bubbles look better; it is an infrastructure-level tool that decouples the visual presentation of information from the architectural constraints of the 1990s web. By solving the last and biggest bottleneck of text measurement, Lou has provided a path for the web to finally compete with native platforms in terms of fluidity and expressiveness. Whether it is used for high-end editorial design, 120fps virtualized feeds, or generative AI interfaces, Pretext marks the moment when text on the web stopped being a static document and became a truly programmable medium. Organizations should adopt Pretext immediately if they are building "Generative UI" or high-frequency data dashboards, but they should do so with a clear understanding of the "thick client" trade-off. Why adopt: The move from O(N) to O(\log N) or O(1) layout performance is not an incremental update; it is an architectural unlock. If your product involves a chat interface that stutters during long responses or a masonry grid that "jumps" as it calculates heights, Pretext is the solution. It allows you to build interfaces that feel as fast as the underlying models are becoming. What to be aware of: Adoption requires a specialized talent pool. This isn't "just CSS" anymore; it’s typography-aware engineering. Organizations must also be aware that by moving layout into userland, they become the "stewards" of accessibility and standard behavior that the browser used to handle for free. Ultimately, Pretext is the first major step toward a web that feels more like a game engine and less like a static document. Organizations that embrace this "interpreter" model of layout will be the ones that define the visual language of the AI era.