Articles100
Author(s): Anup Karanjkar Originally published on Towards AI. Workload Identity Federation just hit GA — the per-provider setup, and the precedence trap that cost me two quiet days Last Tuesday I went looking for every static Claude API key I owned, and stopped counting at eleven. The author recounts migrating from long-lived static Claude API keys to keyless authentication using Workload Identity Federation (WIF), emphasizing that federation doesn’t truly “delete” the secret—it moves trust and credentials upstream to the identity provider. They explain how the system works (issuer, service account, federation rule; runtime JWT exchange to short-lived access tokens), then share the critical migration gotcha: the SDK’s credential precedence chain means that if an environment variable like ANTHROPIC_API_KEY is still present anywhere, it will silently override WIF and make the migration appear successful while doing nothing. The post provides a reliable no-downtime cutover sequence (configure federation in parallel, verify with ant auth status, remove the key everywhere, confirm federation wins, then revoke), and gives guidance for setting tight match conditions per provider (GitHub Actions, Kubernetes, AWS, GCP, Entra/Okta) to avoid wildcard rules. Finally, it stresses what WIF doesn’t solve—upstream IdP misconfiguration, lack of attestation for runtime workload identity, and limited auditability across governance frameworks—so “keyless” must be paired with proper IdP security and auditing of the trust hop you can’t see. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): MayhemCode Originally published on Towards AI. Why Local AI Is Not a Fringe Thing Anymore My ChatGPT Plus subscription was costing me $20 a month. That’s $240 a year. For someone who uses AI every single day for drafting, coding help, summarizing long PDFs that number started to bother me. Not because it’s too expensive in absolute terms, but because I kept hearing people say local models had gotten good enough to replace it. I wanted to find out if that was actually true. Why Local AI Is Not a Fringe Thing AnymoreAfter setting up local AI for 30 days with Ollama and Open WebUI on a desktop and a MacBook, the author found that today’s models are genuinely capable for everyday work—especially writing, summarizing, brainstorming, and many “80% of the time” knowledge tasks—often producing results close enough to ChatGPT to be hard to tell apart. Qwen3 32B became the main choice for quality, while smaller or different models (like DeepSeek for reasoning-style tasks and Gemma for lightweight summarization and quick Q&A) served specific use cases. Local AI’s biggest wins were privacy (prompts never leave the machine) and cost for high-volume batch text processing, where local inference can be far cheaper and faster for repetitive jobs. The main frustrations were long-context multi-step reasoning failures, limited or absent image understanding for most local setups, slower response speeds on CPU for big models, and the real time/effort required to troubleshoot local configuration and model selection. Overall, the author concludes that local AI isn’t a full replacement for the best cloud models, but it can replace most cloud usage, making a hybrid workflow (local for the bulk, cloud for the hardest 10–15%) the most practical approach; they end by recommending starter models based on hardware and emphasizing that even when switching back to cloud, the privacy instinct learned during the experiment made the process feel different. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Datafortune Inc Originally published on Towards AI. Should we move to AWS, Azure, or GCP? Do we need a hybrid architecture? Is multicloud the right long-term strategy? How quickly can we modernize legacy workloads? These are important questions. Yet they often overshadow a decision that can have just as much impact on the outcome of a migration: choosing the partner who will help execute it. When organizations look back on migrations that exceeded budgets or missed deadlines, the story is rarely about a lack of cloud capability. More often, it’s about whether the people leading the migration understood the environment they were moving into, the business they were supporting, and the operational realities waiting on the other side of go-live. That’s a risky imbalance. A cloud migration partner does more than move workloads from one environment to another. Their decisions influence migration timelines, governance models, cost visibility, operational readiness, and the experience of the teams that inherit the environment after launch. If you’re evaluating partners for an upcoming migration, there are a few signals worth paying attention to long before a contract is signed. Why Cloud Migrations Go Wrong Before Migration Begins The cloud platforms themselves are mature, proven, and used at massive scale. But if you’ll ask a room full of IT leaders about failed cloud migrations, you’ll hear familiar explanations. The timeline was too aggressive. Dependencies surfaced late. The application architecture was more complex than expected. Compliance requirements appeared halfway through the project. Teams discovered that critical applications are more interconnected than anyone realized. They are often symptoms rather than root causes. Problems usually emerge in the gaps between planning and execution. Many migration challenges can be traced back to decisions made. Specifically, decisions about how the migration is planned, who is responsible for it, and how success is defined. The first and most common mistake is evaluating a partner primarily through certifications. Cloud certifications matter, as they demonstrate expertise with a platform’s services, tools, and best practices. What they don’t reveal is whether a team has experience migrating an environment that resembles yours. For example, a manufacturing company moving an ERP platform faces a very different set of challenges than a software company migrating customer-facing applications. Another mistake emerges when migration planning focuses almost exclusively on infrastructure. The conversation becomes centered on servers, storage, networking, and timelines, while business processes receive less attention. Unfortunately, business processes are often where the most expensive surprises are hiding. An application exchanges data with other systems, supports multiple departments, and often serves workflows that have evolved over many years. When those relationships aren’t fully understood, migration teams discover them in the middle of execution, usually when changes become significantly more expensive. Three Signals You’re Evaluating the Wrong Things Over the years, a few patterns tend to show up when organizations focus on the wrong evaluation criteria. Signal #1: Every conversation revolves around tools and technologies They should absolutely be part of the discussion. The problem arises when it’s the only discussion. If every meeting centers on cloud services, migration tools, and platform capabilities, you’re only seeing part of the picture. A migration is ultimately a business initiative supported by technology, not the other way around. A partner should be asking questions about operational dependencies, critical business processes, reporting requirements, regulatory obligations, and acceptable downtime windows. Those conversations often reveal more about migration complexity than the technical architecture diagram. Signal #2: Nobody discusses operational ownership Many migration projects are planned around a finish line. The workloads are migrated, and the project is officially complete; nobody talks about what happens after go-live. The first few months after a migration are often when organizations discover optimization opportunities, integration issues, user adoption challenges, and operational adjustments that weren’t visible during planning. A partner’s role during that period can be just as important as their role during the migration itself. If post-migration ownership remains vague throughout the evaluation process, it’s worth digging deeper before moving forward. Signal #3: Compliance appears late in the discussion Not all cloud environments are built for the same purpose. A company adopting a hybrid architecture faces different operational considerations than one pursuing a multicloud strategy. Governance models, networking requirements, security controls, and workload placement decisions can vary significantly depending on the environment being built. Yet many evaluation discussions treat cloud migration as though every destination follows the same blueprint. Understanding the target environment should shape the migration strategy from the beginning. Questions to Ask Every Cloud Migration Partner Once the conversation moves beyond certifications, case studies, and platform expertise, the quality of the evaluation often depends on the questions being asked. The goal isn’t to put a potential partner under pressure. It’s to understand how they think when complexity appears, priorities conflict, and decisions have to be made with incomplete information. Here are a few questions worth bringing into the discussion. Q1. Have You Migrated Workloads Similar to Ours? Experience is most valuable when it is relevant. A partner may have completed dozens of migrations and still have limited experience with the specific challenges your organization faces. Ask for examples that resemble your environment, not just your industry. Pay attention to how they describe the challenges they encountered and how those challenges were resolved. Specific answers tend to reveal genuine experience. Q2. How Do You Identify and Manage Dependencies? Dependencies are responsible for a surprising number of migration delays. Applications exchange data with other systems, rely on shared services, support business processes, and interact with users across multiple departments. The more interconnected the environment, the more important dependency mapping becomes. A strong partner should be able to explain how they discover, document, validate, and monitor dependencies before migration work begins. The methodology matters as much as the final architecture. Q3. What Happens if Something Doesn’t Go According to Plan? Every migration plan includes assumptions. Some of those assumptions will prove accurate. Others won’t. What creates risk is the absence of a structured response when unexpected issues emerge. Ask how […]
Author(s): Rizwanhoda Originally published on Towards AI. First: What Problem Does AsyncIO Solve? Adding async and await to your code doesn't make it asynchronous. It makes it eligible to be asynchronous. There's a big difference and it bites almost everyone the first time. Photo by Árpád Czapp on UnsplashThe article explains that AsyncIO is designed to improve performance for I/O-bound workloads by using cooperative multitasking: while tasks are waiting, the event loop can run other pending work rather than blocking a single thread. It walks through how the event loop schedules coroutines and why yielding only happens at proper await points. It also clarifies common failure modes—using sequential awaits when concurrency is needed, accidentally blocking the event loop with synchronous libraries or CPU-heavy work, forgetting to actually run the event loop, and mixing sync/async incorrectly. Through a real FastAPI “before vs after” example and a mental model, the piece shows that async/await are signaling mechanisms, not speed buttons, and real parallelism requires launching multiple coroutines concurrently (e.g., with asyncio.gather or create_task). Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Divy Yadav Originally published on Towards AI. Photo from AI At 9:03 am on a Tuesday, my research agent said hello and stared at an empty /workspace/. Six hours of analysis from the night before. Gone. The cloned repository. The installed packages. The notes it had spent hours writing. Gone. I had assumed that if an agent stopped working for the night, it could simply continue the next morning. That was wrong. Over the next three weeks, I rebuilt the same workflow on Tensorlake, Cloudflare, and Daytona to figure out what had happened. The hardest part of running Claude Managed Agents isn’t the model. It’s everything underneath it. This is the exact code I ran, the things that broke, and the mistake that cost me two weeks to understand. If you want more such information about AI, consider subscribing to my newsletter, where you will get noise-free AI information every week Link for the newsletter: Newsletter What Claude Managed Agents is, before anything else Photo from Anthropic If you’ve never built with Claude Managed Agents, the architecture needs a minute. Skip this if you already know it. Anthropic runs the reasoning. You run the execution. The agent loop, session state, work queue, and retry logic all live on Anthropic’s infrastructure. You configure a Self-hosted Environment in the Claude Console. When your application starts a session, Anthropic queues the work, your orchestrator picks it up, spins up a sandbox, and the model starts issuing tool calls into that sandbox. Every bash, read, write, grep, and edit call executes inside an environment you own. Anthropic never touches it. You decide what that environment looks like, what it can access, and what happens between sessions. Anthropic’s intelligence is fixed. Your engineering determines whether that intelligence has a stable, stateful environment to work in, or a clean slate that forgets everything the moment it goes idle. What I was building and why it mattered Photo from AI I needed an agent that could do real deep-work research on a codebase: clone a repository, read through the module structure, build an understanding of how the pieces fit together, write notes, and propose refactoring strategies. The kind of work that takes a senior engineer a full day and an AI agent about six hours. The key constraint: the agent couldn’t do this all at once. Sometimes I’d kick off a session at 8pm, let it run until midnight, and pick it back up the next morning. The filesystem it had built during that first session — the analysis notes, the installed tools, the half-read source files — had to be there when the next session started. Rebuilding from scratch each time wasn’t viable. That constraint is what drove every provider decision I made. The requirements I didn’t know I had At the start, I thought I needed a Linux environment that could run Claude Managed Agents. By the end, I realized I actually needed three things. I found them all in one place, but not until I had looked in two others first. A filesystem that survived between work sessions. Near-zero cost while the agent was idle. The ability to branch from an already-completed analysis state. I did not discover all three requirements on day one.I discovered them one mistake at a time. How a session actually starts: the code before the sandbox You drive a session through the reference orchestrator using a simple command: make session PROMPT="Clone the repository at github.com/tensorlakeai/tensorlake. \Read through the module structure. Write a summary to /workspace/analysis.md. \Note any components that look like they could be simplified." The orchestrator sends this prompt to Anthropic as a new session. Anthropic picks it up, starts the agent loop, and immediately begins issuing tool calls. Those tool calls arrive at your sandbox. The agent reads files, runs bash commands, writes notes. The session runs until the task is complete or you stop it. The agent stream looks roughly like this as it runs: [thinking] The repository appears to be a Python SDK for…[bash] git clone https://github.com/tensorlakeai/tensorlake[bash] ls -la /workspace/tensorlake/[read] /workspace/tensorlake/tensorlake/sandbox.py[write] /workspace/analysis.md[thinking] The Sandbox class handles… Each bracketed event is a tool call going into your sandbox. The session accumulates state inside /workspace/ across all those calls. By the end of a six-hour session, that directory contains the cloned repo, installed packages, analysis files, and intermediate notes. That’s the state that needs to survive overnight. Build 1: Cloudflare Photo from Cloudflare My first assumption was that I needed a platform that could efficiently run Claude Managed Agents. Cloudflare is optimized for high-concurrency execution. My problem turned out to be different. The agent I was building accumulated hours of filesystem state between bursts of work. Notes, cloned repositories, installed dependencies, and intermediate analysis all needed to survive overnight. Cloudflare’s execution model wasn’t designed around that requirement.That was the first time I realized I wasn’t looking for compute. I was looking for persistent state. Build 2: Daytona Photo from Daytona The second build solved part of the problem.The agent could accumulate state throughout a session, which initially felt like progress. Then I wanted to test three different refactoring strategies starting from the same six-hour analysis. Instead of branching from that state, I found myself repeating the setup work each time: rebuilding context, reinstalling dependencies, and re-running analysis before I could begin the actual experiment. That was when I discovered my second requirement.Preserving state wasn’t enough.I also needed a way to branch from an existing state without repeating hours of work. Build 3: Tensorlake Photo from Tensorlake The first thing that caught my attention was not a feature. It was an architectural decision. Most platforms preserve state by keeping compute alive. This one treated compute and state as separate problems. The docs described a suspended sandbox that could preserve its state and resume in approximately 0.6 seconds. That was the first time I saw a design that directly addressed the problem I’d been running into. I wanted to know whether it actually worked. I started with […]
Author(s): Bessie Delight Kekeli Originally published on Towards AI. The Building Blocks of LangGraph (Part 0) For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 As Large Language Models (LLMs) have become more capable, developers have moved beyond simple chatbots and begun building systems that can reason, make decisions, use tools, retrieve information, interact with APIs, and collaborate with other AI agents. Building these systems introduces a new challenge: How do we coordinate and manage the flow of intelligence? This is the problem that LangGraph was created to solve. At its core, LangGraph is a framework for building stateful, controllable, and production-ready AI workflows. It allows developers to define how AI agents think, make decisions, communicate with tools, and move through complex tasks. If LangChain helps you connect AI components together, LangGraph helps you orchestrate how those components behave over time. LangGraph is an orchestration framework built by the team behind LangChain. It allows developers to model AI applications as a graph The Simplest Graph(agent flow) Let’s build a simple graph with 3 nodes and one conditional edge. The easiest way to understand nodes, edges, and state is to imagine a food delivery process. A node is simply a task or action that does something. For example: Receive Order is a node. Prepare Food is another node. Deliver Food is another node. Every time some work is performed, you are at a node. An edge is the path that tells the system where to go next. For example: Receive Order ↓Prepare Food ↓Deliver Food Those arrows are the edges. The edge is not doing any work itself. It simply says: “After this step finishes, go to that step.” Think of an edge as a road connecting two cities. The cities are the nodes, and the road is the edge. A state is the information that travels through the entire process. Imagine a customer orders: PizzaAddress: 123 Main StreetCustomer: John When the order is received, that information enters the system. As the order moves from: Receive Order ↓Prepare Food ↓Deliver Food the information moves along with it. That information is the state. Let’s build our first simple agent State Think of state as the graph’s shared memory. It is the information that travels through the workflow as it moves from one node to another. Every node can read the state, update it, and pass the updated version to the next node. In this example, the state contains a single piece of information called graph_state. First, define the State of the graph. The State schema serves as the input schema for all Nodes and Edges in the graph. Let’s use the TypedDict class from python's typing module as our schema, which provides type hints for the keys. from typing_extensions import TypedDictclass State(TypedDict): graph_state: strNodes Nodes A node is simply a function that performs some work. When a node runs, it receives the current state, does something with it, and returns an updated state. You can think of a node as a worker in a factory. The worker receives a package (the state), modifies it, and then passes it along. The first positional argument is the state, as defined above. Because the state is a TypedDict with schema as defined above, each node can access the key, graph_state, with state['graph_state']. Each node returns a new value of the state key graph_state. By default, the new value returned by each node will override the prior state value. def node_1(state): print("---Node 1---") return {"graph_state": state['graph_state'] + "I am"}def node_2(state): print("---Node 2---") return {"graph_state": state['graph_state'] + "happy!"}def node_3(state): print("---Node 3---") return {"graph_state": state['graph_state'] + "sad!"} Edges An edge is simply a connection between nodes. It tells the graph where to go after a node finishes its work. A normal edge is a fixed path. After one node completes, the graph always moves to the same next node. For example, if a workflow has “Collect Data” followed by “Analyze Data,” the graph will always move from the first node to the second. A conditional edge is a decision point. Instead of always following the same path, the graph looks at the current state and decides where to go next. For example, after analyzing data, the graph might ask: “Do I have enough information?” If the answer is yes, it moves to “Generate Report.” If the answer is no, it moves back to “Collect More Data.” Conditional edges are implemented as functions that return the next node to visit based on some logic. import randomfrom typing import Literaldef decide_mood(state) -> Literal["node_2", "node_3"]: # Often, we will use state to decide on the next node to visit user_input = state['graph_state'] # Here, let's just do a 50 / 50 split between nodes 2, 3 if random.random() < 0.5 # 50% of the time, we return Node 2 return "node_2" # 50% of the time, we return Node 3 return "node_3" Graph Construction Now, we build the graph from our components defined above. The StateGraph class is the graph class that we can use. First, we initialize a StateGraph with the State class we defined above. Then, we add our nodes and edges. We use the START Node, a special node that sends user input to the graph, to indicate where to start our graph. The END Node is a special node that represents a terminal node. Finally, we compile our graph to perform a few basic checks on the graph structure. We can visualize the graph as a Mermaid diagram. from IPython.display import Image, displayfrom langgraph.graph import StateGraph, START, END#Build Graphbuilder = StateGraph(state)builder.add_node("node_1", node_1)builder.add_node("node_2", node_2)builder.add_node("node_3", node_3)#Logicbuilder.add_edge(START, "node_1")builder.add_conditional_edges("node_1", decide_mood)builder.add_edge("node_2", END)builder.add_edge("node_3", END)#Addgraph = builder.compile()#Viewdisplay(Image(graph.get_graph().draw_mermaid_png())) #OUTPUT Graph Invocation The compiled graph implements the runnable protocol. This provides a standard way to execute LangChain components. invoke is one of the standard methods in this interface. The input is a dictionary {"graph_state": "Hi, this is lance."}, which sets the initial value for our graph state dict. When invoke is called, the graph starts execution […]
Author(s): Anup Karanjkar Originally published on Towards AI. Single agent, subagents, skills, agent teams, dynamic workflows — a builder’s map, and the one that isn’t really orchestration On May 28, Claude Code got its fifth way to run a multi-step job, and I watched a room of good engineers immediately reach for the wrong one. The article argues that choosing between Claude Code’s multi-step primitives is not primarily about how many agents you want to spawn—agent count is an output, not an input. It presents five ways to run work (single agent, subagents, skills, agent teams, and dynamic workflows), clarifying that skills are orthogonal because they package know-how (e.g., via SKILL.md) and don’t orchestrate or spawn agents. It then sorts the orchestration options with two key questions: who holds the plan (model-held vs code-held, where dynamic workflows move the plan into JavaScript for determinism, repeatability, and verifiable coordination) and how many memories/contexts the task needs (single context vs isolated subagent contexts vs peer coordination via agent teams using shared codebases and hub-and-spoke coordination). Finally, it emphasizes using these questions in order, defaulting to the simplest option that fits, and watching the first run to avoid the “thirty-times tax” from overusing complex orchestration when the job doesn’t require it. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Dhanush Kandhan Originally published on Towards AI. Choose Wisely: Models Should Follow Your Use Case. — By Dhanush Kandhan A guy in my builder’s discord group blew his entire Codex subscription in eleven days. Two weeks into the month, nothing left. You know what he was building? A billing feature in his SaaS. Not a compiler. Not an operating system kernel. Not a real-time physics simulation. A billing page with subscriptions, invoices, and a Dodo Payments webhook that doesn’t send duplicate emails. He said it with the exhausted pride of someone who just pushed to prod at 2 AM (we devs are batmans, right?). I nodded. I didn’t say anything. But inside I was doing the mental math. I run my full AI stack coding agent, agent workflows, browser automation, speech to text, for around $10 — $15 a month. And I ship. Regularly (my github is proof for that). With billing features and everything. That conversation is what this post is about. The Benchmark Theater We All Fell For Let me describe a pattern you’ve probably noticed. A big AI company/lab drops a new model/version. The announcement lands. Within hours, everyone on X is posting about it. “Our model built a C compiler from scratch.” “Our model achieved gold on the International Math Olympiad.” “Our model solved problems that researchers said required human-level reasoning.” Image Credits: Faiapp Meme Creator The posts get thousands of likes. Engineers screenshot the benchmark charts. Someone puts together a thread comparing it to the previous generation. Replies flood in from founders saying they’re switching immediately. Then someone from Chennai quietly tries it on their actual codebase and reports back that it’s roughly the same as before for their use case. This tweet gets eleven likes. I’m not mocking the benchmark results. Building a C compiler is impressive. Scoring on the IMO is legitimately hard. These results tell you something real about what the model is capable of in controlled settings. But here is the question nobody asks loudly enough: when was the last time your actual work required an AI to build a C compiler? Look at what you built last week. Probably a REST endpoint. A React component that talks to it. Some data validation logic. An email template. A webhook handler. A cron job that moves rows between two database tables. Maybe a RAG pipeline if you’re in the AI space. Something with auth. Something with payments. You are not building compiler infrastructure. You are building software for users. Web apps. Mobile apps. Developer tools. Internal automation. The kind of work that, individually, each piece looks boring on a benchmark slide but collectively represents most of the software being written on earth today. The benchmark score tells you the ceiling of what a model can achieve on curated academic tasks. It does not tell you whether the model is the right tool for your Monday morning standup’s ticket queue. I learned this slowly. And expensively. What “Open Source” Actually Means Here? (It’s Not One Thing) Before I get into the specific models, I need to clear up something that trips up engineers constantly. When someone says a model is “open source,” they usually mean one of two very different things, and conflating them leads to bad decisions. The first is open weights. The actual model parameters, the billions of floating point numbers that encode what the model knows are publicly available. You can download them. You can run them on your own hardware. You can fine-tune them on your own data. You can deploy them inside your own VPC and never send a single token to anyone else’s server. You can modify the architecture and release derivatives. Models like GLM-5.2, DeepSeek V4, Kimi K2.6, and Nemotron from NVIDIA are all open-weight models. The weights live on Hugging Face. Most of them ship under MIT licenses, which means you can use them commercially without paying anyone a licensing fee. The second is what most of the subscription-based coding tools are: API access. You get to call their endpoint. The model runs on their servers. Their data retention policy applies to your prompts. Their pricing can change next quarter. If their infrastructure has issues on the day you have a demo, that is your problem too. You never see the weights. You cannot run it locally. The model is theirs; you are renting access. The practical difference matters more than most engineers realize until they’ve felt it. With open weights, your inference cost is literally your compute. You can run through OpenRouter or Together AI and pay per token with no monthly subscription, switching to a better model the day it ships. You can cache aggressively. You can self-host if the data sensitivity requires it. You are not locked into anyone’s pricing model. There is also a comfortable middle path, which is what I run: open-weight models accessed through inference providers. Pay per token, no subscription, full flexibility to switch, and the per-token cost is typically a fraction of what the closed model APIs charge. The Stack. For Real. I’ve read too many “why I use open source models” posts that are basically just “open source good, closed source bad” with a Hugging Face link at the bottom. Useless. Let me be specific. GLM-5.2 for Coding via OpenCode When GLM-5.2 dropped from Z.ai, the Beijing-based lab that used to be called Zhipu AI the X(twitter) reaction was something. Aravind Srinivas posted about it. Guillermo Rauch appreciated it. The Artificial Analysis Intelligence Index ranked it at 51 points, which put it above DeepSeek V4 Pro, Kimi K2.6, and even some Google models. On their GDPval-AA v2 metric, which is their best approximation of real agentic task performance, GLM-5.2 roughly matched GPT-5.5. But you know how it goes. X(Twitter) energy is its own genre. I do not make infra decisions based on who gets quote-tweeted by whom. So I used it. On a $10/month OpenCode Go plan, using it daily. The billing feature I […]
Author(s): Siddhant Nitin Patil Originally published on Towards AI. You Do Not Need 50 Diffusion Steps. Here Is What Nvidia Proved at GTC. The video diffusion industry has had the same conversation for two years. Better model. More parameters. Higher resolution. Longer clips. Richer motion. And underneath all of it, the same silent constraint that nobody advertises: generating a single second of 720p video still takes long enough to make most real-time use cases a fantasy. At GTC 2026 in San Jose, Nvidia’s Ziv Ilan from the AI Labs team in Paris gave a 20-minute talk that reframed the problem entirely. The title: You Might Not Need 50 Diffusion Steps. The argument was not about a new model. It was about what happens when you stop treating the step count as a fixed constraint and start treating it as an engineering variable. Why Step Count Is the Real Bottleneck Diffusion models generate images and videos through iterative denoising. Random noise gets progressively cleaned up across a series of steps, each step moving the output closer to the final result. Standard production models run 20 to 50 denoising steps. Each step is a full forward pass through a model that, in the case of modern video diffusion architectures, can have 20 to 40 billion parameters. The math compounds fast. A single 1,328 x 1,328 image generated with Qwen-Image involves approximately 12,900 TFLOPs of computation, producing a latency of up to 127 seconds per image on an Nvidia H20 GPU. For video, where you need consistent quality across frames with temporal coherence, the compute demand grows faster than linearly with resolution and duration. This is why Adobe’s Firefly video generation model, before optimization, was architecturally capable but commercially constrained. State-of-the-art image diffusion already took tens of seconds per image. Video diffusion with a 50-step process at production resolution was simply not viable for interactive or real-time applications. The path forward was not a bigger model. It was a smarter inference stack. The Three-Technique Stack Ilan’s talk organized the solution space into three composable techniques: quantization, caching, and distillation. Critically, these are not alternatives. They are stackable. You deploy them in combination, and each one adds a multiplier to the performance gains of the others. Quantization: Making Each Step Cheaper Quantization reduces the numerical precision of the model’s weights and activations from 16-bit or 32-bit floating point to lower-precision formats: INT8, FP8, or even FP4 in the latest research. For LLMs, the impact of quantization is well understood and well documented. Diffusion models present a more complex picture because they are attention-heavy in ways that LLMs are not. The multi-head attention mechanisms in transformer-based diffusion architectures (DiT models) are more sensitive to precision loss than the feed-forward layers in autoregressive models. This means that naive quantization approaches developed for LLMs often produce measurable quality degradation in diffusion models even at INT8 precision. The solution Nvidia has deployed in production, demonstrated through their collaboration with Black Forest Labs on Flux 2, uses dynamic quantization rather than static quantization. Static quantization pre-computes the activation range across a calibration dataset and applies fixed scaling factors at inference time. Dynamic quantization computes activation ranges on the fly per batch, adapting to the actual data distribution being processed. For diffusion models where the latent space evolves significantly across denoising steps, dynamic quantization maintains quality that static approaches cannot match. The hardware layer amplifies this further. Nvidia’s Blackwell architecture introduced NVFP4 support, a 4-bit floating point format that, combined with Blackwell’s dedicated FP4 tensor cores, delivers performance gains that dwarf what FP8 achieved on Hopper. In ComfyUI benchmarks, NVFP4 optimizations on RTX 50-series cards delivered up to 3x performance boosts over FP16 baselines. For Stable Diffusion 3.5 Large, FP8 quantization alone cuts the VRAM requirement from 18GB to 11GB, opening up mid-range 12GB GPUs for a model that previously required 24GB. The Adobe Firefly case is the most concrete enterprise data point. Using TensorRT with mixed FP8 and BF16 precision on Hopper GPUs via AWS EC2 P5 instances: 60% latency reduction, 40% total cost of ownership reduction, serving more users with fewer GPUs. This is not a research result. It is a production deployment that is live today. One important note from Ilan on diffusion-specific quantization considerations: because these models are more attention-heavy than LLMs, the memory savings from quantization are less dramatic than in the LLM world. The performance gains still matter, but the ratio of memory benefit to compute benefit is different. Quantization should be treated as the entry-point optimization, the lowest-friction gain available, rather than the primary strategy. Quantization gets you into the field. Caching and distillation win the game. Caching: Skipping the Computation You Already Did The second technique exploits a property of diffusion that is counterintuitive until you see it: adjacent denoising steps are highly redundant. When a diffusion model runs 50 steps to generate a video frame, the feature representations in the model’s internal layers do not change dramatically between step 23 and step 24. The high-level structure, the composition, the semantic layout, these are largely determined in the early steps. The middle steps refine. The late steps clean up residual noise and adjust texture. Large swaths of the computation happening in steps 24 through 48 are recalculating values that changed very little from the previous step. This is the same insight that motivated KV caching in LLMs: if you have already computed something and it has not changed meaningfully, do not recompute it. In the autoregressive case, KV cache is straightforward because you are generating one token at a time and the previously computed keys and values are definitionally unchanged. In diffusion, the cache mechanics are more complex because you are denoising across a full latent space simultaneously, but the redundancy is real and measurable. T-cache, the approach Ilan referenced in his talk, operates at the full pixel or latent space level. It computes a similarity metric between the current denoising step’s output and the previous step’s output. If the change […]
Author(s): Ayo Akinkugbe Originally published on Towards AI. Understanding Reinforcement Learning — A Primer Photo by Girl with red hat on Unsplash Introduction: Learning by Trial and Error Imagine teaching a dog to fetch a ball. You don’t hand the dog a manual titled “The Complete Guide to Ball Retrieval.” Instead, you throw the ball, and when the dog brings it back, you give it a treat. When the dog gets distracted and wanders off, you withhold the treat. Over dozens of repetitions, the dog learns that bringing the ball back leads to rewards, while ignoring the ball doesn’t. This process of learning through interaction, experimentation, and feedback is exactly what reinforcement learning does for artificial intelligence. Teaching a dog to fetch a ball A Different Type of Learning : Supervised, Unsupervised, Reinforced Reinforcement learning is fundamentally different from the other types of machine learning you might be familiar with. In supervised learning, we show the algorithm thousands of examples with correct answers, like showing a child flashcards where one side has a picture of an apple and the other side has the word “apple.” In unsupervised learning, we give the algorithm data without answers and ask it to find patterns, like asking someone to organize a messy drawer without telling them how. But in reinforcement learning, we do something more interesting: we place an agent in an environment, give it a goal, and let it figure out how to achieve that goal through experimentation. The agent doesn’t know the right answer in advance. It doesn’t have a dataset of correct moves to learn from. Instead, it takes actions, observes what happens, receives rewards or penalties, and gradually learns which actions tend to lead to good outcomes and which ones don’t. This is how DeepMind’s AlphaGo learned to beat world champions at Go, how robotic arms learn to grasp objects, and how autonomous vehicles learn to navigate roads. The agent learns by doing, making mistakes, and slowly improving its strategy based on the consequences of its actions. “In reinforcement learning, the agent doesn’t know the right answer in advance. It doesn’t have a dataset of correct moves to learn from. Instead, it takes actions, observes what happens, receives rewards or penalties and gradually learns which actions tend to lead to good outcomes and which ones don’t.” The Core Components of Reinforcement Learning At the heart of every reinforcement learning problem are 5 fundamental components that work together in a continuous loop. Understanding each of these components and how they interact is essential to grasping how reinforcement learning actually works. Agent The agent is the learner or decision-maker. In our dog example, the dog is the agent. In a video game, the agent might be the character you control. In a self-driving car, the agent is the AI system making decisions about steering, acceleration, and braking. The agent exists to make decisions, and its entire purpose is to learn which decisions lead to the best outcomes. The agent doesn’t start out knowing anything; it begins with a blank slate and learns entirely from experience. Environment The environment is everything the agent interacts with. It’s the world in which the agent operates. For the dog, the environment includes the room, the ball, you as the trainer, and all the physical laws that govern how balls bounce and roll. For a chess-playing agent, the environment is the chessboard and the rules of chess. For a trading algorithm, the environment is the stock market with all its complexity, volatility, and rules. The environment responds to the agent’s actions and provides feedback. It’s important to note that the agent doesn’t control the environment; it can only influence it through its actions. “The agent doesn’t control the environment; it can only influence it through its actions.” State A state represents a specific situation or configuration of the environment at a particular moment in time. When you’re teaching the dog to fetch, one state might be “ball has just been thrown and is in the air,” another state might be “ball has landed fifteen feet away,” and another might be “dog has ball in mouth and is five feet from owner.” States capture all the relevant information the agent needs to make a decision. In a video game, the state might include the positions of all characters, their health levels, available items, and the current score. The quality of the state representation is key: if you don’t include important information in your state, the agent won’t be able to make good decisions.\ “A state represents a specific situation or configuration of the environment at a particular moment in time.” Action An action is something the agent can do to interact with the environment. Actions are the agent’s way of influencing its world. For the dog, actions might include “run toward ball,” “pick up ball,” “run toward owner,” or “lie down and take a nap.” For a chess agent, actions are the legal moves available given the current board position. For a robot learning to walk, actions are the specific motor commands sent to each joint and actuator. The set of available actions can change depending on the current state. In chess, the legal moves change with every move made. In the fetch example, the dog can’t pick up the ball if the ball isn’t within reach. “An action is an agent interacting or influencing the environment” Reward The reward is the feedback signal that tells the agent whether its action was good or bad. Rewards are numbers: positive numbers for good outcomes and negative numbers (penalties) for bad outcomes. When the dog brings the ball back, it gets a positive reward (the treat, which we might represent as +10). When it ignores the ball, it gets zero or even a small negative reward (no treat, perhaps represented as -1 or 0). The reward is the only way the environment communicates value to the agent. The agent’s entire learning process is driven by a single objective: maximize […]
Author(s): Enzo Lombardi Originally published on Towards AI. State machines for multi-step tasks The loop in Part 1 handles a class of question that fits in one breath: read this file, list that directory, answer the user. Two turns, three turns, done. As long as the model can plan and execute inside one conversation, the loop is enough. Beyond the initial loop, the article argues that real multi-step agent work needs a state-machine structure to handle composition of phases, human approval gates, and durability across failures or restarts. It explains how Eugene v0.4 introduces a typed graph system in Rust—nodes that represent phases and return transitions (goto, halt, interrupt), a graph runner that drives execution with checkpointing via a SQLite checkpointer, and an interrupt mechanism for human-in-the-loop pauses. The post also covers plan-mode lineage (permission modes like read-only vs approve-before-destructive), generalized gating via hooks (before/after node hooks for permissions, logging, budgets, and other cross-cutting concerns), and how retries should be placed at the correct scope (HTTP call vs whole node). Finally, it demonstrates a practical three-node “draft → review → revise” graph and concludes with what this design enables next (multi-agent parallelism) plus where to find the full code and related background. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Enzo Lombardi Originally published on Towards AI. State machines for multi-step tasks The loop in Part 1 handles a class of question that fits in one breath: read this file, list that directory, answer the user. Two turns, three turns, done. As long as the model can plan and execute inside one conversation, the loop is enough. Beyond the initial loop, the article argues that real multi-step agent work needs a state-machine structure to handle composition of phases, human approval gates, and durability across failures or restarts. It explains how Eugene v0.4 introduces a typed graph system in Rust—nodes that represent phases and return transitions (goto, halt, interrupt), a graph runner that drives execution with checkpointing via a SQLite checkpointer, and an interrupt mechanism for human-in-the-loop pauses. The post also covers plan-mode lineage (permission modes like read-only vs approve-before-destructive), generalized gating via hooks (before/after node hooks for permissions, logging, budgets, and other cross-cutting concerns), and how retries should be placed at the correct scope (HTTP call vs whole node). Finally, it demonstrates a practical three-node “draft → review → revise” graph and concludes with what this design enables next (multi-agent parallelism) plus where to find the full code and related background. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Enzo Lombardi Originally published on Towards AI. Multi-agent crews The single-agent loop in Part 1 was enough for one question, one tool, one answer. The state machine in Part 4 handled a task with phases. Neither helps when the work itself wants to be divided. Some questions are better answered by a researcher who gathers facts, a skeptic who pokes holes, and an editor who reconciles them: three different jobs, three different system prompts, three different temperatures, three different lenses on the same input. Forcing one agent to wear all three hats is asking it to be three things at once, and the result is the kind of confidently wrong middle that nobody ordered. The article explains why “multi-agent crews” work when roles are genuinely distinct, and why adding agents usually increases latency, cost, and error surface (“the multi-agent trap”). It introduces an `Agent` trait (mapping a free-form query to a free-form answer), then shows how crews orchestrate specialists either in parallel (`run_parallel` with `join_all` for low wall-clock time) or sequentially (`run_sequential` for pipelines). It covers routing (letting a router model pick a specialist), debate/verification protocols (pro/con agents with an optional judge or adversarial critic for high-stakes domains), and how Eugene v0.5 implements these ideas with types, dispatch methods, and example outputs. Finally, it describes how crews compose with earlier crates (skills/graphs) and preview what comes next (provider abstraction for swapping LLM backends via `eugene-providers`). Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Enzo Lombardi Originally published on Towards AI. Multi-agent crews The single-agent loop in Part 1 was enough for one question, one tool, one answer. The state machine in Part 4 handled a task with phases. Neither helps when the work itself wants to be divided. Some questions are better answered by a researcher who gathers facts, a skeptic who pokes holes, and an editor who reconciles them: three different jobs, three different system prompts, three different temperatures, three different lenses on the same input. Forcing one agent to wear all three hats is asking it to be three things at once, and the result is the kind of confidently wrong middle that nobody ordered. The article explains why “multi-agent crews” work when roles are genuinely distinct, and why adding agents usually increases latency, cost, and error surface (“the multi-agent trap”). It introduces an `Agent` trait (mapping a free-form query to a free-form answer), then shows how crews orchestrate specialists either in parallel (`run_parallel` with `join_all` for low wall-clock time) or sequentially (`run_sequential` for pipelines). It covers routing (letting a router model pick a specialist), debate/verification protocols (pro/con agents with an optional judge or adversarial critic for high-stakes domains), and how Eugene v0.5 implements these ideas with types, dispatch methods, and example outputs. Finally, it describes how crews compose with earlier crates (skills/graphs) and preview what comes next (provider abstraction for swapping LLM backends via `eugene-providers`). Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Dylan Tartarini Originally published on Towards AI. Compounding knowledge using AI Agents Some time ago, Andrej Karpathy released a Github GiST containing a guide, or better, an intuition on how to build one’s own personal knowledge base. The core philosophy behind the concept is simple and to the point: Graph view from my own study notesThe author explains that while the original LLM-wiki idea emphasizes compiling personal notes into a compounding markdown wiki via an LLM agent, most implementations are too developer-centric, so they build their own approach (DyResearch). They outline the shift from a single coding assistant toward a team/faculty of specialized agents integrated with Obsidian, combining a compounding wiki concept with local, lightweight retrieval through a dual storage architecture. They describe the agent roles (Study Coordinator, Professor, Librarian, Researcher, Note Taker), how DyResearch is served via a FastAPI backend and connected to Obsidian through a custom community plugin, and how the system manages sessions/events and source retrieval. Finally, they detail their implementation choices for orchestration (Google ADK), database/session persistence (Postgres + pgvector vs local-first SQLite + LanceDB), and the plugin features that let users chat, ingest documents, and automatically generate or update notes inside their vault. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Dylan Tartarini Originally published on Towards AI. Compounding knowledge using AI Agents Some time ago, Andrej Karpathy released a Github GiST containing a guide, or better, an intuition on how to build one’s own personal knowledge base. The core philosophy behind the concept is simple and to the point: Graph view from my own study notesThe author explains that while the original LLM-wiki idea emphasizes compiling personal notes into a compounding markdown wiki via an LLM agent, most implementations are too developer-centric, so they build their own approach (DyResearch). They outline the shift from a single coding assistant toward a team/faculty of specialized agents integrated with Obsidian, combining a compounding wiki concept with local, lightweight retrieval through a dual storage architecture. They describe the agent roles (Study Coordinator, Professor, Librarian, Researcher, Note Taker), how DyResearch is served via a FastAPI backend and connected to Obsidian through a custom community plugin, and how the system manages sessions/events and source retrieval. Finally, they detail their implementation choices for orchestration (Google ADK), database/session persistence (Postgres + pgvector vs local-first SQLite + LanceDB), and the plugin features that let users chat, ingest documents, and automatically generate or update notes inside their vault. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): GSO1 Originally published on Towards AI. Why ChatGPT Is More Than Autocomplete Figure by the author with assistance from Claude (Anthropic) Calling a large language model (LLM) like ChatGPT “autocomplete” is not exactly wrong, but it is deeply misleading. Most of us think of autocomplete as a text-completion tool: a phone keyboard guessing the next word, a search bar finishing a phrase. But that picture is not powerful enough to explain what LLMs actually produce — explanations, analogies, plans, summaries, arguments, code, stories, dialogue. A transformer-based LLM does predict one token at a time — roughly a word, though in practice often a word-fragment — but that visible sequence is only the surface trace of a much richer hidden process. Before each word appears, the model has built a high-dimensional internal state that reflects the topic, the context, the tone, the intent, and the likely directions the answer could take. The next token is not read off the prompt. It is read off this internal state. That is why a system trained only to predict the next token produces explanations, arguments, analogies, plans, and dialogue that feel far more than anything we would call autocomplete. Previous articles in this series developed a geometric picture of the transformer’s internal state in terms of attention— attention as a coupled free-energy minimization and weighted least squares problem [Article 1], simplification of attention to two operators in one space [Article 2], and attention as a geometric flow [Article 3]. This article steps back and asks a plainer question — what is actually happening when a chatbot answers you, and why is autocomplete the wrong picture — and answers it with a minimum of mathematical machinery. The autocomplete misunderstanding Calling an LLM “autocomplete” is tempting because, at one level, it is true. Given a sequence of text, the model predicts the next token. That token is appended to the input, which is fed back in for the next prediction. Stepping through this loop produces the visible response, one token at a time, that has the appearance of autocomplete. The problem is that this picture describes what goes on “across” steps but ignores what goes on “within” steps. Across steps, things feel like autocomplete — the sequential generation of individual words that cohere with previously generated text. (Researchers properly call this process “autoregressive”, but we’ll stick with the less than proper autocomplete for now.)” The missing piece is the hidden computation taking place within a step that generates the next token. Within a step, before a new token is generated, the input prompt is transformed into a high-dimensional internal state. The tokens of the prompt are not treated as isolated words in a list; they are interpreted in relation to one another via the attention mechanism. A question, a definition, a metaphor, a constraint, a conversational tone — all of these shape the internal state from which the next token is drawn. So the next token is not predicted from the input text alone as autocomplete suggests. It is predicted from a rich internal state built from the text. The model then projects a small part of that state into the vocabulary to choose the next token, appends the token to the input text, and rebuilds the state over the longer text for the next prediction step. The observed output is the result of multiple steps thru that loop. “Autocomplete,” then, is technically defensible but conceptually misleading. It names the final visible act of generation and ignores the machinery that makes the act possible. An LLM like ChatGPT does not merely “complete” the observed text. It repeatedly reconstructs meaning over a growing context and projects part of that meaning back into language, one token at a time. Where the next token really comes from An LLM predicts the next token not from the input text alone, but from a dense internal representation built by sequentially processing the text through multiple layers of the model. When a prompt is input to the model, each token is turned into a vector — a point in a high-dimensional space whose location already encodes what pre-training has learned about that token: its meanings, its grammatical roles, the company it tends to keep. This initial cloud of points is only the starting arrangement. As the cloud passes through the model’s many layers, each token/point absorbs information from the rest of the text, so the its final position reflects not just the word it started as but the role that word plays in context with the rest of the cloud. The word “bank” has a different vector representation in “On the river bank” and “Call the investment bank” because nearby words changes its position. The same is true of the text as a whole: a question, an example, a requested tone, a constraint, a prior phrase — each reshapes the text, changing not only what a given word means but what kind of answer becomes likely. Context not only disambiguates individual words. It shapes the entire internal state from which the next token is predicted. Consider four prompts: “Explain E = mc².” — pushes toward teaching, physics, symbol definitions, accessible explanation. “Explain E = mc² to a 10-year-old.” — shifts toward simpler vocabulary, analogy, a gentler tone. “Explain E = mc² in one sentence.” — adds brevity and compression. “Explain E = mc² using calculus.” — shifts toward a more technical, mathematical treatment. The idea to be explained is identical in all four. The surrounding context changes the kind of answer that becomes likely. This is the key point: the next token is not predicted from the text. It is predicted from the model’s internal state after the text has passed through many layers of interaction and conditioning. That state is not a sentence, a paragraph, a private monologue, or a plan. It is a distributed, high-dimensional representation that holds many things at once — topic, syntax, style, intention, discourse structure, […]
Last Updated on June 22, 2026 by Editorial Team Author(s): GSO1 Originally published on Towards AI. Why ChatGPT Is More Than Autocomplete Figure by the author with assistance from Claude (Anthropic) Calling a large language model (LLM) like ChatGPT “autocomplete” is not exactly wrong, but it is deeply misleading. Most of us think of autocomplete as a text-completion tool: a phone keyboard guessing the next word, a search bar finishing a phrase. But that picture is not powerful enough to explain what LLMs actually produce — explanations, analogies, plans, summaries, arguments, code, stories, dialogue. A transformer-based LLM does predict one token at a time — roughly a word, though in practice often a word-fragment — but that visible sequence is only the surface trace of a much richer hidden process. Before each word appears, the model has built a high-dimensional internal state that reflects the topic, the context, the tone, the intent, and the likely directions the answer could take. The next token is not read off the prompt. It is read off this internal state. That is why a system trained only to predict the next token produces explanations, arguments, analogies, plans, and dialogue that feel far more than anything we would call autocomplete. Previous articles in this series developed a geometric picture of the transformer’s internal state in terms of attention— attention as a coupled free-energy minimization and weighted least squares problem [Article 1], simplification of attention to two operators in one space [Article 2], and attention as a geometric flow [Article 3]. This article steps back and asks a plainer question — what is actually happening when a chatbot answers you, and why is autocomplete the wrong picture — and answers it with a minimum of mathematical machinery. The autocomplete misunderstanding Calling an LLM “autocomplete” is tempting because, at one level, it is true. Given a sequence of text, the model predicts the next token. That token is appended to the input, which is fed back in for the next prediction. Stepping through this loop produces the visible response, one token at a time, that has the appearance of autocomplete. The problem is that this picture describes what goes on “across” steps but ignores what goes on “within” steps. Across steps, things feel like autocomplete — the sequential generation of individual words that cohere with previously generated text. (Researchers properly call this process “autoregressive”, but we’ll stick with the less than proper autocomplete for now.)” The missing piece is the hidden computation taking place within a step that generates the next token. Within a step, before a new token is generated, the input prompt is transformed into a high-dimensional internal state. The tokens of the prompt are not treated as isolated words in a list; they are interpreted in relation to one another via the attention mechanism. A question, a definition, a metaphor, a constraint, a conversational tone — all of these shape the internal state from which the next token is drawn. So the next token is not predicted from the input text alone as autocomplete suggests. It is predicted from a rich internal state built from the text. The model then projects a small part of that state into the vocabulary to choose the next token, appends the token to the input text, and rebuilds the state over the longer text for the next prediction step. The observed output is the result of multiple steps thru that loop. “Autocomplete,” then, is technically defensible but conceptually misleading. It names the final visible act of generation and ignores the machinery that makes the act possible. An LLM like ChatGPT does not merely “complete” the observed text. It repeatedly reconstructs meaning over a growing context and projects part of that meaning back into language, one token at a time. Where the next token really comes from An LLM predicts the next token not from the input text alone, but from a dense internal representation built by sequentially processing the text through multiple layers of the model. When a prompt is input to the model, each token is turned into a vector — a point in a high-dimensional space whose location already encodes what pre-training has learned about that token: its meanings, its grammatical roles, the company it tends to keep. This initial cloud of points is only the starting arrangement. As the cloud passes through the model’s many layers, each token/point absorbs information from the rest of the text, so the its final position reflects not just the word it started as but the role that word plays in context with the rest of the cloud. The word “bank” has a different vector representation in “On the river bank” and “Call the investment bank” because nearby words changes its position. The same is true of the text as a whole: a question, an example, a requested tone, a constraint, a prior phrase — each reshapes the text, changing not only what a given word means but what kind of answer becomes likely. Context not only disambiguates individual words. It shapes the entire internal state from which the next token is predicted. Consider four prompts: “Explain E = mc².” — pushes toward teaching, physics, symbol definitions, accessible explanation. “Explain E = mc² to a 10-year-old.” — shifts toward simpler vocabulary, analogy, a gentler tone. “Explain E = mc² in one sentence.” — adds brevity and compression. “Explain E = mc² using calculus.” — shifts toward a more technical, mathematical treatment. The idea to be explained is identical in all four. The surrounding context changes the kind of answer that becomes likely. This is the key point: the next token is not predicted from the text. It is predicted from the model’s internal state after the text has passed through many layers of interaction and conditioning. That state is not a sentence, a paragraph, a private monologue, or a plan. It is a distributed, high-dimensional representation that holds many things at once — topic, syntax, style, intention, discourse structure, […]
Last Updated on June 22, 2026 by Editorial Team Author(s): Utkarsh Mittal Originally published on Towards AI. Part 13 — Design the Recommender System Part 12 — https://medium.com/p/75cf0a345156 The article explains how to design a production recommender system using a real end-to-end scenario and concrete latency, data, and training considerations. It argues that business objectives differ from what can be directly labeled, and that ranking (not simple classification) with measurable proxy signals is central. It outlines what the system must do and must never do under tight latency constraints, why scale forces a two-stage architecture (fast retrieval followed by richer ranking), and how cold-start and feedback sparsity shape training data. It covers how labels are constructed, how negatives and time-based splits avoid bias and leakage, how feature stores prevent training-serving skew, and how two-tower retrieval with dot products and softmax training works in practice. It then discusses ranking with baselines like LightGBM and richer wide & deep models, calibration and multi-task refinements, and evaluation using Recall@K for retrieval and NDCG@K for ranking with error slicing. Finally, it walks through the live 200ms execution pipeline, operational optimizations (quantization, batching, caching, fallbacks, shadow mode), online evaluation methods (A/B bucketing, interleaving), and the four major ways such systems “rot” via monitoring gaps, popularity spirals, offline-online mismatch, and training-serving skew—closing with what distinguishes mid-level, senior, and staff engineering work in recommender systems. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Utkarsh Mittal Originally published on Towards AI. Part 13 — Design the Recommender System Part 12 — https://medium.com/p/75cf0a345156 The article explains how to design a production recommender system using a real end-to-end scenario and concrete latency, data, and training considerations. It argues that business objectives differ from what can be directly labeled, and that ranking (not simple classification) with measurable proxy signals is central. It outlines what the system must do and must never do under tight latency constraints, why scale forces a two-stage architecture (fast retrieval followed by richer ranking), and how cold-start and feedback sparsity shape training data. It covers how labels are constructed, how negatives and time-based splits avoid bias and leakage, how feature stores prevent training-serving skew, and how two-tower retrieval with dot products and softmax training works in practice. It then discusses ranking with baselines like LightGBM and richer wide & deep models, calibration and multi-task refinements, and evaluation using Recall@K for retrieval and NDCG@K for ranking with error slicing. Finally, it walks through the live 200ms execution pipeline, operational optimizations (quantization, batching, caching, fallbacks, shadow mode), online evaluation methods (A/B bucketing, interleaving), and the four major ways such systems “rot” via monitoring gaps, popularity spirals, offline-online mismatch, and training-serving skew—closing with what distinguishes mid-level, senior, and staff engineering work in recommender systems. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. The Best Engineers Stopped Writing Prompts: The 4 Layers That Replaced Prompt Engineering Boris Cherny built Claude Code. In June 2026 he said the quiet part out loud: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.” In four years, the highest-value skill in applied AI has been rewritten 4 times — from prompts, to context, to the harness, to the loop. Each rewrite moved the job one layer outward, and each layer trades less manual operation for more system design. After the opening, the article lays out a “through-line” for why the job keeps shifting: each new layer wraps the previous one. It details Layer 1 (prompt engineering), where the key object is a single input string and the challenge is phrasing and tool-use via prompt structure (few-shot, chain-of-thought, ReAct), but notes its brittleness and the assumption that the model already has everything it needs. It then covers Layer 2 (context engineering), focused on filling a limited context window with the right information using retrieval, memory, summarization, and strategies to prevent context rot—yet still observes that humans remain responsible for choosing what gets retrieved and when. Layer 3 (harness engineering) is presented as the environment around the agent—tools, permissions, sandboxing, lifecycle hooks, retries, traces, and sub-agents—moving reliability concerns into configuration rather than just model behavior. Finally, it introduces Layer 4 (loop engineering), where the system is scheduled and run repeatedly without constant manual prompting: triggers, goal/state persistence, scouting tasks, invoking harnessed agents, verifying outputs (often with a second agent), and writing memory across iterations. The piece concludes with a diagnostic to identify which layer you’re practicing, a “climb one layer” starting path, and a verdict that the next frontier after loops will likely involve fleets of coordinating loops, pushing the highest-paid skills further outward from direct prompt writing. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. The Best Engineers Stopped Writing Prompts: The 4 Layers That Replaced Prompt Engineering Boris Cherny built Claude Code. In June 2026 he said the quiet part out loud: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.” In four years, the highest-value skill in applied AI has been rewritten 4 times — from prompts, to context, to the harness, to the loop. Each rewrite moved the job one layer outward, and each layer trades less manual operation for more system design. After the opening, the article lays out a “through-line” for why the job keeps shifting: each new layer wraps the previous one. It details Layer 1 (prompt engineering), where the key object is a single input string and the challenge is phrasing and tool-use via prompt structure (few-shot, chain-of-thought, ReAct), but notes its brittleness and the assumption that the model already has everything it needs. It then covers Layer 2 (context engineering), focused on filling a limited context window with the right information using retrieval, memory, summarization, and strategies to prevent context rot—yet still observes that humans remain responsible for choosing what gets retrieved and when. Layer 3 (harness engineering) is presented as the environment around the agent—tools, permissions, sandboxing, lifecycle hooks, retries, traces, and sub-agents—moving reliability concerns into configuration rather than just model behavior. Finally, it introduces Layer 4 (loop engineering), where the system is scheduled and run repeatedly without constant manual prompting: triggers, goal/state persistence, scouting tasks, invoking harnessed agents, verifying outputs (often with a second agent), and writing memory across iterations. The piece concludes with a diagnostic to identify which layer you’re practicing, a “climb one layer” starting path, and a verdict that the next frontier after loops will likely involve fleets of coordinating loops, pushing the highest-paid skills further outward from direct prompt writing. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Dr Swarneendu AI Originally published on Towards AI. There are next-word predictions your model is mathematically forbidden from making. Not unlikely. Forbidden, the way a piano with too few keys cannot play a note that lies past its keyboard. The proof needs nothing but small whole numbers and patience. We will do every step on paper. Summary of the article Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): Dr Swarneendu AI Originally published on Towards AI. There are next-word predictions your model is mathematically forbidden from making. Not unlikely. Forbidden, the way a piano with too few keys cannot play a note that lies past its keyboard. The proof needs nothing but small whole numbers and patience. We will do every step on paper. Summary of the article Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 22, 2026 by Editorial Team Author(s): DhanushKumar Originally published on Towards AI. Every Python Concept a Generative AI Developer Actually Needs to Know From async coroutines that power real-time LLM streaming, to memory tricks that let you process million-document datasets — the complete map, written for engineers building with AI today. Most Python tutorials teach you the language. This one teaches you the language as a GenAI engineer uses it — where every concept has a direct line to a real problem you will hit building LLM pipelines, RAG systems, and AI agents. Async / Await — The Heartbeat of Every LLM App Here is the brutal truth about building LLM applications: your code spends most of its time waiting. Waiting for llm to respond. Waiting for an embedding API. Waiting for a vector database. Without async, you serve one user at a time. With async, you serve thousands concurrently — on a single thread. What actually happens when you write await When Python hits an await expression, it pauses the current coroutine and hands control back to the event loop. The event loop looks at everything else that's ready to run, makes progress on it, and returns to your coroutine once the awaited result is available. No threads. No OS context switches. Pure cooperative multitasking. import asyncio import anthropicclient = anthropic.AsyncAnthropic()async def ask_claude(prompt : str, label : str) -> str : # Every await is a potential pause - but only if something else needs CPU message = await client.messages.create( model = "claude-opus-4-5",max_tokens = 512, messages=[{"role":"user","content":prompt}] return f"[{label}]{message.content[0].text}"async def main(): questions = [ ("What is a transformer architecture?", "A"), ("Explain RAG in one paragraph.", "B"), ("What is chain-of-thought prompting?", "C"), ("Describe the attention mechanism briefly.", "D"), ("What is a vector database used for?", "E"), ] # All 5 fire at once - total time ≈ slowest single call (~2s) # Sequential would take ~10s results = await asyncio.gather( *[ask_claude(q, l) for q, l in questions] ) for r in results: print(r)asyncio.run(main()) ⚡ Real-World Impact Sequential LLM calls for 100 documents × 3 seconds each = 5 minutes. With asyncio.gather() they run concurrently and finish in ~3–5 seconds. That's a 60× speedup with zero extra hardware. Tasks: fire and forget (then collect later) asyncio.create_task() schedules a coroutine immediately without waiting for it. This lets you kick off parallel work and collect results later — perfect for RAG pipelines where you retrieve from a vector store and a web search at the same time. async def rag_pipeline(query: str) -> str : # phase 1 : kick off both retrievals simultaneously task_vector = asyncio.create_task(search_vector_db(query)) task_web = asyncio.create_task(search_web(query)) # both run concurrently while we do other prep work system_prompt = "You are a helpful research assistant" # collect results - awaiting blocks only unitl each is redy vector_hits , web_hits = await task_vector, await task_web context = build_context(vector_hits, web_hits) #phase 2 : single llm call with full context return await call_llm(system_prompt, context, query)total : max(vector_latency,web_latency) + llm_latency Streaming tokens in real time with async generators ChatGPT-style streaming — where tokens appear as they’re generated — requires async generators. Instead of waiting for the full response, you yield each token as it arrives and forward it to the client immediately. import anthropicclient = anthropic.AsyncAnthropic()async def stream_response(prompt: str): """ Async generator - yields tokens as they arrive from the llm " async with client.messages.stream( model = "claude-opus-4.5", max_tokens = 1024, messages = [{"role":"user", "content"prompt}]) as stream: async for text in stream.text_stream: yield text # each token arrives here ~50 ms apart async def handle_request(prompt: str): full_text = "" async for token in stream_response(prompt): print(token, end="", flush=True) # real-time display full_text += token print() return full_textasyncio.run(handle_request("Explain diffusion models simply.")) Locks: protecting shared state across coroutines Even though asyncio is single-threaded, race conditions exist. If two coroutines both read-then-write a shared counter without a lock, you’ll get wrong results. asyncio.Lock ensures only one coroutine is inside the critical section at a time. import asynciofrom collections import defualtdictrequest_counts : dict[str,int] = defaultdict(int)lock = asyncio.Lock() async def tracked_embed(Text : str , model : str) -> list[float]: async with lock: request_counts[model] += 1 if request_counts[model] > 1000: raise RuntimeError(f"Daily limit hit for {model}") return await call_embedding_api(text,model) Section 02 : Threading — When Your Library Doesn’t Speak Async Many powerful Python libraries — requests, some database drivers, HuggingFace's synchronous API — are blocking. You can't just slap await on them. But you also can't leave performance on the table. Threading is the answer. The GIL: what it blocks and what it doesn’t The Global Interpreter Lock (GIL) is a mutex in CPython that prevents more than one thread from running Python bytecode at the same time. It sounds like threading is useless — but it isn’t, because the GIL is released during: I/O operations : Network calls, file reads/writes, socket operations. The GIL releases while the OS handles I/O. → Threads work great here C extensions NumPy, PyTorch ops, SciPy — all run C code that releases the GIL for the duration. → Threads work great here Pure Python CPU Loops, string operations, pure Python math. The GIL never releases — threads don’t help. → Use multiprocessing instead threadpool_embedding.py from concurrent.futures import ThreadPoolExecutor, as_completedfrom sentence_transformers import SentenceTransformer#blocking library - can't use asyncio but threads work finemodel = SentenceTransformer("all-MiniLM-L6-V2")def embed_text(text : str, idx : int ) -> tuple : embedding = model.encode(text) # GIL released - C extension runs return idx, embedding.tolist()texts = [f"Document chunk {i} " for i in range(50)]with ThreadPoolExecutor(max_workers = 8) as pool:futures = {pool.submit(embed_text, t, i): i for i, t in enumerate(texts)} results = {} for future in as_completed(futures): idx, embedding = future.result() results[idx] = embeddingprint(f"Embedded {len(results)} chunks") Synchronization primitives — the full toolkit import threading, timemodel_ready = threading.Event()api_sem = threading.Semaphore(5) # max 5 concurrent inferencedef load_model(): print("Loading model weights...") time.sleep(3) # simulate loading 7B param model model_ready.set() # unblocks ALL waiting threads at once print("Model ready!")def inference_worker(worker_id: int): model_ready.wait() # block here until model is loaded with api_sem: # at most 5 […]
Last Updated on June 22, 2026 by Editorial Team Author(s): DhanushKumar Originally published on Towards AI. Every Python Concept a Generative AI Developer Actually Needs to Know From async coroutines that power real-time LLM streaming, to memory tricks that let you process million-document datasets — the complete map, written for engineers building with AI today. Most Python tutorials teach you the language. This one teaches you the language as a GenAI engineer uses it — where every concept has a direct line to a real problem you will hit building LLM pipelines, RAG systems, and AI agents. Async / Await — The Heartbeat of Every LLM App Here is the brutal truth about building LLM applications: your code spends most of its time waiting. Waiting for llm to respond. Waiting for an embedding API. Waiting for a vector database. Without async, you serve one user at a time. With async, you serve thousands concurrently — on a single thread. What actually happens when you write await When Python hits an await expression, it pauses the current coroutine and hands control back to the event loop. The event loop looks at everything else that's ready to run, makes progress on it, and returns to your coroutine once the awaited result is available. No threads. No OS context switches. Pure cooperative multitasking. import asyncio import anthropicclient = anthropic.AsyncAnthropic()async def ask_claude(prompt : str, label : str) -> str : # Every await is a potential pause - but only if something else needs CPU message = await client.messages.create( model = "claude-opus-4-5",max_tokens = 512, messages=[{"role":"user","content":prompt}] return f"[{label}]{message.content[0].text}"async def main(): questions = [ ("What is a transformer architecture?", "A"), ("Explain RAG in one paragraph.", "B"), ("What is chain-of-thought prompting?", "C"), ("Describe the attention mechanism briefly.", "D"), ("What is a vector database used for?", "E"), ] # All 5 fire at once - total time ≈ slowest single call (~2s) # Sequential would take ~10s results = await asyncio.gather( *[ask_claude(q, l) for q, l in questions] ) for r in results: print(r)asyncio.run(main()) ⚡ Real-World Impact Sequential LLM calls for 100 documents × 3 seconds each = 5 minutes. With asyncio.gather() they run concurrently and finish in ~3–5 seconds. That's a 60× speedup with zero extra hardware. Tasks: fire and forget (then collect later) asyncio.create_task() schedules a coroutine immediately without waiting for it. This lets you kick off parallel work and collect results later — perfect for RAG pipelines where you retrieve from a vector store and a web search at the same time. async def rag_pipeline(query: str) -> str : # phase 1 : kick off both retrievals simultaneously task_vector = asyncio.create_task(search_vector_db(query)) task_web = asyncio.create_task(search_web(query)) # both run concurrently while we do other prep work system_prompt = "You are a helpful research assistant" # collect results - awaiting blocks only unitl each is redy vector_hits , web_hits = await task_vector, await task_web context = build_context(vector_hits, web_hits) #phase 2 : single llm call with full context return await call_llm(system_prompt, context, query)total : max(vector_latency,web_latency) + llm_latency Streaming tokens in real time with async generators ChatGPT-style streaming — where tokens appear as they’re generated — requires async generators. Instead of waiting for the full response, you yield each token as it arrives and forward it to the client immediately. import anthropicclient = anthropic.AsyncAnthropic()async def stream_response(prompt: str): """ Async generator - yields tokens as they arrive from the llm " async with client.messages.stream( model = "claude-opus-4.5", max_tokens = 1024, messages = [{"role":"user", "content"prompt}]) as stream: async for text in stream.text_stream: yield text # each token arrives here ~50 ms apart async def handle_request(prompt: str): full_text = "" async for token in stream_response(prompt): print(token, end="", flush=True) # real-time display full_text += token print() return full_textasyncio.run(handle_request("Explain diffusion models simply.")) Locks: protecting shared state across coroutines Even though asyncio is single-threaded, race conditions exist. If two coroutines both read-then-write a shared counter without a lock, you’ll get wrong results. asyncio.Lock ensures only one coroutine is inside the critical section at a time. import asynciofrom collections import defualtdictrequest_counts : dict[str,int] = defaultdict(int)lock = asyncio.Lock() async def tracked_embed(Text : str , model : str) -> list[float]: async with lock: request_counts[model] += 1 if request_counts[model] > 1000: raise RuntimeError(f"Daily limit hit for {model}") return await call_embedding_api(text,model) Section 02 : Threading — When Your Library Doesn’t Speak Async Many powerful Python libraries — requests, some database drivers, HuggingFace's synchronous API — are blocking. You can't just slap await on them. But you also can't leave performance on the table. Threading is the answer. The GIL: what it blocks and what it doesn’t The Global Interpreter Lock (GIL) is a mutex in CPython that prevents more than one thread from running Python bytecode at the same time. It sounds like threading is useless — but it isn’t, because the GIL is released during: I/O operations : Network calls, file reads/writes, socket operations. The GIL releases while the OS handles I/O. → Threads work great here C extensions NumPy, PyTorch ops, SciPy — all run C code that releases the GIL for the duration. → Threads work great here Pure Python CPU Loops, string operations, pure Python math. The GIL never releases — threads don’t help. → Use multiprocessing instead threadpool_embedding.py from concurrent.futures import ThreadPoolExecutor, as_completedfrom sentence_transformers import SentenceTransformer#blocking library - can't use asyncio but threads work finemodel = SentenceTransformer("all-MiniLM-L6-V2")def embed_text(text : str, idx : int ) -> tuple : embedding = model.encode(text) # GIL released - C extension runs return idx, embedding.tolist()texts = [f"Document chunk {i} " for i in range(50)]with ThreadPoolExecutor(max_workers = 8) as pool:futures = {pool.submit(embed_text, t, i): i for i, t in enumerate(texts)} results = {} for future in as_completed(futures): idx, embedding = future.result() results[idx] = embeddingprint(f"Embedded {len(results)} chunks") Synchronization primitives — the full toolkit import threading, timemodel_ready = threading.Event()api_sem = threading.Semaphore(5) # max 5 concurrent inferencedef load_model(): print("Loading model weights...") time.sleep(3) # simulate loading 7B param model model_ready.set() # unblocks ALL waiting threads at once print("Model ready!")def inference_worker(worker_id: int): model_ready.wait() # block here until model is loaded with api_sem: # at most 5 […]
Last Updated on June 22, 2026 by Editorial Team Author(s): Alpha Iterations Originally published on Towards AI. Build a Hybrid RAG System with FAISS, BM25, LangGraph and Claude Sonnet Model Combine semantic search and keyword search into one powerful document Q&A app using Claude Sonnet 4.6 API, step by step tutorial Hybrid Retrieval (Image by Alpha Iterations, Created using ChatGPT) Non members read here for free. Introduction With the rapid advancement of Large Language Models and vector embeddings, Retrieval-Augmented Generation (RAG) has become the go-to solution for querying unstructured documents. Upload a PDF, ask a question, get an answer. It feels like magic. But sometimes, it is not enough. The silent failure mode of most RAG systems is not the LLM. It is the retrieval step. Dense vector search is powerful at finding semantically similar text. It understands that “urban spending” and “city expenditure” mean the same thing. But ask it for a specific error code, a contract clause number, or a precise financial figure, and it can silently return the wrong chunks with high confidence. On the other hand, keyword search like BM25 nails exact matches every time. But it has no concept of meaning. “Automobile” and “car” are completely different strings to it, and any paraphrased question will leave it lost. The uncomfortable truth is that neither retriever is universally better. Each dominates on a different class of queries. And in real-world documents like legal contracts, financial reports, and technical manuals, you will always have both kinds. Hybrid RAG solves this by running both retrievers in parallel and fusing their results using Reciprocal Rank Fusion. You get the semantic understanding of vector search and the precision of keyword search, in a single ranked list, at near-zero extra cost. In this article, we will build a complete Hybrid RAG system from scratch. FAISS for dense search, BM25 for keyword search, and Reciprocal Rank Fusion to merge the two ranked lists into a single, better-ranked result LangGraph for orchestration, and a Streamlit UI where you can toggle between retrieval modes and inspect every chunk and score behind each answer. Real-world use cases this solves Legal teams querying contracts for specific clause numbers (exact match) as well as intent (semantic) Financial analysts asking about EBITDA definitions and quarterly revenue figures in earnings reports Support engineers searching error codes in technical manuals while also asking about root-cause explanations Research teams querying across dozens of papers for both exact citations and conceptual similarity The complete end to end code can be referred to my github repo: agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases This repository consists of agentic ai usecases. . Contribute to alphaiterations/agentic-ai-usecases development by… github.com The Problem with Single-Mode Retrieval Before jumping into code, it helps to understand why hybrid retrieval matters. Dense vector search Converts text into high-dimensional embeddings and finds the nearest neighbours by cosine similarity. It excels at paraphrasing: ‘What is the profit margin?’ finds chunks that say ‘net income as a percentage of revenue’ even though none of those words overlap with the query. But it can silently skip a chunk that contains ERR_4021 because that token was rare in training data and sits in an odd region of the embedding space. BM25 Best Match 25 is a classical information retrieval algorithm based on term frequency and inverse document frequency. It scores documents based on how many query words appear in them and how rare those words are across the whole corpus. It nails exact matches, part numbers, named entities, and specific terminology. The weakness is that it has no semantic understanding at all, so ‘automobile’ and ‘car’ are completely different words to BM25. Test Cases where Semantic Search & BM25 Fail (Image by Alpha Iterations) Hybrid retrieval Combines both signals. The merged ranked list tends to surface chunks that are simultaneously semantically relevant and lexically relevant, which is exactly what you want when your document contains a mix of technical terms and descriptive prose. Hybrid RAG — Best of both. (Image by Alpha Iterations. Created using ChatGPT) The question is: How do we decide which chunk to prioritize? RRF is the answer. RRF (Reciprocal Reranking Fusion): RRF is a rank-based merging algorithm that combines multiple ranked lists into a single, unified ranking without caring about the raw score values from any individual retriever. Instead of asking “which chunk scored highest overall?”, it asks “which chunk appeared near the top of the most lists?” RRF Steps. (Image by Alpha Iterations) The formula is simple: RRF score(d) = Σ 1 / (k + rank(d, list)) where k is a smoothing constant (typically 60) and rank(d, list) is the 1-indexed position of chunk d in a given retriever’s result list. The sum runs over every retriever that returned the chunk. RRF Calculation. (Image by Alpha Iterations) A few properties make RRF especially well-suited for hybrid retrieval: Score-scale agnostic: Cosine similarity from FAISS sits in the range [-1, 1]. BM25 scores are unbounded and document-length-dependent. These two numbers are not comparable you cannot simply average them. RRF sidesteps the problem entirely by converting everything to ranks first. Rewards cross-list agreement: A chunk that ranks 1st in BM25 and 2nd in vector search scores higher than a chunk that ranks 1st in only one list. The fusion step amplifies agreement, which is exactly the signal you want. Robust to outliers: A single retriever that confidently returns a wrong chunk at rank 1 can only contribute 1 / (60 + 1) ≈ 0.016 to the RRF score. If the other retriever did not return that chunk at all, it goes nowhere near the top. In practice, this means: when both retrievers agree on a chunk, it rises to the top. When only one retriever surfaces it, it still gets credit but not enough to dominate if another chunk had broader support. System Architecture Here is the full architecture of what we are going to build: Fig 1: Architecture of Hybrid RAG (Image by Alpha Iterations) Architecture note: Key design decision: FAISS and BM25 […]
Last Updated on June 22, 2026 by Editorial Team Author(s): Alpha Iterations Originally published on Towards AI. Build a Hybrid RAG System with FAISS, BM25, LangGraph and Claude Sonnet Model Combine semantic search and keyword search into one powerful document Q&A app using Claude Sonnet 4.6 API, step by step tutorial Hybrid Retrieval (Image by Alpha Iterations, Created using ChatGPT) Non members read here for free. Introduction With the rapid advancement of Large Language Models and vector embeddings, Retrieval-Augmented Generation (RAG) has become the go-to solution for querying unstructured documents. Upload a PDF, ask a question, get an answer. It feels like magic. But sometimes, it is not enough. The silent failure mode of most RAG systems is not the LLM. It is the retrieval step. Dense vector search is powerful at finding semantically similar text. It understands that “urban spending” and “city expenditure” mean the same thing. But ask it for a specific error code, a contract clause number, or a precise financial figure, and it can silently return the wrong chunks with high confidence. On the other hand, keyword search like BM25 nails exact matches every time. But it has no concept of meaning. “Automobile” and “car” are completely different strings to it, and any paraphrased question will leave it lost. The uncomfortable truth is that neither retriever is universally better. Each dominates on a different class of queries. And in real-world documents like legal contracts, financial reports, and technical manuals, you will always have both kinds. Hybrid RAG solves this by running both retrievers in parallel and fusing their results using Reciprocal Rank Fusion. You get the semantic understanding of vector search and the precision of keyword search, in a single ranked list, at near-zero extra cost. In this article, we will build a complete Hybrid RAG system from scratch. FAISS for dense search, BM25 for keyword search, and Reciprocal Rank Fusion to merge the two ranked lists into a single, better-ranked result LangGraph for orchestration, and a Streamlit UI where you can toggle between retrieval modes and inspect every chunk and score behind each answer. Real-world use cases this solves Legal teams querying contracts for specific clause numbers (exact match) as well as intent (semantic) Financial analysts asking about EBITDA definitions and quarterly revenue figures in earnings reports Support engineers searching error codes in technical manuals while also asking about root-cause explanations Research teams querying across dozens of papers for both exact citations and conceptual similarity The complete end to end code can be referred to my github repo: agentic-ai-usecases/beginner/hybrid-rag at main · alphaiterations/agentic-ai-usecases This repository consists of agentic ai usecases. . Contribute to alphaiterations/agentic-ai-usecases development by… github.com The Problem with Single-Mode Retrieval Before jumping into code, it helps to understand why hybrid retrieval matters. Dense vector search Converts text into high-dimensional embeddings and finds the nearest neighbours by cosine similarity. It excels at paraphrasing: ‘What is the profit margin?’ finds chunks that say ‘net income as a percentage of revenue’ even though none of those words overlap with the query. But it can silently skip a chunk that contains ERR_4021 because that token was rare in training data and sits in an odd region of the embedding space. BM25 Best Match 25 is a classical information retrieval algorithm based on term frequency and inverse document frequency. It scores documents based on how many query words appear in them and how rare those words are across the whole corpus. It nails exact matches, part numbers, named entities, and specific terminology. The weakness is that it has no semantic understanding at all, so ‘automobile’ and ‘car’ are completely different words to BM25. Test Cases where Semantic Search & BM25 Fail (Image by Alpha Iterations) Hybrid retrieval Combines both signals. The merged ranked list tends to surface chunks that are simultaneously semantically relevant and lexically relevant, which is exactly what you want when your document contains a mix of technical terms and descriptive prose. Hybrid RAG — Best of both. (Image by Alpha Iterations. Created using ChatGPT) The question is: How do we decide which chunk to prioritize? RRF is the answer. RRF (Reciprocal Reranking Fusion): RRF is a rank-based merging algorithm that combines multiple ranked lists into a single, unified ranking without caring about the raw score values from any individual retriever. Instead of asking “which chunk scored highest overall?”, it asks “which chunk appeared near the top of the most lists?” RRF Steps. (Image by Alpha Iterations) The formula is simple: RRF score(d) = Σ 1 / (k + rank(d, list)) where k is a smoothing constant (typically 60) and rank(d, list) is the 1-indexed position of chunk d in a given retriever’s result list. The sum runs over every retriever that returned the chunk. RRF Calculation. (Image by Alpha Iterations) A few properties make RRF especially well-suited for hybrid retrieval: Score-scale agnostic: Cosine similarity from FAISS sits in the range [-1, 1]. BM25 scores are unbounded and document-length-dependent. These two numbers are not comparable you cannot simply average them. RRF sidesteps the problem entirely by converting everything to ranks first. Rewards cross-list agreement: A chunk that ranks 1st in BM25 and 2nd in vector search scores higher than a chunk that ranks 1st in only one list. The fusion step amplifies agreement, which is exactly the signal you want. Robust to outliers: A single retriever that confidently returns a wrong chunk at rank 1 can only contribute 1 / (60 + 1) ≈ 0.016 to the RRF score. If the other retriever did not return that chunk at all, it goes nowhere near the top. In practice, this means: when both retrievers agree on a chunk, it rises to the top. When only one retriever surfaces it, it still gets credit but not enough to dominate if another chunk had broader support. System Architecture Here is the full architecture of what we are going to build: Fig 1: Architecture of Hybrid RAG (Image by Alpha Iterations) Architecture note: Key design decision: FAISS and BM25 […]
Last Updated on June 22, 2026 by Editorial Team Author(s): Mike Oller Originally published on Towards AI. credit Author: generated by GPT Image 2.0 Loop Engineering: The Missing Governance Layer for Reliable AI Agents By Mike Oller | AI Tool insider I’ve spent the last year building AI agents that do real work — not just answer questions, but write code, generate reports, schedule tasks, and interact with production systems. And I’ve learned an uncomfortable lesson: The smarter the model gets, the more damage it can do before you realize something went wrong. A GPT-generated poem with a hallucinated fact is harmless. A GPT-generated API call that deletes a production database is not. And the difference isn’t the model — it’s the architecture around it. This is the problem Loop Engineering sets out to solve. The Problem with Today’s Agent Architectures Most AI agent systems today follow one of two patterns: Pattern 1: The One-Shot Wonder. Feed the model a prompt, get an output. Fast, cheap, and surprisingly capable — until the task needs more than one step. Then it drifts, forgets context, and produces outputs that look right but aren’t. Pattern 2: The ReAct Loop. Reason, act, observe, repeat. This is the foundation of most modern agent frameworks (LangGraph, AutoGen, the Microsoft Agent Framework). It’s more powerful, but it’s also ungoverned — there’s no explicit mechanism for deciding when to stop, when to change course, or when to escalate to a human. Both patterns share a fundamental blind spot: they treat reliability as a property of the model, not of the system. credit Author: Generated by GPT Image 2.0 What Loop Engineering Proposes Loop engineering re-frames the problem. Instead of asking “how do we make the model smarter?” it asks “how do we build a governance architecture that wraps around the model?” Drawing on control theory (Wiener’s cybernetics), state machines, workflow orchestration, and reinforcement learning, the paper synthesizes six components that every reliable agent needs: 1. Goal Representation Not just “write a blog post” but a structured definition: the task, the constraints (budget, time, safety rules), the success criteria, and the stop conditions. Without this, the agent has no fixed reference point. It’s a ship without a destination. 2. State Model Five differentiated layers of state: Static state: The goal, constraints, and configuration Dynamic state: Current outputs, intermediate results Tool state: Which tools are available, their status Reflective state: Lessons learned from previous iterations Governance state: Risk budget, cost budget, remaining iterations Most agent systems collapse all of this into a single context window. Loop engineering explicitly separates them so the agent can distinguish between “what I’m trying to do,” “what I’ve done,” and “what I’ve learned.” 3. Action Executor A controlled boundary around tool use. Every action passes through a risk check before execution. This is the difference between an agent that can call any API it wants and one that must ask permission before spending money or modifying files. 4. Observation Collector The observation collector captures what actually happened — not what the agent intended to happen. This distinction matters because LLMs are famously bad at self-assessment. An agent might believe it successfully saved a file when the file system returned a permissions error. 5. Evaluator Assesses four dimensions on every iteration: Confidence: How sure is the agent about its next step? Progress: Is it getting closer to the goal or spinning its wheels? Drift: Has the agent wandered away from the original task? Risk: Could the next action cause harm or exceed budget? 6. Controller The controller is the decision-maker. Given the evaluator’s assessment, it decides one of: Continue — execute the next action Revise — change the plan Rollback — undo the last action Escalate — ask a human Stop — terminate execution This is the component most agent systems lack entirely. They have a model that decides what to do, but no mechanism for deciding whether to keep going. Five Loop Types Not every task needs the same loop structure. The paper identifies five: credit Author: Generated by Typecraft AI created by Author inside Google Opal These loops compose. A single task might cycle through planning, execution, and verification loops, all wrapped in a governance loop that keeps risk in check. Where Current Architectures Fall Short The paper offers a comparative analysis that’s worth laying out in full: One-shot agents are fast and cheap but have no recovery mechanism. If the first output is wrong, you start over. Unguided ReAct loops (the default in most frameworks) are flexible but have no formal termination condition. They keep spending tokens until the context window fills up or a human intervenes. Workflow-orchestrated agents (e.g., Prefect, Airflow, AWS Step Functions) provide excellent traceability and governance — for the failure modes the author anticipated. The moment the task departs from the predefined graph, the system is brittle. Loop-engineered agents are designed for the case where the plan emerges at runtime. The governance isn’t baked into a static graph; it’s baked into a dynamic policy set that applies on every iteration. The Counterargument That Matters The paper is unusually honest about its strongest objection: “Mature workflow orchestration tools already provide state tracking, retries, human-approval gates, and audit logs. Isn’t loop engineering just relabeling existing capability?” The response is worth quoting directly: “Governance checks must run every iteration rather than only at exception points, because there is no design-time map of which iterations might fail.” In a workflow-orchestrated system, you define the entire graph upfront. You know where the risky steps are because you placed them there. In a loop-engineered system, the plan is generated by the model at runtime. You don’t know which step 27 might be the one that tries to call an expensive API or delete a critical file. So you check at every step. This is the core insight: when you can’t predict where the failure will happen, you need a governance layer that’s present everywhere. When NOT to Use Loop Engineering Refreshingly, the paper doesn’t […]
Last Updated on June 22, 2026 by Editorial Team Author(s): Mike Oller Originally published on Towards AI. credit Author: generated by GPT Image 2.0 Loop Engineering: The Missing Governance Layer for Reliable AI Agents By Mike Oller | AI Tool insider I’ve spent the last year building AI agents that do real work — not just answer questions, but write code, generate reports, schedule tasks, and interact with production systems. And I’ve learned an uncomfortable lesson: The smarter the model gets, the more damage it can do before you realize something went wrong. A GPT-generated poem with a hallucinated fact is harmless. A GPT-generated API call that deletes a production database is not. And the difference isn’t the model — it’s the architecture around it. This is the problem Loop Engineering sets out to solve. The Problem with Today’s Agent Architectures Most AI agent systems today follow one of two patterns: Pattern 1: The One-Shot Wonder. Feed the model a prompt, get an output. Fast, cheap, and surprisingly capable — until the task needs more than one step. Then it drifts, forgets context, and produces outputs that look right but aren’t. Pattern 2: The ReAct Loop. Reason, act, observe, repeat. This is the foundation of most modern agent frameworks (LangGraph, AutoGen, the Microsoft Agent Framework). It’s more powerful, but it’s also ungoverned — there’s no explicit mechanism for deciding when to stop, when to change course, or when to escalate to a human. Both patterns share a fundamental blind spot: they treat reliability as a property of the model, not of the system. credit Author: Generated by GPT Image 2.0 What Loop Engineering Proposes Loop engineering re-frames the problem. Instead of asking “how do we make the model smarter?” it asks “how do we build a governance architecture that wraps around the model?” Drawing on control theory (Wiener’s cybernetics), state machines, workflow orchestration, and reinforcement learning, the paper synthesizes six components that every reliable agent needs: 1. Goal Representation Not just “write a blog post” but a structured definition: the task, the constraints (budget, time, safety rules), the success criteria, and the stop conditions. Without this, the agent has no fixed reference point. It’s a ship without a destination. 2. State Model Five differentiated layers of state: Static state: The goal, constraints, and configuration Dynamic state: Current outputs, intermediate results Tool state: Which tools are available, their status Reflective state: Lessons learned from previous iterations Governance state: Risk budget, cost budget, remaining iterations Most agent systems collapse all of this into a single context window. Loop engineering explicitly separates them so the agent can distinguish between “what I’m trying to do,” “what I’ve done,” and “what I’ve learned.” 3. Action Executor A controlled boundary around tool use. Every action passes through a risk check before execution. This is the difference between an agent that can call any API it wants and one that must ask permission before spending money or modifying files. 4. Observation Collector The observation collector captures what actually happened — not what the agent intended to happen. This distinction matters because LLMs are famously bad at self-assessment. An agent might believe it successfully saved a file when the file system returned a permissions error. 5. Evaluator Assesses four dimensions on every iteration: Confidence: How sure is the agent about its next step? Progress: Is it getting closer to the goal or spinning its wheels? Drift: Has the agent wandered away from the original task? Risk: Could the next action cause harm or exceed budget? 6. Controller The controller is the decision-maker. Given the evaluator’s assessment, it decides one of: Continue — execute the next action Revise — change the plan Rollback — undo the last action Escalate — ask a human Stop — terminate execution This is the component most agent systems lack entirely. They have a model that decides what to do, but no mechanism for deciding whether to keep going. Five Loop Types Not every task needs the same loop structure. The paper identifies five: credit Author: Generated by Typecraft AI created by Author inside Google Opal These loops compose. A single task might cycle through planning, execution, and verification loops, all wrapped in a governance loop that keeps risk in check. Where Current Architectures Fall Short The paper offers a comparative analysis that’s worth laying out in full: One-shot agents are fast and cheap but have no recovery mechanism. If the first output is wrong, you start over. Unguided ReAct loops (the default in most frameworks) are flexible but have no formal termination condition. They keep spending tokens until the context window fills up or a human intervenes. Workflow-orchestrated agents (e.g., Prefect, Airflow, AWS Step Functions) provide excellent traceability and governance — for the failure modes the author anticipated. The moment the task departs from the predefined graph, the system is brittle. Loop-engineered agents are designed for the case where the plan emerges at runtime. The governance isn’t baked into a static graph; it’s baked into a dynamic policy set that applies on every iteration. The Counterargument That Matters The paper is unusually honest about its strongest objection: “Mature workflow orchestration tools already provide state tracking, retries, human-approval gates, and audit logs. Isn’t loop engineering just relabeling existing capability?” The response is worth quoting directly: “Governance checks must run every iteration rather than only at exception points, because there is no design-time map of which iterations might fail.” In a workflow-orchestrated system, you define the entire graph upfront. You know where the risky steps are because you placed them there. In a loop-engineered system, the plan is generated by the model at runtime. You don’t know which step 27 might be the one that tries to call an expensive API or delete a critical file. So you check at every step. This is the core insight: when you can’t predict where the failure will happen, you need a governance layer that’s present everywhere. When NOT to Use Loop Engineering Refreshingly, the paper doesn’t […]
Author(s): N Selvaraj Originally published on Towards AI. The Verified Identity Agent Bridge Verify the human at the trusted edge, then carry that identity as explicit context to every downstream agent. Never collapse a user-initiated action into a shared service principal. The problem An enterprise wants its employees to reach an internal assistant from a commercial AI platform they already use every day. The assistant is a low-code agent: a Copilot Studio bot wired to an enterprise search index and a set of Power Automate flows. The integration looks simple. The host platform calls a proxy, the proxy calls the agent, the agent answers. A working prototype ships in a week. Then the question that matters arrives. An employee asks the assistant to act on something only they should see, and someone asks who, exactly, made that request. The proxy authenticated to the downstream agent with a single application credential. Every call the agent received that month carries the same actor: the proxy’s service principal. The humans who actually asked the questions are nowhere in the agent’s context or its flow run history. Personalization is impossible because the agent never learned who is asking. Authorization is impossible for the same reason. The identity was present at the front door and discarded one hop later. This is the default failure mode when a consumer-facing AI platform is bridged into an enterprise agent. The host platform verifies a human. The proxy in the middle, built for speed, authenticates downstream with client credentials because that is the path of least resistance. Identity terminates at the proxy. The claim When integrating a host AI platform with a downstream enterprise agent through a proxy, the proxy should verify the human identity using the enterprise identity provider’s delegated flow, resolve the canonical identity claims from the provider’s own source of truth, and forward that verified identity to the downstream agent as explicit per-request context. The app-only service-principal path is reserved for genuinely user-less work and logged as an exception, never used as the default for a user-initiated action. Preserving verified human identity across this bridge is the precondition for any US enterprise to adopt a third-party AI front end over internal systems without losing user-level accountability, which is what governs whether regulated organizations can use commercial AI platforms at all. There is a wrinkle that defines the pattern. The downstream agent here is a low-code SaaS agent. It cannot validate a user-audience OAuth token the way a custom API can. So identity cannot be propagated by token exchange alone. It has to be verified at the bridge and asserted downstream as trusted context. That constraint is the reason this is a distinct pattern rather than a footnote to delegated-token forwarding. The pattern: Verified Identity Agent Bridge The bridge is the one component on the path that both authenticates the human and speaks to the downstream agent. It carries the identity across the gap. Five constraints define it: Verify the human at the bridge with a delegated flow. Use the enterprise IdP’s Authorization Code flow with PKCE [1][3], not the host platform’s opaque session and not a static API key shared across users. The bridge ends up holding a token that was issued to the actual caller. Resolve identity from the IdP’s source of truth. Do not trust identity self-asserted by the host platform. Call the directory (Microsoft Graph /me) with the user's delegated token and read the canonical claims [4]: object id, user principal name, display name. The identity is what the IdP says it is. Hold no standing user credential. Tokens are request-scoped. Cache an access token only within its own lifetime and only keyed to its own user. Never share a token across users, never persist one past expiry. Assert the verified identity downstream under an explicit trust contract. The bridge forwards the resolved claims to the agent as per-request context. Because the agent cannot itself validate a user token, the channel between bridge and agent must be authenticated so the agent can trust that the asserted identity came from the bridge and not from an arbitrary caller. Gate the service-principal path. When no user token is present, the temptation is to fall back to app-only client credentials [2]. That silently re-terminates identity. Reserve it for user-less operations, gate it, and log every use as an exception. Source: Image by the author. VIAB flow: identity is verified at the bridge, resolved from the directory, and asserted downstream as per-request context over an authenticated channel. Source: Image by the author. The downstream agent cannot validate a user token, so the bridge does not forward one. It verifies the human, then vouches for them over a channel the agent already trusts. Implementation The worked example bridges ChatGPT Enterprise to a Microsoft Copilot Studio agent through a Node.js MCP proxy. ChatGPT Enterprise is the host AI platform, the Copilot Studio bot is the low-code downstream agent reached through a Power Automate flow [5], and the enterprise IdP is Microsoft Entra ID. This is the exact case VIAB is built for: ChatGPT Enterprise verifies the employee, but the Copilot Studio agent on the far side cannot validate a user-audience token, so the bridge has to verify the human and assert the identity across to it. The pattern was validated experimentally against a live integration between these two systems, with a ChatGPT Enterprise action invoking the Copilot Studio agent through the proxy. Verify the human at the bridge The delegated flow starts with PKCE so the authorization code cannot be replayed by an interceptor. The bridge builds the authorization URL, the user signs in against Entra, and the returned code is exchanged for a token issued to that user. import { randomBytes, createHash } from "crypto";export function generatePKCE() { const codeVerifier = randomBytes(32).toString("base64url"); const codeChallenge = createHash("sha256").update(codeVerifier).digest("base64url"); return { codeVerifier, codeChallenge };}export function getAuthorizationUrl(scope: string, redirectUri: string) { const { codeVerifier, codeChallenge } = generatePKCE(); const state = randomBytes(16).toString("base64url"); const params = new URLSearchParams({ client_id: process.env.CLIENT_ID!, response_type: "code", […]
Author(s): Venkat Peri Originally published on Towards AI. You Can’t Prompt Your Away Your LLM Problems When an LLM feature breaks in production, the first instinct in the room is to open the prompt and reword it. We had that instinct too. Then we built a production assistant for financial advisors and kept a record of every LLM-related failure the system hit, along with the fix that actually closed each one. Across the whole build, almost nothing that mattered was fixable by editing a prompt. The durable fixes were architectural. The single time we tried a prompt-only fix on the hardest problem, it made things measurably worse, and we reverted it. This is the case for treating the model as one untrusted component in a larger system, written from the failures that taught us to do it. The fix that made it worse Routing was our most unstable surface, and it was unstable in a way that prompt editing could not reach. The same question routed one way on one run and a different way on the next, with no code change in between. On the ambiguous edges, routing accuracy sat around 56 to 64 percent and was non-deterministic from run to run. A question like “rank my households by AUM” came back as a clarification request on one run and a confident answer on the next. The obvious move was to explain the boundary better in the routing prompt. We added a block of guidance describing how to tell the categories apart. The classifier got less stable, not more, so we took the prose back out. The reason mattered more than the symptom. The flakiness was not bad luck. It was structural, and it got worse every time we added a domain. The router guessed an abstract category first, and then code mapped that category to a concrete tool. The category was a lossy middle step. When the guess was wrong, the right tool was unreachable, and there was no signal left to recover from. The fix was to delete the abstract category. We collapsed routing into a single stage that picks one concrete tool directly from the catalog, with the scope derived from the tool it picked rather than guessed up front. We narrowed what each decision had to weigh, so any one call discriminates among a handful of tools instead of the full set. We grounded each tool with a few example utterances carried as structured data, not as prose inside the prompt. Accuracy moved from a flaky 98 percent on the old design, down through a real regression to 72 percent right after the rewrite, and up to 100 percent on both of our evaluation suites once the grounding was in place. The improvement came from removing decisions the model did not need to make. That set the pattern for most of what followed: take work away from the model wherever code can do it, and catch what is left in deterministic guardrails. A related failure looked like a bad model decision and was actually a missing option. Asked to “show the first account,” the model picked the nearest available tool, a holdings lookup, because no account-listing tool existed, and it invented an ordinal to make the answer fit. The model was choosing the closest thing within reach. We fixed it by building the tool it needed, not by writing a prompt that apologized for the gap. A value the model invented is not the value you computed A model fills a blank confidently, and confidence is the dangerous part. “Create a task for 2pm” put the string “2pm” into a field our code expected to hold a computed timestamp. The parser tried to read “2pm” as an ISO instant, threw, and the user saw a generic server error. This crash only reproduced with real model output. Every offline test we had written passed an empty argument map, and an empty map never triggers the bug. The lesson we wrote down for the next person chasing a live crash they cannot reproduce: vary the arguments the model actually produces, because empty-argument mocks hide a whole class of failure. The crash was one instance of a wider habit. When asked to create a task with no subject, the model invented a subject that was a bare echo of the tool’s own name, and because the field was now non-empty, it sent the task straight to confirmation instead of asking. It presented a specific row as “the first” on a list our schema never ordered, reading fixture order as if it were real order. Given a comparison of two figures, it would do the arithmetic itself and sometimes get it wrong. These were not instruction problems, and the fixes had the same shape every time. Detect the value the model invented and refuse it. A matcher catches tool-name-echo placeholders and forces a clarification. A check drops a non-ISO time argument and re-derives the time from the user’s words in code. A flag on unordered lists, plus a guard that returns the full list with a notice instead of asserting an ordinal. Percentages, deltas, and filtered totals moved into code, with the prompt reduced to one line: do not compute this yourself. The principle underneath all of them is to never trust a model-populated value as if it were the typed or computed value. Validate before you parse. The guardrail is also code We run a deterministic grounding check over every answer that cites figures. It compares the numbers in the rendered answer against the numbers the tool actually returned, and when the answer cites a figure with no support, it withholds the answer behind a message saying we could not fully verify it. This is the reason a fabricated “that’s 5% of your $8.3M,” which the renderer invented on three or four runs out of six, never reached a user. The render prompt already banned invented statistics. The ban alone did not hold, so deterministic shaping that […]
Author(s): Ramon Invarato Originally published on Towards AI. The Free Agent Trap Why agents that run on their own are still failing A standalone article from the series “AI and You”. You’ve been promised the same thing as everyone else: that artificial intelligence no longer just answers questions — it does the work on its own. You assign a task, step out for a coffee, and come back when it’s done. That’s the promise of AI “agents,” talked about everywhere. It sounds great. The problem is what happens when nobody’s watching while the AI does the job. An example a developer shared in the comments of a technical blog (article in Spanish) illustrates it perfectly. They asked an agentic AI to display a customer’s orders on a web page. The AI solved it in a way that worked… and was a disaster: for each order, it fired an independent query to the database. A hundred thousand orders, a hundred thousand queries. The code compiled, the tests passed, and the screen showed the correct data. There was just one detail: it took twenty seconds to load and saturated the database, whereas a professional would have solved it with a single query in half a second. It worked up close; it was useless from a distance. That gap between “it works” and “it’s useful” sums up one of the great disappointments of this year: the promise of the autonomous agent — capable of executing long tasks on its own without human supervision — remains unfulfilled. Models are good at doing pieces of work well, but when left to run for fifteen or twenty consecutive steps, they fail in ways a human wouldn’t easily catch, and that can be very costly. This article explains why, covers the data backing it up, and discusses when it makes sense — and when it doesn’t — to use a free agent. This is not an article against AI: it’s a guide for using it and ideas for minimizing real risk. About the extreme cases in this article. Some comparisons, scenarios, and diagrams in this text are illustrative: they contrast extremes (utopia / dystopia) to make a range visible. They are not operational recommendations or predictions. The author takes no responsibility for how each reader uses these ideas. Full disclaimer text here. Conceptual representation of silent corruption: the agent iterates turn by turn, the document degrades from within, and the surface keeps looking impeccable. The representative log of how it happens in code, in “Anatomy of a silent disaster: the internal log of an agent.” The promise being sold The narrative of recent years has been clear: the next frontier of generative AI is not answering questions — it’s executing tasks. An agent receives a goal ( “book me a flight to London on Friday”, “refactor this code”, “put together a quarterly report with the CRM data”), breaks the problem into steps, executes those steps, validates whether the result is approaching the goal and, if not, retries. Ideally, you come back when it’s done. That promise is what has moved tens of billions of dollars in investment in 2024–2026. It’s also what justified Uber deploying agentic tools to their 5,000 engineers and burning through their entire annual AI budget in four months (article in Spanish). The actual capability of agents, measured with reproducible benchmarks, however, lags far behind the marketing being sold to us. This contradiction reaches into the heart of the very companies created to lead the AI revolution. The Anthropic Institute published “When AI builds itself” in June 2026, revealing that in May 2026 more than 80% of merged code in Anthropic’s internal repository was written by Claude — before Claude Code launched in preview (February 2025), that figure was in the low single digits. In Q2 2026, the typical engineer was merging 8× more code per day than in 2024; the report itself attributes that second jump to the moment models started working autonomously over longer time horizons, with the engineer in the role of director and reviewer rather than typist. Yet in a paradoxical twist, those same security teams and founders at Anthropic have led public warnings calling for caution and strict regulation in deploying advanced autonomy without guardrails. They know better than anyone that the speed of commercial adoption is running well ahead of the theoretical safety net of the models. The figure that keeps not moving: CRMArena-Pro In June 2025, Salesforce AI Research published CRMArena-Pro, the first serious benchmark for evaluating AI agents in realistic enterprise environments. The difference from previous benchmarks matters: CRMArena-Pro doesn’t measure whether the AI can answer an isolated question well. It measures what happens when it’s asked to execute a complete CRM task across multiple turns, with real data (over 83,000 synthetic but structurally representative records), simulating a conversation with a human user who keeps asking for things. The results, after evaluating the top models of the moment (Gemini 2.5 Pro and similar), were: Single-turn (a single interaction): success rate of 58%. Multi-turn (several chained interactions, which is what a realistic workflow looks like): success rate of 35%. Workflow Execution (structured tasks with clear steps): up to 83% in single-turn (which shows that when the structure is clear, the agent works reasonably well). Confidentiality awareness: practically zero without specific prompting. When asked to handle confidentiality, it improves, but the task success rate drops further. Translated into plain terms: a top-performing agent, left to its own devices in a realistic multi-turn workflow, fails two out of every three times. Not from a one-off bug — structurally. And when confidentiality oversight is added, it drops further still. DELEGATE-52: silent corruption So far, we’ve measured how often the agent gets it right. The next study measures something different and more unsettling: how much it damages what it touches when it works alone for an extended stretch. If CRMArena-Pro wasn’t enough, in April 2026 Microsoft Research published another study that attacks the problem from another angle. DELEGATE-52 measures what […]
Author(s): Dr Swarneendu AI Originally published on Towards AI. After 500 cycles, a long-running agent is not the same agent that started. Its goal has shifted. Its constraints have eroded. This is measurable, preventable, and nobody has built the instrument. Until now. Fareed Khan’s long-running agent survived a host reboot. It survived context overflow. It survived 31 items over-scoped to 14. No image found in the provided HTML.The article explains why representational drift is mathematically inevitable in long-running agents: repeated lossy compression (summarisation, decision-logging distillation, and abstraction/consolidation) erases recoverable information, so an agent’s output distribution diverges from its early-cycle behavior (quantified via KL divergence). It then proposes a practical, probe-based drift detector using lightweight multiple-choice questions with known correct answers and statistical hypothesis testing (chi-squared) to detect shifts in interpretation. When drift is detected, the recommended fix is to inject the original instruction as a targeted “drift correction” anchor into the active context, re-grounding the agent before the deviation compounds—keeping the KL divergence near zero. The piece closes by highlighting that this instrumentation can distinguish useful long-running agents from expensive near-misses and provides related references. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Sarfaraz Merchant Originally published on Towards AI. Beyond Chat: Processing Images, PDFs, and Documents with the OpenAI Adapter in Oracle Integration Cloud Exploring File Uploads, Image Processing, and Document Extraction using the Native OpenAI Adapter in OIC. When Oracle introduced the OpenAI Adapter in Oracle Integration Cloud (OIC), most of the examples I came across focused on text generation and chatbot-style interactions. Naturally, I assumed that if I wanted to process files such as images, PDFs, or Word documents, I would probably need to build custom REST integrations and call the OpenAI APIs directly. While exploring the adapter, I noticed a few operations related to file management. That got me curious. Can the OpenAI Adapter upload files? Can those files be processed later using a File ID? Can we extract structured data from invoices, images, or PDF documents without building custom REST calls? To answer these questions, I decided to build a small proof of concept. The results were quite interesting. Using only the native OpenAI Adapter, I was able to: Upload files to OpenAI Retrieve and reuse File IDs Process images and PDF documents Extract structured JSON responses Analyze multilingual content, including Arabic and Persian documents In this article, I’ll walk through the approach, the architecture, and a few practical use cases that I tested while building the POC. By the end, you’ll have a reusable pattern that can be applied to invoice extraction, document processing, content analysis, and many other AI-powered integration scenarios. OpenAI Adapter provides native operations for uploading and processing files. Understanding the Architecture While building the POC, I decided to separate the solution into two integrations. The first integration is responsible for uploading a file to OpenAI and obtaining a File ID. The second integration uses that File ID together with a prompt to process the document and return a structured response. I found this approach cleaner because the uploaded file becomes a reusable asset. The same file can be analyzed multiple times using different prompts without uploading it again. Integration 1 — Upload File The first integration accepts a file and uploads it using the OpenAI Adapter’s Upload File operation. Supported examples include: Images (JPG, PNG) PDF documents Microsoft Word documents The adapter returns a unique File ID similar to: file-xxxxxxxxxxxxxxxx This File ID can then be stored, logged, or passed to another integration for processing. Integration 2 — Process File The second integration receives: A File ID Processing instructions It then invokes the OpenAI Adapter using the Responses operation. Depending on the type of document and the prompt provided, OpenAI can: Extract invoice data Analyze documents Summarize content Return structured JSON Process multilingual text The response can be consumed directly within OIC and mapped to downstream systems. Solution Architecture. High-Level Flow The overall process looks like this: Document(Image/PDF/DOCX) | v+------------------+| Upload File || OpenAI Adapter |+------------------+ | v File ID | v+------------------+| Responses API || OpenAI Adapter |+------------------+ | v Structured Output (JSON) One thing I particularly liked about this pattern is that it relies entirely on the native OpenAI Adapter. There are no custom REST calls, no manual API payload construction, and no need to manage Base64-encoded content. Upload FIle to OpenAI Account. Process OpenAI Files. Building Integration 1 — Uploading Files to OpenAI The first integration is responsible for uploading a document to OpenAI and obtaining a File ID that can be reused later. In my case, I tested the following file types: JPG images PNG images PDF documents Microsoft Word documents Step 1 — Create the OpenAI Connection Start by creating a connection using the OpenAI Adapter. Provide: OpenAI API Key Connection Name Security Configuration Once the connection is successfully tested, it can be used within your integration. OpenAI Connection. Step 2 — Create the Upload File Integration Create an App-Driven Orchestration. 2.1 Add Trigger Rest Adapter to Accept File as input with: – Select the multipart attachment processing options Request is multipart with payload Multipart request is of type multipart/form-data with HTML form payload 2.2 Add the OpenAI Adapter and select the following operation: Upload File This operation allows OIC to upload a file directly to OpenAI without invoking the REST API manually. Step 3 — Configure the Request Mapping The Upload File operation expects two important pieces of information: Purpose The purpose determines how the uploaded file will be used. For image processing: vision For document processing (PDF, DOCX, TXT, etc.): user_data File Content Map the incoming attachment or file reference to the adapter’s File element. Example mapping: RequestWrapper ├── Purpose └── File └── streamReference The adapter handles the upload process and stores the file within OpenAI. Upload File Mapper. Step 4 — Execute the Integration After activation, invoke the integration with a sample document. If the upload succeeds, the response contains a File ID. Example: file-3UoRxBRyqV7pJRgPC4WSgi This File ID becomes the bridge between the upload step and the document processing step. Uploaded File to OpenAI Accuont. Storage FIles in OpenAI Account. Lesson Learned: File Names Matter While testing the Upload File operation, I ran into an issue that took a little time to troubleshoot. Some uploads failed when the file name contained spaces or special characters. For example: WhatsApp Image 2026-06-05 at 7.25.23 PM.jpeg Error if the File Name contains space. After renaming the file to a simpler format without spaces or special characters, the upload completed successfully. Example: invoice_20260605.jpeg If you encounter unexpected upload failures, one of the first things to check is the file name. Although this may vary depending on the adapter version and environment, using clean file names is a good practice and helped avoid issues during testing. Why the File ID Matters Initially, I assumed I would need to send Base64 content to OpenAI every time I wanted to analyze a document. The File ID approach is much cleaner. Once the document is uploaded: The file can be reused Multiple prompts can be executed against the same file There is no need to repeatedly transfer the document This makes the […]
Author(s): Enzo Lombardi Originally published on Towards AI. Skills as traits Two tools fit in a match statement. Three start to feel cramped. By six, the dispatcher is a swamp of clones, retries, error formatting, and case branches that all look almost but not quite alike. The agent loop is still simple. The space around it is not. The article proposes replacing the brittle match-based dispatcher with a typed trait approach: define each “skill” as its own trait implementation with associated input types, runtime metadata (read-only vs destructive and concurrency safety), and standardized error handling so the system can reason about failures. It introduces a registry to own dispatch, generate JSON Schema from Rust structs (so model and code stay in sync), and run multiple tool calls in parallel when safe. Beyond dispatch, it covers resilience via retry helpers for transient errors, cross-cutting behavior via pre/post “hooks” (permissions, logging, redaction, caching), and scaling strategies (deferring large tool sets using search hints, aliases, and on-demand loading). Finally, it distills three rules—atomic skills, result-based outputs, and safety-first input handling—and shows how the pattern fits Eugene v0.3 in practice. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): FS Stance Originally published on Towards AI. Self-Hosting Airflow at Home: Automating Stock Price Data Collection One of the main goals of creating my home lab is to gain a deeper understanding of Machine Learning Operations (MLOps) and how to productionalize AI workflows. Generally speaking, MLOps and productionalization deals with moving AI models from research into a real-life environment with automation and ability to handle errors gracefully. In my previous articles, I set up a PostgreSQL server and an Airflow server. These serve as the data foundation for how to get datasets that will be used to train AI models. Now we need to start filling out our PostgreSQL databases with data. We can use Airflow to orchestrate data pipelines so that up-to-date data is loaded into our PostgreSQL database. Setting up this data foundation is typically the first step in the machine learning (ML) process after planning, as you need data to train your models. A major reason behind my home lab setup is I want to show that you can self-host the whole ML process with a couple of VMs and containers. Since I have been doing a lot of personal investments lately, let’s work with finance data. With finance data, you can analyze trends, correlate prices, and even try forecasting, making it broadly useful across many scenarios. Configuring Airflow Running you Airflow Server Before we start, here are things I learnt and implemented that will make life easier with Airflow. When I set up the Airflow server previously, I used the following commands: nohup airflow scheduler > scheduler.log 2>&1 & nohup airflow dag-processor > dag-processor.log 2>&1 & nohup airflow triggerer > triggerer.log 2>&1 & nohup airflow api-server --port 8080 > api-server.log 2>&1 & The nohup command allows you to run a process continuously even after you log out or close the shell. The issue I ran into was that the Airflow components would crash and I would have to go to the Airflow server and re-run the commands. Another issue was that starting up all the components required 4 commands which could be annoying to run all the time. To solve the latter issue, you can write a script to start or restart Airflow. nano airflow_restart.sh In the airflow_restart.sh file, copy the below code and save it. #!/bin/bashpkill -f "airflow" --ignore-ancestorssleep 2echo "Starting scheduler..."nohup airflow scheduler > scheduler.log 2>&1 & echo $! | tee scheduler.pidecho "Starting dag-processor..."nohup airflow dag-processor > dag-processor.log 2>&1 & echo $! | tee dag-processor.pidecho "Starting triggerer..."nohup airflow triggerer > triggerer.log 2>&1 & echo $! | tee triggerer.pidecho "Starting api-server..."nohup airflow api-server --port 8080 > api-server.log 2>&1 & echo $! | tee api-server.pidecho "Airflow restarted" This script will kill any processes with “airflow” and restart all 4 components. So instead of running all 4 commands, you can now just run the script. To go a step further and solve the issue of having to restart the Airflow components every time it crashes, you can set up Airflow as a system daemon process. A daemon is a background process that typically starts when you boot up your system, restarts when there are failures, and is detached from the shell so you can keep it running at all times. First, create a file using: nano /etc/systemd/system/airflow-scheduler.service Next, copy the below text, put it in the file, and save it. [Unit]Description=Airflow SchedulerAfter=network.target postgresql.serviceWants=postgresql.service[Service]User=<USER>Group=<GROUP>Environment=PATH=<AIRFLOW_PATH>:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/binEnvironment=AIRFLOW_HOME=<AIRFLOW_FOLDER>ExecStart=<AIRFLOW_PATH> schedulerRestart=on-failureRestartSec=5sStandardOutput=journalStandardError=journal[Install]WantedBy=multi-user.target Replace <USER>, <GROUP>, <AIRFLOW_PATH>, and <AIRFLOW_FOLDER> depending on how you set up your Airflow server. Repeat this for all the other components of Airflow: dag-processor, triggerer, api-server. Run the below to pick up new system daemon processes. sudo systemctl daemon-reload Then, enable and start all the services. # Enable services on bootsudo systemctl enable airflow-schedulersudo systemctl enable airflow-dag-processorsudo systemctl enable airflow-triggerersudo systemctl enable airflow-api-server# Start servicessudo systemctl start airflow-schedulersudo systemctl start airflow-dag-processorsudo systemctl start airflow-triggerersudo systemctl start airflow-api-server Voila! You now have Airflow running on your server in a more robust way. You can use systemctl status to check if each component is running and journalctl to see logging information. Adding your PostgreSQL Connection Another thing to do is to set up the PostgreSQL connection in your Airflow server so that it can talk to your PostgreSQL database. Airflow has a handy PostgreSQL connection that lets you specify the connection parameters once, and this allows you to re-use the connection anytime you want to connect to PostgreSQL. Edit the airflow.cfg file. nano airflow.cfg Find where it says test_connection and set it equal to Enabled. Restart Airflow (either using the script or restart the daemon processes). This allows you to test your connections through the Airflow UI. Now, let’s go to the Airflow UI and add the PostgreSQL connection. One the left panel, select Admin > Connections. On the top right, select Add Connection. Under Connection Type, select Postgres. Note you may need to run pip install apache-airflow-providers-postgres in your Airflow server if you do not see Postgres as an option. Fill in your PostgreSQL Host (IP address), Login, Password, Port, and Database. You should see your new PostgreSQL connection. To test the connection, click on the graph-like icon. After clicking, it should become like a wi-fi icon. There should also be a green pop-up message on the bottom-right indicating your test connection was successful. Creating the Airflow Pipeline Code Airflow has been set up and configured properly, so now we can get to writing the code for the Airflow pipeline. First, a little bit about Airflow. Airflow works through Directed Acyclic Graphs (DAGs), which is just a term they use to describe structured workflows that can contain multiple data processing tasks. Within a DAG, the tasks can be run in specified orders and with dependencies, so you can set up tasks to run when tasks complete or in parallel with other tasks. To get finance data, I used yfinance, a python package that leverages Yahoo! Finance’s APIs to get market data. We can use this to extract and write stock ticker price data to our PostgreSQL database. Here is the final […]
Last Updated on June 18, 2026 by Editorial Team Author(s): HyperDeep AI Originally published on Towards AI. The 76-Hour Frontier: How the Takedown of Claude Fable 5 Birthed the Military-Industrial-AI Complex On Tuesday, June 9, 2026, Anthropic released Claude Fable 5, the most capable artificial intelligence model ever shipped to a public API. By Friday, June 12, at 5:21 PM ET, it was completely gone. In a move that has sent shockwaves through Silicon Valley and permanently rewritten the boundaries of state intervention in the tech sector, the Trump administration executed the first-ever forced global shutdown of a deployed commercial AI model. Citing an emergency export control directive, the government barred any foreign national — including Anthropic’s own overseas engineers — from accessing the Fable 5 and Mythos 5 architectures. Because selective geolocation and real-time biometric filtering are architectural impossibilities on a live public cluster, Anthropic was forced to execute a global kill-switch, pulling its crowning achievements offline for every customer worldwide. The weekend fallout exposed a deeply chaotic struggle behind closed doors. Reports from The Verge and Semafor revealed that Anthropic executives spent forty-eight frantic hours in crisis meetings with Washington, locked in a bizarre existential paradox: trying to convince regulators that the model they had spent months hyping as a world-altering, hyper-capable frontier engine was, in fact, not powerful or dangerous at all. What we witnessed wasn’t just a regulatory hiccup. It was the violent birth of a new era where software is treated as a weapon of war, and the assumption of “permissionless innovation” in Silicon Valley is officially dead. The Pre-Incident Matrix: When Memory Becomes a Legal Liability Before the White House pulled the plug, Fable 5 had already triggered an intensely volatile week for enterprise adoption. Positioned as the sanitized, public-facing sibling of the gated Claude Mythos 5 engine, Fable was designed to bring “Mythos-class” long-horizon execution to mainstream software engineering while filtering out its most dangerous cyber-offensive capabilities. To understand why Washington fired the nuclear option, we have to look past raw token speed and parameter counts. The true paradigm shift of the Fable 5 architecture — and the exact feature that triggered the government’s panic — was its native stateful memory graph. For years, Large Language Models were effectively hyper-intelligent amnesiacs. They treated every API call as a stateless, ephemeral text transaction. Fable 5 changed the calculus entirely. It was designed to autonomously spin up sandboxed environments, evaluate its own code execution against simulated compiler loops, map out complex system dependency trees, and, crucially, remember its state across days of autonomous execution. [Stateless LLM Paradigm] --> User Prompt --> Inference Engine --> Static Response (Forget State) [Mythos Stateful Graph] --> Agent Goal --> Native Memory Graph --> Tool Call Execution Loop ^ | +--- Self-Correction -----+ This is where technical capability crossed the regulatory red line: Memory is the prerequisite for agency. When an AI can hold context, remember its past failures, and iterate on a cyber-offensive or defensive task for 48 hours without human prompting, it ceases to be a “tool.” It becomes an autonomous digital entity. Regulators didn’t shut down Fable 5 because it could write better Python scripts; they shut it down because stateful persistence applied to vulnerability discovery is indistinguishable from a persistent cyber-weapon. The 30-Day Retention Fight This immense capability came with a steep architectural trade-off. To maintain real-time safety classification on such a fluid, stateful model, Anthropic quietly rolled back its strict Zero Data Retention (ZDR) policy, enforcing a mandatory 30-day data caching window. This sparked an immediate corporate revolt. Microsoft lawyers, panicking over intellectual property exposure, quickly scrubbed Fable 5 from internal model pickers inside GitHub Copilot, and corporate counsels across Fortune 500 tech firms began drafting internal boycotts. Fable 5 was already facing an enterprise crisis when the federal government delivered its ultimatum. The Math Behind the Machine Anthropic launched the Mythos class with an aggressive pricing structure designed to squeeze legacy API providers while signaling a direct bid for high-value enterprise workflows and autonomous agent clusters. Claude Fable 5 Access Tier: Public API / Console Primary Specialization: General Reasoning, Complex Software Engineering, Advanced Vision Safeguard Strategy: Conservative classifiers; routes flagged queries to Opus 4.8 (~5% false-positive rate). Cost (per 1M Input/Output): $10.00 / $50.00 Claude Mythos 5 Access Tier: Gated (Project Glasswing / Vetted Partners) Primary Specialization: Offensive/Defensive Cyber, Life Sciences, Advanced R&D Safeguard Strategy: Lifted domain safeguards; unrestricted token generation within secure enclaves. Cost (per 1M Input/Output): $10.00 / $50.00 Claude Opus 4.8 Access Tier: General Enterprise Primary Specialization: High-Context Analysis, Legacy Workflows Safeguard Strategy: Standard Constitutional AI guardrails. Cost (per 1M Input/Output): $5.00 / $25.00 While a $10.00 / $50.00 token tier seems steep at first glance compared to commodity models, Anthropic’s implementation of intelligent, persistent prompt caching alters the fundamental unit economics of AI execution. By allowing agents to cache up to 90% of their input context — such as entire codebases, system logs, or technical ledgers — the financial cost of long-horizon reasoning loops plummets. Instead of costs scaling linearly with every step of an autonomous loop, the input cost curve flattens completely. This shift transforms the economic viability of utilizing Mythos-class models as permanent digital workforces from a speculative luxury to a pragmatic utility bill. The Micro-Engineering Reality: The Breakdown of Orchestration The brief 72-hour lifecycle of Fable 5 fundamentally dismantled two of the tech industry’s favorite illusions regarding AI safety and the future of engineering labor. The Supervisor’s Dilemma: The Collapse of “Human-in-the-Loop” For years, the industry’s favorite regulatory security blanket was the “human-in-the-loop” (HITL) framework. Fable 5 proved that at the frontier level, HITL is pure compliance theater. When an agentic model is recursively compiling code, deploying isolated containers, and executing multi-step environment diagnostics at machine speed, human oversight becomes an operational bottleneck. Human supervisors cannot realistically audit thousands of lines of recursively generated structural code in real time. During Fable’s brief public window, human operators experienced acute cognitive fatigue, inevitably defaulting to blind trust and […]
Last Updated on June 18, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. I Trained a Markdown File to Boost GPT-5.5 by 23 Points — It Shouldn’t Work I did not fine-tune anything. I did not touch a single weight. I ran a training loop whose only output was a 1,400-token Markdown file, dropped that file into the context window of a frozen GPT-5.5, and watched its six-benchmark average jump from 58.8 to 82.3. That is +23.5 points from a text file you can open in any editor. After the intro, the article explains the core idea behind SkillOpt: treat the skill document (a Markdown “skill file”) like trainable state while keeping the target model frozen, and use a stronger optimizer model during training to propose bounded edits (add/delete/replace) that are only accepted if they strictly improve a held-out validation score—mirroring concepts from gradient descent stability in text space. It then summarizes the reported results across 52 (model, benchmark, harness) combinations, highlighting that SkillOpt is best-or-tied-best everywhere and that GPT-5.5 in direct chat rises from 58.8 to 82.3 (+23.5), with especially large gains on procedural, format-checked tasks like SpreadsheetBench. The author shows what the resulting “trained” skills look like—rules for checking structure, writing explicitly evaluated values, tracking state in embodied navigation, and anchoring answers to the correct table row—and notes that improvements can come from just a few accepted edits and a small artifact size. It provides a practical setup for reproducing the workflow quickly (installing SkillOpt, configuring backends, running the loop, and deploying by prepending the learned Markdown to the model’s context). The piece also covers SkillOpt-Sleep, a plugin-style extension that learns from a user’s own past transcripts with a review-and-adopt, validation-gated offline consolidation loop. Finally, it addresses two limitations—reliance on automatic scoring judges and the fact it optimizes one document at a time—before concluding that for procedural, checkable agent tasks, training the document instead of the model is a more reliable and cheaper optimization than fine-tuning. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 18, 2026 by Editorial Team Author(s): Services Ground Originally published on Towards AI. We Replaced ChatGPT With a Local AI Server. Six Months of Honest Data. This is not a “local AI is better” argument. It is a data argument. Six months ago, a number stopped me mid-scroll: Qwen 2.5 Coder 32B scored 92.9 on HumanEval. GPT-4o scored 90.2. HumanEval is the industry-standard coding benchmark — 164 programming problems across languages and problem types, designed to measure real code generation capability. It is not perfect, but it is the closest thing to an objective apples-to-apples comparison the field has. A free, open-source model running on consumer hardware had just outperformed the model our team was paying $30 per user per month for. On the benchmark that matters most for our use case. That number demanded an honest audit of what we were actually paying for. What followed was six months of running both systems in parallel, tracking outputs against real tasks, measuring costs, and documenting the surprises. This article is that documentation — with the honest failures alongside the wins. The Audit: What We Were Actually Paying For Before building anything, we mapped every AI task our team of ten performed in a typical week. The breakdown was more lopsided than expected: ~45% writing tasks — emails, documentation, summaries, proposals ~30% coding tasks — debugging, code review, function generation, test writing ~15% analysis tasks — data interpretation, structured reasoning, research synthesis ~10% edge cases — tasks requiring real-time information, highly specialized reasoning, or frontier-level capability The critical insight from this audit: the 10% of tasks that genuinely required frontier-level intelligence were subsidizing the 90% that didn’t. We were paying per-user-per-month pricing for tasks where a local 14B model would produce output we couldn’t reliably distinguish from GPT-4o. This is the framing that matters. The question was never “is local AI better?” It was “for the specific distribution of tasks our team performs, does the quality delta justify the cost delta?” The honest answer for our team: no. Not at $300/month scaling indefinitely with headcount. The Hardware Decision We selected an RTX 3090–24GB VRAM, purchased used for $600. The 24GB threshold is the critical inflection point in the local AI hardware tier because it is the minimum required to run 32B parameter models with Q4 quantization. Below 24GB you are running 14B models, which are capable but noticeably weaker on complex multi-step tasks. The full hardware VRAM tier picture: Hardware VRAM Max Model (Q4) Quality Tier CPU only 16–64GB RAM 7B (3–8 tok/s) Acceptable for simple tasks RTX 3070 / 4060 Ti 8GB 7B–8B Good for daily tasks RTX 3080 / 4080 16GB 13B–14B Strong, near-frontier on most tasks RTX 3090 / 4090 ✅ 24GB 32B–34B Competitive with GPT-4o on benchmarks Dual 3090 / A6000 48GB+ 70B full Frontier-adjacent Total infrastructure cost: ~$1,200 including the GPU, a used workstation, and 2TB NVMe storage. Break-even against our previous ChatGPT Team subscription: four months. The Model Stack We ran every major open-source model against our actual task distribution before settling on the final stack. Here is what we landed on and why each choice was made. General Tasks — Qwen 2.5 14B Pull command: ollama pull qwen2.5:14b Handles writing, email drafting, summarization, analysis, and Q&A. Fits in 9GB VRAM with Q4 quantization, leaving 15GB headroom for other processes or concurrent requests. The quality surprise: on writing tasks — the category where we expected the largest gap — we could not reliably distinguish Qwen 2.5 14B output from GPT-4o output in blind testing. The model’s instruction following is strong, tone control is accurate, and output length calibration is consistent. This is the default model. Most daily queries never need anything larger. Coding Tasks — Qwen 2.5 Coder 32B Pull command: ollama pull qwen2.5-coder:32b The benchmark data holds in production. This model handles Python, TypeScript, Go, Rust, SQL, and shell scripting with genuine competence — idiomatic output, correct function signatures, accurate debugging explanations. It uses ~20GB VRAM at Q4, leaving minimal headroom on a 24GB card. This means it does not run simultaneously with other large models — Ollama swaps it in on demand and evicts the previous model. The swap latency is 3–5 seconds on NVMe storage. Acceptable for a team that isn’t running multiple models simultaneously. HumanEval comparison for context: Model HumanEval VRAM (Q4) Cost Qwen 2.5 Coder 32B 92.9 20GB Free GPT-4o 90.2 — $20+/mo DeepSeek Coder V2 Lite 90.2 10GB Free Qwen 2.5 Coder 7B 83.5 5GB Free Reasoning Tasks — DeepSeek R1 14B Pull command: ollama pull deepseek-r1:14b DeepSeek R1 uses a chain-of-thought architecture that externalizes its reasoning process before committing to an answer. The visible reasoning trace is not cosmetic — it produces measurably more accurate results on multi-step analytical tasks compared to standard instruction-following models of the same size. The tradeoff is speed. R1 generates its reasoning chain before producing a final answer, which adds latency. For tasks where accuracy matters more than speed — structured analysis, complex data interpretation, multi-constraint planning — it is the correct tool. For quick tasks, Qwen 2.5 7B is faster. Voice Pipeline Speech-to-Text: pip install faster-whisper# Or via Ollama:ollama pull whisper Whisper Large v3 Turbo achieves under 3% word error rate on clean audio — the same quality tier as OpenAI’s paid Whisper API. It runs on 6GB VRAM for real-time processing or CPU for batch transcription. The paid API costs per minute. The local version costs nothing per minute after hardware. Text-to-Speech: pip install kokoro Kokoro (82M parameters) runs entirely on CPU. It produces natural-sounding speech that reviewers consistently rate above models ten times its size, with under 200ms time-to-first-audio on modern hardware. The GPU stays fully allocated to the LLM layer — Kokoro consumes no VRAM. Document Q&A — RAG with nomic-embed-text Pull command: ollama pull nomic-embed-text nomic-embed-text is the embedding model that enables RAG — Retrieval Augmented Generation. It converts documents into searchable vector representations stored in Qdrant, enabling the AI to retrieve relevant […]
Last Updated on June 14, 2026 by Editorial Team Author(s): Sai Bhargav Rallapalli Originally published on Towards AI. How I built a 98.8% accurate prediction model — and discovered that the “cleanest” fuel is hiding a dirty secret When the Global Automotive Council wants to reduce vehicle emissions, where do they start? Do they target fuel types? Engine sizes? Vehicle classes? The answer, it turns out, is not as straightforward as you’d think — and the data tells a story that completely contradicts common intuition. The article walks through a CO₂ emissions data science workflow: starting with a dataset of 7,000+ vehicles, the author cleans duplicates and verifies target distribution, then addresses multicollinearity (dropping redundant fuel-consumption columns) using variance inflation factor and Ridge regression for stability. They argue against removing high-emission outliers because those “top 1%” vehicles represent the category policy makers most need to regulate. The core result overturns raw fuel-type averages via Simpson’s Paradox: ethanol (E85) appears worst when averaged, but once the model controls for engine size and fuel consumption, ethanol is actually the cleanest fuel in the dataset—its benefit is “hidden” because it’s used in larger, higher-consuming engines. The author describes building a scikit-learn pipeline with one-hot encoding and evaluating performance (very high R², low error), then shows model weaknesses concentrated in rare alternative fuel categories and proposes policy recommendations tying fuel mandates to vehicle/engine constraints and targeted actions against super-emitters. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 14, 2026 by Editorial Team Author(s): malrsapps Originally published on Towards AI. Training GPT-2 From Scratch on a GTX1050 This project started with curiosity rather than a clear plan. I wanted to understand how language models were actually trained and whether it was possible to build and train one myself on local hardware. At first, I considered building a small coding assistant. Later, I shifted toward a Turkish tax-focused model. Along the way, I found myself digging into datasets, tokenizers, training loops, GPU limitations, and model architecture. What began as a simple learning project eventually turned into training GPT-2 from scratch on a GTX1050. One of the local training runs running at full utilization on a GTX1050-class GPU with only 4GB VRAM. At the time, I only had access to a GTX1050-class GPU, limited VRAM, and a local Linux setup. I knew the hardware was far from ideal, but I wanted to understand how much experimentation was still possible under constrained conditions. Long-running experiments also introduced practical concerns such as heat, system stability, and power interruption risks during multi-day training runs. What started as a small experiment eventually evolved into tokenizer experiments, chained training workflows, runtime recovery systems, schedule-driven orchestration, and a full low-resource experimentation toolkit. Most importantly, it changed the way I think about training workflows on limited hardware. The Original Goal Was Completely Different My first idea was not related to Turkish language models at all. Initially, I wanted to train a coding-oriented model locally. I started downloading large open-source code datasets — some of them tens of gigabytes in size — and attempted to run training experiments directly on my local machine. That failed almost immediately. The datasets were too large, memory usage became unstable, and the overall workflow simply did not fit the hardware constraints I was working with. At that stage, I realized the first bottleneck was not the model itself. The real bottleneck was building a workflow that could survive on limited hardware. After those failed coding-model experiments, I changed direction. Instead of a coding model, I decided to experiment with a Turkish tax-oriented assistant. Since I work in accounting and tax-related fields, I already had structured domain knowledge and legal text sources available. I prepared an initial dataset using Turkish tax law documents and attempted a small fine-tuning experiment on GPT2-style models. The results were terrible. The problem quickly became obvious: The base model did not actually understand Turkish well enough. At first, I assumed I simply needed a better pretrained Turkish model. I searched Hugging Face for Turkish GPT-style models, downloaded several alternatives, and tried multiple fine-tuning runs. The outputs were still weak. Some models generated broken Turkish, some produced unstable outputs, and some simply failed to generalize in useful ways. That period taught me an important lesson: Not every published pretrained model is necessarily well-trained or practically usable. At that point, the project direction changed again. Instead of trying to fine-tune weak Turkish models, I decided to focus on a different problem entirely: First, I needed to build a GPT-style model that could actually learn Turkish reasonably well on constrained hardware. Discovering Wikipedia Training Once I realized the main issue was not fine-tuning but the lack of a strong Turkish base model, my focus shifted completely. The new goal became much simpler in theory: Train a small GPT-style model that could at least learn Turkish properly. Since GPT2-sized architectures were the only realistic option for my hardware, I started exploring how small-scale Turkish pretraining could work locally. That search eventually led me to Wikipedia datasets. At the time, Wikipedia looked manageable compared to the massive coding datasets I had attempted earlier. But even then, almost every stage introduced new problems. Tokenizer training became one of the first major bottlenecks. I experimented with training tokenizers directly on Turkish Wikipedia text, but repeatedly ran into issues related to vocabulary size, preprocessing stability, and memory constraints. I also discovered that many failures happened before actual model training even started. In practice, preparing the data pipeline turned out to be harder than launching the training itself. At various stages, I experimented with streaming-based approaches and alternative loading strategies to avoid memory problems. Some workflows partially worked, others failed unpredictably, and some became too slow to manage comfortably on my hardware. The deeper I went, the more I realized that low-resource experimentation was fundamentally a workflow engineering problem. The model architecture itself was only one part of the system. Why the Workflow Became Chained One of the biggest turning points came when I started splitting the datasets into smaller parts. Initially, I experimented with a small number of dataset chunks — around six parts — hoping sequential training could reduce runtime pressure and memory instability. The idea worked better than expected. Instead of treating training as one uninterrupted process, I could continue training progressively across multiple dataset parts. That realization slowly evolved into a chained training workflow. However, new problems appeared immediately. Some training parts took an extremely long time to finish. In several experiments, a single dataset part could run for dozens of hours. On my hardware, that created practical problems beyond machine learning itself. A training session running continuously for two or three days introduced risks such as: unexpected crashes overheating power interruptions unstable runtime behavior inability to use the machine normally during training At that point, the workflow design started changing again. I increased the number of dataset parts repeatedly: first 6 then 12 eventually around 120 smaller parts Counterintuitively, increasing the number of parts actually made experimentation more manageable. Smaller parts reduced runtime risk, improved recovery speed, and made debugging significantly easier. The project gradually shifted from: “Train a model.” to: “Build a system capable of surviving long-running local experiments.” That distinction ended up shaping the entire toolkit architecture. The Birth of the Training Loop An early chained training session showing tokenizer loading, dataset preparation, and the beginning of a local GPT-style training run. As the number […]
Author(s): Praveen Bhavani Originally published on Towards AI. Principal Component Analysis (PCA) is one of the most widely used techniques for dimensionality reduction and feature extraction. PCA transforms correlated variables into a smaller set of uncorrelated variables called principal components, while preserving as much information (variance) as possible. PCA is fundamentally a linear algebra and statistical method rooted in: Covariance structure analysis Orthogonal transformations Eigenvalue decomposition Variance maximization Modern datasets often contain hundreds or thousands of correlated variables. In finance, quantitative trading, computer vision, genomics, and machine learning, high-dimensional data creates several challenges: Computational inefficiency Noise accumulation Multicollinearity Overfitting Difficulty in visualization and interpretation Conceptual Foundation of PCA Consider a dataset of n observations measured across p features, where those features are often correlated with one another — as height and weight tend to be in a population study, or as pixel intensities tend to be in neighboring regions of an image. Working directly in this high-dimensional, correlated space is statistically inefficient: many of the p dimensions carry redundant information, and the sheer number of variables can obscure the underlying structure we actually care about. PCA addresses this by asking a deceptively simple question: is there a new coordinate system in which the data’s variability becomes easier to see? Rather than describing each observation through the original features x₁, x₂, …, xₚ, PCA constructs a new set of axes the principal components PC₁, PC₂, …, PCₚ that are rotations of the original coordinate space. These axes are chosen sequentially and according to a strict rule: each one must point in the direction of greatest remaining variance in the data, subject to being orthogonal to every component that came before it. This gives rise to a natural ordering. PC₁ is the single direction through the data that explains more variance than any other possible axis — it is, in a precise geometric sense, the line along which the data is most spread out. PC₂ then captures as much of the residual variance as possible, after the contribution of PC₁ has been removed, and it does so at a right angle to PC₁. PC₃ repeats this logic relative to the first two components, and so on down the chain. The orthogonality constraint is not merely a geometric nicety — it is what guarantees that the principal components are uncorrelated with one another. Information captured by PC₁ is completely absent from PC₂, which means each component contributes something genuinely new to the description of the data. This stands in direct contrast to the original features, which may be entangled by correlations that make it hard to attribute variance to any single variable cleanly. The practical payoff is that the bulk of the dataset’s variance is typically concentrated in the first few principal components. A dataset with fifty correlated features might have 90% of its variance sitting in just three or four PCs. Those components can then stand in for the full feature set — dramatically reducing dimensionality while preserving the structure that matters most for analysis, visualization, or downstream modeling. Geometric Interpretation The algebra of PCA — eigendecompositions, covariance matrices, orthogonal projections — can feel abstract until you see what it is actually doing to the data. Geometrically, PCA is a rotation. Nothing is stretched, nothing is discarded, and no information is destroyed. The coordinate axes simply pivot to a new orientation, one chosen to align with the natural shape of the data rather than with the arbitrary axes of the original measurement space. To make this concrete, consider two financial variables measured daily over a year: a stock’s return and its trading volume. Plotted as a scatter of points, this data rarely forms a tidy horizontal or vertical band. Returns and volume tend to move together — high-volume days often coincide with large price swings — so the cloud of points stretches diagonally across the plane, oriented somewhere between the two original axes. The original coordinate system, with its horizontal “return” axis and vertical “volume” axis, cuts across the data at an oblique angle. It describes the cloud accurately, but inefficiently: both variables are needed to characterize even the dominant pattern. PCA identifies that diagonal direction and makes it the first axis. PC₁ runs along the longest dimension of the cloud — the direction in which the data is most spread out. A single number along this axis tells you more about where a given day sits in the distribution than either the return or the volume measurement alone. The second axis, PC₂, is then placed at a right angle to PC₁, pointing across the narrow dimension of the cloud and capturing whatever residual variation remains after the dominant trend is accounted for. The result of this rotation is threefold. First, it compresses information: the dominant structure of the data, which previously required two coordinates to describe, is now legible in one. Second, it removes redundancy: because the original variables were correlated, they were partly saying the same thing twice; the rotated axes separate that shared signal from the independent variation each variable carries. Third, it decorrelates the features: by construction, the projections of the data onto PC₁ and PC₂ have zero correlation — the axes are orthogonal, so knowing a point’s position along one component tells you nothing about its position along the other. What PCA does not do is change the data itself. Every point occupies the same position in space before and after the transformation; only the rulers used to measure that position have changed. This is why PCA is lossless when all components are retained — and why choosing to keep only the first k components is a deliberate, interpretable act of compression rather than an accidental distortion. The Data Matrix The starting point for any rigorous treatment of PCA is a precise description of the data it operates on. We represent our dataset as a matrix X ∈ ℝⁿˣᵖ, where n is the number of observations and p is the number of measured variables. Each of […]
Author(s): yooiken Originally published on Towards AI. Build a Zero-Cost Web Automation Pipeline With OpenRouter, OpenClaw, and MediaUse I have become less interested in whether a cheap model can “browse the web” and more interested in whether it can run a boring workflow correctly every morning. That is a different problem. Most low-cost or free LLMs fail at web automation because the model has to do too much at once. It has to understand the goal, inspect the page, decide where to click, recover from layout changes, parse the result, and then write something useful. One weak link ruins the whole run. The workaround is simple: do not ask the free model to operate the browser. Use the free model as the dispatcher. Let MediaUse handle the browser work through site plugins. The model calls semantic commands like “get Hacker News top stories” or “read this Reddit thread.” MediaUse turns those commands into stable browser actions and returns structured JSON. In this article, I will build a daily pipeline that: Uses OpenClaw with OpenRouter’s free openrouter/owl-alpha model as the orchestrator. Uses MediaUse Hacker News skill to find today’s technical stories. Uses MediaUse Reddit skill to collect user reactions. Uses MediaUse ChatGPT skill to turn the research into a Medium draft. Saves the article draft locally. Runs every day around 10:00 AM. The result is a low-cost agent that does not depend on a frontier model for every step. The free model plans and routes. MediaUse performs the web operations. ChatGPT is optional and only used at the end because I want the final writing to be good. If you want the strictest “zero API spend” version, skip the ChatGPT step and ask owl-alpha to write the draft from the collected JSON. If you already have access to ChatGPT through the web UI, the MediaUse ChatGPT skill can use that browser workflow instead of sending paid API calls. Current caveat: OpenRouter lists openrouter/owl-alpha as free during its current availability window, with tool support and a large context window. Free model availability can change, so check the model page before relying on it in production. Why this works better than “LLM, please browse the web” A general browser agent has to reason over pixels and HTML. A site plugin does not. MediaUse skills package website actions into predictable commands. The Hacker News skill has commands like: mediause hackernews get top --limit 20 --jsonmediause hackernews read item --id <item_id> --depth 2 --replies 20 --max-length 2000 --json The Reddit skill has commands like: mediause reddit search posts --query "open source AI agent" --subreddit "LocalLLaMA" --sort relevance --time day --limit 10 --jsonmediause reddit read item --post-id <post_id> --sort top --limit 30 --depth 3 --max-length 3000 --json The LLM does not need to know where the Reddit search box is. It does not need to scroll through nested comments. It just asks for the operation. That is the whole trick. Low-quality models often struggle when the task is open-ended. They do much better when the action space is small, named, and structured. MediaUse gives them that smaller action space. The pipeline Here is the workflow I use: 10:00 AM | vOpenClaw wakes up the workflow | vOpenRouter owl-alpha chooses the plan | vMediaUse Hacker News skill fetches today's top tech stories | vMediaUse Reddit skill searches for matching user reactions | vResearch JSON is normalized into one brief | vMediaUse ChatGPT skill writes a Medium draft | vDraft is saved to ./drafts/YYYY-MM-DD-medium-draft.md Step 1: install and configure MediaUse On Windows, install or update the MediaUse CLI: powershell -C "iwr https://release.mediause.dev/install.ps1 -UseBasicParsing | iex"mediause --version Configure your MediaUse key: mediause manage key <your_mediause_key> --json Install the site plugins: mediause plugin add hackernews --jsonmediause plugin add reddit --jsonmediause plugin add chatgpt --json Bind the accounts: mediause auth list --json # Hacker News supports guest read workflows.mediause use account hackernews:guest --policy balanced --json# Reddit usually works best in visible mode.mediause use account reddit:<account_id> --policy balanced --show --json# ChatGPT needs your account context if you use it for writing.mediause use account chatgpt:<account_id> --policy balanced --jsonmediause auth health --json Step 2: configure OpenClaw to use OpenRouter Owl Alpha Create an OpenRouter API key, then set it in your shell: $env:OPENROUTER_API_KEY = "<your_openrouter_key>" Use openrouter/owl-alpha as the model for the orchestration agent. The exact OpenClaw config shape may differ depending on your version, but the important part is the provider, base URL, and model: provider: openrouterbase_url: https://openrouter.ai/api/v1model: openrouter/owl-alphaapi_key_env: OPENROUTER_API_KEYtemperature: 0.2 You can sanity-check the model directly with OpenRouter’s OpenAI-compatible API: $body = @{ model = "openrouter/owl-alpha" messages = @( @{ role = "user" content = "Return a JSON plan for collecting today's developer news." } ) response_format = @{ type = "json_object" }} | ConvertTo-Json -Depth 10 Invoke-RestMethod ` -Uri "https://openrouter.ai/api/v1/chat/completions" ` -Method Post ` -Headers @{ Authorization = "Bearer $env:OPENROUTER_API_KEY" "Content-Type" = "application/json" } ` -Body $body Step 3: give the agent a narrow prompt The prompt matters. Do not ask the model to be creative with the workflow. Ask it to call a small set of commands and produce a strict output. Use this as the system prompt for the OpenClaw workflow: You are a daily technical research dispatcher. Your job is to collect material for one Medium article draft.Rules:- Use MediaUse commands only for website data collection.- Prefer structured JSON outputs.- Do not browse manually.- Do not invent article facts.- Keep the final research bundle under 12,000 words.- Stop if a site returns a risk prompt, captcha, or account challenge.Workflow:1. Get today's top Hacker News stories.2. Select 3 to 5 stories about projects, developer tools, AI infrastructure, open source, or software engineering.3. Read each selected HN item with comments.4. For each selected story, search Reddit for matching discussion.5. Read the most relevant Reddit thread when available.6. Build a research bundle with: - title - source URL - why it matters - HN discussion summary - Reddit user feedback summary - notable disagreement - possible article angle7. Send the research bundle to ChatGPT through the MediaUse ChatGPT skill to […]
Last Updated on June 8, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. I Gave Qwen3.7-Plus a Screenshot and It Found the Exact Pixel to Click for $0.40 I uploaded a messy AWS console screenshot and asked one question: which pixel do I click to launch an instance? The model came back with click at (x=1147, y=283). I overlaid that coordinate on the image. It landed dead center on the orange "Launch instance" button. Then I checked the price: $0.40 per million input tokens — one-sixth what Alibaba charges for the text-only Qwen3.7-Max, and the model scores 79.0 on ScreenSpot Pro, the benchmark that decides whether a "computer use" agent actually works. The author argues that successful “computer use” hinges on GUI grounding: given a screenshot and an instruction, the model must output exact pixel coordinates for the right UI element. They explain how Qwen3.7-Plus (a vision-capable variant that only outputs text) achieves a strong ScreenSpot Pro score (79.0), compare it to Qwen3.7-Max and other benchmarks, and show how to implement it quickly using Alibaba Cloud Model Studio via the OpenAI-compatible SDK. The article walks through four practical “glue” calls—(1) screenshot-to-JSON coordinates, (2) converting coordinates into real clicks with a confidence gate, (3) running an observe-act loop in Playwright for browser tasks, and (4) “screenshot to code” to recreate UI components. Finally, it discusses when to use Plus versus alternatives, highlights the key limitation that Plus is proprietary/API-only (no open weights or self-hosting), and concludes that it’s a cost-effective way to prototype frontier-grade screen grounding before moving to more polished managed or self-hostable solutions. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 8, 2026 by Editorial Team Author(s): Suchit Majumdar Originally published on Towards AI. Beyond the Prompt: Why Autonomous AI Agents Are Replacing the Chatbot In May 2025, Sebastian Siemiatkowski — the same Klarna CEO who fifteen months earlier had told the world that one OpenAI-powered assistant was doing the work of 700 customer service agents — quietly started hiring humans back. Bloomberg got the quote: “Cost unfortunately seems to have been a too predominant evaluation factor, what you end up having is lower quality.” Headcount over the same window went from 5,527 at the end of 2022 to 3,422 at the end of 2024, per the S-1 Klarna filed in November. The chatbot stayed. The “all-AI customer service” story did not. So the title of this piece is half a lie, and I want to correct it before you read another paragraph. Chatbots are not, in any general sense, being replaced by autonomous agents in 2026. The replacement is happening in one specific place: queue-shaped back-office work where no human is waiting on the other end, and almost nowhere else. That narrow claim is the thesis. The broad version is what every vendor deck says, and it is wrong. If you walked out of your last AI strategy review thinking the agent wave is about to subsume your support org, your sales org, and your engineering org all at once, you are about to spend the next four quarters defending a budget against numbers that will not arrive. That is the claim. The rest is me showing my work. Klarna is evidence for the thesis, in reverse The 2024 Klarna press release is worth re-reading with an engineer’s eye. 2.3 million conversations in month one across 35 languages. Resolution time from 11 minutes down to 2. A CSAT of 4.4 against a human baseline of 4.2, Klarna’s own number, never independently audited. OpenAI mirrored the case study on its own site. It was the most widely cited “AI replaced humans” deployment of the LLM era. It was also a chatbot. Not an agent. A user-initiated, real-time, conversational interface with safety rails and a handoff-to-human button. Gergely Orosz pointed this out at the time in his Pragmatic Engineer breakdown: what Klarna had actually built was L1 tier-one support automation, the kind of containment work IVR systems were doing twenty years ago, except now in natural language. The bot was a filter that escalated anything sharp. Then it broke on the seams chatbots always break on. The May 2025 reporting from CX Dive and CNBC converges on a single picture: hallucinations clustered on edge cases. CSAT cratered on emotional tickets where the bot was technically correct but tonally wrong, because being right and being heard are different jobs. Compliance teams refused to let an LLM autonomously close accounts. So Klarna kept the bot for volume and rebuilt the human layer underneath it, “Uber-style,” remote and flexible, hiring students and rural workers as on-demand specialists. Read that as a bull case for chatbots if you want. I read it as a warning about the entire customer-facing slice. The most aggressive chatbot deployment in the world, with founder-level air cover and a workforce reduction of nearly 2,000 people, still bounced off the part of the work where a customer was on the line and cared about being there. That isn’t a story about agents replacing chatbots. It’s a story about customer-facing conversation being a category that resists full automation by either shape of system. The spine of the argument: the meaningful axis isn’t conversational versus autonomous, it’s who triggers the work. Source: builder spec compiled from Klarna S-1, Intercom Fin published metrics, Lemonade 10-K (Q4 2024). Where the chatbot still wins, and it isn’t close Intercom Fin is the cleanest counter to the “agents will eat customer support” narrative. Self-reported resolution rate of 67% globally as of late 2025, on 40 million cumulative conversations, across more than 10,000 business accounts. Priced at $0.99 per resolved conversation. Intercom claims the human-agent comparison is $5 to $10 per query and I’ll flag that as a vendor-published number, not an audit — but Teneo’s 2025 cost analysis lands in roughly the same range ($8–$15 per fully-loaded human resolution), so the order of magnitude is real even if Intercom is choosing the friendly end. The caveats matter. “Resolution” is defined by Intercom: the customer exits, or affirms satisfaction, after Fin’s last answer. No public study correlates that signal with actual customer satisfaction. And the variance across accounts is enormous. One Intercom community thread in late 2025 had a customer reporting 27.6% resolution rate next to another at 80.1% over the same 12-week window, with the high performers being the ones who spent two to four weeks cleaning their knowledge base before launch. The published 67% is a marketing mean sitting on a long, ugly tail. But the unit economics survive every caveat. This is a working chatbot business, at scale, on user-initiated conversational work, with no agent loop in sight. If your Q3 roadmap involves wrapping Fin in a LangGraph orchestrator and rebranding it an “agentic support platform,” the question I would ask in your planning meeting is whether the additional dollars per resolution clear the additional tokens per resolution, because the LeanOps numbers I’ll get to below say they usually don’t. There’s also the Air Canada precedent from February 2024, when the BC Civil Resolution Tribunal made the airline liable for its chatbot’s incorrect bereavement-fare advice. The damages were small, roughly $650 CAD. The precedent is not. Any system, conversational or autonomous, that makes binding statements to a customer creates legal exposure, which is one more structural reason the production migration is happening where no customer sits on the other end of the conversation at all. What actually has to be true for an agent to pay for itself Strip away the framework news cycle. OpenAI Agents SDK in March 2025. Google ADK in April. LangGraph 1.0 in October. Anthropic computer use […]
Last Updated on June 8, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. Why this matters right now A Chinese lab just shipped a terminal coding agent that does almost everything Claude Code does, released the entire thing under the MIT license, and pointed it at a model that costs $0.60 per million output tokens. Claude Code’s default model, Opus 4.8, costs $25 for the same million tokens. That is roughly 42 times more expensive on the part of the bill that actually hurts. I spent the morning reading Moonshot’s repo line by line, pulling the install script apart, and trying to find the catch. The catch is smaller than you would expect. The article explains that Moonshot’s terminal agent is called Kimi Code CLI and is truly open to build and distribute under MIT, with an emphasis on how it differs from other “terminal coding agents” that are moving toward closed-source distribution. It details what the CLI does (single-binary install, a TUI, OAuth/API key login, and a workflow that reads/edits code and runs shell/web tasks), then highlights standout capabilities such as video input, conversational MCP configuration, built-in subagents, lifecycle hooks, and editor integration via an Agent Client Protocol. It argues the core value is not just the agent wrapper but the pricing and model loop: Kimi Code CLI runs on Kimi K2.6 open-weight models with dramatically cheaper output-token costs, making agent usage far more affordable than Claude Code’s and others’ more expensive proprietary setups. The piece compares how each incumbent approaches openness and licensing, concludes that Kimi Code CLI is an open alternative aligned with developers’ need to avoid lock-in and pricing shocks, and ends with a “verdict” that while the repo is still young and not for everyone (e.g., enterprise constraints), the direction toward open, MIT-licensed tooling is the most durable edge for 2026. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 8, 2026 by Editorial Team Author(s): Satish Kumar Originally published on Towards AI. Connections, Roles, and Warehouses: Getting CoCo Desktop Production-Ready from Day One Snowflake COCO Desktop| Part 1 of 8 There’s a moment every data engineer hits when first opening Snowflake’s CoCo Desktop: the welcome screen looks clean, the interface is polished, and then the connect step appears. And if your organization uses SSO, has multiple accounts, or runs a non-default role setup — that step is where things quietly fall apart. Most getting-started content for AI coding tools assumes the connection is the easy part. With CoCo Desktop, authentication is where you make architectural decisions that affect every subsequent session: which credentials get cached, which warehouse runs agent queries, which role the agent operates under. Getting it right upfront saves a lot of friction later. Getting it wrong means your agents either fail silently or run with more privileges than you intended. This is the first article in an 8-part series on Snowflake CoCo Desktop for data engineering teams. This one covers everything before the first prompt: installation, prerequisites, the onboarding flow, authentication options, connection management, and the decisions you’ll want to make consciously rather than by default. TL;DR CoCo Desktop requires a paid Snowflake account with Cortex Code enabled and the SNOWFLAKE.CORTEX_USERdatabase role — trial accounts won't work. Available for macOS and Windows only (no Linux desktop client). The 4-step onboarding flow (welcome → connect → mode → theme) is mostly intuitive, but the Connect step catches teams who rely on SSO without a configured default browser. OAuth is the right default for most users. Password auth is available but not recommended; key pair is best for service accounts. PAT and Workload Identity Federation are also supported for specialized use cases. Default Warehouse set via the UI persists both server-side on your Snowflake account and locally in connections.toml. That dual-write behavior matters when multiple team members share the same Snowflake user. The connections.toml file permissions (chmod 600) are a requirement on macOS/Linux, not just a best practice — Snowflake tools will refuse to read the file otherwise. What this doesn’t cover: how to configure roles for least-privilege agent use — that’s a permission modes topic covered in subsequent Article. Prerequisites: What You Need Before Installing Before downloading CoCo Desktop, confirm these requirements are met. Skipping this step is the most common source of “it connects but nothing works” issues. Account requirements: A paid Snowflake account (trial accounts are explicitly blocked — see the troubleshooting section below) Cortex Code must be enabled on the account Your user must have the SNOWFLAKE.CORTEX_USER database role (granted through PUBLIC by default, but your org may have revoked it) At least one supported model must be available to your account (check CORTEX_MODELS_ALLOWLIST) Platform requirements: macOS (Apple Silicon or Intel) or Windows Linux is not supported for the desktop client (use Cortex Code CLI instead) Network requirements: Network access to your Snowflake server If a model you need isn’t available in your region, an ACCOUNTADMIN must configure cross-region inference: ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_US'; Replace AWS_US with the appropriate region identifier (AWS_EU, AWS_APJ, AZURE_US, or ANY_REGION). This is a common first-run blocker that looks like a connection failure but is actually a model availability issue. Quick prerequisite check — run this in any Snowflake worksheet to confirm readiness: SELECT CURRENT_USER() AS user, CURRENT_ROLE() AS role, CURRENT_WAREHOUSE() AS warehouse, CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS account_identifier; If warehouse comes back NULL, you'll need to set a default before CoCo Desktop will execute agent queries. The Onboarding Flow Is Deceptively Simple Opening CoCo Desktop for the first time sends you through four screens: welcome, connect, mode, then theme. The first and last are cosmetic. The middle two are where the real work happens. The Connect step is where you either authenticate against an existing connection or create a new one. If you’ve already set up the Cortex Code CLI or Snowflake CLI, CoCo Desktop detects your ~/.snowflake/connections.toml automatically and shows your existing connections with a status dot. This is genuinely convenient — you don't have to re-enter anything. If you're starting fresh, you'll fill in an account identifier, a connection name, a username, and pick an authentication method. The account identifier format trips people up consistently. It follows the pattern orgname-accountname — not the Snowflake URL format, not the legacy account.region format that older tools use. You can find it at app.snowflake.com under your avatar → "Connect a tool to Snowflake." You can also read it directly from your Snowsight URL: https://app.snowflake.com/orgname/accountname/. Worth bookmarking that path if you're setting up multiple team members. The Mode step asks whether to start in Agent mode or Editor mode. This is not a permanent decision — you can switch at any time — but the choice sets the default layout for your first session. Agent mode is optimized for parallel agent sessions across multiple workspaces; Editor mode is optimized for working with files while keeping agent sessions on the side. More on the practical difference between these in Article 2. One thing the onboarding flow doesn’t surface clearly: if your browser doesn’t open automatically for OAuth or SSO, there’s a “Browser didn’t open?” fallback link in the app. It’s easy to miss on first run and results in people assuming the connection failed when it just needs a manual URL copy. Authentication Methods: A Practical Decision Tree CoCo Desktop supports six authentication methods. The four primary ones cover most use cases; two additional methods serve specialized automation scenarios. Which one you choose should depend on your account’s security posture, not just what’s easiest to configure. | Authentication Method | Best For | Credential Storage | Notes || ---------------------------------- | ----------------------------------------- | ----------------------------------- | ------------------------------------------------------------------------------------------------------------ || OAuth ✅ Recommended | Most human users | OS Keychain / DPAPI | Add `client_store_temporary_credential = true`; otherwise re-authentication may be required on every launch. || External Browser / SSO | Organizations using Okta or Azure AD […]
Last Updated on June 8, 2026 by Editorial Team Author(s): Anubhav Originally published on Towards AI. My First $5,000 Month Writing About AI Engineering on Medium In May, my Medium earnings crossed $5,000 from writing about AI engineering. A month earlier, the same account had done 10.9K views. Two months earlier, it was at 3.9K. The jump looks like an overnight success if you only look at the final revenue screenshot. The tempting explanation is that two posts exploded and carried the entire month. That’s true, but it misses the point. If I had published the exact same viral posts on a brand new account, they would have spiked and died. Instead, they spiked and woke up fifteen other articles. Readers clicked on a post about local LLMs, finished it, looked at my profile, and clicked through three more posts on retrieval-augmented generation and agent architectures. The Month That Changed The Account March was quiet. I was publishing consistently and seeing small signals of life — a stray comment here, a highlighted code snippet there. The traffic was random. By mid-April, the compounding started to become visible. I wasn’t getting viral hits, but my baseline daily views were rising. I would publish a new article and older pieces would get a secondary wave of traffic. The catalog was waking up as readers who found one piece of my work started clicking through to the others. Entering the second week of May, the breakout happened. The recommendations stopped being random. The same kind of builder-focused reader kept showing up. When a post took off, the traffic spilled over into the rest of the catalog. I remember refreshing the dashboard on a Tuesday morning and seeing an old RAG post from March suddenly back on the daily charts, sitting right next to an article I had published twelve hours prior. The Wrong Lesson Would Be “Write Viral Posts” When people see a big month, they immediately try to reverse-engineer the biggest hits. Looking at my May dashboard, a few specific articles did most of the numbers. My guide on Claude Code setup pulled in 39,000 views and nearly 20,000 reads, earning about $1,360. A breakdown of the best local LLMs for coding did 39,000 views and 18,900 reads, bringing in just over $2,000 — my single highest-earning post. A curated list of 12 AI books brought in another 28,000 views, 12,800 reads, and $714. The obvious conclusion is that I should just write more listicles and setup guides. Outliers gave me the spike. But the spike only mattered because there were 20 other posts for those readers to land on next. When a developer clicked on the Claude Code setup guide, they got a useful tutorial. At the bottom of that page, however, they saw links to my other work. They found deep technical dives like “RAG Chunking That Works,” “Multi-Agent Systems” and a detailed comparison of LangGraph vs Temporal. If my profile had only contained generic AI news, that developer would have closed the tab. Because it contained dense engineering articles, they realized I was a builder. They hit the follow button, joined the email list, and clicked through to older posts. The Niche Was Narrower Than “AI” I drew a hard boundary around my topics. The niche was not artificial intelligence broadly, but AI engineering specifically. If I wrote about AI in general, I would be competing with news sites and generic content farms. I’d end up writing about how AI is going to change the future of work, which no engineer wants to read. Instead, I broke my niche down into very specific lanes. This specific mapping worked. It had enough search volume to matter, but the code snippets naturally filtered out people looking for ChatGPT prompt hacks. It was also deeply connected. If someone reads an article about RAG chunking, they are a natural fit for a piece about reranking or hybrid search. I had to stay disciplined here. Writing a great article about Python agents today and a generic productivity piece tomorrow would just dilute the exact reader base I was trying to build. Compounding Looked Boring Until It Was Obvious By late April, I realized that compounding in content creation looks like absolutely nothing for weeks. It looks like small, boring movements. I would publish an article and get 50 views. I’d publish another and get 70. Then, older posts started firing together. The pattern repeated every time. One new post would catch attention. Readers clicked on it, finished reading, and the recommendation engine pulled up an article from three weeks ago at the bottom of the page. The Claude Code spike directly helped my older AI coding workflow posts. The AI Books article drove traffic back to my roadmap pieces. The RAG Chunking article kept my technical RAG authority alive and fed traffic to my reranking guide. You cannot judge a technical article by its first week of traffic, because an article about Python agents might sit dead for a month until a related LangGraph piece suddenly revives it. Technical Deep Dives Worked, But Not Alone I used to think you had to pick: deep tutorials or broad overviews. May taught me you need both, for different reasons. Deep technical dives proved my competence. My article on RAG Chunking was dense, filled with regex patterns for markdown parsing and detailed explanations of semantic boundaries. It didn’t go viral. It did 2,700 views and 1,100 reads. But the people who clicked it read it, holding a read ratio around 40%. Meanwhile, my deep-dive architectural comparison of LangGraph vs Temporal did only 959 views, but it earned $119.91. That is a high earnings-per-view ratio. It monetized well because it answered a specific architectural question that senior engineers were searching for. They were trying to decide between state machines and durable execution for their agent orchestration, and they spent ten minutes reading every word of that post. The biggest discovery came from posts with broader entry points. […]
Last Updated on June 8, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn't Be This Good A 26-billion-parameter model has no business fitting in 15GB of memory and spitting out 193 tokens a second on a single consumer GPU. That is laptop-and-gaming-rig territory, not a datacenter. Yet that is exactly what Google’s new Gemma 4 QAT checkpoints do, and after digging into how they pulled it off, the part that stuck with me is not the speed. It is that the 4-bit version barely loses anything compared to the full-precision original. By every law of quantization I thought I understood, it should be noticeably dumber. It isn’t. After the lead, the article breaks down why Gemma 4 QAT + Unsloth’s GGUF conversion is unusually effective: it quantizes during training so the model learns to be robust to 4-bit rounding, explains the typical PTQ quality loss, and describes how Unsloth fixes a subtle scale-mismatch bug that otherwise wipes out most of the benefit when converting to llama.cpp formats. It then provides concrete performance and memory numbers for different Gemma 4 variants (especially the 26B-A4B mixture-of-experts model), compares naive vs dynamic conversion accuracy, and summarizes the practical steps to run the model with llama.cpp, plus other deployment options (API server, Ollama/LM Studio, Unsloth Studio, vLLM/SGLang, MLX, and browser ONNX). Finally, it offers guidance on which model to choose based on available hardware, notes the remaining caveat that 4-bit is still 4-bit, and concludes that the usual quality-vs-speed tradeoff is collapsing—making the 26B-A4B feel like a near big-model experience on consumer GPUs. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 8, 2026 by Editorial Team Author(s): Atul Kumar Originally published on Towards AI. LangChain Explained: Understanding Models, Prompts, Chains, Memory, Indexes, and Agents Large Language Models (LLMs) such as GPT, Gemini, and Claude have made it easier than ever to build intelligent applications. However, developing production-ready AI systems often requires much more than simply calling an API. This is where LangChain comes in In this article, we’ll explore the core components of LangChain and understand why they are important. Introduction to Langchain : LangChain is an open-source framework that simplifies the development of applications powered by Large Language Models. It provides a collection of components that help developers connect LLMs with external data, memory, tools, and workflows. Benefits of LangChain: 1. Model Agnostic LangChain provides a unified interface for different LLM providers such as OpenAI, Gemini, and Claude. This makes it easier to switch models without rewriting large portions of code. 2. Reusable Prompt Templates Instead of hardcoding prompts, developers can create dynamic and reusable prompt templates, improving maintainability and scalability. 3. Simplified AI Workflows Chains allow multiple operations to be connected, making complex workflows easier to build and manage. 4. Context-Aware Applications Memory enables applications to remember previous interactions, creating more natural and personalized user experiences. 5. Efficient Knowledge Retrieval Indexes and vector databases allow applications to retrieve relevant information from large datasets, improving response accuracy. 6. Autonomous Decision Making Agents can dynamically decide which tools or actions to use, enabling the development of intelligent AI systems. 7. Faster Development LangChain provides ready-to-use components, reducing development time and allowing developers to focus on application logic rather than infrastructure. What Can We Build Using LangChain? 1. AI Chatbots Build conversational assistants capable of answering questions and maintaining context throughout a conversation. 2. Customer Support Systems Create intelligent support agents that can access company knowledge bases and provide accurate responses. 3. Retrieval-Augmented Generation (RAG) Applications Build systems that retrieve information from documents and use LLMs to generate context-aware answers. 4. Document Question-Answering Systems Allow users to upload PDFs, research papers, or reports and ask questions about their contents. 5. AI Agents Develop autonomous agents that can use tools, APIs, databases, and search engines to complete tasks. 6. Personal AI Assistants Create assistants that manage schedules, answer questions, summarize information, and perform actions on behalf of users. 7. Content Generation Tools Generate blogs, social media posts, emails, reports, and marketing content automatically. 8. Recommendation Systems Use embeddings and semantic search to recommend products, articles, courses, or videos. 9. Research Assistants Build AI systems that search, summarize, and analyze information from multiple sources. 10. Multi-Agent Systems Create multiple specialized agents that collaborate to solve complex problems and automate workflows. Components Of LangChain:- LangChain Components 1. Models Models are the core interfaces through which we interact with AI models. It is one of the core building blocks of a LangChain application. Problem: Different AI providers, such as Anthropic, OpenAI, and Google Gemini, use different SDKs and API formats. If you write code directly for one provider, switching to another often requires changing significant parts of your code. Solution: LangChain’s Model abstraction provides a common interface for interacting with different LLMs. Instead of writing provider-specific code, we write against LangChain’s standardized API. Models In LangChain 2. Prompts A Prompt is the fundamental text input or instruction provided to a Large Language Model (LLM). Problem: LLMs perform best when prompts are well — structured properly. But hardcoding prompts every time can lead to: Repeated code Inconsistent outputs Difficult maintenance Solution: The Prompt component in LangChain provides reusable and dynamic templates that insert user inputs into predefined instructions before sending them to the model. Prompting Techniques: Dynamic Prompting: Prompt changes based on user input or variable. Example of Dynamic Prompting Use Case: Personalised responses, Chatbots, AI Applications. 2. Role-Based Prompting: Assign a role or persona to the model. Example of Role based Prompting Use Case: Expert Advice, Tutoring, Coding Assistance. 3. Few-Shot Prompting: It provides an example so the model learns the desired pattern and generates the desired output for a new input. Example of Few-Shot Prompting Use Case: Classification, Extraction, Formatting Tasks. 3. Chains A Chain in LangChain connects individual components like prompts, LLMs, and output parsers into a seamless, automated workflow. Let Say, NLP Pipeline We can represent the flow using pipelines; if not, we have to manually take the output of one and push it as input to another Problem: Many GenAI applications require multiple steps, not just a single Large Language Model (LLM) call. Without chains, we would have had to manually connect all the steps. Solution: Chains connect multiple LangChain components into a single workflow, in which the output of one component becomes the input to the next. Types of Chains: Sequential Chain: Steps executed one after another. Parallel Chain: Multiple tasks run simultaneously. RAG Chain: Most common in Industry( Chat with PDF, Company Knowledge base) Types Of Chains 4. Memory Problem: By Default, “LLM API calls are stateless” means they do not remember previous conversations. Solution: Memory stores conversation history or important information and automatically provides it to the LLM when needed. How Memory Works? Working of Memory Types of Memory: Conversation Buffer Memory: Stores the entire conversation. Simple, but becomes large over time. Best for short conversations and prototyping. Conversation Buffer Memory 2. Conversation Buffer Window Memory: Limit memory to only the last K messages. Best for maintaining the recent conversational context while keeping the token usage predictable and preventing the context window from overflowing. 3. Summarize Based Memory: Periodically summarize older chat segments to keep a condensed memory footprint. 4. Custom Memory: For advanced use cases, we can store specialized state, e.g., the user's preference or key facts about them, in a common memory class 💡Key Takeaway: Choose the right memory type based on your use case, conversation length, and token limit to build an efficient and context-aware AI application. 5. Indexes Indexes in LangChain connect your application to external Knowledge, such as PDFs, […]
Last Updated on June 8, 2026 by Editorial Team Author(s): Sourav Ghosh Originally published on Towards AI. Is JSON Finally Getting a Token-Efficient Alternative for LLMs? For years, JSON has been the default language for APIs, integrations, configuration files, event payloads, and all other types of application-to-application communications. It is an easy language to understand, it is very robust and developers can easily exploit it. But when we transition from traditional software systems to Large Language Model applications, we start to see how JSON comes with an invisible price tag. LLMs do not process JSON the way that applications do. They handle it as tokens. The article explains why JSON becomes token-expensive for LLMs—repeated keys, syntax, and nested structure consume context window and increase cost—then introduces TOON (Token-Oriented Object Notation) as a more token-efficient, prompt-friendly way to represent structured data while preserving the same underlying data model (objects, arrays, strings, numbers, booleans, null). It shows a before/after example converting JSON arrays of records into TOON where field names are declared once, values are arranged in rows, and structure remains readable for the model. The piece argues TOON is especially valuable at the LLM boundary when payloads share a uniform schema with repeated records (common in RAG retrieval results, agent tool outputs, and agent memory), and it provides enterprise scenarios plus code/prompt patterns illustrating how to use TOON as LLM input while keeping JSON for validated outputs. Finally, it outlines best practices and cautions: don’t replace JSON everywhere, use TOON only where it fits (and validate outputs), benchmark against JSON, consider tooling/model reliability and escaping edge cases, and treat TOON as an optimization layer for context representation rather than an enterprise contract substitute. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 3, 2026 by Editorial Team Author(s): Rick Hightower Originally published on Towards AI. Part 17: Why your config looks right but Claude ignores it, and the exact commands that diagnose it in seconds In this article: Most Claude Code failures are not mysteries. They are a small set of recurring patterns: files that load on demand instead of at startup, permission rules that quietly override your mode, descriptions that fail to trigger a subagent. This is a Claude Code troubleshooting catalog grouped into five categories, with the exact diagnostic commands for each. Read it once and you will spend dramatically less time chasing configuration ghosts. Claude Code TroubleshootingAfter the introduction, the article lays out a practical troubleshooting catalog for Claude Code. It begins with four “muscle memory” commands (/doctor, /context, /memory, and /debug) that resolve most issues quickly, then explains five recurring categories of failure: loading problems (often caused by on-demand CLAUDE.md loading rather than startup), permission surprises (ask/allow rules override mode settings), delegation not triggering (subagent/skill descriptions must be specific and trigger-based), integrations going quiet (use /mcp, /hooks, and correct project-scoped configuration; debug failures including auth, paths, and permissions), and environment quirks (UI/input changes, trust/worktree behavior, and performance issues traced to state). It closes with a diagnostic order to follow when you’re stuck, the scope/precedence rule behind “it should work” situations, and a “do this today” checklist to build the instinct for diagnosing problems in seconds. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 3, 2026 by Editorial Team Author(s): Rick Hightower Originally published on Towards AI. Claude Code Personas Part 16: A short, honest map of how much Claude Code setup your work actually needs, so you stop measuring yourself against a maximalist ideal you may never require. In this article: Claude Code mastery is not a single finish line. It shows up at three levels of investment, casual, pro, and elite, and most developers do not need the top one. This article describes what the setup looks like at each level, the friction that pushes you up, the traps that make people overbuild, and a one-paragraph rule for picking the right amount of structure for any task. By the end, you will be able to locate yourself honestly and name your next concrete move. Claude Code Personas: Casual, Pro, EliteThe article argues that “Claude Code mastery” isn’t about maximizing tooling, but matching the amount of setup to the depth of the work: casual is one session per task with manual review; pro adds planning, automation via Auto Mode, and a living CLAUDE.md plus targeted skills; elite builds infrastructure such as parallel worktrees, deterministic hooks, and shared MCP-connected verification systems that you treat like CI. It explains that the personas are a menu, not a ladder—different repos/tasks may require different levels—and that you should follow a practical one-paragraph rule: notice the friction you hit today, then make only the single next investment that removes it (or, if you’re higher-level, temporarily try the matching next investment or help someone else move up). Ultimately, true mastery is judgment and restraint: choose the structure your current work demands, earn the next layer only when real pain shows up, and avoid copying artifacts or jumping levels for the sake of identity. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 3, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. MiniMax M3 Decodes 1M Tokens 15x Faster — and It Shouldn’t Be This Cheap On June 1, a Shanghai lab quietly shipped a model that decodes a 1-million-token context 15.6x faster than its own previous generation — and charges you roughly 8% of what Claude Opus costs to do it. I spent two days poking at MiniMax M3 through the API, and the part that actually rewired how I think about long-context isn’t the benchmark table everyone is screenshotting. It’s the attention mechanism underneath it. After the lead, the article argues that MiniMax’s real breakthrough isn’t just the speed claims or headline SWE-Bench results, but the model’s architecture: MiniMax Sparse Attention (MSA). The author explains why standard attention becomes prohibitively expensive at 1M-token context lengths and contrasts MSA with other approaches like DeepSeek’s latent attention (MLA) and native sparse attention (NSA). MSA is described as using a lightweight index branch on top of grouped-query attention to select relevant KV cache blocks for queries, running attention only on those selected blocks while using real (uncompressed) key-values and optimizing GPU memory access via a “KV outer gather Q” pattern. The piece then revisits reported benchmarks with caveats that scores are vendor-reported and independent testing wasn’t possible at launch because weights weren’t released, noting that while M3 is competitive for coding, it is weaker in multimodal grounding and hallucination-related performance. It emphasizes that the pricing is the standout differentiator—especially the very low per-million-token input and output costs—making long-context agentic workflows economically feasible. The author provides quick-start guidance for using M3 via OpenRouter or MiniMax’s API, suggests practical tests for long-context behavior, and concludes with a nuanced verdict: M3 may not be the absolute smartest overall, but its cost and 1M-token economic viability are a genuinely new product category, with remaining uncertainty tied to benchmark independence and the still-pending open weights. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 3, 2026 by Editorial Team Author(s): Pallav Kant Originally published on Towards AI. Using Amazon SQS for AI Agent Orchestration As AI agents become more capable, organizations are moving beyond standalone chatbots and building systems where multiple agents work together to complete complex tasks. A single request may involve one agent gathering information, another analyzing data, a third generating content, and a fourth validating the results. Coordinating between these agents to work asynchronously requires a reliable way to exchange information, hand off work, and handle failures. Direct communication between agents can quickly create tightly coupled systems that are difficult to scale and maintain. This is where messaging services play an important role. By introducing a messaging layer, organizations can decouple agents, enable asynchronous processing, improve fault tolerance, and scale components independently. Instead of communicating directly, agents exchange messages through queues or event streams. Among the various messaging technologies available, Amazon Simple Queue Service (SQS) is one of the most popular options for building scalable multi-agent AI workflows. As a fully managed message queuing service, SQS allows agents to communicate asynchronously through queues, improving reliability and simplifying orchestration. In this article, we’ll explore how Amazon SQS can be used to orchestrate AI agents, discuss common architectural patterns, and walk through a practical implementation example. Understanding AI Agent Orchestration AI agent orchestration refers to the process of coordinating multiple agents to accomplish a larger goal. Imagine a user asks: “Research electric vehicles, compare the top three models, and create a presentation.” A multi-agent system might work like this: Research Agent Searches the web Collects relevant information Stores findings Pass on the information to the next agent i.e. analysis agent. Analysis Agent Compares vehicle specifications Identifies strengths and weaknesses Generates insights Pass on the information to the next agent i.e. content generation agent. Content Generation Agent Creates presentation content Writes speaker notes Send every thing to the last agent in the flow to review. Review Agent Checks for consistency. Validates information. Approves final output. Each agent performs a specialized task and passes results to the next agent. Without orchestration, coordinating these interactions can become difficult and fragile. Why Use Amazon SQS? Amazon SQS offers several benefits for AI workflows. Decoupling Agents Decoupling agents allow the flexibility of agents not calling each other directly, instead you can introduce different SQS queues between each of those multiple agents. Considering the example above instead of doing this: Research Agent → Analysis Agent → Content Generation Agent - Review Agent you can use: Research Agent ↓ SQS Queue ↓Analysis Agent ↓ SQS Queue ↓Content Generation Agent ↓ SQS Queue ↓Review Agent Agents don’t need to know where other agents are running. They simply read messages from a queue, process them and send results to another queue. This greatly simplifies system design. Reliability AI workflows can fail for many reasons including API timeouts, LLM errors, rate limiting and/or infrastructure outages. SQS automatically retains messages until they are successfully processed. If an agent crashes, another worker can pick up the same message later. It helps prevents task/context loss. Scalability Suppose your system receives 10 requests per minute today but 10,000 requests per minute tomorrow. SQS allows you to scale processing independently. You can increase the number of: Lambda functions ECS containers Kubernetes pods Cost Efficiency Workers process jobs only when messages exist. This makes SQS especially attractive when combined with Auto Scaling Groups. You pay primarily for actual usage. Core Architecture A common AI orchestration architecture looks like this: User Request ↓ Orchestrator ↓ Task Queue (SQS) ↓ Research Agent ↓ Analysis Queue (SQS) ↓ Analysis Agent ↓ Content Queue (SQS) ↓ Content Generation Agent ↓ Result Store Each stage consumes messages from one queue and publishes messages to the next queue. This creates a workflow pipeline. Queue Design Patterns Pattern 1: Sequential Workflow — This is the simplest approach. Each agent performs one task and forwards the result. This pattern is best for report generation, content creation and data processing pipelines. Queue A → Research AgentQueue B → Analysis AgentQueue C → Content Agent Pattern 2: Fan-Out Processing — Sometimes multiple agents need the same data. The orchestrator duplicates messages and sends them to multiple queues. This enables parallel processing that provides benefit including faster execution, independent scaling and reduced bottlenecks. Research Result | | +----> Analysis Agent | +----> Content Agent | +----> Fact Check Agent Pattern 3: Dynamic Agent Routing — More advanced systems determine the next agent dynamically. The router uses an LLM to decide which specialized agent should handle the request. This creates intelligent workflows. Incoming Request | V Router Agent | +----> Analysis Queue | +----> Content Generation Queue | +----> Review Queue Message Structure A well-designed message is critical. Here is an example of the initial JSON payload sent to the first agent (i.e. Research agent) in the example we used earlier in this article: { "taskId": "12345", "workflowId": "wf-001", "agentType": "research", "status": "pending", "input": { "query": "Top electric vehicles in 2026" }} After processing is complete by the first agent, here is the message generated that will be passed to the second agent that will perform the analysis. { "taskId": "12345", "workflowId": "wf-001", "agentType": "analysis", "status": "completed", "researchResults": { "vehicles": [ "Tesla Model Y", "Hyundai Ioniq 5", "Ford Mustang Mach-E" ] }} Workflow identifiers are very helpful as including them in the payload helps track jobs across multiple agents. Handling Failures No production AI system is perfect. Failures can happen due to multiple reasons including API unavailability, network issues ,irrelevant prompts and/or exceeding token limits. SQS supports Dead Letter Queues (DLQ) mechanism to handle such failures. Main Queue | +--> Failure | +--> Retry | +--> Retry | +--> DLQ Messages that repeatedly fail move to a DLQ for investigation. This prevents endless retry loops. Multi-Agent Example Let’s build a document analysis workflow. Step 1: User Uploads Document — Application places a message in document-processing-queue. Message: { "documentId": "doc123", "type": "zonning-report"} Step 2: Extraction Agent — […]
Last Updated on June 3, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. I Ran a 1.5B-Active Model on My Laptop That Embarrassed a 26B by 46 Points I did not expect a model that activates 1.5 billion parameters to walk all over one that activates 4 billion — and to do it on a benchmark that actually matters for agents. But that is exactly what happened when I put Liquid AI’s new LFM2.5–8B-A1B (released May 28, 2026) up against Google’s Gemma 4 26B. On Tau²-Telecom, a hard multi-turn tool-use benchmark, the tiny Liquid model scored 88.07 against Gemma 4 26B’s 42.11. That is a 46-point gap — in favor of the model small enough to run on my laptop with memory to spare. After the lead, the article explains why an 8B MoE model can matter again for on-device agents: it’s optimized for privacy-sensitive, multi-step tool use and reliable abstention, not just leaderboard-style generality. It details the LFM2.5–8B-A1B architecture (sparse MoE with many short-range gated convolution layers instead of attention) and notes a reasoning-only design intended to improve tool/agent behavior without making inference too expensive on edge hardware. The author walks through benchmarks showing major gains over the prior LFM2 generation and a large Tau²-Telecom advantage over Gemma 4 26B, including a big improvement in non-hallucination/hedging. They then cover practical deployment (vLLM, Transformers, and llama.cpp/quantized GGUF on CPU), runtime performance and tool-call formatting, and the “catch” that the model is strong for agentic tool dispatch but weak for coding and some knowledge-intensive tasks. The piece concludes with guidance on when to choose this model (private on-device tool-calling) versus when to use something else (dedicated coding models or systems with retrieval for broad knowledge), culminating in a brief verdict that Liquid built a “scalpel” for agents rather than a “Swiss Army knife.” Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on June 3, 2026 by Editorial Team Author(s): Naresh Idiga Originally published on Towards AI. How to Build a Self-Improving Company with AI Most founders are using AI wrong. They’re adding it on top of existing company structures. That’s the old model. This image was created using an AI image creation program. This article was written with the assistance of an AI writing program, based on the author’s notes and analysis of Tom Blomfield’s YC batch talk. Tom Blomfield co-founded Monzo — took it from zero to 5M+ customers and $900M raised. Now he’s a General Partner at Y Combinator. In a recent batch talk, he made one core argument, built partly on ideas he borrowed from Jack Dorsey’s tweets and a talk by fellow YC partner Diana: Most founders are using AI wrong. They’re adding it on top of existing company structures. That’s the old model. The new model: your company is a set of recursive, self-improving AI loops. Here’s the framework — and what it means for every developer, builder, and startup founder right now. TL;DR Most companies bolt AI onto old hierarchies. Wrong model. The right model: recursive, self-improving AI loops that get better while you sleep. Key shifts: record everything → make your company legible to AI → burn tokens not headcount → cut middle management → keep humans at the edge. The constraint is shifting from headcount to token spend. The companies that win will maximize what AI can do, not how many people they employ. Table of Contents The Roman Legion Problem The Self-Improving AI Loop — All 5 Layers The Holy Shit Moment at YC Real Examples: Product, Support, and Ops Loops Four Things to Do Right Now Where Humans Still Matter What This Means for Developers and Builders 1. The Roman Legion Problem Blomfield opened with a sharp analogy — one he credits to Jack Dorsey’s recent tweets. The Roman legions were built to project power from Rome to the far edges of the empire through nested hierarchies: named individuals with fixed spans of control, passing orders down and sending information back up. Most companies today are still organized like Roman legions. Humans are the conduit for information flowing up and down. That’s not a productivity problem. It’s a structural assumption baked into how we think about organizations. And AI doesn’t just improve this structure — it breaks the assumption underneath it. The old mental model: add AI co-pilots to your existing workflows. Make engineers 20% more productive. Ship more code. Blomfield’s critique: that’s like putting a more powerful engine onto a horse-drawn cart. You’re taking the old way of working and adding a more powerful engine onto it. The new mental model: redesign what a company is. This image was created using an AI image creation program. 2. The Self-Improving AI Loop — All 5 Layers Instead of thinking about AI as a tool bolted onto your org, think about your company as a set of recursive, self-improving loops. Each loop has 5 layers. If every single step runs with minimal human intervention, your system gets better and better while you’re sleeping. Here are all five layers: Layer 1 — Sensor Layer This is where real-world data comes in. Emails from customers, support tickets, code changes, people cancelling their subscription, product telemetry. Think of it as the sensory layer that picks up signals from the outside world. Layer 2 — Decision Layer (Policy Layer) This is where rules live. What can the system act on autonomously? What does it have to log? What must it ask a human permission for? This is the policy and decision layer — guardrails for what the AI can and cannot do on its own. Layer 3 — Tool Layer This is where code executes. Deterministic APIs — query a database, look at a calendar, send an email, update a record. These are the tools the AI can call. Blomfield describes this as the “skills and code” layer — the actual mechanisms of execution. Layer 4 — Quality Gate Before any action is committed, it passes through a quality gate. This might be eval checks, safety filters, or human review for high-risk actions. It’s the checkpoint between “the AI decided to do something” and “the thing actually happens.” Layer 5 — Learning Mechanism This is what makes the whole thing self-improving. The system interacts with the real world, picks up where it didn’t work, and loops back to the top again. Failures become training signal. The loop tightens over time. This image was created using an AI image creation program. The key insight: if you can run every single step of that loop without human intervention — or with minimal human supervision — your system gets better and better overnight. Automatically. 3. The Holy Shit Moment at YC Blomfield described exactly when this clicked for him at YC. They built a simple internal agent — a tool their partners could ask questions like “when did I last have a meeting with this founder?” It worked well. Partners got answers faster. Useful, but nothing groundbreaking. Think of it as a smarter search bar for their internal database. Then they put a monitoring agent on top of it. This agent watched every query every YC employee made. It tracked when queries succeeded and when they failed. When they failed, it asked: Why? What would have made this work? Do we need different deterministic tools? A new database view? A different index? Then — overnight — it wrote the fix, opened a pull request to the YC codebase, had another agent review it, and merged and deployed it. By the time a human came in the next morning to ask the same query, it worked. No human involved. This image was created using an AI image creation program. “For me, that was like the holy shit moment. That’s not just AI making you 20 or 30% more valuable. It is the AI going through this loop […]
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free You now know how to make the model fast (Part 1) and how to build a stable serving layer around it (Part 2). The final question is: which engine actually implements all of this without forcing you to write a custom scheduler from scratch? The theme of this part: inference engines are not neutral wrappers. They bake in specific opinions about batching, KV cache memory layout, prefix caching, and kernel selection. Pick the engine that aligns with your pain points, and you get chunked prefill, continuous batching, and paged KV cache for free. Pick the wrong one, and you’ll spend sprints reimplementing features the right engine already has. Here is how the four major runtimes compare in 2026, with exact configs and the tradeoffs that matter for production. vLLM: The Production Default vLLM is the safest starting point for most teams. Its core innovation — PagedAttention — treats the KV cache like virtual memory with fixed-size blocks, reducing fragmentation from 60–80% in naive systems to under 4%. This directly translates to 2–4x higher concurrency on the same GPU. What you get out of the box: Continuous batching (iteration-level scheduling): requests enter and leave the GPU every token step, not every batch Chunked prefill (v0.4+): long prompts are broken into chunks and interleaved with decode steps, so a 3K-token prefill doesn’t starve short chat requests Automatic prefix caching (APC): the engine detects shared prompt prefixes and reuses KV cache automatically Speculative decoding (EAGLE, Medusa, n-gram): 2–3x latency reduction for memory-bound decode Multi-LoRA serving: serve hundreds of fine-tuned adapters on one base model Broad quantization support: GPTQ, AWQ, FP8, INT8, INT4, AutoRound 200+ model architectures: Llama, Qwen, DeepSeek, Mixtral, MoE, VLMs, embedding models The config that matters: python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.85 \ --max-model-len 8192 \ --max-num-seqs 64 \ --enable-prefix-caching \ --enable-chunked-prefill \ --quantization fp8 Key flags explained: --enable-prefix-caching: turns on automatic prefix caching for shared system prompts (massive for RAG) --enable-chunked-prefill: prevents long prefill monopolization; interleaves prefill chunks with decode --gpu-memory-utilization 0.85: leaves 15% headroom for CUDA graph capture and KV cache growth; going to 0.95 often causes OOM during graph compilation --max-num-seqs 64: caps concurrent sequences. Higher isn’t always better—if you hit memory limits, the engine will evict blocks and thrash. MRV2 (Model Runner V2): In v0.17.0+, enable VLLM_USE_V2_MODEL_RUNNER=1 for a rewritten backend that delivers significant throughput gains, especially on newer architectures like GB200. When to choose vLLM: You support many different models and need one engine to handle them all You run on heterogeneous hardware (NVIDIA, AMD, Intel Gaudi, AWS Trainium) Your team wants the largest community, best documentation, and fastest debugging You need to be online in under 90 seconds from a cold start Limitation: Peak throughput on dedicated H100 clusters is ~29% lower than SGLang or LMDeploy in some benchmarks, primarily due to Python orchestration overhead. If you have a fixed model and a specialized team, you can squeeze more out of other engines. But for most teams, vLLM’s breadth outweighs that gap. SGLang: The Throughput Challenger with Automatic Prefix Caching SGLang, developed by LMSYS (the team behind Chatbot Arena), is no longer a niche alternative. It powers xAI’s Grok 3 and Microsoft Azure’s DeepSeek R1 deployments, running on over 400,000 GPUs worldwide. What differentiates it: RadixAttention. Instead of manually configuring prefix caches, SGLang builds a radix tree from request prefixes and automatically reuses KV cache across any requests that share token sequences. This is transformative for multi-turn chat, agent loops, and RAG pipelines where system prompts and retrieved contexts repeat. What you get out of the box: RadixAttention: automatic, dynamic prefix caching without manual key management Chunked prefill: same interleaving benefit as vLLM EAGLE/EAGLE3 speculative decoding: state-of-the-art draft-model speculation Prefill-decode disaggregation: separate prefill and decode across different GPU pools for independent scaling MLA-optimized kernels: specifically tuned for DeepSeek models Zero-overhead CPU scheduler: moves scheduling logic off the GPU thread The config that matters: python -m sglang.launch_server \ --model-path meta-llama/Llama-3.3-70B-Instruct \ --tp 2 \ --quantization fp8 \ --context-length 8192 \ --mem-fraction-static 0.92 \ --enable-flashinfer-mla \ --host 0.0.0.0 \ --port 8000 Key flags explained: --tp 2: tensor parallelism across 2 GPUs --mem-fraction-static 0.92: SGLang’s memory allocator is more aggressive than vLLM’s; 0.92 is typically stable on H100 --enable-flashinfer-mla: enables optimized Multi-Head Latent Attention kernels for DeepSeek-class models Performance reality check: In H100 benchmarks with unique prompts (no prefix sharing), SGLang achieves roughly 29% higher throughput than vLLM. However, the gap narrows or reverses on workloads with high memory pressure where vLLM’s PagedAttention is more mature. The real win is in shared-prefix workloads — multi-turn conversations, agent loops, and RAG with fixed retrievers — where RadixAttention provides gains no other engine matches automatically. When to choose SGLang: Your workload is dominated by multi-turn conversations or shared system prompts You are serving DeepSeek models (MLA kernels are best-in-class) You have a dedicated inference team that can manage dependencies (FlashInfer can be finicky to install) You need prefill-decode disaggregation at scale Limitation: Model coverage is narrower than vLLM. If you serve exotic architectures or need to swap models frequently, vLLM is safer. TensorRT-LLM: The NVIDIA Optimizer (With a Catch) TensorRT-LLM is NVIDIA’s official inference SDK. It delivers the highest raw throughput and lowest TTFT on NVIDIA hardware when fully tuned. But it makes very specific tradeoffs. The compiled engine tradeoff: Traditionally, TensorRT-LLM required compiling a model into a serialized engine — a process that takes ~28 minutes for a 70B model. This is a one-time cost per model version, but it breaks auto-scaling and blue-green deploys unless you precompile and cache engines. The PyTorch backend (v1.0+): This changed the game. TensorRT-LLM now defaults to a PyTorch backend that loads HuggingFace weights directly, cutting cold start to ~60–90 seconds (comparable to vLLM). You lose some peak throughput compared to the compiled engine, but you […]
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99 You’ve quantized the model, switched to Flash Attention, and maybe even dropped to INT4. Your GPU kernels are now efficient. But users still complain that the app is “sometimes slow.” Welcome to serving hell, where the bottleneck is rarely the model and almost always the system around it. The theme of this part: once the model is efficient, most production wins come from queueing discipline, traffic routing, and stability controls. P95 and P99 latency are not driven by tensor core utilization. They’re driven by queueing, noisy neighbors, long prompts stuck behind short ones, and slow clients holding onto GPU memory. System Level Techniques to reduce LLM production latency The Real Enemy Is Queueing, Not Compute Here is the counterintuitive truth of production LLM serving: most latency is waiting time, not compute time. A request that takes 50ms of actual GPU work can easily spend 800ms in a queue because the batcher decided to wait for one more request, or because a 4K-token RAG prompt monopolized the prefill slot. P95 and P99 latency are almost always caused by: Queueing: Requests piling up behind a large batch, or when the KV cache pool is exhausted and no new slots are free Noisy neighbors: One tenant submits a 10K-token prompt and stalls everyone else Long prompts: Prefill dominates the GPU, starving decode steps Slow clients: A streaming client on a 3G connection buffers tokens and pins GPU memory Cold starts: A freshly scaled replica that hasn’t loaded weights or allocated KV cache If you only optimize median latency, you miss the real user experience. Users remember the one time they waited three seconds for the first token. Your product metrics will look fine while your user trust erodes. Measure the Right Things Before you fix anything, you need metrics that split the problem correctly. Most teams log “total request time” and call it a day. That is useless. Log these on every request: Time-to-first-token (TTFT): the user’s perception of responsiveness Time Per Output Token (TPOT): the standard industry metric for decode speed (e.g., 20ms/token). It is the inverse of tokens-per-second, making it easier to calculate SLAs Prompt tokens and output tokens: separates prefill cost from decode cost Queue wait time (KV Cache Starvation): time spent before the GPU starts work. Note: requests usually queue not because the GPU is at 100% compute, but because it has run out of PagedAttention blocks in the KV cache Prefill time and decode time separately: tells you which phase to optimize P95 and P99 per lane: not global P99, but per traffic lane (interactive vs. batch, short vs. long) Why per-lane matters: If you mix a 50-token chat query with a 4K-token legal document summary, your global P99 will be dominated by the long prompt. You’ll optimize the wrong thing. Split your metrics by lane and optimize each lane independently. Practical implementation: Most teams pipe these into Prometheus or Datadog. The key is tagging every metric with lane, model, quantization, and gpu_type. If you can’t segment, you can’t diagnose. Traffic Shaping: Separate Your Lanes The single most effective serving optimization is also the simplest: don’t let different workloads fight over the same GPU. Interactive vs. Batch Interactive traffic (chat, streaming UI) needs low TTFT. Batch traffic (background summarization, embedding generation) needs high throughput. They want opposite things from the scheduler. The rule: Run them on separate replicas, or at minimum, separate queues with different scheduling policies. In vLLM, you can approximate this with separate engine instances: # Interactive lane: small max batch, prioritize TTFTpython -m vllm.entrypoints.api_server \ --model your-model \ --max-num-seqs 4 \ --max-model-len 4096 \ --port 8000# Batch lane: larger batch, tolerate higher latencypython -m vllm.entrypoints.api_server \ --model your-model \ --max-num-seqs 16 \ --max-model-len 8192 \ --port 8001 In TGI, the scheduler is less configurable per instance, so the cleanest approach is separate deployments behind a router. Prompt-Length Lanes Even within interactive traffic, a 100-token prompt and a 3K-token RAG prompt should not share a queue. The long prompt will stall the short one during prefill. Fast lane: Prompts under ~512 tokens. Strict max wait time (5–10ms), small batch cap. Slow lane: Prompts over ~512 tokens. Longer wait time allowed (50–100ms), larger batch cap acceptable. Router logic (pseudocode concept): if prompt_tokens < 512 and streaming == true: route_to_fast_lane()else: route_to_slow_lane() The threshold depends on your model and GPU. Measure where your prefill time starts to dominate TTFT and draw the line there. The Modern Alternative: Chunked Prefill While routing by prompt length is a great architectural defense, modern inference engines (like vLLM 0.4+ and TGI) now solve this at the scheduler level via Chunked Prefill. Instead of computing a 3,000-token prefill in one giant block (which starves all other requests of decode steps), the engine breaks the prefill into smaller chunks (e.g., 512 tokens). It computes one chunk, runs a decode step for active streams, computes the next chunk, and so on. (We’ll cover which engines support this in Part 3.) Continuous Batching and SLA Caps Static batching is dead. Today, engines use Continuous Batching (or iteration-level scheduling). Instead of waiting for a batch to fill, the scheduler greedily injects new requests the moment a single token finishes and KV cache frees up. The danger of continuous batching is that greedy schedulers can ruin TTFT. Set your maximum wait times to your TTFT SLA. If your interactive SLA is 100ms, configure the router so no request waits more than 80ms in the queue, leaving 20ms for the actual prefill compute. (Note: this 80/20 math only applies to the fast-lane with short prompts; long prompts will need dedicated SLA tracking.) In vLLM, this is handled internally, but you control the tradeoff via --max-num-seqs (max concurrent sequences) and --max-model-len. For more explicit control, some teams run custom dispatchers: # Dispatcher pseudocodeMAX_WAIT_MS = 10MAX_BATCH = 8while True: batch = queue.collect_until( max_items=MAX_BATCH, timeout_ms=MAX_WAIT_MS […]
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. 3-Part Series: LLM Latency in Production (Part 1) Originally published at https://mhabir.substack.com. Part 1 — Model-Level Speed: Make the Model Fast on the GPU If you’re shipping LLMs to production, your first performance bottleneck isn’t serving logic or network overhead-it’s the raw arithmetic happening inside the GPU. Most teams waste weeks tuning their batching logic before realizing their model baseline is 3–4x slower than it should be. This part is about fixing that baseline. Why LLM Inference Is Memory-Bandwidth Bound (Especially in Decode) The fundamental misconception: LLMs are not always compute-bound. Decode is typically memory-bandwidth bound, while prefill is mixed (compute + memory) and becomes kernel-sensitive, especially with long contexts. Here’s the intuition that proves it. A 7B parameter model in FP16 needs 14 GB just for weights. For a single token generation step (decode), you’re moving those 14 GB through GPU memory bandwidth (TB/s-class HBM) to do ~ 14 GFLOPs of computation. That’s an arithmetic intensity around 1 FLOP/byte-well below the roofline where compute becomes the limit. On modern GPUs, you’d need >200 FLOP/byte to saturate tensor cores. In practice, during decode, you’re waiting on HBM reads, not matrix multiplications. This has two consequences: Batching helps because amortizing weight loads across multiple sequences improves effective memory bandwidth utilization. Quantization is a bandwidth win: INT4 weights are 4x smaller, so you move 4x less data per token. That directly translates to lower latency. Caveat: This is an upper-bound mental model. Effective traffic depends on batching, caching, and parallelization-real workloads see less than this theoretical maximum. Every LLM request has two phases with completely different performance characteristics. The Prefill-Decode Asymmetry Prefill processes the entire prompt in one forward pass, but it’s compute-intensive and memory-heavy because you’re building the KV cache. For a 4K-token prompt, you’re doing attention over a 4K sequence in parallel-not 4K autoregressive steps-creating an O(n²) attention matrix and storing 4K × hidden_dim × num_layers × 2 (K and V) values. This can be multiple GB per request on large models. Decode generates tokens autoregressively. Each step processes one token, but reuses the KV cache. It’s memory-bandwidth dominated because you’re streaming the entire KV cache through HBM on every step. This asymmetry means your optimization strategy must be phase-aware. Faster prefill requires better attention kernels (Flash Attention). Faster decode requires better cache management (paged KV, quantization). Quantization is the single most effective model-level optimization. It reduces memory footprint, improves bandwidth efficiency, and often comes with minimal quality loss. INT8 (LLM.int8()) uses vector-wise quantization with outlier preservation. It’s the safest starting point-most models show <0.1% perplexity degradation. Implementation is straightforward: # bitsandbytes INT8 inferencepip install bitsandbytes In your model loading code: from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, # outlier threshold llm_int8_has_fp16_weight=False ) This works out-of-the-box in vLLM and TGI: # TGI with bitsandbytes INT8text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize bitsandbytes# vLLM with INT8 (via config)python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --quantization bitsandbytes INT4 is where the real speedup lives. You achieve 4x memory reduction and 2–3x latency improvement, but measurable quality degradation occurs. Always validate with your actual prompt distribution. AWQ: Activation-Aware Weight Quantization AWQ’s key insight: not all weights are equally important. Activation magnitudes reveal which weights matter most. By scaling weights based on activation statistics, AWQ achieves better 4-bit accuracy than naive quantization. Installation & Usage: git clone https://github.com/mit-han-lab/llm-awqcd llm-awqpip install -e .cd awq/kernels && python setup.py install # Build efficient CUDA kernels Quantize a model: # Step 1: AWQ search (calibration)python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \ --w_bit 4 --q_group_size 128 \ --run_awq --dump_awq llama-2-7b-w4-g128.pt# Step 2: Generate quantized weightspython -m awq.entry --model_path meta-llama/Llama-2-7b-hf \ --w_bit 4 --q_group_size 128 \ --load_awq llama-2-7b-w4-g128.pt \ --q_backend real --dump_quant llama-2-7b-w4-g128-awq.pt In vLLM/TGI, use pre-quantized models: # vLLM with AWQ (supported in many recent versions)python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-AWQ --quantization awq# TGI with AWQtext-generation-launcher --model-id TheBloke/Llama-2-7B-AWQ AWQ Configuration Details: q_group_size=128: Weights are quantized in groups of 128 channels. Smaller groups improve accuracy but increase quantization overhead. w_bit=4: 4-bit quantization. AWQ also supports 3-bit for extreme compression. version="GEMM": Choose between GEMM (general matrix multiply) or GEMV (vector) kernels. GEMM is faster for batch sizes > 1. GPTQ: Gradient-Based Post-Training Quantization GPTQ uses second-order information (Hessian) to minimize quantization error. It’s slightly more computationally expensive to quantize but produces excellent 4-bit models. Installation: pip install auto-gptq --no-build-isolation Quantization: from transformers import AutoTokenizerfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfigmodel = AutoGPTQForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantize_config=BaseQuantizeConfig( bits=4, group_size=128, desc_act=False, # False for speed, True for slight quality improvement ))# Calibrate with ~128-256 samples from your domainexamples = [...] # List of tokenized samplesmodel.quantize(examples)model.save_quantized("llama-2-7b-gptq") GPTQ in Serving: # vLLM with GPTQpython -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-GPTQ --quantization gptq Key GPTQ Configs: desc_act=False: Disables activation reordering. This is 2-3x faster in inference with minimal quality loss. Set to True only if perplexity degradation is > 2%. use_marlin=True: On Ampere GPUs (A100, RTX 30xx/40xx), Marlin kernels are 30-50% faster than default exllamav2. bitsandbytes NF4/FP4: The No-Precompute Option bitsandbytes 4-bit (used in QLoRA) quantizes on-the-fly during model loading. No calibration needed, but inference is often slower than AWQ/GPTQ because quantization happens per forward pass. Use when: Config: from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # or "fp4" bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True # Compresses quantization constants ) Performance note: NF4 inference is often slower than AWQ/GPTQ for pure inference because of runtime dequantization overhead. Use it for development, not max-throughput serving. GPU Acceleration: Kernels That Actually Matter Quantization reduces memory traffic. These kernels make the traffic you do have more efficient. Flash Attention: The Prefill King Flash Attention eliminates the need to materialize the full N×N attention matrix. Instead, it tiles the computation and uses smart memory management to reduce HBM reads/writes by 10–20x in theory, with typical speedups of 30–50% in practice on long sequences. Installation: pip install flash-attn --no-build-isolation In Practice: FlashAttention-2 is integrated into all major inference engines. You just need to install it before building your engine. For custom PyTorch code, use the flash_attn […]
Last Updated on June 3, 2026 by Editorial Team Author(s): Rashmi Originally published on Towards AI. Claude Code: The AI Coding Partner Changing How Developers Build Software Claude Code is Anthropic’s AI-powered coding agent that lives directly in your terminal. Unlike chatbot-style assistants that require copy-pasting code back and forth, Claude Code reads your actual codebase, edits real files, runs shell commands, and executes tests — all autonomously from within your development environment. The article explains what Claude Code is and how it differs from other AI coding tools, then walks through installation options (native binary, Homebrew, and deprecated npm), authentication, and first-project setup. It outlines Claude Code’s architecture (a CLI engine coordinating file system access, shell commands, and the Anthropic API), describes what you can use it for—code generation, debugging, documentation, test writing, refactoring, dependency upgrades, and CI/CD/DevOps tasks—and covers advanced capabilities like multi-agent parallel work via Git worktrees. It also provides a command reference (CLI flags, in-session slash commands, automation/print mode), discusses project memory via CLAUDE.md, shows how MCP servers extend Claude Code to external systems, and ends with safety best practices and final guidance on starting simple. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 29, 2026 by Editorial Team Author(s): Michael Shapiro MD MSc Originally published on Towards AI. The Next Evolution of Data Teams For years, building data products required a chain of specialists: data engineers, data scientists, software engineers, ML engineers, MLOps teams, and product managers. This specialization enabled organizations to tackle increasingly complex problems, but it also introduced handoffs, dependencies, and slower feedback cycles. (If you’re not a Medium Member, read it for free here). After the introduction, the article explains why agentic coding is pushing teams toward end-to-end ownership rather than fragmented specialization. It defines the “Full-Stack Data Scientist” as a practitioner who combines data/domain expertise with product thinking and accountability for outcomes, supported by rapid prototyping and modern coding agents. The author argues data scientists are naturally suited to this model because they already operate at the intersection of technology, business, and uncertainty, and because they learn and iterate effectively under ambiguity. They then describe how this approach works in practice—building early product interfaces, focusing on measurable value, and using stakeholder feedback to refine requirements—before concluding that the agentic era favors teams that learn fastest by aligning context, data, validation, and iteration. Finally, it frames this as both a mindset and a management philosophy: empowering smaller, capable teams to own outcomes while AI increases execution leverage, making context and judgment the main differentiators. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Satish Kumar Originally published on Towards AI. Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio 1. Enterprise AI Reality Check Here is the uncomfortable truth about enterprise GenAI in 2026: most implementations are unmaintainable — and most teams do not know it yet. Prompts live in Jupyter notebooks. Evaluation means a developer squinting at model outputs and nodding. There is no governance layer. No versioning. No rollback capability. And when prompt drift silently degrades a classification model from 94% to 71% accuracy over six weeks, nobody notices until a compliance audit surfaces the gap. I have watched this pattern repeat across organizations of every size. Teams build genuinely impressive proof-of-concepts, then watch them decay in production because the engineering discipline that governs every other piece of software — CI/CD pipelines, automated testing, observability, release management — never gets applied to the AI logic those systems depend on. The root causes are consistent: Prompt sprawl: templates scattered across notebooks, API scripts, and application code with no single source of truth Manual evaluation: quality assessment depends on human judgment rather than systematic benchmarking with reproducible metrics Zero governance: no RBAC on prompt templates, no audit trail on changes, no approval workflows between environments No versioning: impossible to answer “what exactly changed?” when outputs start degrading No rollback: when a prompt update breaks production, teams scramble to reconstruct the previous state from memory and Slack messages Silent drift: model updates and data distribution shifts cause gradual quality erosion with no alerting mechanism Snowflake Cortex AI Function Studio changes this equation fundamentally. It provides a complete lifecycle management system for AI functions — creation, evaluation, optimization, governance, and deployment — running entirely within Snowflake’s governed platform. Combined with Cortex Code Skills for reusable AI engineering workflows, enterprises now have an opinionated, production-grade path from task definition to deployed, monitored function. This is not another AI playground. This is enterprise AI engineering infrastructure. The gap between “it runs in a notebook” and “it works reliably in production for twelve months” is precisely where most AI implementations break down. That gap is what this article addresses. 2. What Is Cortex AI Function Studio? Cortex AI Function Studio is a managed development environment for building production-ready Cortex AI Functions. It surfaces two interfaces: the Cortex Code CLI for engineers who need scriptable, agentic workflows, and the Snowsight AI Studio for analysts who need guided, no-code experiences. Both paths produce the same governed, versioned, testable output. Lifecycle Architecture The intended workflow is create → evaluate → optimize, and each stage is deeply instrumented: Creation Layer: Natural language task definition — describe the objective, the system constructs the function Automatic model selection based on task requirements: multimodal support, multilingual needs, reasoning depth, latency tolerance Structured output enforcement via JSON schema, classification label sets, and confidence scores Smoke test generation and execution before the function is registered Support for text, documents, images, audio, and video inputs Evaluation Layer: Three evaluation paths: labeled datasets (ground truth comparison), label generation (a reasoning model generates baselines where labels do not exist), and synthetic dataset generation (bootstrapped from the task definition itself) Configurable metrics: exact match, fuzzy match, contains match, LLM-as-a-judge, and custom metric definitions Per-record scoring with human-in-the-loop review for low-confidence outputs Regression detection across versions — the system flags when a new version performs worse than its predecessor Optimization Layer: Genetic-Pareto Algorithm for systematic prompt exploration across the search space Budget tiers: demo (2 iterations), light (6), medium (12), heavy (18) Multi-model benchmarking — evaluate 6+ models simultaneously against the same dataset Automated prompt restructuring, instruction reordering, and workflow modifications Before/after quality comparison with statistical significance testing Deployment Layer: One-click deployment of optimized configurations to production Functions registered as standard Cortex AI Functions with full RBAC and governance applied immediately Re-optimization possible as new models become available without rebuilding from scratch The critical architectural detail worth internalizing: Custom AI Functions incur no surcharge beyond underlying model inference costs in production. The abstraction layer is free. You pay only for the tokens consumed during inference. 3. Production Demo Scenario Automated Incident Root Cause Analyzer Consider a realistic enterprise scenario: an organization receives thousands of support tickets daily containing SQL failures, access-control exceptions, performance degradation reports, and infrastructure incidents. Currently, L1 engineers manually triage, classify, and route these tickets — a process that takes 15 to 45 minutes per incident and produces inconsistent, analyst-dependent categorization. Input signals: Raw support ticket text (free-form descriptions from end users) Error log excerpts (stack traces, error codes, warehouse context) SQL failure messages (compilation errors, runtime exceptions) Access-control denial messages (privilege errors, role hierarchy mismatches) Required output (structured JSON): { "root_cause_summary": "User lacks SELECT privilege on target table due to role hierarchy gap", "severity": "P3", "category": "ACCESS_CONTROL", "remediation": "Grant SELECT on DB.SCHEMA.TABLE to role ANALYST_ROLE via SECURITYADMIN", "escalation_team": "IAM_TEAM", "confidence_score": 0.92} This scenario is realistic precisely because it is hard. It requires multi-signal reasoning (correlating free-text descriptions with structured error codes), domain-specific classification (Snowflake’s error taxonomy), actionable output generation (specific remediation steps, not generic advice), and confidence calibration (knowing when to escalate to a human rather than acting autonomously). These are the characteristics that expose prompt fragility fastest. 4. Environment Setup -- Production database structureCREATE DATABASE IF NOT EXISTS PROD_AI; CREATE SCHEMA IF NOT EXISTS PROD_AI.AI_FUNCTIONS;CREATE SCHEMA IF NOT EXISTS PROD_AI.AI_EVAL;CREATE SCHEMA IF NOT EXISTS PROD_AI.AI_SKILLS;CREATE SCHEMA IF NOT EXISTS PROD_AI.AI_OBSERVABILITY;CREATE SCHEMA IF NOT EXISTS PROD_AI.AI_GOVERNANCE; -- Dedicated warehouse for AI workloads (separate cost tracking)CREATE WAREHOUSE IF NOT EXISTS AI_INFERENCE_WH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE COMMENT = 'Dedicated compute for AI function inference and evaluation'; -- Evaluation data stageCREATE STAGE IF NOT EXISTS PROD_AI.AI_EVAL.EVAL_DATASETS ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE'); -- Incident data table (production input)CREATE OR REPLACE TABLE PROD_AI.AI_FUNCTIONS.SUPPORT_INCIDENTS ( incident_id VARCHAR(36) DEFAULT UUID_STRING(), created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), ticket_text TEXT, error_log TEXT, sql_statement TEXT, reporter_role VARCHAR(100), affected_objects ARRAY, raw_error_code VARCHAR(50)); -- Roles for AI function governanceUSE ROLE SECURITYADMIN;CREATE ROLE IF NOT EXISTS AI_ENGINEER;CREATE ROLE IF NOT EXISTS AI_EVALUATOR;CREATE ROLE IF NOT […]
Author(s): Amit | AI & Side Hustle Originally published on Towards AI. A practical developer-first comparison of LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, DSPy, and more after real experimentation. Six months ago, I decided to evaluate AI agent frameworks seriously. Not because I needed to — I had a working system — but because the space was moving so fast that I felt like I was missing something. The tools available now are genuinely different from what existed a year prior, and the conversations I was seeing online felt reductive. People would declare one framework “the winner,” then pivot three weeks later. I wanted to understand what was actually happening beneath the hype. Photo by Alex Knight on UnsplashThe author reports results from experimenting with ten AI agent frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel, OpenAI’s Agents SDK, PydanticAI, Haystack Agents, LlamaIndex Workflows, Atomic Agents, and DSPy) and argues that the ecosystem is fragmented rather than converging: each framework makes different bets about control vs. abstraction, orchestration style, tool-calling behavior, and state/memory management. Key pain points include messy tool calling (stop/retry/error semantics), underestimated work around state and memory (often requiring wrapper logic), and orchestration complexity that scales poorly beyond a few agents. They highlight which frameworks do well for structured outputs (PydanticAI, DSPy) and where debugging/observability and documentation quality vary dramatically, with many frameworks not built with cross-framework lifecycle visibility in mind. The piece also stresses practical selection criteria—framework fit to the specific problem, integration and dependency footprint, local model and provider support, and production readiness/stability—not just feature checklists. The conclusion recommends choosing based on immediate constraints (and starting with simpler function-calling loops when orchestration isn’t truly needed) and expecting the market to keep evolving. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): FutureLens Originally published on Towards AI. How One Spring Boot Optimization Saved Our Startup $30,000 a Year Our AWS bill was $7,400 last month. The month before, $6,900. The month before that, $6,200. The line was going in one direction. Our runway wasn’t. “Image created by ChatGPT”After observing steadily rising AWS costs without user growth, the author traces the issue to Spring Boot defaults and production inefficiencies: tuning HikariCP connection pooling to reduce CPU, then uncovering a costly session-validation query repeatedly executed on every authenticated request and eliminating it with Redis caching. They further cut serialization overhead by caching the products endpoint response, add missing PostgreSQL composite indexes to prevent expensive sequential scans, and finally rightsizes ECS Fargate tasks based on real utilization; overall CPU drops sharply, database load decreases, and monthly cloud spend falls from roughly $7,400 to about $3,130—saving an estimated $30,000–$35,000 annually. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Akash Dogra Originally published on Towards AI. Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works Ontology-Augmented Generation, Apollo’s air-gapped deployments, and the k-LLM routing architecture — a deep technical teardown. This photograph taken on January 19, 2023 shows a woman walking past the logo of US big data analytics software company Palantir Technologies during the World Economic Forum (WEF) annual meeting in Davos. (Photo by Fabrice COFFRINI / AFP via Getty Images) The Problem No One Else Solved The fundamental challenge of modern enterprise architecture is rarely a lack of raw data it is a catastrophic crisis of meaning. Consider a forward-deployed military operation with seventeen distinct sensor feeds: SIGINT streams, HUMINT reports, GEOINT imagery, drone telemetry. Each operates on a different temporal frequency, adheres to a different schema, possesses a different confidence interval, and is governed by distinct security classifications. No single database and certainly no human analyst can synthesize this in real time. The corporate sector faces a mirror image. A multinational enterprise fragments its operational reality across SAP instances, Salesforce CRM, Oracle databases, and IoT data lakes. The standard industry response for two decades has been building massive ETL pipelines to dump information into centralized warehouses, under the assumption that colocation spontaneously generates intelligence. Palantir’s foundational insight is that data integration is not a storage problem it is a meaning problem. Moving a table from Oracle to Snowflake does nothing to resolve the semantic disconnect between how a logistics system defines a “Purchase Order” and how a compliance system defines a “Transaction.” Without a unified semantic layer, data remains inert, siloed, and useless for automated decision-making. When the industry pivoted to LLMs, the prevailing approach was pointing these probabilistic engines at vector databases via standard RAG. The results: semantically plausible but structurally ungrounded, with unacceptably high hallucination rates and zero deterministic trust. Palantir’s approach diverges entirely. Instead of retrofitting an LLM onto a flat warehouse, AIP embeds the LLM directly into a bidirectional knowledge graph. The resulting architecture dictates that AI interacts with the enterprise through a strictly governed semantic layer that natively understands relationships, logic, and operational constraints. By prioritizing the world model over the language model, Palantir transforms the LLM from a volatile query interface into a governed operational agent. The Ontology: Palantir’s Most Important and Least Understood Concept The Palantir Ontology is not a semantic data model or a metadata catalog. It is a governed, typed, live, bidirectional knowledge graph that acts as the authoritative digital twin of the enterprise unifying both semantic elements (the “nouns”: objects, properties) and kinetic elements (the “verbs”: actions, logic functions, security policies). Raw data from any source relational databases, Kafka streams, SAP instances, Excel files is ingested, transformed, and mapped into Ontology objects. These objects correspond to real-world entities: Aircraft, Manufacturing Facilities, Suppliers, Soldiers. The Palantir Ontology: typed objects with directed edges, fed by multiple data sources through a transformation layer. Backend services (OMS, OSS, Funnel) manage schema, reads, and writes. Formally, an Ontology object is defined as: Where t_i is the object’s type from the governed type set T (e.g., Aircraft, Supplier), P_i is the set of typed key-value property pairs, and L_i is the set of directed, typed edges connecting to other objects. Three decoupled backend services power this: Service Function OMS (Ontology Metadata Service)Source of truth for schema defines all object types, link types, and action types. Enforces global schema integrity and versioning.OSS (Object Set Service)High-throughput read layer serves all queries with extreme low latency. LLMs and applications interface through OSS.Funnel (Object Data Funnel)Orchestrates all write operations validates actions against governance policies, MAC/DAC security, and schema constraints before mutating state. Consider a supply chain deployment: a Supplier object links to hundreds of Purchase_Order objects, which link to Manufacturing_Facility and Inventory_SKU objects. If a fire breaks out at a distribution center, the IoT data updating the Facility object immediately propagates through the graph. The system identifies delayed purchase orders, evaluates downstream inventory impact, and surfaces it to logistics managers who can execute an Initiate_Reorder action directly through the governed Ontology, writing the decision back to the underlying ERP system. Traditional data platforms model the nouns but ignore the verbs. The Ontology models decisions. Foundry vs Gotham: Same Core, Different Missions A common misconception: Palantir maintains entirely separate engineering stacks for commercial and government clients. At a deep technical level, Foundry and Gotham are built on the identical Ontology foundation. The difference lies in mission profiles, ingestion parsers, and security paradigms. Foundry (commercial) and Gotham (defense) share the same Core Ontology — same objects, same links, same OMS/OSS/Funnel stack. Foundry serves commercial enterprises Airbus, the UK NHS, Merck, major financial institutions. Its tooling (Code Repositories, Pipeline Builder, Workshop) is optimized for supply chain optimization, manufacturing analytics, and pharmaceutical research. Gotham serves the intelligence community and military. Ingestion pipelines handle classified formats, unstructured intelligence cables, and kinetic targeting telemetry. The application layer replaces supply chain dashboards with entity resolution, link analysis, pattern-of-life intelligence, and GEOINT fusion across air-gapped networks. Gotham’s security model implements Mandatory Access Control (MAC), Discretionary Access Control (DAC), and dynamic attribute-based clearance. A logistics officer might see a unit’s supply level but lack clearance to traverse the link to its classified geolocation. The fundamental engineering principle remains identical: heterogeneous data mapped into a governed, typed knowledge graph. Apollo: The Deployment Engine That Makes Everything Possible While the Ontology gets attention, Apollo — the autonomous deployment engine is arguably Palantir’s most architecturally significant component. It controls thousands of microservices, ML models, and schemas across wildly diverse infrastructure environments. Apollo’s declarative pull model deploys to multi-cloud, on-premises, air-gapped, and edge environments through autonomous agents. Traditional CI/CD pipelines (Jenkins, GitLab CI) operate on a linear “push” model that breaks at scale. Apollo reverses this with a declarative pull model: Developers define software artifacts with explicit dependencies Artifacts are promoted through Release Channels: RELEASE → CANARY → STABLE Autonomous Apollo agents in each environment continuously monitor their subscriptions When constraints are satisfied (maintenance […]
Last Updated on May 29, 2026 by Editorial Team Author(s): Error Originally published on Towards AI. What Is a Reverse Proxy? (And Why Every Backend Developer Should Care) Imagine you’ve built a web app. You have a backend API, maybe a separate service for authentication, another for image processing, and a frontend. They all run on different ports or even different servers. A user types your domain into their browser — how does the request know where to go? Reverse Proxy FlowAfter the lead, the article explains what a reverse proxy enables—load balancing, SSL/TLS termination, caching, compression, request routing across multiple services, security hardening, and hiding infrastructure—then clarifies what “proxy” means in general and contrasts forward vs reverse proxies. It walks through a step-by-step request flow from browser to backend (DNS to proxy, routing, forwarding with headers like X-Forwarded-For, backend processing, and proxy response handling) and highlights common real-world tools such as Nginx, HAProxy, Traefik, and Envoy. The piece concludes that if you’re running more than one service or need real security/performance, a reverse proxy is essential, and starting with Nginx can simplify early setup while letting the proxy scale as your system grows. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 29, 2026 by Editorial Team Author(s): Rajesh Vishnani Originally published on Towards AI. What Claude Opus 4.8 Actually Changes If You’re Building Agents I’ve been building AI agents for long enough now to have developed a healthy reflex: whenever a new frontier model drops, my first question isn’t “is it smarter?” It’s “does it change the shape of the code I have to write?” Most releases don’t. They nudge a benchmark, shave a few cents off a token, and the agent loop I wrote last quarter still looks the same the next morning. Claude Opus 4.8 is one of the rare ones that does change the shape. Anthropic shipped it on May 28, 2026, and on the surface it reads like a polish release — better coding, better tool use, better alignment numbers. But buried inside the announcement are three changes that, taken together, quietly retire a bunch of scaffolding agent developers have been writing for the last year. I want to walk through what those are, why they matter, and where I think they push the next generation of agent architectures. What Anthropic actually shipped The short version, before we go deeper: Model: claude-opus-4-8, available across the API, Claude.ai, and Claude Code. Pricing: unchanged at $5 / $25 per million input/output tokens. Fast mode dropped to $10 / $50 — three times cheaper than the previous generation’s fast tier. Coding: improvements on Terminal-Bench 2.1 and large-scale codebase migrations. Agentic work: 84% on Online-Mind2Web (browser/computer-use benchmark), cleaner multi-step tool calling, and what Anthropic describes as “better judgment.” [1] Honesty: roughly 4× less likely to let a code flaw pass unremarked compared to 4.7. Alignment: new highs on prosocial trait measures, with substantially lower misaligned-behavior rates than 4.7. Three new platform features ship alongside it: Dynamic workflows in Claude Code — orchestrating hundreds of parallel subagents on a single task. Effort control — explicit high/extra/max levels you set per request. Mid-message system entries in the Messages API — system instructions you can inject inside the message array, not just at the top. If you only read this far, the rest of the post is mostly about why those three features and the honesty bump are the parts that change how I’d design a new agent today. The centerpiece: dynamic workflows and parallel subagents This is the one I want to spend the most time on, because it’s the change with the biggest implications. Until now, the dominant pattern for “agent that does a big thing” has been some variant of: a planner LLM breaks a task into steps, then a worker loop executes them mostly in sequence, with maybe a couple of parallel branches if you were being adventurous. Anyone who has tried to make this work at scale knows the failure mode. The planner gets it wrong, errors compound, and you spend more time orchestrating than the model spends thinking. Dynamic workflows flip this. Instead of you, the developer, writing the orchestration logic and stitching subagents together, Claude Code itself spawns and coordinates parallel subagents at runtime. Anthropic describes it as “hundreds of parallel subagents on a single task.” [1] That’s not marketing fluff if you take it seriously — it means the model is acting less like a single executor and more like a small organization deciding how to split work. Practically, the things this unlocks for me: Codebase-wide refactors. Touching 80 files used to mean writing a careful plan, dispatching to a worker, and hoping it didn’t go off the rails halfway through. Now I can hand the task to one entry point and let it fan out. Multi-document analysis. “Read these 200 contracts and pull out anything that conflicts with our standard MSA” is the kind of job that was technically possible but operationally painful. Parallel subagents collapse the wall-clock time. Exploration-heavy debugging. Instead of a single linear bisect, you can dispatch many small investigation threads at once and consolidate. The interesting part isn’t the speed. The interesting part is that the orchestration is the model’s problem now, not yours. A lot of the LangGraph/CrewAI/custom-router code I’ve written in the past year was, in retrospect, scaffolding around the limitation that a single model call couldn’t be trusted to coordinate. That limitation is shrinking. A small example: effort control + a clean tool call Here’s the kind of code you’d write for an agent task today, using two of the new features — effort control and a tool definition — against Opus 4.8: import anthropicclient = anthropic.Anthropic()tools = [ { "name": "search_contracts", "description": "Search internal contract repository by keyword or clause.", "input_schema": { "type": "object", "properties": { "query": {"type": "string"}, "limit": {"type": "integer", "default": 10}, }, "required": ["query"], }, }]response = client.messages.create( model="claude-opus-4-8", max_tokens=4096, effort="max", # high | extra | max — pick your quality/speed point tools=tools, system="You are a contracts analyst. Flag any MSA that conflicts with our " "standard terms. Be specific; cite the clause.", messages=[ { "role": "user", "content": "Review every contract signed in Q1 and surface anything " "non-standard. Group findings by counterparty.", } ],)print(response.content) Two things to notice. First, effort="max" is a single knob that previously required prompt-engineering ("think carefully, take your time, double-check") and never reliably worked. Now it's an explicit lever. Second, the prompt is unusually terse — I'm not babying the model with step-by-step instructions, because in practice Opus 4.8 doesn't need them for tasks at this shape. The tool-calling efficiency gains mean fewer wasted round-trips when it decides to actually call search_contracts. I’d estimate roughly 30–40% of the system-prompt boilerplate I was carrying in production agents on 4.7 is now dead weight on 4.8. I haven’t deleted it yet — I want a couple more weeks of behavior data first — but it’s coming. The honesty upgrade is more important than it sounds The headline number — “4× less likely to allow code flaws to pass unremarked” — is easy to skim past as a benchmark factoid. It’s not. It’s a load-bearing change for […]
Last Updated on May 29, 2026 by Editorial Team Author(s): Mandar Karhade, MD. PhD. Originally published on Towards AI. When long-horizon tasks, exploratory freedom, some time, and the ability to self-improve are given to a good model, Great things happen. Qwen 3.7 max, it’s a big boy model. It is supposed to be fairly capable of running long-horizon tasks. In a recent experiment by Alibaba Group, this model was given a task of improving the performance of a software system on unknown hardware. The model was given an unlimited budget of time to improve the kernel. It made more than a thousand tool calls, ran hundreds of evaluations, and found incredible opportunities to speed up the kernel one step at a time. Openai Image 2The article explains how long-horizon, recursive improvement works: instead of answering once, an AI is given a goal and a looping process (hypothesize, test, inspect, modify, repeat) so it can iteratively refine performance even in unfamiliar settings. It highlights that such success depends on tolerance for failure, information and behavioral persistence, and rapid iteration—capabilities that matter especially when documentation is sparse and hardware is unknown. The author contrasts this with human limitations shaped by specialization and narrow “tunnel vision,” arguing that AI can explore neglected gaps and even enable discoveries across domains like drugs, materials, protein engineering, and chip design. It also notes that this approach isn’t a silver bullet: uncontrolled long runs can waste energy, optimize wrong metrics, accumulate hidden errors, and be hard to audit, so safeguards like sandboxing, cost limits, permissions, checkpoints, logs, and intermittent human review are essential. Overall, the piece argues that Qwen 3.7 Max demonstrates the value of letting well-instrumented long-running experiments proceed. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 29, 2026 by Editorial Team Author(s): Akash Dogra Originally published on Towards AI. The modern intelligence stack: satellite imagery, tactical knowledge graphs, drone surveillance, and command center fusion. When LLMs Meet Knowledge Graphs on the Battlefield How GNNs, LLMs, and ontology-grounded retrieval fuse into the architecture powering modern military intelligence — with the math, the code, and the failure modes. The 90-Second Decision That Broke Standard AI A highly dispersed sensor network detects anomalous activity across a 200-kilometer front. Synthetic aperture radar identifies irregular vehicle tracks near a tree line. Signals intelligence intercepts encrypted, short-burst radio transmissions. A week-old HUMINT report places a high-value commander within 50 kilometers. Open-source intelligence shows sudden civilian supply chain delays. Drone feeds capture thermal signatures obscured by deliberate multispectral camouflage. A human commander has roughly ninety seconds to synthesize this. The decision must comply with International Humanitarian Law — decisively differentiating combatants from protected civilians while mitigating an imminent threat to friendly forces. Standard supervised classifiers fail catastrophically here. The core IID assumption — that training and test data share identical distributions — is actively violated by an adversary whose explicit goal is to operate outside historical patterns. Vanilla LLMs fare no better: they hallucinate confident claims when navigating sparse data and cannot track entities across evolving operational contexts without catastrophic forgetting. Traditional databases are too rigid to handle contradictory, probabilistic evidence in real time. The defense ecosystem recognized this structural gap. DARPA’s Knowledge-directed AI Reasoning Over Schemas (KAIROS) program (2019–2023) initiated the development of systems that identify, link, and temporally sequence complex events from noisy multimedia inputs [1]. Its successor programs now fuel production-grade GNN-LLM hybrids [2]. As of late 2025, the DoD’s GenAI.mil initiative mandates AI-first decision workflows across operational commands [3]. No single algorithm solves this. The real architectural insight is a fusion stack: LLMs provide language-grounded reasoning. GNNs provide relational inference and uncertainty propagation. Knowledge graphs provide the governed semantic substrate that tethers both to operational reality. Knowledge Graphs: The Semantic Backbone Information in warfare is inherently relational. An intercepted signal is meaningless in isolation — its tactical value only emerges when linked to its transmission origin, the adversary’s command hierarchy, and geographic proximity to logistics routes. A knowledge graph models the battlefield through three axes: entities (people, units, locations, weapons), relationships (command hierarchies, supply chains, communications), and temporal dynamics (timestamps, confidence scores, version histories). Formally, the environment is a heterogeneous knowledge graph: G = (V, E, φ, ψ) where V is the node set, E is the edge set, φ: V → T_V maps each node to an entity type, and ψ: E → T_E maps each edge to a relationship type. A heterogeneous military knowledge graph with five typed entities and strictly defined relationships. If the Comms Relay is destroyed, the cascading isolation of Unit Alpha is computable instantly. Even in this reduced five-node configuration, the graph encodes deep tactical structure. If a strike destroys the Communications Relay, the graph explicitly maps cascading consequences: the Commander loses transmission to the Unit contesting the Disputed Territory. No flat database computes this cascading impact with the latency required for combat. Intelligence graphs must be dynamic. Edges appear, disappear, and change confidence as new intelligence arrives, transforming G into a temporal sequence G = {G₁, G₂, …, G_T} where G_t represents battlespace state at time t. Platforms like Palantir Gotham and AIP operationalize exactly this kind of entity-relationship modeling at enterprise scale [4]. Graph Neural Networks: Inferring the Unknown Standard graph databases (Cypher, SPARQL) are exact-match retrieval engines. They return what is explicitly encoded. They cannot infer what is implicitly implied by structure, nor propagate uncertainty through multi-hop reasoning chains. If an adversary uses sophisticated OPSEC to obscure a command link, a standard query returns null. GNNs transform discrete graph topology into continuous vector spaces, enabling probabilistic reasoning over the network itself: Link Prediction: Inferring missing edges — “Who is this unknown actor connected to?”. Node Classification: Inferring entity type from relational context alone Entity Resolution: Determining whether a burner phone MAC address and a physically observed combatant represent the same real-world entity. A GNN classifies an unknown entity as a mobile air defense system — not by observing it directly, but by analyzing the relational gravity of its neighbors. Consider: a new entity appears with only three known edges — proximal to a Logistics Node, detected by an RF Sensor, and observed on a Supply Route. A standard database sees three isolated facts. A GNN propagates neighborhood information, aggregates feature vectors from all three neighbors, and compares the resulting embedding against historical deployments. The output: 87% probability of a mobile air defense system protecting the logistics chain. The model inferred a classified asset by analyzing the negative space of its neighbors. In heterogeneous intelligence environments, standard GNNs are insufficient a SIGINT intercept carries profoundly different epistemic weight than a social media post. Heterogeneous Graph Attention Networks (HAN) address this through hierarchical attention. The semantic-level attention computes meta-path importance: The algorithm learns to prioritize reliable transmission chains (dedicated military comms) over noisy ones (civilian cell networks). This directly mimics how an expert analyst weights source reliability. import torchfrom torch_geometric.nn import HANConvfrom torch_geometric.data import HeteroData# Define a heterogeneous military intelligence graphdata = HeteroData()# Node features: entity embeddings from sensor fusiondata['commander'].x = torch.randn(5, 64) # 5 known commandersdata['unit'].x = torch.randn(20, 64) # 20 tracked units data['location'].x = torch.randn(50, 64) # 50 geolocationsdata['unknown'].x = torch.randn(3, 64) # 3 unclassified entities# Typed edges encoding relationshipsdata['commander', 'commands', 'unit'].edge_index = torch.randint(0, 5, (2, 15))data['unit', 'operates_at', 'location'].edge_index = torch.randint(0, 20, (2, 40))data['unknown', 'proximal_to', 'location'].edge_index = torch.tensor([[0,1,2],[5,12,30]])# HAN with meta-path-based attentionmetadata = data.metadata()han = HANConv(in_channels=64, out_channels=32, metadata=metadata, heads=4)# Forward pass: message passing across typed edgesout = han(data.x_dict, data.edge_index_dict)# out['unknown'] now contains relational embeddings for classification LLMs as the Reasoning Layer — and Why They Can’t Work Alone GNNs output high-dimensional vectors and probabilistic logits. A battlefield commander cannot make a targeting decision based on a cosine similarity metric. LLMs bridge this gap as the […]
Last Updated on May 29, 2026 by Editorial Team Author(s): Mehul Ligade Originally published on Towards AI. Fine-Tuning is Dead: Why Context Orchestration Won in 2026 | M009 📍 Abstract Every few months, something in AI gets declared dead. Prompt engineering. RAG. Transformers. And now, fine-tuning. Here is the honest answer: fine-tuning is not dead. But the problem it was solving — “how do I get this model to know and do the right things?” — has a better solution now. And the sooner you understand that distinction, the faster you stop building systems that break the moment the world changes. This article is not another hot take. It is not a LinkedIn post dressed up as a technical guide. It is what I wish I had read before I made expensive decisions in the wrong direction. If you have ever wondered whether to fine-tune or just “engineer the context better” — and felt genuinely confused about what the right answer was — this is for you. 📘 Contents Why “Fine-Tuning is Dead” is Both Wrong and Right The Question Nobody Asks Before They Start Form vs Facts — The Distinction That Does 80% of the Work What Context Orchestration Actually Is The Four Layers of Context Every Production System Needs Why 1M-Token Windows Changed the Math — But Not Everything The Hidden Costs Nobody Puts in Their Blog Post Context Poisoning — The Failure Mode Nobody Talks About A Decision Framework You Can Actually Use When Fine-Tuning Still Wins (It’s a Smaller Case Than You Think) What Most Beginners Misunderstand About Both Approaches What I Would Tell Someone Starting a New AI Product Today What Comes Next in This Series 🔴 Why “Fine-Tuning is Dead” is Both Wrong and Right Let me start by saying something that will frustrate people on both sides. Fine-tuning is not dead. And also, most teams should stop reaching for it as their first answer. Both of those sentences are true at the same time. And the reason people argue endlessly about this topic is that they are never talking about the same problem. The people saying fine-tuning is dead are usually reacting to how it was being overused — as a hammer for every nail, including nails that did not need hammering. They are right about the overuse. They are wrong to declare the tool useless. The people defending fine-tuning are usually defending a legitimate use case — one where you genuinely need to change the model’s behavior at a fundamental level. They are right about the use case. They are wrong to ignore how narrow that use case actually is in practice. The real question is not whether fine-tuning is alive or dead. The real question is: what problem are you actually trying to solve? That question, asked honestly, will answer the fine-tuning debate before you ever open a training script. 🔴 The Question Nobody Asks Before They Start I have seen this pattern more times than I can count. A team builds a prototype. The model works well on generic tasks. But it keeps drifting — wrong tone, wrong format, hallucinated facts about the product, outputs that miss the point by just enough to matter. So someone says, “We need to fine-tune.” And they do. Weeks of work. Training runs. Evaluation loops. A frozen model checkpoint tied to a specific dataset version. Three months later, the product changes. The docs update. The tone guide shifts. And now the fine-tuned model is confidently wrong about things that changed after its training cutoff. I have been in that room. I have been that person. What nobody asked at the beginning was this: Is this model failing because it doesn’t know something — or because it doesn’t know how to behave? That distinction is everything. And it is the foundation of understanding why context orchestration won. 🔴 Form vs Facts — The Distinction That Does 80% of the Work Before you write a single line of training code, ask yourself two questions. Question one: Is your problem about form? Form means behavior. Tone. Output schema. Reasoning style. How the model structures its responses. Whether it sounds like your brand or like a generic assistant. Whether it refuses certain things, formats JSON a specific way, or follows a multi-step reasoning protocol. Question two: Is your problem about facts? Facts mean knowledge. Product documentation. Internal policies. Last week’s pricing. Customer history. Domain-specific information that the base model simply does not have — and that changes over time. This single distinction does most of the work. Facts that change belong in retrieval, not in weights. Why? Because when you bake facts into model weights, you are making a permanent decision about temporary information. You are freezing what should be fluid. Every time the facts change — and they will — you need another training run. Another evaluation cycle. Another deployment. Form that is stable — where you have genuinely decided how the model should behave across all cases, and that behavior is unlikely to shift — is a real candidate for fine-tuning. Because behavior that lives in weights is consistent. It does not depend on a retrieval system being available. It does not need to be re-engineered every time a new user query arrives. Stable form → fine-tuning is a conversation worth having. Changing facts → context orchestration is almost always the answer. The mistake most teams make is treating “our model gives wrong answers” as a form problem. Usually, it is a facts problem. And baking a facts problem into weights is just an expensive way to lock in a decision you have not actually finished making. 🔴 What Context Orchestration Actually Is Here is where I want to be precise — because this term gets thrown around loosely and that loose usage causes confusion. Context orchestration is not just “writing a better system prompt.” It is not just RAG. And it is not just stuffing more documents into a long context window. Context orchestration […]
Last Updated on May 27, 2026 by Editorial Team Author(s): Sudip P. Originally published on Towards AI. 5 Things Broke When I Shipped a RAG + MCP Agent to Production. Diagram-1: RAG vs MCP agent architecture: a small LLM router classifies each user query as either a Knowledge request (hybrid search → cross-encoder rerank) or an Action request (validate input → tool call). Both paths converge at a single frontier model for synthesis, then pass through eval and logging before returning a response. Read this article for free: link TL;DR (because you’re busy) Demos lie. Production finds your dumb mistakes. Vector‑only search is a trap. Hybrid + rerank or go home. MCP tools will return null and ruin your night. Validate. Timeout. Return structured errors. Keyword routers die on real users. Use a small LLM as router. Cache stuff. Build the eval harness on day zero. No evals, no clue. 6:14 a.m. and I want to quit 6:14 a.m. Slack explodes. On-call guy asks our shiny new agent why a pipeline job stalled. Agent replies, all calm and confident like a junior consultant who just read a blog: “This job is covered under the Tier-2 SLA with a 4-hour response window.” The real SLA is 30 minutes. The job had been dead for 90. I open the trace. Someone already filed a PagerDuty ticket. Title: “agent gave wrong SLA again”. The “again” part is what really stung. Quick context (if you missed my last post): RAG gets knowledge from a static index. MCP lets the model call live tools. On my laptop, with two test queries, it felt like magic. In prod, with 80 users pasting real Slack threads, it felt like a liability. Here’s what broke. Here’s what I did. Some of it worked. The whole mess (how it should work) Diagram 2 — A production agent pipeline drawn as a single-board layout. User queries enter through edge connector J1, get classified by the U1 router, and check the U2 cache. On a miss, two parallel rails fire: the RAG pipeline (normalize, hybrid search, rerank) builds context, while the MCP tools (validate, timeout/retry, structured result) handle side-effecting calls. Both feed into U9, the frontier model, which synthesizes the answer. From there, copper traces loop the output through observability and the eval harness before delivering it to J2. Every stage is independent, measurable, and replaceable, just like real PCB components. A small LLM router decides whether the user query needs knowledge (RAG) or a live action (MCP). The RAG pipeline uses hybrid search (BM25 + vector) plus a cross‑encoder reranker to find relevant chunks. MCP tools validate inputs, enforce timeouts and retries, and return structured {ok, error, detail} responses. Both paths feed into a frontier model for synthesis, then through an eval and logging layer before the final response. Every box here exists because something broke without it. If you want the calm, demo-version of this same architecture before any of it broke, the original write-up walks through it end-to-end. Breakage #1: Retrieval gave me the wrong SLA What failed: The model didn’t hallucinate. The retriever just handed it the wrong chunk. “Tier-1 SLA for streaming” and “Tier-2 SLA for batch” are really close in vector space. To a human at 3 a.m., they are completely different. Why: Vector similarity is not relevance. Embeddings smooth over the exact words you actually care about. Fix: Two things, both required. First, hybrid search: BM25 (keyword) plus vector, combined with reciprocal rank fusion. BM25 catches literal terms like “Tier-1” and “streaming” that embeddings blur together. Second, rerank the top 20 candidates with a cross-encoder. Cross-encoders look at the query and chunk together. They catch the “semantically close but factually wrong” cases. Here’s the code I wish I’d started with. Not perfect, but you get the idea. # Not real code, but you get the ideadef retrieve(query, k=5): candidates = hybrid_search(query, k=20) # BM25 + vector reranked = rerank_model.rerank(query, candidates, top_n=k) return reranked How to detectLog the chunk IDs for every answer. When users complain, you need to see instantly if retrieval failed or synthesis failed. Add a “retrieval precision” metric to your eval set. For each golden query, did the right chunk show up in top 5? Breakage #2: MCP tool returned null and the model said “all good” What failed: The MCP tool called the job status API. The API timed out and returned null. The model saw null, interpreted it as "no issues", and cheerfully told the user everything was fine. The job was actually dead for 90 minutes. Why: No validation on the tool output. No distinction between null (no data) and {"status": "ok"}. The model treated both as success. Fix: Wrap every tool call in a structured result. Validate inputs. Enforce timeouts. Return an explicit ok flag. Never let a raw null reach the model. def get_job_status(job_id): try: validate(job_id) data = call_api(job_id, timeout=5.0) return {"ok": True, "data": data} except TimeoutError: return {"ok": False, "error": "upstream_timeout"} except Exception as e: return {"ok": False, "error": "upstream_failure", "detail": str(e)} When the tool returns {"ok": false, "error": "upstream_timeout"}, the model can say "I couldn't reach the job system – please retry." That is the answer you want at 3 a.m. How to detect: Emit a metric per tool call. Label it with tool name, ok/error, and error category. Alert on error rate. The first tool to drift is never the one you expect. What actually happens when a tool fails (vs what you think happens) Diagram 3 — The silent API timeout that turns null into "success" – why your agent needs explicit null checks and timeout handling, not just error handling. That second path ruined my week. Breakage #3: Keyword router didn’t survive first contact with users What failed.My demo router was 12 lines of if statements.if "job" in query: do this. elif "sla" in query: do that. I knew it was bad. I shipped it anyway. Real users don’t type “what is the SLA for job 1234”. They type “hey is […]
Last Updated on May 27, 2026 by Editorial Team Author(s): Mandar Karhade, MD. PhD. Originally published on Towards AI. Co-scientist is a multi-agent scientific reasoning system designed to accelerate the pace of discovery Co-scientist is a multi-agent scientific reasoning system designed to accelerate the pace of discovery by generating, debating, refining, and re-ranking hypotheses at a hyperscale, which is almost impossible for an individual or even a group of researchers who might be connected in the brain. Google.comAfter the lead-in, the author explains that the core bottleneck in science is often the human mind: limited attention, difficulty keeping up with literature, and slow experimental cycles mean researchers get too few effective “shots” and can miss crucial prior work. Co-scientist is presented as a way to overclock discovery by enabling more chances for hypothesis generation and evaluation, potentially compressing time from months to hours while also triaging evidence and narrowing experimental pathways. The article then describes how the system works as a multi-agent setup atop Gemini, with specialized parallel roles for searching, proposing, critiquing, debating, and re-ranking ideas through structured, multi-turn reasoning rather than single-pass prompting. By bringing distributed perspectives together, it aims to connect seemingly unrelated facts across fields and reduce tunnel vision. Finally, the author addresses risks and limitations—hallucinations, unrealism, gaming of scoring—and stresses that AI can only generate hypotheses; real scientific truth still requires human verification through experiments and clinical trials, with careful, grounded expectations for what “time saved” can and cannot accomplish. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. Microsoft Just Embarrassed Browser Web Agents — 1,000 Lines Made GPT-5.4 Beat Opus 4.6 on 200 Web Tasks A Microsoft Research lab spent the last few weeks watching every other AI lab build bigger, smarter browser agents — then on May 24 they shipped 1,000 lines of code that beat all of them with a terminal. Webwright pushes GPT-5.4 from 33.5% to 60.1% on the Odysseys long-horizon web benchmark, sailing past Claude Opus 4.6’s 44.5% leaderboard top score. The cheaper, older model just smoked the frontier. The trick: stop predicting clicks and let the model write Playwright code. Webwright’s key idea is replacing click-oriented browser-agent loops with a terminal-based “code agent” that has the model generate and run Playwright scripts, treat code as the durable artifact, and iterate using terminal output rather than fragile UI action prediction. The article explains how this simple 1,000-line harness beats heavier browser-native stacks on long-horizon benchmarks like Odysseys and Online-Mind2Web, why action granularity makes browser agents inefficient (many steps mean many LLM calls and exploding token costs), and which small engineering choices enable reliability—particularly a self-reflection gate for “done” correctness and history compaction to prevent context explosion. It also outlines when Webwright wins (multi-site research, conditional form filling, long-tail scraping, date-picker and element-waiting problems) and where it may lose (canvas-rendered apps, real-time games, shifting DOM IDs, fine-grained drag-and-drop). Finally, it argues that the broader industry lesson is to build less bespoke orchestration and more reusable tool/script libraries, since improving model capabilities make heavily engineered browser agent harnesses increasingly constraining. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Sunil kumar Reddy Originally published on Towards AI. The Modern Data Stack Is Broken — Here’s How to Fix It With AI, Governance, and Real Architecture There’s a painful gap between how most data talks are given and how data actually flows inside real companies. Conference slides show clean, linear pipelines. Production reality is messier: three different teams calling the same Kafka topic with different schemas, a dbt model nobody owns that silently joined the wrong dimension table for six months, and an “AI-powered” feature that turns out to be a single GPT-4 API call with no retry logic, no monitoring, and no idea what happens when the context window fills up. So let me try to sketch out what a genuinely modern data platform looks like, one that handles AI workloads, enforces governance without requiring a full-time compliance team, and can actually be deployed by a team of four rather than forty. Why the Old Stack Breaks Under AI Workloads The traditional warehouse-centric stack, think Redshift or BigQuery at the centre with some Airflow DAGs feeding it, was designed around a pretty simple contract. Structured data comes in, SQL queries go out, BI dashboards update overnight. That was fine. AI changes the contract pretty fundamentally. Your LLM pipeline might need: Raw, unstructured text sitting in S3 alongside clean relational data Embeddings computed from 50-million-row tables in real time Feature vectors generated in milliseconds for a model serving endpoint Full data lineage tracked so a regulatory audit can trace exactly which training rows produced a specific prediction None of these fit cleanly into the old paradigm. The warehouse assumes structure. The data lake has structure but no transactions. Neither was built to serve a vector database at 10ms latency. The Lakehouse architecture, Delta Lake, Apache Iceberg, or Apache Hudi sitting on object storage, is the honest answer to this. It gives you the flexibility of a lake with enough transactional integrity to trust your data. But it’s only part of the answer. The Medallion Architecture: Not Just Bronze and Gold You’ve probably seen the medallion diagram before. Bronze is raw, Silver is cleaned, Gold is business-ready. Simple enough. But there are a few things people tend to gloss over. The fourth tier, what I call the Platinum or AI-native layer, is the piece most teams miss. Once your data is clean and business-ready in Gold, there’s still work left to make it AI-ready. Embeddings need to be computed and stored somewhere fast. Fine-tuning datasets need to be curated, versioned, and tracked. Feature vectors for real-time ML models need to be pre-materialised so your serving endpoint doesn’t have to compute them at query time. One thing I’d push back on here: a lot of teams treat this fourth tier as optional, something to add later. In practice, if you don’t design for it from the start, retrofitting it into an existing Iceberg table structure is genuinely painful. You’ll end up with embedding tables that aren’t linked to their source rows in the lineage graph, which makes auditing almost impossible. Real-Time vs. Batch: The Lambda Debate Is (Mostly) Over For years, data engineers debated Lambda architecture vs. Kappa architecture. Lambda said: run two pipelines, one for real-time and one for batch, then merge the outputs. Kappa said: just use streaming for everything. Both camps were partly right. Kappa’s “streaming for everything” ideal runs into the reality that batch processing is still cheaper and more reliable for large historical datasets. But Lambda’s “dual pipeline” approach creates synchronisation nightmares. The honest answer in 2026 is something like this: The key insight in this architecture is that Iceberg acts as the unification point. Flink can write micro-batches to an Iceberg table every 30 seconds. Spark can overwrite entire partitions of that same table in a nightly batch run. Both writes are ACID-safe. Downstream consumers, whether a BI tool or an LLM-powered agent, always see a consistent snapshot. The latency SLOs are also worth calling out explicitly. Your real-time path has to meet latency requirements measured in seconds. Your batch path is measured in hours. Those two pipelines need different monitoring, different alerting, and often different teams owning them. I’ve seen organisations collapse those two SLOs into one on-call rotation and it ends badly. Governance Isn’t a Feature, It’s a Foundation Here’s where I want to spend some real time, because governance is the thing that usually gets designed last and regretted first. Most teams treat data governance like a regulatory tax. You do the minimum required to pass the audit, you slap a data catalogue on top of whatever already exists, and you call it done. That works fine until your LLM training pipeline accidentally ingests PII that was supposed to be masked in Silver. Or until a churn model’s predictions are challenged in a lawsuit and you can’t reconstruct which version of which feature was used at inference time. A few things here worth calling out explicitly. Column-level lineage is not optional when you’re doing AI. Table-level lineage, “this table was derived from those three tables,” is useful. But it doesn’t tell you which specific columns fed a model. If a regulator asks why your credit-scoring model produced a particular output, you need to trace back from the model’s input features to their source columns in the raw data, including any transformations that happened along the way. OpenLineage is the open standard here, and both Airflow and dbt can emit lineage events that populate tools like Marquez or DataHub. Data contracts are the real unlock. The idea is simple enough: before a producer team changes a table’s schema or SLA, they have to negotiate that change with all downstream consumers. In practice this means defining a YAML contract (Avro schemas work well here) that specifies the column names, types, nullability guarantees, and expected freshness SLA. Break the contract, break the build. Tools like Soda Core or custom Great Expectations suites can enforce this […]
Last Updated on May 27, 2026 by Editorial Team Author(s): Can Demir Originally published on Towards AI. The token bill, the tool-count cliff, and the “expected behavior” security handoff — three costs MCP delegates without naming them. Scroll through developer forums this month and you’ll see the same obituary in every thread: MCP is dead. Eric Holmes’ “MCP is dead, long live the CLI” hit the top of Hacker News. Pieter Levels called MCP “just as useless of an idea as LLMs.txt.” Thoughtworks put “naive API-to-MCP conversion” in the Hold ring of their Tech Radar. After the lead, the article argues that MCP isn’t really a protocol so much as a cost model disguised as JSON-RPC: each turn forces the model to pay a token “manifest tax” for every tool, large tool lists create a discontinuous “tool-count cliff” where tool selection collapses due to attention/selection saturation, and—per Ox Security—STDIO execution pushes sanitization risk to developers as an “expected behavior,” enabling command injection unless you sanitize before launch. It also highlights an operational failure mode (“stdout corruption”) where stray logs/prints break JSON-RPC framing. Finally, it lays out practical design guidance (limit tools, use progressive disclosure, sanitize at every layer you control, log to stderr only), when MCP is worth the overhead (multi-user/enterprise governance), and what remains uncertain (whether progressive disclosure becomes default, how Anthropic’s stance evolves, and whether model-tool selection cliffs are intrinsic or training artifacts). Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Can Demir Originally published on Towards AI. A deep dive into the part of agent harness engineering most tutorials skip If you have written an LLM agent in the last two years, you have written something like this: The article argues that the “while not done” loop is not boring scaffolding but the core engineering challenge: deciding when an LLM agent should stop, and what to do when stopping is uncertain. It reframes termination as a four-part taxonomy—goal-based (declared vs verifiable completion), resource-based (iteration/token/wall-clock/monetary/tool budgets), pathology-based (detecting repeated/no-progress/loop or confidence collapse behaviors), and external termination (user cancellation, upstream timeouts, circuit breakers). It explains why common advice like max_iterations and a single “finish” signal is incomplete, details how “retry” actually contains four different mechanisms with different give-up policies (transient, format, semantic, and strategy retries), and shows how to choose and prioritize checks in the right order. The piece closes with the idea that agent design is termination design, offers a compact Python termination-layer pattern in code, and highlights frequent pitfalls (confusing retries, relying on unverified completion, overly aggressive loop detection, and missing cancellation paths). Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Can Demir Originally published on Towards AI. Why the same model behaves like a different system depending on what surrounds it — and the anatomy of a good harness For the past two years, almost every conversation in AI has started with the same question: “Which model is best?” Opus or GPT or Gemini? Which one hallucinates less, which one writes cleaner React, which one holds the longest context? After the introduction, the article lays out harness engineering as the missing half of “agent” behavior: an agent is the model plus everything wrapped around it (prompts, tool definitions, loops, state, memory, sandboxes, observability, etc.). It explains why the discipline emerged now as agent frameworks replaced earlier waves like prompt engineering and RAG, and then breaks down the typical harness components—durable state via filesystems/Git, action via bash/code execution, safe execution via sandboxes, continual learning via memory/search, and techniques to fight context rot (compaction, tool-call offloading, and even session resets). It also emphasizes enforced discipline through hooks, argues for debugging by blaming configuration/system design rather than the model (“skill issue” is often harness design), and introduces the “ratchet” approach where mistakes permanently tighten constraints. The essay grounds these ideas with a concrete three-agent architecture example (planner/generator/evaluator), discusses misconceptions such as “better models mean less harness” and “more tools means a more capable agent,” and closes by highlighting the co-evolution of models and harnesses as capabilities change. Ultimately, it reframes AI engineering around system design—shifting the question from “which model should I switch to?” to “which harness component is failing and how do I adjust it?” Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Felix Kebaya Originally published on Towards AI. AI Engineers Who Can’t Debug Are Getting Fired (Here’s How I Debug with Claude Code) “AI is creating a generation of developers who can’t debug their own code.” AI Engineers Who Can’t Debug Are Getting Fired (Here’s How I Debug with Claude Code)After introducing the “can’t debug” headline and arguing it’s only half right, the author demonstrates a hands-on workflow by intentionally breaking a production poll app with five real bugs. Using Claude Code (Opus 4.7), they show how clear symptom descriptions help the agent trace data flow and event chains across files to find root causes and apply fixes: missing database updates (votes not persisting after refresh), duplicate real-time subscriptions (updates firing twice), division-by-zero causing NaN% display, localStorage key mismatches preventing duplicate-vote blocking, and an async/await issue causing deletes to update only local state while deleted polls reappear. They conclude that AI agents don’t replace debugging skills; they expose gaps—especially the ability to isolate symptoms precisely and point the agent to the right code scope—making competent debuggers faster and less skilled ones feel the mismatch more sharply. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Rick Hightower Originally published on Towards AI. Part 7: The five layers of Claude Code memory that stop your AI from repeating its mistakes, and how to route each fact to the right one. You tell Claude Code that your project uses pnpm, not npm. You tell it the auth service signs JWTs with RS256, not HS256. You tell it tests live in tests/ and mirror the source path. Useful facts, all of them. Then tomorrow you open a fresh session, and Claude suggests npm install. By Friday you have explained the build command nine times, and you are starting to wonder why this tool keeps forgetting things. Claude Code MemoryThe article explains that Claude isn’t “forgetting” because each session starts with a fresh context window; instead, you failed to place persistent facts into Claude Code’s memory system. It breaks memory into five layered scopes (managed policy, user CLAUDE.md, project CLAUDE.md, local CLAUDE.md, and rules, plus auto memory), clarifying how the layers are concatenated each session and how later/more specific instructions win. It then gives practical guidance on keeping CLAUDE.md disciplined (short, specific, verifiable, and fix-focused), moving repeated corrections into the right files, and using path-scoped rules to grow instructions without bloating token cost. Auto memory is presented as a user-and-machine-local scratchpad Claude maintains for you, while the daily habit is to promote notes into the appropriate layer—project CLAUDE.md for team-wide facts, rules for file-scope conventions, and CLAUDE.local.md for your personal-but-stable preferences. Finally, it distinguishes memory (guidance) from enforcement (hooks that must run every time) and ends with a concrete “do this today” checklist using /memory and auditing + extracting from CLAUDE.md to stop the repeated re-explanation tax. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 27, 2026 by Editorial Team Author(s): Rick Hightower Originally published on Towards AI. Part 6 Subagents: Subagents are the Claude Code feature most engineers never touch, and skipping them is why long sessions slowly fall apart. Your best Claude Code session quietly rots all afternoon, and you blame the model when the fix was one habit away. Claude Code subagents are the single highest-leverage habit separating casual users from experts. Most engineers never touch them. Here is what that costs you, and how to fix it in one afternoon. Claude Code Subagents: Isolated Context WindowAfter the lead, the article explains how Claude Code degrades over long sessions due to context pollution: file reads from exploratory detours accumulate in your main conversation and make answers fuzzier, even though the model isn’t “getting worse.” It then breaks down what a subagent is—an isolated context window with its own system prompt and tool restrictions—so the parent only receives the final summary rather than all intermediate tool calls. The piece covers the cost/benefit math (token exploration stays in the subagent), the key mechanics and constraints (no nesting, parent sees only outputs, some shared context like MCP/skills), and the three built-in subagents you can use immediately (Explore, Plan, and general-purpose) along with when each makes sense. It shows how to invoke subagents on purpose via prompt naming or delegation (including headless usage), how to create custom subagents using YAML/frontmatter (or the /agents UI), when to choose subagents versus worktrees or agent teams, and concludes with practical “do this today” steps—explicitly delegate with Explore, browse/create agents, and build one reusable custom subagent—arguing that subagents become a compounding habit that keeps your main conversation clean and your engineering judgment reusable. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 19, 2026 by Editorial Team Author(s): Vinamra Yadav Originally published on Towards AI. The demo worked perfectly. You ran it twenty times. You showed it to your team. You showed it to your CTO. Every prompt returned exactly the right output. Then you deployed it. Three days later, a customer reported that the agent gave them completely wrong information — confidently, without any error. Your logs showed HTTP 200s all the way down. Your monitoring reported zero errors. The agent had been silently hallucinating for 72 hours, and nothing in your infrastructure had noticed. This is not a model quality problem. The model was doing exactly what models do. This is an architecture problem — and it’s the problem nobody writes about, because it only becomes visible after you’ve already deployed. I’ve spent the last year building and reviewing AI agent systems in production. The failure taxonomy is consistent. There are six ways an AI agent dies in production, and almost none of them show up in a demo. The math that should terrify you Before the taxonomy, one number worth sitting with. If your agent achieves 85% accuracy per step — which is a good number, better than many production systems — and your workflow has 10 steps, the probability of completing that workflow successfully is 0.8⁵¹⁰ = 19.7%. In this simplified model — where steps are independent and success is binary — roughly eight out of ten workflows fail despite each individual step being “pretty good.” Real failure modes are messier than this, and steps are rarely fully independent. But the model captures the architecture problem accurately: multi-step workflows compound failure. The only way out is to build failure handling into every step, not just the last one. Now, the six failure modes. Failure 1: Context degradation In a multi-step agent workflow, the model doesn’t remember what happened two steps ago — you send it. Every API call includes the entire conversation history. And that history grows with every step. What engineers miss: context doesn’t just grow, it degrades. Datadog’s 2026 State of AI Engineering report documents the pattern precisely: the average token count in production agent workflows more than doubled year-over-year for median-use teams, and quadrupled for heavy users. As context grows, the original instruction becomes diluted — newer tool outputs and summaries crowd out the early reasoning, and the agent continues confidently on increasingly corrupted signal. When this drift surfaces, there is no owner to page, no baseline to compare against, no runbook to execute. It surfaces as a customer complaint. The agent doesn’t tell you this is happening. The outputs become subtly wrong in ways that are nearly impossible to detect without evaluation tooling. The pattern that makes it worse: engineers build agents that pass outputs between steps as plain text summaries. The model summarises step 3’s output, passes the summary to step 4, which summarises again for step 5. Each summarisation is a lossy compression. By step 8, you’re acting on a summary of a summary of a summary of the original instruction. The fix: preserve structured outputs between steps, not prose summaries. Use typed data contracts between agent steps rather than natural language handoffs. # Instead of this — lossy text handoffresult = agent.run("Summarize what you found and pass it to the next step")# Do this - structured contract between steps@dataclassclass StepResult: extracted_entities: list[str] confidence_scores: dict[str, float] raw_source_ids: list[str] # preserve provenance step_number: int Failure 2: Silent failures This is the one that keeps engineers up at night, because it’s the one you don’t know about. Traditional monitoring is completely blind to agent failures. An agent that hallucinates a confident wrong answer still returns HTTP 200. Latency stays normal. Error rate stays at zero. Your dashboards are green. Your Slack alerts are quiet. Latitude’s production observability research documents the pattern clearly: “Tool misuse is the most common agent-specific failure mode in production — and the most insidious: a single malformed argument at step 2 silently corrupts every subsequent step that depends on that output.” The agent calls a tool with incorrect arguments, selects the wrong tool for the task, or fails to handle a tool error and continues as if the call succeeded. The classic scenario: a customer support agent that answers questions about account status. In testing, all queries are clean, structured English. In production, queries are messy, multilingual, emotionally charged. The agent returns plausible wrong answers with normal latency and HTTP 200s. The only signal is a customer escalation — which arrives hours or days after the degradation began. The fix: add a lightweight LLM evaluator layer that scores every agent output before it reaches the user. Not a human in the loop — a small, fast model that checks three things: is this response relevant to the query? Does it contradict the source data? Does the confidence language match what the retrieval actually returned? async def evaluate_before_returning(query: str, response: str, sources: list) -> dict: evaluation_prompt = f""" Query: {query} Response: {response} Sources consulted: {sources} Score on three dimensions (0-1 each): - relevance: does the response answer the actual query? - grounding: is the response supported by the sources? - calibration: does the certainty language match source quality? """ score = await fast_evaluator.run(evaluation_prompt) if score["grounding"] < 0.7: raise AgentQualityError("Response not grounded in retrieved sources") return score Failure 3: Tool execution schema drift Your agent calls tools — APIs, database queries, internal services. Those tools change. When they change, your agent doesn’t know. This is the API version problem in disguise. An agent calling a tool doesn’t validate that the tool’s response schema matches what it was trained or prompted to expect. When a third-party API updates their response format, or when your internal service adds a required field, or when an OAuth token expires — the agent receives a malformed or empty response and, depending on how you’ve built it, either hallucinates a plausible-looking answer from the gap, or enters a retry loop. Datadog’s 2026 State […]
Last Updated on May 19, 2026 by Editorial Team Author(s): Swaraj Patil Originally published on Towards AI. SBERT, also known as Sentence-Bert, is a widely used approach for obtaining sentence embeddings that aim to retain the contextual information within the sentences. However, generating these embeddings can be slow when dealing with large amounts of data. To address this, one option is to utilize batch-based encoding to accelerate the inference. However, this may not necessarily reduce the inference time. In this Medium blog post, we will explore the application of the ONNX (Open Neural Network Exchange) framework and how it aids in reducing the inference time of the model. P.S. This article does not delve into the internal workings of ONNX. For more in-depth information, please consult the official ONNX documentation. Let’s begin by installing the import libraries. We can use pip for the installation of ONNX pip install onnxpip install onnxruntime-gpupip install transformerspip install torch Once ONNX is installed we verify it using the below snippet import onnxprint(onnx.__version__) In order to obtain sentence embeddings, we will utilize the IMDB dataset sourced from Kaggle. Specifically, we will focus on the “Overview of Movie” column to generate embeddings using SBERT. The time needed to create embeddings will be determined for the 1000 sentences present in the dataset. We will perform two experiments here on both CPU and GPU Inference time for 1000 sentences using Vanilla SBERT (CPU). Inference time for 1000 sentences using ONNX converted SBERT (CPU). Inference time for 1000 sentences using Vanilla SBERT (GPU). Inference time for 1000 sentences using ONNX converted SBERT (GPU). The Sentence BERT model that we would consider here is all-MiniLM-L6-v2 We can invoke the Sentence BERT model from the Hugging Face Library and the Sentence Transformer Library. The output embeddings from both the library will be the same. For our experiments, we will use the Hugging Face library. Remember that when we use the Hugging Face library after obtaining the embeddings, additional post-processing could be needed such as Pooling or Normalization. The different steps can be obtained from the model page on Hugging Face. Perform those steps to get final sentence embeddings. Let's first convert the model to ONNX format. # # Load pretrained model and tokenizerfrom transformers import AutoModel, AutoTokenizermodel_name = "sentence-transformers/all-MiniLM-L6-v2"tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=True )model = AutoModel.from_pretrained(model_name )#Mean Pooling - Take attention mask into account for correct averagingdef mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() temp = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) return F.normalize(temp, p=2, dim=1)# Get the first example data to run the model and export it to ONNXsample = ['Hey, how are you today?']inputs = tokenizer(sample, padding=True, truncation=True, return_tensors="pt" )## Convert Model to ONNX Formatimport osimport torchdevice = torch.device("cpu")# Set model to inference mode, which is required before exporting # the model because some operators behave differently in# inference and training mode.model.eval()model.to(device)output_dir = os.path.join(".", "onnx_models")if not os.path.exists(output_dir): os.makedirs(output_dir)export_model_path = os.path.join(output_dir, 'all_MiniLM_L6-v2.onnx')with torch.no_grad(): symbolic_names = {0: 'batch_size', 1: 'max_seq_len'} torch.onnx.export(model, # model being run args=tuple(inputs.values()), # model input (or a tuple for multiple inputs) f=export_model_path, # where to save the model (can be a file or file-like object) opset_version=11, # the ONNX version to export the model to do_constant_folding=True, # whether to execute constant folding for optimization input_names=['input_ids', # the model's input names 'attention_mask', 'token_type_ids'], output_names=['start', 'end'], # the model's output names dynamic_axes={'input_ids': symbolic_names, # variable length axes 'attention_mask' : symbolic_names, 'token_type_ids' : symbolic_names, 'start' : symbolic_names, 'end' : symbolic_names}) print("Model exported at ", export_model_path) Now that we have converted the Sentence BERT Model. Let’s get the stats for the models. Vanilla SBERT (CPU) The inference time obtained for the Vanilla SBERT model on the CPU can be found using the snippet below. import timeimport pandas as pdimport numpy as npfrom tqdm import tqdmdf = pd.read_csv('./imdb_top_1000.csv', usecols=['Overview'])total_samples = len(df)latency = []outputs_cpu = []with torch.no_grad(): for i in tqdm(range(total_samples)): data = [df.loc[i, "Overview"]] inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt" ) start = time.time() outputs_cpu.append(mean_pooling(model(**inputs), inputs['attention_mask'] ).cpu().detach().numpy()) latency.append(time.time() - start)print("\n")print("PyTorch {} Inference time = {} ms".format(device.type, np.round(np.average(latency)*1000, 4))) 100%|██████████| 1000/1000 [00:36<00:00, 27.62it/s] PyTorch cpu Inference time = 34.2605 ms ONNX Converted SBERT (CPU) The inference time obtained for the ONNX SBERT model on the CPU can be found using the below snippet. import onnxruntimeimport numpy as npsess_options = onnxruntime.SessionOptions()session = onnxruntime.InferenceSession(export_model_path, sess_options, providers=['CPUExecutionProvider'])latency = []ort_outputs_cpu = []for i in tqdm(range(total_samples)): data = [df.loc[i, "Overview"]] inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt" ) ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()} start = time.time() op = session.run(None, ort_inputs) op = torch.from_numpy(op[0]) ort_outputs_cpu.append(mean_pooling([op], inputs['attention_mask'] ).cpu().detach().numpy()) latency.append(time.time() - start)print("\n")print("OnnxRuntime {} Inference time = {} ms".format(device.type, np.round(np.average(latency)*1000, 4))) 100%|██████████| 1000/1000 [00:16<00:00, 60.80it/s] OnnxRuntime cpu Inference time = 15.5696 ms Outputs outputs_cpu[0][:,:10] ## Vanilla SBERT CPU Outputarray([[-0.06326339, 0.0414625 , -0.04707527, -0.03361899, -0.02562934, 0.03499832, 0.00804075, -0.05042004, 0.00215668, -0.03816812]], dtype=float32) ort_outputs_cpu[0][:,:10] ## Onnx SBERT CPU Outputarray([[-0.06326343, 0.04146247, -0.04707528, -0.033619 , -0.02562926, 0.03499835, 0.0080408 , -0.05042008, 0.00215669, -0.03816817]], dtype=float32) Vanilla SBERT (GPU) The inference time obtained for the Vanilla SBERT model on the GPU can be found using the snippet below. device = torch.device("cuda")# Set model to inference mode, which is required before exporting # the model because some operators behave differently in# inference and training mode.model.eval()model.to(device)total_samples = len(df)latency = []outputs_gpu = []with torch.no_grad(): for i in tqdm(range(total_samples)): data = [df.loc[i, "Overview"]] inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt" ).to(device) start = time.time() outputs_gpu.append(mean_pooling(model(**inputs), inputs['attention_mask']).cpu().detach().numpy()) latency.append(time.time() - start)print("\n")print("PyTorch {} Inference time = {} ms".format(device.type, np.round(np.average(latency)*1000, 4))) 100%|██████████| 1000/1000 [00:07<00:00, 135.29it/s] PyTorch cuda Inference time = 6.737 ms ONNX Converted SBERT (GPU) The inference time obtained for the ONNX SBERT model on the GPU can be found using the snippet below. import onnxruntimeimport numpy as npsess_options = onnxruntime.SessionOptions()session = onnxruntime.InferenceSession(export_model_path, sess_options, providers=['CUDAExecutionProvider'])latency = []ort_outputs_gpu = []for i in tqdm(range(total_samples)): data = [df.loc[i, "Overview"]] inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt" ).to(device) ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()} start = time.time() op = session.run(None, ort_inputs) op = torch.from_numpy(op[0]) ort_outputs_gpu.append(mean_pooling([op], inputs['attention_mask'].cpu()).cpu().detach().numpy()) latency.append(time.time() - start)print("\n")print("OnnxRuntime {} Inference time = {} ms".format(device.type, np.round(np.average(latency)*1000, 4))) 100%|██████████| 1000/1000 […]
Last Updated on May 19, 2026 by Editorial Team Author(s): Simon Corde Originally published on Towards AI. I passed the Terraform certification some month ago. While learning terraform, I quickly found myself willing to provision all I can through terraform. Turned out a bad idea. Here is why. Do you use Terraform for deploying helm charts ? Terraform vs CI/CD for Serverless Deployments 1. The confusion between infra and app lifecycle Infrastructure lifecycle and application lifecycle differ Infrastructure change slowly whereas application releases can appear multiple times a day. For instance, while you are satisfied with your network configuration, you will not change it much at during next year. On the other hand, if your release of this morning for your application has broken some feature despite all the tests you have put in place, you want to rollback to the previous version quickly. Therefore the changes you make to application have to be released much more frequently than pure infrastructure. You cannot handle both in the same place. 2. What Terraform is GREAT at Terraform is an open source Infrastructure As Code open source tool created by Hashicorp. It allows to define your infrastructure, provision and manage it. It has seen a huge adoption in the industry though the years. It allows you to automated deployment for different providers, namely the cloud ones like AWS, Azure, GCP but also other PaaS like Heroku or even github or gitlab. Today many tools have their Terraform provider to allow you to declare your configuration in files and communicate with their apis. The 1 M$ question really is what do I put in Terraform ? Terraform shines at defining your desired state in terms of configuration for the following: networking IAM buckets SQL data base secrets DNS load balancer These infrastructure components define the core of your systems. Once they are setup, you rarely update them. 3. Why application deployment is different Over the past few years, the tech industry has become significantly more competitive. And this shift has only accelerated with the arrival of AI-powered coding assistants. Today, development velocity is no longer a nice-to-have — it is a core requirement. Feature teams can no longer afford to wait weeks or even a full month to deliver new functionality. In many organizations, it is now common to ship multiple features per week, and in some cases, even multiple times per day. This level of speed requires a fundamental change in how software is delivered. Promoting application images across environments must be fast, reliable, and fully automated in order to support rapid testing and release cycles. Equally important is the ability to roll back just as quickly when something goes wrong. In this model, deployment is no longer just about pushing changes to production — it is about enabling continuous, safe, and reversible delivery at high velocity. 4. The Cloud Run example Let us take as a backbone example the serverless platform in Google Cloud named Cloud Run. This is a container as a service platform which can run at scale. We are going to compare the workflows to deploy an application through Terraform vs through Continuous Deployment leveraging github actions. 💡All material found in this article can be found in my repository at https://github.com/Redcart/serverless-deployment. 🛠 Terraform example In the following piece of Terraform configuration, we are defining our cloud run service. Once you are satisfied with this configuration, you just need to run terraform apply. But now, if you need to promote a new version of your app ? You first need to build and push the new docker image of your code in Artifact Registry. Second you have to pick up the righ name and change it in the variable named below container_image. Eventually run terraform apply. resource "google_cloud_run_v2_service" "cloud-run-service" { name = var.cloud_run_service_name location = var.region ingress = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER" deletion_protection = false template { service_account = var.sa_cloud_run containers { image = var.container_image env { name = "GCP_PROJECT_ID" value = var.project_id } env { name = "INPUT_DATASET" value = var.input_dataset } env { name = "OUTPUT_DATASET" value = var.output_dataset } env { name = "API_KEY" value_source { secret_key_ref { secret = var.api_key_secret_name version = var.api_key_secret_version } } } } }} 🛠 CI/CD in github actions example In the following piece of configuration code, we are deploying through gcloud CLI in github actions a cloud run service. If we want to deploy a new version of the app, we just have to commit and push the change in the github remote repository, Then the CI/CD will be triggered, the docker image built and pushed to Artifact Registry, the cloud run service automatically updated with this new docker image. All in one step, from one single repository. IMAGE=${{ env.GCP_REGION }}-docker.pkg.dev/${{ env.GCP_PROJECT_ID }}/cloud-run/app:${{ github.sha }}gcloud run deploy ${{ env.PROJECT }} \ --image=$IMAGE \ --region=${{ env.GCP_REGION }} \ --platform=managed \ --memory=1Gi \ --no-allow-unauthenticated \ --ingress=internal-and-cloud-load-balancing \ --service-account=${{ env.SERVICE_ACCOUNT_RUNTIME }} \ --set-env-vars GCP_PROJECT_ID=${{ env.GCP_PROJECT_ID }} \ --set-env-vars INPUT_DATASET=${{ env.INPUT_DATASET }} \ --set-env-vars OUTPUT_DATASET=${{ env.OUTPUT_DATASET }} \ --set-secrets API_KEY=API_KEY:latest 5. Terraform pain points for deployments While we have seen in the previous section that deploying a cloud run service through terraform can be easy, it has some limitations. First, as discussed earlier, it mixes infra and app concerns. In some teams, people in charge of infra and app are not the same. You can encounter lack of velocity especially if you need to quickly rollback because of a pending apply (a colleague has done a plan but has not confirmed the apply yet). You can ran into state locking, drift is more likely to happen and interfer with the behavior of your app in production. The plan of such terraform workspaces can become very noisy and take a while. 📌 To overcome some of these limitations, one may tempted to say ok, I will have more terraform workspaces to not be bothered by other pending apply or long plan. But here again, do not underestimate the complexity and challenge to maintain a bunch of workspaces. […]
Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. The 17,300-view AI Engineer Singapore talk that quietly killed half my MLOps job I watched Merve Noyan’s “Your Agent Can Now Train Models” talk three times this week. It went up on the AI Engineer channel three days ago, hit 17,300 views in 72 hours, and now sits as the second-most-watched talk on the entire @aiDotEngineer feed — beaten only by the “CI/CD Is Dead” pitch from Hugo Santos two slots above it. Both are screaming the same thing in different keys: the loop where a human writes a training script, picks a GPU, watches loss curves, and pushes a checkpoint is about to look as quaint as configuring Tomcat by hand. Claude reads the dataset card during a live demo at AI Engineer Singapore.The article discusses how Merve Noyan’s AI Engineer talk illustrated the automation of MLOps processes, highlighting the capabilities of the new huggingface-llm-trainer skill. This skill allows users to fine-tune AI models with minimal human intervention, showcasing significant improvements in efficiency and cost-saving in model training. The author recounts personal experiences with the skill, detailing various training tasks and costs involved, and concludes that while automation is transforming the MLOps landscape, the need for human understanding of the training process remains essential. The piece advocates for a shift in MLOps roles towards oversight rather than manual scripting. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Vektor Memory Originally published on Towards AI. On Hyperspace, basic swarms, the math nobody wrote down, and why we built the thing they were missing in a single afternoon. Join us as we traverse multiple whitepapers and agentic memory ideas like a ferret on Adderall. Some rabbit holes start with a GitHub link. Someone drops it in social posts on Facebook/Reddit/Discord. No context, just the URL to Github and a single line: Someone just built AGI! Wow! The repo was called hyperspaceai/agi. The name alone should have been a warning. I clicked it anyway because I was curious, of course. As I delved deeper into the github vibe code abyss, I could see the attraction: a new frontier of swarm bot peer-to-peer networks with the ability to earn base 10 points per epoch of confirmation and crypto tokenomics baked in. Playstation does have something similar created awhile back called Folding@Home—for the PS3 and PCs: https://en.wikipedia.org/wiki/Folding@home — is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers’ personal computers. If you like to view one of the first actual swarm bots whitepapers: The term “Swarm-bot” originally refers to the landmark 2000–2005 European Union-funded SWARM-BOTS project, coordinated by Marco Dorigo, which successfully created a physical peer-to-peer network of autonomous mobile robots called s-bots. These s-bots connected physically and coordinated via peer-to-peer local sensing. https://www.sciencedirect.com/science/article/abs/pii/S0921889005001478 The AGI That Wasn’t Hyperspace describes itself as the first distributed AGI system. 660 agents. 27,000 experiments. A peer-reviewed research pipeline running autonomously across a P2P network. The marketing is excellent and captivating, guaranteed to attract lemmings like flies to juicy GitHub stars. The actual results are a different story. The swarm’s biggest published discovery — the finding that propagated to 23 agents within hours via gossip protocol, the one they highlight as proof the system works — was Kaiming initialization. Kaiming init has been in the PyTorch standard library since 2015. It’s covered in week two of every deep learning course. Kaiming He published the paper eleven years ago. A grad student with a coffee and an afternoon would have found it faster. https://arxiv.org/pdf/1502.01852 The infrastructure underneath is genuinely impressive. DiLoCo gradient compression, libp2p gossip, CRDT leaderboards, 32 anonymous nodes completing a collaborative training run in 24 hours. The plumbing is real. I don’t want to dismiss that. But AGI? No. What they built is a parallel random search engine with a shared high score table and excellent branding. To understand why, you need to understand how the gradient compression actually works — because it’s the most technically interesting part, and it’s completely separate from the intelligence problem. The Tech That Actually Works: DiLoCo and Gradient Compression Standard distributed training requires every GPU to synchronise gradients after every forward/backward pass. Every node waits for every other node. This works in a data centre on InfiniBand. It falls apart completely over the internet — latency is too high, bandwidth too variable. DiLoCo (Decoupled Local Communication, Google DeepMind 2023) solves this differently. Instead of syncing every step, each node trains independently for many steps — called “inner steps” — then syncs once. The “delta” being sent is just the net drift: weights_after - weights_before. Node A: train 100 steps locally → share deltaNode B: train 100 steps locally → share deltaNode C: train 100 steps locally → share delta ↓ average the deltas (outer step) ↓ all nodes update → repeat But even one sync of a model’s full weight delta is massive. A 500M parameter model is roughly 2GB of float32 deltas. Over the internet, per round, that’s unusable. So Hyperspace stacks two compression techniques on top: SparseLoCo — top-k sparsity. Only send the largest-magnitude weight updates. Most parameter updates are near-zero noise. The high-magnitude updates carry the actual learning signal. Full delta: [0.001, -0.0003, 0.89, 0.0001, -0.76, ...]Top-2% only: [ 0, 0, 0.89, 0, -0.76, ...] → send as sparse {index: value} pairs Parcae — layer pooling. Group adjacent transformer layers into blocks of 6, average their gradients before taking top-k. Adjacent layers learn correlated things. Averaging before sparsification means a more stable top-k mask. The combined result: 195× compression. 5.5MB per round instead of roughly 1GB. DiLoCo: sync every N steps not every step → ~100× less frequentSparseLoCo: top-2% of delta values only → 45× smaller payloadParcae: pool layers before sparsification → 6× additional reductionTotal: 195× This is real and impressive. The problem is that none of it has anything to do with intelligence. It’s bandwidth optimisation. The agents communicating through this pipe are still completely amnesiac. Why the Swarm Is Basic: The Architecture Problem Here is the agents’ complete intelligence loop. Every agent. All 660 of them. Every one of the 27,000 experiments: 1. read current leaderboard (what's the best score?)2. read last 5 experiment results from shared branch3. prompt LLM: "given these results, generate hypothesis"4. run experiment5. record result6. gossip to peers7. goto 1 The LLM’s context window is the memory. When the session resets, everything resets. There is no persistence. There is no structure. There is no causal understanding of why anything worked. Hyperspace stores: "run_047: threshold 0.30, score 0.67" ← flat log Hyperspace does NOT store: why threshold 0.30 worked what it interacted with under what conditions it holds what failed before it So when the Kaiming init “discovery” happened, here is what actually occurred: the LLM generating hypotheses was trained on He et al. 2015. The prompt included “try to improve initialization.” The model recalled Kaiming from pretraining weights. An agent ran the experiment. It worked. The score updated. 23 agents adopted it via gossip. Not emergence. Not intelligence. Retrieval from a pretrained model, dressed up as swarm discovery. The plateau problem is the proof. Every RSI paper — Gödel Agent, Darwin Gödel Machine, Reflexion, STOP — hits the same wall: iterations 1-10: big gains […]
Author(s): Kamrun Nahar Originally published on Towards AI. Real World Sales Forecasting Playbook The Single Most Useful Picture I Have Ever Seen The thing that changed how I worked was not a model. It was a 2×2 grid. Two professors put it on paper in 2005. The grid has saved me from picking the wrong model approximately 600 times. The Forecastability Map. Every product on your shelf belongs to one of these four boxes. Pick the wrong box, pick the wrong model. This is the Syntetos-Boylan classification. Two axes. Two numbers per SKU. The x-axis is the Average Demand Interval (ADI). Take all the time periods in your history. Count how many had at least one sale. Now divide the total number of periods by that. If you sell something every day, ADI is 1. If you sell it every other day on average, ADI is 2. The bigger ADI gets, the rarer the SKU. The y-axis is the squared coefficient of variation (CV²) of the non-zero demand sizes. Take only the periods where you sold something. Compute the standard deviation of the quantities. Divide by the mean. Square the result. This tells you how variable the size of demand is when it does happen. The thresholds are 1.32 for ADI and 0.49 for CV². Why those exact numbers. They came from empirical analysis of real industrial data. The math is in a 2005 paper. Don’t go chasing the original PDF, the boundaries are good enough as rules of thumb. Four boxes pop out of this. SMOOTH | INTERMITTENT |Frequent. | Rare.Steady. | Steady when it happens. |------------|------------ |ERRATIC | LUMPY |Frequent. | Rare AND variable.Variable. | The nightmare. Every product on your shelf needs a different forecasting approach. Treating them all the same is why your one big model fails on the long tail. Let me walk you through each box. Pour yourself something. We’ll be a while. Box One. Smooth. This is the dream. Sells every day. Roughly the same quantity. The textbook stuff. A small grocery store sells milk like this. So does a hospital pharmacy with basic painkillers. So does a power company billing residential customers. Frequent. Steady. The data scientist’s friend. For smooth demand, almost any classical method works. AutoARIMA. Exponential Smoothing (ETS). A simple regression with calendar features and a couple of lags. The fancy methods barely beat the simple methods. You will see WAPE around 8 to 15 percent and you will look like a magician. The model will not save you. The data is already easy. The mistake people make on smooth data is over-engineering. They reach for an LSTM. They tune hyperparameters for two weeks. They get a 0.4 percent improvement and call a meeting. import pandas as pdfrom statsforecast import StatsForecastfrom statsforecast.models import AutoETS, AutoARIMA, SeasonalNaivedf = pd.read_csv("smooth_skus.csv", parse_dates=["ds"]) # cols unique_id ds ysf = StatsForecast( models=[AutoETS(season_length=7), AutoARIMA(season_length=7), SeasonalNaive(season_length=7)], freq="D", n_jobs=-1) # daily data, 7-day weekly cycle, all coressf.fit(df) # fits one of each model per unique_idfcst = sf.predict(h=28) # 4 weeks outprint(fcst.head()) # check it. should look boring. boring is good. Line by line, because the Reddit poster asked for thoroughness. import pandas as pd. Pandas is the spreadsheet library every Python data person uses. The as pd is a community nickname so we never have to type the long name. from statsforecast import StatsForecast. The orchestrator class from Nixtla's library. It is fast because it parallelizes across SKUs without you having to write a loop. from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive. Three classical models. AutoETS picks the best exponential smoothing variant for you. AutoARIMA does the (p,d,q) search you used to do by hand at 2 AM. SeasonalNaive is the dumb baseline that says "last week's same weekday." Always include the dumb baseline. df = pd.read_csv(...). Reads the CSV. The library expects three columns. unique_id (which SKU), ds (which date), y (how many sold). StatsForecast(models=[...], freq="D", n_jobs=-1). We instantiate. freq="D" is daily. n_jobs=-1 means "use every CPU." The library is genuinely fast. sf.fit(df). Behind the scenes, it groups by unique_id and fits each model to each SKU's history. No loop. You're welcome. fcst = sf.predict(h=28). Predict 28 days ahead. Each model produces its own column. print(fcst.head()). Eyeball test. For a smooth SKU, the three forecasts should agree closely. If they wildly disagree, your data isn't actually smooth, and you're in the wrong box. Why this matters. If your SKU is genuinely smooth, this snippet is your whole pipeline. You don’t need LightGBM. You don’t need Prophet. You don’t need a paper from NeurIPS. The smooth demand dream. Live it while you can. Box Two. Intermittent. Now it gets interesting. A roofing supply store sells a particular size of slate tile maybe four times a month. When it sells, it sells two or three boxes at a time. Always two or three. Never twenty. Never zero point seven. Frequency is low. Quantity is consistent. For demand that is rare but steady-quantity, the classical models go quiet. ARIMA wants regular data. ETS wants a trend or a season. There isn’t one. There are just zeros, interrupted by a normal number, then more zeros. Enter John Croston. 1972. Yorkshire. He probably had a slide rule. He had a brilliant idea. Forecast two things separately. How big an order is when it happens. How often orders happen. Divide the first by the second. That’s the whole method. From Watergate-era. And it still beats every neural network on sparse data most of the time. Most of the time. We will get to the exceptions. Croston’s method has a known problem. It is positively biased. It tends to over-forecast. The smoothing parameter beta makes it worse. In 2005, the same Syntetos and Boylan who made the quadrant figured out a fix. Multiply by (1 - alpha/2). That's it. That fix is now called the Syntetos-Boylan Approximation (SBA) and it is the default people should be using. There is a further variant called TSB (Teunter-Syntetos-Babai). This one solves a different problem. Croston only updates […]
Last Updated on May 15, 2026 by Editorial Team Author(s): Greg Oliver Originally published on Towards AI. Architectural Cubic n{C/A} Ratios and Easy Shifts to Aid Robotics Design This post provides a toolbox of genetic Cubic coefficient ratios n{C/A} and n{C} ratios in Header Graph 1 applied to a depressed Cubic y=Ax³ — Cx+0 in black with Roots, Tp’s and in green the Sum Of gradients = — 3C at all possible 3 real roots (between Tp(y)’s) as presented in my recent post; Designing Polynomials Using Sum of Gradients at the Roots. Header Graph 1 Coefficient Ratios n{C/A}This article discusses genetic Cubic coefficient ratios essential for robotics design, providing efficient formulas to manipulate and shift depressed cubic functions within a coordinate system, thus aiding in various robotic applications, including movement and control mechanisms. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 15, 2026 by Editorial Team Author(s): Shahidullah Kawsar Originally published on Towards AI. Data Scientist & Machine Learning Interview Preparation Let’s check your basic knowledge of AdaBoost. Here are 10 Q&A for your next interview. Source: This image is generated by ChatGPTThe article presents a collection of 20 interview questions and answers focused on AdaBoost, a popular machine learning algorithm. It covers various aspects of the algorithm, including its functionality, applications, and the significance of tuning parameters, while also addressing common misconceptions and the implications of model choices. Each question is answered in detail, helping candidates prepare effectively for technical interviews in the data science and machine learning fields. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 15, 2026 by Editorial Team Author(s): Davin Convay Originally published on Towards AI. There are a lot of new terms dominating the artificial intelligence world lately, “Agentic AI” and “AI agents” being two of them. Oftentimes, they’re being used interchangeably, but the two phrases have their own distinct meanings. Organizations that understand when to deploy AI agents versus agentic ai solutions will automate intelligently while others automate blindly. The revolution isn’t just about AI doing tasks; it’s about AI pursuing goals. That difference changes everything. In this blog, we explore agentic AI vs AI agents, what makes them different, and how they will change the way we work. What is an AI Agent? An AI agent is a software program designed to perform specific tasks on behalf of users, responding to inputs with predetermined or learned behaviors. Think of AI agents as sophisticated digital assistants that excel at defined functions within established parameters. They perceive their environment through inputs, process information using programmed logic or trained models, and execute actions to achieve specific outcomes. The term “agent” implies agency, but AI agents possess limited autonomy. They operate within boundaries, following scripts, rules, or patterns learned from training data. A customer service chatbot represents a classic AI agent: it interprets queries, searches knowledge bases, and provides responses, but cannot independently decide to redesign the customer experience or proactively reach out to at-risk customers. AI agents have evolved significantly from simple rule-based systems. Modern AI agents leverage machine learning, natural language processing, and sophisticated decision trees to handle complex interactions. They can learn from experience, improving responses over time. Yet they remain fundamentally reactive, task-oriented tools waiting for activation rather than independently pursuing objectives. Examples of AI agents permeate our digital lives: Chatbots and Virtual Assistants: From Siri to enterprise customer service bots, these agents respond to queries and execute simple commands. They parse language, match intents, and deliver programmed responses. Recommendation Engines: Netflix’s content suggestions and Amazon’s product recommendations are AI agents analyzing behavior patterns to predict preferences. They excel at pattern matching but don’t independently decide to revolutionize recommendation strategies. Robotic Process Automation (RPA) Bots: These agents automate repetitive tasks like data entry, form processing, and report generation. They follow defined workflows efficiently but cannot reimagine business processes. Trading Bots: Algorithmic trading agents execute trades based on market signals and predetermined strategies. They react quickly to market conditions but don’t independently develop new trading philosophies. Email Filters: Spam detection agents classify messages using learned patterns. They improve accuracy through feedback but don’t autonomously investigate new spam techniques. What unites these AI agents is their fundamental characteristic: they are tools wielded by humans rather than autonomous collaborators. They augment human capabilities within defined scopes but don’t independently identify problems to solve or goals to pursue. Different Categories of AI Agents Understanding AI agent categories helps clarify why not all agents are agentic. Each category serves specific purposes, with distinct capabilities and limitations that determine their appropriate applications. Reactive Agents Reactive agents represent the simplest form, responding directly to current stimuli without memory or planning. They excel at immediate response scenarios where historical context is irrelevant. Characteristics: No internal state, immediate stimulus-response, consistent behavior for identical inputs. Examples: Basic chatbots with scripted responses, simple email autoresponders, rule-based alert systems. Limitations: Cannot learn from experience, no context awareness, fails with complex multi-step tasks. Use Cases: FAQ responses, simple notifications, basic data validation. Proactive Agents Proactive agents anticipate needs and initiate actions without explicit user commands. They monitor conditions and trigger responses when specific criteria are met. Characteristics: Environmental monitoring, threshold-based activation, predictive capabilities. Examples: Predictive maintenance systems, inventory reorder agents, calendar scheduling assistants. Strengths: Reduces human oversight, prevents problems before they occur, improves efficiency. Limitations: Operates within predefined parameters, cannot adapt strategies autonomously. Hybrid Agents Hybrid agents combine reactive and proactive behaviors, switching modes based on context. They respond to requests while also initiating beneficial actions. Characteristics: Dual-mode operation, context-sensitive behavior, balanced autonomy. Examples: Modern virtual assistants like Google Assistant, enterprise monitoring systems, smart home controllers. Advantages: Versatile application, user-friendly interaction, efficient resource utilization. Challenges: Complex design, mode-switching logic, user expectation management. Specialized vs Generalist Agents The specialization spectrum determines an agent’s breadth versus depth of capabilities. Specialized Agents: Excel at specific tasks with deep expertise. Example: Medical diagnosis agents trained on radiology images. Generalist Agents: Handle diverse tasks with moderate proficiency. Example: GPT-based assistants answering various queries. Trade-offs: Specialists offer superior performance in narrow domains. Generalists provide flexibility across multiple applications. Multi-Agent Systems Multi-agent systems coordinate multiple specialized agents to achieve complex objectives. Each agent handles specific sub-tasks while communicating with others. Architecture: Distributed intelligence, inter-agent communication protocols, coordinated goal pursuit. Examples: Supply chain optimization systems, smart grid management, autonomous vehicle fleets. Benefits: Scalability, fault tolerance, parallel processing, emergent intelligence. Complexities: Coordination overhead, conflict resolution, communication bottlenecks. Learning Agents Learning agents improve performance through experience, adapting behaviors based on feedback and outcomes. Learning Mechanisms: Supervised learning from labeled data, reinforcement learning from rewards, unsupervised pattern discovery. Examples: Recommendation systems, fraud detection agents, game-playing AI. Evolution: From simple parameter adjustment to complex strategy development. Limitations: Requires quality training data, can learn biases, may overfit to specific scenarios. Autonomous Agents Autonomous agents operate independently within defined parameters, making decisions without human intervention. Autonomy Levels: From simple script execution to complex decision-making within boundaries. Examples: Autonomous testing bots, robotic process automation, industrial control systems. Requirements: Robust error handling, safety constraints, performance monitoring. Distinction: Autonomous operation doesn’t equal agentic AI; autonomy can exist without goal-setting capability.What is Agentic AI? Agentic AI represents a fundamental leap beyond traditional AI agents: artificial intelligence systems capable of independent goal formulation, strategic planning, and autonomous pursuit of objectives without constant human direction. While AI agents execute tasks, agentic AI owns outcomes. This distinction transforms AI from a tool into a collaborator, from an assistant into a strategic partner. The “agentic” qualifier signifies genuine agency: the capacity to act independently based on internal goals rather than […]
Last Updated on May 15, 2026 by Editorial Team Author(s): Towards AI Editorial Team Originally published on Towards AI. Good morning, AI enthusiasts! This week, we’re looking at the shift from “AI demos” to real systems: agents that need reliable execution, enterprises building durable AI infrastructure, and architectures that survive production constraints. We also cover: A 1-hour practical walkthrough of modern AI engineering, from prompting and RAG to agents, evaluation, and deployment, plus a production lesson on why agent retries quietly break real systems. Why recursive multi-agent systems may depend less on “more agents” and more on how agents communicate internally. How enterprises are turning years of operational complexity into an advantage in the emerging “harness era” of AI. A practical guide to deploying production-ready agents on Google Cloud using Agents CLI. Why modern AI architecture evolved layer by layer, from LLMs to RAG, agents, and MCP, in response to real system failures. Let’s get into it! What’s AI Weekly This week in What’s AI, I’m sharing something we normally only do for enterprise teams: a 1-hour deep dive into the foundations of AI engineering you need to know in 2026. We go through AI theory without the math, cover the real limitations of current LLMs, and walk through the production techniques such as prompting, context engineering, RAG, agents, fine-tuning, evaluation, and deployment. If you’re building with LLMs or planning to, this is the starting point I wish had existed when I began. Watch the full video on YouTube. AI Tip of the Day Agent tool call retries are helpful when a model request times out, a tool fails, or the system loses connection. But retries can cause serious problems if the agent repeats the same action. It might send the same email twice, issue two refunds, create duplicate support tickets, or rerun the same payment step. Checking the tool arguments is not enough. The arguments can be valid, but the action may have already happened. Give each tool action a unique ID that connects to the user request and the action being taken. Save the action status before running it. Then, before the tool runs again, check whether that same action has already finished. For external APIs, use an idempotency key when they support one. For your own database writes, add a uniqueness rule so the same action cannot be saved twice. If you’re building agentic LLM applications and want to go deeper into tool use, guardrails, and production architecture, check out our Agentic AI Engineering course. — Louis-François Bouchard, Towards AI Co-founder & Head of Community Learn AI Together Community Section! Featured Community post from the Discord _creepycactus built OpenEar, a Mac dictation app. It hears you when you speak, records your meetings, and remembers every word. It runs on your chip, not the cloud, and doesn’t store any information. It is great for long prompts, meetings, voice journaling, or brain dumps. Check it out here and support a fellow community member. If you have any questions, ask in the thread! Collaboration Opportunities The Learn AI Together Discord community is flooding with collaboration opportunities. If you are excited to dive into applied AI, want a study partner, or even want to find a partner for your passion project, join the collaboration channel! Keep an eye on this section, too — we share cool opportunities every week! 1. Lucazsh is building a social media app and is looking for a frontend designer or app designer to improve the UX/UI. If this sounds like something you would enjoy working on, connect with them in the thread! 2. Muneebbaig. wants to dive deeper into ML, LLMs, and open-source AI research and produce one or two papers based on it. If you want to spend time on research projects or build your own, reach out to them in the thread! 3. Beratgurleer is working on n8n growth systems focused on lead conversion solutions and is looking for partners who can help with the technical side. If you want to enter the space and build something together, contact them in the thread! Meme of the week! Meme shared by bin4ry_d3struct0r TAI Curated Section Article of the week Groundbreaking Latent State Recursive Multi-Agent Systems is 2.4x Faster Uses 75.6% Cheaper By Mandar Karhade, MD. PhD. This article walks you through the paper ‘Recursive Multi-Agent Systems’ that bundles two ideas: passing latent hidden states between agents instead of text, and running agents in iterative critique loops. Recursive loops are well-established since Self-Refine and Reflexion in 2023. The latent channel is the actual contribution. Text-based recursion plateaus or regresses by round three because agents commit uncertainty to words; latent recursion keeps improving. The paper’s own data shows the communication channel, not loop depth, is where multi-agent accuracy stops climbing. Our must-read articles 1. Designing LLM Pipelines for Clinical Data: A Pattern for ALCOA++ and 21 CFR Part 11 Compliance By Pranav Nandan Shipping LLM features into regulated clinical workflows reveals a recurring architectural failure: the prototype works, but it can’t answer where the audit trail is, why outputs have changed, or who is accountable. The article outlines a five-layer pipeline treating the LLM as a lossy parser, using constrained decoding to physically prevent hallucinations and deterministic Python for all logic and computation. A conditional judge LLM fires on only 15% of records, and ALCOA++ and 21 CFR Part 11 compliance emerge from the architecture. 2. Harness: The Era Enterprises Were Built For By Fabio Yáñez Romero The era of prompt engineering favored lean, fast-moving teams who could ship on instinct. The harness era inverts that advantage. The article traces the arc from model weights through context engineering to the harness, a persistent runtime built on externalized memory, reusable skills, and machine-readable protocols. Enterprises that spent decades documenting procedures, governing data, and stabilizing interfaces now hold exactly the right raw material. The model becomes swappable; the harness becomes the durable intelligence layer the company owns outright. 3. How to Build and Deploy AI Agents on Google […]
Author(s): DrSwarnenduAI Originally published on Towards AI. For a decade, we asked if RNNs can represent what Transformers represent. We proved they can. We forgot to ask how expensively. That omission just cost us ten years. “Can our architecture represent everything a Transformer can?” The benchmarks run. The perplexity scores appear. The answer, roughly, is yes. A paper at ICLR 2026, titled “Transformers are Inherently Succinct,” was awarded Outstanding Paper.The article discusses the limitations of recurrent neural networks (RNNs) compared to transformers, particularly regarding their ability to represent complex structures succinctly. It reveals that while RNNs can compute functions similar to transformers, they require exponentially more parameters, especially in tasks requiring deep compositional structures. The piece highlights that evaluations of model efficiency often overlook the underlying parameter costs, which become apparent at higher nesting depths in tasks. Ultimately, it advocates for hybrid architectures that leverage the strengths of both RNNs and transformers to optimize performance in various computational contexts. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Author(s): Kamrun Nahar Originally published on Towards AI. SARIMAX, Prophet, XGBoost, LSTM, and N-BEATS broken down without any pretentious math. Pick the right model in under five minutes today. The 9 billion dollar lesson. In November 2021, Zillow walked into a conference room and admitted that their AI had set 7,000 houses on fire. Not literally. Financially. They’d built an algorithm to buy and flip homes, and the algorithm spent two years quietly overpaying for everything in Phoenix, Atlanta, and a dozen other markets. By the time they noticed, the company had to take a 304 million dollar write-down, lay off a quarter of the entire staff, and shut down the iBuying business completely. The market cap dropped about 7.8 billion dollars in days. The CEO blamed unpredictability. The data scientists got the side eye on Twitter. The post-mortems all said the same thing in different words. The forecasting model worked great until the world changed and nobody noticed. This is a time series problem. A famously expensive one. And after you finish this article, you’ll understand exactly why it went wrong, exactly how a more careful forecaster would have caught it, and exactly which tests would have screamed before any houses got bought. The face every executive makes when the algorithm spends 9 billion dollars on the wrong thing. You already forecast. You just don’t call it that. Take a moment. Right now. Glance out the window and guess if it’ll rain in two hours. You’ve got an answer. Maybe a confidence level too. Congratulations, you just ran a forecasting model in your head. Your phone does it for you a hundred times a day. Maps predicts traffic. Spotify predicts the next track. Your weather app predicts the rest of the week. Even your body forecasts. It knows when you’ll feel hungry by 7 PM because you usually feel hungry by 7 PM. The technical name for this is time series forecasting, and the math is older than your grandparents. The first serious model came from the 1920s. The Yule autoregressive equations. We’ve been refining this for a century. We’re not exactly working with new tools. Your brain does this every morning. It just doesn’t print a confidence interval. What makes time series different from ordinary data? Memory. Today depends on yesterday, and yesterday depends on the day before, all the way back to whenever the data started. You can shuffle a list of customer ages and lose nothing useful. Shuffle a stock price history and you’ve reduced it to nonsense. Order is sacred. Time isn’t a label, it’s the spine. The three ingredients in every messy series Pretty much every time series you’ll ever meet is three things layered on top of each other. Once you can name the three layers, you’ve already passed the first interview question. There’s a leaf blower screeching somewhere on my street that’s making this paragraph harder to write, but the metaphor still holds. Every series is a noisy stack. Trend. The slow drift. Are the numbers generally rising over months? Falling? Flat? Trend is the long story. Earth’s average temperature. Your kid’s height. The number of people who still own a fax machine. Seasonality. The repeating heartbeat. Pumpkin spice sales spike every October. Gym memberships peak the first week of January and die by February 15. Electricity demand in Stockholm crashes every July when everyone leaves for the archipelago. Residual. The leftover wiggle. Whatever’s left after you account for the trend and the season. A surprise rainstorm. A celebrity tweet. A holiday Monday that fell on the wrong week. The fuzz. Three honest squiggles hiding inside one ugly squiggle. Here’s a real one to make this concrete. A 2017 paper from the M-competitions analyzed call centers and found something strange. Banking call centers have 169 seasonal cycles per day. That’s not a typo. They aggregate calls every 5 minutes, and each 5 minute slot has its own pattern. Mondays at 10:13 AM look completely different from Mondays at 10:18 AM, and both look different from Tuesdays at the same minute. Reality is gloriously weird. Why this matters for you. Decomposition is the very first thing pros do when they meet new data. You can’t pick a model if you don’t know what’s in your data. Strong yearly cycle. SARIMA. Mostly random. Try the simplest baseline. Naming the parts changes the model. Stationarity. The most boring superpower in the universe. A series is stationary when its statistical properties stop drifting over time. The mean stays put. The variance doesn’t bloat. The way today correlates with yesterday isn’t shape-shifting. The flickering streetlight outside my apartment? Stationary. The price of houses in my neighborhood since 2019? Definitely not. One sits on a bench. The other is sprinting uphill. Left side wandered off to find itself. Right side stayed for tea. Why does anyone care? Because most classical models, especially the ARIMA family, assume your data is stationary before they touch it. Feed them a wandering series and they’ll learn nonsense. Then they’ll forecast nonsense back at you, with a confidence interval that looks suspiciously confident. This is roughly what happened to Zillow in slow motion. Their model was trained on housing markets that were fundamentally non-stationary, and it kept extrapolating an upward trend that the world stopped agreeing with. How do you check stationarity in code? Use the Augmented Dickey Fuller test (ADF). The null hypothesis is that the series is non-stationary. If the p-value comes back below 0.05, you can reject the null and call your series stationary enough for ARIMA-style models. # I am about to ask whether this series sits still long enough to modelfrom statsmodels.tsa.stattools import adfuller#result = adfuller(series.dropna())print(f"ADF statistic : {result[0]:.4f}")print(f"p-value : {result[1]:.4f}")print(f"used lags : {result[2]}")print(f"observations : {result[3]}")print("critical values:")for k, v in result[4].items(): print(f" {k}: {v:.4f}")# below 0.05 means probably stationary# more negative ADF statistic means more confident If the p-value comes back at 0.42, your series is wandering. The fix is the next section. It’s almost embarrassingly easy. […]
Author(s): Kushal Banda Originally published on Towards AI. Claude Chat is reactive you prompt, Claude answers. Claude Code is terminal-first, demands CLI comfort. Cowork splits the difference: agentic architecture (same as Code) in a GUI anyone can operate. Multi-step planning. File system access. Connectors to Gmail/Slack/Salesforce. Parallel execution. And crucially: the ability to complete work automatically through scheduled tasks, which isn’t possible in regular chats outside of Cowork. Claude Cowork The mental model: stop thinking “assistant that writes.” Start thinking “assistant that does.” Claude Cowork is Anthropic’s interpretation for OpenClaw, more secure and built exclusively for Claude ecosystem. The Architecture Layer Projects: Task isolation. Every major workflow lives in its own project with dedicated folder access. This prevents context bleed when running 5 different projects simultaneously. Customize Claude for your team with plugins Create a dedicated folder for each project, grant Claude access only to that folder. That’s your security boundary. Plugins: Role-specific skill bundles. Finance, legal, marketing, data analysis, operations each plugin pre-packs connectors (Gmail, Slack, etc.), skills (reusable prompts), and sub-agents. Plugins come pre-configured with the skills and connectors relevant to that function, and you can tailor them to your workflow. Plugins Example workflow (sales): “New prospect Johnson” → Plugin scans LinkedIn + Gmail + Slack → generates dossier + templates → ready to send. Skills: Slash-command reusable prompts. /summarize, /format, /analyze are built-in. Custom skills are copy-paste-ready instructions that trigger specific actions. Type "/" to see available skills; plugins add more. Dispatch: Remote task assignment from phone. From your phone, hand Claude tasks that use everything on your desktop, including things you can’t open on your phone pull data from local spreadsheets, search Slack/email, generate reports. Syncs continuously. You get push notifications when done or approval needed. Scheduled Tasks: Cron without cron. Define recurring workflows (daily briefing, weekly report, monthly cleanup). Runs on wake; sleeping machine skips and re-runs on next boot. No continuous execution unless your desktop stays on. Scheduled Tasks What Actually Works vs. What Breaks Real-world patterns that ship: Weekly reports: Analyze folder of CSVs → synthesize findings → output PowerPoint + email to stakeholders. Runs Sundays 6am. Email-triggered workflows: New email in label → Cowork extracts data → creates Notion entry + Slack summary. (Requires email connector setup.) File organization: 500 PDFs in Downloads → parse filenames → sort into folders by type → archive old. Runs on demand or scheduled. Contract review triage: Upload contracts → extract key terms → flag risks → bucket into categories (approve/needs-review/reject). Legal teams actually use this. Competitor research: Task → Claude searches web + downloads pricing pages + compiles comparison table → exports to Sheets. Takes 15–20 min per batch of 10 competitors. What breaks: CAPTCHA workflows (can’t solve them). Skip these entirely. High-precision image editing (Photoshop-grade retouching). Use Claude for planning, human for execution. Tasks requiring real-time responsiveness. 10–15 second latency per step adds up. Finance trades or irreversible deletions. Cowork has no rollback. Don’t hand it keys to live systems without approval gates. Desktop apps with no API/MCP bridge (some legacy enterprise software). If it’s not web-accessible or MCP-connected, Cowork struggles. Get started with Claude Cowork Create project folder: /Users/you/Cowork-ProjectName/ (not Documents, not Desktop). Grant access: Cowork settings → pick this folder only. Deny everything else. Set global instructions: Settings → Cowork → Edit. Add tone, output preferences, constraints. Install plugins: Customize menu → browse marketplace → install 2–3 role-specific ones (start lightweight). Test one workflow: Small task, 5 files, one output. Debug before scaling. Enable Dispatch: Phone app → authenticate → sync desktop Cowork. Schedule one recurring task: Daily/weekly/monthly. Observe for 2 weeks before adding more. Don’t over-engineer. Start with one plugin, one project, one recurring task. Add complexity after you see it working. Further Reading Claude Cowork Power User Guide 50+ tested tips on plugins, skills, sub-agents, scheduled tasks Anthropic Claude Cowork Thanks for reading! If you have any questions or feedback, please let me know on Medium or LinkedIn Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 11, 2026 by Editorial Team Author(s): Ravi Yogesh Originally published on Towards AI. 10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about. ⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving Photo by Logan Voss on Unsplash When Google published TurboQuant at ICLR 2026, the headline was hard to ignore: compress your LLM’s key-value cache to 3 bits, keep quality intact, get up to 6× memory savings. I built a 10-experiment evaluation pipeline, ran it across three models — Gemma-2B base, Gemma-2B-IT, and TinyLlama 1.1B Chat — and measured everything I could: factual accuracy, RAG retrieval quality, multi-task generation fidelity, throughput, memory footprint, layer sensitivity, and something most quantization write-ups skip entirely: what compression does to attention entropy. Figure 0: Full experiment suite — quality, memory, throughput, and mixed-bit results across all three models What Is TurboQuant and Why It’s Different Most KV-cache quantization schemes treat compression as a reconstruction problem: minimize mean squared error on cached vectors. TurboQuant is smarter than that. It targets the thing that actually matters for attention — inner-product preservation. Stage 1 — PolarQuant Each pair of consecutive vector coordinates gets converted to polar form (radius + angle). The radius stays in full float. The angle — which lives in a bounded, roughly uniform range after an optional random rotation — is quantized at low bitwidth. The rotation is the real trick: it spreads outlier energy so that scalar quantization on the angle isn’t wrecked by a few dominant values. Stage 2 — QJL (Quantized Johnson-Lindenstrauss) The reconstruction error from Stage 1 gets projected into a random subspace, and only the sign of each projection is stored — 1 bit per projection. The Johnson-Lindenstrauss lemma says random projections approximately preserve inner products. So this step specifically corrects for dot-product error that distorts attention scores, not just MSE. That distinction is what separates TurboQuant from naive quantizers. Disclaimer: My implementation hooks into the model’s forward pass and compresses cached K and V tensors between decode steps using int8 containers with per-vector float16 scale factors. It is not a packed 3-bit kernel — it does not use fused CUDA or Triton. This distinction matters a lot for speed results and somewhat for memory results. I will be explicit about both throughout. The Experimental Setup Ten experiments, two phases. All text evaluations used a fixed external encoder (sentence-transformers/all-MiniLM-L6-v2) separate from the tested models. Throughput used a fixed 80-step greedy decode benchmark with alternating load order to reduce warm/cold GPU bias. 1. 3-Bit Is the Real Operating Point — and Entropy Explains Why Start with the synthetic bit-depth comparison — PolarQuant alone vs. the full TurboQuant-style path on attention MSE and KL divergence. At 2 bits, TurboQuant performs worse than PolarQuant alone. At 3 and 4 bits, the combination wins decisively Figure 1 (E1): Attention MSE and KL divergence vs. bit depth — PolarQuant vs. TurboQuant-style Figure 2 (E2): Rate-distortion curve — measured MSE vs. anchored 4^-b reference (log scale) The MSE table shows that 2-bit is worse, but it doesn’t explain why. The attention entropy experiment does. Entropy measures how focused or diffuse the softmax distribution is — high entropy means the model is attending broadly, low entropy means it’s locked onto a few positions. The key distinction is between random key distributions (noise-floor conditions) and structured key distributions — vectors with dominant directional clusters, which is closer to what real LLM attention heads actually produce. Structured distributions (closer to real LLM key patterns) show a 65% entropy collapse at 2 bits — the model is forced to hyper-focus on a tiny fraction of context positions At 2-bit compression, structured key distributions experience a 65% entropy collapse — the quantizer is actively reshaping the attention geometry in ways that cut off access to broader context. This is not just elevated MSE. It is the underreported reason why 2-bit compression quietly damages multi-hop reasoning, long-document synthesis, and complex RAG chains. 2. Quality Results Are Stronger Than Expected Factual QA (40 questions, 3 seeds) RAG-Style Context-Grounded QA (12 contexts, 3 seeds) Multi-task semantic similarity (factual, reasoning, coding, summarization) between baseline and TQ-3bit generations: 1.0 across all categories and all three models. Figure 3 (E7): Multi-task semantic similarity radar — TQ-3bit vs. baseline (1.0 across all categories) 3. The Memory Math Has Two Very Different Versions For Gemma-2B (18 layers, MQA with 1 KV head per group, 256-dim heads), the idealized 3-bit target vs. what the int8 prototype actually stores: The 2.69× gap between ideal and prototype is entirely explained by storage format: int8 costs 8 bits/element instead of 3, plus float16 scale metadata per token vector. Figure 4 (E4): Left: ideal targets by bitwidth. Right: prototype storage vs. ideal 3-bit target If someone tells you TurboQuant gives 5× memory savings, ask what storage backend they’re using. An int8 prototype gives ~2×. That’s still useful — it can extend context window or increase batch size on constrained hardware — but it’s a different product story. The ~5× headline requires truly packed sub-byte storage with custom memory layouts. 4. Speed Is Model-Dependent — Here’s the Honest Reason Why Throughput sweep: 80 fixed greedy decode steps, 5 measured trials, 2 alternating rounds, three prompt lengths. Headline (512-token prompt) Full prompt-length sweep (TQ / Baseline ratio) Values above 1.0 mean TQ-3bit is faster. TinyLlama is nearly 10× faster at baseline, making it proportionally more sensitive to cache I/O savings The speed question is not “is TurboQuant fast?” It is “is your model’s bottleneck memory bandwidth or arithmetic?” Faster models like TinyLlama spend proportionally more time on cache I/O — compressing the cache loosens that bottleneck. Compute-heavy models like Gemma-2B are less responsive. Know your bottleneck before optimizing for it. 5. Mixed-Bit Schedules Are an Easy Win Nobody Talks About Layer sensitivity analysis showed that for both Gemma variants, compression sensitivity peaks in layers 7–10 — the middle of […]
Last Updated on May 4, 2026 by Editorial Team Author(s): Rishav Saigal Originally published on Towards AI. A technical deep-dive into the LangGraph state machine, Pydantic-driven routing, and Critique Agent design powering the LLM Drift Experiment. In the opening piece of this series, we explored the conceptual “why” behind LLM Drift — how AI agents lose their persona, reasoning quality, and behavioral consistency under sustained adversarial pressure. But for the engineers and architects in the room, the “how” is where the real story lives. Building a system designed to intentionally stress-test agent stability requires more than a sequential script. It requires a stateful, resilient, and adversarial architecture — one where failure modes are first-class citizens, not edge cases to be patched later. To build the LLM Drift Experiment, we chose LangGraph. Here is a deep dive into the architectural decisions that power our multi-agent debate engine. Why LangGraph? Stateful Graphs vs. Naive Loops When building complex agentic workflows, the biggest engineering challenge isn’t the LLM call itself — it’s the logic between the calls. We needed a system capable of: Maintaining stateful context across dozens of debate rounds Implementing conditional loops that re-enter specific nodes upon rejection Supporting node-level retries without restarting the entire workflow Plain Python loops or simple LangChain chains don’t handle this gracefully. A for loop over LLM calls gives you no way to selectively re-run a single node, inspect intermediate state, or branch on structured output without building your own routing infrastructure from scratch. LangGraph’s directed graph model solves all three problems natively. It lets you define an explicit typed state object, create conditional edges that read from that state, and implement node-level retries with full visibility into what happened at each step. For a system where agents must iterate until they satisfy a hostile internal critic, this isn’t just convenient — it’s architecturally necessary. “LangGraph enables conditional re-entry — a naive loop cannot.” The Full Debate Graph: How Every Turn Is Orchestrated The heart of the project is the orchestration graph. Every turn of the debate follows a rigorous, deterministic path: The Pros Agent generates an argument The argument passes through the internal refinement loop (more on this below) Only upon approval is it committed to shared memory The Cons Agent reads from shared memory and generates a counter-argument The same internal loop applies before the Cons argument is published The cycle repeats for N configured rounds We rely on LangGraph’s auto-generated graph visualization to monitor this flow in real time. Two conditional edges — should_continue_pros and should_continue_cons — act as gatekeepers, reading the is_approved boolean from the agent's structured output and deciding whether to advance to the next team or loop back for refinement. START → pros_agent → critique_pros → [conditional] → cons_agent → critique_cons → [conditional] → END or loop. The Refinement Loop: An Agent Designed to Reject Its Own Team The most architecturally novel component of this system is the internal Persona → Thinking → Critique loop. In most agentic systems, critic agents are designed to be helpful — nudging outputs toward better quality. In our experiment, the Critique Agent is deliberately adversarial. Every argument generated by the Pros or Cons team must pass through three internal stages before it reaches shared memory: Persona Agent Architects or actively redesigns the team’s adversarial identity each round. It reads persona.json to assess the current persona, then decides whether to reuse the existing identity or design a new strategic persona based on the opponent's latest moves. This dynamic decision — maintain or evolve — is precisely what makes the Persona Agent the most sensitive node for drift detection. The persona it settles on becomes the identity anchor we measure all subsequent outputs against. Thinking Agent Stress-tests the argument internally. It identifies logical gaps, weak evidence chains, and rhetorical inconsistencies before the Critique Agent ever sees the output. Critique Agent Acts as a hostile internal auditor. Its sole function is to find grounds for rejection. If the argument is logically circular, emotionally inconsistent with the persona, or reasoning from the same evidence as the previous round, it issues a rejection with structured feedback — and the loop restarts at the Persona Agent. Arguments only exit to shared_memory.json after surviving this audit. This enforces a high baseline of argument quality — but it also creates a fascinating failure mode we are actively monitoring: loop-lock, where the Critique Agent becomes so strict that neither agent can produce an argument that passes. Loop-lock is, in itself, a measurable form of cognitive drift. “Loop-lock monitored as a drift signal.” Memory Architecture: Isolation by Design Memory management is treated as a first-class architectural concern, not an afterthought. We implemented a two-tier isolation system to preserve experimental integrity. Shared Memory (shared_memory.json) The public transcript of the debate. Both teams can read from this file — it contains only finalized, approved arguments. This represents the “official record” of what each agent has argued. Team-Private Memory Each team maintains three private files that are invisible to the opposing agent: persona.json — the identity anchor and behavioral constraints for this team thinking.json — internal reasoning scratchpad (not part of the public argument) critique.json — the Critique Agent's rejection logs and feedback history This isolation is architecturally critical. If the Cons team could read the Pros team’s internal thinking.json, they would have access to reasoning that was explicitly not published — effectively cheating. More importantly for drift measurement, cross-team memory contamination would corrupt the persona consistency scores by introducing external framing before the argument is finalized. Two additional design rules enforce experimental integrity: Append-Only Writes. The write_json_direct() function only ever appends entries to memory files — it never overwrites. Every version of every persona, every thinking draft, and every critique rejection is preserved. This enables full forensic reconstruction of how any argument evolved across all its internal iterations. Automatic Run Archiving. When a simulation completes, the entire memory state is automatically archived to Research Runs/ using a structured naming convention: memory-v{VERSION}-temp-{TEMPERATURE}-max-tokens-{MAX_TOKENS}# e.g. memory-v6-temp-1-max-tokens-4096 If a […]
Last Updated on May 4, 2026 by Editorial Team Author(s): Rishav Saigal Originally published on Towards AI. Figure 1 — From a plain-English prompt to a fully tracked MLflow experiment, autonomously. TL;DR Wraps PyCaret’s AutoML engine in a Google ADK agent hierarchy One natural language prompt → plan → code → execution → MLflow tracking Self-corrects up to 10 times on failure; isolates artifacts per session Covers Classification, Regression, Clustering, Anomaly Detection, Time Series If you’ve used PyCaret, you know it already cuts ML boilerplate dramatically. PyCaretAgent goes further: a Root Agent reads your intent, a Planner designs the pipeline, and an Executor writes and runs the code — all without you touching a line of Python. How It Works Three layers. The Root Agent validates your CSV and routes to the right specialist. Each specialist is a SequentialAgent: a Planner designs the pipeline and mints a session ID; an Executor writes the code, runs it, and logs everything to MLflow. Figure 2 — Root routes; each SequentialAgent runs Planner → Executor in strict order. The Smart Bits Session IDs via callback. The Planner outputs a free-text plan with a SESSION_ID: AB1X9Z token. A regex callback extracts it and drops it into shared session state — no structured output format needed. 10-retry self-correction. UnsafeLocalCodeExecutor(error_retry_attempts=10) automatically re-runs generated code on failure, letting the model diagnose and fix its own bugs. Failure short-circuit. A before_model_callback checks a check_failure_status flag and skips re-runs if the task already succeeded — no wasted API calls. Figure 3 — Every metric and param is auto-logged. Named classification_AB1X9Z for instant retrieval. The agent doesn’t just run your ML pipeline — it tracks, isolates, and self-heals through every failure. Run It git clone https://github.com/Rishav1996/PyCaretAgent.gitcd PyCaretAgent && uv pip install .uv run mlflow ui --port 5000uv run adk run pycaretagent Prompt: “Classify heart.csv where the target is ‘target’.” That’s the entire interface. The agent validates the file, plans, codes, executes, and delivers a tracked experiment. Figure 4 — Real-time terminal output. Session ID, retry events, and success signal are all visible in the agent’s log stream. What’s Next This article is the first in a series. Each subsequent piece does a deep-dive into one task type, walking through a real dataset end-to-end — prompt, plan, generated code, and final MLflow results. Figure 5 — Each article in the series covers one task type with a real dataset and annotated agent output. Classification Deep-Dive (Coming Soon) Heart disease prediction with heart.csv. We trace the full agent run — from CSV validation to compare_models() — and annotate every decision the Planner makes. Regression Deep-Dive (Coming Soon) House price prediction. How the Executor tunes via tune_model(), and why the 10-retry mechanism matters when XGBoost hits a dependency mismatch mid-run. Clustering Deep-Dive (Coming Soon) Customer segmentation without a target column. Watch the Root Agent skip target validation entirely and route straight to the unsupervised pipeline. Anomaly Detection Deep-Dive (Coming Soon) Fraud detection on a transactions dataset. The Planner picks Isolation Forest; we break down why, and show how anomaly scores surface as MLflow metrics. Time Series Deep-Dive (Coming Soon) Sales forecasting with seasonality detection. The most complex setup — index parsing, horizon selection, and MASE vs. MAPE in the MLflow comparison table. Future: Deploy Directly to Cloud The current version trains, tracks, and saves models locally. The next major milestone closes the loop — pushing finalized models to cloud storage and inference endpoints using PyCaret’s built-in deploy_model(), triggered directly by the agent with no manual steps. The target UX is a single extra sentence in the user prompt: “Classify heart.csv, target=’target’, deploy to AWS.” The Root Agent will parse the platform, pass it as a session state variable, and the Executor will append a deploy_model() call after finalize_model() — credentials injected from environment variables. A dedicated article in this series will cover the full credential handoff pattern and multi-cloud configuration. PyCaretAgent is a clean, reusable template for any agent-wrapped AutoML system. The Planner/Executor pattern, state handoff via callbacks, and retry-based self-correction all generalize well beyond PyCaret. Github Link : https://github.com/Rishav1996/PyCaretAgent Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 4, 2026 by Editorial Team Author(s): Muhammad Hassan Ali Originally published on Towards AI. I Ran This Open-Source AI Tool on a Messy Codebase and Got 71x Fewer Tokens — Here Is Exactly What Happened I have spent months watching developers copy-paste entire files into Claude, burn through context windows, and still get vague answers. Screenshot of Graphify Github Repo by AuthorThe article discusses the capabilities and benefits of Graphify, an open-source AI coding assistant that transforms files into a queryable knowledge graph. It highlights how Graphify significantly reduces token usage during queries compared to traditional methods and explains its three-pass extraction process, which optimizes the handling of various file types, including code and documents. The author emphasizes the ease of installation and the privacy-focused design of the tool, making it suitable for developers looking to enhance their coding workflows. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
Last Updated on May 4, 2026 by Editorial Team Author(s): Ala Falaki, PhD Originally published on Towards AI. Month in 4 Papers (April 2026) This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in! Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy [paper]The article discusses recent research in the NLP field, focusing on four significant papers: the impact of prompt politeness on language model accuracy, advancements in context engineering for self-improving models, the efficiency of LoRA in fine-tuning large models, and a novel method of semantic communication between models through Cache-to-Cache. Each paper presents findings that challenge existing assumptions, propose new methodologies for improving model performance, and highlight the importance of tone and context in interactions with AI models. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
