An end-to-end classical NLP experiment on Kaggle’s Spooky Author Identification task: from Vowpal Wabbit and TF-IDF/NB-SVM baselines to a tuned stacked ensemble, with a compact representation survey of Bag-of-Words, BM25, Word2Vec, and FastText for context.
The post How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification appeared first on Towards Data Science.
Small prompt changes can silently break critical behavior in production. This article introduces a practical framework to detect hidden regressions before users notice.
The post Prompt Engineering Fails Quietly — Prompt Regression Is Why appeared first on Towards Data Science.
The tools I use for analytics and reporting have changed more than I expected, yet my questions for any analytics project haven't moved much.
The post I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work appeared first on Towards Data Science.
Behind a customer's API, a high-quality answer isn't enough. It has to be usable, which means on time. Delivering that consistently is a problem about variance, not speed, and the fixes are counterintuitive.
The post Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows appeared first on Towards Data Science.
A concrete bias–variance lesson: why the smallest model had the best cross-validated fit, and how to know when to reach for the big hammer.
The post I Pitted XGBoost Against Logistic Regression on 358 Matches. The Boring Model Won. appeared first on Towards Data Science.
A team cut their AI inference bill by more than half. Three months later, customer satisfaction was dropping and the cost savings were tied to the quality loss. Cost-optimization routing layers are a Pareto trap, and here's the detection methodology that catches them in days instead of months.
The post We Built a Routing Layer to Cut Our AI Costs. It Broke the Product. appeared first on Towards Data Science.
Using Gemma 4, Ollama, OpenAI Agents SDK, and Tavily MCP to build a lightweight research agent
The post From Local LLM to Tool-Using Agent appeared first on Towards Data Science.
Why memorizing for the exam doesn't mean you understand the subject
The post Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #M1] - The thesis behind every architectural choice in this series
The post Amplify the Expert: A Philosophy for Building Enterprise RAG appeared first on Towards Data Science.
I benchmarked raw chat history, vector-only RAG, and a context graph on the same multi-agent conversations. The results exposed a surprising weakness in relational retrieval.
The post Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory appeared first on Towards Data Science.
A reproducible benchmark on latency, cost, and reproducibility, and where agents actually earn their keep.
The post The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark appeared first on Towards Data Science.
Whether you should stick to a classic Ordinary Least Squares regression, introduce interaction terms, or pivot to a Tweedie distribution depends entirely on how your data handles the messy reality of zeros and extreme outliers.
The post Beyond the Straight Line: Choosing Between OLS, Interaction Terms, and Tweedie Regression appeared first on Towards Data Science.
Beat the 8GB VRAM limit. Learn how to run three different LLMs on a single 8GB GPU using C++ layer multiplexing and admission control.
The post 3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #7C] - One LLM call ranks the candidates with reasons. The output is one typed object your auditor can defend
The post An LLM as arbiter in RAG retrieval: picking the right candidate with reasons appeared first on Towards Data Science.
A reflection on the first month of learning data engineering in public, and what actually kept me going.
The post One Month Into Learning Data Engineering in Public: Here’s What I Didn’t Write About appeared first on Towards Data Science.
Turning model coefficients into a 0–1000 score, with risk classes and stability checks
The post How to Build a Credit Scoring Grid From a Logistic Regression Model appeared first on Towards Data Science.
A practical data engineering onboarding workflow for environment setup, automated testing, and AI-assisted development.
The post Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable appeared first on Towards Data Science.
Activation patching reveals how facts are stored, routed, and read out across transformer layers, and why the residual stream does most of the work
The post A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT appeared first on Towards Data Science.
A practical walkthrough using text-to-SQL as the example
The post Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #7B] - Retrieval is filtering on structured tables: keywords first, TOC second, embeddings last
The post Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End appeared first on Towards Data Science.
Learn about the concept of loops to power your coding agents.
The post How to Create Powerful Loops in Claude Code appeared first on Towards Data Science.
How Gemini solved my Pandas problem in seconds, and why data science fundamentals still matter to spot suboptimal solutions
The post I Spent an Hour on a Data Preprocessing Task Before Asking Gemini appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #7A] - Stop searching strings. Filter line_df and toc_df. Pick anchors small, expand context large
The post Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG appeared first on Towards Data Science.
If you are a programmer and you don't feel "special" anymore, you are not alone
The post The Era of No-Code AI: What You Need to Know appeared first on Towards Data Science.
From installing Ollama to launching OpenCode with a local model, step by step.
The post Build Your Own Local AI Coding Agent with Gemma 4 and OpenCode appeared first on Towards Data Science.
Why one-hot encoding isn’t always the best approach, and alternative encodings
The post Encoding Categorical Data for Outlier Detection appeared first on Towards Data Science.
Learn how to apply coding agents to verify work in your browser.
The post How to Use Claude Code in Your Browser appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #6bis] - Ask one focused clarification, learn the default from the answer, stay silent next time
The post When RAG Users Ask Vague Questions: Clarify Once, Learn the Default appeared first on Towards Data Science.
The intuition behind neural networks and why they need activation functions.
The post Neural Networks, Explained for Beginners: Start Here If They’ve Confused You appeared first on Towards Data Science.
Understanding ow LLMs interact with the world around them, from returning data to taking action
The post Tool Calling, Explained: How AI Agents Decide What to Do Next appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5septies] - When a PDF prints a contents page but exposes no outline, two ways to turn it back into structure, plus the page-alignment step everyone forgets
The post Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section appeared first on Towards Data Science.
For years, I created date tables with DAX code whenever I didn’t have a way to create them upstream of the data flow. Now I've realised there's another way to do it. Let’s see what the alternatives are and how they compare.
The post What Are the Possibilities to Build Date Tables in Self-Service Environments? appeared first on Towards Data Science.
What data teams need to build with AI to make self-healing data architecture a practical reality
The post 7 Crucial Barriers Between Data Teams and Self-Healing Data Architecture appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5sexies] - image_df tells you where every picture is. Turning the few that matter into searchable text is a separate, cost-ordered job
The post Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All appeared first on Towards Data Science.
Five surfaces collapsed into one declarative layer. Here's the full story of Materialized Lake Views in Microsoft Fabric - from syntax to the new GA capabilities
The post Materialized Lake Views in Microsoft Fabric: When Your Medallion Fits in a SELECT Statement appeared first on Towards Data Science.
What I thought was a scheduling problem turned out to be a portability problem first
The post I Tried to Schedule My ETL Pipeline. Here’s What I Didn’t Expect. appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5quinquies] - Same 1974 scanned PDF, two engines. EasyOCR recovers text. Docling recovers text + sections + figures. The structural gap makes one output usable downstream and the other one a flat string.
The post Parse Scanned PDFs for RAG with EasyOCR: Free OCR Gives You Words, Not a Document appeared first on Towards Data Science.
The PCIe transfer latency is silently bottlenecking your agentic inference. Here is how building a custom device-resident vector search kernel bypasses the CPU to unlock deterministic microsecond tail latencies.
The post GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU appeared first on Towards Data Science.
Getting reliable, readable responses out of your LLM, and knowing which tool to reach for
The post Structured Outputs with LLMs: JSON Mode, Function Calling, and When to Use Each appeared first on Towards Data Science.
Learn about the upsides and downsides of Claude Fable 5
The post How Powerful is Claude Fable (Mythos) 5 for Coding? appeared first on Towards Data Science.
For decades, the existence of the hydrophobic core, a region in the 3D structure of proteins where hydrophobic amino acids reside together, has been considered a general property in proteins. What we have found now may extend that model. In particular, the rest of amino acids also seem to cluster together according to their chemical type (polar, acidic, basic, special), specifically in groups of ~8 units. This is what we have come to call the Mosaic Q model. Here is how we found it, along with tools for its quantification and visualization.
The post Proteins: A Mosaic Pattern to Rule Them All? appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #6c] - The decisions the parser makes on top of the user string, using the document’s profile: dispatch, activations, full schema, three approaches to deciding what fires, the audit _meta block, and a broker-corpus walkthrough
The post Dispatching the Parsed RAG Question: Chunk Strategy, Model Tier, Activations, Audit appeared first on Towards Data Science.
A hands-on guide to setting up image similarity search in Milvus, and why visual replication isn't always enough.
The post The Power and Pitfalls of Vector-Based Image Search appeared first on Towards Data Science.
How unit economics should set your classification cutoff, and why they rarely do.
The post Your Churn Threshold Is a Pricing Decision appeared first on Towards Data Science.
Why production-level AI optimization modeling agent needs reproducibility and portability, and how IR helps achieve them
The post The Secret to Reproducible and Portable Optimization: ORPilot’s Intermediate Representation (IR) appeared first on Towards Data Science.
Most LLM applications need a clear workflow, not an autonomous agent. Here's how to build one in plain Python.
The post You Probably Don’t Need an Agent Framework appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #6b] - The five field families the parser reads straight from the user’s question, with the code that fills each one
The post What the Question Parser Extracts from a User String: Keywords, Scope, Shape, Decomposition, Clarification appeared first on Towards Data Science.
Budgets for AI tokens can’t be infinite, no matter how much hyperscalers wish they were
The post Drilling Into AI’s Financial Sustainability appeared first on Towards Data Science.
Tired of your monthly API bill? Follow this tested guide to set up a high-performance local LLM on your Mac Mini without the headaches.
The post Run a Local LLM with OpenClaw on Your Mac Mini appeared first on Towards Data Science.
LLM rate limits don't just interrupt agent pipelines—they can silently corrupt structured outputs when fallback models receive incompatible payloads. I built a recovery layer that classifies failures, adapts payloads across model tiers, preserves execution state, and maintains schema integrity during provider swaps.
The post LLM Fallbacks Break Agent Pipelines — I Built the Missing Recovery Layer appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #6a] - Why a user question deserves the same parsing as the document, and how it splits into a retrieval brief and a generation brief before either runs
The post RAG Questions Need Parsing Too: Turn the User’s String Into Briefs for Retrieval and Generation appeared first on Towards Data Science.
A detailed look at MCP that turned my scattered tool definitions into a stable, discoverable server
The post The Protocol That Cleaned Up Our Agent Architecture appeared first on Towards Data Science.
A single model hands you a single answer and no sense of how much it hinges on the dozens of choices buried inside it.
The post I Built 11 Models to Predict the 2026 World Cup. They Crown Four Different Champions. appeared first on Towards Data Science.
How local optimization in last‑mile delivery can quietly break the system
The post The System Always Knows: Why Local Efficiency and System Performance Are Not the Same Problem appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5quater] - The other parsers read the words on a page. A vision model also reads the pictures
The post Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG appeared first on Towards Data Science.
A systems-level deep dive into the hidden microarchitectural costs of Kubernetes GPU time-slicing, and what it actually costs to co-locate Agentic AI workloads.
The post GPU Time-Slicing for Concurrent LLM Agents on Kubernetes appeared first on Towards Data Science.
Increasing context size in RAG systems doesn’t improve accuracy for aggregation tasks—it makes errors harder to detect. In this article, I benchmark retrieval-based pipelines against a deterministic full-scan engine across 100,000 rows and show why computation queries must be routed away from RAG entirely.
The post Larger Context Windows Don’t Fix RAG — So I Built a System That Does appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5ter] - Table cells, OCR, captions, headings: cloud-grade structure, running on your own machine. No key, no per-page bill, nothing leaves the building
The post Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload appeared first on Towards Data Science.
Let's practice data science thinking through a probability problem
The post Solving the 3Blue1Brown String Probability Problem (Without AI) appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5bis] - The same relational tables. Native table cells. OCR for scanned pages and images. Captions and headings without regex.
The post When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout appeared first on Towards Data Science.
For nearly a decade, this part of neural networks barely changed. DeepSeek is trying to reinvent it.
The post Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem) appeared first on Towards Data Science.
Claude can now write its own harness on the fly, custom-built for the task at hand.
The post A Harness for Every Task: Putting a Team of Claudes on One Job appeared first on Towards Data Science.
I tried to make my ETL pipeline production-ready. Three things broke. Each one taught me something scripting alone never could.
The post I Thought Data Engineering Was Just Writing Scripts. I Was Wrong. appeared first on Towards Data Science.
A story about a broken printer, visual inductive bias, and why the race endedin a tie.
The post Is Language Visual? An Experiment with Chinese Characters appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5B] - One PDF in, a relational set of DataFrames out: lines, pages, TOC, images, cross-references, captions, spans, and a parsing summary
The post Stop Returning Flat Text from a PDF: The Relational Shape RAG Needs appeared first on Towards Data Science.
Take the next step to building real workflows with Spark on your laptop
The post PySpark for Beginners: Beyond the Basics appeared first on Towards Data Science.
Why “average utilization” lies about how full your GPUs really are
The post When GPU Utilization Lies: The Hidden Systems Problem Slowing Modern AI appeared first on Towards Data Science.
An in-depth performance test comparing Nucs and Choco
The post NuCS vs Choco: A Pure-Python Constraint Solver Meets a JVM Veteran appeared first on Towards Data Science.
A structured methodology for comparing candidate models, testing stability, and selecting a robust final score
The post How to Train a Scoring Model in the Age of Artificial Intelligence appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #5A] - Document signals (metadata, native TOC, source software) and page-level content (text vs scans, tables, images, columns, page profile)
The post Beyond extract_text: The Two Layers of a PDF That Drive RAG Quality appeared first on Towards Data Science.
An intuitive introduction to reasoning with uncertainty, from directed Bayesian networks to undirected Markov networks and weighted logical rules.
The post Bayesian Networks and Markov Networks: An Intuitive Guide to Structured Uncertainty appeared first on Towards Data Science.
A quick guide to separating Physical AI from world models, embodied AI, physics AI, and digital twins
The post Physical AI: What It Is and What It Is Not appeared first on Towards Data Science.
Enterprise Document Intelligence [Vol.1 #4bis] - A coauthor note on the brick-by-brick pitfalls that justified the four-brick split, before Part II walks the fixes
The post 10 Common RAG Mistakes We Keep Seeing in Production appeared first on Towards Data Science.
Stop re-computing the same context. Learn how to build a C++ runtime with copy-on-fork KV snapshots to eliminate redundant LLM prefills in multi-agent pipelines.
The post Prefill Once, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines appeared first on Towards Data Science.
Follow this framework to build a project that will impress hiring managers
The post The Exact ML Project I’d Build to Get Hired in 2026 appeared first on Towards Data Science.
This is how LLMs are used today to increase precision in recommendation systems
The post Increase Recommendation Systems’ Precision with LLMs, Using Python appeared first on Towards Data Science.
Quantum Machine Learning promises powerful new ways of processing information, but quantum states are extraordinarily fragile. In this article, we explore why quantum information is so difficult to protect, how noise and decoherence introduce errors, and the fundamental ideas behind Quantum Error Correction: the technology that may make large-scale quantum machine learning possible.
The post How to Keep Quantum Information Alive for Machine Learning appeared first on Towards Data Science.
What Fourier analysis misses
The post Sequential Fitting: A Different Perspective on the Spectral Bias of Neural Networks appeared first on Towards Data Science.
The clipping bug has lived in every 3D simulation pipeline for three decades. Here is exactly why it happens, how the math breaks, and how swapping one equation fixes it; as well as the python code to see it for yourself!
The post The Polynomial That Fixed 30 Years of Cloth Simulation appeared first on Towards Data Science.
My approach to guiding the choice between Eppo and Statsig, and the lessons learned
The post Picking an Experimentation Platform: A Retrospective appeared first on Towards Data Science.
what it costs, what it gains and the three mistakes that I make
The post My SciPy ODE Solver Was Killing My Bayesian Inference: A Cosmologist’s Honest Account of Discovering Diffrax appeared first on Towards Data Science.
I got tired of copying files into an AI chat just to get feedback. So I built a pure Python MCP server that gives AI tools direct access to my local project—no frameworks, no dependencies. It runs over stdio for local use and switches to HTTP/SSE for concurrent clients with a single flag. The result: 5 clients, under 50ms, and a design that stays simple without sacrificing capability.
The post My AI Couldn’t See My Files — I Built a Zero-Dependency MCP Server appeared first on Towards Data Science.
How a simple choice shapes exploration, safety, and efficiency
The post The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy appeared first on Towards Data Science.
Using DSPy to automatically create, evaluate, and optimize your prompts
The post Automate Writing Your LLM Prompts appeared first on Towards Data Science.
Python tutorial for fine-tuning a Mistral Small 3.1 on an imbalanced training set to classify 15 emotions in social media communication
The post How to Fine-Tune an SLM for Emotion Recognition appeared first on Towards Data Science.