Technology

Towards Data Science

towardsdatascience.com

Publish AI, ML & data-science insights to a global community of data professionals.

Articles100

How Far Can Classical NLP Go? From Bag-of-Words to Stacking on Spooky Author Identification

Prompt Engineering Fails Quietly —  Prompt Regression Is Why

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

How to Choose Between Small and Frontier Models

Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows

I Pitted XGBoost Against Logistic Regression on 358 Matches. The Boring Model Won.

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

How to Build a Powerful LLM Knowledge Base

From Local LLM to Tool-Using Agent

Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation

Amplify the Expert: A Philosophy for Building Enterprise RAG

How to Ace Data and ML Behavioural Interviews

Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory

The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark

Beyond the Straight Line: Choosing Between OLS, Interaction Terms, and Tweedie Regression

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

An LLM as arbiter in RAG retrieval: picking the right candidate with reasons

One Month Into Learning Data Engineering in Public: Here’s What I Didn’t Write About

How to Build a Credit Scoring Grid From a Logistic Regression Model

Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable

A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT

Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead

Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End

How to Create Powerful Loops in Claude Code

I Spent an Hour on a Data Preprocessing Task Before Asking Gemini

Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG

The Era of No-Code AI: What You Need to Know

Build Your Own Local AI Coding Agent with Gemma 4 and OpenCode

Encoding Categorical Data for Outlier Detection

How to Use Claude Code in Your Browser

When RAG Users Ask Vague Questions: Clarify Once, Learn the Default

Neural Networks, Explained for Beginners: Start Here If They’ve Confused You

Tool Calling, Explained: How AI Agents Decide What to Do Next

Reconstructing the Table of Contents a PDF Forgot to Ship, So RAG Can Scope by Section

What Are the Possibilities to Build Date Tables in Self-Service Environments?

7 Crucial Barriers Between Data Teams and Self-Healing Data Architecture

Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All

Materialized Lake Views in Microsoft Fabric: When Your Medallion Fits in a SELECT Statement

Python 3.14 and its New JIT Compiler

Building a Custom GStreamer Plugin for NVIDIA DeepStream

I Tried to Schedule My ETL Pipeline. Here’s What I Didn’t Expect.

Parse Scanned PDFs for RAG with EasyOCR: Free OCR Gives You Words, Not a Document

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Structured Outputs with LLMs: JSON Mode, Function Calling, and When to Use Each

How Powerful is Claude Fable (Mythos) 5 for Coding?

Proteins: A Mosaic Pattern to Rule Them All?

Dispatching the Parsed RAG Question: Chunk Strategy, Model Tier, Activations, Audit

The Power and Pitfalls of Vector-Based Image Search

Your Churn Threshold Is a Pricing Decision

The Secret to Reproducible and Portable Optimization: ORPilot’s Intermediate Representation (IR)

You Probably Don’t Need an Agent Framework

What the Question Parser Extracts from a User String: Keywords, Scope, Shape, Decomposition, Clarification

Drilling Into AI’s Financial Sustainability

Run a Local LLM with OpenClaw on Your Mac Mini

LLM Fallbacks Break Agent Pipelines — I Built the Missing Recovery Layer

RAG Questions Need Parsing Too: Turn the User’s String Into Briefs for Retrieval and Generation

How to Effectively Align with Claude Code

The Protocol That Cleaned Up Our Agent Architecture

I Built 11 Models to Predict the 2026 World Cup. They Crown Four Different Champions.

The System Always Knows: Why Local Efficiency and System Performance Are Not the Same Problem

4 Lines You Should Include in Your Claude Skill

Vision LLMs are PDF Parsers Too: Reading Charts and Diagrams for RAG

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Larger Context Windows Don’t Fix RAG — So I Built a System That Does

Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload

Solving the 3Blue1Brown String Probability Problem (Without AI)

When PyMuPDF Can’t See the Table: Parse PDFs for RAG with Azure Layout

Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem)

A Harness for Every Task: Putting a Team of Claudes on One Job

I Thought Data Engineering Was Just Writing Scripts. I Was Wrong.

Is Language Visual? An Experiment with Chinese Characters

BI Is Dead, Long Live BI

Stop Returning Flat Text from a PDF: The Relational Shape RAG Needs

PySpark for Beginners: Beyond the Basics

When GPU Utilization Lies: The Hidden Systems Problem Slowing Modern AI

NuCS vs Choco: A Pure-Python Constraint Solver Meets a JVM Veteran

How to Refactor Code with Claude Code

How to Train a Scoring Model in the Age of Artificial Intelligence

Beyond extract_text: The Two Layers of a PDF That Drive RAG Quality

Bayesian Networks and Markov Networks: An Intuitive Guide to Structured Uncertainty

Physical AI: What It Is and What It Is Not

10 Common RAG Mistakes We Keep Seeing in Production

The Hardware That Makes AI Possible

Prefill Once, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines

The Exact ML Project I’d Build to Get Hired in 2026

Can Machine Learning Predict the World Cup?

Increase Recommendation Systems’ Precision with LLMs, Using Python

How to Keep Quantum Information Alive for Machine Learning

4 New Techniques to Maximize Claude Code

Sequential Fitting: A Different Perspective on the Spectral Bias of Neural Networks

The Polynomial That Fixed 30 Years of Cloth Simulation

We Should Train AI to Betray Its Users

Building a Multi-Agent System in Python

Picking an Experimentation Platform: A Retrospective

Who Will Win the 2026 Soccer World Cup?

My SciPy ODE Solver Was Killing My Bayesian Inference: A Cosmologist’s Honest Account of Discovering Diffrax

My AI Couldn’t See My Files — I Built a Zero-Dependency MCP Server

The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy

Automate Writing Your LLM Prompts

How to Fine-Tune an SLM for Emotion Recognition