Towards AIblog

The Modern Data Stack Is Broken — Here’s How to Fix It With AI, Governance, and Real Architecture

Tuesday, May 26, 2026Sunil kumar ReddyView original
Last Updated on May 27, 2026 by Editorial Team Author(s): Sunil kumar Reddy Originally published on Towards AI. The Modern Data Stack Is Broken — Here’s How to Fix It With AI, Governance, and Real Architecture There’s a painful gap between how most data talks are given and how data actually flows inside real companies. Conference slides show clean, linear pipelines. Production reality is messier: three different teams calling the same Kafka topic with different schemas, a dbt model nobody owns that silently joined the wrong dimension table for six months, and an “AI-powered” feature that turns out to be a single GPT-4 API call with no retry logic, no monitoring, and no idea what happens when the context window fills up. So let me try to sketch out what a genuinely modern data platform looks like, one that handles AI workloads, enforces governance without requiring a full-time compliance team, and can actually be deployed by a team of four rather than forty. Why the Old Stack Breaks Under AI Workloads The traditional warehouse-centric stack, think Redshift or BigQuery at the centre with some Airflow DAGs feeding it, was designed around a pretty simple contract. Structured data comes in, SQL queries go out, BI dashboards update overnight. That was fine. AI changes the contract pretty fundamentally. Your LLM pipeline might need: Raw, unstructured text sitting in S3 alongside clean relational data Embeddings computed from 50-million-row tables in real time Feature vectors generated in milliseconds for a model serving endpoint Full data lineage tracked so a regulatory audit can trace exactly which training rows produced a specific prediction None of these fit cleanly into the old paradigm. The warehouse assumes structure. The data lake has structure but no transactions. Neither was built to serve a vector database at 10ms latency. The Lakehouse architecture, Delta Lake, Apache Iceberg, or Apache Hudi sitting on object storage, is the honest answer to this. It gives you the flexibility of a lake with enough transactional integrity to trust your data. But it’s only part of the answer. The Medallion Architecture: Not Just Bronze and Gold You’ve probably seen the medallion diagram before. Bronze is raw, Silver is cleaned, Gold is business-ready. Simple enough. But there are a few things people tend to gloss over. The fourth tier, what I call the Platinum or AI-native layer, is the piece most teams miss. Once your data is clean and business-ready in Gold, there’s still work left to make it AI-ready. Embeddings need to be computed and stored somewhere fast. Fine-tuning datasets need to be curated, versioned, and tracked. Feature vectors for real-time ML models need to be pre-materialised so your serving endpoint doesn’t have to compute them at query time. One thing I’d push back on here: a lot of teams treat this fourth tier as optional, something to add later. In practice, if you don’t design for it from the start, retrofitting it into an existing Iceberg table structure is genuinely painful. You’ll end up with embedding tables that aren’t linked to their source rows in the lineage graph, which makes auditing almost impossible. Real-Time vs. Batch: The Lambda Debate Is (Mostly) Over For years, data engineers debated Lambda architecture vs. Kappa architecture. Lambda said: run two pipelines, one for real-time and one for batch, then merge the outputs. Kappa said: just use streaming for everything. Both camps were partly right. Kappa’s “streaming for everything” ideal runs into the reality that batch processing is still cheaper and more reliable for large historical datasets. But Lambda’s “dual pipeline” approach creates synchronisation nightmares. The honest answer in 2026 is something like this: The key insight in this architecture is that Iceberg acts as the unification point. Flink can write micro-batches to an Iceberg table every 30 seconds. Spark can overwrite entire partitions of that same table in a nightly batch run. Both writes are ACID-safe. Downstream consumers, whether a BI tool or an LLM-powered agent, always see a consistent snapshot. The latency SLOs are also worth calling out explicitly. Your real-time path has to meet latency requirements measured in seconds. Your batch path is measured in hours. Those two pipelines need different monitoring, different alerting, and often different teams owning them. I’ve seen organisations collapse those two SLOs into one on-call rotation and it ends badly. Governance Isn’t a Feature, It’s a Foundation Here’s where I want to spend some real time, because governance is the thing that usually gets designed last and regretted first. Most teams treat data governance like a regulatory tax. You do the minimum required to pass the audit, you slap a data catalogue on top of whatever already exists, and you call it done. That works fine until your LLM training pipeline accidentally ingests PII that was supposed to be masked in Silver. Or until a churn model’s predictions are challenged in a lawsuit and you can’t reconstruct which version of which feature was used at inference time. A few things here worth calling out explicitly. Column-level lineage is not optional when you’re doing AI. Table-level lineage, “this table was derived from those three tables,” is useful. But it doesn’t tell you which specific columns fed a model. If a regulator asks why your credit-scoring model produced a particular output, you need to trace back from the model’s input features to their source columns in the raw data, including any transformations that happened along the way. OpenLineage is the open standard here, and both Airflow and dbt can emit lineage events that populate tools like Marquez or DataHub. Data contracts are the real unlock. The idea is simple enough: before a producer team changes a table’s schema or SLA, they have to negotiate that change with all downstream consumers. In practice this means defining a YAML contract (Avro schemas work well here) that specifies the column names, types, nullability guarantees, and expected freshness SLA. Break the contract, break the build. Tools like Soda Core or custom Great Expectations suites can enforce this […]