A regulator, an auditor, or a model-risk reviewer asks a simple question: how was this number produced? Which inputs fed this decision, and where did they come from? The question sounds like it should have a one-paragraph answer. In most organizations it takes a week of interviews, a few archaeological digs through old repositories, and a spreadsheet that three people swear is accurate and one person quietly doubts.
You do not need a regulator in the room to feel this. An engineer at a retailer is about to change how a field is computed and needs to know what depends on it. An analyst is about to build a new metric and wants to know whether one already exists. A platform lead inherits a service and cannot tell which downstream teams will notice if it changes. None of them work in a bank. They are all asking a version of the same question, and they all get the same answer: nobody is sure, go ask around.
That gap between the question and the answer is the subject of this essay. It exists for a reason that is easy to state and hard to fix. The data an enterprise runs on is produced, shaped, and moved by code. Every record, every event, every field starts somewhere, created or changed by a line of code in some service that a particular team owns. The meaning of that data, the logic that shaped it, and the path it takes to everywhere it is used are all properties of code. We manage almost none of it where the code lives. We wait until the data lands somewhere downstream, and then we try to reconstruct a story the code already contained.
I have spent the last few years working with enterprises trying to close that gap, and the question that comes first, the one that stops releases and triggers audits and burns engineering weeks, is almost always a question about provenance. Where did this come from. Where does it go. What happens if it changes. This revision puts that question at the center, because it belongs there.
Federation is why this got hard
In the early 2000s, software engineering reorganized itself around speed. Monoliths became services. Centralized teams became autonomous ones. Each team picked its own languages, owned its own datastores, and shipped on its own schedule. The payoff was velocity. The cost was that no single team, and no central group, could see the whole system anymore.
Operations felt this first. In a monolith, a small ops team could plan releases and enforce standards from one place. In a federated world, hundreds of teams shipped thousands of independent changes, and central visibility collapsed. DevOps emerged to close that gap, not by recentralizing control, but by pushing operational responsibility into the development process where the changes actually happened.
The same fragmentation hit data, and we responded worse. When engineering teams started making independent decisions about what to log, which database to use, and how to structure their events, the data team lost the one thing it depended on: a coherent view of where data came from. The response was the data lake. Send everything raw into central storage, and sort it out later. Data engineering grew up around that decision as a reactive discipline, reconstructing meaning after the fact from fragments produced by code it never saw.
This is the root of the lineage problem, and it is worth being precise about it. Lineage did not get hard because storage is complicated. It got hard because the journey now starts in code that the data team has no access to and no visibility into. Dozens of independently owned services define and reshape data before it ever reaches a place the data team can observe. By the time anyone looks, the origin is already off the map.
This pattern has happened before
Shifting left is not a new idea. It has already played out in software, and it followed a consistent shape. DevOps took deployment and infrastructure, which used to be a downstream handoff to a separate operations group, and moved them into the development workflow. The artifact it produced, almost as a side effect, was a continuous record of how software gets built and shipped: version control, CI pipelines, infrastructure as code. You stopped having to ask how something got to production. The system already knew.
The same shift happened twice more. DevSecOps moved security from a gate at the end of the process into automated checks that run on every change, and the byproduct was continuous evidence instead of a quarterly scan. Feature management moved experimentation and rollout control out of a post-launch scramble and into how features are built, and the byproduct was a live record of what shipped, to whom, under what flag. Each time, a function that lived downstream moved into the place where code is written, and a durable record fell out as a side effect. When you move a discipline to the source, the source starts generating the evidence you used to assemble by hand. For data, the discipline is provenance, and the evidence is lineage.
Lineage is the load-bearing pillar
Here is where most lineage efforts go wrong. They start in the warehouse. They map table to table and column to column inside the analytical store, produce a clean diagram, and call it lineage. That diagram captures the cleanest, most structured, and least informative part of the journey. By the time data is sitting in a warehouse table, the interesting decisions about it have already been made, somewhere else, by code nobody mapped.
It helps to be exact about the layers, because the word lineage gets stretched across all of them. There is storage lineage, which tracks how tables relate to other tables. There is transformation lineage, which parses the SQL and the transform jobs running inside the data platform and traces columns through them. Both are useful, and both are database lineage: they describe what happens to data after it has landed in the database. They share the same blind spot. They begin at the edge of the data platform and cannot see anything that happened before data arrived. The service that created the field, the function that set its value, the business rule encoded in application logic, the event that carried it across a system boundary: none of that is visible to a tool that starts reading at the warehouse.
Database lineage is not data lineage. Or more precisely, it is one layer of data lineage, and usually the layer that starts too late. The provenance of a value, the real answer to where it came from and what it means, mostly does not live in the database. It lives in the code that produced it. The database only ever sees the output, after the decisions that matter have already been made upstream. A tool that reads only the database is reading the last chapter and trying to reconstruct the plot.
The lineage that answers the hard question is the kind read from application source code, at the point where data is produced. That means parsing the actual services, in the actual languages they are written in, to extract which code wrote a field, which function transformed it, which contract it crosses on the way out, and which downstream consumers depend on the result. This is the layer where data is born, and it is the earliest layer that knows the answer to where did this come from, before the answer gets laundered through three systems and a transform job.
Not every value originates in a tidy line of application code. Some enters through a vendor feed, a manual correction, an external system you do not control. But even then, the first place your own organization touches and shapes that data is code you own: the ingestion logic, the validation, the service that lands it. That is still upstream of the warehouse, and it is still where the provenance signal first becomes readable. The argument is not that every origin is a Java method. The argument is that the earliest reliable record of where data came from lives in the producer code, and a lineage tool that begins at the warehouse begins too late to capture it.
Warehouse-first lineage maps where data settled. Code-level lineage maps where it came from. Only one of those answers the question an auditor actually asked.
There is a second reason to capture lineage at the code layer, and it is the one engineers feel most directly. Lineage drawn by hand, or inferred from downstream metadata, is stale the moment it is finished. Systems change every day. A diagram produced last quarter describes a system that no longer exists. When lineage is read from source code and tied to each release, it changes when the code changes. It stays true because it is derived from the same artifact that defines the system's behavior in the first place.
Consider how carefully we version code and how casually we version the data it produces. Every change to a service is committed, tagged, reviewed, and traceable to an author and a moment in time. We can check out the exact code that ran on any given day. The data that code produced has no comparable record. A number sits in a table with no memory of which version of which service computed it, under which logic, with which assumptions. The two are tightly coupled. A field's meaning is a property of the code revision that wrote it, and that code changes underneath the data constantly. Yet we treat the data as if it were timeless, then act surprised when last year's figure cannot be reproduced. Code-level lineage closes that gap by pinning each output to the code version that produced it. The data inherits the version history of the code, which is the only place that history ever existed.

That currency is what makes three questions answerable, and these are the questions engineers actually ask, regulated industry or not.
The first is what breaks if I change this. A producer is about to change how a field is computed. That field feeds two reported figures, three machine learning features, and a service owned by a team two floors away. Today almost no one can see that before the change ships, because the dependency graph lives in scattered code nobody has assembled. With code-level lineage, the blast radius is visible at the pull request, while the change is still a proposal instead of an incident.
The second is subtler and more expensive: what will I get wrong. A field can pass every schema check and still be the wrong input for the calculation about to consume it. Revenue that includes refunds is valid data and the wrong number for a margin model that assumes it does not. The bug is not a broken type or a null value. It is a correct-looking value carrying a meaning the consumer did not expect. That kind of error is invisible to validation and obvious in lineage, because lineage shows the logic that produced the value, not just its shape. When you can trace a field back to the code that computed it, you can see what it actually means before you build on top of it.
The third is the one teams rarely even try to ask: what already exists that I can reuse. In a federated organization the same metric gets rebuilt five times because no one can find the four versions that already exist, and each rebuild drifts a little from the others until there are five definitions of revenue and no agreement on which is right. A map from code to outputs is also a map of what has already been built. Before an analyst defines a new metric or an engineer writes another transformation, lineage can answer whether the thing already exists and where. Discovery and deduplication are the quiet payoff of provenance, and over time they matter as much as the dramatic save of catching a breaking change.
Everything else stands on the map
Once lineage is read from the source and kept current, the rest of the data stack stops being a pile of separate tools and becomes a set of capabilities that all refer back to the same map. The order matters. The map comes first, because everything else needs to know where things are before it can act on them.
Ownership resolves through the code. The reason ownership of data is so often unclear is that data does not explain who is responsible for it. Code does. Code lives in a repository, the repository has owners, and a change to it goes through a review. When provenance is anchored to code, the question of who owns a field has an answer: the team that owns the service that produces it.
Governance and policy become enforceable. A policy about sensitive data is only as good as your ability to know which fields carry it and where they flow. Tag a field as sensitive at the point it is produced, and lineage carries that tag everywhere the field travels. Governance stops being a document describing what should be true and becomes a check against what the code actually does.
Contracts enforce the boundaries that matter. A data contract is an agreement about the shape and meaning of data at a boundary between a producer and its consumers, and at the edges that carry real risk it is exactly the right control. A contract is an enforcement point, not a map. It governs one boundary you have chosen to protect, which means you first have to know where your boundaries are and which ones carry the figures, decisions, and models that matter. Lineage draws that map. Contracts hold the lines on it that are worth holding. The map tells you where to put them.
Audit evidence becomes a byproduct. When provenance, policy state, and ownership are all derived from code and tied to releases, the evidence an auditor or reviewer asks for is something the system already has. It is assembled from the same record that runs production, not reconstructed by hand under deadline. The packet you used to spend three weeks building is a query against a map you already maintain.
Regulation is closing the gap for you
Everything above is true on its own terms. For most teams, though, the reason it stops being a someday project and becomes a this-quarter project is that someone with enforcement power is now asking the question and will not accept a week of interviews as the answer.
Regulators rarely use the word lineage. They use words like reproducibility, traceability, model-input documentation, and impact analysis. Those are the same requirement under different names: prove where this number came from, show what feeds this automated decision, demonstrate what you assessed before this change shipped. Risk-data aggregation rules expect a regulated figure to be reproducible on demand. Model-risk guidance expects the inputs to a model to be documented and traceable. The newer wave of AI regulation extends the same expectation to training and feature data. The direction of travel is one way only: heavier, broader, more prescriptive, and increasingly explicit that you must be able to trace a result back to its origin.
This is why regulated institutions buy first. In a bank, an insurer, or a healthcare system, the inability to answer where did this come from shows up as an examination finding with a dollar figure attached, not merely as lost engineering time. There is a gun to the head, and the trigger is a scheduled exam. But the forces creating that pressure are not staying inside regulated industries. As more decisions become automated and more of the business runs on data nobody can fully account for, the same provability requirement spreads outward on its own. Regulated industries are simply where the deadline arrived first.
At the same time, the volume of change the requirement applies to is exploding. Coding agents are now making changes across systems they do not fully understand. A single agent working inside one repository can be reasoned about. An agent making changes that ripple across services, data stores, and team boundaries cannot, because the context it would need to be safe is exactly the context that lives outside the file it is editing. The rate at which producer code changes is climbing, and every change is a change to the provenance of whatever that code produces. Reconstructing that picture after the fact was already slow. Against AI-speed change, it is hopeless. Provenance has to be captured at the moment data is created, by reading the code that creates it, or it will not exist when an auditor, an agent, or the engineer supervising one needs to know what a change is about to affect.
What changes when you get this right
The point of all of this is not a cleaner diagram. It is a different set of answers to questions teams ask every week.
When someone asks where a number came from, the answer takes minutes instead of a week, because the path from source code to output is already mapped. When a producer proposes a change, the downstream impact is visible in the pull request, before merge, instead of discovered in an incident afterward. When an engineer needs to know what a field actually means before consuming it, the logic that produced it is one click away, so the wrong-meaning bug gets caught in review instead of in a quarterly number that looks plausible and is not. When an analyst is about to build a metric, they can find the version that already exists instead of minting a sixth definition of revenue. When a reviewer asks for evidence that a regulated figure or an automated decision can be traced to its inputs, that evidence is a byproduct of shipping rather than a fire drill. When something does go wrong, root cause moves from a multi-day investigation to a traversal of a map that was already there.
The last change is the one that makes the rest stick. When provenance is read from code and surfaced in the pull request, engineers experience it as part of their own workflow rather than as governance work done on someone else's behalf. That is the difference between a data initiative that producers tolerate and one they actually use. The work has to be valuable to the people doing it, or it does not get done.
A handful of questions are worth keeping in front of any data team:
- Who produced this field?
- What logic shaped its value, and what does the value actually mean?
- Which contract does it cross on the way out?
- What downstream models, figures, and decisions depend on it?
- What breaks if it changes?
- Does this already exist somewhere, so I can reuse it instead of rebuilding it?
Most of those answers do not live in the data. The data carries little record of its own origin. The answers live in the system that produced it, and that system is code.
Where to start
The first question I always get is where to begin. The answer is to find the person in your organization who already has to prove something and cannot. A few years ago that was usually a QA engineer who lived by validating code against expectations. Today it is just as often a privacy engineer who has to show where a sensitive field flows, a model-risk partner who has to document what fed a decision, or a platform lead who is tired of redrawing the same manual map every audit cycle. Those people already feel the gap. They are your allies.
Pick one critical data element. A figure that ends up in a regulated report, an input to a model that matters, a field that several teams quietly depend on. The point is not to trace it by hand. Tracing it by hand is the work you are trying to retire, and the diagram you would draw is stale before you finish it. The point is to put a tool on that one element and let it read the source code, recover the path back to the code that produces the value and forward to everything that consumes it, and keep that path current as the code changes. Do not start with a platform evaluation or a governance framework. Start with one element, prove the trace can be generated instead of assembled, and let the result speak for itself. The first time someone answers a where did this come from question in minutes, from a map that built itself, in front of a reviewer who expected to wait a week, the argument is over.
This is the problem we work on at Gable: lineage read from the code that produces data, not only from the tables that store it, so the record of where data came from is captured where it is created and stays true as the code changes. I believe this shift will be as consequential for data as DevOps was for software. Data has always been produced by code. It is time we managed it there.
Good luck.
-Chad






.avif)

.avif)
%20(1).avif)