When data becomes more complex, big data analytics also becomes more complex—and, in turn, data compliance and regulations are even more complex.
And this all ripples to affect the complexity of data governance.
While data-driven decision-making actually becomes easier, maintaining competitive data advantages becomes more difficult. Access to quality data is already vital, but understanding how an organization uses that quality data is now essential.
Enter: Data lineage.
Data lineage refers to how specific data is used and transformed over time. Data professionals employ data lineage practices to record and share this data use and transformation as it occurs.
Time out: How is data lineage different from data provenance and data governance?
One of the challenges of data management is that you can’t break out aspects of the data lifecycle and affix each to specific periods of time—past, present, or future.
This is why we sometimes find data, especially in large complex systems, to be a perceptual challenge. We can’t cleanly break it down into easily categorical chunks. For this reason, it can seem like all data (e.g., data flows, data assets, data environments, etc.) can, at times, be proverbially everything, everywhere, all at once.
However, with data’s growing role in our lives, it’s important everyone works to understand these fundamental aspects of data management—as our collective ability to ensure data security, compliance, and data-driven decision-making relies on it.
Fortunately, we can use our new book on data contracts (now available in early release from our friends at O’Reilly Media) as an illustrated example of how provenance, governance, and lineage all relate.
While the above hopefully parses the different roles that provenance, governance, and lineage play in effective data management, our hope is that it also illustrates their interdependencies—how each of the three functionalities improves the other two when orchestrated together.
Now that we’ve simplified things, let’s go ahead and complicate them again (just a bit).
In data management, there are many different kinds of data lineage. There need to be—because different stakeholders and departments within an organization can use the same data in very different ways. And knowing everything about all data all the time isn’t necessary (despite what that one data analyst two cubicles over would have you believe).
Depending on these varying user requirements and perspectives, some aspects, dimensions, and attributes will make more sense to map in one instance, and less (or not at all) in another. To this, other factors determining what should or should not be mentioned include data complexity, regulatory compliance, data governance, and strategy, among others. This leads us to nine common types of data lineage in use across organizations.
Settling on the right best practices for data lineage is a lot like settling on the best title for a book on data contracts; both are exercises in limitations and precision. (And another good option always pops to mind the moment you think you’re “done.”)
That said, we think the following nine best practices, as a whole, accomplish two things:
Establishing clear objectives before embarking on any initiative, data-related or otherwise, is crucial. In this instance, clarity helps data leaders ensure that all tools, policies, and procedures that make up an organization’s data lineage practices will be efficient, sustainable, and aligned with ongoing business needs.
It’s worth considering how data contracts help you get the most out of this process, as the value of ongoing and tangible evidence of how data lineage practices positively impact an organization over time can’t be overstated.
Plan to automate data lineage practices as much as possible. Gaining access to accurate and consistent information is the point, after all. Automation, especially automated data discovery, can play a crucial role here, as it reduces the risks of errors inherent in manual processes.
At the same time, automation promotes scalability—ensuring that data lineage practices (especially those related to metadata capture and management) remain functionally efficient over time.
For most of us tasked with implementing data lineage practices, chances are good an established data environment will already be in operation. Audit existing data management tools based on established objectives, ensuring they’ll contribute to maximizing the utility of data lineage information. You can then determine if you’ll need to invest in a dedicated data lineage tool or if some combination of existing tools and systems will provide the needed functionality.
While reviewing integration capabilities, you might request demonstrations and trials, assess levels of support from potential vendors, and conduct cost-benefit analyses as needed.
Maintaining detailed and accurate records of data lineage should be the rule, not the exception. This documentation becomes crucial for understanding data flows and ensuring data quality throughout the lifecycle.
Depending on the size of organizations, data teams may also need to be vigilant regarding whether this documentation remains standardized over time. This contributes to robust data governance, guiding consistent understanding and use across departments and teams—reducing confusion while fostering solid communication.
Security has always been critical for protecting sensitive lineage information. But it’s increasingly critical to position data security as the responsibility of everyone, not just those in IT. To this, implement secure communications and help your co-workers understand why, at their most basic level, encryption and secure APIs are being used.
Establish a consistent cadence and maintain detailed access logs to make sure systems and tools are patched and updated regularly. Implement monitoring tools to automatically detect unusual access patterns and ensure the right alerts get sent to the right people at the right times. And, as more of the organization leans in to keep data secure, robust access controls become increasingly important. Consider utilizing security principles like the principle of least privilege as you define and refine which roles and responsibilities get specific data access.
At this point, you’re more or less ready to approach stakeholders for their support. Make sure you do, and that a tacit buy-in is what you actually walk away with.
Depending on your stakeholders, it may help to engage with them early and often. Make sure you clearly outline and identify the benefits the organization’s data lineage practices will have. Demonstrate how the lineage practices align with the business and set realistic expectations, the latter of which can often be aided through a strategic series of pilot programs.
You probably visualized parts of your lineage proposal to help sell it. Carry the visualization forward, representing data lineage in ways that make it easy for employees with different experience levels, skill sets, and backgrounds to understand and digest.
As able, promote training that ensures different users understand how to leverage the tool or tools used to map data lineage. (Note: To promote this training, consider pizza.)
Systematic measurement requires systematic monitoring of how effective data lineage practices are over time. In turn, effectively doing both helps data teams ensure that supporting systems are robust, responsive, and stay aligned with organization governance and management objectives.
Ideally, this monitoring and measurement isn’t limited to lineage alone but functions as part of broader data management efforts.
The tail of the best-practices snake here consists of regular audits and reviews of the data lineage process as it unravels.
This is vital, as it allows data teams to adjust the granularity of data lineage mapping, balancing utility across users and uses while optimizing for evolving stakeholder needs.
A data catalog can also be beneficial here. When embedded with data lineage information, these catalogs make it easier for users to get at and understand the data they need, enhancing overall data management.
The potential impact of a map directly correlates to the quality of information used to create it.
Best practices, in addition to a clear understanding of the concept, certainly make data lineage practices more efficient and effective. But shifting the emphasis and expectation of data quality further left can make them exceptional.
Be among the first to find out the role data contracts can (and increasingly do) play in the lineage of data by signing up for our product waitlist at Gable.ai.
Gable is currently in private Beta. Join the product waitlist to be notified when we launch.
Join product waitlist →