Big data is big enough that, yes, it still warrants the occasional media story or captivating infographic. But, increasingly, our collective attention needs to focus far more on the quality—not quantity—of data.
However, good data quality doesn’t manifest itself. Management of specific factors and categories by data governance teams, data stewards, and the like is essential to ensure data from producers consistently reaches consumers in its optimal form.
This is why any data quality initiative should arguably begin with internal alignment on what those factors and categories entail.
Like many aspects of data engineering and science, the concept of data quality is quite simple until two professionals sit down to discuss it. Professionals of varying specialties and experience will, understandably, emphasize different characteristics concerning data quality they consider more important.
But at a very foundational level, we can define the quality of a specific dataset by evaluating five key characteristics:
Taken together, data quality characteristics allow us to quantify the quality of data flowing from producers to consumers. However, a categorical lens also provides valuable context for data governance when paired with the five factors above. Traditionally, these categories are organized to address specific areas of concern or dimensions.
As such, the four most common categories are as follows:
It bears repeating that these characteristics and categories are in no way mutually exclusive.
For instance, the characteristic of data "completeness” can be seen in both the relevancy/completeness/timeliness and accuracy/integrity categories. While the quality characteristic of “reliability” also naturally slots into the accuracy/integrity category, it also relates to availability/accessibility issues.
The point is whether they’re characteristics or categories or both, use whichever makes the most sense based on your needs and the needs of your organization.
The five data quality characteristics might make more sense for metrics and measurement or data profiling. In contrast, the four categories may be more useful for data governance or managing stakeholder expectations.
Characteristics and categories, leveraged in tandem, can also support a balanced and effective approach to managing data quality in diverse scenarios.
It’s also important to clarify the difference between data quality and data integrity, as the nouns quality (how good or bad something is) and integrity (the quality of being honest and having immutable moral principles) can be used synonymously in some situations. However, regarding data, they are significantly different.
Here's how:
Data quality, as we've defined above, refers to the condition or state of data at a given point in time. Being a finite measurement, the level of quality relates to the data's value as it can be used in specific, intended instances—such as in business operations, planning, or decision-making.
Data integrity, however, refers to the accuracy and consistency of data across its entire lifecycle. Does quality data from a reliable source fundamentally remain unchanged and unaltered once it's been ingested into an organization? And has it been accidentally or maliciously tampered with or corrupted? These are questions of data integrity.
Other differences between quality and integrity relate to scope, implementation, overall concerns, and security issues related to data. In this sense, data quality and data integrity both relate to the reliability and trustworthiness of data.
Together, those contributing to data management have a way of referring to the usability of their data (data quality), and their ability to trust its quality will be consistent over time (data integrity).
As is obvious now, data quality can have a profound effect on business initiatives and operations—both positive and negative.
Data quality management, be it more or less formally practiced, should be considered table stakes in any modern business environment, not just in situations where data-driven decisions need to be accurate and effective.
But that doesn't mean measuring the quality of data is easy to do (especially at scale). After all, the idea of "quality" can sometimes be frustratingly subjective. Fortunately, utilizing a data quality assessment is an excellent way to both quantify what quality data means in the context of one's organization and assess it using multiple dimensions of data quality.
Performing a data quality assessment involves a series of sequential steps, the basics of which we've outlined here:
Before jumping into measurement, start by outlining clear criteria based on business needs. Based on those needs, outline which aspects of data quality are most relevant (i.e., which characteristics—be they five, or more, or less).
Determine which KPIs can best be used to measure each aspect of data quality you’ve outlined as part of your criteria. Using our five characteristics to illustrate, these KPIs might be established by the following actions:
Add continuous monitoring tools to the mix as well, as they will help you constantly check and report on how the quality of your data is—or is not—improving over time based on your specific KPIs. To best do so, we always recommend establishing monitoring as early as possible. Setting up alerts to inform you of any significant deviations is also beneficial.
Work needed to keep something from breaking is always preferable to working to fix what’s broken. This is why the time and effort required to establish feedback loops is well spent. Enabling data consumers to report anomalies or issues they encounter is valuable. But the fact that feedback loops keep data producers and consumers in lockstep—more often identifying problems before they become problematic—is invaluable.
Data profiling tools analyze data to provide a statistical summary while highlighting inconsistencies or anomalies. The information provided can provide insights regarding outliers, patterns, and potential quality issues. Ultimately, however, profiling tools prove invaluable for helping teams identify and resolve issues as quickly as possible.
Data quality scorecards can prove invaluable for clearly displaying the result of your ongoing data quality measurements against benchmarks you set. When used correctly, these scorecards also make it easier for stakeholders to understand the current state of the organization's data quality. Ideally, data quality scores should also be tied back to specific data products.
Auditing can be periodic. But they should occur consistently, as the deep dives they require can reveal system issues that tend not to appear in day-to-day checks.
Whenever possible, compare your data quality metrics with industry standards or benchmarks. Doing so provides valuable context, helping you maintain a sense of where you stand relative to industry peers and best practices.
A business impact analysis can help you measure the actual bottom-line-impact data quality issues are having within your organization. In instances where poor data quality led to a flawed business decision, the cost implications of that decision should be made apparent.
As communication is a key aspect of maintaining high-quality data, review your measurement strategies to account for how data sources, tools, stakeholder expectations, and the overall business environment evolve.
If nothing else, when you keep stakeholders aware of the impact of data quality on business outcomes, they're more likely to support and participate in your quality improvement initiatives.
Based on the above, ensuring data quality is less of a singular task and more of a continuous journey. And the process of managing data quality is fundamentally intertwined with the tools and strategies employed in its management, such as data quality tools that scrutinize and enhance accuracy, feedback mechanisms that perpetually refine it, and master data management (MDM) that ensures its consistency and accuracy across the organization.
Truly, the quality of data as it moves downstream to consumers is dependent on its quality when it enters an organization. More specifically, if data from producers is flawed, the resulting analyses, decisions, and strategies it’s used for will be, too.
This brings us to a pivotal solution: data contracts. These can be seen as the vigilant gatekeepers ensuring that only the finest, most pristine data enters your organization's repositories.
In modern organizations, there will always be an alarming amount of things we simply can’t control. Often, these things add up to become the proverbial cost of doing business. This is why that which we can control deserves all the more investment—especially when, as data does, it can make all the difference in organizational success or failure.
This is why we invite you to sign up for our Beta at Gable.ai. Explore how data contracts can transform your approach to data management, ensuring your organization’s data is of the finest quality, and ready to enrich your analytical and strategic endeavors.
Gable is currently in private Beta. Join the product waitlist to be notified when we launch.
Join product waitlist →