December 15, 2023

Data Quality: Why Big Data Needs to Be Its Best Self

Written by

Chad Sanderson

Share

Big data is big enough that, yes, it still warrants the occasional media story or captivating infographic. But, increasingly, our collective attention needs to focus far more on the quality—not quantity—of data. 

However, good data quality doesn’t manifest itself. Management of specific factors and categories by data governance teams, data stewards, and the like is essential to ensure data from producers consistently reaches consumers in its optimal form.

This is why any data quality initiative should arguably begin with internal alignment on what those factors and categories entail.

What are the 5 characteristics of data quality?

Like many aspects of data engineering and science, the concept of data quality is quite simple until two professionals sit down to discuss it. Professionals of varying specialties and experience will, understandably, emphasize different characteristics concerning data quality they consider more important.

But at a very foundational level, we can define the quality of a specific dataset by evaluating five key characteristics:

  1. Accuracy: How accurate is the data at hand? Data values should represent the real-world event or instance it exists to depict.
  2. Consistency: High-quality data contains no contradictions across a given dataset or system.
  3. Reliability: Referring to the trustworthiness of data sources, the reliability of a source can often be determined through data lineage.
  4. Timeliness: Whenever required, data is both up-to-date and available.
  5. Uniqueness and completeness: Quality data is whole, with no missing values or parts.

What are the 4 categories of data quality?

Taken together, data quality characteristics allow us to quantify the quality of data flowing from producers to consumers. However, a categorical lens also provides valuable context for data governance when paired with the five factors above. Traditionally, these categories are organized to address specific areas of concern or dimensions.

As such, the four most common categories are as follows:

  1. Availability and accessibility issues: Can the data be used when needed? Can it also be accessed by those who need it (and inaccessible to those who don’t)? In addition to permissions structures and security protocols, this category includes issues like system errors and data outages.
  2. Accuracy and integrity issues: Here we focus on whether the data is reliable and correct. Can it be trusted? Addressing this category typically involves correcting for inconsistencies, inaccuracies, and whether or not the relationships between datasets maintain their integrity.
  3. Conformity issues: This category covers whether a dataset or system conforms to a predefined standard or set of business rules based on particular formats, standards, or patterns.
  4. Relevancy, completeness, and timeliness issues: This final category of data quality focuses on whether or not the data is current, complete in terms of covering all necessary aspects, and relevant to its intended use.

It bears repeating that these characteristics and categories are in no way mutually exclusive. 

For instance, the characteristic of data "completeness” can be seen in both the relevancy/completeness/timeliness and accuracy/integrity categories. While the quality characteristic of “reliability” also naturally slots into the accuracy/integrity category, it also relates to availability/accessibility issues.

The point is whether they’re characteristics or categories or both, use whichever makes the most sense based on your needs and the needs of your organization.

The five data quality characteristics might make more sense for metrics and measurement or data profiling. In contrast, the four categories may be more useful for data governance or managing stakeholder expectations. 

Characteristics and categories, leveraged in tandem, can also support a balanced and effective approach to managing data quality in diverse scenarios.

Data quality vs. data integrity

It’s also important to clarify the difference between data quality and data integrity, as the nouns quality (how good or bad something is) and integrity (the quality of being honest and having immutable moral principles) can be used synonymously in some situations. However, regarding data, they are significantly different.  

Here's how:

Data quality, as we've defined above, refers to the condition or state of data at a given point in time. Being a finite measurement, the level of quality relates to the data's value as it can be used in specific, intended instances—such as in business operations, planning, or decision-making.

Data integrity, however, refers to the accuracy and consistency of data across its entire lifecycle. Does quality data from a reliable source fundamentally remain unchanged and unaltered once it's been ingested into an organization? And has it been accidentally or maliciously tampered with or corrupted? These are questions of data integrity.

Other differences between quality and integrity relate to scope, implementation, overall concerns, and security issues related to data. In this sense, data quality and data integrity both relate to the reliability and trustworthiness of data.

Together, those contributing to data management have a way of referring to the usability of their data (data quality), and their ability to trust its quality will be consistent over time (data integrity).

The mission-critical impact data quality has on business

As is obvious now, data quality can have a profound effect on business initiatives and operations—both positive and negative.

Benefits of high data quality

  • More efficient operations: High-quality data helps to streamline processes, reduce operational errors, and can lead to cost savings. The potency of data analytics, reporting, and forecasting also hinges on the availability of high-quality data.
  • Improved decision-making: Timely, accurate, reliable data ensures that data-informed business decisions are grounded in reality, leading to better strategic decisions and outcomes.
  • Better customer relations: Accurate customer data enables businesses to tailor their offers, services, and communication to consumers, increasing customer satisfaction and loyalty.
  • Compliance and mitigated risks: Reliable, high-quality data reduces the risk of penalties and legal consequences by helping businesses comply with various regulations.
  • Increased trust: Organizations that prioritize high data consistency and quality data assets are more easily seen as trustworthy and reliable, which can strengthen their brand reputation in-market.
  • Innovation fuel: Advanced analytics, AI, and machine learning processes live or die based on the reliability of data they have access to. So, too, does their ability to support innovative or completely new data-driven products and solutions.
  • Increased revenue: Ultimately, high-quality data can increase revenue as it enables more accurate pricing strategies, better campaign targeting, and the continual improvement of sales processes.

Issues stemming from low data quality

  • Operational inefficiencies: Stakeholder and business intelligence initiatives based on outdated or inaccurate data can produce poor—if not outright detrimental—business processes and outcomes.
  • Increased costs: Low-quality, inconsistent data leads to more mistakes, data discrepancies, and handling issues. And costs due to rework and remediation can get out of hand quickly.
  • Hindered digital initiatives: Poor data can grind digital transformation efforts to a halt, especially in cases where an organization wishes to leverage the benefits of advanced data-reliant technologies like artificial intelligence (AI).
  • Compliance risks: Bad data can result in regulatory violations. And, in industries where data accuracy is paramount, organizations can face reputational damage and financial penalties.
  • Loss of trust: Regular errors, issues, and inaccuracies can erode trust in general, both with stakeholders internally and customers externally.
  • Customer dissatisfaction: Flawed customer data can lead to miscommunication, bad targeting and personalization efforts, or errors in service delivery. For increasingly digitally savvy consumers, the resulting dissatisfaction can quickly lead to loss of business.
  • Decreased revenue: A sustained lack of access to high-quality data can ultimately result in ineffective marketing campaigns, decreased sales effectiveness, and missed opportunities.

How to measure data quality

Data quality management, be it more or less formally practiced, should be considered table stakes in any modern business environment, not just in situations where data-driven decisions need to be accurate and effective.

But that doesn't mean measuring the quality of data is easy to do (especially at scale). After all, the idea of "quality" can sometimes be frustratingly subjective. Fortunately, utilizing a data quality assessment is an excellent way to both quantify what quality data means in the context of one's organization and assess it using multiple dimensions of data quality.

Performing a data quality assessment involves a series of sequential steps, the basics of which we've outlined here:

1. Define your criteria

Before jumping into measurement, start by outlining clear criteria based on business needs. Based on those needs, outline which aspects of data quality are most relevant (i.e., which characteristics—be they five, or more, or less).

2. Develop key performance indicators (KPIs)

Determine which KPIs can best be used to measure each aspect of data quality you’ve outlined as part of your criteria. Using our five characteristics to illustrate, these KPIs might be established by the following actions:

  • Comparing a sample of your data against a trusted source or standard, then using the percentage that matches to represent data accuracy.
  • Calculating the percentage of missing data in a dataset to represent data completeness.
  • Quantifying any identified discrepancies within a dataset or between datasets to represent data consistency.
  • Tracking the number of times a data source provides incorrect information or suffers outages to represent data reliability.
  • Measuring the delay between when data is generated and when it becomes available for use to represent data timeliness.

3. Implement continuous monitoring

Add continuous monitoring tools to the mix as well, as they will help you constantly check and report on how the quality of your data is—or is not—improving over time based on your specific KPIs. To best do so, we always recommend establishing monitoring as early as possible. Setting up alerts to inform you of any significant deviations is also beneficial.

4. Establish feedback loops

Work needed to keep something from breaking is always preferable to working to fix what’s broken. This is why the time and effort required to establish feedback loops is well spent. Enabling data consumers to report anomalies or issues they encounter is valuable. But the fact that feedback loops keep data producers and consumers in lockstep—more often identifying problems before they become problematic—is invaluable. 

5. Implement data profiling tools

Data profiling tools analyze data to provide a statistical summary while highlighting inconsistencies or anomalies. The information provided can provide insights regarding outliers, patterns, and potential quality issues. Ultimately, however, profiling tools prove invaluable for helping teams identify and resolve issues as quickly as possible.

6. Score data quality

Data quality scorecards can prove invaluable for clearly displaying the result of your ongoing data quality measurements against benchmarks you set. When used correctly, these scorecards also make it easier for stakeholders to understand the current state of the organization's data quality. Ideally, data quality scores should also be tied back to specific data products.

7. Audit regularly

Auditing can be periodic. But they should occur consistently, as the deep dives they require can reveal system issues that tend not to appear in day-to-day checks.

8. Benchmark externally

Whenever possible, compare your data quality metrics with industry standards or benchmarks. Doing so provides valuable context, helping you maintain a sense of where you stand relative to industry peers and best practices.

9. Conduct a business impact analysis

A business impact analysis can help you measure the actual bottom-line-impact data quality issues are having within your organization. In instances where poor data quality led to a flawed business decision, the cost implications of that decision should be made apparent.

10. Review regularly (and continuously adapt)

As communication is a key aspect of maintaining high-quality data, review your measurement strategies to account for how data sources, tools, stakeholder expectations, and the overall business environment evolve.

If nothing else, when you keep stakeholders aware of the impact of data quality on business outcomes, they're more likely to support and participate in your quality improvement initiatives.

The best way to improve data quality 

Based on the above, ensuring data quality is less of a singular task and more of a continuous journey. And the process of managing data quality is fundamentally intertwined with the tools and strategies employed in its management, such as data quality tools that scrutinize and enhance accuracy, feedback mechanisms that perpetually refine it, and master data management (MDM) that ensures its consistency and accuracy across the organization.

Truly, the quality of data as it moves downstream to consumers is dependent on its quality when it enters an organization. More specifically, if data from producers is flawed, the resulting analyses, decisions, and strategies it’s used for will be, too. 

This brings us to a pivotal solution: data contracts. These can be seen as the vigilant gatekeepers ensuring that only the finest, most pristine data enters your organization's repositories.

Why data contracts are quintessential for high-quality data

  • Ensuring consistency: Data contracts establish a standardized format, ensuring incoming data adhere to predetermined quality and structural benchmarks. With clear data expectations and the assurance of data quality and accuracy, organizations can collaborate more effectively.
  • Minimizing errors: By defining the acceptable data parameters, data contracts inherently reduce the influx of erroneous data, safeguarding analytical outcomes.
  • Integration testing: Contracts enable teams to check upstream in a CI/CD pipeline before any breaking issues occur. This also helps bring data producers and consumers together, improving data collaboration and the general understanding of its use.
  • Enhancing compliance: Data contracts ensure that data adheres to regulatory and organizational standards, mitigating compliance risks.
  • Optimizing data management: By ensuring that only quality data enters the system, data management processes are streamlined and optimized.

With the stakes so high, quality should be contractual

In modern organizations, there will always be an alarming amount of things we simply can’t control. Often, these things add up to become the proverbial cost of doing business. This is why that which we can control deserves all the more investment—especially when, as data does, it can make all the difference in organizational success or failure.

This is why we invite you to sign up for our Beta at Gable.ai. Explore how data contracts can transform your approach to data management, ensuring your organization’s data is of the finest quality, and ready to enrich your analytical and strategic endeavors.

Share

Getting started with Gable

Gable is currently in private Beta. Join the product waitlist to be notified when we launch.

Join product waitlist →