September 3, 2024

The Usual Suspects: 7 Common Data Quality Issues

Written by

Mark Freeman

Share

Here’s the deal with big data: It’s undeniably ushering in a new era for modern business. Unfettered access to high-quality data assets is revolutionizing how we use data, as new use cases for machine learning and business intelligence give rise to smarter algorithms, diverse data types, and the ability for business stakeholders to make real-time decisions that drive disruptive competitive advantages.

All this data, in turn, necessitates increasingly sophisticated forms of data quality management and data governance. Because more data inevitably, poetically, begets more data quality problems.

Moreover, data teams that understand the most common issues contributing to these data quality problems are better prepared to help organizational stakeholders succeed.

That said, some data quality issues are much more common than others. Let’s take a look at some of them (and the best way to get ahead of the problems they cause).

The 7 most common data quality issues

Note: In the real world of data engineering, fixing common data quality issues is much more important than ranking them.

But this is the internet. So, we've had a little fun definitively (i.e., 100% subjectively) ranking the following rogue's gallery of data quality issues from most common to least.

1. Human error

18th-century poet Alexander Pope must have spent some time working at an IT helpdesk, because his famous “To err is human” is prescient as it relates to the hands-down most common source of data quality issues.

We jest (a tad), but human-caused errors related to the entry, processing, and handling of data have major impacts on organizational data quality.

  • Examples: Incorrectly coded data, miskeyed info during manual data entry, and mistakes made during data transformation processes.
  • Impact: In 2021, Gartner survey data showed that the average organization lost $12.9 million annually to poor data quality. Some estimate that the figure is now closer to $15 million. Human error is attributed to 1-4% of that total, meaning these too-human errs may cost organizations $150,000-$600,000 every single year.

2. Duplicate data

Duplicate data occurs when the same piece of data gets entered multiple times times (<- like so). When this happens in blog writing, it drives Grammarly crazy. However, in a database, duplicate data creates duplicate records, another exceptionally common cause of data quality issues.

  • Examples: A single transaction recorded multiple times, a customer forgetting they’d signed up for a service and signing up again, or two datasets merged from different sources without being properly de-duplicated.
  • Impact: Supply-chain and inventory inefficiencies, increased storage costs, misleading data trends, inflated performance metrics.

3. Incomplete data

As opposed to too much of the same thing (i.e., duplication), incomplete data is another common data quality issue where gaps occur in a dataset due to missing data or records that are missing one or more required fields.

  • Examples: Incomplete transaction details like the last four digits of a credit card, incomplete product descriptions, or missing customer contact information.
  • Impact: Increased risks of fraud, difficulties in marketing personalization, ineffective data analysis, and flawed data-driven decision-making.

4. Inconsistent data

Inconsistent data commonly occurs when discrepancies in data formats, entries, or standards within a given dataset occur. Often due to varying data sources, data inconsistencies can arise across an array of points between data systems and sources as flows of organizational data get recorded, stored, and interpreted.

  • Examples: Date formatting, currency symbol variations, and inconsistencies regarding units of measurement.
  • Impact: Complicates data integration, hinders data processing, and invites potential errors in data analysis.

5. Inaccurate data

Regardless of how consistent and complete it is, inaccurate data still impacts overall data quality. In this sense, inaccurate data includes incorrect, misleading, or simply outdated information.

Faulty instruments or methodologies are often the sources of data inaccuracies. However, deliberate alteration or falsification of data can also be a contributing factor.

  • Examples: Incorrect addresses, incorrect pricing information, or outdated inventory levels.
  • Impact: Degradations in customer satisfaction, operational inefficiencies, poor decision-making, and diminished trust in data over time.

6. Ambiguous data

Like some expensive TV series currently available on popular streaming services, ambiguous data lacks clarity, precise definitions, or much-needed context—making it difficult (or impossible) to understand and interpret clearly.

Typically, ambiguous data occurs due to poor or absent standardization, vague data entries, or inadequate metadata.

  • Examples: Data fields labeled only as “value” or “score,” data entries noted to be “Pending” with no additional timing information, and acronyms used without being defined.
  • Impact: Cross-departmental misunderstandings, flayed analysis, inaccurate or shallow reporting, inconsistencies or errors within integrated systems.

7. Hidden data and dark data

Data professionals do use hidden data and dark data interchangeably. And while their effect on data quality tends to be similar, they are slightly different concepts.

Hidden data refers to information overlooked or inaccessible within an existing data management system (DMS). Dark data, on the other hand, refers to data that is collected, processed, and stored within a DMS but remains there, unutilized.

  • Examples: Dormant customer data, unused log files, archived emails, historical transaction data, and sensor data from the Internet of Things (IoT).
  • Impact: Hidden data can lead to increased costs, inefficient data utilization, data siloing, and compliance issues. Comparatively, dark data results in wasted resources, missed insights, security issues, and data management complexities.

Prevention > cure: How data contracts counteract data quality issues

Many traditional methods and data quality tools are available to data teams looking to address data quality issues (common or otherwise)—options ranging from initial efforts to correctly plan and execute an organization’s data quality framework to fully automating all data governance processes.

But it’s hard to argue with the logic that the best way to address an issue is to keep it from occurring in the first place. Data contracts, drafted and enforced upstream, are the preeminent solution in this regard, evidenced by how they can eliminate or counteract every single issue on our list:

Human error: Part of the standard data contract drafting process involves teams working with stakeholders and organizational data consumers to standardize data entry procedures and validation rules. The contract might require all data entries to conform to a single, agreed-upon date format—like the ISO 8601 standard commonly adopted in hospitals, for example.

This alone can vastly reduce the likelihood of commonplace manual errors that so easily plague data quality within organizations.

Duplicate data: Data contracts can define unique identifiers and enforce deduplication processes during data integration, ensuring that users cannot create duplicate records while utilizing organizational data. Simple initiatives, such as leveraging a data contract to mandate unique customer or user IDs, can significantly contribute to cleaner, more reliable datasets.

Incomplete data: Contracts can also specify mandatory fields and establish protocols for handling missing data. For instance, a contract drafted for an ecommerce platform might require all subsequent transaction records to include fields deemed essential, like the transaction date, amount, and customer ID for each purchase made online.

These requirements, then, actively reduce the incidence of incomplete data working downstream, enabling more comprehensive and accurate data analysis. (Not to mention happier data analysts.)

Inconsistent data: Data contract enforcement also applies to mandating data format consistency and standards across all data sources. Again, in practice, this could be as simple as a large consortium of insurance companies enforcing a standard currency format across all of its members’ financial records.

Leveraging data contracts to guarantee all organizational data is uniformly formatted in this way facilitates better data integration and analysis.

Inaccurate data: Contracts that include accuracy checks and validation mechanisms can verify data correctness ingested from data producers and throughout the data lifecycle.

Consider logistics and supply-chain applications, where accurate inventory data is literally mission-critical. The checks and validation mechanisms enforced through data contracts help ensure the accuracy of shipments en route, billing, and ongoing compliance with a shifting menagerie of trade regulations. 

More broadly, the completeness of the data that contracts can provide improves the reliability of business intelligence and decision-making processes across industries.

Ambiguous data: Back to the upfront benefits the drafting process provides—data contracts crystalize definitions and context for data fields unique to an organization. The result of these definitions in practice furthers data clarity and usability within the org, ensuring that data is both easily understood and correctly interpreted.

Hidden and dark data: Finally, we arrive at the evil twins, the dual issues of hidden and dark data. Well-drafted data contracts outline data management and usage protocols, ensuring that all data an organization collects is easily discoverable, usable, and utilized effectively.

Doing so flips the issue of big data on its head, helping data teams and stakeholders maximize the value of data assets regardless of their respective volume and complexity at any given time.

It’s just common sense: Trade quality issues for quality insights, at scale

Addressing common data quality issues is imperative, as their effects on data quality can be tantamount to operational disaster. Proactively implementing a data contract can save time, energy, and professional reputations down the road.

Increasingly, data teams are fully embracing data contracts to do exactly that—providing a solid foundation for data consistency, accuracy, and usability, ensuring that all data collected is valuable and actionable within their organizations.

And if you’re interested in joining them, learn more by signing up for our own data contract product waitlist (while there’s still time!) at Gable.ai.

Share

Getting started with Gable

Gable is currently in private Beta. Join the product waitlist to be notified when we launch.

Join product waitlist →