Here’s the deal with big data: It’s undeniably ushering in a new era for modern business. Unfettered access to high-quality data assets is revolutionizing how we use data, as new use cases for machine learning and business intelligence give rise to smarter algorithms, diverse data types, and the ability for business stakeholders to make real-time decisions that drive disruptive competitive advantages.
All this data, in turn, necessitates increasingly sophisticated forms of data quality management and data governance. Because more data inevitably, poetically, begets more data quality problems.
Moreover, data teams that understand the most common issues contributing to these data quality problems are better prepared to help organizational stakeholders succeed.
That said, some data quality issues are much more common than others. Let’s take a look at some of them (and the best way to get ahead of the problems they cause).
Note: In the real world of data engineering, fixing common data quality issues is much more important than ranking them.
But this is the internet. So, we've had a little fun definitively (i.e., 100% subjectively) ranking the following rogue's gallery of data quality issues from most common to least.
18th-century poet Alexander Pope must have spent some time working at an IT helpdesk, because his famous “To err is human” is prescient as it relates to the hands-down most common source of data quality issues.
We jest (a tad), but human-caused errors related to the entry, processing, and handling of data have major impacts on organizational data quality.
Duplicate data occurs when the same piece of data gets entered multiple times times (<- like so). When this happens in blog writing, it drives Grammarly crazy. However, in a database, duplicate data creates duplicate records, another exceptionally common cause of data quality issues.
As opposed to too much of the same thing (i.e., duplication), incomplete data is another common data quality issue where gaps occur in a dataset due to missing data or records that are missing one or more required fields.
Inconsistent data commonly occurs when discrepancies in data formats, entries, or standards within a given dataset occur. Often due to varying data sources, data inconsistencies can arise across an array of points between data systems and sources as flows of organizational data get recorded, stored, and interpreted.
Regardless of how consistent and complete it is, inaccurate data still impacts overall data quality. In this sense, inaccurate data includes incorrect, misleading, or simply outdated information.
Faulty instruments or methodologies are often the sources of data inaccuracies. However, deliberate alteration or falsification of data can also be a contributing factor.
Like some expensive TV series currently available on popular streaming services, ambiguous data lacks clarity, precise definitions, or much-needed context—making it difficult (or impossible) to understand and interpret clearly.
Typically, ambiguous data occurs due to poor or absent standardization, vague data entries, or inadequate metadata.
Data professionals do use hidden data and dark data interchangeably. And while their effect on data quality tends to be similar, they are slightly different concepts.
Hidden data refers to information overlooked or inaccessible within an existing data management system (DMS). Dark data, on the other hand, refers to data that is collected, processed, and stored within a DMS but remains there, unutilized.
Many traditional methods and data quality tools are available to data teams looking to address data quality issues (common or otherwise)—options ranging from initial efforts to correctly plan and execute an organization’s data quality framework to fully automating all data governance processes.
But it’s hard to argue with the logic that the best way to address an issue is to keep it from occurring in the first place. Data contracts, drafted and enforced upstream, are the preeminent solution in this regard, evidenced by how they can eliminate or counteract every single issue on our list:
Human error: Part of the standard data contract drafting process involves teams working with stakeholders and organizational data consumers to standardize data entry procedures and validation rules. The contract might require all data entries to conform to a single, agreed-upon date format—like the ISO 8601 standard commonly adopted in hospitals, for example.
This alone can vastly reduce the likelihood of commonplace manual errors that so easily plague data quality within organizations.
Duplicate data: Data contracts can define unique identifiers and enforce deduplication processes during data integration, ensuring that users cannot create duplicate records while utilizing organizational data. Simple initiatives, such as leveraging a data contract to mandate unique customer or user IDs, can significantly contribute to cleaner, more reliable datasets.
Incomplete data: Contracts can also specify mandatory fields and establish protocols for handling missing data. For instance, a contract drafted for an ecommerce platform might require all subsequent transaction records to include fields deemed essential, like the transaction date, amount, and customer ID for each purchase made online.
These requirements, then, actively reduce the incidence of incomplete data working downstream, enabling more comprehensive and accurate data analysis. (Not to mention happier data analysts.)
Inconsistent data: Data contract enforcement also applies to mandating data format consistency and standards across all data sources. Again, in practice, this could be as simple as a large consortium of insurance companies enforcing a standard currency format across all of its members’ financial records.
Leveraging data contracts to guarantee all organizational data is uniformly formatted in this way facilitates better data integration and analysis.
Inaccurate data: Contracts that include accuracy checks and validation mechanisms can verify data correctness ingested from data producers and throughout the data lifecycle.
Consider logistics and supply-chain applications, where accurate inventory data is literally mission-critical. The checks and validation mechanisms enforced through data contracts help ensure the accuracy of shipments en route, billing, and ongoing compliance with a shifting menagerie of trade regulations.
More broadly, the completeness of the data that contracts can provide improves the reliability of business intelligence and decision-making processes across industries.
Ambiguous data: Back to the upfront benefits the drafting process provides—data contracts crystalize definitions and context for data fields unique to an organization. The result of these definitions in practice furthers data clarity and usability within the org, ensuring that data is both easily understood and correctly interpreted.
Hidden and dark data: Finally, we arrive at the evil twins, the dual issues of hidden and dark data. Well-drafted data contracts outline data management and usage protocols, ensuring that all data an organization collects is easily discoverable, usable, and utilized effectively.
Doing so flips the issue of big data on its head, helping data teams and stakeholders maximize the value of data assets regardless of their respective volume and complexity at any given time.
Addressing common data quality issues is imperative, as their effects on data quality can be tantamount to operational disaster. Proactively implementing a data contract can save time, energy, and professional reputations down the road.
Increasingly, data teams are fully embracing data contracts to do exactly that—providing a solid foundation for data consistency, accuracy, and usability, ensuring that all data collected is valuable and actionable within their organizations.
And if you’re interested in joining them, learn more by signing up for our own data contract product waitlist (while there’s still time!) at Gable.ai.
Gable is currently in private Beta. Join the product waitlist to be notified when we launch.
Join product waitlist →