Serving as a (very real, fully accredited, we swear) 101-level collegiate course, this blog article aims to lay a solid, real-world-based foundation regarding the concept and practice of data modeling.
As such, the article will include a summary of data modeling’s historical prevalence in data engineering, its more recent dissolution, a definition of the concept, and different methods of use.
We’ll conclude by exploring why any attempt to discuss the benefits of one type over another consistently equates to booting a hornet’s nest.
This foundation will serve as a gateway for newer data engineers, function as a juicy target of ridicule for the more seasoned, and will act to foster an appreciation for the role data contracts will play in data modeling’s future.
At one point in the not-too-distant past, data modeling was the cornerstone of any data management strategy. Due to the technical and business practices that were predominant at the end of the 21st century, data modeling at its zenith placed a strong emphasis on structured, well-defined models.
However, in the late 2000s, the emergence of major cloud service providers like Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS) enabled cloud computing to gain traction within business organizations.
By the end of that same decade, the benefits of scaleable, on-demand computing resources led to a proper surge within business organizations, which then led to the proliferation of what is now commonly referred to as the modern data stack—a group of cloud-based tools and technologies used for the collection, storage, processing, and analysis of data.
Compared to the at-the-push-of-a-button benefits available on demand, data modeling was then seen by a growing number of practitioners as rigid and inflexible. Data modeling takes time. It can get complicated. The costs and overheads associated with the process reflected this. Perhaps most damaging at the time, it became easy to frame data modeling as a bottleneck—dead weight hampering the speed and flexibility of modern data management.
However, this overemphasis on speed and flexibility and the underutilization of data modeling wasn’t sustainable. Though there is no specific “breaking point” to point to, by the mid-2010s, these issues became increasingly attributable to data modeling’s diminution.
While far from exhaustive, increasingly common factors helped to precipitate this recalibration in the data space:
Data modeling is the practice or process of creating a conceptual representation of data objects and the relation between them. Data modeling is comparable to architecture, in that the process blueprints how data is stored, managed, and used within and between systems and databases.
In essence, there are three key components of data modeling:
Traditionally, the role of data modeling primarily focused on designing databases for transactional systems and normalizing data to reduce redundancy, improving database performance. The process itself mainly involved working with structured data in relational databases.
Modern data modeling is highly varied by comparison. And while its practice and process have evolved beyond some of its inherent qualities viewed negatively in the past, others are now increasingly accepted as trade-offs to be balanced against.
Data modeling today caters to a wide range of data storage and processing systems, ranging from traditional relational database management systems (RDBMS) to data lakes and NoSQL databases. Data models now facilitate data integration. They can support advanced analytics, data science initiatives, and predictive modeling. Modern models emphasize agility and scalability to quickly adapt to shifting business requirements.
As such, data modeling now also supports efforts in the data space to democratize data, helping to make data more understandable and accessible to a wide range of users.
There are four main types of data models, conceptual, logical, physical, and dimensional. This is true when the goal is to simplify the categorization of data models.
Depending on the business needs of an organization, however, more than these initial four may be considered and utilized. We note the former simply because of the confusion this can sometimes cause within the data space.
The purpose of conceptual data models is to establish a macro, business-focused view of an organization’s data structure. Conceptual models are often leveraged in the planning phase of database design or a database management system.
In these cases, a data architect or modeler may work with business stakeholders and analysts to identify relevant entities, attributes, and relationships using unified modeling language (UML) and entity-relationship diagrams (ERDs).
Logical data models work to provide a detailed view of organizational data that is independent of specific technologies and physical considerations. By doing so, logical models are free to focus on capturing business requirements and rules without being biased by technical constraints. As a result, they can provide a clearer understanding of data from a business perspective.
The ability of less technical stakeholders to more easily understand logical data models also makes them a particularly useful tool for communicating with technical teams.
Alternately, physical data modeling aims to capture and represent the detailed structure and design of a database, taking into account the specific features and constraints of a chosen database management system (DBMS), as well as business requirements for performance, access methods, and storage.
For this reason, the entities database administrators and developers will focus on include physical aspects of a database—indexes, keys, partitioning, stored procedures and triggers, etc.
For business intelligence and data warehousing applications, dimensional data modeling is often used. This is because a dimensional model employs an efficient, user-friendly, flexible structure that organizes data into fact tables and dimensions to support fast querying and reporting.
Due to this, dimensional data models can specifically support related applications' complex queries, analysis, and reporting needs.
Based on the principles of object-oriented programming, object-oriented data modeling represents data as objects instead of entities. The objects in this type of data modeling encapsulate both data and behavior. This object-oriented approach is key, making object-oriented models highly useful in scenarios where data structures must reflect real-world objects and their relationships.
Common examples of these scenarios include ecommerce and inventory management systems, banking and financial systems, customer relationship management (CMS) systems, and educational software.
As the word “vault” implies, data vault modeling is used in data warehousing, but also in business intelligence. Both data warehousing and BI projects benefit from the historical data preservation, scalability, flexibility, and integration capabilities that data vault models provide.
In theory, this makes data modeling a potential tool for any organization that needs to integrate data from multiple sources while maintaining data history and lineage (e.g., healthcare organizations, government agencies, and manufacturing companies).
This type of data modeling focuses on two things—reducing data redundancy and improving data integrity. This can be crucial for transactional systems where data integrity and consistency are of prime importance. Normalized models are easier to maintain and update, while they also prevent data anomalies like inconsistencies and duplication.
Alternately, de-normalized data models involve the intentional introduction of redundancies into a dataset in order to improve performance. Through de-normalized modeling, related data can be stored in the same table or document. This reduces the need for computationally expensive join operations, which can slow down query performance.
Because of how they function, de-normalized data models also harmonize with the principles of NoSQL databases, which prioritize flexibility, scalability, and performance.
Data scholars agree that discussions around data modeling function similarly to a hornet’s nest in nature—both tend to cause massive amounts of pain when stumbled into. While unfortunate for the stumbler, it helps to understand that, in both cases, damage results in the attempt to defend what one holds dear.
For hornets, driven to protect the nest’s existing and developing queens, the aggression results from a combination of their innate programming, alarm pheromones, and the instinct to attack in numbers in order to intimidate and dissuade larger foes.
For data practitioners, however, aggressively defending one’s beliefs about the process and practice of data modeling is usually motivated by one or more of the following factors:
The good news is that navigating the tension between the impact of data modeling and the convenience of the modern data stack is inevitable. Organizations helping to strike the balance should consider employing the following:
As is now abundantly clear, treating data as a product is paramount for any organization looking to succeed in an overwhelmingly data-dependent world. Data contracts are the best way to guarantee the quality of data before it even enters an organization.
For this reason, we’re offering a transformative approach to retaining, developing, and operationalizing data contracts. Make sure to join our product waitlist to be among the first to experience the benefits of Gable.ai.
Gable is currently in private Beta. Join the product waitlist to be notified when we launch.
Join product waitlist →