December 7, 2023

Data Modeling 101: General Foundations to Specific Frustrations

Written by

Chad Sanderson

Share

Serving as a (very real, fully accredited, we swear) 101-level collegiate course, this blog article aims to lay a solid, real-world-based foundation regarding the concept and practice of data modeling. 

As such, the article will include a summary of data modeling’s historical prevalence in data engineering, its more recent dissolution, a definition of the concept, and different methods of use. 

We’ll conclude by exploring why any attempt to discuss the benefits of one type over another consistently equates to booting a hornet’s nest.

This foundation will serve as a gateway for newer data engineers, function as a juicy target of ridicule for the more seasoned, and will act to foster an appreciation for the role data contracts will play in data modeling’s future.

Course schedule:

  1. Data modeling: An overview
  2. Data modeling defined
  3. Common types of data models
  4. Causes of controversy in the data modeling space
  5. Restoring the model of balance
  6. Suggested next steps

1. Data modeling: An overview

At one point in the not-too-distant past, data modeling was the cornerstone of any data management strategy. Due to the technical and business practices that were predominant at the end of the 21st century, data modeling at its zenith placed a strong emphasis on structured, well-defined models.

However, in the late 2000s, the emergence of major cloud service providers like Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS) enabled cloud computing to gain traction within business organizations. 

By the end of that same decade, the benefits of scaleable, on-demand computing resources led to a proper surge within business organizations, which then led to the proliferation of what is now commonly referred to as the modern data stack—a group of cloud-based tools and technologies used for the collection, storage, processing, and analysis of data. 

Compared to the at-the-push-of-a-button benefits available on demand, data modeling was then seen by a growing number of practitioners as rigid and inflexible. Data modeling takes time. It can get complicated. The costs and overheads associated with the process reflected this. Perhaps most damaging at the time, it became easy to frame data modeling as a bottleneck—dead weight hampering the speed and flexibility of modern data management. 

However, this overemphasis on speed and flexibility and the underutilization of data modeling wasn’t sustainable. Though there is no specific “breaking point” to point to, by the mid-2010s, these issues became increasingly attributable to data modeling’s diminution. 

While far from exhaustive, increasingly common factors helped to precipitate this recalibration in the data space:

  • Data governance challenges: The abundance of cloud-based data storage and processing fueled an explosive increase in the data sources and repositories the average organization had access to. This sudden abundance, in turn, intensified the maintenance of data quality, security, and compliance, irreparably complicating the governance process.
  • Data quality issues: The fevered rate at which cloud-based solutions were adopted resulted in the neglect of data modeling and proper data architecture, resulting in inconsistencies, data quality issues, and difficulties in data integration. 
  • Lack of standardization: While cloud environments freed teams to use various tools and platforms, the consistency of data management practices degraded, making it harder to ensure consistency and interoperability across an organization.
  • Scalability and performance issues: Without proper data modeling, it became difficult to optimize systems for performance and scalability. Bottlenecks and reduced system efficiency resulted as data volumes grew.
  • Security and compliance risks: Rapid cloud adoption without adequate attention to data modeling and architecture can expose organizations to security vulnerabilities and compliance risks, especially when dealing with sensitive or regulated data.
  • Difficulties in extracting value from data: Without a well-thought-out data model, organizations struggle to extract meaningful insights from data. Inevitably, these organizations found that simply having data in the cloud did not guarantee it was inherently usable or valuable for decision-making.

2. Data modeling defined

Data modeling is the practice or process of creating a conceptual representation of data objects and the relation between them. Data modeling is comparable to architecture, in that the process blueprints how data is stored, managed, and used within and between systems and databases.

In essence, there are three key components of data modeling:

  1. Entities: Entities represent the real-world objects or concepts an organization wants to understand better. Examples of data modeling entities include products, employees, and customers.
  2. Attributes: These are the characteristics or properties of the entities being modeled. Attributes provide details that are used to describe and distinguish instances of an entity—product names, prices, customer names, phone numbers, etc.
  3. Relationships: The connections between entities in a data model are called relationships. They can be one-to-one, many-to-many, or one-to-many. Each entity is represented in a relational database in the typical data modeling process. While each entity has a unique identity, it can have multiple instances. 

Traditionally, the role of data modeling primarily focused on designing databases for transactional systems and normalizing data to reduce redundancy, improving database performance. The process itself mainly involved working with structured data in relational databases.

Modern data modeling is highly varied by comparison. And while its practice and process have evolved beyond some of its inherent qualities viewed negatively in the past, others are now increasingly accepted as trade-offs to be balanced against.

Data modeling today caters to a wide range of data storage and processing systems, ranging from traditional relational database management systems (RDBMS) to data lakes and NoSQL databases. Data models now facilitate data integration. They can support advanced analytics, data science initiatives, and predictive modeling. Modern models emphasize agility and scalability to quickly adapt to shifting business requirements.

As such, data modeling now also supports efforts in the data space to democratize data, helping to make data more understandable and accessible to a wide range of users.

3. Common types of data models

There are four main types of data models, conceptual, logical, physical, and dimensional. This is true when the goal is to simplify the categorization of data models.

Depending on the business needs of an organization, however, more than these initial four may be considered and utilized. We note the former simply because of the confusion this can sometimes cause within the data space.

Conceptual data models

The purpose of conceptual data models is to establish a macro, business-focused view of an organization’s data structure. Conceptual models are often leveraged in the planning phase of database design or a database management system.

In these cases, a data architect or modeler may work with business stakeholders and analysts to identify relevant entities, attributes, and relationships using unified modeling language (UML) and entity-relationship diagrams (ERDs).

Logical data modeling

Logical data models work to provide a detailed view of organizational data that is independent of specific technologies and physical considerations. By doing so, logical models are free to focus on capturing business requirements and rules without being biased by technical constraints. As a result, they can provide a clearer understanding of data from a business perspective.

The ability of less technical stakeholders to more easily understand logical data models also makes them a particularly useful tool for communicating with technical teams.

Physical data modeling

Alternately, physical data modeling aims to capture and represent the detailed structure and design of a database, taking into account the specific features and constraints of a chosen database management system (DBMS), as well as business requirements for performance, access methods, and storage.

For this reason, the entities database administrators and developers will focus on include physical aspects of a database—indexes, keys, partitioning, stored procedures and triggers, etc.

Dimensional data modeling

For business intelligence and data warehousing applications, dimensional data modeling is often used. This is because a dimensional model employs an efficient, user-friendly, flexible structure that organizes data into fact tables and dimensions to support fast querying and reporting.

Due to this, dimensional data models can specifically support related applications' complex queries, analysis, and reporting needs.

Object-oriented data modeling

Based on the principles of object-oriented programming, object-oriented data modeling represents data as objects instead of entities. The objects in this type of data modeling encapsulate both data and behavior. This object-oriented approach is key, making object-oriented models highly useful in scenarios where data structures must reflect real-world objects and their relationships.

Common examples of these scenarios include ecommerce and inventory management systems, banking and financial systems, customer relationship management (CMS) systems, and educational software.

Data vault modeling

As the word “vault” implies, data vault modeling is used in data warehousing, but also in business intelligence. Both data warehousing and BI projects benefit from the historical data preservation, scalability, flexibility, and integration capabilities that data vault models provide.

In theory, this makes data modeling a potential tool for any organization that needs to integrate data from multiple sources while maintaining data history and lineage (e.g., healthcare organizations, government agencies, and manufacturing companies).

Normalized data modeling

This type of data modeling focuses on two things—reducing data redundancy and improving data integrity. This can be crucial for transactional systems where data integrity and consistency are of prime importance. Normalized models are easier to maintain and update, while they also prevent data anomalies like inconsistencies and duplication.

De-normalized data modeling

Alternately, de-normalized data models involve the intentional introduction of redundancies into a dataset in order to improve performance. Through de-normalized modeling, related data can be stored in the same table or document. This reduces the need for computationally expensive join operations, which can slow down query performance.

Because of how they function, de-normalized data models also harmonize with the principles of NoSQL databases, which prioritize flexibility, scalability, and performance.

4. Causes of controversy in the data modeling space

Data scholars agree that discussions around data modeling function similarly to a hornet’s nest in nature—both tend to cause massive amounts of pain when stumbled into. While unfortunate for the stumbler, it helps to understand that, in both cases, damage results in the attempt to defend what one holds dear.

For hornets, driven to protect the nest’s existing and developing queens, the aggression results from a combination of their innate programming, alarm pheromones, and the instinct to attack in numbers in order to intimidate and dissuade larger foes.

For data practitioners, however, aggressively defending one’s beliefs about the process and practice of data modeling is usually motivated by one or more of the following factors:

  • Diverse perspectives: Data modeling is a field that intersects with numerous disciplines, including data science, software engineering, database design, data analytics, and business intelligence. While sharing varying degrees of overlap, these disparate professional backgrounds act as frames through which the views of “effective data modeling” become wildly divergent in the data space.
  • Complexity and trade-offs: Additionally, data modeling tends to involve near-endless tradeoffs between competing priorities. These tradeoffs include speed vs. governance, normalization vs. performance, and structure vs. flexibility—each with passionate advocates on both sides of the aisle.
  • Organizational context: The “right” data model in one organization may not be the same in another, even when operating within the same industry. Differing business rules and goals, data requirements, schema, information systems, and data maturity all but guarantee that there will never be one true data modeling technique or process.
  • Subjectivity in design: Data modeling itself can be quite subjective. Like many design disciplines, there are often multiple ways to model a given dataset. And data modelers themselves often have legitimate reasons for championing one approach over another. This subjectivity is part of why many find the challenges of data modeling so fulfilling.
  • Evolving technologies: Despite the order and logic practitioners attempt to bring to the table, the exponentially rapid evolution of data technologies—from traditional relational databases to NoSQL, low and no-code platforms, and big data—necessitates approaches to data modeling to continuously diversify.
  • Fluctuating best practices: Due to the ever-evolving modeling landscape, its related best practices invariably need to change. Techniques once considered sacrosanct can find themselves outdated, furthering debates about what the current best approach may be at any given time.
  • Emotional Investment: Data practitioners tend to be curious, persistent, analytical thinkers who benefit from a high attention to detail. As such, those who practice data modeling (or cross paths with it) tend to invest a great deal of intellectual and emotional capital in their work. Occasionally, this can create an environment where critiques or suggestions for alternate approaches can either be delivered as a personal attack, or taken as such. 

5. Restoring the model of balance

The good news is that navigating the tension between the impact of data modeling and the convenience of the modern data stack is inevitable. Organizations helping to strike the balance should consider employing the following:

  1. Adopt a hybrid approach: Consider using structured data modeling for core business entities that require consistency and stability above all. In areas that call for more agility and flexibility, employ modern data technologies that enable rapid iteration.
  2. Harmonize flexibility with standardization: Building on a hybrid approach, look to standardize core data elements and processes. At the same time, allow for flexibility in areas where rapid change can be expected. Embrace constant balancing and rebalancing of the strengths of structured data modeling and the modern data stack. 
  3. Use iterative data modeling: Instead of insisting on extensive upfront data modeling, try an iterative approach. Start with a basic model, then evolve it as needed. Iteration can produce the best of both worlds, maintaining a structured approach while responding to requirements as they change over time.
  4. Leverage data virtualization: Data visualization provides a helpful layer of abstraction that allows for integrating diverse data sources without extensive modeling. In some organizations, this approach maintains agility while ensuring data is effectively understood and used. 
  5. Focus on metadata management: Bridging the gap between structure modeling and agility usually involves a (sometimes renewed) focus on effective metadata management. Robust metadata curation further enables organizational flexibility while clarity regarding data structures and relationships is maintained. 
  6. Emphasize data governance: When individuals are empowered to enact consistent data governance, clear policies and standards guiding data quality, usage, and security help ensure a data environment remains as agile as possible.
  7. Enable self-service data access: When implemented with appropriate controls, self-service data access supports agility by allowing users to access data as needed while still operating within the framework of the established data model.
  8. Continuous collaboration: Make sure to foster continuous collaboration between your data architects, engineers, and business users. While the passionate data modeling discussions will still take place from time to time, making cross-disciplinary collaboration an important part of the culture helps keep modeling efforts and business needs aligned. 
  9. Implement data contracts: Finally, employ data contracts to provide structured agreements on data formats and interfaces. Their ability to foster communication between data producers and consumers promotes balance just as the other tactics here do—but also allows that balance to scale.

6. Suggested next steps

As is now abundantly clear, treating data as a product is paramount for any organization looking to succeed in an overwhelmingly data-dependent world. Data contracts are the best way to guarantee the quality of data before it even enters an organization. 

For this reason, we’re offering a transformative approach to retaining, developing, and operationalizing data contracts. Make sure to join our product waitlist to be among the first to experience the benefits of Gable.ai.

Share

Getting started with Gable

Gable is currently in private Beta. Join the product waitlist to be notified when we launch.

Join product waitlist →