Normalization and denormalization

 Introduction

One of the most important jobs of a data analyst in Microsoft Power BI is creating and managing data models. By executing this task effectively, you can make it easier for your team to understand the data. However, one of the questions you’ll often face is “to normalize or denormalize?”


In this reading, you’ll explore the concepts of normalization and denormalization, their effects on the data model, and how to enact these techniques using Power Query.


The data model and its advantages

As you might already be aware, a data model is a conceptual representation of data elements. It’s a visual overview of your tables, their columns, and the relationships between the tables. 


An appropriately designed data model provides the following advantages:


Faster data exploration


Easier aggregate creation


Precise reporting


Quicker report creation


Simpler report maintenance


However, further advantages can also be provided by using normalization or denormalization.


Normalization

Normalization is a data model design technique that involves structuring data to minimize redundancy and ensure data integrity. It divides data into multiple related tables, each with a specific purpose. This approach reduces data duplication. However, it often requires creating complex relationships between the tables.


Normalization offers the following advantages:


Removal of redundant data.


Improved data integrity.


Easier maintenance of your data model.


With normalization, relationships are established between tables by using primary and foreign keys. Primary keys are columns that uniquely identify each row of data.


Normalized sales table

For example, the Adventure Works Product table has a unique key column named productid. Additional columns represent product attributes such as name and description.


The second table, Sales, has a unique key column named SaleId, and a SalesTotal column. It also has a foreign key column named Product. Each field in this column represents an ID in the Product table. The relationship between the Sales table and the Product table is an example of a many-to-one relationship. In other words, there can be many rows in the Sales table related to one row in the Product table.


This schema is normalized because the Sales table uses a foreign key to associate each row with a specific product.


Denormalization

Denormalization is the reverse process of normalization. It involves converting the normalized schema into a schema that has redundant information. Implementing denormalization helps to avoid expensive queries between the tables but at the cost of creating redundant or duplicated data.


To demonstrate, let’s return to the previous example. To denormalize the Adventure Works schema, the Sales table must contain a SaleId, SalesTotal, ProductName and ProductDescription columns.


Sales table

Note how the product information is now duplicated in multiple rows.


On the other hand, denormalization offers some benefits:


Instead of loading multiple data tables in your data model, you have only one large table, which could be more efficient in performance. With no relationship between tables, queries to join the data are reduced.


Filter propagation between the tables may sometimes be less efficient than filters applied to a single table.


Establishing a hierarchy from fields within a table is simpler.


However, despite these benefits, you don’t have to create a denormalized data model for every dataset. During the design stage, you must consider other factors that might also affect your model, like analytical and business requirements, your data sources, and the size of your model.


Difference between normalization and denormalization

There are several differences between normalized and denormalized models, which it’s important to be aware of:


Data integrity


Data integrity is maintained during the normalization process. On the other hand, data integrity is harder to maintain with a denormalized model. This is because multiple rows may need to be updated when data changes.


Redundant data


Redundant data is eliminated when normalization is performed. Howeverdenormalization increases redundant data.


Model size


Normalization increases the number of tables and join queries (due to relationships between tables). In contrast, denormalization reduces the number of tables and relationships.


Memory


Disk space is overused in a denormalized schema because the same data is stored elsewhere. On the contrary, disk space is optimized in a normalized schema.


Normalizing and denormalizing data in Power Query

Now that you’re familiar with normalization and denormalization let’s review how these techniques are implemented.


Load your data


When data is loaded into Power BI from a single data source, it likely represents a denormalized data structure. In this case, you can use the Power Query editor to transform and shape your source data into multiple normalized tables. Once you’ve performed normalization in Power Query, you can build relationships between your data tables in the modeling view of Power BI desktop.


Design your model


Your responsibility is to design an optimized data model with tables that represent the business requirements. Any decision around normalization or denormalization must be aligned with these requirements.


In the following example, Adventure Works classifies products by category and subcategory. Products are assigned to subcategories. While subcategories are assigned to categories. In this model, the Product dimension table is normalized and stored in three separate tables:


Category, 


Subcategory,


and Product. 


Sales data model

Denormalize your model


It’s in the business interest of Adventure Works to denormalize this model. Its data analysts can use Power Query to transform these tables into a single product table using the Merge queries feature.


Conclusion

The normalization or denormalization debate is about striking the right balance between data integrity, query performance, and ease of maintenance. There is no one-size-fits-all solution, and the choice depends on your project's specific business and analytical requirements. However, by understanding the fundamentals, differences, and use cases of each normalized and denormalized data structure, you can tailor your data model to meet your specific needs.

Post a Comment

Previous Post Next Post