Data Profiling IN PowerBI

 Imagine you have just started working at AdventureWorks as a data analyst. You have a lot of data to analyze to determine which products are preferred, by which client, and why. To perform successful analysis on these many items, it is necessary to have data that includes fields suitable for analysis with an adequate amount of data and a variety of data ranges representing the overall data. Over the next few minutes, you will be introduced to data profiling and statistical analysis and why it is important when reviewing data sets. By the end of this video, you will have been introduced to a high level understanding of data profiling and statistical analysis when reviewing data sets. You will also learn about the distribution, anomalies and outliers in the context of data profiling. Let's first cover an introduction to data profiling. Before analyzing any data set, it is important to examine and evaluate the data you are working with. Analyzing the data without evaluating its accuracy, completeness and alignment with the objectives can lead to misleading results. When examining a data set for the first time, there are several aspects you should look at, especially for numerical fields. You should check these characteristics for each numerical field, minimum or min, maximum or max, average or mean, frequently occurring values or mode and standard deviation. The best way to start assessing data is with data you can immediately troubleshoot. Imagine you are reviewing a data set that has an age field. For instance, there could be someone in the data set with an age of 200, which would be extremely unlikely to be true. If so, there may be an outlier in the data. Look at the minimum and maximum values, such as appearing between 21 and 77. These are realistic ages, unlike 200. The concept of distribution of data refers to how the data points are spread or arranged within a data set. It describes the pattern or shape of the data when plotted on a graph. Understanding the distribution of data is crucial in data analysis because it helps you gain insights into the central tendency, variability, and overall characteristics of the data. Next, let's consider outliers. The formal definition of an outlier in statistics is a data point that significantly deviates from other observations. Outlier data can be handled by applying a technique called min, max scaling or normalization. The aim is to adjust the mean and standard deviation of the data proportionately while preserving the ratio of the distance between outlier data and other data points. Analyzing the distribution allows you to make informed decisions, identify outliers, and choose appropriate statistical techniques for further analysis. There are situations where there may be values in the data set that skew the average. For example, there may be examples close in age. Let's say there are three individuals aged 80 and above. If you solely rely on the average to evaluate the distribution, these outliers can mislead you by increasing the average. In this case, it would be appropriate to examine the distribution more closely. When taking a closer look at the data, you may find that the distribution is normal, but the three records mentioned in the example are outliers. Next, let's look at standard deviation. Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a data set. It provides a way to understand how individual data points differ from the mean or average of the data set. The main objective here is to prevent outliers from causing deviations in your analysis results, minimizing their impact. Finally, let's return to the point of distribution of data. The balanced distribution of data points that fall outside the outliers is another factor that affects data quality and your analysis results. It is important for descriptive variables such as age, gender, income status, occupation, city and neighborhood to represent as many diverse groups as possible and be evenly distributed among others. If not, a cluster of records that closely resemble each other will lead to narrow intervals when defining norms which will mislead your analysis. Profiling and statistically analyzing data, including examining its distribution, min max, mean and mode values detecting outliers, if any, and normalizing outliers, ensuring that the data represents the entirety of the data set, are the key elements that demonstrate data quality. Considering these factors will enhance the accuracy and quality of analysis and predictions made with this data.


Post a Comment

Previous Post Next Post