Data Mining Concepts 3. Exploring Data

EMC2DATA

Exploring Data

The third step in the data mining process, as highlighted in the following diagram, is to explore the prepared data.

You must understand the data in order to make appropriate decisions when you create the mining models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. For example, you might determine by reviewing the maximum, minimum, and mean values that the data is not representative of your customers or business processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis for your expectations. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help you improve the model. Data that strongly deviates from a standard distribution might be skewed, or might represent an accurate picture of a real-life problem, but make it difficult to fit a model to the data. +

By exploring the data in light of your own understanding of the business problem, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems or gain a deeper understanding of the behaviors that are typical of your business. +

You can use tools such as Master Data Services to canvass available sources of data and determine their availability for data mining. You can use tools such as SQL Server Data Quality Services, or the Data Profiler in Integration Services, to analyze the distribution of your data and repair issues such as wrong or missing data. +

After you have defined your sources, you combine them in a Data Source view by using the Data Source View Designer in SQL Server Data Tools. For more information, see Data Source Views in Multidimensional Models. This designer also contains some several tools that you can use to explore the data and verify that it will work for creating a model. For more information, see Explore Data in a Data Source View (Analysis Services). +

Note that when you create a model, Analysis Services automatically creates statistical summaries of the data contained in the model, which you can query to use in reports or further analysis. For more information, see Data Mining Queries.