Data Mining Concepts 2. Preparing Data

EMC2DATA

Preparing Data

Preparing Data The second step in the data mining process, as highlighted in the following diagram, is to consolidate and clean the data that was identified in the Defining the Problem step.

Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as incorrect or missing entries. For example, the data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home.

Data cleaning is not just about removing bad data or interpolating missing values, but about finding hidden correlations in the data, identifying sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis. For example, should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price? Incomplete data, wrong data, and inputs that appear separate but in fact are strongly correlated all can influence the results of the model in ways you do not expect. +

Therefore, before you start to build mining models, you should identify these problems and determine how you will fix them. For data mining typically you are working with a very large dataset and cannot examine every transaction for data quality; therefore, you might need to use some form of data profiling and automated data cleansing and filtering tools, such as those supplied in Integration Services, Microsoft SQL Server 2012 Master Data Services, or SQL Server Data Quality Services to explore the data and find the inconsistencies. For more information, see these resources: +

  • Integration Services in Business Intelligence Development Studio
  • Master Data Services Overview (MDS)
  • Data Quality Services
  • It is important to note that the data you use for data mining does not need to be stored in an Online Analytical Processing (OLAP) cube, or even in a relational database, although you can use both of these as data sources. You can conduct data mining using any source of data that has been defined as an Analysis Services data source. These can include text files, Excel workbooks, or data from other external providers. For more information, see Supported Data Sources (SSAS - Multidimensional).