Data Mining Concepts 1. Define the Problem

EMC2DATA

Defining the Problem

The first step in the data mining process, as highlighted in the following diagram, is to clearly define the problem, and consider ways that data can be utilized to provide an answer to the problem.


This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining specific objectives for the data mining project. These tasks translate into questions such as the following:

  •  What are you looking for? What types of relationships are you trying to find?
  •  Does the problem you are trying to solve reflect the policies or processes of the business?
  •  Do you want to make predictions from the data mining model, or just look for interesting patterns and associations?
  •  Which outcome or attribute do you want to try to predict?
  •  What kind of data do you have and what kind of information is in each column? If there are multiple tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to make the data usable?
  •  How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the business?
  • To answer these questions, you might have to conduct a data availability study, to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you might have to redefine the project. You also need to consider the ways in which the results of the model can be incorporated in key performance indicators (KPI) that are used to measure business progress.