5. Steps of the Knowledge Discovery in Databases Process

Data mining is actually the core step in Knowledge Discovery in Databases (KDD) process. Though KDD is used synonymously to represent data mining, both these are actually different. Some preprocessing steps before data mining and post processing steps after data mining are to be completed to transform the raw data as useful knowledge. Thus, data mining alone might not give you what you actually look for.

KDD is an iterative process that transforms raw data into useful information. Different steps of Knowledge discovery in Databases are:Understanding: The first step is understanding the requirements. We need to have a clear understanding about the application domain and your objectives, whether it is to improve your sales, predict stock market etc. It should also know whether you are going to describe your data or predict information. Selection of data set: Data mining is done on your current or past records. Thus, you should select a data set or subset of data, in other words data samples, on which you need to perform data analysis and get useful knowledge. We should have enough quantity of data to perform data mining.

  1. Data cleaning

    Data cleaning is the step where noise and irrelevant data are removed from the large data set. This is a very important preprocessing step because your outcome would be dependent on the quality of selected data. As part of data cleaning, you might have to remove duplicate records, enter logically correct values for missing records, remove unnecessary data fields, standardize data format, update data in a timely manner and so on.

  2. Data transformation

    With the help of dimensionality reduction or transformation methods, the number of effective variables is reduced and only useful features are selected to depict data more efficiently based on the goal of the task. In short, data is transformed into appropriate form making it ready for data mining step.

  3. Selection of data mining task

    Based on the objective of data mining, appropriate task is selected. Some common data mining tasks are classification, clustering, association rule discovery, sequential pattern discovery, regression and deviation detection. We can choose any of these tasks based on whether we need to predict information or describe information.

  4. Selection of data mining algorithm

    Appropriate method(s) is to be selected for looking for patterns from the data. You need to decide the model and parameters that might be appropriate for the method. Some popular data mining methods are decision trees and rules, relational learning models, example based methods etc.

  5. Data mining

    Data mining is the actual search for patterns from the data available using the selected data mining method.

  6. Pattern evaluation

    This is a post processing step in KDD which interprets mined patterns and relationships. If the pattern evaluated is not useful, then the process might again start from any of the previous steps, thus making KDD an iterative process.

  7. Knowledge consolidation:

This is the final step in Knowledge Discovery in Databases (KDD). The knowledge discovered is consolidated and represented to the user in a simple and easy to understand format. Mostly, visualization techniques are being used to make users understand and interpret information.

Though these are the main steps in any KDD process, some of the steps could be done combined during the actual process. For example, considering the convenience, data selection and data transformation can be combined together. Even after presenting knowledge to the user, new data can be added to the data set or mining can be further refined or a different data mining method can be chosen to get more accurate results. Thus, KDD is completely an iterative process.

When we analyze different steps of KDD process, we could understand that we are mining data to get useful information or knowledge. Thus, knowledge mining would be the more appropriate term rather than data mining.