Introduction to Data Mining

Data mining involves various algorithms and techniques for database searching, inferring data relationships, pattern recognition, and pattern classification. Pattern recognition is the process of comparing a sample observation against a fixed set of patterns (like those stored in a database) to search for an optimal match. Face recognition, voice recognition, character recognition, fingerprint recognition, and text string matching are all examples of pattern searching and pattern recognition.

Going one step further, given a set of prescribed pattern classes, pattern classification is the process of associating a sample observation with one of the pattern classes. For example, consider a database containing two possible classes of face images: happy faces and sad faces. Pattern classification involves processing an input face image of unknown classification to optimally classify it as either happy or sad. As you will soon see, the optimal pattern match or pattern classification is often defined using probabilistic and statistical measures such as distances, deviations, and confidence intervals.

What is Data Mining?

Data Mining refers to a set of methods applicable to large and complex databases to eliminate the randomness and discover the hidden pattern. Data mining methods are almost always computationally intensive.  Data mining is about tools, methodologies, and theories for revealing patterns in data - which is a critical step in knowledge discovery. There are several driving forces for why data mining has become such an important area of study.

  1. The explosive growth of data in a great variety of fields in industry and academia supported by:
    • Cheaper storage devices with unlimited capacities, such as cloud storage
    • Faster communication with faster connection speeds;
    • Better database management systems and software support
  2.     Rapidly increasing computing power.

With such a high volume of varied data available, data mining techniques help to extract information out of the data.

Statistical learning methods include everything, starting with linear regression, and encompassing recently developed complex and computation-intensive pattern recognition methods with roots in computer science. The main objective of learning methods is prediction, though that need not be the only objective. In this course, only prediction methods are considered.