Introduction to Data Mining

Data mining involves various algorithms and techniques for database searching, inferring data relationships, pattern recognition, and pattern classification. Pattern recognition is the process of comparing a sample observation against a fixed set of patterns (like those stored in a database) to search for an optimal match. Face recognition, voice recognition, character recognition, fingerprint recognition, and text string matching are all examples of pattern searching and pattern recognition.

Going one step further, given a set of prescribed pattern classes, pattern classification is the process of associating a sample observation with one of the pattern classes. For example, consider a database containing two possible classes of face images: happy faces and sad faces. Pattern classification involves processing an input face image of unknown classification to optimally classify it as either happy or sad. As you will soon see, the optimal pattern match or pattern classification is often defined using probabilistic and statistical measures such as distances, deviations, and confidence intervals.

Overview

With rapid advances in information technology, explosive growth is witnessed in data generation and data collection capabilities across all domains. In the business world, very large databases on commercial transactions have been generated by retailers and e-commerce. A huge amount of scientific data have been generated in various fields as well. One case in point is the human genome project which has aggregated gigabytes of data on the human genetic code. The World Wide Web provides another example with billions of web pages consisting of textual and multimedia information that is used by millions of people. Analyzing huge bodies of data that can be understood and used efficiently remains a challenging problem. Data mining addresses this problem by providing techniques and software to automate the analysis and exploration of large and complex data sets. Research on data mining is being pursued in a wide variety of fields, including statistics, computer science, machine learning, database management, and data visualization, to name a few.

This course on data mining will cover commonly used techniques and applications in this field. Though the focus is on the application of the methods through the software R, considerable effort is devoted to developing the mathematical basis.  Data mining and learning techniques developed in fields other than statistics, e.g., machine learning and signal processing, are also introduced. After the completion of the course, students should be able to identify situations concerning the applicability of the techniques, employ the techniques to derive results, interpret the results and comprehend the limitations, if any, of the final outcome.


Source: The Pennsylvania State University, https://online.stat.psu.edu/stat508/lesson/1a
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.