Data Mining Techniques in Analyzing Process
Methods
Data Mining Techniques
This study demonstrates how to utilize
data mining techniques to map the selected features (both action and
time) to students' item performance on this problem-solving item in 2012
PISA. Given students' item scores are available in the data file,
supervised learning algorithms can be trained to help classify students
based on their known item performance (i.e., score category) in the
training dataset while unsupervised learning algorithms categorize
students into groups based on input variables without knowing their item
performance. No assumptions about the data distribution are made on
these data mining techniques.
Four supervised learning methods:
Classification and Regression Tree (CART), gradient boosting, random
forest, and SVM are explored to develop classifiers while, two
unsupervised learning methods, Self-organizing Map (SOM) and k-means,
are utilized to further examine different strategies used by students in
both the same and different score categories. CART was chosen because
it worked effectively in a previous study and
is known for its quick computation and simple interpretation. However,
it might not have the optimal performance compared with other methods.
Furthermore, small changes in the data can change the tree structure
dramatically. Thus, gradient boosting and random forest,
which can improve the performance of trees via ensemble methods, were
also used for comparison. Though SVM has not been used much in the
analysis of process data yet, it has been applied as one of the most
popular and flexible supervised learning techniques for other
psychometric analysis such as automatic scoring. The two
clustering algorithms, SOM and k-means, have been applied in the
analysis of process data in log files. Researchers have suggested to use more than one
clustering methods to validate the clustering solutions. All the analyses were conducted in the software program Rstudio.