Read this article and pay attention to the data mining techniques, classifier development, and evaluation criteria. Then take notes and understand the difference between supervised and unsupervised learning models. Finally, read the summary and discussion section of this article. What distinctions can be made about the three major purposes of problem-solving items using data-mining techniques?
There are different types of data warehouses, and each has a specific purpose within an organization. Remember, it is important to use the correct type of warehouse to support the "decision support" model being employed. Decision support techniques such as classification, prediction, time-series analysis, association, clustering, and so on will each have their own unique data needs. Correctly designing the data warehouse will ensure the best possible evidence to support strategic and daily decisions.
Managing data is an important function in the administrative process. Because organizations use data to guide decisions, decision-makers rely on you to produce a data management plan for sustainability, growth, and strategy. As you start to interact with decision-makers and the decision-support systems they use, you will also find that additional study of the models employed through a course on quantitative methods or decision-support technology will prove useful.
Introduction
With the advance of technology incorporated in
educational assessment, researchers have been intrigued by a new type of
data, process data, generated from computer-based assessment, or new
sources of data, such as keystroke or eye tracking data. Most often,
such data, often referred to as "data ocean," is of very large volume
and with few ready-to-use features. How to explore, discover and extract
useful information from such an ocean has been challenging.
What
analyses should be performed on such process data? Even though specific
analytic methods are to be used for different data sources with
specific features, some common analysis methods can be performed based
on the generic characteristics of log files. Hao et al. have
summarized several common analytic actions when introducing the package
in Python, glassPy. These include summary information about the log
file, such as the number of sessions, the time duration of each session,
and the frequency of each event, can be obtained through a summary
function. In addition, event n-grams, or event sequences of different
lengths, can be formed for further utilization of similarity measures to
classify and compare persons' performances. To take the temporal
information into account, hierarchical vectorization of the rank ordered
time intervals and the time interval distribution of event pairs were
also introduced. In addition to these common analytic techniques, other
existing data analytic methods for process data are Social Network
Analysis, Bayesian Networks/Bayes nets, Hidden Markov Model, Markov Item
Response Theory, diagraphs and
process mining. Further, modern data mining
techniques, including cluster analysis, decision trees, and artificial
neural networks, have been used to reveal useful information about
students' problem-solving strategies in various technology-enhanced
assessments.
The focus of the current study is about data
mining techniques and this paragraph provides a brief review of related
techniques that have been frequently utilized and lessons that have been
learned related to analyzing process data in technology-enhanced
educational assessment. Two major classes of data mining techniques are
supervised and unsupervised learning methods. Supervised methods are used when subjects' memberships are known
and the purpose is to train a classifier that can precisely classify
the subjects into their own category (e.g., score) and then be
efficiently generalized to new datasets. Unsupervised methods are
utilized when subjects' memberships are unknown and the goal is to
categorize the subjects into clearly separate groups based on features
that can distinguish them apart. Decision trees, as a supervised data
classification method, has been used very often in analysing process
data in educational assessment. DiCerbo and Kidwai used
Classification and Regression Tree (CART) methodology to create the
classifier to detect a player's goal in a gaming environment. The
authors demonstrated the building of the classifier, including feature
generation, pruning process, and evaluated the results using precision,
recall, Cohen's Kappa and A'. This study
proved that the CART could be a reliable automated detector and
illustrated the process of how to build such a detector with a relative
small sample size (n = 527). On the other hand, cluster analysis and
Self-Organizing Maps are two well-established
unsupervised techniques that categorize students' problem-solving
strategies. Kerr et al. showed that cluster analysis can
consistently identify key features in 155 students' performances in log
files extracted from an educational gaming and simulation environment
called Save Patch, which measures mathematical
competence. The authors described how they manipulated the data for the
application of clustering algorithms and showed evidence that fuzzy
cluster analysis is more appropriate than hard cluster analysis in
analyzing log file process data from game/simulation environment. Most
importantly, the authors demonstrated how cluster analysis can identify
both effective strategies and misconceptions students have with respect
to the related construct. Soller and Stevens showed the power of
SOM in terms of pattern recognition. They used SOM to categorize 5284
individual problem-solving performances into 36 different
problem-solving strategies, each exhibiting different solution
frequencies. The authors noted that the 36 strategy classifications can
be used as input to a test-level scoring process or externally validated
by associating them with other measures. Such detailed classifications
can also serve as valuable feedback to students and instructors.
Chapters in Williamson et al. also discussed extensively the
promising future of using data mining techniques, like SOM, as an
automated scoring method. Fossey has evaluated three unsupervised
methods, including k-means, SOM and Robust Clustering Using Links
(ROCK) on analyzing process data in log files from a game-based
assessment scenario.
To date, however, no study has demonstrated
the utilization of both supervised and unsupervised data mining
techniques for the analysis of the same process data. This study aims at
filling this gap and provides a didactic of analyzing process data from
the 2012 PISA log files retrieved from one of the problem-solving items
using both types of data mining methods. This log file is
well-structured and representative of what researchers may encounter in
complex assessments, thus, suitable for demonstration purposes. The goal
of the current study is 3-fold: (1) to demonstrate the use of data
mining methods on process data in a systematic way; (2) to evaluate the
consistency of the classification results from different data mining
techniques, either supervised or unsupervised, with one data file; (3)
to illustrate how the results from supervised and unsupervised data
mining techniques can be used to deal with psychometric issues and
challenges.
The subsequent sections are organized as follows.
First, the PISA 2012 public dataset, including participants and the
problem-solving item analyzed, is introduced. Second, the data analytic
methods used in the current study are elaborated and the concrete
classifier development processes are illustrated. Third, the results
from data analyses are reported. Lastly, the interpretations of the
results, limitations of the current study and future research directions
are discussed.