Read this article and pay attention to the data mining techniques, classifier development, and evaluation criteria. Then take notes and understand the difference between supervised and unsupervised learning models. Finally, read the summary and discussion section of this article. What distinctions can be made about the three major purposes of problem-solving items using data-mining techniques?
There are different types of data warehouses, and each has a specific purpose within an organization. Remember, it is important to use the correct type of warehouse to support the "decision support" model being employed. Decision support techniques such as classification, prediction, time-series analysis, association, clustering, and so on will each have their own unique data needs. Correctly designing the data warehouse will ensure the best possible evidence to support strategic and daily decisions.
Managing data is an important function in the administrative process. Because organizations use data to guide decisions, decision-makers rely on you to produce a data management plan for sustainability, growth, and strategy. As you start to interact with decision-makers and the decision-support systems they use, you will also find that additional study of the models employed through a course on quantitative methods or decision-support technology will prove useful.
Methods
Evaluation Criterion
For the supervised methods, students in the test dataset are classified based on the classifier developed based on the training dataset. The performance of supervised learning techniques was evaluated in terms of classification accuracy. Outcome measures include overall accuracy, balanced accuracy, sensitivity, specificity, and Kappa. Since item scores are three categories, 0, 1, and 2, sensitivity, specificity and balanced accuracy were calculated as follows.
where sensitivity measures the ability to predict positive cases, specificity measures the ability to predict negative cases and balanced accuracy is the average of the two. Overall accuracy and Kappa were calculated for each method based on the following formula:
(4)
(5)
where
overall accuracy measures the proportion of all correct predictions.
Kappa statistic is a measure of concordance for categorical data. In its
formula, is the observed proportion of agreement,
is the
proportion of agreement expected by chance. The larger these five
statistics are, the better classification decisions.
For the two
unsupervised learning methods, the better fitting method and the number
of clusters were determined for the training dataset by the following
criteria:
1. Davies-Bouldin Index
calculated as in Equation 6, can be applied to compare the performance
of multiple clustering algorithms. The algorithm with the
lower DBI is considered the better fitting one which has the higher
between-cluster variance and smaller within-cluster variance.
(6)
where
is the number of clusters,
and
are the average distances from
the cluster center to each case in cluster
and cluster
.
is the
distance between the centers of cluster
and cluster
. Cluster
has
the smallest between-cluster distance with cluster
or has the highest
within-cluster variance, or both.
2.
Kappa value (see Equation 5) is a measure of classification consistency
between these two unsupervised algorithms. It is usually expected not
smaller than 0.8.
To check the
classification stability and consistency in the training dataset, the
methods were repeated in the test dataset, DBI and Kappa values were
computed.