Read this article and pay attention to the data mining techniques, classifier development, and evaluation criteria. Then take notes and understand the difference between supervised and unsupervised learning models. Finally, read the summary and discussion section of this article. What distinctions can be made about the three major purposes of problem-solving items using data-mining techniques?
There are different types of data warehouses, and each has a specific purpose within an organization. Remember, it is important to use the correct type of warehouse to support the "decision support" model being employed. Decision support techniques such as classification, prediction, time-series analysis, association, clustering, and so on will each have their own unique data needs. Correctly designing the data warehouse will ensure the best possible evidence to support strategic and daily decisions.
Managing data is an important function in the administrative process. Because organizations use data to guide decisions, decision-makers rely on you to produce a data management plan for sustainability, growth, and strategy. As you start to interact with decision-makers and the decision-support systems they use, you will also find that additional study of the models employed through a course on quantitative methods or decision-support technology will prove useful.
Methods
Classifier Development
The general classifier building process
for the supervised learning methods consists of three steps: (1) train
the classifier through estimating model parameters; (2) determine the
values of tuning parameters to avoid issues such as "overfitting" (i.e.,
the statistical model fits too closely to one dataset but fails to
generalize to other datasets) and finalize the classifier; (3) calculate
the accuracy of the classifier based on the test dataset. In general,
training and tuning are often conducted based on the same training
dataset. However, some studies may further split the training dataset
into two parts, one for training while the other for tuning. Though
tree-based methods are not affected by the scaling issue, training and
test datasets are scaled for SVM, SOM, and k-means.
Given the
relatively small sample size of the current dataset, training, and
tuning processes were both conducted on the training dataset.
Classification accuracy was evaluated with the test dataset. For the
CART technique, the cost-complexity parameter (cp) was tuned to find the
optimal tree depth using R package rpart. Gradient boosting was carried
out using R package gbm. The tuning parameters for gradient boosting
were the number of trees, the complexity of trees, the learning rate and
the minimum number of observations in the tree's terminal nodes. Random
forest was tuned over its number of predictors sampled for splitting at
each node (mtry) using R package randomForest. A radial basis function
kernel SVM, carried out in R package kernlab, was tuned through two
parameters: scale function σ and the cost value C, which determine the
complexity of the decision boundary. After the parameters were tuned,
the classifiers were trained fitting to the training dataset.
10-fold-validation was conducted for supervised learning methods in the
training processes. Cross-validation is not necessary for random forest
when estimating test error due to its statistical properties.
For the unsupervised learning methods, SOM was carried out
in the R package kohonen. Learning rate declined from 0.05 to 0.01 over
the updates from 2000 iterations. k-means was carried out using the
kmeans function in the stats R package with 2000 iterations. Euclidian
distance was used as a distance measure for both methods. The number of
clusters ranged from 3 to 10. The lower bound was set to be 3 due to the
three score categories in this dataset. The upper bound was set to be
10 given the relative small number of features and small sample size in
the current study.