Data Mining Techniques in Analyzing Process
Methods
Classifier Development
The general classifier building process
for the supervised learning methods consists of three steps: (1) train
the classifier through estimating model parameters; (2) determine the
values of tuning parameters to avoid issues such as "overfitting" (i.e.,
the statistical model fits too closely to one dataset but fails to
generalize to other datasets) and finalize the classifier; (3) calculate
the accuracy of the classifier based on the test dataset. In general,
training and tuning are often conducted based on the same training
dataset. However, some studies may further split the training dataset
into two parts, one for training while the other for tuning. Though
tree-based methods are not affected by the scaling issue, training and
test datasets are scaled for SVM, SOM, and k-means.
Given the
relatively small sample size of the current dataset, training, and
tuning processes were both conducted on the training dataset.
Classification accuracy was evaluated with the test dataset. For the
CART technique, the cost-complexity parameter (cp) was tuned to find the
optimal tree depth using R package rpart. Gradient boosting was carried
out using R package gbm. The tuning parameters for gradient boosting
were the number of trees, the complexity of trees, the learning rate and
the minimum number of observations in the tree's terminal nodes. Random
forest was tuned over its number of predictors sampled for splitting at
each node (mtry) using R package randomForest. A radial basis function
kernel SVM, carried out in R package kernlab, was tuned through two
parameters: scale function σ and the cost value C, which determine the
complexity of the decision boundary. After the parameters were tuned,
the classifiers were trained fitting to the training dataset.
10-fold-validation was conducted for supervised learning methods in the
training processes. Cross-validation is not necessary for random forest
when estimating test error due to its statistical properties.
For the unsupervised learning methods, SOM was carried out
in the R package kohonen. Learning rate declined from 0.05 to 0.01 over
the updates from 2000 iterations. k-means was carried out using the
kmeans function in the stats R package with 2000 iterations. Euclidian
distance was used as a distance measure for both methods. The number of
clusters ranged from 3 to 10. The lower bound was set to be 3 due to the
three score categories in this dataset. The upper bound was set to be
10 given the relative small number of features and small sample size in
the current study.