Methods

Feature Generation and Selection

Feature Generation

Features generated can be categorized into time features and action features, as summarized in Table 1. Four Time features were created: T_time, A_time, S_time, and E_time, indicating total response time, action time spent in process, starting time spent on first action, and ending time spent on last action, respectively. It was assumed that students with different ability levels may differ in the time they read the question (starting time spent on first action), the time they spent during the response (action time spent in process), and the time they used to make final decision (ending time spent on last action). Different researchers have proposed various joint modeling approaches for both response accuracy and response times, which explain the relationship between the two. Thus, the total response times are expected to differ as well.

However, in this study, action features were created by coding different lengths of adjacent action sequences together. Thus, this study generated 12 action features consisting of only one action (unigrams), 18 action features containing two ordered adjacent actions (bigrams), and 2 action features created from four sequential actions (four-grams). Further, all action sequences generated were assumed to have equal importance and no weights were assigned to each action sequence. In Table 1, "concession" is a unigram, consisting of only one action, that is, the student bought the concession fare; on the other hand, "S_city" is a bigram, consisting of two actions, which are "Start" and "city subway," representing the student selected the city subway ticket after starting the item.

Sao Pedro et al. showed that features generated should be theoretically important to the construct to achieve better interpretability and efficiency. Following their suggestion, features were generated as the indicators of the problem-solving ability measured by this item, which is supported by the scoring rubric. For example, one action sequence consisted of four actions, which was coded as "city_con_daily_cancel," is crucial to scoring. If the student first chose "city_subway" to tour the city, then used the student's concession fare ("concession"), looked at the price of daily pass ("daily") next and lastly, he/she clicked "Cancel" to see the other option, this action sequence is necessary but not sufficient for a full credit.

The final recoded dataset for analysis is made up of 426 students as rows and 36 features (including 32 action sequence features and 4 time features) as columns. Scores for each student served as known labels when applying supervised learning methods. The frequency of each generated action feature was calculated for each student.


Feature Selection

The selection of features should base on both theoretical framework and the algorithms used. As features were generated from a purely theoretical perspective in this study, no such consideration is needed in feature selection.

Two other issues that need consideration are redundant variables and variables with little variance. Tree-based methods handle these two issues well and have built-in mechanisms for feature selection. The feature importance indicated by tree-based methods are shown in Figure 3. In both random forest and gradient boosting, the most important one is "city_con_daily_cancel." The next important one is "other_buy," which means the student did not choose trip_4 before the action "Buy." The feature importance indicated by tree-based methods is especially helpful when selection has to be made among hundreds of features. It can help to narrow down the number of features to track, analyze, and interpret. The classification accuracy of the support vector machine (SVM) is reduced due to redundant variables. However, given the number of features (36) is relatively small in the current study, deleting highly correlated variables (ρ≥ 0.8) did not improve classification accuracy for SVM.

Figure 3. Feature importance indicated by tree-based methods.



Clustering algorithms are affected by variables with near zero variance. Fossey and Kerr et al. discarded variables with 5 or fewer attempts in their studies. However, their data were binary and no clear-cut criterion exists for feature elimination when using cluster algorithms in the analysis of process data. In the current study, 5 features with variance no >0.09 in both training and test dataset were removed to achieve optimal classification results.

In summary, a full set of features (36) were retained in the tree-based methods and SVM while 31 features were selected for SOM and k-means after the deletion of features with little variance.