Data Mining Techniques in Analyzing Process
Methods
Data Description
The PISA 2012 log file dataset for the
problem-solving item was downloaded at
http://www.oecd.org/pisa/pisaproducts/database-cbapisa2012.htm. The
dataset consists of 4722 actions from 426 students as rows and 11
variables as columns. Eleven variables (see Figure 2) include: cnt
indicates country, which is USA in the present study; schoolid and
StIDStd indicate the unique school and student IDs, respectively;
event_number (ranging from 1 to 47) indicates the cumulative number of
actions the student took; event_value (see raw event_values presented in
Table 1) tells the specific action the student took at one time stamp
and time indicates the exact time stamp (in seconds) corresponding to
the event_value. Event notifies the nature of the action (start item,
end item, or actions in process). Lastly, network, fare_type,
ticket_type, and number_trips all describe the current choice the
student had made. The variables used were schoolid, StIDStd, event_value
and time. ID variables helped to identify students, while event_value
and time variables were used to generate features. The scores for all
students were not provided in the log file, thus, hand coded and
carefully double checked based on the scoring rule. Among the 426
students, 121 (28.4%) got full credit, 224 (52.6%) got partial credit
and 81 (19.0%) did not get any credit. Full, partial, and no credit were
coded as 2, 1, and 0, respectively.
Figure 2. The screenshot of the log file for one student.

Table 1. 15 raw event values and 36 generated features.