"The use of OLAP data cube models for psychometrics opens the door to complex and dynamic uses of that data. This paper asserts that data cube modelling would allow larger, aligned, and integrated datasets to be constructed that could be used to build knowledge graphs or feed machine learning systems". Consider what this means and list some opportunities afforded by applying psychometric criteria to classifying content in e-learning systems that could improve your education.
The Foundations of the Data Cube and Its Extensions
Background and Terminology
In computer science literature, a data cube is a multi-dimensional data structure, or a data array in a computer programming context. Despite the implicit 3D structural concept derived from the word "cube," a data cube can represent any number of data dimensions such as 1D, 2D… nD. In scientific computing studies, such as computational fluid dynamics, data structures similar to a data cube are often referred to as scalars (1D), vectors (2D), or tensors (3D). We will briefly discuss the concept of the relational data model and the corresponding relational databases management system (RDBMS) developed in the 70's, followed by the concept of the data warehouse developed in the 80's. Together they contributed to the development of the data cube concept in the 90's.
Relational Data Model and Relational Databases Management System (RDBMS)
In a relational data model, data are stored in a table with rows and columns that look similar to a spreadsheet, as shown in Figure 1. The columns are referred to as attributes or fields, the rows are called tuples or records, and the table that comprises a set of columns and rows is the relation in RDMBS literature.
Figure 1. A relational database.
The technology was developed when CPU speed was slow, memory was expensive, and disk space was limited. Consequently, design goals were influenced by the need to eliminate the redundancies (or duplicated information), such as "2015" in the Year column in Figure 1, through the concept of normalization. The data normalization process involves breaking down a large table into smaller ones through a series of normal forms (or procedures). The discussion of the normalization process is important, but beyond the scope of this paper. Readers are referred to Codd for further details.
Information retrieval from these normalized tables can be done by joining these tables through the use of unique keys identified during the normalization process. The standard RDBMS language for maintaining and querying a relational database is Structured Query Language (SQL). Variants of SQL can still be found in most modern day databases and spreadsheet systems.
Data Warehousing
The concept of data warehousing was presented by Devlin and Murphy in 1988, as described by Hayes. A data warehouse is primarily a data repository from one or more disparate sources, such as marketing or sales data. Within an enterprise system, such as those commonly found in many large organizations, it is not uncommon to find multiple systems operating independently, even though they all share the same stored data for market research, data mining, and decision support. The role of data warehousing is to eliminate the duplicated efforts in each decision support system. A data warehouse typically includes some business intelligence tools, tools to extract, transform, and load data into the repository, as well as tools to manage and retrieve the data. Running complex SQL queries on a large data warehouse, however, can be time consuming and too costly to be practical.
Data Cube
Due to the limitations of the data warehousing described above, data scientists developed the data cube. A data cube is designed to organize the data by grouping it into different dimensions, indexing the data, and precomputing queries frequently. Because all the data are indexed and precomputed, a data cube query often runs significantly faster than a standard SQL query. In business intelligence applications, the data cube concept is often referred to as Online Analytical Processing (OLAP).
Online Analytical Processing (OLAP) and Business Intelligence
The business sector developed OnLine Analytical Processing technology (OLAP) to conduct business intelligence analysis and look for insights. An OLAP data cube is indeed a multidimensional array of data. For example, the data cube in Figure 2 represents the same relational data table shown in Figure 1 with scores from multiple years (i.e., 2015–2017) of the same five students (Noah, Chloe, Ada, Jacob, and Emily) in three academic fields (Science, Math, and Technology). Once again, there is no limitation on the number of dimensions within an OLAP data cube; the 3D cube in Figure 2 is simply for illustrative purposes. Once a data cube is built and precomputed, intuitive data projections (i.e., mapping of a set into a subset) can be applied to it through a number of operations.
Figure 2. A 3D data cube.
Describing data as a cube has a lot of advantages when analyzing the data. Users can interactively navigate their data and visualize the results through slicing, dicing, drilling, rolling, and pivoting.
Slicing
Given a data cube, such as the one shown in Figure 2, users can, for example, extract a part of the data by slicing a rectangular portion of it from the cube, as highlighted in blue in Figure 3A. The result is a smaller cube that contains only the 2015 data in Figure 3B. Users can slice a cube along any dimension. For example, Figure 4 shows an example of slicing along the Name dimension highlighted in blue, and Figure 5 shows an example of slicing along the Subject dimension.
Figure 3. (A,B) Slicing along the Year dimension of a data cube.
Figure 4. Slicing along the Name dimension of a data cube.
Figure 5. Slicing along the Subject dimension of a data cube.
Dicing
The dicing operation is similar to slicing, except dicing allows users to pick specific values along multiple dimensions. In Figure 6, the dicing operation is applied to both Name (Chloe, Ada, and Jacob) and Subject (Calculus and Algebra) dimensions. The result is a small 2 ×3 ×3 cube shown in the second part of Figure 6.
Figure 6. Dicing a 3D data cube.
Drilling
Drilling-up and -down are standard data navigation approaches for multi-dimensional data mining. Drilling-up often involves an aggregation (such as averaging) of a set of attributes, whereas drilling-down brings back the details of a prior drilling-up process.
The drilling operation is particularly useful when dealing with core academic skills that can be best described as a hierarchy. For example, Figure 7A shows four skills of Mathematics (i.e., Number and Quantity; Operations, Algebra, and Functions; Geometry and Measurement; and Statistics and Probability) as defined by the ACT Holistic Framework. Each of these skill sets can be further divided into finer sub-skills. Figure 7B shows an example of dividing the Number and Quantity skill from Figure 7A into eight sub-skills—from Counting and Cardinality to Vectors and Matrices.
Figure 7. (A) Four skills of Mathematics. (B) Eight sub-skills of the Number and Quantity skill.
Figure 8 shows a drill-down operation in a data cube that first slices along the Subject dimension with the value "Math". The result is a slice of only the Math scores for all five names from 2015 to 2017 in Figure 8. The drilling-down operation in Figure 8 then shows the single Math score that summarizes the three different Math sub-scores of Calculus, Algebra, and Topology. For example, Emily's 2015 Math score is 2, which is an average of his Calculus (1), Algebra (3), and Topology (2) scores as depicted in Figure 8.
Figure 8. Drilling-down of a data cube.
The drilling-up operation can go beyond aggregation and can apply rules or mathematical equations to multiple dimensions of a cube and create a new dimension for the cube. The idea, which is similar to the application of a "function" on a spreadsheet, is often referred to as "rolling-up" a data cube.
Pivoting
Pivoting a data cube allows users to look at the cube via different perspectives. Figure 9 depicts an example of pivoting the data cube from showing the Name vs. Subject front view in the first part of Figure 9 to a Year vs. Subject in the third part of Figure 9, which shows not just Emily's 2015 scores but also scores from 2016 and 2017. The 3D data cube is indeed rotated backward along the Subject dimension from the middle image to the last image in Figure 9.
Figure 9. Pivoting a data cube from one perspective (dimensional view) to another.
Beyond Data Cubes
Data cube applications, such as OLAP, take advantage of pre-aggregated data along dimension-levels and provide efficient database querying using languages such as MDX. The more pre-aggregations done on the disk, the better the performance for users. However, all operations are conducted at disk level, which involves slow operation, and thus CPU load and latency issues. As the production cost of computer memory continues to go down and its computational performance continues to go up simultaneously, it has become evident that it is more practical to query data in the memory instead of pre-aggregating data on the disk as OLAP data-cubes.
In-memory Computation
Today, researchers use computer clusters with as much as 1 TB of memory (or more) per computer node for high dimensional, in-memory database queries in interactive response time. For example, T-Rex is able to query billions of data records in interactive response time using a Resource Description Framework RDF 2014 database and the SPARQL query language running on a Linux cluster with 32 nodes of Intel Xeon processors and ~24.5 TB of memory installed across the 32 nodes. Because such a large amount of information can be queued from a database in interactive time, the role of data warehouses continues to diminish in the big data era and as cloud computing becomes the norm.
The Traditional Data Cubes Concept
Additionally, in-memory database technology allows researchers to develop newer interactive visualization tools to query a higher number of data dimensions interactively, which allows users to look at their data simultaneously from different perspectives. For example, T-Rex's "data facets" design, as shown in Figure 10A, shows seven data dimensions of a cybersecurity benchmark dataset available in the public domain. After the IP address 172.10.0.6 (in the SIP column) in Figure 10A is selected, the data facets update the other six columns as shown in Figure 10B simultaneously. The query effort continues in Figure 10B where the IP address 172.10.1.102 is queried in the DIP column. Figure 10C shows the results after two consecutive queries, shown in green in the figure.
Figure 10. Interactive database queries of a high dimensional dataset.
The spreadsheet-like visual layout in Figure 10 performs more effectively than many traditional OLAP data interfaces found in business intelligence tools. Most importantly, the data facets design allows users to queue data in interactive time without the need for pre-aggregating data with pre-defined options. This video shows how T-Rex operates using a number of benchmark datasets available in the public domain.
The general in-memory data cube technology has extensive commercial and public domain support and is here to stay until the next great technology comes along.