Addressing the Epistemological Challenges

Trying not to let BDA fall into empiricism, we looked at each BDA step and the related critical question in performing this step in order to identify the main epistemological challenges. In result we recommend a "lightweight theory-driven" approach in contrast to a "heavyweight theory-driven" research that is solely based on popular or relevant theories pertaining to the research. The advantage of the latter is the ability to derive generalizable research outcomes that are easily interpreted and compared. The disadvantage is that conflicting theories could exist to choose from which makes it unclear whether a selected theory would hold in the application domain. Avoiding a theoretical commitment from the outset, a lightweight theory-driven BDA may start by data acquisition followed by the remaining steps. In each, there exist some recommendations in order not to wait for competing or conflicting theories. Lightweight theory-driven acquisition and preprocessing consists of activities such as data summarization, graphical representation, dimension reduction; and outlier detection. Dimension reduction means reducing the number of dimensions is normally accomplished via methods such as principal components analysis (PCA).

Lightweight theory-driven data and parameter selection for knowledge induction means relying on body-of-knowledge and existing theories in order to go beyond a mere quantitative analytics approach. For instance, one way to do this is to map the constructs of analytics with known theoretical constructs. Preferably, multiple researchers contribute to constructs' identification and cross-disciplinary contribution in the mapping process. The rationale behind being multi-disciplinary is that an a priori knowledge of the data and the represented domain helps in sharpening multi-dimensional understanding of the analytics as a process and hence pave the road for sound conclusions that contribute to science and practice. However, this can be considered 'lightweight' because reaching out to the data does not start hierarchical from the theoretical concepts and constructs and respective variable definition, but from the given datasets mainly 'to sort things out'.

Examining data quality, validity and reliability, for example in data warehousing, usually includes questions such as: Is the data complete? Is it correct? Does it contain errors, and if there are errors how common are they? Are there missing values in the data, and how common are they? In BDA, similar efforts are required, but to scale. In particular, the assessment of big data quality should be made not just of the individual data source (e.g. OLTP systems), but also of any data that came from merging sources. This is due to the fact that merging data is quite often the case in big data projects, which causes potential problems, e.g. inconsistencies between the sources, which do not exist in individual data sources. Also, in data warehousing the data is modeled upfront, i.e. before utilization, which makes it necessary to build extract-transform-load (ETL) before use and to apply required quality and validity checks. However, in BDA the data is modeled afterwards, and that makes it different in terms of when data validity and reliability takes place. In BDA research, attention to validity and reliability is required because the scale of the datasets is often huge, the variety of data types is high, and the model is built after the data has already been collected – and because overlooking data validity and reliability issues would risk ending up with contaminated analytics and hence interpretations.

Finally, a theoretical framework should govern BDA because complacency about the modeling technique causes epistemological problems in result interpretation. Without such framework the selection of the technique might be mostly based on: tool availability, knowledge of the data scientist/researcher, and/or being politically friendly to stakeholder expectations, with all the problems that might arise from such bias. Instead, models should be selected based on: given data, problem at hand, and model assumptions. For instance, we should not use correlation in a problem for which we need to know cause-and-impact, since correlation coefficient does not imply causality. It goes for association rules, which indicate frequency and occurrences, but have least prediction power. Also, we should use SVM when we have nonlinearity and selection of features is required.

Table 1 summarizes this effort and indicates possible theoretical contributions (explained below) that could guide through mastering the identified epistemological challenges.

BDA step Critical questions Epistemological challenge Possible lightweight theory-driven guidance
Acquisition What data do I need?
What kinds of data [sets] are available/to be selected?
Data 'sampling' Apply data summarization, graphical representation, dimension reduction (e.g. PCA) and outlier detection
Ensure multi-expert and multi-disciplinarily participation in data reduction and selection
Trace and examine all stages of extract, transform, load, and merge for completeness, correctness, and consistency
Pre-processing How can data [sets] be represented and processed without falsification or insight loss? Data validity and reliability
Analytics Which method[s] to use?
What rules govern conclusions from these data [sets]?
Knowledge discovery Map the constructs of analytics to known theoretical concepts
Ensure multi-expert and multi-disciplinarily participation in parameter selection and mapping analytical constructs with theoretical concepts
Develop/apply theoretical framework for choice of techniques (mining, machine learning, statistics) or models
Interpretation How to interpret such conclusion? Non-/interpretability; reliability of prediction Develop/apply theoretical framework for result interpretation


Data summarization

The idea of data summarization is simple. For example, in order to understand the relationship between qualifications and income, dataset could be viewed by plotting the average income by qualifications level. Such summary will be sufficient for some purposes, but if the outcome of summarization is used in fact-based decision making, more time is required in order to achieve a better understanding of the data. A simple example would be to include the standard deviation information along with the averages. Further, it may be more revealing, for example, to break down the average income levels by age group, or to exclude outlier incomes. Moreover, the relationship between income and qualifications may vary between men and woman, or may vary by geography. Overall, effective summarization involves both identifying overall trends and important exceptions to them.


Graphical representation

Graphical techniques aid users in managing and displaying data in an intuitive manner. Visualization can be helpful in the discovery of relationships and dependencies that may exist within the dataset. The core issue here is to effectively representing multidimensional datasets without overwhelming the human ability to comprehend the resulting graphs. Data summarization can reduce the size and complexity of multidimensional datasets. This could highlight the relevant aspects of the dataset more clearly, leading to more coherent visualizations, and also facilitating more accurate and efficient visual analytics.


Outliers detection

In the following figure, if we look at the top-right most point, it seems like an outlier, since we are looking with regards to how far from rest of the data points which are depicted in two axes. However, if we look at such point with regards to x-axis or y-axis only, it will not be identified as outlier. Outlier detection techniques can be categorized, based on the number of variables or dimensions used to define the outliers, into two categories; univariate outlier detection techniques; where the outliers are detected for only one variable at a time, and the other category is the multivariate outlier detection where more than one variable is taken into account while defining the outliers. Most probably the univariate outlier detection is insufficient. So, in BDA analyzing the dataset based on univariate outlier detection method leads to epistemological pitfall. Correction is to rely on multivariate techniques in outlier detection (Fig. 3).


Outlier detection


Dimension reduction

Dimension reduction refers to the process of converting a dataset of high dimensions, into dataset with less dimensions. However, similar information should be precisely conveyed. Dimension reduction techniques are used in BDA in order to obtain better features for a classification or regression task. We can reduce n dimensions of dataset into k dimensions (where k < n). The k dimensions can be directly identified or can be a combination of dimensions (weighted averages of dimensions) or new dimension(s) that represent existing multiple dimensions. Dimensionality reduction takes care of multicollinearity, by removes redundant features. That is, variables exhibiting higher correlation, which could cause low predictive power of a model. Both factor analysis and principal component analysis are used for the purpose of dimension reduction. Factor Analysis, assumingly the dataset used in analytics has highly correlated variables. These variables can be grouped by their correlations whereby variables in a group are highly correlated, but exhibit low correlation with variables of other groups. Here each group represents a single underlying construct or factor. These factors are small in number as compared to large number of dimensions. On the other hand, using principal component analysis (PCA), variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principle components. They are obtained in such a way that first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance. The second principal component must be orthogonal to the first principal component. In other words, it does its best to capture the variance in the data that is not captured by the first principal component. For two-dimensional dataset, there can be only two principal components.


ETL

The ETL refers to the extract, transform, and load process. Normally, this is associated with the data coming to the data warehouse form multiple source systems. During extract, variables are extracted from various sources e.g., an ERP database. In the transform, data is transformed into a desired structure e.g., source systems have price and quantity while target systems – the data warehouse – have total i.e., multiplication of price and quantity. The transformation may also include filtering unwanted data, sorting, aggregating, joining data, data cleaning, data validation based on the business need. Lastly, the load step involves the transformed data being loaded into a destination target, which might be a database or a data warehouse. One of the issues that might occur during ETL is improper, or incomplete mapping. That is, not all columns in source systems are mapped to destination systems. If happened during data acquisition step of the BDA, that will lead to data loss or errors hence impact knowledge outcome and model predictive power.


Merge for completeness

Big data is generated at different source systems, so bringing such data together is a challenge. For example, in order to ensure completeness of the dataset in BDA, the data is to be retrieved from n sources: S1, S2, …, Sn. All source systems must send their data records to the central repository and it should be possible to define a relational operation that will reconstruct the dataset from the multiple sources. Such reconstruction could be horizontal via Union operator and vertical via Natural Join operator. For example, in case the dataset is split horizontally, we could retrieve them all in the central repository – data warehouse – using selection operator. That is, if the dataset comes from two horizontal sources, we then need two statements for full reconstruction: σ type = 'A' (S1) ∪ σ type = 'A' (S2). Or vertical sources reconstructed: Π ProductNo (S1) ∪ Π ProductName (S2). Failing to reconstruct the dataset completely leads to prediction accuracy being low for the model used.

Fundamentally, BDA automates the knowledge discovery process from data, or datasets, in order to make predictions. Such discovery is a genuine machine science in which all process steps are subject to automation. Generating new theories is among the roles predictive analytics is expected to play (see above), achieved through the development or improvement of models. Such theories target to predict a variable in the future, given a set of explanatory variables or predictors. Some theories may even be able to explain the causal relationship between independent and the dependent, while others do not have such explanatory power.

One of the core question for science and practice regarding utility is: what are the necessary epistemological preconditions that make predictions based on model-based data processing acceptable for human stakeholders? Here we consider the primary criterion to be the performance: the success of the prediction (i.e. it turns out to be true) is far more important than how we have reached it. Our rationale here is grounded on the counter question: What else could be more relevant to assess prediction rather than its correctness? If theory is not able to correctly predict the future, we also start to question every aspect of it (constructs, variables, relationships, assumptions, context) and/or leave it aside.

There is an argument that a prediction without explanation is inferior, hence BDA-based predictions lack the explanation power in many occasions and that render them 'incomplete'. However, often enough predictions based on explanatory theories are not accurate (enough). And in many real life cases prediction precedes explanation, i.e. certain phenomena (e.g. an epidemic or seasonality sales) can and/or should be predicted with the reason behind the phenomena to be revealed only later (if at all). It is certainly the case that to some extent BDA-based prediction lacks the explanatory power. But especially in the beginning of exploring new phenomena this should not be considered as an argument to rule out BDA application. And in such cases the acceptability of the prediction and the trust in its results should rather be derived from the transparency and lightweight theory-driven governance of the BDA process.