BDA Challenges

Researchers and practitioners alike face various types of challenges when using big data analytics for prediction, for instance, privacy and security of big data, platform scalability, integration, etc. However, for our purposes we only focus on those challenges that require epistemological reflection due to the bias incorporated in current practice, namely 'streetlight' research and data monetization.


'Streetlight' research

Big data is being passively created and continuously collected, and this has opened the door for plenty of research to be conducted. However, research should be formulated around important problems. Yet, it has been noticed recently that big data research may have suffered from the so-called 'streetlight' effect. That is, the tendency of researchers to study phenomena for which there exist plethora of data, instead of studying relevant problems. To explain, most of the experiments and data-analytic research is relying on data from biggest data-driven companies e.g., Facebook, Twitter, Google, LinkedIn and Amazon. Great percentage of such studies is focusing on the data made available for researchers by those companies, for internal purposes. That is, such data may be either biased towards solving those companies' problems, and not necessarily the grand problems.

For instance, showed that Twitter has become one of the favorite BDA research destinations. Such choice (of Twitter) by researchers is justified by its relatively high-level of accessibility and the relative openness of its API. Together, such two factors, have led to a substantial number of studies dealing with Twitter data. However, regardless of the case, the relative ease of data collection and analytics always entails the risk, and bias, of 'streetlight' research in BDA.

The "We Are Social Report", 2016 digital Yearbook, ranks Twitter as 9th in popularity as a social platform with 320M users, while other platforms have almost double or triple the number, such as FB (1.5B), WhatsApp (900M), etc. Twitter is certainly not the largest pool of users, and some of the accounts are used by bots, not humans. Furthermore, many companies are using it as a way to boost sales, analyzing tweets "only" is indeed biased. Lastly, research observed that Twitter not only enables effective broadcasting of valid news, but also of rumors; as a matter of fact, false rumors would spread more quickly.

Since researchers can only analyze existing data, many are tempted not to formulate a clear research question or problem that enables to define what data is needed. In consequence, the range of insights we could or are able to generate remains unconsciously limited.


Data monetization

Data monetization is the ability of a company to generate money from its available datasets (partially or as a whole). In today's environment, companies have become aware of the meaning of the term "data is the new oil". Accordingly, each company is sitting on sheer amounts of data that needs to be utilized towards value creation. The way data monetization is implemented at companies could either be direct or indirect. Direct data monetization means selling (part of) the dataset of a company as such. Indirect monetization uses the dataset to create new products and services, such as Amazon is using its customer records to suggest other products or Alibaba via its targeted finance. Another form of indirect monetization takes place whereby a company is bartering its datasets.

Researchers can access data from data-driven companies e.g., Twitter, Facebook, Google, etc. via two mechanisms: API (aka 'garden hose') or a 'firehose'. API, or application programming interface, is a tool created for developers to interact with data producers. For instance, Twitter has created an open API allowing developers to source Twitter data. The major advantage of the API is to promote external innovation, based on data. Offering data externally allows developers to create products, platforms, and interfaces without the need to expose the raw data. As a byproduct, Twitter has capitalized on this model by the acquisitions of ten different companies in 2012, built around their open API.

The 'firehose' is closely similar to the streaming API. The Twitter firehose guarantees delivery of 100% of the tweets that match search criteria by researchers. Data providers like GNIP and DataSift handle Twitter firehose. The firehose consists of an agreement between researchers and distributor of the firehose e.g., GNIP on tweets the researcher should receive. As the data providers receive tweets they are pushed directly to the end user.

The Twitter API is offered for free, but the Twitter firehose, which removes a lot of the usage restrictions imposed by Twitter, comes at a fee that not all researchers could afford. That fee represents what is known as "data monetization" for Twitter. Of course, researchers need to delimit their scope based on the data available. The key issue here is to be aware of the limitations of the datasets and the tools employed and to detail one's research approach accordingly.