Designing BI Solutions in the Era of Big Data

4. Description of Case Study Concept

For current case study we are focusing on designing a new "data-centric" business service in the bank domain, and it's intentionally selected to be fully artificial and simple for understanding. The main parts that are covered by case study are (1) idea behind data-centric business product, (2) extraction and load processes, and (3) transformation of big data storage's raw data and implementation of data marts after transformation, the rest parts like implementing BI system and performing complex analysis are left as "future work".

While addressing big data's 3V (Volume, Velocity and Variety) is important to selected proper data sources, in current case study following formats are used: relational, graph and log.

A bank came up with idea to attract more clients, and stimulate them to use more and more "services" offered by the bank. One way to attract new clients is by using already existing clients, however not in way that "other clients" will receive any kind of "spam" from bank. But rather do it in target oriented style, by targeting proper clients only. In order to meet goal of targeting proper data sources must be selected and huge amount of data need to be analyzed. For this case following data sources are selected: (a) user's social graph; (b) user's transactional data; (c) logs from bank's web server.

User social graph is used to understand new potential clients. User transactions data are used to understand need of current client and prepare good business-offer. Web server logs are used to identify most active users and probably their interests in some products.

Current implementation of "Big Data Storage" contains two main components Apache Hadoop Cluster (raw data storage) and SAP HANA (serving as a data analysis platform on demand).

After guideline's Step 1 EA fulfilment (section 3) the next step is to perform extract and load processes. This step is done with Apache Hadoop, which is mainly used as a staging layer for the raw data, interesting part here is that Hadoop can store variety number of formats, so during this step no "aggregation" and "transformation" is needed and as the result - almost all data stays inside storage. For current particular example (a) logs from web server, (b) social graph, and (c) users' transactions are extracted and loaded into Apache Hadoop Cluster.

Steps 3 and 4 are performed to understand what exactly is important for the analysis and transform data that is needed for analysis only. By applying transformation after loading and by having data staged in distributed cluster allows revision and re-planning of performed transformation that gives possibility to try more hypotheses during the analysis.

While performing Steps 5 from the guideline, that address designing Virtual Data Mart Layer, data is transformed and loaded into in-memory database that allows performing manipulations with layers rapidly and efficiently. As inmemory database SAP HANA is used, it has multiple data process engines that meet needs of online transaction processing (OLTP), online analytical processing (OLAP), graph and text processing systems at once.