Designing BI Solutions in the Era of Big Data

Site: Saylor Academy
Course: BUS610: Advanced Business Intelligence and Analytics
Book: Designing BI Solutions in the Era of Big Data
Printed by: Guest user
Date: Friday, 4 April 2025, 10:47 AM

Description

This article highlights the changing dynamics needed for models to achieve feasibility within an organization by proposing a new process challenging both ETL (extract, transform, load) and ELT (extract, load, transform) by modifying with ETLA (extract, transform, load, and analyze). To challenge your thinking, apply the concept by putting dirty dishes into a dishwasher for context.

Abstract

Current work presents new approach for designing business intelligence solutions. In the era of big data, former and robust analytical concepts and utilities needs to adopt themselves to changed market circumstances. Main focus of the work is to address acceleration of building process of a "data-centric" business intelligence solution on one side, and also prepare business intelligence solutions for big data utilization. Research is addressing following goals: (a) reduce time spent during business intelligence solution designing phase; (b) achieve flexibility of business intelligence solution by removing problems with adding new data sources; (c) prepare business intelligence solution for utilization of big data concepts. Research proposes extension of existing Extract, Load and Transform (ELT) approach to the new one Extract, Load, Transform and Analyse (ELTA).



Source: Pablo Michel Marín-Ortegaa, Viktor Dmitriyev, Marat Abilov, and Jorge Marx Gómez, https://www.sciencedirect.com/science/article/pii/S2212017314002424
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

1. Introduction

Companies are following different strategies in order to be competitive against others. According to Competitive advantage: Creating and sustaining superior performance, the advantages can be derived from following two aspects: (1) operational efficiency and (2) unique value creation for customers. Both aspects involve building an enterprise structure and designing a business process in a systemic and unique way.

For these reasons discovering new business value adding process based on business historical behavior (extracted from data) to overcome their competitors is emerging. Such can be achieved with support of business intelligence (BI). According to An architecture for ad-hoc and collaborative business intelligence, current BI implementations suffer from several shortcomings:

  • Missing focus on the individual needs of particular analysts or decision makers. These users are forced to rely on standard reporting and predefined analytical methods that often do not answer to all needs of the individual. They strongly depend on either IT administration or enhanced technical skills.
  • The lack of business context information, such as definitions, business goals, and strategies as well as business rules or best practices for the provided analytical data. Hence, business users have to understand the semantics of data by themselves and they have to take decisions and derive strategies using additional information sources, which often leads to an escalation of efforts and costs.
  • Poor alignment between Business and IT department. The setup and configuration of current BI systems requires deep insight in both the data to be analysed and the intended analytical tasks. Content and data models have to be provided in advance by the IT department and it must support the whole information in the decision making process.
  • The modal time for new BI implementations is between 3 and 6 months causing implementation and support costs that often deter companies of a wider BI deployment.
  • BI solutions have a strong focus on structured, enterprise-internal data but lack the capability of integrating external and/or unstructured information in an easy, (near) real-time, and effective way. As a consequence, a lot of useful information is never included in the analysis. Not considering this information could provide a distorted or incomplete view of the actual world and consequently, it could lead to wrong business decisions.

Current work focuses on a presenting new approach for designing BI solutions. Presented approach is addressing following goals: (a) time reduction that is spend on BI solution designing phase; (b) flexibility achievement in BI solution by removing "data agnosticism"; (c) preparedness of BI solution to be used with big data. The research is extending existing concept ELT (Extract, Load and Transform) to an ELTA (Extract, Load, Transform and Analyse).

2. Related Works

  

2.1. Business Intelligence and Big Data

Business intelligence systems support and assist in decision making processes. It's also taking part in organization strategic plan, which normally addressing achievement of management effectiveness. BI is defined as "a set of methodologies, processes, architectures and  technologies  that  transform  raw data  into meaningful and useful  information used  to  enable  more  effective  strategic tactical, and operational insights and decision-making". Effective BI systems give decision makers access to quality information, enabling them accurately identify where the company has been, where it is now, and where it needs to be in future. Despite the immense benefits that an effective BI system can bring, numerous studies shown that the usage and  adoption of BI systems remain low, particularly among smaller institutions and companies with resource constraints. 

According to Study on Port Business Intelligence System Combined with Business Performance Management the BI system should have the following basic features:

  • Data Management: including data extraction, data cleaning, data integration, as well as efficient storage and maintenance of large amounts of data
  • Data Analysis: including information queries, report generation, and data visualization functions
  • Knowledge Discovery: extracting useful information (knowledge) from the rapidly growing volumes of digital data in databases

The most important feature to succeed in building BI solution is to perform well on stage of the Data Management. Data Management being a foundation of BI solution, it's usually the most stressing and time consuming part. Nowadays, there are many companies offering own solutions. However, their applications do not assure that the whole necessary information in the decision making process will be available. Rather than focusing on necessary information to build good solutions, most of these providers are focusing on the technological aspects. Such behavior is not satisfying real business needs, and not supporting the fact that there is not alignment between the business and technological domains.

Informally, big data is defined as limitation of analytics and storage capabilities of standard data processing tools like database management systems. Formally, the big data stays that there is a triple "v": volume, velocity, and variety. Volume states the fact of data processing limitations that a coming from data's huge size. Velocity argues that data input speed is also crucial, because data is generated and inserted into data storage on high speed. Variety stays that data is coming from different heterogeneous sources (social networks, sensors, transactional data and etc.).

Despite big data is kind of buzzword, the business cannot ignore it without losing competitiveness on the market. Datameer Inc. (2013) reported that the major goals for the companies to implement big data are: (1) increase revenue, (2) decrease costs, and (c) increase productivity. 

Data and knowledge extraction from it are too different things, but they cannot be separated. Then data is stored, proper analytical methods must be applied in order to get value out of it. Mainly, there are two ways that are used to implement analytics over data: SQL and MapReduce. SQL proved it's applicability by the long and robust history of usage (more than 40 years). While MapReduce appeared less than a decade ago, it's already one of the most popular programming models to support complex analysis over huge volumes of structure and unstructured data. Multiple researches stays that SQL wasn't designed for current needs, and new models and ways, like MapReduce, should deal with analytical challenges addressed by the era of big data. But, works A comparison of approaches to largescale data analysis and A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics showed that database management systems with SQL on board were significantly faster and required less code to implement information extraction and analytical tasks. However, the process of database tuning and data loading takes more time in comparison with MapReduce. 

As was mentioned the major feature of big data are: (1) increase revenue, (2) decrease costs, and (3) increase productivity. These 3 features are very desirable for any BI project. In the typical architecture of BI system it is very common to have data warehouses with the whole information needed or even several data marts joining together to conform the data warehouse, in this point. 

2.2. ETL vs ELT

In a typical BI infrastructure, data, extracted from Operational Data Sources (ODS), firstly transformed, then cleaned and loaded into a data warehouse. Before data are loaded into a data warehouse, it's necessary to process "raw data" for a variety of reasons. For example, a data warehouse typically consolidates a multitude different ODS with different schemas and metadata behind. Hence, incoming data must be normalized. Also, the ODS may contain erroneous, corrupted or missed data, so the process of cleaning and and reconsolidation are needed. This preprocessing is commonly known as Extract, Transform and Load (ETL): data are first extracted from the original data source, then transformed including normalization and cleansing, and finally loaded into the data warehouse.

While database technologies used for data warehousing had seen tremendous performance and scalability enhancements over the past decade, ETL has not been improved in scalability and performance in the same level of degree as database. As a result, most BI infrastructures are increasingly experiencing a bottleneck: data can't be easily acquired to the data warehouse with necessary actuality. Clearly, in order to provide near real-time BI, this bottleneck needs to be resolved.

Costs of data data storage were always a significant factor, but they becoming cheaper with time, and as the result, analysis can be performed over bigger amount of data with less investments. And in changing circumstances, former (but robust) Extract, Transform and Load approach cannot be easily applied to answer all needs of business, which includes work with big data and, as the result, new approach and/or architectural changes are needed. Main disadvantage of ETL is that data must be firstly transformed and only then loaded. It means that on transformation phase, mass amounts of potentially valuable data are thrown away. However, to eliminate drawbacks of ETL, improvement of latest storage techniques can be used. One of the approach that address such challenges is called  Extract, Load, and Transform. The basic idea is to perform Load process immediately after Extract process, and apply Transformation only after getting data stored. ELT, in comparison with ETL, has four following advantages: (1) flexibility in adding new data sources (EL part); (2) aggregation can be applied multiple times on same raw data (T part); (3) Transformation process can be re-adopted, even on legacy data; (4) speed-up process of implementation.

According to the Competitive advantage: Creating and sustaining superior performance, the enterprise competitiveness heavily rely on the time elapsed during decisions making process. And to make decisions, BI solutions are became "de-facto" standard. As long as time is a very important factor, it's also crucial to design BI solution in shorter period of time. One of the shortcomings, according to the An architecture for ad-hoc and collaborative business intelligence, is that time spent on phase of implementation BI solution causes increasing of costs. The next shortcoming is that BI solution needs to be flexible in order to reflect environmental changes and adopt them in shorted possible period of time.

In the ETL process is not addressing flexibility in the terms of reflecting environmental changes and classical BI solutions need vast amount of time to be implemented. ETL process is applying transformation after extraction and before loading that is causing data to be inserted into data warehouse only during the last phase. In contrast, ELT allows firstly extracting and loading data, and then apply on demand transformation according to the business needs. Also, transformation with ELT can be applied and re-applied taking into account changes in business requirements. Based on above reasons it's more preferable to adopt Extract, Load, and Transform (ELT) instead of Extract, Transform, and Load (ETL) in the BI solutions.

3. Proposed Methodology

  

3.1. Introduction and Model

ELTA stands for Extract, Load, Transform and Analyse. The authors define the ELTA term as follows: a process called Extract enables data extraction from the heterogeneous sources in heterogeneous formats (transactional data, machine generated data etc.); process Load provides ability to store data inside storage system; process named Transform provides ability to transform data from raw state (a) on demand and (b) according to the needs of decision making process; Analyse phase makes business users efficiently utilize preprocessed data to understand enterprise behavior through analysis.

Based on the Framework to Design a Business Intelligence Solution approach, a framework to define an Enterprise Architecture (EA) as solution's foundations is required. There are various EA frameworks. Among them, the Zachman Framework is selected as a base EA framework. Although the Zachman framework lacks in modelling for detailed EA components and relationships among them and does not provide concrete implementing method, it is valuable in the point that it presents general framework which every enterprise can use to build its own EA. Beside "the Zachman Framework is an ontology - a theory of the existence of a structured set of essential components of an object for which explicit expression is necessary, and perhaps even mandatory for creating, operating, and changing the object (the object being an enterprise, a department, a value chain, a solution, a project, an airplane, a building, a product, a profession, or whatever)". We are considering only the first four rows of the framework, which are defined as follows: (1) strategy model, (2) business model, (3) system model, and (4) technology model.

In accordance with the previously expressed, the proposed model is comprised by EA components, a Balanced Scorecard (BSC), BI components, and relationships between them as shown on Figure 1 as proposed model.

3.2. Guideline

In this section, the detailed guideline that describes proposed model is given.

Step 1 EA fulfilment:  According to the structure defined in The Zachman Framework: The Official Concise Definition, Zachman EA must be completed by rows, where each row represents a top level with respect to the one that follows in order, nevertheless there exist a big dependency among each of the elements of the columns. Table 1 shows proposed dependencies between cells. The order, in which cells must be fulfilled, depends on the relationship between them.

 

Fig. 1. ELTA Proposed Model


Step 2 Extract and Load Processes: Based on the information defined in the EA from the step 1, the IT users can extract all necessary for business information from heterogeneous data sources and load it in a big data storage. This step should be implemented by the IT users.

   Table 1. Proposed rules to fulfil Zachman EA.

 

What

How

Where

Who

When

Scope Contents

A1

B1

C1

D1

E1

Business Concepts

A2=(A1)

B2=(B1+A2)

C2=(C1+B2)

D2=(D1+B2+C2)

E2=(E1+A2+C2)

System Logic

A3=(A2+B2+F2)

B3=(B2+F2)

C3=(C2+A3+B3)

D3=(D2+F2+B3)

E3=(E2+B3+C3)

Technology Physics

A4=(A3)

B4=(B3+A4)

C4=(C3+A4+B4)

D4=(D3+A4+B4)

E4=(E3+D4 )

 

Step 3 Management Control Tool: The main goal of this step is to define all necessary information for the decision making process. The idea is to use data from big data storage to create new global indicator for the Balanced Scorecard perspective, as it is described in Compensatory Fuzzy Logic Uses in Business Indicators Design. This assures reduction of the gap between strategical and tactical levels, because it is possible to know, how to link each indicators from the different management levels and improve the enterprise knowledge. Methodology includes one step with Principal Component Analysis (PCA) in order to discover the correlation among the whole indicators. This step should be performed by the business users.

 

Step 4 Transformation Process:  The main goal of this step is to properly transform all data based on the necessity of information for the decision making process. Based on the big data storage from the step 2 of current guideline and the indicators defined in the step 3, it is possible to know which transformations are necessary to support the entire business report requirement. This step should be implemented by the IT users.

 

Step 5 Virtual Data Mart Layer: The main goal of this step is to define several virtual data marts in accordance with the business report requirements. In-memory approach is used to accelerate creation and usage of data mart. Such solution is bringing more flexibility and unprecedented performance due to its in-memory nature.

 

Step 6 Develop BI System: The main goal of this step is to develop a BI system. Based on the data marts structure is necessary to define the OLAP schema and business users defined reports. There are big variety of available tools for building BI solutions. And one of the most popular is Pentaho BI Suite. Pentaho is popular due to its BI features and licensing policies. Although, according to the authors experience, it is possible to achieve great flexibility in BI solution by combining Pentaho BI Suite with others tools, like Birt Report.

 

Step 7 Analysis: The main goals of this step is to analyse most parts of the available information to support decision making process and discover new pattern in the business by using data mining techniques, it will help in: (a) redefining the indicators in the Balanced Scorecard (in case it's necessary) and (b) support the decision making process. For this step, any 3rd party external tool like Weka, or integrated into big data platform analytical facilities, can be used.

4. Description of Case Study Concept

For current case study we are focusing on designing a new "data-centric" business service in the bank domain, and it's intentionally selected to be fully artificial and simple for understanding. The main parts that are covered by case study are (1) idea behind data-centric business product, (2) extraction and load processes, and (3) transformation of big data storage's raw data and implementation of data marts after transformation, the rest parts like implementing BI system and performing complex analysis are left as "future work".

While addressing big data's 3V (Volume, Velocity and Variety) is important to selected proper data sources, in current case study following formats are used: relational, graph and log.

A bank came up with idea to attract more clients, and stimulate them to use more and more "services" offered by the bank. One way to attract new clients is by using already existing clients, however not in way that "other clients" will receive any kind of "spam" from bank. But rather do it in target oriented style, by targeting proper clients only. In order to meet goal of targeting proper data sources must be selected and huge amount of data need to be analyzed. For this case following data sources are selected: (a) user's social graph; (b) user's transactional data; (c) logs from bank's web server.

User social graph is used to understand new potential clients. User transactions data are used to understand need of current client and prepare good business-offer. Web server logs are used to identify most active users and probably their interests in some products.

Current implementation of "Big Data Storage" contains two main components Apache Hadoop Cluster (raw data storage) and SAP HANA (serving as a data analysis platform on demand).

After guideline's Step 1 EA fulfilment (section 3) the next step is to perform extract and load processes. This step is done with Apache Hadoop, which is mainly used as a staging layer for the raw data, interesting part here is that Hadoop can store variety number of formats, so during this step no "aggregation" and "transformation" is needed and as the result - almost all data stays inside storage. For current particular example (a) logs from web server, (b) social graph, and (c) users' transactions are extracted and loaded into Apache Hadoop Cluster.

Steps 3 and 4 are performed to understand what exactly is important for the analysis and transform data that is needed for analysis only. By applying transformation after loading and by having data staged in distributed cluster allows revision and re-planning of performed transformation that gives possibility to try more hypotheses during the analysis.

While performing Steps 5 from the guideline, that address designing Virtual Data Mart Layer, data is transformed and loaded into in-memory database that allows performing manipulations with layers rapidly and efficiently. As inmemory database SAP HANA is used, it has multiple data process engines that meet needs of online transaction processing (OLTP), online analytical processing (OLAP), graph and text processing systems at once.

5. Conclusion and Future Work

In the upcoming era of big data, "data-centric" business services, and processes of discovering new business strategies based on historical behavior (mainly data), to achieve competitive edge over other competitors, will play a huge role. And major factors in succeeding in such completions is to prepare business IT solutions for the new "rolechanging" requirements of the market. However, successfulness cannot be achieved only by fully applying new, modern, and trending approaches and by neglecting robust and well-proved former techniques like BI. In such situation, it's more appropriate to modify or to prepare existing solutions for the market needs and benefit from the both: (1) robustness of well-proved existing techniques like BI and (2) promising advantages of new approaches like big data. Our work that presents ELTA (Extract, Load, Transform and Analyse) is one of such new approaches that address combination of business intelligence with big data by taking best parts from both, and in parallel, removing disadvantages of business intelligence.

In future works architecture proposed model depicted on Fig. 1. will be enhance with focus on the following parts:

Big Data Storage, Virtual Data Marts Layer, and Analysis Layer. Future work will also include evaluation of Extract, Load, Transform and Analyse through carefully implementation of complex business scenarios and discussions of obtained results.