3. Big data Analytics Platforms

3.3. Functional Layers of Big Data Architecture

Open source strategy creates a large number of computing software projects that provide users with various choice options for the big data analytics process. Often users are faced with the challenge of choice, and the wrong choice of any software solution may lead to business failure. Big data solution selection should be driven by use case, product reputation, team expertise, and knowledge of the current happening. Selection of any solution requires knowledge of functional areas (Big data stack) of the big data solution and maps the selected software into the correct functional layer. A Hadoop cluster is constructed by combining any number of these software tools described in Section 4.2.

Big data stack or architecture can be viewed as a layered model that consists of a number of different functional areas for big data analytics. They include infrastructure, data repository, data refinery, data factory, data fusion, and business value. Each functional area has tools that allow the organization to leverage Hadoop's power of combining multiple data sources to gain new knowledge. These five layers are explained below.

  • Infrastructure layer. This is the first layer and includes all the hardware, network, and software used to support and to operate a Hadoop cluster. Software ranges from operating system versions to all the commonly used tools for monitoring and reporting on the Hadoop cluster.
  • Data Repository layer. The second layer that deals with the movement of data within a distributed computing environment. The primary repository here is the Hadoop distributed file system. Data transport tools such as Sqoop and Flume are also included. All the NoSQL databases, such as Accumulo and HBase is a form of data repository.
  • The Data Refinery layer. This is the third layer and provides a platform for manipulation and processing of data using the parallel processing framework. The primary technology in the data refinery layer includes Yarn and MapReduce.
  • The Data Factory layer. Software under this class is called data workers. They are designed to interface into Hadoop and enables easier access to the full power of Hadoop. Many of them actually allow the users to create compute jobs in an easily understood language, such as SQL, and then they translate these inputs into MapReduce jobs. The tools here include Hive, Pig, Spark, and Oozie.
  • The Data Fusion. This layer is the application layer and the true business end of the big data solution. Data analytics and data visualization tools fall under this layer and are used to create applications, implement algorithms, and visualize data. The tools include machine learning technology, such as Mahout, or Data Visualization tools, such as Datameter or Pentahoe and Tableau.
  • The Business Value Layer. This is the final layer which utilizes a wide range of tools to express the requirements, the service levels, and the request for changes, the cost and the expenditures for the business.