2. Technologies and Methods

1. HADOOP:

In earlier days, an enterprise will accumulate and process big data in a computer. Within this approach, the application manages the role of data storage and analysis and the user interacts with this application. This method works well for the applications that are capable of processing less voluminous data. The best application of this method when it processes less huge data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But while dealing with huge amounts of the data it becomes a hectic task to process such amount of data through a single database bottleneck. This problem was solved by Google by introducing an algorithm called Map Reduce. In Map Reduce algorithm the given data is alienated into smaller parts, these parts are then assigned to different computers, and the result is collected from them and are later integrated, to compte the resultant dataset. Doug Cutting with his team developed an open source project named HADOOP which is based on the solution provided by Google.

Hadoop applications also fucntions using this Map Reduce algorithm, in which the data is prepared concurrently with others. In short, Hadoop is useful for developing applications that can perform absolute statistical analysis on vast quantities of data.

Using simple programming models Hadoop allow distributed processing of vast number of datasets athwart clusters of computers. It is written in java and is an Apache open source framework. The environment in which Hadoop applications operates provides computation across clusters of computers and distributed storage. To evolve from a single server to thousands of machines, Hadoop technology is planned and each of this machine provides local calculation and storage.

Hadoop Architecture

Hadoop has 2 major layers at its core, namely:

  • Processing or Computation layer (Map Reduce)
  • Storage layer (Hadoop Distributed File System- HDFS).
    Fig b: Hadoop Architecture
    Map Reduce
For writing distributed applications, Map Reduce is used as a parallel programming model which is devised at Google for well-organized dealing out of vast amount of data (multiterabyte data-sets), on bulk collection (thousands of nodes) of product hardware in a dependable, flawless manner. The Map Reduce program functions on Hadoop also that is an apache open source framework.

HDFS Architecture

The underlying file system of a Hadoop cluster is Hadoop Distributed File System (HDFS). This system provides scalable, fault-tolerant storage intended to be deployed on export hardware. HDFS is set apart from other distributed file systems because of several attributes in it.. Among them, some of the key differentiators are that HDFS is:

  • designed with hardware failure in mind
  • built for large datasets, with a default block size of 128 MB
  • optimized for sequential operations
  • rack-aware
  • cross-platform and supports heterogeneous clusters

In Hadoop cluster, data is broken down into minor units (called blocks) and distributed all through the cluster. Each block is duplicated twice (for a total of three copies) and the two replica is stored in a rack somewhere else in the cluster on two nodes. Since the data has a default replication factor of three, it is highly available and fault- tolerant. HDFS can automatically re-replicate a copy elsewhere in the cluster, if it is lost (for example, because of machine failure). Thereby, ensuring that the threefold replication factor is maintained.

YARN

YARN (Yet Another Resource Negotiator) is the shell that is responsible for resources for application execution.

YARN consists of three core components:

  • ResourceManager (one per cluster)
  • ApplicationMaster (one per application)
  • NodeManagers (one per node)
    Fig c: Components of YARN
    Resource Manager:

It is responsible for taking inventory of available resources and runs several critical services, the most important of which is the Scheduler.

Node Managers:

The Node Manager is a per-node agent tasked with overseeing containers throughout their lifecycles, monitoring container resource usage, and periodically communicating with the Resource Manager.

WORKING OPERATION OF HADOOP?

We can bind collectively numerous commodity computers with a single CPU as an alternative to build bigger servers with heavy configurations that handle large scale processing. Practically, to offer much advanced throughput these clustered machines can interpret the dataset at the same time. Furthermore, it is cheaper than one high end server. Thus hadoop runs across clustered and low-cost machines hence, this is the first important reason behind using Hadoop.

Hadoop runs its code transversely in a group of computers. The above process includes the subsequent core tasks that is performed by Hadoop:

  • Initially data is divided into directories and files. These files are further alienated into equivalent size of blocks of 128M and 64M.
  • For further processing these files are then distributed across a range of cluster nodes.
  • Being on peak of the local file system, HDFS oversees the processing.
  • For handling hardware failure, blocks are duplicated.
  • Verifying that the code was executed effectively.
  • The sort that takes place between the map and reduce stages is performed.
  • The sorted data is sent to a assured computer.
  • The debugging log for each job is written.


2. MAP REDUCE FRAMEWORK:

A map reduce job generally divides the input data set into autonomous chunks in a totally simultaneous way which are processed by the map tasks. The frame _work sorts the outputs of the maps and then goes to the reduce tasks. Typically both the input and the output of the job are accumulated in a file-system. The basic unit of information used in Map Reduce is a Key and value pair. Before feeding the data to Map Reduce model entire type of structured and unstructured data needs to be translated to this basic unit. As the name suggests, Map Reduce model consist of two separate routines, namely Map-function and Reduce-function. When coupled with HDFS, map Reduce can be used to handle big data.

The computation on a set of pairs in Map Reduce model occurs in three stages:

  1. Map stage
  2. Shuffle stage
  3. Reduce stage.
Fig d: Stages of Map reduce
1. Map Stage:

Map Reduce logic, is not restricted to just structured datasets. It has an extensive capability to handle unstructured data as well. Map stage is the crucial step which makes this possible. Mapper brings a structure to unstructured data. For example, incase I want to calculate the number of photographs on my system based on the location (city), of the photograph I will have to analyze the same. The mapper makes key and value pairs from this data set. In this case, key will be the location and value will be the photograph. We have a structure to the entire data- set, after mapper is done with its task. In the map stage, the mapper accepts a single key and value pair as input which generates multiple number of key and value pairs as output. A map function is designed by the user that corresponds an input key and value pair to many number of output pairs for the map phase. Mostly all the time, to identify the preferred position of the input value by altering its key the map phase is simply used.

2. Shuffle Stage:


The Map Reduce framework automatically handles the shuffle stage. The fundamental system executing map reduce directs all of the values that are associated with an entity key to the same reducer.

3. Reduce Stage:
In the reduce stage, all of the values linked with an single key k and outputs multiple number of key and value pairs are accepted by the reducer. This bring out one of the main aspects of the map reduce computation- before reduce stage begins all of the maps must finish. The reducer can perform sequential computations on the values since, all access to the values with the same key is inbuilt in reducer. In the reduce step, by observing that reducers operating on different keys can be executed at the same time the parallelism is exploited.


3. HIVE:

Hive is a data warehousing infrastructure built above hadoop. Providing query, data summarization, and analysis is the primary liability. Analysis of large datasets stored in Hadoop's HDFS is supported by it. An SQL-like interface is provided by Hive to query data that is stored in various databases and file systems that integrate with Hadoop.

Analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem is supported by Apache Hive. It provides called Hive query langauge, an SQL-like language with schema on read and clearly transforms queries to map or reduce. Embedded metadata is stored in Apache Derby database, and other client-server databases like MySQL can alternatively be used.

Alternative features of Hive include:

  • To give acceleration indexing types like compaction and bitmap index as of 0.10 and extra index types are designed.
  • Various types of storages such as plain text, HBase , ORC, RC File and others.
  • The time to perform semantic checks during query execution are considerably reduced using metadata storage in an RDBMS.
  • Condensed data is stored in the Hadoop ecosystem can be operated using algorithms such as DEFLATE, SNAPPY, BWT, etc.
  • Management of strings, dates, and other data-mining tools are done using built-in user defined functions. To handle use cases not supported by built in functions hive supports extending the user defined function set.
  • Implicitly, SQL-like queries such as Hive query language are transformed into map reduce or Spark or Tez jobs.

Hive query language do not severely chase all the SQL- 92 standard although hive is based on SQL. Extensions of hive query language offers not in SQL, including multi table inserts and create table as select, but only provides basic support for indexes. Hive query language only has inadequate subquery support and hence lacks support for transactions and materialized views.