What tools are used to analyze big data?

Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data at a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

  • The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
  • YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure;
  • MapReduce, as described above, a model for doing big data processing;
  • And a common set of libraries for other modules to use.

Other tools are out there too. One that receives a lot of attention is Apache Spark. The main selling point of Spark is that it stores much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster. Depending on the operation, analysts may see results a hundred times faster or more. Spark can use HDFS, but it is also capable of working with other data stores, like Apache Cassandra or OpenStack Swift. It's also fairly easy to run Spark on a single local machine, making testing and development easier.