Read this article and focus on the definition, types of data considered big data, and how to analyze it. Then take notes on the tools currently used to analyze big data.
Technology continues to advance an organization's ability to collect data. Next, you will learn about storing and making big data accessible by developing the infrastructure.
Other big data tools
Of course, these aren't the only big data tools out there. There are countless open source solutions for working with big data, many of them specialized for providing optimal features and performance for a specific niche or for specific hardware configurations.
The Apache Software Foundation (ASF) supports many of these big data projects. Here are some that you may find useful.
- Apache Beam is "a unified model for defining both batch and streaming data-parallel processing pipelines". It allows developers to write code that works across multiple processing engines.
- Apache Hive is a data warehouse built on Hadoop. A top-level Apache project, it "facilitates reading, writing, and managing large datasets … using SQL".
- Apache Impala is an SQL query engine that runs on Hadoop. It's incubating within Apache and is touted for improving SQL query performance while offering a familiar interface.
- Apache Kafka allows users to publish and subscribe to real-time data feeds. It aims to bring the reliability of other messaging systems to streaming data.
- Apache Lucene is a full-text indexing and search software library that can be used for recommendation engines. It's also the basis for many other search projects, including Solr and Elasticsearch.
- Apache Pig is a platform for analyzing large datasets that runs on Hadoop. Yahoo, which developed it to do MapReduce jobs on large datasets, contributed it to the ASF in 2007.
- Apache Solr is an enterprise search platform built upon Lucene.
- Apache Zeppelin is an incubating project that enables interactive data analytics with SQL and other programming languages.
Other open source big data tools you may want to investigate include:
- Elasticsearch is another enterprise search engine based on Lucene. It's part of the Elastic stack (formerly known as the ELK stack for its components: Elasticsearch, Kibana, and Logstash) that generates insights from structured and unstructured data.
- Cruise Control was developed by LinkedIn to run Apache Kafka clusters at large scale.
- TensorFlow is a software library for machine learning that has grown rapidly since Google open sourced it in late 2015. It's been praised for "democratizing" machine learning because of its ease-of-use.
As big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow as well.