Hadoop

Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data.

Distributions

Hadoop is available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks.

Version Status

Initial release April 1, 2006;
Version Series 3.3.x (i.e. 3.3.0 etc) was released on 14 July 2020

Hadoop Ecosystem

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.

Performance Comparison

Spark: Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It's also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. Spark has particularly been found to be faster on machine learning applications, such as Naive Bayes and k-means.(https://logz.io/blog/hadoop-vs-spark/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02-Hadoop.md

02-Hadoop.md

Hadoop

Distributions

Version Status

Hadoop Ecosystem

Performance Comparison

Files

02-Hadoop.md

Latest commit

History

02-Hadoop.md

File metadata and controls

Hadoop

Distributions

Version Status

Hadoop Ecosystem

Performance Comparison