Skip to content

Latest commit

 

History

History
26 lines (11 loc) · 1.22 KB

02-Hadoop.md

File metadata and controls

26 lines (11 loc) · 1.22 KB

Hadoop

  • Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data.

Distributions

  • Hadoop is available either open-source through the Apache distribution, or through vendors such as Cloudera (the largest Hadoop vendor by size and scope), MapR, or HortonWorks.

Version Status

  • Initial release April 1, 2006;
  • Version Series 3.3.x (i.e. 3.3.0 etc) was released on 14 July 2020

Hadoop Ecosystem

  • The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.

Performance Comparison

Spark: Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It's also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. Spark has particularly been found to be faster on machine learning applications, such as Naive Bayes and k-means.(https://logz.io/blog/hadoop-vs-spark/).