Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
https://spark.apache.org/docs/latest/
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.
https://www.docker.com/whatisdocker/
Docker images are the basis of containers. Images are read-only, while containers are writeable. Only the containers can be executed by the operating system.
https://docs.docker.com/terms/image/
Branch | Base Image | Description |
---|---|---|
master | gelog/java:openjdk7 | Spark pre-built for Hadoop |
spark-for-hadoop | " " | Spark pre-built for Hadoop (dev branch) |
spark-from-source | scala:2.10.4 | Spark built from source |
Note: currently the spark-from-source image takes quite a while to build, and generates 2.3 GB of virtual size.
The recommended branch for general use is master.
docker run -d --name spark-master -h spark-master -p 8080:8080 -p 7077:7077 \
gelog/spark:1.2-bin-hadoop2.3 spark-class org.apache.spark.deploy.master.Master
docker run -d --name spark-worker1 -h spark-worker1 --link=hdfs-namenode:hdfs-namenode --link=spark-master:spark-master \
gelog/spark:1.2-bin-hadoop2.3 spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077