Skip to content
Devender Yadav edited this page Aug 3, 2015 · 7 revisions

Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including [Spark SQL] (http://spark.apache.org/docs/latest/sql-programming-guide.html) for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

##Support Being a JPA provider, Kundera provides support for Spark. It allows to perform read-write operation & SQL querying over data in Cassandra & MongoDB. Along with these databases, support for File System(CSV/JSON) & HDFS is also added.

Why no Update and Delete:

Spark does not provide the update or delete operation over data. It's a data processing tool used for processing huge amount of data and most of its use cases are related to analytics. So, keeping in mind the Spark's philosophy Kundera provides support only for read-write and querying over data.

Kundera provides 3 modules with Spark:

  • spark-core : This is the core module & mandatory for using kundera-spark. Also, it deals with HDFS and FS(CSV & JSON) part.
  • spark-cassandra : This module is designed for Cassandra.
  • spark-mongodb : This module is designed for MongoDB.
Clone this wiki locally