Skip to content
Devender Yadav edited this page Jul 27, 2015 · 7 revisions

Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including [Spark SQL] (http://spark.apache.org/docs/latest/sql-programming-guide.html) for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

##Support Being a JPA provider, Kundera provides support for Spark. It allows to perform read-write operation & SQL querying over data to Cassandra & MongoDB. Along with these databases, support for File System(CSV/JSON) & HDFS is also added.

Kundera provides 3 modules with Spark:

  • spark-core : It deals with HDFS and FS(CSV & JSON) part. User can perform read, write operations & query data over there.
  • spark-cassandra : This module is designed for Cassandra. Similarly, user can perform read, write operations & query data over there.
  • spark-mongodb : This module is designed for MongoDB. In the same way, user can perform read, write operations & query data over there.
Clone this wiki locally