Spark Modularized View enables users to build enterprise scale applications on Apache Spark platform.
- Scales with DATA size
- Scales with CODE size
- Scales with TEAM size
In addition to the data scalability inherited from Spark, SMV also provides code and team scalability through the following features:
- Multi-level modular design allow developers to work on large scale projects, and enable easy code and data reuse
- Multi-grain traceability to support full scope knowledge transparency to developers and data users
- Provides interfaces to multiple languages(Scala and R for now) for easy integrating to existing code and leverage existing developer experiences
- Pure text code, can utilized modern CM (Configuration Management) tool to track and merge changes among team members
- Automatic Data and Code version synchronization to enable coordination on both code and data level
- Data publishing mechanism to support inter-team coordination
- Build-in data quality management to ensure data quality in a continuous bases
- High level helper functions and tools for quick data App development
Please refer to User Guide and API docs for details.
Note: The sections below were extracted from the User Guide and it should be consulted for more detailed instructions.
Install Docker. An installation guide for your machine may be found here.
Pull this repository, navigate to the docker directory, and build the SMV docker image with
docker build -t smv .
Now run SMV with
docker run -rm -it -v /path/to/projects:/projects -v /path/to/data:/data smv
SMV provides a shell script to easily create an example application. The example app can be used for exploring SVM and it can also be used as an initialization script for a new project.
$ _SMV_HOME_/tools/smv-init MyApp com.mycompany.myapp
$ mvn clean install
$ _SMV_HOME_/tools/smv-run --run-app
The output csv file and schema can be found in the data/output
directory (as configured in the conf/smv-user-conf.props
files).
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"32",981295
"33",508120
"34",3324188
"35",579916
"36",7279345
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.schema/part-*
FIRST('ST): String
EMP: Long
See Getting Started section of User Guide for further details.
If smv-run
is provided the -g
flag, instead of running and persisting the module, the module dependency graph will be created as a dot
file. It can be converted to png
using the dot
command.
$ _SMV_HOME_/tools/smv-run -g -m com.mycompany.myapp.stage1.EmploymentByState
$ dot -Tpng com.mycompany.MyApp.stage1.EmploymentByState.dot -o graph.png
See Run SMV Application for further details.