Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.
Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb
Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/
Supported Java versions
The jars are compatible with Java 11. Support for Java 8 is dropped.
Supported Spark versions
Spark 3.2 and 3.3 are no longer supported as of RumbleDB 1.22, as they are no longer supported officially by the Spark team. Spark 3.4 and 3.5 are supported. Spark 4 is currently in preview and not supported yet by RumbleDB, but we are currently trying it out in order to support in future releases.
Jars
RumbleDB comes in 3 jars that you can pick from depending on your needs:
rumbledb-1.22.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.22.0-standalone.jar with Java 11.
rumbledb-1.22.0-for-spark-3.4-scala-2-12.jar, rumbledb-1.22.0-for-spark-3.5-scala-2-12.jar, and rumbledb-1.22.0-for-spark-3.5-scala-2-13.jar are smaller in size, do not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-....jar -q '1+1'
Improvements
Support for the W3C-standardized copy-modify-return expression as a more convenient way to transform JSON objects and arrays with the update syntax (insertion, deletion, replacement, renaming)
Support for the persistence of updates on objects and arrays read from the DeltaLake (with the same update syntax)
Support for scripting: variable assignments, while loops, applying updates in the middle of the execution with visible side effects (under snapshot semantics), statements, block statements, continue, break, exit returning.
Many performance improvements
Many bugfixes