Integrate Apache Spark / Sansa #23

Aklakan · 2021-10-08T17:15:36Z

This would make processing large log files less boring!

Aklakan · 2021-10-10T20:14:44Z

After lots of additions and refactoring across aksw-commons, jena-sparql-api (jsa), sansa and lsq, lsq rdfize now runs on spark.
The most central aspect is the marriage of spark and rxjava: Lsq was already built around rx-based functions and it was clear that somehow this could be made to work inside a rdd.mapPartition but so far there were always gaps when trying to construct a pipeline from source data to target data via lsq.

The missing piece was the introduction of the DatasetOneNg class to both sansa and jsa-rx; this class represents a Dataset with only a single named graph and allows one to enforce rdds or flowables of individual named graphs.

The input-adapter turns any source into a flowable or rdd of DatasetOneNg, the output is DatasetOneNg - so any operator that transform one into another can be used within a mapPartitions call.

Aklakan · 2021-10-11T17:33:41Z

Sansa is now integrated for rdfize and analyze. Benchmarking probably is not very useful from within spark. If desired anyway a new issue can be raised.

Aklakan closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Apache Spark / Sansa #23

Integrate Apache Spark / Sansa #23

Aklakan commented Oct 8, 2021

Aklakan commented Oct 10, 2021 •

edited

Loading

Aklakan commented Oct 11, 2021

Integrate Apache Spark / Sansa #23

Integrate Apache Spark / Sansa #23

Comments

Aklakan commented Oct 8, 2021

Aklakan commented Oct 10, 2021 • edited Loading

Aklakan commented Oct 11, 2021

Aklakan commented Oct 10, 2021 •

edited

Loading