Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Apache Spark / Sansa #23

Closed
Aklakan opened this issue Oct 8, 2021 · 2 comments
Closed

Integrate Apache Spark / Sansa #23

Aklakan opened this issue Oct 8, 2021 · 2 comments

Comments

@Aklakan
Copy link
Member

Aklakan commented Oct 8, 2021

This would make processing large log files less boring!

@Aklakan
Copy link
Member Author

Aklakan commented Oct 10, 2021

After lots of additions and refactoring across aksw-commons, jena-sparql-api (jsa), sansa and lsq, lsq rdfize now runs on spark.
The most central aspect is the marriage of spark and rxjava: Lsq was already built around rx-based functions and it was clear that somehow this could be made to work inside a rdd.mapPartition but so far there were always gaps when trying to construct a pipeline from source data to target data via lsq.

The missing piece was the introduction of the DatasetOneNg class to both sansa and jsa-rx; this class represents a Dataset with only a single named graph and allows one to enforce rdds or flowables of individual named graphs.

The input-adapter turns any source into a flowable or rdd of DatasetOneNg, the output is DatasetOneNg - so any operator that transform one into another can be used within a mapPartitions call.

@Aklakan
Copy link
Member Author

Aklakan commented Oct 11, 2021

Sansa is now integrated for rdfize and analyze. Benchmarking probably is not very useful from within spark. If desired anyway a new issue can be raised.

@Aklakan Aklakan closed this as completed Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant