You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After lots of additions and refactoring across aksw-commons, jena-sparql-api (jsa), sansa and lsq, lsq rdfize now runs on spark.
The most central aspect is the marriage of spark and rxjava: Lsq was already built around rx-based functions and it was clear that somehow this could be made to work inside a rdd.mapPartition but so far there were always gaps when trying to construct a pipeline from source data to target data via lsq.
The missing piece was the introduction of the DatasetOneNg class to both sansa and jsa-rx; this class represents a Dataset with only a single named graph and allows one to enforce rdds or flowables of individual named graphs.
The input-adapter turns any source into a flowable or rdd of DatasetOneNg, the output is DatasetOneNg - so any operator that transform one into another can be used within a mapPartitions call.
Sansa is now integrated for rdfize and analyze. Benchmarking probably is not very useful from within spark. If desired anyway a new issue can be raised.
This would make processing large log files less boring!
The text was updated successfully, but these errors were encountered: