Cichlid: Efficient Large Scale RDFS/OWL Reasoning with Spark
Introduce:
Cichlid is a distributed RDFS & OWL reasoning system based on Spark. Cichlid achieves higher efficiency and scalability than the existing large scale reasoning systems based the MapReduce framework or the P2P self-organizing networks. The major contributions and novelties of Cichlid are summarized as follows:
- We proposed the RDFS reasoning algorithm with the Spark RDD parallel programming model. The proposed parallel RDFS reasoning algorithm is optimized from three aspects, including data partition model, the execution order of reasoning rules and the removing of duplicated data.
- Further, we studied the more complex OWL Horst reasoning rule set and designed a new parallel OWL reasoning algorithm on top of the Spark RDD model. We focused on optimizing three major time-consuming parts in the algorithm, including large-scale data join, the transitive closure computation and the equivalent relation computation.
- In addition to above optimizations at the reasoning algorithm level, we also optimized the inner Spark execution mechanism by proposing an off-heap memory storage mechanism for RDD. This system-level optimization patch has been accepted and integrated into Apache Spark 1.0.
Cichlid is around 8 times faster on average than the state-of-the-art distributed reasoning systems for both large scale synthetic and real-world benchmarks.
For more information about the design of Dolphin and up-to-date documentation on many of our research ideas, check out our website:https://github.com/PasaLab/cichlid.