Rainbow: A Distributed and Hierarchical RDF Triple Store

Rainbow -- A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

Introduction and Architecture

Introduction: Rainbow adopts a distributed and hierarchical storage architecture that uses HBase as the scalable persistent storage and further combines a distributed memory storage to speedup query performance. Moreover, a hybrid RDF data indexing scheme well fitting the storage architecture is designed based on the statistical analysis of user query space. Also, a fault-tolerant mechanism in Rainbow is designed and implemented to assure successful query response.
System Archtecture: The overview of Rainbow's system architecture is below.
Fig 1. System Archtecture.

Indexing Scheme and Data Placement

Query Space Analysis: To better decide what kinds of query pattern indices are more desirable to be stored in the distributed memory storage, we statistically analyze millions of real user queries and synthetic benchmark queries. The real user queries are got from the whole opened DBpedia query logs (around 22 million valid queries), and the synthetic benchmark queries are derived from some widely-used benchmarks, including LUBM, LongWell, SP2Bench and Berlin SPARQL Queries. The statistics of query patterns are shown below.
Fig 2-a. Statistics of real-user query patterns (extracted from the whole DBpedia query logs, it is here).


Fig 2-b. Statistics of Berlin SPARQL Benchmark query patterns (extracted from here).	Fig 2-c. Statistics of LUBM query patterns (extracted from here).	Fig 2-d. Statistics of SP2Bench query patterns (extracted from here).	Fig 2-e. Statistics of LongWell query patterns (extracted from here).

Indexing Scheme: Based on the statistical analysis above, we design a hybrid RDF data indexing scheme with its data placement (shown in Fig. 3) that well fits our distributed hierarchical storage architecture.

Fig 3. The hybrid indexing scheme and data placement of RDF triples.

Implementation and Evaluation

Benchmark: LUBM (the Lehigh University Benchmark) was used as the benchmark for test. LUBM has synthetic OWL and RDF data scalable to arbitrary size, as well as 14 extensional queries reflecting a variety of properties. In our experiments.
Triple Stores:
Rainbow: In our implementation, HBase (ver 0.94) is chosen as the distributed persistent storage. Redis (ver 2.4) is the single-node in-memory store. The whole distributed memory storage is made up of all these Redis stores to provide a unified interface to applications. ZooKeeper (ver 3.3.6) is adopt as the distributed coordinator. The frontal SPARQL engine of Rainbow is built in OpenRDF framework currently. The source code of Rainbow is now available here. During evaluation, the HBase cluster used in Rainbow is exactly the same as that of Jena-HBase with totally same configuration. For Redis, we adopted its default configuration.

Jena-HBase: (download from here). We adopted the Hybrid layout since the authors reported that it gave the best performance. JVM heap size was also set to 8GB.

SHARD: (download from here). We used Hadoop 1.0.3. All the 17 nodes acted as TaskTrackers/DataNodes, and another master node acted as JobTracker/NameNode. The Hadoop configuration used 8 map/reduce slots per node and set each child's JVM heap size to 8GB.

Sesame: (download from here). Version 2.3.2. Tomcat 6.0 as the HTTP interface. selected the native storage layout and set the SPOC, POSC, OPSC indices in the native storage configuration. 8GB memory was also assigned to the JVM heap.
Query Perforamnce: We tested the performance of Rainbow and comparative systems on LUBM-10, LUBM-100 and LUBM-1000 w.r.t. their query execution time. For simplicity, the distributed memory storage in Rainbow is denoted as Rainbow-IM, and the underlying distributed persistent storage is denoted as Rainbow-HBase. For better demonstration, the query execution time on Rainbow-IM and Rainbow-HBase are listed separately. All the time presented here is the average of 10 runs of the queries. Also, both cold start (cold cache) and hot run performance is reported. The detailed results of the LUMB-10 are shown in Table below.
Fig 4. Query execution time on LUBM-10 (1.3 million triples).

Scalability: We evaluated the scalability of Rainbow-IM by carrying out two experiments on a part of queries: (a) scaling the data amount while fixing the number of machines, and (b) scaling the number of machines while fixing the data size. The results are shown in the figures below.Please notice that, for all the queries, Rainbow-IM achieves stable performance regardless of increasing or decreasing the number of machines (shown in Fig. 7(b).). A possible reason for adding machines into Rainbow-IM is to enlarge the storage capacity of the distributed memory, which can potentially achieve more performance improvement when more data come to the system.


Fig 5-a. Varying the amount of data in Rainbow-IM.	Fig 5-b. Varying the number of machines In Rainbow-IM.

Dynamic and Fault Tolerant Features: We designed a simulating experiment to verify the dynamic scalability and fault tolerance features of Rainbow. The scenario is designed as follow: A user submits a sequence of queries (we chose the slow query Q13 here) to Rainbow; each submission is referred to as a query event. Rainbow keeps running without stopping.Initially, Rainbow-HBase held 1.3M triples while Rainbow-IM was disabled.
Fig 6. The hybrid indexing scheme and data placement of RDF triples.
Contact: Rong Gu, Wei Hu, Yihua Huang