随着语义网的不断发展,发布在互联网上的资源描述框架(RDF)数据达到百亿级三元组规模,并且呈现几何增长趋势,针对RDF数据的单机SPARQL查询方法已经不再适用。为此,提出一种基于整体同步并行(BSP)模型的SPARQL基本图模式查询算法。根据RDF有向图数据特性及基本图模式定义,将整个查询过程分成匹配和迭代2个阶段,在匹配出所需查询的三元组模式后,通过迭代使部分解逐步逼近完全解,得到最终查询结果。利用HAMA分布式计算框架进行算法实现,实验结果表明,与基于MapReduce的SPARQL查询算法相比,该算法具有较高的查询效率,能为大规模RDF数据的快速SPARQL查询提供支持。
With the advance of semantic Web,the Resource Description Framework(RDF)data published on the Web reaches the scale of ten billion triples,and it shows a geometric growth trend. Simple Protocol and RDF Query Language(SPARQL)query methods on stand-alone machine are no longer applicable. For this problem,this paper proposes a SPARQL Basic Graph Pattern(BGP)search algorithm based on Bulk Synchronous Parallel(BSP)model. According to the graph nature of RDF data and the definition of BGP,it divides the whole process into"matching"stage and"iteration"stage. First match each triple patterns and then iterate to get the query results eventually. It implements the algorithm by HAMA distributed computing framework. Experimental results show that it has higher query efficiency than SPARQL algorithm based on MapReduce,and it can support the SPARQL query of the large scale RDF data.