Processing Linked Data at Warp Speed
The Web of Data has grown immensely over the past years. From only one dataset in 2007 the linked portion of the Open Data Cloud has grown to over 31 billion triples (in 2011) usually shown in the diagrams and a plethora of open data sets published by individuals, organizations and governments all over the world usually not shown. Given this immense growth the question arises how to process these data. Even if you can process 10’000 triples per second it will still take more than 861 hours to process the whole cloud… so algorithms traveling (or traversing) the linked data cloud using conventional methods are going to be slow. In this talk I will talk about two methods for processing large numbers of triples. First, I will introduce the distributed graph-processing framework Signa/Collect, which allows to process billions of edges in seconds. I will highlight the usefulness of the framework in 3 application scenarios. Second, I will briefly touch upon the need and challenges when processing large graphs as data-streams, where the actual data is not stored but only the portions necessary for processing are kept.