Semantic Web and Linked Open Data provide a potential platform for interoperability of scientific data, offering a flexible model for providing machine-readable and queryable metadata. However, RDF and SPARQL gained limited adoption within the scientific community, mainly due to the lack of support for managing massive numeric data, along with certain other important features – such as extensibility with user-defined functions, query modularity, and integration with existing environments and workflows.
We present the design, implementation and evaluation of Scientific SPARQL – a language for querying data and metadata combined, represented using the RDF graph model extended with numeric multidimensional arrays as node values – RDF with Arrays. The techniques used to store RDF with Arrays in a scalable way and process Scientific SPARQL queries and updates are implemented in our prototype software – Scientific SPARQL Database Manager, SSDM, and its integrations with data storage systems and computational frameworks. This includes scalable storage solutions for numeric multidimensional arrays and an efficient implementation of array operations. The arrays can be physically stored in a variety of external storage systems, including files, relational databases, and specialized array data stores, using our Array Storage Extensibility Interface. Whenever possible SSDM accumulates array operations and accesses array contents in a lazy fashion.
In scientific applications numeric computations are often used for filtering or post-processing the retrieved data, which can be expressed in a functional way. Scientific SPARQL allows expressing common query sub-tasks with functions defined as parameterized queries. This becomes especially useful along with functional language abstractions such as lexical closures and second-order functions, e.g. array mappers.
Existing computational libraries can be interfaced and invoked from Scientific SPARQL queries as foreign functions. Cost estimates and alternative evaluation directions may be specified, aiding the construction of better execution plans. Costly array processing, e.g. filtering and aggregation, is thus preformed on the server, saving the amount of communication. Furthermore, common supported operations are delegated to the array storage back-ends, according to their capabilities. Both expressivity and performance of Scientific SPARQL are evaluated on a real-world example, and further performance tests are run using our mini-benchmark for array queries.