General SPARQL Discussion

1.1. What is SPARQL?

SPARQL is a recursive acronym standing for SPARQL Protocol and RDF Query Language. As the name implies, SPARQL is a general term for both a protocol and a query language.

Most uses of the SPARQL acronym refer to the RDF query language. In this usage, SPARQL is a syntactically-SQL-like language for querying RDF graphs via pattern matching. The language’s features include basic conjunctive patterns, value filters, optional patterns, and pattern disjunction.

The SPARQL protocol is a method for remote invocation of SPARQL queries. It specifies a simple interface that can be supported via HTTP or SOAP that a client can use to issue SPARQL queries against some endpoint.

Both the SPARQL query language and the SPARQL protocol are products of the W3C’s RDF Data Access Working Group. The latest released versions of the Working Group’s specifications (excluding intermediate working drafts) can be found here:

SPARQL Query Language
SPARQL Protocol
SPARQL Query Results XML Format
1.2. How can I learn SPARQL?

There are a variety of SPARQL tutorials and introductions scattered around the Web. Some notable ones include:

Jena/ARQ SPARQL tutorial
Leigh Dodds’ “Introducing SPARQL” article
Philip McCarthy’s “Search RDF data with SPARQL” article
SPARQL By Example
1.3. What are the benefits/drawbacks of SPARQL vis a vis SQL and XQuery?

The jury is still out on best practices surrounding using SPARQL compared to other query languages. Some benefits of SPARQL include:

Queries RDF data. If your data is in RDF, then SPARQL can query it natively.
Implicit join syntax. SPARQL queries RDF graphs, which consist of various triples expressing binary relations between resources, by specifying a subgraph with certain resources replaced by variables. Because all relationships are of a fixed size and data lives in a single graph, SPARQL does not require explicit joins that specify the relationship between differently structured data. That is, SPARQL is a query language for pattern matching against RDF graphs, and the queies themselves look and act like RDF. This is one main point made by Oracle’s Jim Melton in his analysis of SPARQL vis a vis SQL and XQuery: SQL, XQuery, and SPARQL: What’s Wrong With This Picture?.
SPARQL has strong support for querying semistructured and ragged data—i.e., data with an unpredictable and unreliable structure. Variables may occur in the predicate position to query unknown relationships, and the OPTIONAL keyword provides support for querying relationships that may or may not occur in the data (a la SQL left joins).
SPARQL is often an appropriate query language for querying disparate data sources (not sharing a single native representation) in a single query. Because RDF represents all data as a collection of simple binary relations, most data can be easily mapped to RDF and then queried and joined using SPARQL. Often, these mappings can be performed on the fly, meaning that SPARQL can be used to join heterogeneous data at a higher level than that of the native structure of the data.
SPARQL is built to support queries in a networked, web environment. SPARQL introduces the notion of an RDF dataset, which is the pairing of a default graph and zero or more named graphs. As both the default graph and the named graphs are identified by URIs, it is common for SPARQL implementations to retrieve a graph by performing an HTTP GET on the graph’s URI. This allows a single query to join information from multiple data sources accessible across different Web sites.
Similarly, the SPARQL GRAPH keyword allows data to be queried along with its provenance information. GRAPH can be used to discover the URI of the graph that contains the data that matches the query.
Some drawbacks are:

Lack of wide deployment. SPARQL is relatively young, and as such there are not many data stores which can be directly queried with SPARQL (as compared with SQL or XPath).
Immaturity. As a young query language, SPARQL lacks the explicit processing model of XQuery or the decades of SQL-optimization research. As with the above point, this is likely to improve as current and new research and implementations contribute to a body of knowledge surrounding SPARQL.
Lack of support for transitive/hierarchical queries. While SPARQL is designed to query RDF graphs, SPARQL has no facilities for easily querying transitive relations or hierarchical structures within a graph. There are some workarounds for this, but SPARQL does not approach the power of, for instance, XQuery’s axes.

1.4. What SPARQL implementations are available?

The community maintains a list of SPARQL implementations at the W3C ESW Wiki.

1.5. Can I use SPARQL to query data that’s not stored in RDF?

Several software packages exist which allow SPARQL queries to generate answers from data sources other than RDF, such as relational databases, LDAP servers, or XML data. The community maintains a list of these tools at the W3C ESW Wiki.

The W3C recently completed an incubator group examining the state of the art in accessing relational databases via SPARQL. As a result of this incubator group, a new Working Group may be established to produce specifications in this area.

1.6. How can I tell what dataset, functions, or extensions a SPARQL endpoint supports?

There is currently no established, interoperable method for representing or accessing functional descriptions of SPARQL endpoints. (This is not to be confused with the WSDL which describes the SPARQL Protocol itself.)

The Data Access Working Group postponed this topic in 2005, leaving behind a draft “of historical interest only.” In the meantime, implementations have devised their own vocabularies and techniques for specifying and advertising the services and datasets supported by a SPARQL endpoint. For example, HP Labs’ Joseki allows service descriptions to be specified with an RDF configuration vocabulary. See the SPARQL service description wiki page for more information.

1.7. Do SPARQL queries perform well against small datasets? Large datasets?

The performance of a SPARQL query against any particular dataset depends not only upon the size of the dataset but also on the nature of the dataset’s storage (a relational store, a native triple store, LDAP, etc.), the complexity of the query itself, optimizations in use by the SPARQL engine, the distribution of the data, and other environmental factors. To date, little work has been done in analyzing SPARQL query performance in particular, and the field of SPARQL query optimization is relatively inchoate.

Some analysis has been done on the topic of RDF stores which can handle large datasets. (A large dataset in this context is usually considered one on the order of tens or hundreds of millions of triples). The W3C ESW wiki contains information on a variety of RDF stores which can scale to large numbers of triples, but does not speak specifically to the performance of SPARQL queries against these stores.

1.8. Is there anywhere on the Web where I can try out SPARQL queries?

The creators of several SPARQL implementations provide online services where SPARQL queries can be input and executed against either canned datasets or arbitrary datasets (identified by URLs). The community maintains a list of SPARQL endpoints on the W3C ESW Wiki.