Why does this federated SPARQL query work in TopBraid but not in Apache Fuseki? -
i have following federated sparql query works expect in topbraid composer free edition (version 5.1.4) not work in apache fuseki (version 2.3.1):
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix movie: <http://data.linkedmdb.org/resource/movie/> prefix dcterms: <http://purl.org/dc/terms/> select ?s { service <http://data.linkedmdb.org/sparql> { <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . ?actor movie:actor_name ?actorname . } service <http://dbpedia.org/sparql?timeout=30000> { ?s ?p ?o . filter(regex(str(?s), replace(?actorname, " ", "_"))) . } }
i monitor sub sparql queries being executed under hood , notice topbraid correctly executes following query http://dbpedia.org/sparql endpoint:
select * { ?s ?p ?o filter regex(str(?s), replace("paul reubens", " ", "_")) }
while apache fuseki executes following sub query:
select * { ?s ?p ?o filter regex(str(?s), replace(?actorname, " ", "_")) }
notice difference; topbraid replace variable ?actorname particular value 'paul reubens', while apache fuseki not. results in error http://dbpedia.org/sparql endpoint because ?actorname used in result set not assigned.
is bug in apache fuseki or feature in topbraid? how can make apache fuseki correctly execute federated query.
update 1: clarify behaviour difference between topbraid , apache fuseki bit more. topbraid executes linkedmdb.org subquery first , executes dbpedia.org subquery each result of linkedmdb.org query )(and substitutes ?actorname results linkedmdb.org query). assumed apache fuseki behaves similar, first subquery dbpedia.org fails (because ?actorname used in result set not assigned) , not continue. not sure if want execute subquery dbpedia.org multiple times, because never gets there.
update 2: think both topbraid , apache fuseki use jena/arq, noticed in stack traces topbraid package name com.topbraid.jena.* might indicate use modified version of jena/arq?
update 3: joshua taylor says below: "surely wouldn't expect second service block executed each 1 of them?". both topbraid , apache fuseki use method following query:
prefix owl: <http://www.w3.org/2002/07/owl#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix movie: <http://data.linkedmdb.org/resource/movie/> prefix dcterms: <http://purl.org/dc/terms/> select ?film ?label ?subject { service <http://data.linkedmdb.org/sparql> { ?film movie:film . ?film rdfs:label ?label . ?film owl:sameas ?dbpedialink filter(regex(str(?dbpedialink), "dbpedia", "i")) } service <http://dbpedia.org/sparql> { ?dbpedialink dcterms:subject ?subject } } limit 50
but agree in principle should execute both parts once , join them, maybe performance reasons chose different strategy?
additionally, notice how above query works on apache fuseki, while first query of post not. so, apache fuseki behaving topbraid in particular case. seems related using uri variable (?dbpedialink) in 2 triple patterns (which works in fuseki) compared using string variable (?actorname) triple pattern in filter regex function (which not work in fuseki).
updated (simpler) response
in original answer wrote (below), said issue sparql queries executed innermost first. think that still applies here, think problem can isolated more easily. if have
service <ex1> { ... } service <ex2> { ... }
then results have you'd executing each query separately on endpoints , then joining results. join merge results common variables have same values. e.g.,
service <ex1> { values ?a { 1 2 3 } } service <ex2> { values ?a { 2 3 4 } }
would execute, , you'd have 2 possible values ?a in outer query (2 , 3). in query, second service can't produce results. if take:
?s ?p ?o . filter(regex(str(?s), replace(?actorname, " ", "_"))) .
and execute @ dbpedia, shouldn't results, because ?actorname isn't bound, filter never succeed. appears topbraid performing first service first , injecting resulting values second service. that's convenient, don't think it's correct, because returns different results you'd if dbpedia query had been executed first , other query executed second.
original answer
subqueries in sparql executed inner-most first. means query like
select * { { select ?x { ?x :cat } } ?x foaf:name ?name }
would first find cats, , then find names. "candidate" values ?x determined first subquery, , values ?x made available outer query. now, when there 2 subqueries, e.g.,
select * { { select ?x { ?x :cat } } { select ?x ?name { ?x foaf:name ?name } } }
the first subquery going find cats. second subquery finds names of everything has name, , in outer query, results joined names of cats. values of ?x first subquery aren't available during execution of second subquery. (at least in principle, query optimizer might able figure out things should restricted.)
my understanding service blocks have same kind of semantics. in query, have:
service <http://data.linkedmdb.org/sparql> { <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . ?actor movie:actor_name ?actorname . } service <http://dbpedia.org/sparql?timeout=30000> { ?s ?p ?o . filter(regex(str(?s), replace(?actorname, " ", "_"))) . }
you tracing shows topbraid executing
select * { ?s ?p ?o filter regex(str(?s), replace("paul reubens", " ", "_")) }
if topbraid executed first service block , got unique solution, might acceptable optimization, if, instance, first query had returned multiple bindings ?actorname? surely wouldn't expect second service block executed each 1 of them? instead, second service block executed written, , return result set joined result set first.
the reason "doesn't work" in jena because second query doesn't bind variables, it's pretty got @ every triple in data, going take long time.
i think can around nesting service calls. if nested service launched "local" endpoint (i.e., nesting service call doesn't ask remote endpoint make remote query), might able do:
service <http://dbpedia.org/sparql?timeout=30000> { service <http://data.linkedmdb.org/sparql> { <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . ?actor movie:actor_name ?actorname . } ?s ?p ?o . filter(regex(str(?s), replace(?actorname, " ", "_"))) . }
that might kind of optimization want, still seems might not work unless dbpedia has efficient ways of figuring out triples retrieve based on computing replace. you're asking dbpedia @ all triples, , keep ones string form of subject matches particular regular expression. it'd better construct iri manually in subquery , search it. i.e.,
service <http://dbpedia.org/sparql?timeout=30000> { { select ?actor { service <http://data.linkedmdb.org/sparql> { <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . ?actor movie:actor_name ?actorname . } bind(iri(concat("http://dbpedia.org/resource", replace(?actorname," ","_"))) ?actor) } } ?actor ?p ?o }
Comments
Post a Comment