Pl4net.info

Bibliothekarische Stimmen. Independent, täglich.

14. November 2014
von nbt
Kommentare deaktiviert für Publishing SPARQL queries live

Publishing SPARQL queries live

SPARQL queries are a great way to explore Linked Data sets - be it our STW with it's links to other vocabularies, the papers of our repository EconStor, or persons or institutions in economics as authority data. ZBW therefore offers since a long time public endpoints. Yet, it is often not so easy to figure out the right queries. The classes and properties used in the data sets are unknown, and the overall structure requires some exploration. Therefore, we have started collecting queries in our new SPARQL Lab, which are in use at ZBW, and which could serve as examples to deal with our datasets for others.

A major challenge was to publish queries in a way that allows not only their execution, but also their modification by users. The first approach to this was pre-filled HTML forms (e.g. http://zbw.eu/beta/sparql/stw.html). Yet that couples the query code with that of the HTML page, and with a hard-coded endpoint address. It does not scale to multiple queries on a diversity of endpoints, and it is difficult to test and to keep in sync with changes in the data sets. Besides, offering a simple text area without any editing support makes it quite hard for users to adapt a query to their needs.

And then came YASGUI, an "IDE" for SPARQL queries. Accompanied by the YASQE and YASR libraries, it offers a completely client-side, customable, Javascript-based editing and execution environment. Particular highlights from the libraries' descriptions include:

  • SPARQL syntax highlighting and error checking
  • Extremely customizable: All functions and handlers from the CodeMirror library are accessible
  • Persistent values (optional): your query is stored for easier reuse between browser sessions
  • Prefix autocompletion (using prefix.cc)
  • Property and class autocompletion (using the Linked Open Vocabularies API)
  • Can handle any valid SPARQL resultset format
  • Integration of preflabel.org for fetching URI labels

With a few lines of custom clue code, and with the friendly support of Laurens Rietveld, author of the YASGUI suite, it is now possible to load any query stored on GitHub into an instance on our beta site and execute it. Check it out - the URI

http://zbw.eu/beta/sparql-lab/?queryRef=https://api.github.com/repos/jneubert/sparql-queries/contents/class_overview.rq&endpoint=http://data.nobelprize.org/sparql

loads, views and executes the query stored at https://github.com/jneubert/sparql-queries/blob/master/class_overview.rq on the endpoint http://data.nobelprize.org/sparql (which is CORS enabled - a requirement for queryRef to work).

Links like this, with descriptions of query's purpose, grouped according to tasks and datasets, and ordered in a sensible way, may provide a much more accessible repository and starting point for explorations than just a directory listing of query files. For ongoing or finished research projects, such a repository - together with versioned data sets deployed on SPARQL endpoints - may offer a easy-to-follow and traceable way to verify presented results. GitHub provides an infrastructure for publicly sharing the version history, and makes contributions easy: Changes and improvements to the published queries can be proposed and integrated via pull requests, an issue queue can handle bugs and suggestions. Links to queries authored by contributors, which may be saved in different repositories and project contexts, can be added straightaway. We would be very happy to include such contributions - please let us know.

23. September 2014
von nbt
Kommentare deaktiviert für Other editions of this work: An experiment with OCLC’s LOD work identifiers

Other editions of this work: An experiment with OCLC’s LOD work identifiers

Large library collections, and more so portals or discovery systems aggregating data from diverse sources, face the problem of duplicate content. Wouldn't it be nice, if every edition of a work could be collected beyond one entry in a result set?

The WorldCat catalogue, provided by OCLC, holds more than 320 million bibliographic records. Since early in 2014, OCLC shares its 197 million work descriptions as Linked Open Data: "A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. ... In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat." The works and editions are marked up with schema.org semantic markup, in particular using schema:exampleOfWork/schema:workExample for the relation from edition to work and vice versa. These properties have been added recently to the schema.org spec, as suggested by the W3C Schema Bib Extend Community Group.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it's bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

As a basis for our experiment, we extracted the "oclc subset" from EconBiz. Round about one third of the instances in this subset is in English - and therefore most likely to be included in WorldCat from other sources also -, one third in other languages (mostly German), and for one third the language is unknown. We randomly selected 100,000 instances from the 1.2 million "oclc subset". We looked up the work id for each of the attached oclc numbers, and in turn the editions of this work.

An example for the work/edition linking

We start with oclc number 247780068, which represents an edition of the book "Changes in the structure of employment with economic development" by Amarjit S. Oberai (WorldCat, EconBiz).

A lookup via

curl -LH "Accept: application/ld+json" http://www.worldcat.org/oclc/247780068

returns data in JSON-LD format. Here a heavily shortened version (the full data of the example is available on GitHub):

{
    "exampleOfWork" : "http://worldcat.org/entity/work/id/14820164",
    "schema:name" : "Changes in the structure of employment with economic development",
    "schema:datePublished" : "1978",
    "@id" : "http://www.worldcat.org/oclc/247780068",
    "workExample" : [
      "http://worldcat.org/isbn/9789221019268",
      "http://worldcat.org/isbn/9789221027737"
    ]
}

We go up the hierarchy by picking the "exampleOfWork" URI:

curl -LH "Accept: application/ld+json" http://worldcat.org/entity/work/id/14820164

and get the data for the work (heavily shortened again)

{
    "schema:name" : [
      "Changes in the structure of employment with economic development",
      {
          "@value" : "Changes in the structure of employment with economic development",
          "@language" : "en"
      },
      {
          "@value" : "Changes in the structure of employment with economic development /",
          "@language" : "en"
      },
      "Changes in the structure of employment with economic development /",
      "Changes in the structure of employment with economic development."
    ],
    "@id" : "http://worldcat.org/entity/work/id/14820164",
    "workExample" : [
      "http://www.worldcat.org/oclc/609785684",
      "http://www.worldcat.org/oclc/784152303",
      "http://www.worldcat.org/oclc/797167669",
      "http://www.worldcat.org/oclc/4563017",
      "http://www.worldcat.org/oclc/245743486",
      "http://www.worldcat.org/oclc/730127832",
      "http://www.worldcat.org/oclc/247780068",
      "http://www.worldcat.org/oclc/716102743",
      "http://www.worldcat.org/oclc/732345392",
      "http://www.worldcat.org/oclc/466428327",
      "http://www.worldcat.org/oclc/716519725",
      "http://www.worldcat.org/oclc/472872207",
      "http://www.worldcat.org/oclc/760439428",
      "http://www.worldcat.org/oclc/254548387",
      "http://www.worldcat.org/oclc/705976718",
      "http://www.worldcat.org/oclc/781155896",
      "http://www.worldcat.org/oclc/8215485",
      "http://www.worldcat.org/oclc/803773951"
    ]
}

As we can observe, different forms of the title of the work (with and w/o language tag) are collected in the "schema:name" property - WorldCat does not try to determine an authoritative title for the work. Data from different editions is also collected for authors or subjects. Sometimes literal values are complemented by URIs, for example to VIAF (persons) or LCSH (subjects).

We now can look up other editions of this work, given in a "workExample" property, e.g.

curl -LH "Accept: application/ld+json" http://www.worldcat.org/oclc/245743486

which reveals a later edition of the same work:

{
    "exampleOfWork" : "http://worldcat.org/entity/work/id/14820164",
    "schema:name" : "Changes in the structure of employment with economic development",
    "schema:datePublished" : "1981",
    "@id" : "http://www.worldcat.org/oclc/245743486",
    "schema:bookEdition" : "2nd ed",
    "workExample" : "http://worldcat.org/isbn/9789221027737"
}

WorldCat itself seems to use this data in it's View all editions and formats link on the human-readable web pages for the editions. The ISBN-based "workExample" links on the edition level redirect to oclc numbers; their purpose and use seems to be not yet documented.

The experiment

The starting point for the experiment outlined above was the query to what extent such work/edition links exist for a real-world collection like EconBiz. For the randomly selected 100,000 instances (editions) with oclc numbers, we looked up the edition by its URI and extracted the related work (if such a work exists). We than looked up the work, and extracted the oclc numbers of all its editions. For each oclc number of the starting set, we saved a list of oclc numbers of other editions as json data structure. This took about 44 hours of runtime. (We deliberately didn't parallelize the network access to avoid overloading the server, and cached results to save some lookups). For 15 of the lookups we got a 500 "internal server error", for 71 a 404 "not found". These errors occurred on edition as well as on work lookups, with no recognizable pattern. Some random tests revealed that normally a second lookup of the url was successful. Due to their small number we ignored these errors in the further course of the experiment.

In a second step, we evaluated the resulting data in respect to the whole of WorldCat, and to the data of our collection. All code and data for the experiment (and prior run seven weeks earlier, which did not show significantly different results) is avaliable on GitHub.

The results

For more than 99 % of our test data set we found valid WorldCat work ids. For a total of 880 oclc numbers we couldn't retrieve a work; in 922 cases the work did not link back to the oclc number from which we started. So in this early stage of WorldCat Linked Data (flagged still as "experimental") there seem to be some minor gaps and inconsistencies in the Work/Edition linkage. Yet, the results show that more than 60 % of WorldCats editions of our test set link to a work with at least one other edition within WorlCat.

Number of editions per work re. all OCLC numbers from WorldCat re. 1,260,337 OCLC numbers from EconBiz
1 37847 92879
2 13734 4826
3-5 23697 1225
6-10 14663 137
11-50 8563 50
51-100 443 2
101-9999 173 1

When we take into account, which of these other editions are in the holdings of EconBiz, the number boils down to 6.2 % of the test set.

The resulting edition clusters themselves, the clustering algorithms, and in the end the cataloging practices they result from, require further analysis and discussion. A quick glance at the largest clusters in EconBiz reveal that they result from serials: Indian village surveys, country profiles or economical analysis for different countries. If particularly these clusters make sense to users, seems questionable.

How could this be useful?

One aim in a larger subject portal like EconBiz, which merge several data sources, is the reduction of duplicates in result sets received by the users. Unfortunately, only a minor part (1,2 of 8 million records) of the EconBiz holdings have oclc numbers, and only a fraction of these form clusters within these holdings. So currently the WorldCat work clusters could only be a tiny piece of the de-duplication puzzle. For the development of custom de-duplication algorithms however the data may create a starting point, by providing firstly a pool of possible example cases, and secondly a counterpart for statistical analysis of results. (In a recent blog entry with some answers to early question about the OCLC work entities, Richard Wallis points to OCLC's FRBR Work-Set Algorithm, which has been described in a 2002 D-Lib magazine article.) Some random samples revealed a situation de-duplication also for a few instances can be highly helpful: When working papers or other sources have records with and without attached links to the full text, work clusters could be exploited to display always a link to the PDF, when an instances/edition is presented to the users.

Another area, where work clusters could be useful immediately, is the ranking of search results. If we suppose that works for which multiple instances exist are more relevant, we can use that as a ranking factor (surely among others). Since it does not make a crucial difference where these editions exist, we here can base such an assumption on the whole of WorldCat, and thus can add such a ranking factor for a much larger part of our existing data.

This does not even touch the most exiting field for exploitation: The descriptions on the work as well as on the edition level. For subject indexing and classification, this has been investigated by Magnus Pfeffer (slides) and Kai Eckert, e.g. in the UB Mannheim Linked Data Service and continued in the Culturegraph project. Possible applications are the collection and merging of index terms or classes from different editions of a work, or perhaps also an evaluation of indexing consistency. Heidrun Wiesenmüller suggested the use of work clusters for the enrichment with personal name authority data, or even the enrichment of the authority itself (slides, in German).

OCLC has announced further development of the service: "WorldCat Works will continue to be enhanced over the coming months and years.  The data will get cleaner, the descriptions will get richer, and the linking will get better."

----

With thanks to Kirsten Jeude, Kim Plassmeier and Timo Borst for hints and discussions.

9. April 2014
von nbt
Kommentare deaktiviert für Link out to DBpedia with a new Web Taxonomy module

Link out to DBpedia with a new Web Taxonomy module

ZBW Labs now uses DBpedia resources as tags/categories for articles and projects. The new Web Taxonomy plugin for DBpedia Drupal module (developed at ZBW) integrates DBpedia labels, stemming from Wikipedia page titles, via a comfortable autocomplete plugin into the authoring process. On the term page (example), further information about a keyword can be obtained by a link to the DBpedia resource. This at the same time connects ZBW Labs to the Linked Open Data Cloud.

The plugin is the first one released for Drupal Web Taxonomy, which makes LOD resources and web services easily available for site builders. Plugins for further taxonomies are to be released within our Economics Taxonomies for Drupal project.

25. September 2013
von nbt
Kommentare deaktiviert für Extending econ-ws Web Services with JSON-LD and Other RDF Output Formats

Extending econ-ws Web Services with JSON-LD and Other RDF Output Formats

From the beginning, our econ-ws (terminology) web services for economics produce tabular output, very much like the results of a SQL query. Not a surprise - they are based on SPARQL, and use the well-defined table-shaped SPARQL 1.1 query results formats in JSON and XML, which can be easily transformed to HTML. But there are services, whose results not really fit this pattern, because they are inherently tree-shaped. This is true especially for the /combined1 and the /mappings service. For the former, see our prior blog post; an example of the latter may be given here: The mappings of the descriptor International trade policy are (in html) shown as:

concept prefLabel relation targetPrefLabel targetConcept target
<http://zbw.eu/stw/descriptor/10616-4> "International trade policy" @en <http://www.w3.org/2004/02/skos/core#exactMatch> "International trade policies" @en <http://aims.fao.org/aos/agrovoc/c_31908> <http://zbw.eu/stw/mapping/agrovoc/target>
<http://zbw.eu/stw/descriptor/10616-4> "International trade policy" @en <http://www.w3.org/2004/02/skos/core#closeMatch> "Commercial policy" @en <http://dbpedia.org/resource/Commercial_policy> <http://zbw.eu/stw/mapping/dbpedia/target>

That´s far from perfect - the "concept" and "prefLabel" entries of the source concept(s) of the mappings are identical over multiple rows.

Often, a consuming application will have to re-build the original tree structure, which can be visualized as:

Graph for International trade policy

In order to support such results more appropriately, we have added RDF formats (rdf-xml, n-triples, turtle and json-ld) to the output formats of the /combined1 and /mappings services. Under the hood, these outputs are generated by SPARQL CONSTRUCT (instead of SELECT) queries. In Turtle, the above result looks like this:

<http://zbw.eu/stw/descriptor/10616-4>
  skos:closeMatch  <http://dbpedia.org/resource/Commercial_policy> ;
  skos:exactMatch  <http://aims.fao.org/aos/agrovoc/c_31908> ;
  skos:prefLabel   "Internationale Handelspolitik"@de , "International trade policy"@en .

<http://dbpedia.org/resource/Commercial_policy>
  dcterms:isPartOf  <http://zbw.eu/stw/mapping/dbpedia/target> ;
  skos:prefLabel    "Commercial policy"@en .

<http://aims.fao.org/aos/agrovoc/c_31908>
  dcterms:isPartOf  <http://zbw.eu/stw/mapping/agrovoc/target> ;
  skos:prefLabel    "WELTHANDELSPOLITIK"@de , "International trade policies"@en .

The turtle syntax reflects fact that the Agrovoc and the DBpedia concept are both connected to the STW concept. Artificially built names, such as prefLabel and targetPrefLabel, which are required in the result table above to distinguish columns, are avoided. And by the way, it allows us to output prefLabels in English and German, without having to duplicate every row in the above table.

A great advantage of the RDF output formats is that the the results can be readily canned into other RDF-enabled web services. To demonstrate this, we provide two example pages for the /mappings and /combined1 services, where you can enter an arbitrary query or concept. The URI for the invocation of the respective econ-ws service is build from your input by a Javascript function, and passed to the VisualRDF service, which asks for the actual RDF and transforms it to a visual graph (as the one shown above).

JSON for Linking Data

For Semantic Web-savvy geeks, Turtle is very familiar and intuitive. However, web programmers without such a background would clearly prefer structured JSON, which can be parsed easily in Javascript (and almost any other language). In order to represent Linked Data in JSON, a W3C community group has developed JSON-LD and an accompanying API. Some days ago both reached W3C Candidate Recommendation state. On json-ld.org, lots of resources and an online playground for experiments are available.

Since currently our SPARQL server, Fuseki, does not deliver JSON-LD formatted results, we included Markus Lanthaler's JsonLD library to post-process turtle results delivered by Fuseki. This is working fine - for the above result we get:

{
  "@context": {
    "prefLabel": "http://www.w3.org/2004/02/skos/core#prefLabel",
    "exactMatch": "http://www.w3.org/2004/02/skos/core#exactMatch",
    "closeMatch": "http://www.w3.org/2004/02/skos/core#closeMatch",
    "isPartOf": "http://purl.org/dc/terms/isPartOf"
  },
  "@graph": [{
    "@id": "http://aims.fao.org/aos/agrovoc/c_31908",
    "isPartOf": {
      "@id": "http://zbw.eu/stw/mapping/agrovoc/target"
    },
    "prefLabel": [{
      "@language": "de",
      "@value": "WELTHANDELSPOLITIK"
    }, {
      "@language": "en",
      "@value": "International trade policies"
    }]
  }, {
    "@id": "http://dbpedia.org/resource/Commercial_policy",
    "isPartOf": {
      "@id": "http://zbw.eu/stw/mapping/dbpedia/target"
    },
    "prefLabel": {
      "@language": "en",
      "@value": "Commercial policy"
    }
  }, {
    "@id": "http://zbw.eu/stw/descriptor/10616-4",
    "closeMatch": {
      "@id": "http://dbpedia.org/resource/Commercial_policy"
    },
    "exactMatch": {
      "@id": "http://aims.fao.org/aos/agrovoc/c_31908"
    },
    "prefLabel": [{
      "@language": "de",
      "@value": "Internationale Handelspolitik"
    }, {
      "@language": "en",
      "@value": "International trade policy"
    }]
  }]
}

The result is straight JSON. The context part may or may not be used by an application. The last section of the graph part contains the overall structure for the source concept, whereas the details about the mapped DBpedia and Agrovoc concepts can be found in separate sections.

To Frame or Not To Frame

The graph showed above could transformed to an even more intuitive one by embedding the properties of the mapped concepts into the main tree derived from the source concept, resulting in one overall tree structure. This kind of shaping of the output is called "Framing" in JSON-LD. Since such a graph can take multiple forms, depending on application demands, it requires something in the kind of a template, named "Frame" here. We provide here a link to the JSON-LD Playground which will load our example and a simple frame document. (Please open the "Framed" tab, and feel free to experiment.)

The JSON-LD Framing spec is not on the W3C Standards Track, and has some open issues. We did not include a framed approach into the current version of econ-ws, but we do see it as quite promising: Framed JSON-LD results can directly, without further processing, be used as a, e.g., Javascript data structure. The hard work to assemble the RDF "triple soup" into something meaningful can be done once and for all be the providers of a service, and is taken from the shoulders of the application programmer. However, if this application programmer happens to be a semweb geek, she can lossless get the raw triples to combine them with triples from other sources and assemble them into something completely different.

An emerging pattern for unified Linked Data resource lookup and web service results?

In our Linked Data services, there is a not really satisfactory dichotomy between the econ-ws services, which mostly deliver query results of some kind, and the results you get when you look up resource URIs such as the one for International trade policy: The former delivered (up to now) easily understandable, table-shaped XML or JSON, while the latter deliver RDFa, RDF/XML or Turtle, which requires some effort for parsing, especially for web developers which are not deeply in Semantic Web technology. (BTW, this situation is reflected in the fact that we have marked the /narrower and /labels service as "deprecated" - as the data can be obtained by resource lookup -, but not yet have removed them.)

Perhaps, JSON-LD could bridge the gap, in that it can be used in both kinds of services, and in that it is able to express complex data structures in a widely familiar and easy-to-use way.

7. August 2013
von nbt
Kommentare deaktiviert für Thesaurus-augmented Search with Jena Text

Thesaurus-augmented Search with Jena Text

How can we get most out of a thesaurus to support user searches? Taking advantage of SKOS thesauri published on the web, their mappings and the latest Semantic Web tools, we can support users both with synonyms (e.g. "accountancy" for "bookkeeping") for their original search terms as well as with suggestions for neighboring concepts.

Thesauri describe the concepts of a domain (e.g., economics, as covered by STW), enriched with lots of alternate labels (possibly in multiple languages). These terms form a cloud of search terms, which could be used on keyword and free-text fields in arbitrary databases - simply querying for ("term a" OR "term b" OR ...).

The relationships within the theaurus can be exploited additionally to bring up concepts (and their respective cloud of search terms) which are related to the original query of the user, and may or may not be instrumental for further exploration of the query space.

Basic search algorithm using text retrieval

Unfortunately however, users tend to rarely use the "correct" thesaurus terms or straight synonyms in their search queries. Especially, if the query consists of more than one word, these words may ask for a single concept (such as "free trade zone") or multiple concepts (such as "financial crisis in china"), or may even contain proper names (e.g, "siebert environmental economics"). And of course, within the single search box, the terms referring to different concepts are not separated neatly, but occur without delimiter in any order.

So the first task is to identify the concepts which may be relevant to a given search query. We see no reasonable chance for an exact analysis of maybe complex search queries in real time in order to split the query string and identify the single concepts (or persons, or things) according to the user's intention. Instead, we prefer to look at thesaurus concepts as text documents, with all their preferred or alternate or even hidden labels (the latter sometimes used to cover common misspellings) as part of the document's text. We then can build a text index on these documents. This done, we can just take the unmodified query string and apply standard text search algorithms to get the best fitting documents for this query. What "best fitting" means exactly depends on the scoring algorithms of the retrieval software. Here, heuristics come into play. These heuristics can be tuned often according to the special use case and document set. Using the defaults of the software described below we felt no urge to tune, since it worked pretty well out-of-the-box.

Having the matching concepts identified, the thesaurus relationships can be used to offer neighboring concepts. We found the skos:narrower and the skos:related relationships most useful, but this may depend on the domain and on the construction of the particular thesaurus. In a final step, all concepts can be enriched with their labels, in order to be available for building search queries.

From a practical point of view, it is important that the three steps described here are executed in a single query. As it is executed over the network, having more than one round-trip could impact performance considerably.

Implementation as SPARQL query

We used the upcoming Jena Text module as part of the Jena Fuseki RDF/SPARQL server to implement the algorithm described above. On top of a triple store, it feeds the /combined1 web service for economics (1). The text module will be generally available with the next Jena 2.10.2 / Fuseki 0.2.8 version, but is quite stable already (snapshot here).

Jena Text allows us to ask for concepts related to, e.g., "telework" with the statement

  ?concept text:query ('telework' 1)

The Lucene query format could be used to further specify the query. The second, numeric argument in the argument list allows us to limit the results. We suppose that the maximum number of different concepts asked for in the query may not be higher than the number of words, so in general we put the query word count as a limit.

When we want to include narrower and related concepts (and their relationship to the matched concepts) in the result, we have to extend the query:

 {
   { ?concept text:query ('telework' 1) }
  UNION
   { ?concept text:query ('telework' 1) .
    ?concept skos:narrower ?narrower
   }
  UNION
   { ?tmpConcept text:query ('telework' 1) .
    ?tmpConcept skos:narrower ?concept
   }
  UNION
   { ?concept text:query ('telework' 1) .
    ?concept skos:related ?related
   }
  UNION
   { ?tmpConcept text:query ('telework' 1) .
    ?tmpConcept skos:related ?concept
   }
 }

The repetition of the text query is ugly (perhaps someone can suggest a more elegant solution), but seems to do no real harm in terms of execution time.

The enrichment with the labels is done by a final statement:

 {
   { ?concept skos:prefLabel ?prefLabel }
  UNION
   { ?concept skos:prefLabel ?label }
  UNION
   { ?concept skos:altLabel ?label }
  UNION
   { ?concept skos:hiddenLabel ?label }
 }
 BIND (lcase(?label) AS ?hiddenLabel)

This returns the skos preferred labels as such, and additionally a superset of all skos labels in lowercase form (as ?hiddenLabel), which comes in handy for building search queries.

Making use of thesaurus mappings

More and more mappings between thesauri (and sometimes other datasets, such as DBpedia) are published. Here an example concept from STW mapped to a concept from the Thesaurus for the Social Sciences (generated with Visual RDF):

These mappings can be exploited twofold:

  1. the relevant concepts for a query can be looked up using synonyms (e.g., "Öko-Auditing") defined for the mapped concepts
  2. the list of synonyms returned for the identified concepts can be extended by all synonyms of the mapped concepts

In order to simplify both of these steps, we set up some data preprocessing, eliminating structural particularities of the mapped vocabularies before or while loading it into our triple store. Reducing skos-xl property chains to just the simple skos labeling properties (pref/alt/hiddenLabel) or replacing DBpedia rdfs:label by skos:prefLabel allows us to use the skos labeling properties consistently in our queries. The complete, executable version looks this:

PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX skos:  <http://www.w3.org/2004/02/skos/core#>
PREFIX zbwext: <http://zbw.eu/namespaces/zbw-extensions/>
PREFIX text:  <http://jena.apache.org/text#>

SELECT DISTINCT ?concept ?prefLabel ?hiddenLabel ?narrower ?related
WHERE {
 {
   { ?concept text:query ('telework' 2) }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:exactMatch ?concept
   }
  UNION
   { ?concept text:query ('telework' 2) .
    ?concept skos:narrower ?narrower
   }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:narrower ?concept
   }
  UNION
   { ?concept text:query ('telework' 2) .
    ?concept skos:related ?related
   }
  UNION
   { ?tmpConcept text:query ('telework' 2) .
    ?tmpConcept skos:related ?concept
   }
 }
 ?concept rdf:type zbwext:Descriptor .
 {
   { ?concept skos:prefLabel ?prefLabel }
  UNION
   { ?concept skos:prefLabel ?label }
  UNION
   { ?concept skos:altLabel ?label }
  UNION
   { ?concept skos:hiddenLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:prefLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:altLabel ?label }
  UNION
   { ?concept skos:exactMatch/skos:hiddenLabel ?label }
 }
 BIND (lcase(?label) AS ?hiddenLabel)
}

The BIND statement returns all synonyms in lower case, which in combination with "SELECT DISTINCT" filters out semi-duplicate labels with different capitalization. The query makes use of the elegant new SPARQL 1.1 feature of property paths (skos:exactMatch/skos:altLabel). It currently solely uses the skos:exactMatch mapping property, which is the only one declared as transitive, so the chains should be valid too. For the relationships one thesaurus (STW) is used. As you may have spotted, the maximum number of returned concepts was increased in this query, because with multiple thesauri involved, more than one concept can exactly fit the query. (The reduction to STW descriptors follows only later on.) 

In this setting, STW serves as a backbone, which is enriched by synonyms from other thesauri. We chose this design because the introduction of multiple, partly overlapping concepts and hierarchies from other thesauri (focused on subject areas such as social sciences or agriculture) might leave users profoundly confused. On the other side, we would not expect much additional value for the field of economics, because it is covered by STW quite well.

Setting up the text index

To set up a text index with Fuseki, a service and dataset has to be created within the config.ttl file:

<#service_xyz#> rdf:type fuseki:Service ;
  rdfs:label           "xyz TDB Service (R)" ;
  fuseki:name           "xyz" ;
  fuseki:serviceQuery       "query" ;
  fuseki:serviceQuery       "sparql" ;
  fuseki:serviceReadGraphStore  "data" ;
  fuseki:serviceReadGraphStore  "get" ;
  fuseki:dataset      :xyz ;
  .
:xyz rdf:type   text:TextDataset ;
  text:dataset <#xyzDb> ;
  text:index  <#xyzIndex> ;
  .
<#xyzDb> rdf:type   tdb:DatasetTDB ;
  tdb:location "/path/to/tdb/files" ;
  tdb:unionDefaultGraph true ; # Optional
  .
<#xyzIndex> a text:TextIndexLucene ;
  text:directory <file:/path/to/lucene/files> ;
  text:entityMap <#entMap> ;
  .
<#entMap> a text:EntityMap ;
  text:entityField   "uri" ;
  text:defaultField  "text" ; # Must be defined in the text:map
  text:map (
     [ text:field "text" ; text:predicate skos:prefLabel ]
     [ text:field "text" ; text:predicate skos:altLabel ]
     [ text:field "text" ; text:predicate skos:hiddenLabel ]
     ) .

Basicly, this defines locations for the database (Jena TDB) and index files. The entMap statement contains the text indexing logic. The properties to be indexed are listed and mapped to a index prefix. In our case, we just use one default index ("text") for all properties.

Maintaining database and index

Fuseki supports SPARQL update, and so the simplest way to manage database and index is to just use the HTTP interface to load the data. This creates or updates the database and index files. For our use case, we however preferred to define the datastore as read-only (as in the config given above), and to rely on command line calls like this:

 java -cp $FUSEKI_HOME/fuseki-server.jar tdb.tdbloader --tdb=config_file data_file
 java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --desc=config_file

The documentation points out one caveat: Index entries are added, but not deleted, when then the RDF data is modified. This can be worked arround, or, in cases of larger changes, the index can just be rebuilt.

Just as a side note: As an alternative to building Lucene indexes as described here, Fuseki supports the use of externally defined and maintained Solr indexes. This should allow for all kinds of neat tricks, but is currently sparsely documented.

Practical use

The web service backed based on Fuseki and Jena Text as described above is available at zbw.eu/beta/econ-ws, and is supported by documentation and executable examples. The integration into an application can be done server-side, in the applications backend. Or it can also be accomplished client-side by pulling together the synonyms in a search string (for Google, this could be "term a" OR "term a" OR ...), putting it into the search box, and executing the search, just through Javascript on the page.

Have a look at ZBW's EconStor digital repository of papers in economics, where the service is in production use. Just try this URL. You will notice a short delay, after which search suggestions from STW thesaurus are displayed. And when you click at one of these, a search is executed with all the synonyms filled into the search box. We don't have yet enough data for an analysis of response times, but we very seldom see more than half a second - which is fast enough for the current application -, and the vast majority of responses takes less than 100 ms (on commonplace hardware).

For sure, the algorithms described here and their implementation as SPARQL queries may be subject to critical discussion or tuning efforts. I'd be happy to receive your feedback. A SPARQL endpoint with the STW dataset and the according text index is publicly available at http://zbw.eu/beta/sparql.

Footnote

1) The service has existed since 2009. Till recently it was based on the SPARQLite implementation by Alistair Miles, Graham Kline and others. SPARQLite was Jena-based too, designed for scalability, quality-of-service and reliability (even in case of DOS attacks), and had the then recent LARQ (Lucene + ARQ) module integrated. It thus allowed an early implementation of the algorithms described here - hence the query code was much uglier than now with Jena Text. The development of SPARQLite was discontinued, however, in 2010.

24. April 2013
von nbt
Kommentare deaktiviert für ZBW Labs als Linked Open Data

ZBW Labs als Linked Open Data

As a laboratory for new, Linked Open Data based publishing technologies, we now develop the ZBW Labs web site as a Semantic Web Application. The pages are enriched with RDFa, making use of Dublin Core, DOAP (Description of a Project) and other vocabularies. The schema.org vocabulary, which is also applied through RDFa, should support search engine visibility.

With this new version we aim at a playground to test new possibilities in electronic publishing and linking data on the web. At the same time, it facilitates editorial contributions from project members about recent developments and allows comments and other forms of participation by web users.

As it is based on Drupal 7, RDFa is "build-in" (in the CMS core) and is easy done by configuration on a field level. Enhancements are made through the RDFx. Schema.org and SPARQL Views modules. A lot of other ready-made components in Drupal (most noteworthy the Views and the new Entity Reference modules) make it easy to provide and interlink the data items on the site. The current version of Zen theme enables the HTML 5 and the use of RDFa 1.1, and permits a responsive design for smartphones and pads.