Pl4net.info

Bibliothekarische Stimmen. Independent, täglich.

24. Oktober 2019
von Joachim Neubert
Kommentare deaktiviert für 20th Century Press Archives: Data donation to Wikidata

20th Century Press Archives: Data donation to Wikidata

ZBW is donating a large open dataset from the 20th Century Press Archives to Wikidata, in order to make it better accessible to various scientific disciplines such as contemporary, economic and business history, media and information science, to journa...

23. Oktober 2018
von Joachim Neubert
Kommentare deaktiviert für ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

At 27th and 28th of October, the Kick-off for the "Kultur-Hackathon" Coding da Vinci is held in Mainz, Germany, organized this time by GLAM institutions from the Rhein-Main area: "For five weeks, devoted fans of culture and hacking alike will prototype...

30. November 2017
von Joachim Neubert
Kommentare deaktiviert für Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers

Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers

In the EconBiz portal for publications in economics, we have data from different sources. In some of these sources, most notably ZBW's "ECONIS" bibliographical database, authors are disambiguated by identifiers of the Integrated Authority File (GND) - ...

2. März 2017
von Joachim Neubert
Kommentare deaktiviert für New version of multi-lingual JEL classification published in LOD

New version of multi-lingual JEL classification published in LOD

The Journal of Economic Literature Classification Scheme (JEL) was created and is maintained by the American Economic Association. The AEA provides this widely used resource freely for scholarly purposes. Thanks to André Davids (KU Leuven), who has translated the originally English-only labels of the classification to French, Spanish and German, we provide a multi-lingual version of JEL. It's lastest version (as of 2017-01) is published in the formats RDFa and RDF download files. These formats and translations are provided "as is" and are not authorized by AEA. In order to make changes in JEL tracable more easily, we have created lists of inserted and removed JEL classes in the context of the skos-history project.

2. März 2017
von Joachim Neubert
Kommentare deaktiviert für New version of multi-lingual JEL classification published in LOD

New version of multi-lingual JEL classification published in LOD

The Journal of Economic Literature Classification Scheme (JEL) was created and is maintained by the American Economic Association. The AEA provides this widely used resource freely for scholarly purposes. Thanks to André Davids (KU Leuven), who has translated the originally English-only labels of the classification to French, Spanish and German, we could provide a multi-lingual version of JEL. It's lastest version (as of 2017-01) is published in the formats RDFa and RDF download files. These formats and translations are provided "as is" and are not authorized by AEA. In order to make changes in JEL tracable more easily, we have created lists of inserted and removed JEL classes in the context of the skos-history project.

17. Januar 2017
von Joachim Neubert
Kommentare deaktiviert für Economists in Wikidata: Opportunities of Authority Linking

Economists in Wikidata: Opportunities of Authority Linking

Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.

The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.

Bini Agarwal - Sqid screenshot

Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser

In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:

Economics-related Persons in EconBiz
Number of publications total in Wikidata percentage
Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)
> 0 457,244 39,778 8.7 %
> 30 18,008 3,232 17.9 %
> 150 1,225 547 44.7 %

These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.

To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment).  Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.

As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.

What can we gain?

The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.

When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall  (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.

A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.

How can we start improving?

As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.

Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.

While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.

PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries.

17. Januar 2017
von Joachim Neubert
Kommentare deaktiviert für Economists in Wikidata: Opportunities of Authority Linking

Economists in Wikidata: Opportunities of Authority Linking

Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.

The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.

Bini Agarwal - Sqid screenshot

Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser

In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:

Economics-related Persons in EconBiz
Number of publications total in Wikidata percentage
Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)
> 0 457,244 39,778 8.7 %
> 30 18,008 3,232 17.9 %
> 150 1,225 547 44.7 %

These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.

To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment).  Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.

As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.

What can we gain?

The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.

When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall  (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.

A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.

How can we start improving?

As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.

Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.

While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.

PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries.

30. März 2016
von Joachim Neubert
Kommentare deaktiviert für Turning the GND subject headings into a SKOS thesaurus: an experiment

Turning the GND subject headings into a SKOS thesaurus: an experiment

The "Integrated Authority File" (Gemeinsame Normdatei, GND) of the German National Library (DNB), the library networks of the German-speaking countries and many other institutions, is a widely recognized and used authority resource. The authority file comprises persons, institutions, locations and other entity types, in particular subject headings. With more than 134,000 concepts, organized in almost 500 subject categories, the subjects part - the former "Schlagwortnormdatei" (SWD) - is huge. That would make it a nice resource to stress-test SKOS tools - when it would be available in SKOS. A seminar at the DNB on requirements for thesauri on the Semantic Web (slides, in German) provided another reason for the experiment described below.

The GND subject headings are defined using a well-thought-out set of custom classes and properties, the GND Ontology (gndo). The GND links to other vocabularies with SKOS mapping properties, which technically implies for some, but not all, of its subject headings being skos:Concepts. Many of the gndo properties already mirror the SKOS or Isothes properties. For the experiment, the relevant subset of the whole GND was selected by the gndo:SubjectHeadingSensoStricto class. One single SPARQL construct query does the selection and conversion (execute on an example concept). For skos:pref/altLabel, derived from gndo:preferred/variantNameForTheSubjectHeading, German language tags are added. The fine-grained hiearchical relations of GNDO - generic, instantial, partitive - are dumped down to skos:broader/narrower. All original properties of a concept are included in the output of the query.

Some additional work was required to integrate the GND Subject Categories (gndsc), a skos:ConceptScheme of about 484 skos:Concepts which logically build a hierarchy. (In fact, the currently published file puts all subject categories on one level.) The subject headings invariably link by to one or more subject categories, but unfortunately the data has to be downloaded and added separately (with a bit of extension). The linking property from the subject headings, gndo:gndSubjectCategory, was already dumped down to skos:broader in the former query. Finally we add an explicit skos:notation and some bits of metadata about the concept scheme.

This earns us a large skos:ConceptScheme, which we called swdskos and which is currently avaliable in a SPARQL endpoint. Now, we can proceed, and try to prove that generic SKOS tools for display, verification and version history comparisons work at that scale.

Skosmos for thesaurus display

Skosmos is an open source web application for browsing controlled vocabularies, developed by the National Library of Finland. It requires a triple store with the vocabulary loaded. (The Skosmos wiki provides detailed installation and configuration help for this.) The configuration for the GND/SWD vocabulary takes only a few lines, following the provided template. The result can be found at http://zbw.eu/beta/skosmos/swdskos:

false

 

With marginal effort, we gained a structured concept display, a very nice browsing and hierarchical view interface, and a powerful search - out of the box. The initial alphabetical display takes a few seconds, due to the large number of terms for most of the index letters. In a production setting, that could be improved by adding a Squid or Varnish cache. The navigation from concept to concept is far below one second, so the tool seems well suited for practical use even with larger-than-usual vocabularies. For GND, it offers an alternative to the existing access over the DNB portal, more focused on browsing contexts and with a more precise search.

Quality assurance with qSKOS

Large knowledge organization systems are prone to human mistakes, which creep in even with strict rules and careful editing. Some maintenance systems try to catch some of these errors, but let slip others. So one of the really great things about SKOS as a general format for knowledge organization systems is that generic tools can be developed, which catch more and more classes of errors. qSKOS has identified a number of wide-spread possible quality issues, on which it provides detailed analytic information. Of course, often it depends on the vocabulary, which types if issues are considered as errors - for example, it is expected that most GND subject headings lack a definition, so a list of 100,000+ such concepts is not helpful, whereas the list of the (in total 3) cyclic hierarchical relations is. The parametrization we use for STW seems to provide useful results here too:

java -jar qSKOS-cmd.jar analyze -np -d -c ol,chr,usr,rc,mc,ipl,dlv,urc swdskos.ttl.gz -o qskos_extended.log

The tool has already been tested with very large vocabularies (LCSH, e.g.) On the swdskos dataset, it runs for 8 minutes, but it provides results, which could not be obtained otherwise. For example, the list of overlapping labels (report) reveals some strange clashes (example). Standard SKOS tools thus could complement the quality assurance procedures which are already in place.

Version comparisons with skos-history

The skos-history method allows to track changes in knowlege organization systems. It had been developed in the context of the STW overhaul. With swdskos, it proves to be applicable to much larger KOS. The loading of the three versions and the computation of all version deltas take almost half an hour (on a moderately sized virtual machine). That way, for example, we can see the 638 concepts, which were deleted between the Oct 2015 and the Feb 2016 dump of GND. Some checked concept URIs return concepts with different URIs, but the same preferred label, so we can assume that duplicates have been removed here. The added_concepts query can be extended to make use of the - often underestimated - GND subject categories for organizing the query results, as is shown here (list filtered by the notation for computer science and data processing):

false

These queries only scratch the surface of what could be done by comparing multiple versions of the GND subject headings. Custom queries could try to reveal maintenance patterns, or, for example, trace the uptake of the finer-grained hierarchical properties (generic/instantial/partitive) used in GND.

Summing up

Generic SKOS tools seem to be useful to complement custom tools and processes for specialized knowledge organization systems. The tools considered here have shown no scalability issues with large vocabularies. The publication of an additional experimental SKOS version of the subject headings part of the GND linked data dump could perhaps instigate further research on the development of vocabulary.

The code and the data of the experiment are available here.

27. Juli 2015
von Joachim Neubert
Kommentare deaktiviert für skos-history: New method for change tracking applied to STW Thesaurus for Economics

skos-history: New method for change tracking applied to STW Thesaurus for Economics

“What’s new?” and “What has changed?” are questions users of Knowledge Organization Systems (KOS), such as thesauri or classifications, ask when a new version is published. Much more so, when a thesaurus existing since the 1990s has been completely revised, subject area for subject area. After four intermediately published versions in as many consecutive years, ZBW's STW Thesaurus for Economics has been re-launched recently in version 9.0. In total, 777 descriptors have been added; 1,052 (of about 6,000) have been deprecated and in their vast majority merged into others. More subtle changes include modified preferred labels, or merges and splits of existing concepts.

Since STW has been published on the web in 2009, we went to great lengths to make change traceable: No concept and no web page has been deleted, everything from prior versions is still available. Following a presentation at DC-2013 in Lisbon, I've started the skos-history project, which aims to exploit published SKOS files of different versions for change tracking. A first beta implementation of Linked-Data-based change reports went live with STW 8.14, making use of SPARQL "live queries" (as described in a prior post). With the publication of STW 9.0, full reports of the changes are available. How do they work?

<--break->

The basic idea is to exploit the power of SPARQL on named graphs of different versions of the thesaurus. After having loaded these versions into a "version store", we can compute deltas (version differences) and save them as named graphs, too. A combination of the dataset versioning ontology (dsv:) by Johan De Smedt, the skos-history ontology (sh:), SPARQL service description (sd:) and VoiD (void:) provides the necessary plumbing in a separate version history graph:

 skos-history example graphs

That in place, we can query the version store, for e.g. the concepts added between two versions, like this:

# Identify concepts inserted with a certain version
#
SELECT distinct ?concept ?prefLabel
WHERE {
 # query the version history graph to get a delta and via that the relevant graphs
 GRAPH {
  ?delta a sh:SchemeDelta ;
   sh:deltaFrom/dc:identifier "8.14" ;
   sh:deltaTo/dc:identifier "9.0" ;
   sh:deltaFrom/sh:usingNamedGraph/sd:name ?oldVersionGraph ;
   dct:hasPart ?insertions .
  ?insertions a sh:SchemeDeltaInsertions ;
   sh:usingNamedGraph/sd:name ?insertionsGraph .
 }
 # for each inserted concept, a newly inserted prefLabel must exist ...
 GRAPH ?insertionsGraph {
  ?concept skos:prefLabel ?prefLabel
 }
 # ... and the concept must not exist in the old version
 FILTER NOT EXISTS {
  GRAPH ?oldVersionGraph {
   ?concept ?p []
  }
 }
}

The resulting report, cached for better performance and availability, can be found in the change reports section of the STW site, together with reports on deprecation/replacement of concepts, changed preferrred labels, hiearchy changes, merges and splits of concepts (descriptors as well as the higher level subject categories of STW). The queries used to create the reports are available on GitHub and linked from the report pages.

The methodology allows for aggregating changes over multiple versions and levels of the hierarchy of a concept scheme. That enabled us to gather information for the complete overhaul of STW, and to visualize it in change graphics:

STW relaunch: Business economics

The method applied here to STW is in no way specific to it. It does not rely on transaction logging of the internal thesaurus management system, nor on any other out-of-band knowledge, but solely on the published SKOS files. Thus, it can be applied to other knowledge management systems, by its publishers as well as by interested users of the KOS. Experiments with TheSoz, Agrovoc and the Finnish YSO have been conducted already; example endpoints with multiple versions of these vocabularies (and of STW, of course) are provided by ZBW Labs.

At the Finnish National Library, as well as the FAO, approaches are under way to explore the applicability of skos-history to the thesauri and maintenance workflows there. In the context of STW, the change reports are mostly optimized for human consumption. We hope to learn more how people use it in automatic or semi-automatic processes - for example, to update changed preferred label of systems working with prior versions of STW, to review indexed titles attached to split-up concepts, or to transfer changes to derived or mapped vocabularies. If you want to experiment, please fork on GitHub. Contributions in the issue queue as well as well as pull requests are highly welcome.

More detailed information can be found in a paper (Leveraging SKOS to trace the overhaul of the STW Thesaurus for Economics), which will be presented at DC-2015 in Sao Paulo.

 

 

26. Juli 2015
von Joachim Neubert
Kommentare deaktiviert für skos-history: New method for change tracking applied to STW Thesaurus for Economics

skos-history: New method for change tracking applied to STW Thesaurus for Economics

“What’s new?” and “What has changed?” are questions users of Knowledge Organization Systems (KOS), such as thesauri or classifications, ask when a new version is published. Much more so, when a thesaurus existing since the 1990s has been completely revised, subject area for subject area. After four intermediately published versions in as many consecutive years, ZBW's STW Thesaurus for Economics has been re-launched recently in version 9.0. In total, 777 descriptors have been added; 1,052 (of about 6,000) have been deprecated and in their vast majority merged into others. More subtle changes include modified preferred labels, or merges and splits of existing concepts.

Since STW has been published on the web in 2009, we went to great lengths to make change traceable: No concept and no web page has been deleted, everything from prior versions is still available. Following a presentation at DC-2013 in Lisbon, I've started the skos-history project, which aims to exploit published SKOS files of different versions for change tracking. A first beta implementation of Linked-Data-based change reports went live with STW 8.14, making use of SPARQL "live queries" (as described in a prior post). With the publication of STW 9.0, full reports of the changes are available. How do they work?

<--break->

The basic idea is to exploit the power of SPARQL on named graphs of different versions of the thesaurus. After having loaded these versions into a "version store", we can compute deltas (version differences) and save them as named graphs, too. A combination of the dataset versioning ontology (dsv:) by Johan De Smedt, the skos-history ontology (sh:), SPARQL service description (sd:) and VoiD (void:) provides the necessary plumbing in a separate version history graph:

 skos-history example graphs

That in place, we can query the version store, for e.g. the concepts added between two versions, like this:

# Identify concepts inserted with a certain version
#
SELECT distinct ?concept ?prefLabel
WHERE {
 # query the version history graph to get a delta and via that the relevant graphs
 GRAPH {
  ?delta a sh:SchemeDelta ;
   sh:deltaFrom/dc:identifier "8.14" ;
   sh:deltaTo/dc:identifier "9.0" ;
   sh:deltaFrom/sh:usingNamedGraph/sd:name ?oldVersionGraph ;
   dct:hasPart ?insertions .
  ?insertions a sh:SchemeDeltaInsertions ;
   sh:usingNamedGraph/sd:name ?insertionsGraph .
 }
 # for each inserted concept, a newly inserted prefLabel must exist ...
 GRAPH ?insertionsGraph {
  ?concept skos:prefLabel ?prefLabel
 }
 # ... and the concept must not exist in the old version
 FILTER NOT EXISTS {
  GRAPH ?oldVersionGraph {
   ?concept ?p []
  }
 }
}

The resulting report, cached for better performance and availability, can be found in the change reports section of the STW site, together with reports on deprecation/replacement of concepts, changed preferrred labels, hiearchy changes, merges and splits of concepts (descriptors as well as the higher level subject categories of STW). The queries used to create the reports are available on GitHub and linked from the report pages.

The methodology allows for aggregating changes over multiple versions and levels of the hierarchy of a concept scheme. That enabled us to gather information for the complete overhaul of STW, and to visualize it in change graphics:

STW relaunch: Business economics

The method applied here to STW is in no way specific to it. It does not rely on transaction logging of the internal thesaurus management system, nor on any other out-of-band knowledge, but solely on the published SKOS files. Thus, it can be applied to other knowledge management systems, by its publishers as well as by interested users of the KOS. Experiments with TheSoz, Agrovoc and the Finnish YSO have been conducted already; example endpoints with multiple versions of these vocabularies (and of STW, of course) are provided by ZBW Labs.

At the Finnish National Library, as well as the FAO, approaches are under way to explore the applicability of skos-history to the thesauri and maintenance workflows there. In the context of STW, the change reports are mostly optimized for human consumption. We hope to learn more how people use it in automatic or semi-automatic processes - for example, to update changed preferred label of systems working with prior versions of STW, to review indexed titles attached to split-up concepts, or to transfer changes to derived or mapped vocabularies. If you want to experiment, please fork on GitHub. Contributions in the issue queue as well as well as pull requests are highly welcome.

More detailed information can be found in a paper (Leveraging SKOS to trace the overhaul of the STW Thesaurus for Economics), which will be presented at DC-2015 in Sao Paulo.