Pl4net.info

Bibliothekarische Stimmen. Independent, täglich.

29. Januar 2021
von Joachim Neubert
Kommentare deaktiviert für Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

The world's largest public newspaper clippings archive comprises lots of material of great interest particularly for authors and readers in the Wikiverse. ZBW has digitized the material from the first half of the last century, and has put all available metadata under a CC0 license. More so, we are donating that data to Wikidata, by adding or enhancing items and providing ways to access the dossiers (called "folders") and clippings easily from there.

Challenges of modelling a complex faceted classification in Wikidata

That had been done for the persons' archive in 2019 - see our prior blog post. For persons, we could just link from existing or a few newly created person items to the biographical folders of the archive. The countries/subjects archives provided a different challenge: The folders there were organized by countries (or continents, or cities in a few cases, or other geopolitical categories), and within the country, by an extended subject category system (available also as SKOS). To put it differently: Each folder was defined by a geo and a subject facet - a method widely used in general purpose press archives, because it allowed a comprehensible and, supported by a signature system, unambiguous sequential shelf order, indispensable for quick access to the printed material.

Folders specifically about one significant topic (like the Treaty of Sèvres) are rare in the press archives, whereas country/subject combinations are rare among Wikidata items - so direct linking between existing items and PM20 folders was hardly achievable. The folders in themselves had to be represented as Wikidata items, just like other sources used there. Here however we did not have works or scientific articles, but thematic mini-collections of press clippings, often not notable in themselves and normally without further formal bibliographic data. So a class of PM20 country/subject folder was created (as subclass of dossier, a collection of documents). Aiming at items for each folder - and having them linked via PM20 folder ID (P4293) to the actual press archive folders was yet only part of the solution.

In order to represent the faceted structure of the archive, we needed anchor points for both facets. That was easy for the geographical categories: the vast majority of them already existed as items in Wikidata, a few historical ones, such as Russian peripheral countries, had to be created. For the subject categories, the situation was much different. Categories such as The country and its people, politics and economy, general or Postal services, telegraphy and telephony were constructed as baskets for collecting articles on certain broader topics. They do not have an equivalent in Wikidata, which tries to describe real world entities or clear-cut concepts. We decided therefore to represent the categories of the subject category system with their own items of type PM20 subject category. Each of the about 1400 categories is connected to the upper one via a "part of" (P361) property, thus forming a five-level hierarchy.

More implementation subtleties

For both facets, according Wikidata properties where created as "PM20 geo code" (P8483) and "PM20 subject code" (P8484). As external identifiers, they link directly to lists of subjects (e.g., for Japan) or geographical entities (e.g., for The country ..., general). For all countries where the press archives material has been processed - this includes the tedious task of clarifying the intellectual property rights status of each article -, the Wikidata item for the country includes now a link to a list of all press archives dossiers about this country, covering the first half of the 20th century.

PM20 country categories

The folders represented in Wikidata (e.g., Japan : The country ..., general) use "facet of" (P1269) and "main subject" (P921) properties to connect to the items for the country and subject categories. Thus, not only each of the 9,200 accessible folders of the PM20 country/subject archive is accessible via Wikidata. Since the structural metadata of PM20 is available, too, it can be queried in its various dimensions - see for example the list of top level subject categories with the number of folders and documents, or a list of folders per country, ordered by signature (with subtleties covered by a "series ordial" (P1545) qualifier). The interactive map of subject folders as shown above is also created by a SPARQL query, and gives a first impression of the geographical areas covered in depth - or yet only sparsely - in the online archive.

Core areas: worldwide economy, worldwide colonialism

The online data reveals core areas of attention during 40 years of press clippings collection until 1949. Economy, of course, was in the focus of the former HWWA (Hamburg Archive for the International Economy), in Germany and namely Hamburg, as well as in every other country. More than half of all subject categories are part of the n Economy section of the category system and give in 4,500 folders very detailed access to the field. About 100,000 of the almost 270,000 online documents of the archive are part of this section, followed by history and general politics, foreign policy, and public finance, down to more peripheral topics like settling and migration, minorities, justice or literature. Originating in the history of the institution (which was founded as "Zentralstelle des Hamburgischen Kolonialinstituts", the central office of the Hamburg colonial institute) colonial efforts all over the world were monitored closely. We published with priority the material about the former German colonies, listed in the Archivführer Deutsche Kolonialgeschichte (Archive guide to the German Colonial Past, also interconnected to Wikidata). Originally collected to support the aggressive and inhuman policy of the German Empire, it is now available to serve as research material for critical analysis in the emerging field of colonial and postcolonial studies.

Enabling future community efforts

While all material about the German colonies (and some about the Italian ones) is online, and accessible now via Wikidata, this is not true for the former British/French/Dutch/Belgian colonies. While Japan or Argentina are accessible completely, China, India or the US are missing, as well as most of the European countries. And while 800+ folders about Hamburg cover it's contemporary history quite well, the vast majority of the material about Germany as a whole is only accessible "on premises" within ZBW's locations. It however is available as digital images, and can be accessed through finding aids (in German), which in the reading rooms directly link to a document viewer. The metadata for this material is now open data and can be changed and enhanced in Wikidata. A very selective example how that could work is a topic in German-Danish history - the 1920 Schleswig plebiscites. The PM20 folder about these events was not part of the published material, but got some interest with last year's centenary. The PM20 metadata on Wikidata made it possible to create an according folder completely in Wikidata, Nordslesvig : Historical events, with a (provisional) link to a stretch of images on a digitized film. While the checking and activation of these images for the public was a one-time effort in the context of an open science event, the creation of a new PM20 folder on Wikidata may demonstrate how open metadata can be used by a dedicated community of knowledge to enable access to not-yet-open knowledge. Current intellectual property law in the EU forbids open access to all digitized clippings from newspapers published in 1960 until 2031, and all where the death date of a named author is not known until after 2100. Of course, we hope for a change in that obstrusive legislation in a not-so-far future. We are confident that the metadata about the material, now in Wikidata, will help bridging the gap until it will finally be possible to use all digitized press archives contents as open scientific and educational resources, within and outside of the Wikimedia projects.

More information at WikiProject 20th Century Press Archives, which links also to the code for creating this data donation.

7. Dezember 2020
von Joachim Neubert
Kommentare deaktiviert für Building the SWIB20 participants map

Building the SWIB20 participants map

 SWIB20 participant map

Here we describe the process of building the interactive SWIB20 participants map, created by a query to Wikidata. The map was intended to support participants of SWIB20 to make contacts in the virtual conference space. However, in compliance with GDPR we want to avoid publishing personal details. So we choose to publish a map of institutions, to which the participants are affiliated. (Obvious downside: the 9 un-affiliated participants could not be represented on the map).

We suppose that the method can be applied to other conferences and other use cases - e.g., the downloaders of scientific software or the institutions subscribed to an academic journal. Therefore, we describe the process in some detail.

  1. We started with a list of institution names (with country code and city, but without person ids), extracted and transformed from our ConfTool registration system, saved it in CSV format. Country names were normalized, cities were not (and only used for context information).

  2. We created an OpenRefine project, and reconciled the institution name column with Wikidata items of type Q43229 (organization, and all its subtypes). We included the country column (-> P17, country) as relevant other detail, and let OpenRefine “Auto-match candidates with high confidence”. Of our original set of 335 country/institution entries, 193 were automaticaly matched via the Wikidata reconciliation service. At the end of the conference, 400 institutions were identified and put on the map (data set).

  3. We went through all un-matched entries and either
    a) selected one of the suggested items, or
    b) looked up and tweaked the name string in Wikidata, or in Google, until we found an according Wikipedia page, openend the linked Wikidata object from there, and inserted the QID in OpenRefine, or
    c) created a new Wikidata item (if the institution seemed notable), or
    d) attached “not yet determined” (Q59496158) where no Wikidata item (yet) exists, or
    e) attached “undefined value” (Q7883029) where no institution had been given

  4. The results were exported from OpenRefine into a .tsv file (settings)

  1. Again via a script, we loaded ConfTool participants data, built a lookup table from all available OpenRefine results (country/name string -> WD item QID), aggregated participant counts per QID, and loaded that data into a custom SPARQL endpoint, which is accessible from the Wikidata Query Service. As in step 1, for all (new) institution name strings, which were not yet mapped to Wikidata, a .csv file was produced. (An additional remark: If no approved custom SPARQL endpoint is available, it is feasible to generate a static query with all data in it’s “values” clause.)

    SWIB20 map data flow
  2. During the preparation of the conference, more and more participants registered, which required multiple loops: Use the csv file of step 5 and re-iterate, starting at step 2. (Since I found no straightforward way to update an existing OpenRefine project with extended data, I created a new project with new input and output files for every iteration.)

  3. Finally, to display the map we could run a federated query on WDQS. It fetches the institution items from the custom endpoint and enriches them from Wikidata with name, logo and image of the institution (if present), as well as with geographic coordinates, obtained directly or indirectly as follows:
    a) item has “coodinate location” (P625) itself, or
    b) item has “headquarters location” item with coordinates (P159/P625), or
    c) item has “located in administrative entity” item with coordinates (P131/P625), or
    c) item has “country” item (P17/P625)
    Applying this method, only one institution item could not be located on the map.

SWIB20 participant map - detail

Data improvements

The way to improve the map was to improve the data about the items in Wikidata - which also helps all future Wikidata users.

New items

For a few institutions, new items were created:

For another 14 institutions, mostly private companies, no items were created due to notability concerns. Everything else already had an item in Wikidata!

Improvement of existing items

In order to improve the display on the map, we enhanced selected items in Wikidata in various ways:

  • Add English label
  • Add type (instance of)
  • Add headquarter location
  • Add image and/or logo

And we hope, that participants of the conference also took the opportunity to make their institution “look better”, by adding for example an image of it to the Wikidata knowledge base.

Putting Wikidata into use for a completely custom purpose thus created incentives for improving “the sum of all human knowledge” step by tiny step.

 

 

 

24. Oktober 2019
von Joachim Neubert
Kommentare deaktiviert für 20th Century Press Archives: Data donation to Wikidata

20th Century Press Archives: Data donation to Wikidata

ZBW is donating a large open dataset from the 20th Century Press Archives to Wikidata, in order to make it better accessible to various scientific disciplines such as contemporary, economic and business history, media and information science, to journa...

23. Oktober 2018
von Joachim Neubert
Kommentare deaktiviert für ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

At 27th and 28th of October, the Kick-off for the "Kultur-Hackathon" Coding da Vinci is held in Mainz, Germany, organized this time by GLAM institutions from the Rhein-Main area: "For five weeks, devoted fans of culture and hacking alike will prototype...

30. November 2017
von Joachim Neubert
Kommentare deaktiviert für Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers

Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers

In the EconBiz portal for publications in economics, we have data from different sources. In some of these sources, most notably ZBW's "ECONIS" bibliographical database, authors are disambiguated by identifiers of the Integrated Authority File (GND) - ...

2. März 2017
von Joachim Neubert
Kommentare deaktiviert für New version of multi-lingual JEL classification published in LOD

New version of multi-lingual JEL classification published in LOD

The Journal of Economic Literature Classification Scheme (JEL) was created and is maintained by the American Economic Association. The AEA provides this widely used resource freely for scholarly purposes. Thanks to André Davids (KU Leuven), who has translated the originally English-only labels of the classification to French, Spanish and German, we provide a multi-lingual version of JEL. It's lastest version (as of 2017-01) is published in the formats RDFa and RDF download files. These formats and translations are provided "as is" and are not authorized by AEA. In order to make changes in JEL tracable more easily, we have created lists of inserted and removed JEL classes in the context of the skos-history project.

2. März 2017
von Joachim Neubert
Kommentare deaktiviert für New version of multi-lingual JEL classification published in LOD

New version of multi-lingual JEL classification published in LOD

The Journal of Economic Literature Classification Scheme (JEL) was created and is maintained by the American Economic Association. The AEA provides this widely used resource freely for scholarly purposes. Thanks to André Davids (KU Leuven), who has translated the originally English-only labels of the classification to French, Spanish and German, we could provide a multi-lingual version of JEL. It's lastest version (as of 2017-01) is published in the formats RDFa and RDF download files. These formats and translations are provided "as is" and are not authorized by AEA. In order to make changes in JEL tracable more easily, we have created lists of inserted and removed JEL classes in the context of the skos-history project.

17. Januar 2017
von Joachim Neubert
Kommentare deaktiviert für Economists in Wikidata: Opportunities of Authority Linking

Economists in Wikidata: Opportunities of Authority Linking

Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.

The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.

Bini Agarwal - Sqid screenshot

Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser

In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:

Economics-related Persons in EconBiz
Number of publications total in Wikidata percentage
Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)
> 0 457,244 39,778 8.7 %
> 30 18,008 3,232 17.9 %
> 150 1,225 547 44.7 %

These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.

To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment).  Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.

As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.

What can we gain?

The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.

When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall  (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.

A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.

How can we start improving?

As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.

Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.

While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.

PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries.

17. Januar 2017
von Joachim Neubert
Kommentare deaktiviert für Economists in Wikidata: Opportunities of Authority Linking

Economists in Wikidata: Opportunities of Authority Linking

Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.

The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.

Bini Agarwal - Sqid screenshot

Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser

In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:

Economics-related Persons in EconBiz
Number of publications total in Wikidata percentage
Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)
> 0 457,244 39,778 8.7 %
> 30 18,008 3,232 17.9 %
> 150 1,225 547 44.7 %

These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.

To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment).  Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.

As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.

What can we gain?

The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.

When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall  (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.

A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.

How can we start improving?

As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.

Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.

While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.

PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries.

30. März 2016
von Joachim Neubert
Kommentare deaktiviert für Turning the GND subject headings into a SKOS thesaurus: an experiment

Turning the GND subject headings into a SKOS thesaurus: an experiment

The "Integrated Authority File" (Gemeinsame Normdatei, GND) of the German National Library (DNB), the library networks of the German-speaking countries and many other institutions, is a widely recognized and used authority resource. The authority file comprises persons, institutions, locations and other entity types, in particular subject headings. With more than 134,000 concepts, organized in almost 500 subject categories, the subjects part - the former "Schlagwortnormdatei" (SWD) - is huge. That would make it a nice resource to stress-test SKOS tools - when it would be available in SKOS. A seminar at the DNB on requirements for thesauri on the Semantic Web (slides, in German) provided another reason for the experiment described below.

The GND subject headings are defined using a well-thought-out set of custom classes and properties, the GND Ontology (gndo). The GND links to other vocabularies with SKOS mapping properties, which technically implies for some, but not all, of its subject headings being skos:Concepts. Many of the gndo properties already mirror the SKOS or Isothes properties. For the experiment, the relevant subset of the whole GND was selected by the gndo:SubjectHeadingSensoStricto class. One single SPARQL construct query does the selection and conversion (execute on an example concept). For skos:pref/altLabel, derived from gndo:preferred/variantNameForTheSubjectHeading, German language tags are added. The fine-grained hiearchical relations of GNDO - generic, instantial, partitive - are dumped down to skos:broader/narrower. All original properties of a concept are included in the output of the query.

Some additional work was required to integrate the GND Subject Categories (gndsc), a skos:ConceptScheme of about 484 skos:Concepts which logically build a hierarchy. (In fact, the currently published file puts all subject categories on one level.) The subject headings invariably link by to one or more subject categories, but unfortunately the data has to be downloaded and added separately (with a bit of extension). The linking property from the subject headings, gndo:gndSubjectCategory, was already dumped down to skos:broader in the former query. Finally we add an explicit skos:notation and some bits of metadata about the concept scheme.

This earns us a large skos:ConceptScheme, which we called swdskos and which is currently avaliable in a SPARQL endpoint. Now, we can proceed, and try to prove that generic SKOS tools for display, verification and version history comparisons work at that scale.

Skosmos for thesaurus display

Skosmos is an open source web application for browsing controlled vocabularies, developed by the National Library of Finland. It requires a triple store with the vocabulary loaded. (The Skosmos wiki provides detailed installation and configuration help for this.) The configuration for the GND/SWD vocabulary takes only a few lines, following the provided template. The result can be found at http://zbw.eu/beta/skosmos/swdskos:

false

 

With marginal effort, we gained a structured concept display, a very nice browsing and hierarchical view interface, and a powerful search - out of the box. The initial alphabetical display takes a few seconds, due to the large number of terms for most of the index letters. In a production setting, that could be improved by adding a Squid or Varnish cache. The navigation from concept to concept is far below one second, so the tool seems well suited for practical use even with larger-than-usual vocabularies. For GND, it offers an alternative to the existing access over the DNB portal, more focused on browsing contexts and with a more precise search.

Quality assurance with qSKOS

Large knowledge organization systems are prone to human mistakes, which creep in even with strict rules and careful editing. Some maintenance systems try to catch some of these errors, but let slip others. So one of the really great things about SKOS as a general format for knowledge organization systems is that generic tools can be developed, which catch more and more classes of errors. qSKOS has identified a number of wide-spread possible quality issues, on which it provides detailed analytic information. Of course, often it depends on the vocabulary, which types if issues are considered as errors - for example, it is expected that most GND subject headings lack a definition, so a list of 100,000+ such concepts is not helpful, whereas the list of the (in total 3) cyclic hierarchical relations is. The parametrization we use for STW seems to provide useful results here too:

java -jar qSKOS-cmd.jar analyze -np -d -c ol,chr,usr,rc,mc,ipl,dlv,urc swdskos.ttl.gz -o qskos_extended.log

The tool has already been tested with very large vocabularies (LCSH, e.g.) On the swdskos dataset, it runs for 8 minutes, but it provides results, which could not be obtained otherwise. For example, the list of overlapping labels (report) reveals some strange clashes (example). Standard SKOS tools thus could complement the quality assurance procedures which are already in place.

Version comparisons with skos-history

The skos-history method allows to track changes in knowlege organization systems. It had been developed in the context of the STW overhaul. With swdskos, it proves to be applicable to much larger KOS. The loading of the three versions and the computation of all version deltas take almost half an hour (on a moderately sized virtual machine). That way, for example, we can see the 638 concepts, which were deleted between the Oct 2015 and the Feb 2016 dump of GND. Some checked concept URIs return concepts with different URIs, but the same preferred label, so we can assume that duplicates have been removed here. The added_concepts query can be extended to make use of the - often underestimated - GND subject categories for organizing the query results, as is shown here (list filtered by the notation for computer science and data processing):

false

These queries only scratch the surface of what could be done by comparing multiple versions of the GND subject headings. Custom queries could try to reveal maintenance patterns, or, for example, trace the uptake of the finer-grained hierarchical properties (generic/instantial/partitive) used in GND.

Summing up

Generic SKOS tools seem to be useful to complement custom tools and processes for specialized knowledge organization systems. The tools considered here have shown no scalability issues with large vocabularies. The publication of an additional experimental SKOS version of the subject headings part of the GND linked data dump could perhaps instigate further research on the development of vocabulary.

The code and the data of the experiment are available here.