Pl4net.info

Bibliothekarische Stimmen. Independent, täglich.

23. Dezember 2022
von Joachim Neubert
Kommentare deaktiviert für The PM20 commodities/wares archive: part 4 of the data donation to Wikidata

The PM20 commodities/wares archive: part 4 of the data donation to Wikidata

After the digitized material of the persons, countries/subjects and companies archives of the 20th Century Press archives had been made available via Wikidata, now the last part from the wares archive has been added.This ware archive is about products ...

2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine

How-to: Matching multilingual thesaurus concepts with OpenRefine

Currently, the STW Thesaurus for Economics is mapped to Wikidata, one sub-thesaurus at a time. For the next part, "B Business Economics", we have improved our prior OpenRefine matching process. Though the use case - matching concepts in a multilingual ...

2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine

How-to: Matching multilingual thesaurus concepts with OpenRefine

Currently, the STW Thesaurus for Economics is mapped to Wikidata, one sub-thesaurus at a time. For the next part, "B Business Economics", we have improved our prior OpenRefine matching process. Though the use case - matching concepts in a multilingual ...

2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine

How-to: Matching multilingual thesaurus concepts with OpenRefine

Currently, the STW Thesaurus for Economics is mapped to Wikidata, one sub-thesaurus at a time. For the next part, "B Business Economics", we have improved our prior OpenRefine matching process. Though the use case - matching concepts in a multilingual ...

13. Dezember 2021
von Joachim Neubert
Kommentare deaktiviert für Integrating the PM20 companies archive: part 3 of the data donation to Wikidata

Integrating the PM20 companies archive: part 3 of the data donation to Wikidata

ZBW inherited a large trove of historical company information - annual reports, newspaper clippings and other material about more than 40,000 companies and other organizations around the world. Parts of these, in particular all about German und British...

13. Dezember 2021
von Joachim Neubert
Kommentare deaktiviert für Integrating the PM20 companies archive: part 3 of the data donation to Wikidata

Integrating the PM20 companies archive: part 3 of the data donation to Wikidata

ZBW inherited a large trove of historical company information - annual reports, newspaper clippings and other material about more than 40,000 companies and other organizations around the world. Parts of these, in particular all about German und British entities until 1949, are available free and online in the companies section (list by country) of the 20th Century Press Archives. More digitized folders with material about companies in and outside of Europe up to 1960 are accessible only on ZBW premises, due to intellectual property rights.

As a part of its support for Open Science, ZBW has made all metadata of the 20th Century Press Archives available under a CC0 license. In order to make the folders more easily accessible for business history research as well as for the general public, we have added links for every single folder to Wikidata. In addition to that, the metadata about companies and organizations, such as inception date or links to board members, has been added to the large amount of company data already available in Wikidata. This continues the PM20 data donation of ZBW to Wikidata, as described earlier for the persons archives and the countries/subjects archives. The activities were carried out - with notable help of volunteers - and documented in the WikiProject 20th Century Press Archives.

The mapping process to Wikidata items

Many of the PM20 company and organization folders deal with existing items in Wikidata. If GND identifiers were assigned to these items, we directly created links to PM20 companies with the same id, and were done. Matching and linking to Wikidata items without the help of a unique identifier however provided some challenge. Different from person names, company names change frequently, or are spelled differently in different times or languages. Not too uncommon, the entities themselves change through mergers and acquisitions, and may or may not have been represented by a new folder in PM20, or by a different item in Wikidata. Subsidiaries may be subsumed under the parent organization, or be separate entities. While it is relatively easy to split items in Wikidata, in the folders with printed newspaper clippings and reports it meant digging through sometimes hundreds of pages to single out a company retrospectively. So early decisons about the cutting and delimitation of folders often stuck for the following decades. All of that made it more difficult not only to obtain matches at all, but also to decide if indeed the same entity is covered.

For the first matching approach, we used the Wikidatas Mix-n-match (M-n-m) tool. In order to get manageable "buckets", we sliced the data according to the main language of the company location (German, English, French, Dutch and Other). In the M-n-m batches, we aimed at entities of type "organization" in that language. Despite the fact that we also used the available aliases, we found relatively low matching rates (and a number of "false positives" among them).

After having worked through the M-n-m suggestions, we switched to another approach: For each segment, we created a list of search statements for all folders not already linked in Wikidata. For each entry, the company name was searched via Startpage (which in turn uses Google search), supplemented by "site:wikipedia.org". That searches all Wikipedia pages as full text, so slight differences in the spelling of company names did not matter. Also Wikipedia pages turned up where the company name only occurred in some context, e.g. for the founder of the company or as part of a later merger. We now could select the correct Wikipedia page from the result list, follow the "Wikidata item" link and add the PM20 folder ID to the item. Another search link on the list searched "site:wikidata.org" for existing Wikidata items. (It turned out that for Wikidata, Duck-duck-go brought better results than Startpage/Google.)

When no exactly matching item was found, we sometimes added the PM20 link with a "mapping relation type" of "related match", according to the perceived usefulness for later more detailed work.

The tedious work was facilitated with a second list, which contained statements for creating missing items immediately in Wikidata's QuickStatements  (QS) tool. The statements included labels in different languages, descriptions, sometimes aliases, the official name (with the leagal form), type(s) and often GND ID, as in this example:

# Steel Brothers & Company {19}

CREATE
LAST|Lde|"Steel Brothers & Company"
LAST|Len|"Steel Brothers & Company"
LAST|Dde|"Unternehmen; Kolonialgesellschaft"
LAST|Den|"business; colonial society"
LAST|Ade|"W. Strang Steel & Co"
LAST|Aen|"W. Strang Steel & Co"
LAST|P4293|"co/068007"
LAST|P31|Q4830453|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P31|Q1700154|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P227|"2040532-7"|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P571|+1870-01-01T00:00:00Z/9|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P1448|de:"Steel Brothers & Company, Ltd."|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11


Since both lists followed exactly the same order (primarily by descending number of documents, to put the most relevant companies on top) and were updated every hour, the workflow was easy: step through the list, search for existing items and link them, add the remaining entries one by one or in batches to Wikidata via QS, and repeat until all entries are linked, and both lists are empty. In result, 3897 PM20 folders could be linked to existing items, while 5085 items were created from scratch. (Query code for the search list, the insert list, and the conversion to QS statements are available.)

Enriching the metadata

After the mapping process had been finished, we added missing metadata from PM20 to all linked Wikidata items. That included country and headquarter location (having Geonames identifiers in PM20 helped a lot), inception and dissolution dates, links to predecessor and parent companies, or links to persons in their role as founder or board members.

Table of organization properties sourced in PM20:

PID Property Pre-existing Items New Items Total
P452 industry 5509 7682 13191
P17 country 769 5105 5874
P31 instance of 424 5371 5795
P1448 official name 94 5073 5167
P159 headquarters location 722 3542 4264
P571 inception 371 3800 4171
P227 GND ID 816 1708 2524
P355 subsidiary 764 191 955
P749 parent organization 384 567 951
P576 dissolved, abolished or demolished date 204 673 877
P156 followed by 331 424 755
P155 follows 538 209 747
P3320 board member 460 35 495
P112 founded by 78 22 100
P5052 supervisory board member 63 20 83

Source

Classification by industry

The PM20 companies archive was organized by industries, in two different ways: firstly, a custom classification was used for all folders, derived from an ancient version of the "economic sectors" part of the STW Thesaurus for Economics. Secondly, parts of the folders were classified according to the European economic activities classification NACE Rev. 2.

Here, the approach was to map the custom classification to existing - and a few newly built - industry items in Wikidata (see mapping). This allowed to fill the "industry" property of all linked Wikidata items with values derived from PM20. Additionally, further matching industries were derived from the "NACE code" property in Wikidata. Interestingly, this combined approach extended the coverage of companies folders by NACE significantly - from 3,648 to 6,233.

Due to incompatibilities on the conceptual level, that could not be extended to all industries. To give an example: One of the most important industry sectors in Germany, "Metallinstustrie" (metal industry), cannot be represented by a NACE class: "C24 Manufacture of basic metals" is strictly separate from "C25 Manufacture of fabricated metal products", while further processing of metals, e.g. machinery and equipment, are assigned to still other classes.

Supplemented with "plain" Wikidata industries, it proved nevertheless possible to create a complete hierarchical list of companies with PM20 folders by NACE code, and in absence of a NACE code, by Wikidata industry label.

chart of PM20 industries

(Wikidata query result with mouse-over labels)

As a result of this data donation, the coverage of 20th century companies and organizations in Wikidata has improved considerably, both in width and depth. With the links to the digitized PM20 folders, about 1.2 million document pages has been made available from the according items for FAIR use in research, education and public information.

 

29. Januar 2021
von Joachim Neubert
Kommentare deaktiviert für Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

The world's largest public newspaper clippings archive comprises lots of material of great interest particularly for authors and readers in the Wikiverse. ZBW has digitized the material from the first half of the last century, and has put all available...

29. Januar 2021
von Joachim Neubert
Kommentare deaktiviert für Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives

The world's largest public newspaper clippings archive comprises lots of material of great interest particularly for authors and readers in the Wikiverse. ZBW has digitized the material from the first half of the last century, and has put all available metadata under a CC0 license. More so, we are donating that data to Wikidata, by adding or enhancing items and providing ways to access the dossiers (called "folders") and clippings easily from there.

Challenges of modelling a complex faceted classification in Wikidata

That had been done for the persons' archive in 2019 - see our prior blog post. For persons, we could just link from existing or a few newly created person items to the biographical folders of the archive. The countries/subjects archives provided a different challenge: The folders there were organized by countries (or continents, or cities in a few cases, or other geopolitical categories), and within the country, by an extended subject category system (available also as SKOS). To put it differently: Each folder was defined by a geo and a subject facet - a method widely used in general purpose press archives, because it allowed a comprehensible and, supported by a signature system, unambiguous sequential shelf order, indispensable for quick access to the printed material.

Folders specifically about one significant topic (like the Treaty of Sèvres) are rare in the press archives, whereas country/subject combinations are rare among Wikidata items - so direct linking between existing items and PM20 folders was hardly achievable. The folders in themselves had to be represented as Wikidata items, just like other sources used there. Here however we did not have works or scientific articles, but thematic mini-collections of press clippings, often not notable in themselves and normally without further formal bibliographic data. So a class of PM20 country/subject folder was created (as subclass of dossier, a collection of documents). Aiming at items for each folder - and having them linked via PM20 folder ID (P4293) to the actual press archive folders was yet only part of the solution.

In order to represent the faceted structure of the archive, we needed anchor points for both facets. That was easy for the geographical categories: the vast majority of them already existed as items in Wikidata, a few historical ones, such as Russian peripheral countries, had to be created. For the subject categories, the situation was much different. Categories such as The country and its people, politics and economy, general or Postal services, telegraphy and telephony were constructed as baskets for collecting articles on certain broader topics. They do not have an equivalent in Wikidata, which tries to describe real world entities or clear-cut concepts. We decided therefore to represent the categories of the subject category system with their own items of type PM20 subject category. Each of the about 1400 categories is connected to the upper one via a "part of" (P361) property, thus forming a five-level hierarchy.

More implementation subtleties

For both facets, according Wikidata properties where created as "PM20 geo code" (P8483) and "PM20 subject code" (P8484). As external identifiers, they link directly to lists of subjects (e.g., for Japan) or geographical entities (e.g., for The country ..., general). For all countries where the press archives material has been processed - this includes the tedious task of clarifying the intellectual property rights status of each article -, the Wikidata item for the country includes now a link to a list of all press archives dossiers about this country, covering the first half of the 20th century.

PM20 country categories

The folders represented in Wikidata (e.g., Japan : The country ..., general) use "facet of" (P1269) and "main subject" (P921) properties to connect to the items for the country and subject categories. Thus, not only each of the 9,200 accessible folders of the PM20 country/subject archive is accessible via Wikidata. Since the structural metadata of PM20 is available, too, it can be queried in its various dimensions - see for example the list of top level subject categories with the number of folders and documents, or a list of folders per country, ordered by signature (with subtleties covered by a "series ordial" (P1545) qualifier). The interactive map of subject folders as shown above is also created by a SPARQL query, and gives a first impression of the geographical areas covered in depth - or yet only sparsely - in the online archive.

Core areas: worldwide economy, worldwide colonialism

The online data reveals core areas of attention during 40 years of press clippings collection until 1949. Economy, of course, was in the focus of the former HWWA (Hamburg Archive for the International Economy), in Germany and namely Hamburg, as well as in every other country. More than half of all subject categories are part of the n Economy section of the category system and give in 4,500 folders very detailed access to the field. About 100,000 of the almost 270,000 online documents of the archive are part of this section, followed by history and general politics, foreign policy, and public finance, down to more peripheral topics like settling and migration, minorities, justice or literature. Originating in the history of the institution (which was founded as "Zentralstelle des Hamburgischen Kolonialinstituts", the central office of the Hamburg colonial institute) colonial efforts all over the world were monitored closely. We published with priority the material about the former German colonies, listed in the Archivführer Deutsche Kolonialgeschichte (Archive guide to the German Colonial Past, also interconnected to Wikidata). Originally collected to support the aggressive and inhuman policy of the German Empire, it is now available to serve as research material for critical analysis in the emerging field of colonial and postcolonial studies.

Enabling future community efforts

While all material about the German colonies (and some about the Italian ones) is online, and accessible now via Wikidata, this is not true for the former British/French/Dutch/Belgian colonies. While Japan or Argentina are accessible completely, China, India or the US are missing, as well as most of the European countries. And while 800+ folders about Hamburg cover it's contemporary history quite well, the vast majority of the material about Germany as a whole is only accessible "on premises" within ZBW's locations. It however is available as digital images, and can be accessed through finding aids (in German), which in the reading rooms directly link to a document viewer. The metadata for this material is now open data and can be changed and enhanced in Wikidata. A very selective example how that could work is a topic in German-Danish history - the 1920 Schleswig plebiscites. The PM20 folder about these events was not part of the published material, but got some interest with last year's centenary. The PM20 metadata on Wikidata made it possible to create an according folder completely in Wikidata, Nordslesvig : Historical events, with a (provisional) link to a stretch of images on a digitized film. While the checking and activation of these images for the public was a one-time effort in the context of an open science event, the creation of a new PM20 folder on Wikidata may demonstrate how open metadata can be used by a dedicated community of knowledge to enable access to not-yet-open knowledge. Current intellectual property law in the EU forbids open access to all digitized clippings from newspapers published in 1960 until 2031, and all where the death date of a named author is not known until after 2100. Of course, we hope for a change in that obstrusive legislation in a not-so-far future. We are confident that the metadata about the material, now in Wikidata, will help bridging the gap until it will finally be possible to use all digitized press archives contents as open scientific and educational resources, within and outside of the Wikimedia projects.

More information at WikiProject 20th Century Press Archives, which links also to the code for creating this data donation.

7. Dezember 2020
von Joachim Neubert
Kommentare deaktiviert für Building the SWIB20 participants map

Building the SWIB20 participants map

 SWIB20 participant map

Here we describe the process of building the interactive SWIB20 participants map, created by a query to Wikidata. The map was intended to support participants of SWIB20 to make contacts in the virtual conference space. However, in compliance with GDPR we want to avoid publishing personal details. So we choose to publish a map of institutions, to which the participants are affiliated. (Obvious downside: the 9 un-affiliated participants could not be represented on the map).

We suppose that the method can be applied to other conferences and other use cases - e.g., the downloaders of scientific software or the institutions subscribed to an academic journal. Therefore, we describe the process in some detail.

  1. We started with a list of institution names (with country code and city, but without person ids), extracted and transformed from our ConfTool registration system, saved it in CSV format. Country names were normalized, cities were not (and only used for context information).

  2. We created an OpenRefine project, and reconciled the institution name column with Wikidata items of type Q43229 (organization, and all its subtypes). We included the country column (-> P17, country) as relevant other detail, and let OpenRefine “Auto-match candidates with high confidence”. Of our original set of 335 country/institution entries, 193 were automaticaly matched via the Wikidata reconciliation service. At the end of the conference, 400 institutions were identified and put on the map (data set).

  3. We went through all un-matched entries and either
    a) selected one of the suggested items, or
    b) looked up and tweaked the name string in Wikidata, or in Google, until we found an according Wikipedia page, openend the linked Wikidata object from there, and inserted the QID in OpenRefine, or
    c) created a new Wikidata item (if the institution seemed notable), or
    d) attached “not yet determined” (Q59496158) where no Wikidata item (yet) exists, or
    e) attached “undefined value” (Q7883029) where no institution had been given

  4. The results were exported from OpenRefine into a .tsv file (settings)

  1. Again via a script, we loaded ConfTool participants data, built a lookup table from all available OpenRefine results (country/name string -> WD item QID), aggregated participant counts per QID, and loaded that data into a custom SPARQL endpoint, which is accessible from the Wikidata Query Service. As in step 1, for all (new) institution name strings, which were not yet mapped to Wikidata, a .csv file was produced. (An additional remark: If no approved custom SPARQL endpoint is available, it is feasible to generate a static query with all data in it’s “values” clause.)

    SWIB20 map data flow
  2. During the preparation of the conference, more and more participants registered, which required multiple loops: Use the csv file of step 5 and re-iterate, starting at step 2. (Since I found no straightforward way to update an existing OpenRefine project with extended data, I created a new project with new input and output files for every iteration.)

  3. Finally, to display the map we could run a federated query on WDQS. It fetches the institution items from the custom endpoint and enriches them from Wikidata with name, logo and image of the institution (if present), as well as with geographic coordinates, obtained directly or indirectly as follows:
    a) item has “coodinate location” (P625) itself, or
    b) item has “headquarters location” item with coordinates (P159/P625), or
    c) item has “located in administrative entity” item with coordinates (P131/P625), or
    c) item has “country” item (P17/P625)
    Applying this method, only one institution item could not be located on the map.

SWIB20 participant map - detail

Data improvements

The way to improve the map was to improve the data about the items in Wikidata - which also helps all future Wikidata users.

New items

For a few institutions, new items were created:

For another 14 institutions, mostly private companies, no items were created due to notability concerns. Everything else already had an item in Wikidata!

Improvement of existing items

In order to improve the display on the map, we enhanced selected items in Wikidata in various ways:

  • Add English label
  • Add type (instance of)
  • Add headquarter location
  • Add image and/or logo

And we hope, that participants of the conference also took the opportunity to make their institution “look better”, by adding for example an image of it to the Wikidata knowledge base.

Putting Wikidata into use for a completely custom purpose thus created incentives for improving “the sum of all human knowledge” step by tiny step.

 

 

 

24. Oktober 2019
von Joachim Neubert
Kommentare deaktiviert für 20th Century Press Archives: Data donation to Wikidata

20th Century Press Archives: Data donation to Wikidata

ZBW is donating a large open dataset from the 20th Century Press Archives to Wikidata, in order to make it better accessible to various scientific disciplines such as contemporary, economic and business history, media and information science, to journa...