23. Dezember 2022
von Joachim Neubert
Kommentare deaktiviert für The PM20 commodities/wares archive: part 4 of the data donation to Wikidata
Alle Beiträge von: Joachim Neubert
2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine
How-to: Matching multilingual thesaurus concepts with OpenRefine
2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine
How-to: Matching multilingual thesaurus concepts with OpenRefine
2. Februar 2022
von Joachim Neubert
Kommentare deaktiviert für How-to: Matching multilingual thesaurus concepts with OpenRefine
How-to: Matching multilingual thesaurus concepts with OpenRefine
13. Dezember 2021
von Joachim Neubert
Kommentare deaktiviert für Integrating the PM20 companies archive: part 3 of the data donation to Wikidata
Integrating the PM20 companies archive: part 3 of the data donation to Wikidata
13. Dezember 2021
von Joachim Neubert
Kommentare deaktiviert für Integrating the PM20 companies archive: part 3 of the data donation to Wikidata
Integrating the PM20 companies archive: part 3 of the data donation to Wikidata
ZBW inherited a large trove of historical company information - annual reports, newspaper clippings and other material about more than 40,000 companies and other organizations around the world. Parts of these, in particular all about German und British entities until 1949, are available free and online in the companies section (list by country) of the 20th Century Press Archives. More digitized folders with material about companies in and outside of Europe up to 1960 are accessible only on ZBW premises, due to intellectual property rights.
As a part of its support for Open Science, ZBW has made all metadata of the 20th Century Press Archives available under a CC0 license. In order to make the folders more easily accessible for business history research as well as for the general public, we have added links for every single folder to Wikidata. In addition to that, the metadata about companies and organizations, such as inception date or links to board members, has been added to the large amount of company data already available in Wikidata. This continues the PM20 data donation of ZBW to Wikidata, as described earlier for the persons archives and the countries/subjects archives. The activities were carried out - with notable help of volunteers - and documented in the WikiProject 20th Century Press Archives.
The mapping process to Wikidata items
Many of the PM20 company and organization folders deal with existing items in Wikidata. If GND identifiers were assigned to these items, we directly created links to PM20 companies with the same id, and were done. Matching and linking to Wikidata items without the help of a unique identifier however provided some challenge. Different from person names, company names change frequently, or are spelled differently in different times or languages. Not too uncommon, the entities themselves change through mergers and acquisitions, and may or may not have been represented by a new folder in PM20, or by a different item in Wikidata. Subsidiaries may be subsumed under the parent organization, or be separate entities. While it is relatively easy to split items in Wikidata, in the folders with printed newspaper clippings and reports it meant digging through sometimes hundreds of pages to single out a company retrospectively. So early decisons about the cutting and delimitation of folders often stuck for the following decades. All of that made it more difficult not only to obtain matches at all, but also to decide if indeed the same entity is covered.
For the first matching approach, we used the Wikidatas Mix-n-match (M-n-m) tool. In order to get manageable "buckets", we sliced the data according to the main language of the company location (German, English, French, Dutch and Other). In the M-n-m batches, we aimed at entities of type "organization" in that language. Despite the fact that we also used the available aliases, we found relatively low matching rates (and a number of "false positives" among them).
After having worked through the M-n-m suggestions, we switched to another approach: For each segment, we created a list of search statements for all folders not already linked in Wikidata. For each entry, the company name was searched via Startpage (which in turn uses Google search), supplemented by "site:wikipedia.org". That searches all Wikipedia pages as full text, so slight differences in the spelling of company names did not matter. Also Wikipedia pages turned up where the company name only occurred in some context, e.g. for the founder of the company or as part of a later merger. We now could select the correct Wikipedia page from the result list, follow the "Wikidata item" link and add the PM20 folder ID to the item. Another search link on the list searched "site:wikidata.org" for existing Wikidata items. (It turned out that for Wikidata, Duck-duck-go brought better results than Startpage/Google.)
When no exactly matching item was found, we sometimes added the PM20 link with a "mapping relation type" of "related match", according to the perceived usefulness for later more detailed work.
The tedious work was facilitated with a second list, which contained statements for creating missing items immediately in Wikidata's QuickStatements (QS) tool. The statements included labels in different languages, descriptions, sometimes aliases, the official name (with the leagal form), type(s) and often GND ID, as in this example:
# Steel Brothers & Company {19}
CREATE
LAST|Lde|"Steel Brothers & Company"
LAST|Len|"Steel Brothers & Company"
LAST|Dde|"Unternehmen; Kolonialgesellschaft"
LAST|Den|"business; colonial society"
LAST|Ade|"W. Strang Steel & Co"
LAST|Aen|"W. Strang Steel & Co"
LAST|P4293|"co/068007"
LAST|P31|Q4830453|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P31|Q1700154|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P227|"2040532-7"|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P571|+1870-01-01T00:00:00Z/9|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
LAST|P1448|de:"Steel Brothers & Company, Ltd."|S248|Q36948990|S4293|"co/068007"|S1810|"Steel Brothers & Company, Ltd."|S813|+2021-08-10T00:00:00Z/11
Since both lists followed exactly the same order (primarily by descending number of documents, to put the most relevant companies on top) and were updated every hour, the workflow was easy: step through the list, search for existing items and link them, add the remaining entries one by one or in batches to Wikidata via QS, and repeat until all entries are linked, and both lists are empty. In result, 3897 PM20 folders could be linked to existing items, while 5085 items were created from scratch. (Query code for the search list, the insert list, and the conversion to QS statements are available.)
Enriching the metadata
After the mapping process had been finished, we added missing metadata from PM20 to all linked Wikidata items. That included country and headquarter location (having Geonames identifiers in PM20 helped a lot), inception and dissolution dates, links to predecessor and parent companies, or links to persons in their role as founder or board members.
Table of organization properties sourced in PM20:
PID | Property | Pre-existing Items | New Items | Total |
---|---|---|---|---|
P452 | industry | 5509 | 7682 | 13191 |
P17 | country | 769 | 5105 | 5874 |
P31 | instance of | 424 | 5371 | 5795 |
P1448 | official name | 94 | 5073 | 5167 |
P159 | headquarters location | 722 | 3542 | 4264 |
P571 | inception | 371 | 3800 | 4171 |
P227 | GND ID | 816 | 1708 | 2524 |
P355 | subsidiary | 764 | 191 | 955 |
P749 | parent organization | 384 | 567 | 951 |
P576 | dissolved, abolished or demolished date | 204 | 673 | 877 |
P156 | followed by | 331 | 424 | 755 |
P155 | follows | 538 | 209 | 747 |
P3320 | board member | 460 | 35 | 495 |
P112 | founded by | 78 | 22 | 100 |
P5052 | supervisory board member | 63 | 20 | 83 |
Source
Classification by industry
The PM20 companies archive was organized by industries, in two different ways: firstly, a custom classification was used for all folders, derived from an ancient version of the "economic sectors" part of the STW Thesaurus for Economics. Secondly, parts of the folders were classified according to the European economic activities classification NACE Rev. 2.
Here, the approach was to map the custom classification to existing - and a few newly built - industry items in Wikidata (see mapping). This allowed to fill the "industry" property of all linked Wikidata items with values derived from PM20. Additionally, further matching industries were derived from the "NACE code" property in Wikidata. Interestingly, this combined approach extended the coverage of companies folders by NACE significantly - from 3,648 to 6,233.
Due to incompatibilities on the conceptual level, that could not be extended to all industries. To give an example: One of the most important industry sectors in Germany, "Metallinstustrie" (metal industry), cannot be represented by a NACE class: "C24 Manufacture of basic metals" is strictly separate from "C25 Manufacture of fabricated metal products", while further processing of metals, e.g. machinery and equipment, are assigned to still other classes.
Supplemented with "plain" Wikidata industries, it proved nevertheless possible to create a complete hierarchical list of companies with PM20 folders by NACE code, and in absence of a NACE code, by Wikidata industry label.
(Wikidata query result with mouse-over labels)
As a result of this data donation, the coverage of 20th century companies and organizations in Wikidata has improved considerably, both in width and depth. With the links to the digitized PM20 folders, about 1.2 million document pages has been made available from the according items for FAIR use in research, education and public information.
29. Januar 2021
von Joachim Neubert
Kommentare deaktiviert für Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives
Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives
29. Januar 2021
von Joachim Neubert
Kommentare deaktiviert für Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives
Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives
The world's largest public newspaper clippings archive comprises lots of material of great interest particularly for authors and readers in the Wikiverse. ZBW has digitized the material from the first half of the last century, and has put all available metadata under a CC0 license. More so, we are donating that data to Wikidata, by adding or enhancing items and providing ways to access the dossiers (called "folders") and clippings easily from there.
Challenges of modelling a complex faceted classification in Wikidata
That had been done for the persons' archive in 2019 - see our prior blog post. For persons, we could just link from existing or a few newly created person items to the biographical folders of the archive. The countries/subjects archives provided a different challenge: The folders there were organized by countries (or continents, or cities in a few cases, or other geopolitical categories), and within the country, by an extended subject category system (available also as SKOS). To put it differently: Each folder was defined by a geo and a subject facet - a method widely used in general purpose press archives, because it allowed a comprehensible and, supported by a signature system, unambiguous sequential shelf order, indispensable for quick access to the printed material.
Folders specifically about one significant topic (like the Treaty of Sèvres) are rare in the press archives, whereas country/subject combinations are rare among Wikidata items - so direct linking between existing items and PM20 folders was hardly achievable. The folders in themselves had to be represented as Wikidata items, just like other sources used there. Here however we did not have works or scientific articles, but thematic mini-collections of press clippings, often not notable in themselves and normally without further formal bibliographic data. So a class of PM20 country/subject folder was created (as subclass of dossier, a collection of documents). Aiming at items for each folder - and having them linked via PM20 folder ID (P4293) to the actual press archive folders was yet only part of the solution.
In order to represent the faceted structure of the archive, we needed anchor points for both facets. That was easy for the geographical categories: the vast majority of them already existed as items in Wikidata, a few historical ones, such as Russian peripheral countries, had to be created. For the subject categories, the situation was much different. Categories such as The country and its people, politics and economy, general or Postal services, telegraphy and telephony were constructed as baskets for collecting articles on certain broader topics. They do not have an equivalent in Wikidata, which tries to describe real world entities or clear-cut concepts. We decided therefore to represent the categories of the subject category system with their own items of type PM20 subject category. Each of the about 1400 categories is connected to the upper one via a "part of" (P361) property, thus forming a five-level hierarchy.
More implementation subtleties
For both facets, according Wikidata properties where created as "PM20 geo code" (P8483) and "PM20 subject code" (P8484). As external identifiers, they link directly to lists of subjects (e.g., for Japan) or geographical entities (e.g., for The country ..., general). For all countries where the press archives material has been processed - this includes the tedious task of clarifying the intellectual property rights status of each article -, the Wikidata item for the country includes now a link to a list of all press archives dossiers about this country, covering the first half of the 20th century.
The folders represented in Wikidata (e.g., Japan : The country ..., general) use "facet of" (P1269) and "main subject" (P921) properties to connect to the items for the country and subject categories. Thus, not only each of the 9,200 accessible folders of the PM20 country/subject archive is accessible via Wikidata. Since the structural metadata of PM20 is available, too, it can be queried in its various dimensions - see for example the list of top level subject categories with the number of folders and documents, or a list of folders per country, ordered by signature (with subtleties covered by a "series ordial" (P1545) qualifier). The interactive map of subject folders as shown above is also created by a SPARQL query, and gives a first impression of the geographical areas covered in depth - or yet only sparsely - in the online archive.
Core areas: worldwide economy, worldwide colonialism
The online data reveals core areas of attention during 40 years of press clippings collection until 1949. Economy, of course, was in the focus of the former HWWA (Hamburg Archive for the International Economy), in Germany and namely Hamburg, as well as in every other country. More than half of all subject categories are part of the n Economy section of the category system and give in 4,500 folders very detailed access to the field. About 100,000 of the almost 270,000 online documents of the archive are part of this section, followed by history and general politics, foreign policy, and public finance, down to more peripheral topics like settling and migration, minorities, justice or literature. Originating in the history of the institution (which was founded as "Zentralstelle des Hamburgischen Kolonialinstituts", the central office of the Hamburg colonial institute) colonial efforts all over the world were monitored closely. We published with priority the material about the former German colonies, listed in the Archivführer Deutsche Kolonialgeschichte (Archive guide to the German Colonial Past, also interconnected to Wikidata). Originally collected to support the aggressive and inhuman policy of the German Empire, it is now available to serve as research material for critical analysis in the emerging field of colonial and postcolonial studies.
Enabling future community efforts
While all material about the German colonies (and some about the Italian ones) is online, and accessible now via Wikidata, this is not true for the former British/French/Dutch/Belgian colonies. While Japan or Argentina are accessible completely, China, India or the US are missing, as well as most of the European countries. And while 800+ folders about Hamburg cover it's contemporary history quite well, the vast majority of the material about Germany as a whole is only accessible "on premises" within ZBW's locations. It however is available as digital images, and can be accessed through finding aids (in German), which in the reading rooms directly link to a document viewer. The metadata for this material is now open data and can be changed and enhanced in Wikidata. A very selective example how that could work is a topic in German-Danish history - the 1920 Schleswig plebiscites. The PM20 folder about these events was not part of the published material, but got some interest with last year's centenary. The PM20 metadata on Wikidata made it possible to create an according folder completely in Wikidata, Nordslesvig : Historical events, with a (provisional) link to a stretch of images on a digitized film. While the checking and activation of these images for the public was a one-time effort in the context of an open science event, the creation of a new PM20 folder on Wikidata may demonstrate how open metadata can be used by a dedicated community of knowledge to enable access to not-yet-open knowledge. Current intellectual property law in the EU forbids open access to all digitized clippings from newspapers published in 1960 until 2031, and all where the death date of a named author is not known until after 2100. Of course, we hope for a change in that obstrusive legislation in a not-so-far future. We are confident that the metadata about the material, now in Wikidata, will help bridging the gap until it will finally be possible to use all digitized press archives contents as open scientific and educational resources, within and outside of the Wikimedia projects.
More information at WikiProject 20th Century Press Archives, which links also to the code for creating this data donation.
7. Dezember 2020
von Joachim Neubert
Kommentare deaktiviert für Building the SWIB20 participants map
Building the SWIB20 participants map
Here we describe the process of building the interactive SWIB20 participants map, created by a query to Wikidata. The map was intended to support participants of SWIB20 to make contacts in the virtual conference space. However, in compliance with GDPR we want to avoid publishing personal details. So we choose to publish a map of institutions, to which the participants are affiliated. (Obvious downside: the 9 un-affiliated participants could not be represented on the map).
We suppose that the method can be applied to other conferences and other use cases - e.g., the downloaders of scientific software or the institutions subscribed to an academic journal. Therefore, we describe the process in some detail.
-
We started with a list of institution names (with country code and city, but without person ids), extracted and transformed from our ConfTool registration system, saved it in CSV format. Country names were normalized, cities were not (and only used for context information).
-
We created an OpenRefine project, and reconciled the institution name column with Wikidata items of type Q43229 (organization, and all its subtypes). We included the country column (-> P17, country) as relevant other detail, and let OpenRefine “Auto-match candidates with high confidence”. Of our original set of 335 country/institution entries, 193 were automaticaly matched via the Wikidata reconciliation service. At the end of the conference, 400 institutions were identified and put on the map (data set).
-
We went through all un-matched entries and either
a) selected one of the suggested items, or
b) looked up and tweaked the name string in Wikidata, or in Google, until we found an according Wikipedia page, openend the linked Wikidata object from there, and inserted the QID in OpenRefine, or
c) created a new Wikidata item (if the institution seemed notable), or
d) attached “not yet determined” (Q59496158) where no Wikidata item (yet) exists, or
e) attached “undefined value” (Q7883029) where no institution had been given -
The results were exported from OpenRefine into a .tsv file (settings)
- Again via a script, we loaded ConfTool participants data, built a lookup table from all available OpenRefine results (country/name string -> WD item QID), aggregated participant counts per QID, and loaded that data into a custom SPARQL endpoint, which is accessible from the Wikidata Query Service. As in step 1, for all (new) institution name strings, which were not yet mapped to Wikidata, a .csv file was produced. (An additional remark: If no approved custom SPARQL endpoint is available, it is feasible to generate a static query with all data in it’s “values” clause.)
-
During the preparation of the conference, more and more participants registered, which required multiple loops: Use the csv file of step 5 and re-iterate, starting at step 2. (Since I found no straightforward way to update an existing OpenRefine project with extended data, I created a new project with new input and output files for every iteration.)
-
Finally, to display the map we could run a federated query on WDQS. It fetches the institution items from the custom endpoint and enriches them from Wikidata with name, logo and image of the institution (if present), as well as with geographic coordinates, obtained directly or indirectly as follows:
a) item has “coodinate location” (P625) itself, or
b) item has “headquarters location” item with coordinates (P159/P625), or
c) item has “located in administrative entity” item with coordinates (P131/P625), or
c) item has “country” item (P17/P625)
Applying this method, only one institution item could not be located on the map.
Data improvements
The way to improve the map was to improve the data about the items in Wikidata - which also helps all future Wikidata users.
New items
For a few institutions, new items were created:
- Burundi Association of Librarians, Archivists and Documentalists
- FAO representation in Kenya
- Aurora Information Technology
- Istituto di Informatica Giuridica e Sistemi Giudiziari
For another 14 institutions, mostly private companies, no items were created due to notability concerns. Everything else already had an item in Wikidata!
Improvement of existing items
In order to improve the display on the map, we enhanced selected items in Wikidata in various ways:
- Add English label
- Add type (instance of)
- Add headquarter location
- Add image and/or logo
And we hope, that participants of the conference also took the opportunity to make their institution “look better”, by adding for example an image of it to the Wikidata knowledge base.
Putting Wikidata into use for a completely custom purpose thus created incentives for improving “the sum of all human knowledge” step by tiny step.
24. Oktober 2019
von Joachim Neubert
Kommentare deaktiviert für 20th Century Press Archives: Data donation to Wikidata