Pl4net.info

Bibliothekarische Stimmen. Independent, täglich.

16. November 2020
von Timo Borst
Kommentare deaktiviert für Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics

Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics

Introduction Bibliometrics, scientometrics, informetrics and webometrics have been both research topics and practical guidelines for publishing, reading, citing, measuring and acquiring published research for a while (Hood 2001). Citation databases and...

3. Juni 2016
von Timo Borst
Kommentare deaktiviert für The PM20 commodities/wares archive: part 4 of the data donation to Wikidata

The PM20 commodities/wares archive: part 4 of the data donation to Wikidata

After the digitized material of the persons, countries/subjects and companies archives of the 20th Century Press archives had been made available via Wikidata, now the last part from the wares archive has been added.This ware archive is about products ...

3. Juni 2016
von Timo Borst
Kommentare deaktiviert für Integrating a Research Data Repository with established research practices

Integrating a Research Data Repository with established research practices

Authors: Timo Borst, Konstantin Ott

In recent years, repositories for managing research data have emerged, which are supposed to help researchers to upload, describe, distribute and share their data. To promote and foster the distribution of research data in the light of paradigms like Open Science and Open Access, these repositories are normally implemented and hosted as stand-alone applications, meaning that they offer a web interface for manually uploading the data, and a presentation interface for browsing, searching and accessing the data. Sometimes, the first component (interface for uploading the data) is substituted or complemented by a submission interface from another application. E.g., in Dataverse or in CKAN data is submitted from remote third-party applications by means of data deposit APIs [1]. However the upload of data is organized and eventually embedded into a publishing framework (data either as a supplement of a journal article, or as a stand-alone research output subject to review and release as part of a ‘data journal’), it definitely means that this data is supposed to be made publicly available, which is often reflected by policies and guidelines for data deposit.

In clear contrast to this publishing model, the vast majority of current research data however is not supposed to be published, at least in terms of scientific publications. Several studies and surveys on research data management indicate that at least in the social sciences there is a strong tendency and practice to process and share data amongst peers in a local and protected environment (often with several local copies on different personal devices), before eventually uploading and disseminating derivatives from this data to a publicly accessible repository. E.g., according to a survey among Austrian researchers, the portion of researchers agreeing to share their data either on request or among colleagues is 57% resp. 53%, while the agreement to share on a disciplinary repository is only 28% [2]. And in another survey among researchers from a local university and cooperation partner, almost 70% preferred an institutional local archive, while only 10% agreed on a national or international archive. Even if there is data planned to be published via a publicly accessible repository, it will first be stored and processed in a protected environment, carefully shared with peers (project members, institutional colleagues, sponsors) and often subject to access restrictions – in other words, it is used before being published.

With this situation in mind, we designed and developed a central research data repository as part of a funded project called ‘SowiDataNet’ (SDN - Network of data from Social Sciences and Economics) [3]. The overall goal of the project is to develop and establish a national web infrastructure for archiving and managing research data in the social sciences, particularly quantitative (statistical) data from surveys. It aims at smaller institutional research groups or teams, which often do lack an institutional support or infrastructure for managing their research data. As a front-end application, the repository based on DSpace software provides a typical web interface for browsing, searching and accessing the content. As a back-end application, it provides typical forms for capturing metadata and bitstreams, with some enhancements regarding the integration of authority control by means of external webservices. From the point of view of the participating research institutions, a central requirement is the development of a local view (‘showcase’) on the repository’s data, so that this view can be smoothly integrated into the website of the institution. The web interface of the view is generated by means of the Play Framework in combination with the Bootstrap framework for generating the layout, while all of the data is retrieved and requested from the DSpace backend via its Discover interface and REST-API.

Image of SDN Architecture

SDN ArchitectureDiagram: SowiDataNet software components

The purpose of the showcase application is to provide an institutional subset and view of the central repository’s data, which can easily be integrated into any institutional website, either as an iFrame to be embedded by the institution (which might be considered as an easy rather than a satisfactory technical solution), or as a stand-alone subpage being linked from the institution’s homepage, optionally using a proxy server for preserving the institutional domain namespace. While these solutions imply the standard way of hosting the showcase software, a third approach suggests the deployment of the showcase software on an institution’s server for customizing the application. In this case, every institution can modify the layout of their institutional view by customizing their institutional CSS file. Because using Bootstrap and LESS Compiling the CSS file, a lightweight possibility might be to modify only some LESS Variables compiling to an institutional CSS file.

As a result from the requirement analysis conducted with the project partners (two research institutes from the social sciences), and in accordance with the survey results cited, there is a strong demand for managing not only data which is to be published in the central repository, but also data which is protected and circulating only among the members of the institution. Moreover, this data is described by additional specific metadata containing internal hints on the availability restrictions and access conditions. Hence, we had to distinguish between the following two basic use cases to be covered by the showcase:

  • To provide a view on the public SDN data (‘data published’)
  • To provide a view on the public SDN data plus the internal institutional data resp. their corresponding metadata records, the latter only visible and accessible for institutional members (‘data in use’)


From the perspective of a research institution and data provider, the second use case turned out to be the primary one, since it covers more the institutional practices and workflows than the publishing model does. As a matter of fact, research data is primarily generated, processed and shared in a protected environment, before it may eventually be published and distributed to a wider, potentially abstract and unknown community – and this fact must be acknowledged and reflected by a central research data repository aiming at the contributions from researchers which are bound to an institution.

If ‘data in use’ is to be integrated into the showcase as an internal view on protected data to be shared only within an institution, it means to restrict the access to this data on different levels. First, for every community (in the sense of an institution), we introduce a DSpace collection for just those internal data, and protect it by assigning it to a DSpace user role ‘internal[COMMUNITY_NAME]’. This role is associated with an IP range, so that only requests from that range will be assigned to the role ‘internal’ and granted access to the internal collection. In the context of our project, we enter only the IP of the showcase application, so that every user of this application will see the protected items. Depending on the locality of the showcase application resp. server, we have to take further steps: If the application resp. server is located in the institution’s intranet, the protected items are only visible and accessible from the institution’s network. If the application is externally hosted and accessible via the World Wide Web – which is expected to be the default solution for most of the research institutes –, then the showcase application needs an authentication procedure, which is preferably realized by means of the central DSpace SowiDataNet repository, so that every user of the showcase application is granted access by becoming a DSpace user.

In the context of an r&d project where we are partnering with research institutes, it turned out that the management of research data is twofold: while repository providers are focused on the publishing and unrestricted access to research data, researchers are mainly interested in local archiving and sharing of their data. In order to manage this data, the researchers’ institutional practices need to be reflected and supported. For this purpose, we developed an additional viewing and access component. When it comes to their integration with existing institutional research practices and workflows, the implementation of research data repositories requires concepts and actions which go far beyond the original idea of a central publishing platform. Further research and development is planned in order to understand and support better the sharing of data in both institutional and cross-institutional subgroups, so the integration with a public central repository will be fostered.

Link to prototype

References

[1] Dataverse Deposit-API. Retrieved 24 May 2016, from http://guides.dataverse.org/en/3.6.2/dataverse-api-main.html#data-deposit-api
[2] Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung – Report 2015. Version 1.2 - Zenodo. (2015). Retrieved 24 May 2016, from https://zenodo.org/record/32043#.VrhmKEa5pmM
[3] Project homepage: https://sowidatanet.de/. Retrieved 24 May 2016.
[4] Research data management survey: report - Nottingham ePrints. (2013). Retrieved 24 May 2016, from http://eprints.nottingham.ac.uk/1893/
[5] University of Oxford Research Data Management Survey 2012 : The Results | DaMaRO. (2012). Retrieved 24 May 2016, from https://blogs.it.ox.ac.uk/damaro/2013/01/03/university-of-oxford-research-data-management-survey-2012-the-results/

3. Juni 2016
von Timo Borst
Kommentare deaktiviert für Building the SWIB20 participants map

Building the SWIB20 participants map

  Here we describe the process of building the interactive SWIB20 participants map, created by a query to Wikidata. The map was intended to support participants of SWIB20 to make contacts in the virtual conference space. However, in compliance with GD...

3. Juni 2016
von Timo Borst
Kommentare deaktiviert für Content recommendation by means of EEXCESS

Content recommendation by means of EEXCESS

Authors: Timo Borst, Nils Witt

Since their beginnings, libraries and related cultural institutions were confident in the fact that users had to visit them in order to search, find and access their content. With the emergence and massive use of the World Wide Web and associated tools and technologies, this situation has drastically changed: if those institutions still want their content to be found and used, they must adapt themselves to those environments in which users expect digital content to be available. Against this background, the general approach of the EEXCESS project is to ‘inject’ digital content (both metadata and object files) into users' daily environments like browsers, authoring environments like content management systems or Google Docs, or e-learning environments. Content is not just provided, but recommended by means of an organizational and technical framework of distributed partner recommenders and user profiles. Once a content partner has connected to this framework by establishing an Application Program Interface (API) for constantly responding to the EEXCESS queries, the results will be listed and merged with the results of the other partners. Depending on the software component installed either on a user’s local machine or on an application server, the list of recommendations is displayed in different ways: from a classical, text-oriented list, to a visualization of metadata records.

The Recommender

The EEXCESS architecture comprises  three major components: a privacy-preserving proxy, multiple client-side tools for the Chrome Browser, Wordpress, Google Docs and more, and the central server-side component, responsible for generating recommendations, called recommender. Covering all of these components in detail is beyond the scope of this blog post. Instead, we want to focus on one component: the federated recommender, as it is the heart of the EEXCESS infrastructure.

The recommender’s task is to generate a list of objects like text documents, images and videos (hereafter called documents, for brevity’s sake) in response to a given query. The list is supposed to contain only documents relevant to the user. Moreover, the list should be ordered (by descending relevance). To generate such a list, the recommender can pick documents from the content providers that participate in the EEXCESS infrastructure. Technically speaking but somewhat oversimplified: the recommender receives a query and forwards it to all content provider systems (like Econbiz, Europeana, Mendeley and others). After receiving results from each content provider, the recommender decides in which order documents will be recommended to the user  and return it to the user who submitted the query.

This raises some questions. How can we find relevant documents? The result lists from the content providers are already sorted by relevance; how can we merge them? Can we deal with ambiguity and duplicates? Can we respond within reasonable time? Can we handle the technical disparities of the different content provider systems? How can we integrate the different document types? In the following, we will describe how we tackled some of these questions, by giving a more detailed explanation on how the recommender compiles the recommendation lists.

Recommendation process

If the user wishes to obtain personalized recommendations, she can create a local profile (i.e. stored only on the user’s device). They can specify their education, age, field of interest and location. But to be clear here: this is optional. If the profile is used, the Privacy Proxy[4] takes care of anonymizing the personal information. The overall process of creating personalized recommendations is depicted in figure and will be described in the following.

After the user has sent a query as well as her user profile, a process called Source Selection is triggered. Based on the user’s preferences, the Source Selection decides which partner systems will be queried. The reason for this is that most content providers cover only a specific discipline (see figure). For instance, queries from a user that is only interested in biology and chemistry will never receive Econbiz recommendations, whereas a query from a user merely interested in politics and money will get Econbiz recommendations (up to the present, this may change when other content provider participate). Thereby, Source Selection lowers the network traffic and the latency of the overall process and increases the precision of the results at the expense of missing results and reduced diversity. Optionally, the user can also select the sources manually.

The subsequent Query Processing step alters the query:

  • Short queries are expanded using Wikipedia knowledge
  • Long queries are split into smaller queries, which are then handled separately (See [1] for more details).

 The queries from the Query Processing step are then used to query the content providers selected during the Source Selection step. With the results from the content providers, two post processing steps are carried out to generate the personalized recommendations:

  • Result Processing: The purpose of the Result Processing is to detect duplicates. A technique called fuzzy hashing is used for this purpose. The words that make up a result list’s entry are sorted, counted and truncated by the MD5 hash algorithm [2], which allows convenient comparison.
  • Result Ranking: After the duplicates have been removed, the results are re-ranked. To do so, a slightly modified version of the round robin method is used. Where vanilla round robin would just concatenate slices of the result lists (i.e. first two documents from list A + first two document from list B + …), Weighted Round Robinmodifies this behavior by taking the overlap of the query and the result’s meta-data into account. This is, before merging the lists, each individual list is modified. Documents, whose meta data exhibit a high accordance to the query, are being promoted.

Partner Wizard

As the quality of the recommended documents increases with the number and diversity of the content providers that participate, a component called Partner Wizard was implemented. Its goal is to simplify the integration of new content providers to a level that non-experts can manage this process without any support from the EEXCESS consortium. This is achieved by a semi-automatic process triggered from a web frontend that is provided by the EEXCESS consortium. Given a search API, it is relatively easy to obtain search results, but the main point is to obtain results that are meaningful and relevant to the user. Since every search service behaves differently, there is no point in treating all services equally. Some sort of customization is needed. That’s where the Partner Wizard comes into play. It allows an employee from the new content provider to specify the search API. Afterwards, the wizard submits pre-assembled pairs of search queries to the new service. Each pair is similar but not identical, like for examp

  • Query 1: <TERM1> OR <TERM2>
  • Query 2: <TERM1> AND <TERM2>.

The thereby generated result lists are presented to the user, which has to decide which list contains the more relevant results and suits the query better (see figure). Finally, based on the previous steps, a configuration file is generated that configures the federated recommender. Whereupon the recommender mimics the behavior, that was previously exhibited. The wizard can be completed within a few minutes and it only requires a publically available search API.

The project started with five initial content providers. Now, due to the contribution of the partner wizard, there are more than ten content providers and negotiations with further candidates are ongoing. Since there are almost no technical issues anymore, legal issues dominate the consultations. As all programs developed within the EEXCESS project are published under open source conditions, the Partner Wizard can be found at [3].

Conclusions

The EEXCESS project is about injecting distributed content from different cultural and scientific domains into everyday user environments, so this content becomes more visible and better accessible. To achieve this goal and to establish a network of distributed content providers, apart from the various organizational, conceptual and legal aspects some specification and engineering of software is to be done – not only one-time, but also with respect to maintaining the technical components. One of the main goals of the project is to establish a community of networked information systems, with a lightweight approach towards joining this network by easily setting up a local partner recommender. Future work will focus on this growing network and the increasing requirements of integrating heterogeneous content via central processing of recommendations.

24. Mai 2016
von Timo Borst
Kommentare deaktiviert für Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics

Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics

by Franz Osorio, Timo Borst Introduction Bibliometrics, scientometrics, informetrics and webometrics have been both research topics and practical guidelines for publishing, reading, citing, measuring and acquiring published research for a while (Hood 2...

17. Mai 2016
von Timo Borst
Kommentare deaktiviert für ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

ZBW’s contribution to „Coding da Vinci“: Dossiers about persons and companies from 20th Century Press Archives

At 27th and 28th of October, the Kick-off for the "Kultur-Hackathon" Coding da Vinci is held in Mainz, Germany, organized this time by GLAM institutions from the Rhein-Main area: "For five weeks, devoted fans of culture and hacking alike will prototype...

17. Mai 2016
von Timo Borst
Kommentare deaktiviert für Update: New dump of EconStor Linked Dataset available

Update: New dump of EconStor Linked Dataset available

We are happy to announce that we have updated our EconStor LOD dump. This dataset now comprises 108k metadata records provided with Semantic Web URIs and partially linked to external datasets like STW and JEL.The dataset is available at http://linkedda...