INTEGRATING WIKIDATA INTO AN AUDIOVISUAL ARCHIVE
Sound and Vision has a large thesaurus containing more than half a million entities. The thesaurus (the Common Thesaurus for Audiovisual Archives (Gemeenschappelijke Thesaurus voor Audiovisuele Archieven (GTAA))) has grown over time, and became more difficult to manage internally. Research showed that external publicly available sources had more complete information about entities covered in the thesaurus and Sound and Vision set her goal to re-use more of this knowledge, and alleviate the maintenance work needed for the GTAA. In a first project Jesse and his colleagues decided to link the terms contained in the Sound and Vision thesaurus to corresponding items in Wikidata, the open, structured knowledge database, used to support Wikipedia.
In an initial attempt, the 137.000 personal names contained in the GTAA were uploaded to the Wikidata Mix’n’match tool. This tool automatically suggests matching Wikidata items for the items in an uploaded dataset and subsequently allows users to confirm or reject these suggestions. However, it turned out that in the vast majority of cases, it was impossible to make a match solely based on the personal name and the limited additional information that GTAA occasionally contains (such as the occupation of a person). A matching Wikidata item was suggested for only 10.000 personal names. Over the course of three years, 8.000 of these were confirmed by the community of Sound and Vision employees and Wikipedians.
It was decided to use the Sound and Vision catalogue as a source for additional information about the people in the GTAA. After all, GTAA terms are used to describe the items in the catalogue. The idea was to determine where personal names are used in the catalogue and to extract other, hopefully related, terms in their immediate vicinity. The personal names and the additional terms would subsequently be used to find matching Wikidata persons based on all information contained in the entry. Spinque Desk was used to put this idea into practice:
Based on this approach, over 45.000 matches were automatically suggested and 26.000 of these have since been accepted by the community. The additional context further facilitates the manual matching of items. Using the approved matches, Sound and Vision is able to improve and enrich the data on Dutch Media History on Wikidata and vice versa enrich the data in its catalogue. For example, based on a maker’s date of death, it can automatically determine when a work is transferred to the public domain. Or based on a person’s birth date, sex and occupation, it can enable researchers to use advanced queries over the collection; for example to return all tv-programs in which female politicians, born in the 70s occur.
Spinque puts Jesse in charge of his search and enables him to collaborate closely with Wikidata to enrich the Sound and Vision archive and to improve its services.
What I like about Spinque Desk is its flexibility. We were able to include all the necessary datasets (...) and we could subsequently use various building blocks to search and access them.
BEHIND THE SCREENS
For this project Spinque Desk was used to design a strategy that searches matching Wikidata persons for all personal names in the thesaurus of Sound and Vision.
In the strategy first the Sound and Vision catalogue is searched for terms related to each personal name. This combined information is subsequently used to search for matching Wikidata persons.
If a match is found the person data from Wikidata is imported in the Sound and Vision database and the additional information is presented at all sites where the personal names are used.
WHAT WE CAN DO FOR YOU
In this project Spinque Desk was used to enrich the thesaurus of Sound and Vision. One of the many ways in which this application can be used.
To what dataset could you link the entities in your domain in order to enrich them? Let us know, we are happy to think along!