Dealing with multilingual spoken data: Corpus-based approaches


Date and time: September 19, 2014, at 10 to 15
Location: Njalsgade 136, 2300 Copenhagen S, room 27.0.47

Setting out from the general topic of how to handle spoken multilingual data in linguistic corpora, this course provides general information on the adequate assessment of multilingual data as well as data challenges that are more specific.


Timm Lehmberg, Hanna Hedeland and Daniel Jettka (Hamburger Zentrum für Sprachkorpora, University of Hamburg/CLARIN-D), specialised in multilingual spoken data, present requirement specifications of data formats and tools in the creation and the analysis of corpora of multilingual spoken data.


Thomas Schmidt (Archiv für gesprochenes Deutsch, Institut für deutsche Sprache, Mannheim) outlines the data processing procedure of language island data (Australia German), with regard to a) digitalisation, documentation and alignment, b) the long-time storage in the CLARIN infrastructure and c) the cross-linking of these language island data with other language island corpora within the same archive and/or other archives as well as with reference corpora (standard and dialect).


Lunch break


Lene Offersgaard (CST, DigHumLab/CLARIN-DK) comments on possibilities of data repositories, requirements with regard to the accessibility of data by others than the primary researchers and the integration into research infrastructures.


Coffee break


Karoline Kühl (research project ‘Danish Voices in the Americas’, INSS) focuses on ambiguous features, i.e. non-language specific elements in multilingual data (America Danish), and the challenges that lie in the annotation of closely related languages with lots of shared features and patterns.