Workshop: Best practices for multilingual spoken data in linguistic corpora

Date and time: September 18th 2014, 10-15 o’clock
Location: LANCHART Centre, Njalsgade 136, 27.5, 2300 Copenhagen S

Organizers: CLARIN-DK, research project ‘Danish Voices in the Americas’ (INSS) and the LANCHART Centre

’Danish Voices in the Americas’ investigates Danish spoken by Danish immigrants and their descendants in the US and in Argentina. Thus, we handle data that are distinctly bi- or trilingual in that they represent Danish, with or without Danish dialectal features, influenced by English or Spanish on all linguistic levels. Such data do not necessarily follow standard language/Standard Danish norms and, as a result, can be a challenge for transcription and annotation as well as for corpus-based analyses.

The aim of the workshop is to discuss best practice for representing multilingual data with:

Dr. Thomas Schmidt, director of ’Archiv für gesprochenes Deutsch‘, Institut für deutsche Sprache, Mannheim (

Timm Lehmberg, Hanna Hedeland and Daniel Jettka, ‘Hamburger Zentrum für Sprachkorpora’, University of Hamburg and CLARIN-D (

We would like the workshop to be a dialogue between our guests and those who work with corpora and/or multilingual data (at the LANCHART Centre), so please feel free to participate in the discussions!


The 1st part of the programme has been put together with an eye to discussion of technical solutions, best possible representation of data for transcribers and coders as well as long-time storage of data.

The 2nd part (after lunch) discusses challenges in the processing (transcription and annotation) of multilingual data.

10-11 s. t.


Vanessa Wolter and Kirsten Appel

Standards for data processing with special focus on the transcription and annotation of non-standard language elements in Transcriber


Tilde Ranis

Data administration in the LANCHART Corpus


Michael Barner Rasmussen

The LANCHART Corpus: Standards, procedures and search tools with special focus on non-standard language elements


11-12 s. t.


Thomas Schmidt and Timm Lehmberg

EXMARaLDA. A system of computer assisted transcription and annotation of spoken language (partitur editor, EXAKT, COMA; se



Lunch (at the canteen)


13-15 s. t.

Best practices for spoken multilingual data


Karoline Kühl and Jan Heegård Petersen (research project ‘Danish Voices in the Americas’)

Discussants: Hanna Hedeland, Daniel Jettka, Timm Lehmberg, Thomas Schmidt

Other contributions are welcome!


Realistic representation of data vs. operationalization in corpora

How to handle non-standard forms (e.g., ungerste instead of yngste ‘youngest’) in the transcription process?

How realistic should the transcription represent the data? How much divergence from standard forms is required in order to note it down? 


How much interpretation is okay?

How will we handle non-standard forms that need interpretation in order to match a standard language?


Example: I farm mange acres af land ‘I XXX many acres of land’. farm is neither Danish nor English but it needs a standard orthographic form in order to be found (Engl. farmed, Engl farm + Danish past –ede)


What is best practice - between interpretation, necessary standardisation and realistic representation of our data - for cases like this?


Linguistic ambiguity

English and Danish share many linguistic features, even more if Jutish dialect features are included.


Example: Inversion of subject and verb in declarative main sentences

Det sner. Så busserne (S) kører (V) ikke i dag.

It’s snowing. So the busses (S) do (V) not run today.


We are interested in if/how language contact reinforces such shared patterns. It has been custom in the LANCHART Corpus to mark non-Danish linguistic elements as such. How about ambiguous patterns and features?