Workshop: Best practices for multilingual spoken data in linguistic corpora

Date and time: September 18^th 2014, 10-15 o’clock

Location: LANCHART Centre, Njalsgade 136, 27.5, 2300 Copenhagen S

Organizers: CLARIN-DK, research project ‘Danish Voices in the Americas’ (INSS) and the LANCHART Centre

’Danish Voices in the Americas’ investigates Danish spoken by Danish immigrants and their descendants in the US and in Argentina. Thus, we handle data that are distinctly bi- or trilingual in that they represent Danish, with or without Danish dialectal features, influenced by English or Spanish on all linguistic levels. Such data do not necessarily follow standard language/Standard Danish norms and, as a result, can be a challenge for transcription and annotation as well as for corpus-based analyses.

The aim of the workshop is to discuss best practice for representing multilingual data with:

Dr. Thomas Schmidt, director of ’Archiv für gesprochenes Deutsch‘, Institut für deutsche Sprache, Mannheim (https://www.ids-mannheim.de/prag/personal/schmidt/)

Timm Lehmberg, Hanna Hedeland and Daniel Jettka, ‘Hamburger Zentrum für Sprachkorpora’, University of Hamburg and CLARIN-D (http://www.corpora.uni-hamburg.de/)

We would like the workshop to be a dialogue between our guests and those who work with corpora and/or multilingual data (at the LANCHART Centre), so please feel free to participate in the discussions!

Programme

The 1^st part of the programme has been put together with an eye to discussion of technical solutions, best possible representation of data for transcribers and coders as well as long-time storage of data.

The 2^nd part (after lunch) discusses challenges in the processing (transcription and annotation) of multilingual data.

10-11 s. t.	LANCHART Corpus
Vanessa Wolter and Kirsten Appel	Standards for data processing with special focus on the transcription and annotation of non-standard language elements in Transcriber
Tilde Ranis	Data administration in the LANCHART Corpus
Michael Barner Rasmussen	The LANCHART Corpus: Standards, procedures and search tools with special focus on non-standard language elements
11-12 s. t.	EXMARaLDA
Thomas Schmidt and Timm Lehmberg	EXMARaLDA. A system of computer assisted transcription and annotation of spoken language (partitur editor, EXAKT, COMA; se www.exmaralda.org)
12-13	Lunch (at the canteen)
13-15 s. t.	Best practices for spoken multilingual data
Karoline Kühl and Jan Heegård Petersen (research project ‘Danish Voices in the Americas’)	Discussants: Hanna Hedeland, Daniel Jettka, Timm Lehmberg, Thomas Schmidt Other contributions are welcome!
Realistic representation of data vs. operationalization in corpora	How to handle non-standard forms (e.g., ungerste instead of yngste ‘youngest’) in the transcription process? How realistic should the transcription represent the data? How much divergence from standard forms is required in order to note it down?
How much interpretation is okay?	How will we handle non-standard forms that need interpretation in order to match a standard language? Example: I farm mange acres af land ‘I XXX many acres of land’. farm is neither Danish nor English but it needs a standard orthographic form in order to be found (Engl. farmed, Engl farm + Danish past –ede) What is best practice - between interpretation, necessary standardisation and realistic representation of our data - for cases like this?
Linguistic ambiguity	English and Danish share many linguistic features, even more if Jutish dialect features are included. Example: Inversion of subject and verb in declarative main sentences Det sner. Så busserne (S) kører (V) ikke i dag. It’s snowing. So the busses (S) do (V) not run today. We are interested in if/how language contact reinforces such shared patterns. It has been custom in the LANCHART Corpus to mark non-Danish linguistic elements as such. How about ambiguous patterns and features?

info.clarin.dk

Workshop: Best practices for multilingual spoken data in linguistic corpora

Programme

CLARIN-Logo3.jpg

logodgcss.jpg