Recommended standards and formats – University of Copenhagen

Home > CLARIN-DK Infrastructure > Recommended standards ...

Recommended standards and formats

Resource type

Format

Description

Text TEI

CLARIN-DK recommends using the TEI format for metadata annotation and annotation of text corpora.

In DK-CLARIN (2008-2011) a common TEI format was prepared for all text files. See:https://info.clarin.dk/clarin-dk-infrastrukturen/vejledninger/text-header.pdf and https://info.clarin.dk/clarin-dk-infrastrukturen/vejledninger/text-format.pdf

 This format can be generated automatically using a CLARIN service: https://clarin.dk/clarindk/toolchains-upload.jsp

See instructions: 

https://info.clarin.dk/clarin-dk-infrastrukturen/vejledninger/Konvertering-TEI.pdf

And see description of DK-CLARIN's texts in TEI format:

https://info.clarin.dk/clarin-dk-infrastrukturen/tekster-i-tei-format/

Most of these corpora are also available as packaged zip files at the data center: https://repository.clarin.dk/repository/xmlui/
Lexicons LMF

A very common and recommended format for dictionaries and online lexical resources is the Lexical Markup Framework (LMF). LMF is the ISO standard (ISO-24613: 2008) for natural language processing of machine-readable dictionaries and lexical resources.

LMF combines designs and methods from many existing NLP lexical resources. The overall framework is based on the general features of lexicons where efforts have been made to develop a consistent terminology to describe the components of the lexicons. Based on this, a model has been designed that in the best way could represent all the features of these lexicons.

The website http://www.lexicalmarkupframework.org/ provides examples of dictionary formats for multiple languages. The standard itself can be purchased at Dansk Standard’s webshop: https://webshop.ds.dk/da-dk/standarder/standard/ds-iso-246132008.

In CLARIN-DK, the Language Technology Dictionary for Danish (STO) uses LMF as export format: https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/22, and https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/23

The dictionary itself is a database that is converted to LMF upon export.
Wordnet

The Danish Wordnet, DanNet, https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/25, follows the general standard for Wordnets. This means that it uses the top ontology of the European wordnet, EuroWordNet, and the structure of Princeton Wordnet, where one or more synonyms are grouped with a common hyperonym and other potential relationships.

Read more about organization in the Linguistic specifications for DanNet Version 2: https://cst.ku.dk/projekter/dannet/dannetspecifikationer_v2.pdf.

As an export format you can choose an Rdf / owl format or an csv format. The Owl format follows the W3C extension for representing wordnets: http://www.w3.org/TR/wordnet-rdf/. 
Multimodal annotation MUMIN annotation schemes MUMIN specifications for marking communicative gestures in ANVIL and ELAN formats: https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/43

Accessibility declaration on the way