Material

Content description

Corpus size

Subkorpus Tokens Entities Version
Werther (1787) 41.505 331 1
Adorno 13.233 929 1
Parzival 30.491 2.001 1
Bundestagsdebatten 6.371 488 1

Data formats

CoNLL (TSV)

A token per line. For sentence boundaries an empty line. Annotations are tab-separated. Multiple¬† annotations are put into different columns. B-PER denotes the first token of a person annotation, I-PER denotes all following tokens of that person annotation. O (the letter) denotes ‘no annotations’.

Example

Die B-PER
geringen I-PER
Leute I-PER
des I-PER B-LOC
Orts I-PER I-LOC
kennen O
mich O
schon O
, O
und O
lieben O
mich O
, O
besonders O
die B-PER
Kinder I-PER
. O

Apache UIMA XMI

XML-based, for processing with Apache UIMA. The type system can be downloaded here. The relevant types are subtypes of de.unistuttgart.ims.creta.api.Entity. The annotation category is found in the subtype denotation (e.g. de.unistuttgart.ims.creta.api.EntityPER) or in the value of the attribute category.

Markdown

The pandoc Markdown is used, above all for manually reading the annotations. Annotations are marked by square brackets, followed by the category in subscript.
This format schould not be used for automatic processing.

Example

[Die geringen Leute [des Orts ]~LOC~]~PER~kennen mich schon, und lieben mich, besonders [die Kinder]~PER~.

Downloads

Subkorpus CoNLL (TSV) XMI Markdown
Werther (1787) 3_34_12 3_34_12 3_34_12
Adorno Please fill in this form. We are asking you to provide random sentences from the text and thus verifying that you are already in possession of the text.
Parzival Book 3, Book 4, Book 5, Book 6 Book 3, Book 4, Book 5, Book 6 Book 3, Book 4, Book 5, Book 6
Bundestagsdebatten 3_22_26, 3_23_26, 3_24_26, 3_25_26 3_22_26, 3_23_26, 3_24_26, 3_25_26 3_22_26, 3_23_26, 3_24_26, 3_25_26