Content description
- Corpus
- Entity annotations
- Annotation guidelines (in German)
Corpus size
Subkorpus | Tokens | Entities | Version |
---|---|---|---|
Werther (1787) | 41.505 | 331 | 1 |
Adorno | 13.233 | 929 | 1 |
Parzival | 30.491 | 2.001 | 1 |
Bundestagsdebatten | 6.371 | 488 | 1 |
Data formats
CoNLL (TSV)
A token per line. For sentence boundaries an empty line. Annotations are tab-separated. Multiple annotations are put into different columns. B-PER
denotes the first token of a person annotation, I-PER
denotes all following tokens of that person annotation. O
(the letter) denotes ‘no annotations’.
Example
Die B-PER
geringen I-PER
Leute I-PER
des I-PER B-LOC
Orts I-PER I-LOC
kennen O
mich O
schon O
, O
und O
lieben O
mich O
, O
besonders O
die B-PER
Kinder I-PER
. O
Apache UIMA XMI
XML-based, for processing with Apache UIMA. The type system can be downloaded here. The relevant types are subtypes of de.unistuttgart.ims.creta.api.Entity
. The annotation category is found in the subtype denotation (e.g. de.unistuttgart.ims.creta.api.EntityPER
) or in the value of the attribute category
.
Markdown
The pandoc Markdown is used, above all for manually reading the annotations. Annotations are marked by square brackets, followed by the category in subscript.
This format schould not be used for automatic processing.
Example
[Die geringen Leute [des Orts ]~LOC~]~PER~kennen mich schon, und lieben mich, besonders [die Kinder]~PER~.
Downloads
- Call.pdf
- Annotationsrichtlinien.pdf (annotation guidelines, in German)
- UIMA type system
Subkorpus | CoNLL (TSV) | XMI | Markdown |
---|---|---|---|
Werther (1787) | 3_34_12 | 3_34_12 | 3_34_12 |
Adorno | Please fill in |
||
Parzival | Book 3, Book 4, Book 5, Book 6 | Book 3, Book 4, Book 5, Book 6 | Book 3, Book 4, Book 5, Book 6 |
Bundestagsdebatten | 3_22_26, 3_23_26, 3_24_26, 3_25_26 | 3_22_26, 3_23_26, 3_24_26, 3_25_26 | 3_22_26, 3_23_26, 3_24_26, 3_25_26 |