In the context of the Center for Reflected Text Analytics we have been developing tools to support text-based analyses that are relevant for our research questions. We describe some of them below.
The platform CRETAnno which has been developed within CRETA consists of different modules in the front- and backend. The frontend has been specified for web-based use, so that any internet-capable computer can easily access the research data in the project. Part of the backend is WebService-based and offers machine learning (ML) modules in addition to user and data management.
CRETAnno is mainly used for annotating text passages. So far, we’ve done this on different levels: entities, narrative structure, and location segments. In combination with the ML backend, the annotation module was extended by a semi-automatic component, such that the annotation work could be significantly accelerated.
Another module allows interactive disambiguation of person entities and mapping them to the respective character list. Again, the ML backend is included for semi-automatic support. Through these assignments, complete character networks can be exported in a graph format, e.g. to be analyzed with Gephi.
The Reporting Tool for Relational Visualization and Analysis of Character Mentions in Literature is a program for visualization of character networks from literary texts, which is publically available on a web server (see the Online Demo). rCAT takes parameters for network extraction and generates reports in PDF format that enrich the network representations with statistics and word clouds. By combining network extraction, visualization, and analysis, it goes beyond basic graphics toolkits and has the advantages of reproducibility and direct reusability compared to an interactive toolkit.
One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. Using the example of OCR post-correction, we adapt a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We showed that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over single-handed approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we published the workflow we suggest for efficient text recognition and subsequent automatic and manual postcorrection.
Part-of-Speech Tagger for Middle High German
One of the project groups in CRETA is interested in the automatic analysis of medieval literature. Their relevant texts are written in Middle High German. This comes with the challenge that there are fewer tools for the automatic analysis of texts available (compared to other projects in CRETA). Since PoS tagging builds a basic preprocessing step for further automatic analyses, we developed a tagger for Middle High German. We made use of a large lexical resource to semi-automatically create a large training corpus using the rather coarse-grained Universal Dependency tag set. The corresponding parameter file for the TreeTagger is available for download. See also our Online Demo.
Language-Identification and Part-of-Speech Tagging for Mixed Texts
Code-switching is often described as a phenomenon highly frequent in spoken language. However, it actually reaches much further back and can be found in medieval sermons, for example. In today’s multi-cultural society, addressing mixed language in natural language processing appears to be inevitable, as the development of methods close to real-world data touches a nerve in recent computational linguistics. We developed a language identification system and a part-of-speech tagger for Latin-Middle English mixed text. To this end, we annotated data with language IDs and Universal POS tags. As a classifier, we trained a conditional random field classifier for both sub-tasks, including features generated by the TreeTagger models of both languages. The service is available as an Online Demo.
Visual Comparison of Annotations
The annotation of text passages can present severe scaling difficulties on several levels, for example when multiple annotators add complex annotations to a very long text. Based upon such a complex database, one may now have to tackle a series of demanding tasks like cleaning up flawed annotations, exploring those passages in which there is a strong (dis-)agreement among the annotators, or exploring structures resulting from the interplay of annotations of different types. To support these tasks, this web-based visual exploration tool provides a series of aggregation levels and interaction facilities by means of which the user can quickly and continuously change between close and distant views upon long texts containing complex annotations; on each level, the disagreement among annotators can be displayed as well as an interleaving with the text or aggregations thereof.
Visual Text Analytics for Digital Humanities (ViTA)
This web-based tool provides a series of views upon sets of annotations with the purpose of inspecting and verifying them: Word clouds, plot views and graph visualizations support the analysis of entities like figures or places in narrative texts, of the relations holding between them and of the contexts they are in. In doing so, each view allows to directly access the respective text passage. This helps to recognize quickly any mistakes in the annotations or errors resulting from the automatic text processing. Here, of all the views, the interactive graph visualization is of central importance; it maps entities to the nodes and their relations to the edges of a node-link diagram in a force-based layout. This view is supplemented with a finger print visualization that shows where in the text these entities are mentioned respectively. Several filters allow for an interactive exploration of the graph and visualizations of the figures and their relations within the related text view highlight the context of these elements.
Visual Comparison of Networks
In order to support the time-consuming and difficult task of analyzing a novel, this web-based tool offers an overview of a novel’s plot and its ensemble of characters. More particularly, one can analyze how the characters and their relations evolve over the course of a novel’s plot. To this end, the tool visualizes annotations of (character-) entities as a graph, either in the form of a node-link diagram or of a matrix. Both modes of representation are interleaved with a text view, which offers the possibility to consider the characters and their relations in their respective context. Filters allow for a detailed exploration, and automatically extracted descriptions of characters and relations can be shown as word clouds upon demand. The main feature of the tool consists in its ability to show two relation graphs within a single graph of differences, for example graphs that relate to two different text passages or to two different ways to define the concept of “character relation”.