Abstract in English:
This work presents a thorough examination and expansion of the NIF ecosystem.
Both core use cases of NIF, as a format for NLP tool integration, as well as a corpus pivot format,
are addressed. NLP tool integration is extended with a new wrapper for the OpenNLP framework
including a NIF parser, as well as a CoNLL converter to convert the widely used CoNLL format
to NIF. Tools are examined for scalability and data quality. On the corpus side, a large range of
diverse existing corpora are converted into NIF. The complete contribution of corpora presented
in this work adds up to 841,106,737 triples. Data quality is a special focus, validating the corpora
to guarantee correctness using SPARQL test cases. Finally, an industry use case is presented in
Unicode Text Segmentation, where abbreviation lists are extracted from Linked Data and their
impact on sentence boundary detection is measured using the converted corpora and NLP tools.
Pubdate / Erscheinungsdatum: