Brümmer, Martin
This work presents a thorough examination and expansion of the NIF ecosystem. Both core use cases of NIF, as a format for NLP tool integration, as well as a corpus pivot format, are addressed. NLP tool integration is extended with a new wrapper for the OpenNLP framework including a NIF parser, as well as a CoNLL converter to convert the widely used CoNLL format to NIF. Tools are examined for scalability and data quality. On the corpus side, a large range of diverse existing corpora are converted into NIF. The complete contribution of corpora presented in this work adds up to 841,106,737 triples. Data quality is a special focus, validating the corpora to guarantee correctness using SPARQL test cases. Finally, an industry use case is presented in Unicode Text Segmentation, where abbreviation lists are extracted from Linked Data and their impact on sentence boundary detection is measured using the converted corpora and NLP tools.
