Skip to content

Latest commit

 

History

History
103 lines (79 loc) · 5.54 KB

frog.md

File metadata and controls

103 lines (79 loc) · 5.54 KB

Automatic linguistic enrichment for Dutch texts using Frog

Metadata

  • Status: Completed
  • Type: Generic
  • Work Package: WP3
  • Coordinators for CLARIAH: Maarten van Gompel
  • Participating Institutes: KNAW HuC , Radboud University Nijmegen (in the past: Tilburg University)
  • End-users: Researchers
  • Developers: Maarten van Gompel (current maintainer), Ko van der Sloot (former lead developer, now retired), Antal van den Bosch, Peter Berck, Bertjan Busser, Walter Daelemans
  • Interest Groups: Text
  • Task IDs: T139 (Frog), T108 (FoLiA)

Description

This is a generic use case, generalising over a variety of possible research projects that focus on text mining for Dutch.

In this use case we focus on one particular solution that has been developed in the scope of CLARIN-NL and CLARIAH: Frog, an NLP-suite for Dutch that integrates various modules for different kinds of linguistic annotation:

  • Tokenizer (through ucto)
  • Multi-word units
  • Lemmatizer
  • Morphological Analyzer
  • Part-of-Speech Tagger
  • Named Entity Recogniser
  • Phrase Chunker (Shallow parsing)
  • Dependency Parser

Most of the modules are based on memory-based learning (k-NN based techniques) and are the culmination of over two decades of work by various partners, containing the output of multiple research projects. It can be contrasted to some of the more state-of-the-art deep learning techniques that have gained ground in recent years. Frog was and remains a widely used tool for many researchers.

What is the research about?

Researchers often want to enrich texts with linguistic features such as Part-of-Speech tags, lemmas, dependency relations, named entities, etc.. These often provide useful features for text mining or as preprocessing towards further ends.

What is needed to do the research?

Tools

Frog (as well as ucto) is a command-line tool. To make it accessible to a wider audience we have:

  • Integrated a low-latency daemon mode (TCP)
  • Made available a Python binding (also one for ucto
  • Made it available as a webservice that lends itself to further integration with the CLARIAH infrastructure. Note however most users prefer to access this tool using the lower interfaces.

Data

  • Various memory-based models have been trained. The tokeniser is rule-based.
    • There are also some PoS and lemma models for historical dutch from two time periods, developed in the scope of the Nederlab project.
  • A rich data format was needed to represent all the possible annotations. The XML-based format FoLiA was adopted to this and Frog (and ucto) supports reading and writing this format and adding the necessary annotation layers where requested.
  • In addition to FoLiA, output in a simple tab-seperated-format is supported, but there is some information loss when this is chosen.
  • Recent versions of Frog also an extra JSON-output mode.

What software and services are involved?

  • Frog - CLI tool and C++ library
    • frogdata - The trained models/data files
  • Ucto - The tokeniser (standalone CLI tool and C++ library)
    • uctodata - Rule-based tokenisation rules for several languages
  • Mbt - Memory-based tagger (C++)
    • Timbl - A k-NN memory-based learning toolkit (C++)
  • FoLiA support via libfolia - An XML-based data format supporting the various kinds of linguistic annotations Frog provides.
  • CLAM - Used to power the webservice CLAM)
  • Toad - This is a separate set of tools that can be used to train new models for Frog.
  • LaMachine - Because Frog is complex software with many dependencies that may be non-trivial to install for many users, all that is necessary to run and deploy it is bundled in LaMachine, a meta-distribution of various NLP tools. LaMachine is a courtesy to the user here, not a dependency for Frog.

References

Related use-cases:

Relevant publications: