TokenPro

TokenPro is a rule-based tokeniser, i.e. a tool which breaks a text up into elements called tokens (e.g. words, puntuations marks, etc.).

For instance, in the example below, "he'll" is split into two different tokens, "he" and "will".

TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define its behavior when specific UTF-8 symbols are encountered.

Example

Algorithm: Rules defined as regular expressions.

Resources: List of abbreviations for Italian and English.