TokenPro is a rule-based tokeniser, i.e. a tool which breaks a text up into elements called tokens (e.g. words, puntuations marks, etc.). For instance, in the example below, "he'll" is split into two different tokens, "he" and "will". TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define its behavior when specific UTF-8 symbols are encountered. Example Resources: List of abbreviations for Italian and English. |
Modules >