Annotation layers

- December 2016 -

TextPro ( is a set of modules which perform pre-processing (web page cleaning) and a number of NLP tasks: tokenization, sentence splitting, morphological analysis, pos-tagging, lemmatization, chunking, named-entity recognition, time expression recognition, assignment of geographical coordinates to toponyms, dependency parsing, event and event factuality recognition, temporal relation recognition, and keyword extraction.

The input given to TextPro should contain only pure text (hub text, see the CES guidelines for corpus encoding:
If a text contains any markup, this should be removed. Example of input file:
Barack Obama was elected U.S. president in 2012. He'll complete his term in 
January 2017.



CleanPro enables the use of HTML input files by cleaning them. Using the -html option, all mark-up tags and irrelevant text (i.e. words used as navigation menu, common header and footer, etc.) will be removed from HTML pages, while the relevant text will be kept as input text for the other modules.

NLP processing

The output file produced by TextPro contains:
  • Header lines. Some meta information, such as the name of the file, the document's creation date, the author, etc., can be coded in the first lines. These lines begin with the special character #, followed by the name of the field and its value.
  • Annotation lines. They contain the output information of each token of the input file. TextPro uses tabular format with different columns separated by the tabular space. The Textpro tabular format is meant for the output of the modules, not to represent information about the structure of the text (e.g. titles, paragraphs, emphasis). The output of the different modules (except for KeyPro, see below) consists in adding one or more columns containing a value for each token. If a module does not have any output to give for a token, the specific column will contain the value __NULL__ for that token. List of all the TextPro annotation columns: token, tokennorm, tokenid, tokenstart, tokenend, tokentype, pos, full_morpho, comp_morpho, lemma, wnpos, chunk, entity, parserid, feats, head, deprel.

    The end of sentence is marked with an empty line.

    Example of header lines:

    # FILE: example.txt
    # ID: 000055740
    # CY: US
    # DNUM: 06002654
    # DATE: 19991214
    # KIND: A
    # FIELDS: token	pos	lemma
    Example of annotation lines:
    Barack	NP0	Barack
    Obama	NP0	Obama
    was	VBD	be
    elected	VVN	elect
    U.S.	NP0	u.s.
    president	NN1	president
    in	PRP	in
    2012	CRD	2012
    .	PUN	.
    The output reported above is obtained with the following command: $> -l eng -c token+pos+lemma -o /tmp/ example.txt


    TokenPro breaks up a text into elements called tokens (e.g. words, puntuations marks, etc.); when you call TokenPro, the system creates a column called token containing one token in each line.
    For instance, in the example below, the full stops "." are considered as single tokens just as all the words in the sentence. Example:
    TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define the behavior for specific UTF-8 symbols.

    The list of all the columns TokenPro can produce is the following:
    token: contains the token
    tokennorm: contains a normalized version of the token (for example, accents are normalized)
    tokenid: contains the ID given to the token by the system
    tokenstart: contains the offset indicating the first character of the token
    tokenend: contains the offset indicating the last character of the token
    tokentype: tokens are labeled according to six categories, i.e. UPP (uppercase), LOW (lowercase), CAP (capitalized), DIG (digit), PUN (punctuation), and OTH (other).
    token	tokennorm	tokenid	tokenstart	tokenend	tokentype
    They	They		1	0		4		UPP
    have	have		2	5		9		LOW
    been	been		3	10		14		LOW
    working	working		4	15		22		LOW
    at	at		5	23		25		LOW
    FBK	FBK		6	26		29		CAP
    for	for		7	30		33		LOW
    15	15		8	34		36		DIG
    years	years		9	37		42		LOW
    .	.		10	42		43		PUN


    SentencePro identifies sentence boundaries in a text; it marks the end of a sentence when a specific punctuation mark (for instance a full stop, a question mark, an exclamation mark, etc.) is found, unless it is part of a linguistic expression, like an abbreviation (as the full stops in "U.S." which stands for "United States") or an acronym. Example:
    SentencePro can be fully customized from an XML configuration file, in order to define specific sentence-ending rules.


    TagPro marks each word in a text with a PoS (e.g. noun, verb, adjective, etc.) according to a predefined tagset, i.e. a subset of the ELRA tagset for Italian (see Appendix A) and the BNC tagset for English (see Appendix B). Example:
    token		pos
    Barack		NP0
    Obama		NP0
    was		VBD
    elected		VVN
    U.S.		NP0
    president	NN1
    in		PRP
    2012		CRD
    .		PUN
    He		PNP
    'll		VM0
    complete	VVI
    his		DPS
    term		NN1
    in		PRP
    January		NP0
    2017		CRD
    .		PUN


    MorphoPro assigns morphological features (e.g. tense, person and number for verbs) to words in a text, with the possibility of disambiguating the possible morphological readings of a word in context. When you call MorphoPro, the system can create two different columns:

    - full_morpho provides all the possible morphological analyses for a token. For example, "bearing" can be a verb, a noun or an adjective:
    token	full_morpho
    bearing	bear+v+gerund+pres bearing+n+sing bearing+adj+zero 
    Note that the three morphological analyses in the example above are all contained in one colum separated by blanks (colums separators on the other hand, are tabs).

    - comp_morpho contains only the morphological anlysis/es that is/are compatible with the PoS tag selected by TagPro; in our example, TagPro selects a verbal tag (VVG), thus the only compatible morphological analysis is bear+v+gerund+pres:
    token	pos	comp_morpho
    bearing	VVG	bear+v+gerund+pres 

    token		full_morpho
    Barack		__NULL__
    Obama		__NULL__
    was		be+v+indic+past
    elected		elect+v+indic+past elect+v+part+past elected+adj+zero
    U.S.		u.s.+pn
    president	president+n+sing
    in		in+prep in+adv in+n+sing in+adj+zero
    2012		__NULL__
    .		.+punc
    He		he+n+sing he+pron
    'll		will+v+indic+pres+_
    complete	complete+v+indic+pres+no3sing complete+v+infin+pres complete+adj+zero
    his		hi+n+plur his+adj+zero his+pron
    term		term+n+sing term+n+plur term+v+indic+pres+no3sing term+v+infin+pres
    in		in+prep in+adv in+n+sing in+adj+zero
    January		january+pn january+n+sing
    2017		__NULL__
    .		.+punc
    token		pos	comp_morpho
    Barack		NP0	__NULL__
    Obama		NP0	__NULL__
    was		VBD	be+v+indic+past
    elected		VVN	elect+v+part+past
    U.S.		NP0	u.s.+pn
    president	NN1	president+n+sing
    in		PRP	in+prep
    2012		CRD	__NULL__
    .		PUN	.+punc
    He		PNP	he+pron
    'll		VM0	will+v+indic+pres+_
    complete	VVI	complete+v+infin+pres
    his		DPS	his+adj+zero
    term		NN1	term+n+sing term+n+plur
    in		PRP	in+prep
    January		NP0	january+pn
    2017		CRD	__NULL__
    .		PUN	.+punc


    LemmaPro selects a lemma (i.e. a normalized word form, such as the infinitive form for a verb) for words in a text. The module takes as input both the morphological analysis produced by MorphoPro and the PoS label assigned by TagPro, and applies linguistic rules to select the compatible lemma(s). Example:
    token		pos	comp_morpho		lemma
    Barack		NP0	__NULL__		Barack
    Obama		NP0	__NULL__		Obama
    was		VBD	be+v+indic+past		be
    elected		VVN	elect+v+part+past	elect
    U.S.		NP0	u.s.+pn			u.s.
    president	NN1	president+n+sing	president
    in		PRP	in+prep			in
    2012		CRD	__NULL__		2012
    .		PUN	.+punc			.
    He		PNP	he+pron			he
    'll		VM0	will+v+indic+pres+_	will
    complete	VVI	complete+v+infin+pres	complete
    his		DPS	his+adj+zero		his
    term		NN1	term+n+sing term+n+plur	term
    in		PRP	in+prep			in
    January		NP0	january+pn		january
    2017		CRD	__NULL__		2017
    .		PUN	.+punc			.
    Mapping from TextPro POS-tag to WordNet POS-tag: The column wnpos contains a mapping betweenn the PoS-tagging label assigned by TagPro and the corresponding PoS label used in WordNet (, i.e. n (for nouns), v (for verbs), a (for adjectives), and r (for adverbs).

    For Italian (ELRA tagset):
    - n (noun): SS,SP,SN,SPN,YA,YF
    - a (adjective): AS,AP,AN,DS,DP,DN
    - r (adverb): B

    For English (BNC tagset):
    - n (nouns): ITJ,NN0,NN1,NN2,NP0,UNC,ZZ0
    - a (adjective): AJ0,AJC,AJS,CRD,DPS,ORD;
    - r (adverbs): AV0,AVP,AVQ,XX0.

    token		pos	wnpos
    Barack		NP0	n
    Obama		NP0	n
    was		VBD	v
    elected		VVN	v
    U.S.		NP0	n
    president	NN1	n
    in		PRP	__NULL__
    2012		CRD	n
    .		PUN	__NULL__
    He		PNP	__NULL__
    'll		VM0	v
    complete	VVI	v
    his		DPS	a
    term		NN1	n
    in		PRP	__NULL__
    January		NP0	n
    2017		CRD	n
    .		PUN	__NULL__

    ChunkPro: CHUNKING

    ChunkPro groups words into flat (i.e. without nesting) syntactic constituents (e.g. nominal phrases, verbal phrases, etc.), providing both a shallow syntactic analysis of a text and an intermediate step toward full parsing.
    The module is available with pre-trained models in the news domain, for two languages, English and Italian.

    For Italian ChunkPro annotates syntactic chunks of the following types: NP (Nominal Phrases) and VX (Verbal Phrase).

    For English the list of syntactic chunks types is the following (from the Penn Treebank):

    - ADJP: Adjective Phrase;
    - ADVP: Adverb Phrase;
    - CONJP: Conjunction Phrase;
    - INTJ: Interjection (it corresponds approximately to the PoS tag UH);
    - LST: List marker (it includes surrounding punctuation);
    - NP: Noun Phrase;
    - PP: Prepositional Phrase;
    - PRT: Particle (it corresponds approximately to the PoS tag UH);
    - SBAR: Clause introduced by a (possibly empty) subordinating conjunction;
    - UCP: Unlike Coordinated Phrase;
    - VP: Verb Phrase.

    The colunm produced by the system (i.e. chunk) contains an IOB2 tag consisting of two parts:
    - the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;
    - the tag indicating the type of chunk.

    token		chunk
    Mister		B-NP
    Barack		I-NP
    Obama		I-NP
    was		B-VP
    elected		I-VP
    U.S.		B-NP
    president	I-NP
    in		B-PP
    2012		B-NP
    .		O
    He		B-NP
    'll		B-VP
    complete	I-VP
    his		B-NP
    term		I-NP
    in		B-PP
    January		B-NP
    2017		I-NP
    .		O


    EntityPro annotates named entities of the following types for English and Italian: Person (PER), Organization (ORG), and Location (LOC). Only for Italian, it further distinguishes entities of type Location (LOC) from entities of type Geo-Political entity (GPE). Only for English, it also annotates entities of type Miscellaneous (MISC).

    The colunm produced by the system (i.e. entity) contains an IOB2 tag consisting of two parts:
    - the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;
    - the Entity type tag: PER (for Person), ORG (for Organization), GPE (for Geo-Political Entity), LOC (for Location), and MISC (Miscellaneous).

    token		entity
    Barack		B-PER
    Obama		I-PER
    was		O
    elected		O
    U.S.		B-LOC
    president	O
    in		O
    2012		O
    .		O
    He		O
    'll		O
    complete	O
    his		O
    term		O
    in		O
    January		O
    2017		O
    .		O


    TimePro annotates both absolute time expressions (e.g. "April 3rd 2016") and relative time expressions (e.g. "tomorrow") in a text.
    Temporal expressions are first classified (e.g. dates, durations, times) and then normalized according to the ISO-TimeML specifications.
    TimePro is available with pre-trained models in the news domain for Italian and English.
    It is trained on the EVENTI@Evalita2014 dataset for time expression recognition and classification. For time expression normalization we have adaptated timenorm library ( to Italian.

    For each token it produces the following columns:
    - tmx: it contains the type of the time expression type in IOB2 format (i.e. DATE, TIME, SET or DURATION);
    - tmxid: it contains the ID of the time expression;
    - tmxvalue: it contains the normalized value of the time expression.


    GeoCoder assigns geographical coordinates (e.g. latitude and longitude) to names of locations in a text, making their visualization on a map possible. Given the output of EntityPro, the module disambiguates a location using a repository of toponyms (i.e. GeoNames) as an index and, for each location, it provides both a fine grained category (e.g. City for "Honolulu") and its coordinates.


    SyntaxPro assigns syntactic relations between words in a text based on syntactic dependencies. It is based on MaltParser, a transition-based dependency parser that induces a parsing model from a treebank and parses new data using the induced model. It has been trained on available treebanks for Italian. It uses MaltParser, a state-of-the-art data-driven dependency parser, trained on TUT (Turin University Treebank).

    For each token it adds the following information:
    - parserid: the ID of the token within the sentence
    - feats: the morphological features of the token
    - head: the ID of the head of the token
    - deprel: the label of the dependency between the token and its head


    EventPro is a module for event detection and classification in a text. It annotates events following the TimeML specifications and classifies them in one of the seven TimeML classes (REPORTING, PERCEPTION, I_ACTION, etc.). As an example, the word "annunciato" (announced) is an event of type REPORTING expressed by a verb, while the word "tagli" (cuts) is an event of type OCCURRENCE expressed by a noun. Both event recognition and classification are based on a statistical classifier. EventPro is available with pre-trained models in the news domain for Italian and English.

    For each token EventPro produces the following columns:
    - event: it provides the information whether the token is inside the span of an event in IOB2 format (B-EVENT, I-EVENT or O);
    - eventid: it provides the ID of the event;
    - eventclass: it provides the TimeML class of the event (OCCURRENCE, STATE, I_STATE, REPORTING, PERCEPTION, I_ACTION, ASPECTUAL).


    FactPro assigns a factuality value to each event mentioned in a text. Factuality is described through three attributes: polarity, certainty and event time. The annotation follows the framework used in Fact-Ita Bank. As an example, the event mentioned with "annunciato" has been classified with polarity=POS (the occurrence of the event has been affirmed), with certainty=CERTAIN (the source is certain about the event), and with event time=NON_FUTURE (the event has occurred in the past).
    FactPro is based on a statistical classifier. It is available with pre-trained models in the news domain for Italian.

    It is trained on Fact-Ita Bank ( It describes the factuality of an event (detected by the EventPro module) through 3 attributes: polarity, certainty and time. For each token it adds the following information (as columns):
    - polarity: the polarity of the event (POL_POS, POL_NEG, O);
    - certainty: how certain the source is about the event (CERTAIN, UNCERTAIN, O);
    - evtime: when an event occured or will occur (NON_FUTURE, FUTURE, O).


    TempRelPro identifies in a text the temporal relations holding between events and time expressions and classifies them following the TimeML specification. It identifies temporal relations of 13 different types (e.g. BEFORE, AFTER, INCLUDES, IS_INCLUDED, etc.); for example, in the sentence below we have a temporal relation of type IS_INCLUDED between the event "annunciato" (announced) and the time expression "ieri" (yesterday). The module is based on a statistical classifier. TempRelPro is available with pre-trained models in the news domain for Italian and English.
    This module is machine learning based and it has been trained on EVENTI@Evalita2014 dataset. It detects temporal relations between two events. between one event and one time expression and between two time expressions.
    A line is added in the header containing all the temporal relations extracted by the module:
    # TLINKS: (e1;IS_INCLUDED;tmx1) (tmx0;AFTER;tmx1)


    KeyPro (formerly called KX) is a tool for key-phrase extraction, which exploits basic linguistic annotations combined with simple statistical measures to select a list of weighted keywords from a text. In the header lines, it produces a line in which # KEYWORDS: is followed by the most important keywords of the input text (in lower case). Each keyword is followed by <NUM1 NUM2>: NUM1 indicates the relevance of the keyword, while NUM2 indicates the number of occurrences of the keyword in the text.
    # FILE: example.txt
    # LANGUAGE: eng
    # TIMESTAMP: 2016-10-06T18:08:15+0200
    # KEYWORDS: barack obama <173.91 1> elected <51.2 1> president <49.28 1> january <47.84 1> term <47.79 1>
    # FIELDS: token

    Appendix A: the ELRA tagset

    Italian Tag Description Example
    XPS punctuation . ; : ? !
    XPW comma ,
    XPB brakets ( )
    XPO quotation mark, ellipsis, hyphen `` '' - ... _
    N number 1, 1999, '76, sei, sesto
    RS singular article il, l', la, un, una
    RP plural article i, gli, gl', le
    AS singular qual. adj. vera, grandissimo, migliore
    AP plural qual. adj. vere, grandissimi, maggiori
    AN qual. adj. neutral for number rosa, più, super, antincendio
    DS singular det. adj. quello, alcuna, mio, quale?
    DP plural det. adj. quelli, alcune, miei, quali?
    DN det. adj. neutral for number qualsiasi
    E simple preposition di, a, dopo, fino, nonostante
    ES singular articulated preposition dal, sulla, nello
    EP plural articulated preposition dalle, sulle, negli, nei, ai
    B adverb molto, invece, esattamente
    C conjunction e, ma, bensı, sia, perché
    CCHE che che
    CCHI chi chi
    CADV connettivo avverbiale come, dove, quando
    PS singular pronoun ciascuna, lo, mio
    PP plural pronoun costoro, esse, nostri, loro
    PN pronoun neutral for number ci, cui, sé
    SS singular noun aereo, formula
    SP plural noun aerei, formule
    SN noun neutral for number attività, business, novità
    SPN proper noun Alfredo, Ford, Piombino
    QNS singular relative pronoun quanto, quanta
    QNP plural relative pronoun quanti, quante
    YA acronym ANSA, CEE, ONU
    YF foreign term city, fiesta, Papier
    I interjection oh!
    VI main verb, ind., subjunctive, cond. vedo, giungano, saprei
    VIY aux. verb, ind., subjunctive, cond. ho, sia, avrebbe
    VF main verb, inf. arrivare, vedere
    VFY aux. verb, inf. avere, essere
    VSP main verb, past part., singular acquisito, interrotto
    VSPY aux. verb, past part., singular avente, stato, stata
    VPP main verb, past part., plural arrivati
    VPPY aux. verb, past part., plural state
    VG main verb, gerund cantando, ringraziando
    VGY aux. verb, gerund avendo, essendo
    VM main verb, imperative cercate, leggi
    VMY aux. verb, imperative sia, abbia
    +E clitic ne, ci

    APPENDIX B: the BNC tagset

    English Tag



    Adjective (general or positive) (e.g. good, old, beautiful)


    Comparative adjective (e.g. better, older)


    Superlative adjective (e.g. best, oldest)


    Article (e.g. the, a, an, no)


    General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest.


    Adverb particle (e.g. up, off, out)


    Wh-adverb (e.g. when, where, how, why, wherever)


    Coordinating conjunction (e.g. and, or, but)


    Subordinating conjunction (e.g. although, when)


    The subordinating conjunction that


    Cardinal number (e.g. one, 3, fifty-five, 3609)


    Possessive determiner-pronoun (e.g. your, their, his)


    General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0.


    Wh-determiner-pronoun (e.g. which, what, whose, whichever)


    Existential there, i.e. there occurring in the there is ... or there are ... construction


    Interjection or other isolate (e.g. oh, yes, mhm, wow)


    Common noun, neutral for number (e.g. aircraft, data, committee)


    Singular common noun (e.g. pencil, goose, time, revelation)


    Plural common noun (e.g. pencils, geese, times, revelations)


    Proper noun (e.g. London, Michael, Mars, IBM)


    Ordinal numeral (e.g. first, sixth, 77th, last) .


    Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody)


    Personal pronoun (e.g. I, you, them, ours)


    Wh-pronoun (e.g. who, whoever, whom)


    Reflexive pronoun (e.g. myself, yourself, itself, ourselves)


    The possessive or genitive marker 's or '


    The preposition of


    Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)


    Punctuation: left bracket - i.e. ( or [


    Punctuation: general separating mark - i.e. . , ! , : ; - or ?


    Punctuation: quotation mark - i.e. ' or "


    Punctuation: right bracket - i.e. ) or ]


    Infinitive marker to


    Unclassified items which are not appropriately considered as items of the English lexicon.


    The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative]


    The past tense forms of the verb BE: was and were


    The -ing form of the verb BE: being


    The infinitive form of the verb BE: be


    The past participle form of the verb BE: been


    The -s form of the verb BE: is, 's


    The finite base form of the verb BE: do


    The past tense form of the verb DO: did


    The -ing form of the verb DO: doing


    The infinitive form of the verb DO: do


    The past participle form of the verb DO: done


    The -s form of the verb DO: does, 's


    The finite base form of the verb HAVE: have, 've


    The past tense form of the verb HAVE: had, 'd


    The -ing form of the verb HAVE: having


    The infinitive form of the verb HAVE: have


    The past participle form of the verb HAVE: had


    The -s form of the verb HAVE: has, 's


    Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)


    The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive]


    The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)


    The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)


    The infinitive form of lexical verbs (e.g. forget, send, live, return)


    The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)


    The -s form of lexical verbs (e.g. forgets, sends, lives, returns)


    The negative particle not or n't


    Alphabetical symbols (e.g. A, a, B, b, c, d)