Annotation layers
- December 2016 -
TextPro (http://textpro.fbk.eu/) is a set of modules which perform pre-processing (web page cleaning) and a number of NLP tasks: tokenization, sentence splitting, morphological analysis, pos-tagging, lemmatization, chunking, named-entity recognition, time expression recognition, assignment of geographical coordinates to toponyms, dependency parsing, event and event factuality recognition, temporal relation recognition, and keyword extraction.
The input given to TextPro should contain only pure text (hub text, see the CES guidelines for corpus encoding: http://www.cs.vassar.edu/CES/CES1-5.html#ToCOview).
If a text contains any markup, this should be removed. Example of input file:
Barack Obama was elected U.S. president in 2012.
He'll complete his term in January 2017.
Pre-processing
CleanPro: HTML CLEANING
CleanPro enables the use of HTML input files by cleaning them. Using the -html option, all mark-up tags and irrelevant text (i.e. words used as navigation menu, common header and footer, etc.) will be removed from HTML pages, while the relevant text will be kept as input text for the other modules.
NLP processing
The output file produced by TextPro contains:
Header lines. Some meta information, such as the name of the file, the document's creation date, the author, etc., can be coded in the first lines. These lines begin with the special character #, followed by the name of the field and its value.
Annotation lines. They contain the output information of each token of the input file. TextPro uses tabular format with different columns separated by the tabular space. The Textpro tabular format is meant for the output of the modules, not to represent information about the structure of the text (e.g. titles, paragraphs, emphasis). The output of the different modules (except for KeyPro, see below) consists in adding one or more columns containing a value for each token. If a module does not have any output to give for a token, the specific column will contain the value __NULL__ for that token. List of all the TextPro annotation columns: token, tokennorm, tokenid, tokenstart, tokenend, tokentype, pos, full_morpho, comp_morpho, lemma, wnpos, chunk, entity, parserid, feats, head, deprel.
The end of sentence is marked with an empty line.
Example of header lines:
# FILE: example.txt
# ID: 000055740
# CY: US
# DNUM: 06002654
# DATE: 19991214
# KIND: A
# FIELDS: token pos lemma
Example of annotation lines:
Barack NP0 Barack
Obama NP0 Obama
was VBD be
elected VVN elect
U.S. NP0 u.s.
president NN1 president
in PRP in
2012 CRD 2012
. PUN .
The output reported above is obtained with the following command: $> textpro.sh -l eng -c token+pos+lemma -o /tmp/ example.txt
TokenPro: TOKENIZATION
TokenPro breaks up a text into elements called tokens (e.g. words, puntuations marks, etc.); when you call TokenPro, the system creates a column called token containing one token in each line.
For instance, in the example below, the full stops "." are considered as single tokens just as all the words in the sentence. Example:
token
Barack
Obama
was
elected
U.S.
president
in
2012
.
He
'll
complete
his
term
in
January
2017
.
TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define the behavior for specific UTF-8 symbols.
The list of all the columns TokenPro can produce is the following:
token: contains the token
tokennorm: contains a normalized version of the token (for example, accents are normalized)
tokenid: contains the ID given to the token by the system
tokenstart: contains the offset indicating the first character of the token
tokenend: contains the offset indicating the last character of the token
tokentype: tokens are labeled according to six categories, i.e. UPP (uppercase), LOW (lowercase), CAP (capitalized), DIG (digit), PUN (punctuation), and OTH (other).
token tokennorm tokenid tokenstart tokenend tokentype
They They 1 0 4 UPP
have have 2 5 9 LOW
been been 3 10 14 LOW
working working 4 15 22 LOW
at at 5 23 25 LOW
FBK FBK 6 26 29 CAP
for for 7 30 33 LOW
15 15 8 34 36 DIG
years years 9 37 42 LOW
. . 10 42 43 PUN
SentencePro: SENTENCE SPLITTING
SentencePro identifies sentence boundaries in a text; it marks the end of a sentence when a specific punctuation mark (for instance a full stop, a question mark, an exclamation mark, etc.) is found, unless it is part of a linguistic expression, like an abbreviation (as the full stops in "U.S." which stands for "United States") or an acronym. Example:
token
Barack
Obama
was
elected
U.S.
president
in
2012
.
He
'll
complete
his
term
in
January
2017
.
SentencePro can be fully customized from an XML configuration file, in order to define specific sentence-ending rules.
TagPro: POS-TAGGING
TagPro marks each word in a text with a PoS (e.g. noun, verb, adjective, etc.) according to a predefined tagset, i.e. a subset of the ELRA tagset for Italian (see Appendix A) and the BNC tagset for English (see Appendix B). Example:
token pos
Barack NP0
Obama NP0
was VBD
elected VVN
U.S. NP0
president NN1
in PRP
2012 CRD
. PUN
He PNP
'll VM0
complete VVI
his DPS
term NN1
in PRP
January NP0
2017 CRD
. PUN
MorphoPro: MORPHOLOGICAL ANALYSIS
MorphoPro assigns morphological features (e.g. tense, person and number for verbs) to words in a text, with the possibility of disambiguating the possible morphological readings of a word in context. When you call MorphoPro, the system can create two different columns:
- full_morpho provides all the possible morphological analyses for a token. For example, "bearing" can be a verb, a noun or an adjective:
token full_morpho
bearing bear+v+gerund+pres bearing+n+sing bearing+adj+zero
Note that the three morphological analyses in the example above are all contained in one colum separated by blanks (colums separators on the other hand, are tabs).
- comp_morpho contains only the morphological anlysis/es that is/are compatible with the PoS tag selected by TagPro; in our example, TagPro selects a verbal tag (VVG), thus the only compatible morphological analysis is bear+v+gerund+pres:
token pos comp_morpho
bearing VVG bear+v+gerund+pres
Example:
token full_morpho
Barack __NULL__
Obama __NULL__
was be+v+indic+past
elected elect+v+indic+past elect+v+part+past elected+adj+zero
U.S. u.s.+pn
president president+n+sing
in in+prep in+adv in+n+sing in+adj+zero
2012 __NULL__
. .+punc
He he+n+sing he+pron
'll will+v+indic+pres+_
complete complete+v+indic+pres+no3sing complete+v+infin+pres complete+adj+zero
his hi+n+plur his+adj+zero his+pron
term term+n+sing term+n+plur term+v+indic+pres+no3sing term+v+infin+pres
in in+prep in+adv in+n+sing in+adj+zero
January january+pn january+n+sing
2017 __NULL__
. .+punc
token pos comp_morpho
Barack NP0 __NULL__
Obama NP0 __NULL__
was VBD be+v+indic+past
elected VVN elect+v+part+past
U.S. NP0 u.s.+pn
president NN1 president+n+sing
in PRP in+prep
2012 CRD __NULL__
. PUN .+punc
He PNP he+pron
'll VM0 will+v+indic+pres+_
complete VVI complete+v+infin+pres
his DPS his+adj+zero
term NN1 term+n+sing term+n+plur
in PRP in+prep
January NP0 january+pn
2017 CRD __NULL__
. PUN .+punc
LemmaPro: LEMMATIZATION
LemmaPro selects a lemma (i.e. a normalized word form, such as the infinitive form for a verb) for words in a text. The module takes as input both the morphological analysis produced by MorphoPro and the PoS label assigned by TagPro, and applies linguistic rules to select the compatible lemma(s). Example:
token pos comp_morpho lemma
Barack NP0 __NULL__ Barack
Obama NP0 __NULL__ Obama
was VBD be+v+indic+past be
elected VVN elect+v+part+past elect
U.S. NP0 u.s.+pn u.s.
president NN1 president+n+sing president
in PRP in+prep in
2012 CRD __NULL__ 2012
. PUN .+punc .
He PNP he+pron he
'll VM0 will+v+indic+pres+_ will
complete VVI complete+v+infin+pres complete
his DPS his+adj+zero his
term NN1 term+n+sing term+n+plur term
in PRP in+prep in
January NP0 january+pn january
2017 CRD __NULL__ 2017
. PUN .+punc .
Mapping from TextPro POS-tag to WordNet POS-tag: The column wnpos contains a mapping betweenn the PoS-tagging label assigned by TagPro and the corresponding PoS label used in WordNet (https://wordnet.princeton.edu/), i.e. n (for nouns), v (for verbs), a (for adjectives), and r (for adverbs).
For Italian (ELRA tagset):
- n (noun): SS,SP,SN,SPN,YA,YF
- v (verb): VI,VI+E,VIY,VIY+E,VF,VF+E,VFY,VFY+E,VSP,VSP+E,VSPY,VSPY+E,VPP,VPP+E,VPPY,VPPY+E,VG,VG+E,VGY,VGY+E,VM,VM+E,VMY,VMY+E
- a (adjective): AS,AP,AN,DS,DP,DN
- r (adverb): B
For English (BNC tagset):
- n (nouns): ITJ,NN0,NN1,NN2,NP0,UNC,ZZ0
- v (verbs): VBB,VBD,VBG,VBI,VBN,VBZ,VDB,VDD,VDG,VDI,VDN,VDZ,VHB,VHD,VHG,VHI,VHN,VHZ,VM0,VVB,VVD,VVG,VVI,VVN,VVZ
- a (adjective): AJ0,AJC,AJS,CRD,DPS,ORD;
- r (adverbs): AV0,AVP,AVQ,XX0.
Example:
token pos wnpos
Barack NP0 n
Obama NP0 n
was VBD v
elected VVN v
U.S. NP0 n
president NN1 n
in PRP __NULL__
2012 CRD n
. PUN __NULL__
He PNP __NULL__
'll VM0 v
complete VVI v
his DPS a
term NN1 n
in PRP __NULL__
January NP0 n
2017 CRD n
. PUN __NULL__
ChunkPro: CHUNKING
ChunkPro groups words into flat (i.e. without nesting) syntactic constituents (e.g. nominal phrases, verbal phrases, etc.), providing both a shallow syntactic analysis of a text and an intermediate step toward full parsing.
The module is available with pre-trained models in the news domain, for two languages, English and Italian.
For Italian ChunkPro annotates syntactic chunks of the following types: NP (Nominal Phrases) and VX (Verbal Phrase).
For English the list of syntactic chunks types is the following (from the Penn Treebank):
- ADJP: Adjective Phrase;
- ADVP: Adverb Phrase;
- CONJP: Conjunction Phrase;
- INTJ: Interjection (it corresponds approximately to the PoS tag UH);
- LST: List marker (it includes surrounding punctuation);
- NP: Noun Phrase;
- PP: Prepositional Phrase;
- PRT: Particle (it corresponds approximately to the PoS tag UH);
- SBAR: Clause introduced by a (possibly empty) subordinating conjunction;
- UCP: Unlike Coordinated Phrase;
- VP: Verb Phrase.
The colunm produced by the system (i.e. chunk) contains an IOB2 tag consisting of two parts:
- the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;
- the tag indicating the type of chunk.
Example:
token chunk
Mister B-NP
Barack I-NP
Obama I-NP
was B-VP
elected I-VP
U.S. B-NP
president I-NP
in B-PP
2012 B-NP
. O
He B-NP
'll B-VP
complete I-VP
his B-NP
term I-NP
in B-PP
January B-NP
2017 I-NP
. O
EntityPro: NAMED-ENTITY RECOGNITION
EntityPro annotates named entities of the following types for English and Italian: Person (PER), Organization (ORG), and Location (LOC). Only for Italian, it further distinguishes entities of type Location (LOC) from entities of type Geo-Political entity (GPE). Only for English, it also annotates entities of type Miscellaneous (MISC).
The colunm produced by the system (i.e. entity) contains an IOB2 tag consisting of two parts:
- the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;
- the Entity type tag: PER (for Person), ORG (for Organization), GPE (for Geo-Political Entity), LOC (for Location), and MISC (Miscellaneous).
Example:
token entity
Barack B-PER
Obama I-PER
was O
elected O
U.S. B-LOC
president O
in O
2012 O
. O
He O
'll O
complete O
his O
term O
in O
January O
2017 O
. O
TimePro: TIME EXPRESSION RECOGNITION
TimePro annotates both absolute time expressions (e.g. "April 3rd 2016") and relative time expressions (e.g. "tomorrow") in a text.
Temporal expressions are first classified (e.g. dates, durations, times) and then normalized according to the ISO-TimeML specifications.
TimePro is available with pre-trained models in the news domain for Italian and English.
It is trained on the EVENTI@Evalita2014 dataset for time expression recognition and classification. For time expression normalization we have adaptated timenorm library (https://github.com/bethard/timenorm) to Italian.
For each token it produces the following columns:
- tmx: it contains the type of the time expression type in IOB2 format (i.e. DATE, TIME, SET or DURATION);
- tmxid: it contains the ID of the time expression;
- tmxvalue: it contains the normalized value of the time expression.
GeoCoder: GEOGRAPHICAL COORDINATES
GeoCoder assigns geographical coordinates (e.g. latitude and longitude) to names of locations in a text, making their visualization on a map possible. Given the output of EntityPro, the module disambiguates a location using a repository of toponyms (i.e. GeoNames) as an index and, for each location, it provides both a fine grained category (e.g. City for "Honolulu") and its coordinates.
SyntaxPro: DEPENDECY PARSING
SyntaxPro assigns syntactic relations between words in a text based on syntactic dependencies. It is based on MaltParser, a transition-based dependency parser that induces a parsing model from a treebank and parses new data using the induced model. It has been trained on available treebanks for Italian. It uses MaltParser, a state-of-the-art data-driven dependency parser, trained on TUT (Turin University Treebank).
For each token it adds the following information:
- parserid: the ID of the token within the sentence
- feats: the morphological features of the token
- head: the ID of the head of the token
- deprel: the label of the dependency between the token and its head
EventPro: EVENT RECOGNITION
EventPro is a module for event detection and classification in a text. It annotates events following the TimeML specifications and classifies them in one of the seven TimeML classes (REPORTING, PERCEPTION, I_ACTION, etc.). As an example, the word "annunciato" (announced) is an event of type REPORTING expressed by a verb, while the word "tagli" (cuts) is an event of type OCCURRENCE expressed by a noun. Both event recognition and classification are based on a statistical classifier. EventPro is available with pre-trained models in the news domain for Italian and English.
For each token EventPro produces the following columns:
- event: it provides the information whether the token is inside the span of an event in IOB2 format (B-EVENT, I-EVENT or O);
- eventid: it provides the ID of the event;
- eventclass: it provides the TimeML class of the event (OCCURRENCE, STATE, I_STATE, REPORTING, PERCEPTION, I_ACTION, ASPECTUAL).
FactPro: EVENT FACTUALITY
FactPro assigns a factuality value to each event mentioned in a text. Factuality is described through three attributes: polarity, certainty and event time. The annotation follows the framework used in Fact-Ita Bank. As an example, the event mentioned with "annunciato" has been classified with polarity=POS (the occurrence of the event has been affirmed), with certainty=CERTAIN (the source is certain about the event), and with event time=NON_FUTURE (the event has occurred in the past).
FactPro is based on a statistical classifier. It is available with pre-trained models in the news domain for Italian.
It is trained on Fact-Ita Bank (http://hlt-nlp.fbk.eu/technologies/fact-ita-bank). It describes the factuality of an event (detected by the EventPro module) through 3 attributes: polarity, certainty and time. For each token it adds the following information (as columns):
- polarity: the polarity of the event (POL_POS, POL_NEG, O);
- certainty: how certain the source is about the event (CERTAIN, UNCERTAIN, O);
- evtime: when an event occured or will occur (NON_FUTURE, FUTURE, O).
TempRelPro: TEMPORAL RELATION RECOGNITION
TempRelPro identifies in a text the temporal relations holding between events and time expressions and classifies them following the TimeML specification. It identifies temporal relations of 13 different types (e.g. BEFORE, AFTER, INCLUDES, IS_INCLUDED, etc.); for example, in the sentence below we have a temporal relation of type IS_INCLUDED between the event "annunciato" (announced) and the time expression "ieri" (yesterday). The module is based on a statistical classifier. TempRelPro is available with pre-trained models in the news domain for Italian and English.
This module is machine learning based and it has been trained on EVENTI@Evalita2014 dataset. It detects temporal relations between two events. between one event and one time expression and between two time expressions.
A line is added in the header containing all the temporal relations extracted by the module:
# TLINKS: (e1;IS_INCLUDED;tmx1) (tmx0;AFTER;tmx1)
KeyPro: KEYWORD EXTRACTION
KeyPro (formerly called KX) is a tool for key-phrase extraction, which exploits basic linguistic annotations combined with simple statistical measures to select a list of weighted keywords from a text. In the header lines, it produces a line in which # KEYWORDS: is followed by the most important keywords of the input text (in lower case). Each keyword is followed by <NUM1 NUM2>: NUM1 indicates the relevance of the keyword, while NUM2 indicates the number of occurrences of the keyword in the text.
# FILE: example.txt
# LANGUAGE: eng
# TIMESTAMP: 2016-10-06T18:08:15+0200
# KEYWORDS: barack obama <173.91 1> elected <51.2 1> president <49.28 1> january <47.84 1> term <47.79 1>
# FIELDS: token
Barack
Obama
was
elected
U.S.
president
in
2012
.
He
'll
complete
his
term
in
January
2017
.