Annotation layers

- December 2016 -

TextPro (http://textpro.fbk.eu/) is a set of modules which perform pre-processing (web page cleaning) and a number of NLP tasks: tokenization, sentence splitting, morphological analysis, pos-tagging, lemmatization, chunking, named-entity recognition, time expression recognition, assignment of geographical coordinates to toponyms, dependency parsing, event and event factuality recognition, temporal relation recognition, and keyword extraction.

The input given to TextPro should contain only pure text (hub text, see the CES guidelines for corpus encoding: http://www.cs.vassar.edu/CES/CES1-5.html#ToCOview).

If a text contains any markup, this should be removed. Example of input file:

Barack Obama was elected U.S. president in 2012.

He'll complete his term in January 2017.

Pre-processing

CleanPro: HTML CLEANING

CleanPro enables the use of HTML input files by cleaning them. Using the -html option, all mark-up tags and irrelevant text (i.e. words used as navigation menu, common header and footer, etc.) will be removed from HTML pages, while the relevant text will be kept as input text for the other modules.

NLP processing

The output file produced by TextPro contains:

  • Header lines. Some meta information, such as the name of the file, the document's creation date, the author, etc., can be coded in the first lines. These lines begin with the special character #, followed by the name of the field and its value.

  • Annotation lines. They contain the output information of each token of the input file. TextPro uses tabular format with different columns separated by the tabular space. The Textpro tabular format is meant for the output of the modules, not to represent information about the structure of the text (e.g. titles, paragraphs, emphasis). The output of the different modules (except for KeyPro, see below) consists in adding one or more columns containing a value for each token. If a module does not have any output to give for a token, the specific column will contain the value __NULL__ for that token. List of all the TextPro annotation columns: token, tokennorm, tokenid, tokenstart, tokenend, tokentype, pos, full_morpho, comp_morpho, lemma, wnpos, chunk, entity, parserid, feats, head, deprel.

The end of sentence is marked with an empty line.

Example of header lines:

# FILE: example.txt

# ID: 000055740

# CY: US

# DNUM: 06002654

# DATE: 19991214

# KIND: A

# FIELDS: token pos lemma

Example of annotation lines:

Barack NP0 Barack

Obama NP0 Obama

was VBD be

elected VVN elect

U.S. NP0 u.s.

president NN1 president

in PRP in

2012 CRD 2012

. PUN .

The output reported above is obtained with the following command: $> textpro.sh -l eng -c token+pos+lemma -o /tmp/ example.txt

TokenPro: TOKENIZATION

TokenPro breaks up a text into elements called tokens (e.g. words, puntuations marks, etc.); when you call TokenPro, the system creates a column called token containing one token in each line.

For instance, in the example below, the full stops "." are considered as single tokens just as all the words in the sentence. Example:

token

Barack

Obama

was

elected

U.S.

president

in

2012

.

He

'll

complete

his

term

in

January

2017

.

TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define the behavior for specific UTF-8 symbols.

The list of all the columns TokenPro can produce is the following:

token: contains the token

tokennorm: contains a normalized version of the token (for example, accents are normalized)

tokenid: contains the ID given to the token by the system

tokenstart: contains the offset indicating the first character of the token

tokenend: contains the offset indicating the last character of the token

tokentype: tokens are labeled according to six categories, i.e. UPP (uppercase), LOW (lowercase), CAP (capitalized), DIG (digit), PUN (punctuation), and OTH (other).

token tokennorm tokenid tokenstart tokenend tokentype

They They 1 0 4 UPP

have have 2 5 9 LOW

been been 3 10 14 LOW

working working 4 15 22 LOW

at at 5 23 25 LOW

FBK FBK 6 26 29 CAP

for for 7 30 33 LOW

15 15 8 34 36 DIG

years years 9 37 42 LOW

. . 10 42 43 PUN

SentencePro: SENTENCE SPLITTING

SentencePro identifies sentence boundaries in a text; it marks the end of a sentence when a specific punctuation mark (for instance a full stop, a question mark, an exclamation mark, etc.) is found, unless it is part of a linguistic expression, like an abbreviation (as the full stops in "U.S." which stands for "United States") or an acronym. Example:

token

Barack

Obama

was

elected

U.S.

president

in

2012

.

He

'll

complete

his

term

in

January

2017

.

SentencePro can be fully customized from an XML configuration file, in order to define specific sentence-ending rules.

TagPro: POS-TAGGING

TagPro marks each word in a text with a PoS (e.g. noun, verb, adjective, etc.) according to a predefined tagset, i.e. a subset of the ELRA tagset for Italian (see Appendix A) and the BNC tagset for English (see Appendix B). Example:

token pos

Barack NP0

Obama NP0

was VBD

elected VVN

U.S. NP0

president NN1

in PRP

2012 CRD

. PUN

He PNP

'll VM0

complete VVI

his DPS

term NN1

in PRP

January NP0

2017 CRD

. PUN

MorphoPro: MORPHOLOGICAL ANALYSIS

MorphoPro assigns morphological features (e.g. tense, person and number for verbs) to words in a text, with the possibility of disambiguating the possible morphological readings of a word in context. When you call MorphoPro, the system can create two different columns:

- full_morpho provides all the possible morphological analyses for a token. For example, "bearing" can be a verb, a noun or an adjective:

token full_morpho

bearing bear+v+gerund+pres bearing+n+sing bearing+adj+zero

Note that the three morphological analyses in the example above are all contained in one colum separated by blanks (colums separators on the other hand, are tabs).

- comp_morpho contains only the morphological anlysis/es that is/are compatible with the PoS tag selected by TagPro; in our example, TagPro selects a verbal tag (VVG), thus the only compatible morphological analysis is bear+v+gerund+pres:

token pos comp_morpho

bearing VVG bear+v+gerund+pres

Example:

token full_morpho

Barack __NULL__

Obama __NULL__

was be+v+indic+past

elected elect+v+indic+past elect+v+part+past elected+adj+zero

U.S. u.s.+pn

president president+n+sing

in in+prep in+adv in+n+sing in+adj+zero

2012 __NULL__

. .+punc

He he+n+sing he+pron

'll will+v+indic+pres+_

complete complete+v+indic+pres+no3sing complete+v+infin+pres complete+adj+zero

his hi+n+plur his+adj+zero his+pron

term term+n+sing term+n+plur term+v+indic+pres+no3sing term+v+infin+pres

in in+prep in+adv in+n+sing in+adj+zero

January january+pn january+n+sing

2017 __NULL__

. .+punc

token pos comp_morpho

Barack NP0 __NULL__

Obama NP0 __NULL__

was VBD be+v+indic+past

elected VVN elect+v+part+past

U.S. NP0 u.s.+pn

president NN1 president+n+sing

in PRP in+prep

2012 CRD __NULL__

. PUN .+punc

He PNP he+pron

'll VM0 will+v+indic+pres+_

complete VVI complete+v+infin+pres

his DPS his+adj+zero

term NN1 term+n+sing term+n+plur

in PRP in+prep

January NP0 january+pn

2017 CRD __NULL__

. PUN .+punc

LemmaPro: LEMMATIZATION

LemmaPro selects a lemma (i.e. a normalized word form, such as the infinitive form for a verb) for words in a text. The module takes as input both the morphological analysis produced by MorphoPro and the PoS label assigned by TagPro, and applies linguistic rules to select the compatible lemma(s). Example:

token pos comp_morpho lemma

Barack NP0 __NULL__ Barack

Obama NP0 __NULL__ Obama

was VBD be+v+indic+past be

elected VVN elect+v+part+past elect

U.S. NP0 u.s.+pn u.s.

president NN1 president+n+sing president

in PRP in+prep in

2012 CRD __NULL__ 2012

. PUN .+punc .

He PNP he+pron he

'll VM0 will+v+indic+pres+_ will

complete VVI complete+v+infin+pres complete

his DPS his+adj+zero his

term NN1 term+n+sing term+n+plur term

in PRP in+prep in

January NP0 january+pn january

2017 CRD __NULL__ 2017

. PUN .+punc .

Mapping from TextPro POS-tag to WordNet POS-tag: The column wnpos contains a mapping betweenn the PoS-tagging label assigned by TagPro and the corresponding PoS label used in WordNet (https://wordnet.princeton.edu/), i.e. n (for nouns), v (for verbs), a (for adjectives), and r (for adverbs).

For Italian (ELRA tagset):

- n (noun): SS,SP,SN,SPN,YA,YF

- v (verb): VI,VI+E,VIY,VIY+E,VF,VF+E,VFY,VFY+E,VSP,VSP+E,VSPY,VSPY+E,VPP,VPP+E,VPPY,VPPY+E,VG,VG+E,VGY,VGY+E,VM,VM+E,VMY,VMY+E

- a (adjective): AS,AP,AN,DS,DP,DN

- r (adverb): B

For English (BNC tagset):

- n (nouns): ITJ,NN0,NN1,NN2,NP0,UNC,ZZ0

- v (verbs): VBB,VBD,VBG,VBI,VBN,VBZ,VDB,VDD,VDG,VDI,VDN,VDZ,VHB,VHD,VHG,VHI,VHN,VHZ,VM0,VVB,VVD,VVG,VVI,VVN,VVZ

- a (adjective): AJ0,AJC,AJS,CRD,DPS,ORD;

- r (adverbs): AV0,AVP,AVQ,XX0.

Example:

token pos wnpos

Barack NP0 n

Obama NP0 n

was VBD v

elected VVN v

U.S. NP0 n

president NN1 n

in PRP __NULL__

2012 CRD n

. PUN __NULL__

He PNP __NULL__

'll VM0 v

complete VVI v

his DPS a

term NN1 n

in PRP __NULL__

January NP0 n

2017 CRD n

. PUN __NULL__

ChunkPro: CHUNKING

ChunkPro groups words into flat (i.e. without nesting) syntactic constituents (e.g. nominal phrases, verbal phrases, etc.), providing both a shallow syntactic analysis of a text and an intermediate step toward full parsing.

The module is available with pre-trained models in the news domain, for two languages, English and Italian.

For Italian ChunkPro annotates syntactic chunks of the following types: NP (Nominal Phrases) and VX (Verbal Phrase).

For English the list of syntactic chunks types is the following (from the Penn Treebank):

- ADJP: Adjective Phrase;

- ADVP: Adverb Phrase;

- CONJP: Conjunction Phrase;

- INTJ: Interjection (it corresponds approximately to the PoS tag UH);

- LST: List marker (it includes surrounding punctuation);

- NP: Noun Phrase;

- PP: Prepositional Phrase;

- PRT: Particle (it corresponds approximately to the PoS tag UH);

- SBAR: Clause introduced by a (possibly empty) subordinating conjunction;

- UCP: Unlike Coordinated Phrase;

- VP: Verb Phrase.

The colunm produced by the system (i.e. chunk) contains an IOB2 tag consisting of two parts:

- the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;

- the tag indicating the type of chunk.

Example:

token chunk

Mister B-NP

Barack I-NP

Obama I-NP

was B-VP

elected I-VP

U.S. B-NP

president I-NP

in B-PP

2012 B-NP

. O

He B-NP

'll B-VP

complete I-VP

his B-NP

term I-NP

in B-PP

January B-NP

2017 I-NP

. O

EntityPro: NAMED-ENTITY RECOGNITION

EntityPro annotates named entities of the following types for English and Italian: Person (PER), Organization (ORG), and Location (LOC). Only for Italian, it further distinguishes entities of type Location (LOC) from entities of type Geo-Political entity (GPE). Only for English, it also annotates entities of type Miscellaneous (MISC).

The colunm produced by the system (i.e. entity) contains an IOB2 tag consisting of two parts:

- the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens;

- the Entity type tag: PER (for Person), ORG (for Organization), GPE (for Geo-Political Entity), LOC (for Location), and MISC (Miscellaneous).

Example:

token entity

Barack B-PER

Obama I-PER

was O

elected O

U.S. B-LOC

president O

in O

2012 O

. O

He O

'll O

complete O

his O

term O

in O

January O

2017 O

. O

TimePro: TIME EXPRESSION RECOGNITION

TimePro annotates both absolute time expressions (e.g. "April 3rd 2016") and relative time expressions (e.g. "tomorrow") in a text.

Temporal expressions are first classified (e.g. dates, durations, times) and then normalized according to the ISO-TimeML specifications.

TimePro is available with pre-trained models in the news domain for Italian and English.

It is trained on the EVENTI@Evalita2014 dataset for time expression recognition and classification. For time expression normalization we have adaptated timenorm library (https://github.com/bethard/timenorm) to Italian.

For each token it produces the following columns:

- tmx: it contains the type of the time expression type in IOB2 format (i.e. DATE, TIME, SET or DURATION);

- tmxid: it contains the ID of the time expression;

- tmxvalue: it contains the normalized value of the time expression.

GeoCoder: GEOGRAPHICAL COORDINATES

GeoCoder assigns geographical coordinates (e.g. latitude and longitude) to names of locations in a text, making their visualization on a map possible. Given the output of EntityPro, the module disambiguates a location using a repository of toponyms (i.e. GeoNames) as an index and, for each location, it provides both a fine grained category (e.g. City for "Honolulu") and its coordinates.

SyntaxPro: DEPENDECY PARSING

SyntaxPro assigns syntactic relations between words in a text based on syntactic dependencies. It is based on MaltParser, a transition-based dependency parser that induces a parsing model from a treebank and parses new data using the induced model. It has been trained on available treebanks for Italian. It uses MaltParser, a state-of-the-art data-driven dependency parser, trained on TUT (Turin University Treebank).

For each token it adds the following information:

- parserid: the ID of the token within the sentence

- feats: the morphological features of the token

- head: the ID of the head of the token

- deprel: the label of the dependency between the token and its head

EventPro: EVENT RECOGNITION

EventPro is a module for event detection and classification in a text. It annotates events following the TimeML specifications and classifies them in one of the seven TimeML classes (REPORTING, PERCEPTION, I_ACTION, etc.). As an example, the word "annunciato" (announced) is an event of type REPORTING expressed by a verb, while the word "tagli" (cuts) is an event of type OCCURRENCE expressed by a noun. Both event recognition and classification are based on a statistical classifier. EventPro is available with pre-trained models in the news domain for Italian and English.

For each token EventPro produces the following columns:

- event: it provides the information whether the token is inside the span of an event in IOB2 format (B-EVENT, I-EVENT or O);

- eventid: it provides the ID of the event;

- eventclass: it provides the TimeML class of the event (OCCURRENCE, STATE, I_STATE, REPORTING, PERCEPTION, I_ACTION, ASPECTUAL).

FactPro: EVENT FACTUALITY

FactPro assigns a factuality value to each event mentioned in a text. Factuality is described through three attributes: polarity, certainty and event time. The annotation follows the framework used in Fact-Ita Bank. As an example, the event mentioned with "annunciato" has been classified with polarity=POS (the occurrence of the event has been affirmed), with certainty=CERTAIN (the source is certain about the event), and with event time=NON_FUTURE (the event has occurred in the past).

FactPro is based on a statistical classifier. It is available with pre-trained models in the news domain for Italian.

It is trained on Fact-Ita Bank (http://hlt-nlp.fbk.eu/technologies/fact-ita-bank). It describes the factuality of an event (detected by the EventPro module) through 3 attributes: polarity, certainty and time. For each token it adds the following information (as columns):

- polarity: the polarity of the event (POL_POS, POL_NEG, O);

- certainty: how certain the source is about the event (CERTAIN, UNCERTAIN, O);

- evtime: when an event occured or will occur (NON_FUTURE, FUTURE, O).

TempRelPro: TEMPORAL RELATION RECOGNITION

TempRelPro identifies in a text the temporal relations holding between events and time expressions and classifies them following the TimeML specification. It identifies temporal relations of 13 different types (e.g. BEFORE, AFTER, INCLUDES, IS_INCLUDED, etc.); for example, in the sentence below we have a temporal relation of type IS_INCLUDED between the event "annunciato" (announced) and the time expression "ieri" (yesterday). The module is based on a statistical classifier. TempRelPro is available with pre-trained models in the news domain for Italian and English.

This module is machine learning based and it has been trained on EVENTI@Evalita2014 dataset. It detects temporal relations between two events. between one event and one time expression and between two time expressions.

A line is added in the header containing all the temporal relations extracted by the module:

# TLINKS: (e1;IS_INCLUDED;tmx1) (tmx0;AFTER;tmx1)

KeyPro: KEYWORD EXTRACTION

KeyPro (formerly called KX) is a tool for key-phrase extraction, which exploits basic linguistic annotations combined with simple statistical measures to select a list of weighted keywords from a text. In the header lines, it produces a line in which # KEYWORDS: is followed by the most important keywords of the input text (in lower case). Each keyword is followed by <NUM1 NUM2>: NUM1 indicates the relevance of the keyword, while NUM2 indicates the number of occurrences of the keyword in the text.

# FILE: example.txt

# LANGUAGE: eng

# TIMESTAMP: 2016-10-06T18:08:15+0200

# KEYWORDS: barack obama <173.91 1> elected <51.2 1> president <49.28 1> january <47.84 1> term <47.79 1>

# FIELDS: token

Barack

Obama

was

elected

U.S.

president

in

2012

.

He

'll

complete

his

term

in

January

2017

.


APPENDIX A: the ELRA tagset

APPENDIX B: the BNC tagset