- December 2016 - TextPro (http://textpro.fbk.eu/) is a set of modules which perform pre-processing (web page cleaning) and a number of NLP tasks: tokenization, sentence splitting, morphological analysis, pos-tagging, lemmatization, chunking, named-entity recognition, time expression recognition, assignment of geographical coordinates to toponyms, dependency parsing, event and event factuality recognition, temporal relation recognition, and keyword extraction. The input given to TextPro should contain only pure text (hub text, see the CES guidelines for corpus encoding: http://www.cs.vassar.edu/CES/CES1-5.html#ToCOview). If a text contains any markup, this should be removed. Example of input file: Barack Obama was elected U.S. president in 2012. Pre-processingCleanPro: HTML CLEANINGCleanPro enables the use of HTML input files by cleaning them. Using the -html option, all mark-up tags and irrelevant text (i.e. words used as navigation menu, common header and footer, etc.) will be removed from HTML pages, while the relevant text will be kept as input text for the other modules.NLP processingThe end of sentence is marked with an empty line. Example of header lines: # FILE: example.txt # ID: 000055740 # CY: US # DNUM: 06002654 # DATE: 19991214 # KIND: A # FIELDS: token pos lemmaExample of annotation lines: Barack NP0 Barack Obama NP0 Obama was VBD be elected VVN elect U.S. NP0 u.s. president NN1 president in PRP in 2012 CRD 2012 . PUN .The output reported above is obtained with the following command: $> textpro.sh -l eng -c token+pos+lemma -o /tmp/ example.txt TokenPro: TOKENIZATIONTokenPro breaks up a text into elements called tokens (e.g. words, puntuations marks, etc.); when you call TokenPro, the system creates a column called token containing one token in each line.For instance, in the example below, the full stops "." are considered as single tokens just as all the words in the sentence. Example: token Barack Obama was elected U.S. president in 2012 . He 'll complete his term in January 2017 .TokenPro can be fully customized from an XML configuration file, in order to define specific splitting rules or to define the behavior for specific UTF-8 symbols. The list of all the columns TokenPro can produce is the following: token: contains the token tokennorm: contains a normalized version of the token (for example, accents are normalized) tokenid: contains the ID given to the token by the system tokenstart: contains the offset indicating the first character of the token tokenend: contains the offset indicating the last character of the token tokentype: tokens are labeled according to six categories, i.e. UPP (uppercase), LOW (lowercase), CAP (capitalized), DIG (digit), PUN (punctuation), and OTH (other). token tokennorm tokenid tokenstart tokenend tokentype They They 1 0 4 UPP have have 2 5 9 LOW been been 3 10 14 LOW working working 4 15 22 LOW at at 5 23 25 LOW FBK FBK 6 26 29 CAP for for 7 30 33 LOW 15 15 8 34 36 DIG years years 9 37 42 LOW . . 10 42 43 PUN SentencePro: SENTENCE SPLITTINGSentencePro identifies sentence boundaries in a text; it marks the end of a sentence when a specific punctuation mark (for instance a full stop, a question mark, an exclamation mark, etc.) is found, unless it is part of a linguistic expression, like an abbreviation (as the full stops in "U.S." which stands for "United States") or an acronym. Example:token Barack Obama was elected U.S. president in 2012 . He 'll complete his term in January 2017 .SentencePro can be fully customized from an XML configuration file, in order to define specific sentence-ending rules. TagPro: POS-TAGGINGTagPro marks each word in a text with a PoS (e.g. noun, verb, adjective, etc.) according to a predefined tagset, i.e. a subset of the ELRA tagset for Italian (see Appendix A) and the BNC tagset for English (see Appendix B). Example:token pos Barack NP0 Obama NP0 was VBD elected VVN U.S. NP0 president NN1 in PRP 2012 CRD . PUN He PNP 'll VM0 complete VVI his DPS term NN1 in PRP January NP0 2017 CRD . PUN MorphoPro: MORPHOLOGICAL ANALYSISMorphoPro assigns morphological features (e.g. tense, person and number for verbs) to words in a text, with the possibility of disambiguating the possible morphological readings of a word in context. When you call MorphoPro, the system can create two different columns:- full_morpho provides all the possible morphological analyses for a token. For example, "bearing" can be a verb, a noun or an adjective: token full_morpho bearing bear+v+gerund+pres bearing+n+sing bearing+adj+zeroNote that the three morphological analyses in the example above are all contained in one colum separated by blanks (colums separators on the other hand, are tabs). - comp_morpho contains only the morphological anlysis/es that is/are compatible with the PoS tag selected by TagPro; in our example, TagPro selects a verbal tag (VVG), thus the only compatible morphological analysis is bear+v+gerund+pres: token pos comp_morpho bearing VVG bear+v+gerund+pres Example: token full_morpho Barack __NULL__ Obama __NULL__ was be+v+indic+past elected elect+v+indic+past elect+v+part+past elected+adj+zero U.S. u.s.+pn president president+n+sing in in+prep in+adv in+n+sing in+adj+zero 2012 __NULL__ . .+punc He he+n+sing he+pron 'll will+v+indic+pres+_ complete complete+v+indic+pres+no3sing complete+v+infin+pres complete+adj+zero his hi+n+plur his+adj+zero his+pron term term+n+sing term+n+plur term+v+indic+pres+no3sing term+v+infin+pres in in+prep in+adv in+n+sing in+adj+zero January january+pn january+n+sing 2017 __NULL__ . .+punc token pos comp_morpho Barack NP0 __NULL__ Obama NP0 __NULL__ was VBD be+v+indic+past elected VVN elect+v+part+past U.S. NP0 u.s.+pn president NN1 president+n+sing in PRP in+prep 2012 CRD __NULL__ . PUN .+punc He PNP he+pron 'll VM0 will+v+indic+pres+_ complete VVI complete+v+infin+pres his DPS his+adj+zero term NN1 term+n+sing term+n+plur in PRP in+prep January NP0 january+pn 2017 CRD __NULL__ . PUN .+punc LemmaPro: LEMMATIZATIONLemmaPro selects a lemma (i.e. a normalized word form, such as the infinitive form for a verb) for words in a text. The module takes as input both the morphological analysis produced by MorphoPro and the PoS label assigned by TagPro, and applies linguistic rules to select the compatible lemma(s). Example:token pos comp_morpho lemma Barack NP0 __NULL__ Barack Obama NP0 __NULL__ Obama was VBD be+v+indic+past be elected VVN elect+v+part+past elect U.S. NP0 u.s.+pn u.s. president NN1 president+n+sing president in PRP in+prep in 2012 CRD __NULL__ 2012 . PUN .+punc . He PNP he+pron he 'll VM0 will+v+indic+pres+_ will complete VVI complete+v+infin+pres complete his DPS his+adj+zero his term NN1 term+n+sing term+n+plur term in PRP in+prep in January NP0 january+pn january 2017 CRD __NULL__ 2017 . PUN .+punc .Mapping from TextPro POS-tag to WordNet POS-tag: The column wnpos contains a mapping betweenn the PoS-tagging label assigned by TagPro and the corresponding PoS label used in WordNet (https://wordnet.princeton.edu/), i.e. n (for nouns), v (for verbs), a (for adjectives), and r (for adverbs). For Italian (ELRA tagset): - n (noun): SS,SP,SN,SPN,YA,YF - v (verb): VI,VI+E,VIY,VIY+E,VF,VF+E,VFY,VFY+E,VSP,VSP+E,VSPY,VSPY+E,VPP,VPP+E,VPPY,VPPY+E,VG,VG+E,VGY,VGY+E,VM,VM+E,VMY,VMY+E - a (adjective): AS,AP,AN,DS,DP,DN - r (adverb): B For English (BNC tagset): - n (nouns): ITJ,NN0,NN1,NN2,NP0,UNC,ZZ0 - v (verbs): VBB,VBD,VBG,VBI,VBN,VBZ,VDB,VDD,VDG,VDI,VDN,VDZ,VHB,VHD,VHG,VHI,VHN,VHZ,VM0,VVB,VVD,VVG,VVI,VVN,VVZ - a (adjective): AJ0,AJC,AJS,CRD,DPS,ORD; - r (adverbs): AV0,AVP,AVQ,XX0. Example: token pos wnpos Barack NP0 n Obama NP0 n was VBD v elected VVN v U.S. NP0 n president NN1 n in PRP __NULL__ 2012 CRD n . PUN __NULL__ He PNP __NULL__ 'll VM0 v complete VVI v his DPS a term NN1 n in PRP __NULL__ January NP0 n 2017 CRD n . PUN __NULL__ ChunkPro: CHUNKINGChunkPro groups words into flat (i.e. without nesting) syntactic constituents (e.g. nominal phrases, verbal phrases, etc.), providing both a shallow syntactic analysis of a text and an intermediate step toward full parsing.The module is available with pre-trained models in the news domain, for two languages, English and Italian. For Italian ChunkPro annotates syntactic chunks of the following types: NP (Nominal Phrases) and VX (Verbal Phrase). For English the list of syntactic chunks types is the following (from the Penn Treebank): - ADJP: Adjective Phrase; - ADVP: Adverb Phrase; - CONJP: Conjunction Phrase; - INTJ: Interjection (it corresponds approximately to the PoS tag UH); - LST: List marker (it includes surrounding punctuation); - NP: Noun Phrase; - PP: Prepositional Phrase; - PRT: Particle (it corresponds approximately to the PoS tag UH); - SBAR: Clause introduced by a (possibly empty) subordinating conjunction; - UCP: Unlike Coordinated Phrase; - VP: Verb Phrase. The colunm produced by the system (i.e. chunk) contains an IOB2 tag consisting of two parts: - the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens; - the tag indicating the type of chunk. Example: token chunk Mister B-NP Barack I-NP Obama I-NP was B-VP elected I-VP U.S. B-NP president I-NP in B-PP 2012 B-NP . O He B-NP 'll B-VP complete I-VP his B-NP term I-NP in B-PP January B-NP 2017 I-NP . O EntityPro: NAMED-ENTITY RECOGNITIONEntityPro annotates named entities of the following types for English and Italian: Person (PER), Organization (ORG), and Location (LOC). Only for Italian, it further distinguishes entities of type Location (LOC) from entities of type Geo-Political entity (GPE). Only for English, it also annotates entities of type Miscellaneous (MISC).The colunm produced by the system (i.e. entity) contains an IOB2 tag consisting of two parts: - the IOB2 tag: B (for "begin") denotes the first token of a Named Entity, I (for "inside") is used for all other tokens in a Named Entity, and O (for "outside") is used for all other tokens; - the Entity type tag: PER (for Person), ORG (for Organization), GPE (for Geo-Political Entity), LOC (for Location), and MISC (Miscellaneous). Example: token entity Barack B-PER Obama I-PER was O elected O U.S. B-LOC president O in O 2012 O . O He O 'll O complete O his O term O in O January O 2017 O . O TimePro: TIME EXPRESSION RECOGNITIONTimePro annotates both absolute time expressions (e.g. "April 3rd 2016") and relative time expressions (e.g. "tomorrow") in a text.Temporal expressions are first classified (e.g. dates, durations, times) and then normalized according to the ISO-TimeML specifications. TimePro is available with pre-trained models in the news domain for Italian and English. It is trained on the EVENTI@Evalita2014 dataset for time expression recognition and classification. For time expression normalization we have adaptated timenorm library (https://github.com/bethard/timenorm) to Italian. For each token it produces the following columns: - tmx: it contains the type of the time expression type in IOB2 format (i.e. DATE, TIME, SET or DURATION); - tmxid: it contains the ID of the time expression; - tmxvalue: it contains the normalized value of the time expression. GeoCoder: GEOGRAPHICAL COORDINATESGeoCoder assigns geographical coordinates (e.g. latitude and longitude) to names of locations in a text, making their visualization on a map possible. Given the output of EntityPro, the module disambiguates a location using a repository of toponyms (i.e. GeoNames) as an index and, for each location, it provides both a fine grained category (e.g. City for "Honolulu") and its coordinates.SyntaxPro: DEPENDECY PARSINGSyntaxPro assigns syntactic relations between words in a text based on syntactic dependencies. It is based on MaltParser, a transition-based dependency parser that induces a parsing model from a treebank and parses new data using the induced model. It has been trained on available treebanks for Italian. It uses MaltParser, a state-of-the-art data-driven dependency parser, trained on TUT (Turin University Treebank).For each token it adds the following information: - parserid: the ID of the token within the sentence - feats: the morphological features of the token - head: the ID of the head of the token - deprel: the label of the dependency between the token and its head EventPro: EVENT RECOGNITION
EventPro is a module for event detection and classification in a text. It annotates events following the TimeML specifications and classifies them in one of the seven TimeML classes (REPORTING, PERCEPTION, I_ACTION, etc.). As an example, the word "annunciato" (announced) is an event of type REPORTING expressed by a verb, while the word "tagli" (cuts) is an event of type OCCURRENCE expressed by a noun. Both event recognition and classification are based on a statistical classifier. EventPro is available with pre-trained models in the news domain for Italian and English. |
Italian Tag | Description | Example |
---|---|---|
XPS | punctuation | . ; : ? ! |
XPW | comma | , |
XPB | brakets | ( ) |
XPO | quotation mark, ellipsis, hyphen | `` '' - ... _ |
N | number | 1, 1999, '76, sei, sesto |
RS | singular article | il, l', la, un, una |
RP | plural article | i, gli, gl', le |
AS | singular qual. adj. | vera, grandissimo, migliore |
AP | plural qual. adj. | vere, grandissimi, maggiori |
AN | qual. adj. neutral for number | rosa, più, super, antincendio |
DS | singular det. adj. | quello, alcuna, mio, quale? |
DP | plural det. adj. | quelli, alcune, miei, quali? |
DN | det. adj. neutral for number | qualsiasi |
E | simple preposition | di, a, dopo, fino, nonostante |
ES | singular articulated preposition | dal, sulla, nello |
EP | plural articulated preposition | dalle, sulle, negli, nei, ai |
B | adverb | molto, invece, esattamente |
C | conjunction | e, ma, bensı, sia, perché |
CCHE | che | che |
CCHI | chi | chi |
CADV | connettivo avverbiale | come, dove, quando |
PS | singular pronoun | ciascuna, lo, mio |
PP | plural pronoun | costoro, esse, nostri, loro |
PN | pronoun neutral for number | ci, cui, sé |
SS | singular noun | aereo, formula |
SP | plural noun | aerei, formule |
SN | noun neutral for number | attività, business, novità |
SPN | proper noun | Alfredo, Ford, Piombino |
QNS | singular relative pronoun | quanto, quanta |
QNP | plural relative pronoun | quanti, quante |
YA | acronym | ANSA, CEE, ONU |
YF | foreign term | city, fiesta, Papier |
I | interjection | oh! |
VI | main verb, ind., subjunctive, cond. | vedo, giungano, saprei |
VIY | aux. verb, ind., subjunctive, cond. | ho, sia, avrebbe |
VF | main verb, inf. | arrivare, vedere |
VFY | aux. verb, inf. | avere, essere |
VSP | main verb, past part., singular | acquisito, interrotto |
VSPY | aux. verb, past part., singular | avente, stato, stata |
VPP | main verb, past part., plural | arrivati |
VPPY | aux. verb, past part., plural | state |
VG | main verb, gerund | cantando, ringraziando |
VGY | aux. verb, gerund | avendo, essendo |
VM | main verb, imperative | cercate, leggi |
VMY | aux. verb, imperative | sia, abbia |
+E | clitic | ne, ci |
APPENDIX B: the BNC tagset
English Tag |
Description |
---|---|
AJ0 |
Adjective (general or positive) (e.g. good, old, beautiful) |
AJC |
Comparative adjective (e.g. better, older) |
AJS |
Superlative adjective (e.g. best, oldest) |
AT0 |
Article (e.g. the, a, an, no) |
AV0 |
General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. |
AVP |
Adverb particle (e.g. up, off, out) |
AVQ |
Wh-adverb (e.g. when, where, how, why, wherever) |
CJC |
Coordinating conjunction (e.g. and, or, but) |
CJS |
Subordinating conjunction (e.g. although, when) |
CJT |
The subordinating conjunction that |
CRD |
Cardinal number (e.g. one, 3, fifty-five, 3609) |
DPS |
Possessive determiner-pronoun (e.g. your, their, his) |
DT0 |
General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. |
DTQ |
Wh-determiner-pronoun (e.g. which, what, whose, whichever) |
EX0 |
Existential there, i.e. there occurring in the there is ... or there are ... construction |
ITJ |
Interjection or other isolate (e.g. oh, yes, mhm, wow) |
NN0 |
Common noun, neutral for number (e.g. aircraft, data, committee) |
NN1 |
Singular common noun (e.g. pencil, goose, time, revelation) |
NN2 |
Plural common noun (e.g. pencils, geese, times, revelations) |
NP0 |
Proper noun (e.g. London, Michael, Mars, IBM) |
ORD |
Ordinal numeral (e.g. first, sixth, 77th, last) . |
PNI |
Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) |
PNP |
Personal pronoun (e.g. I, you, them, ours) |
PNQ |
Wh-pronoun (e.g. who, whoever, whom) |
PNX |
Reflexive pronoun (e.g. myself, yourself, itself, ourselves) |
POS |
The possessive or genitive marker 's or ' |
PRF |
The preposition of |
PRP |
Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) |
PUL |
Punctuation: left bracket - i.e. ( or [ |
PUN |
Punctuation: general separating mark - i.e. . , ! , : ; - or ? |
PUQ |
Punctuation: quotation mark - i.e. ' or " |
PUR |
Punctuation: right bracket - i.e. ) or ] |
TO0 |
Infinitive marker to |
UNC |
Unclassified items which are not appropriately considered as items of the English lexicon. |
VBB |
The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative] |
VBD |
The past tense forms of the verb BE: was and were |
VBG |
The -ing form of the verb BE: being |
VBI |
The infinitive form of the verb BE: be |
VBN |
The past participle form of the verb BE: been |
VBZ |
The -s form of the verb BE: is, 's |
VDB |
The finite base form of the verb BE: do |
VDD |
The past tense form of the verb DO: did |
VDG |
The -ing form of the verb DO: doing |
VDI |
The infinitive form of the verb DO: do |
VDN |
The past participle form of the verb DO: done |
VDZ |
The -s form of the verb DO: does, 's |
VHB |
The finite base form of the verb HAVE: have, 've |
VHD |
The past tense form of the verb HAVE: had, 'd |
VHG |
The -ing form of the verb HAVE: having |
VHI |
The infinitive form of the verb HAVE: have |
VHN |
The past participle form of the verb HAVE: had |
VHZ |
The -s form of the verb HAVE: has, 's |
VM0 |
Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd) |
VVB |
The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] |
VVD |
The past tense form of lexical verbs (e.g. forgot, sent, lived, returned) |
VVG |
The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) |
VVI |
The infinitive form of lexical verbs (e.g. forget, send, live, return) |
VVN |
The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) |
VVZ |
The -s form of lexical verbs (e.g. forgets, sends, lives, returns) |
XX0 |
The negative particle not or n't |
ZZ0 |
Alphabetical symbols (e.g. A, a, B, b, c, d) |