Linguistic Resources

Lexical resources and Corpora

MultiWordNet - website

A multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet

WordNet Domains - website

A lexical resource created by augmenting WordNet with domain labels. It includes WordNet-Affect

MultiSemCor - website

An English/Italian parallel corpus

T-PAS - website

Typed Predicate Argument Structures for Italian

I-CAB - website

An annotated corpus consisting of 525 news stories taken from a local newspaper

LICO - A lexicon for Italian discourse connectives

LICO, Lexicon for Italian COnnectives, includes about 170 discourse connectives used in Italian, together with their lexical and orthographic variants, syntactic category, semantic relation(s) (according to the Penn Discourse Treebank relation catalogue), and a number of usage examples. Browse LICO in this page, where you can also find connectives lexica for other languages.

Contributors:

    • Anna Feltracco
    • Denise Pangrazzi
    • Elisabetta Jezek
    • Bernardo Magnini
    • Manfred Stede

LICO is licensed under a Creative Commons Attribution (CC-BY 4.0) License

Please fill the request form with your data (they will be maintained in a database at FBK).

Publications or presentations containing research results obtained through the use of LICO should cite the following reference:

    • Feltracco Anna; Jezek Elisabetta; Magnini Bernardo; Stede Manfred. LICO: A Lexicon of Italian Connectives In: Proceedings of the Third Italian Conference on Computational Linguistics (CLiC-it 2016), Napoli, December 5-7, 2016.

Evalita NER2011 Dataset - The Dataset of the Evalita 2011 Named Entity Recognition Task

The Evalita NER2011 Dataset contains the test and training data used for the NER task at Evalita 2011.

For more detailed information, please refer to the Evalita website: NER2011.

We distribute:

    • manual transcription and Named Entities annotation;
    • automatic transcription;
    • audio files (please contact us directly to obtain them).

The research licence is granted through an online form to scholars working for academic and research institutions. Please give us clear evidence about your affiliation (for instance the e-mail account and home-page). If you are an undergraduate or master student the licence should be submitted by a professor of your University.

Fill form

SWiiT - Italian Wikipedia annotated with entity mentions

SWiiT is the Italian Wikipedia annotated at five different levels:

    • basic NLP processing (tokenization, sentence splitting and PoS-tagging)
    • entity mentions (person, organization, location and geo-political entities)
    • entity subtypes (not completed)
    • entity co-reference (not completed)
    • dependency parsing (not completed)

References:

    • Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza. Entity Mention Detection using a Combination of Redundancy-Driven Classifiers. Proc. of LREC, 7th edition of the Language Resources and Evaluation Conference, 19-21 May 2010, Valletta (Malta).
Creative Commons License

SWiiT is licensed under a Creative Commons Attribution 3.0 Unported License. Please fill a request with your data (they will be maintained in a database at FBK).

Subscribe license

Causal-TimeBank - The TimeBank corpus taken from TempEval-3 task, annotated with causal information

Causal-TimeBank is the TimeBank corpus taken from TempEval-3 task, which is part of TempEval-3 English training data: TBAQ-cleaned, annotated with causal information. The annotation was performed using the CAT tool (Bartalesi Lenzi et al., 2012), a web-based application with a plugin to import annotated data in CAT-XML format and add new information on top of it, in this case causal information. The CAT-XML task file for annotating causal information on top of TimeML data (using CAT tool) is available here [CAT User Manual: How to Manage Annotation Tasks].

We provide a tool to convert text annotated with TimeML into CAT-XML format, and vice versa, written in Python.

Note that we only consider gold annotated EVENTs, TIMEX3s, SIGNALs and TLINKs from the TimeBank corpus (as SLINKs and ALINKs are considered irrelevant in this task), and we put new information about causality in the form of C-SIGNALs and CLINKs annotation. More information about the causality annotation can be found in one of the papers in the references section.

Contained in the corpus are:

    • 6,811 EVENTs (only instantiated events by MAKEINSTANCE tag of TimeML)
    • 5,118 TLINKs (temporal links)
    • 171 CSIGNALs (causal signals)
    • 318 CLINKs (causal links)

Download Causal-TimeBank

Whenever making reference to this resource please cite one of the following references:

    • Paramita Mirza, Rachele Sprugnoli, Sara Tonelli and Manuela Speranza. 2014. Annotating causality in the TempEval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL), pages 10–19, Gothenburg, Sweden, April. Association for Computational Linguistics. [pdf] [bib]
    • Paramita Mirza and Sara Tonelli. 2014. An Analysis of Causality between Events and its Relation to Temporal Information. (to appear) in Proceedings of the 25th International Conference on Computational Linguistics, Dublin, Ireland.

Fact-Ita Bank - A subpart of Ita-TimeBank annotated with factuality information

Fact-Ita Bank consists of 170 news stories (for a total of 65,455 tokens) selected from Ita-TimeBank annotated with factuality information on top of TimeML event annotation.

Creative Commons License

Fact-Ita Bank is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Contributors: Begoña Altuna, Tommaso Caselli, Anne-Lyse Minard, Alessandro Marchetti, Manuela Speranza, Maria Giulia Zen

Publications or presentations containing research results obtained through the use of Fact-Ita Bank should cite the following reference:

To obtain the data please fill the request form with your data (they will be maintained in a database at FBK):

Fill form

Developments of Fact-Ita Bank

    • Fact-Ita Bank for FactA@EVALITA 2016: a revised of Fact-Ita Bank (169 news stories) released as training data for the FactA Task at EVALITA 2016.
    • Publications or presentations containing research results obtained through the use of Fact-Ita Bank for FactA@EVALITA 2016 should cite the following reference:
    • Anne-Lyse Minard, Manuela Speranza, and Tommaso Caselli. The EVALITA 2016 Event Factuality Annotation Task (FactA). In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Accademia University Press, Napoli, Italy, December 5-7, 2016.
    • Obtain for the FactA version of Fact-Ita Bank: Fill form
    • Fact-Ita Bank Negation: Fact-Ita Bank for FactA@EVALITA 2016 is being enriched with a new level of annotation, namely negations cues, their scope and their focus
    • Publications or presentations containing research results obtained through the use of Fact-Ita Bank for FactA@EVALITA 2016 should cite the following reference:
    • Begoña Altuna, Anne-Lyse Minard, and Manuela Speranza. The Scope and Focus of Negation: A Complete Annotation Framework for Italian. In Proceedings of EACL 2017, SEMBEAR workshop, to appear.
    • Obtain Fact-Ita Bank Negation (Beta version: only 71 files available, 62 of which contain at least one negation cue): Fill form

Pilot Task of EVENTI @ Evalita 2014 - Test data set of the EVENTI Pilot Task on "Temporal Processing of Historical Texts"

Evaluation data and gold annotations of the pilot task "Temporal Processing of Historical Texts" organized within the EVENTI exercise at EVALITA 2014.

This dataset consists of more than 5,000 tokens from newspaper articles published in "Il Trentino" by Alcide De Gasperi, one of the founders of the Italian Republic and one of the fathers of the European Union. All the selected news stories date back to 1914, the year of the outbreak of World War 1. The dataset has been manually annotated with semantic information: events, temporal expressions and temporal relations following the It-TimeML guidelines, which are compliant with ISO-TimeML, the international standard based on TimeML (read more about It-TimeML). Data are released in the XML-based stand-off format provided by CAT (Content Annotation Tool), the tool used to annotate the corpus.

Data and annotations are freely available for research purposes (license terms).

Subscribe license

SemEval2015 TimeLine Dataset - Dataset of the SemEval-2015 Task "TimeLine: Cross-Document Event Ordering"

The aim of the "TimeLine: Cross-Document Event Ordering" Task at SemEval 2015 was to build timelines from written news in English. More specifically, the goal was to order on a timeline all the events in which a target entity is involved, focusing mainly on cross-document event coreference resolution and cross-document temporal relation extraction.

All the data provided to task participants are freely available from the task website.

NewsReader MEANTIME Corpus - A semantically annotated corpus of 480 news articles in 4 languages

The MEANTIME Corpus has been developed within the EU-funded project NewsReader, as a joint work between FBK, VUA and EHU.

It consists of a total of 480 news articles: 120 English Wikinews articles on four topics (i.e. 'Airbus and Boeing', 'Apple Inc.', 'Stock market', and 'General Motors, Chrysler and Ford') and their translations in Spanish, Italian, and Dutch.

It has been annotated manually at multiple levels, including entities, events, temporal information, semantic roles, and intra-document and cross-document event and entity coreference.

Distribution: The NewsReader MEANTIME corpus is licensed under a CC-BY license from the dedicated page on the NewsReader website.

Reference: Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, and Chantal van Son. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of LREC 2016. TO APPEAR.

Video: a video describing the resource is available here (slides)

NE-annotated-tweets-AL - Tweets annotated with Named Entities following the NEEL-IT guidelines

We have annotated this corpus as part of our participation to the NEEL-IT tast at EVALITA 2016 (http://www.evalita.it/2016/tasks/neel-it). We used it as additional training data.

The distributed corpus is composed of 1614 annotated tweets. A total of 3127 entities have been annotated.

The guidelines used are those of the NEEL-IT task: NEEL-it guidelines

Creative Commons License

NE-annotated-tweets-AL is licensed under a Creative Commons Attribution 4.0 International License.

Please fill the request form with your data (they will be maintained in a database at FBK).

Publications or presentations containing research results obtained through the use of NE-annotated-tweets-AL should cite the following reference:

    • Anne-Lyse Minard, Mohammed R. H. Qwaider, and Bernardo Magnini. 2016. FBK-NLP at NEEL-IT: Active Learning for Domain Adaptation. In Proceedings of the 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2016).

Download

WItaC - NewsReader Wikinews Italian Corpus - The Italian section of the NewsReader MEANTIME corpus

WItaC is the Italian section of the NewsReader MEANTIME corpus. It consists of the Italian translation of 120 English Wikinews (http://en.wikinews.org/) articles on four topics (i.e. Airbus and Boeing, Apple Inc., Stock market, and General Motors, Chrysler and Ford) and has been annotated manually at multiple levels, including entities, events, event factuality, temporal information, semantic roles, and intra-document and cross-document event and entity coreference.

For the annotation guidelines and other information, please refer to the NewsReader website: NewsReader MEANTIME corpus.

WItaC has been used as test data for the Evalita FactA task (Event Factuality Annotation) at EVALITA 2016.

For the annotation guidelines and other information, please refer to the Evalita website: FactA@EVALITA2016.

Creative Commons License

As part of the NewsReader MEANTIME corpus, WItaC is licensed under a Creative Commons Attribution 4.0 International License.

If you use WItaC, please cite one of the following papers:

If you use WItaC for FactA@EVALITA2016, please cite the following paper:

    • Anne-Lyse Minard, Manuela Speranza, and Tommaso Caselli. The EVALITA 2016 Event Factuality Annotation Task (FactA). In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Accademia University Press, Napoli, Italy, December 5-7, 2016.

Downloads:

Obtain WItaC: Download

Obtain WItaC for FactA@EVALITA2016: Fill form

Contrast-Ita Bank - A corpus annotated with discourse contrast relations in Italian

Contrast-Ita Bank is a corpus annotated with discourse contrast relations in Italian. We annotate both explicit and implicit contrast relations (CONTRAST and CONCESSION), following the schema proposed in the Penn Discourse Treebank.

Contrast-Ita Bank consists of 169 news stories (for a total of 65,455 tokens). The documents correspond to the documents of the Fact-Ita Bank corpus, annotated with factuality information and, partially, with negation (Fact-Ita Bank-Negation). Originally, the documents were selected from the larger corpus Ita-TimeBank: a language resource manually annotated with temporal and event information. The same documents are also part of the I-CAB corpus: a corpus of Italian news annotated with temporal expressions and different types of entities (i.e. persons, organizations, locations, and geo-political entities).

Despite the existing information that had already been annotated in these documents, the annotation of contrast was carried out on raw text. We annotated a total of 372 relations.

Creative Commons License

Contrast-Ita Bank is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Contributors: Anna Feltracco, Bernardo Magnini, Elisabetta Jezek, Anne-Lyse Minard, Manuela Speranza

Publications or presentations containing research results obtained through the use of Contrast-Ita Bank should cite the following reference:

    • Anna Feltracco, Bernardo Magnini, and Elisabetta Jezek. Contrast-Ita Bank: A corpus for Italian Annotated with Discourse Contrast Relations. To appear in Proceedings of the Fourth Italian Conference on Computational Linguistic (CLiC-it 2017)

To obtain the data please fill the request form with your data (they will be maintained in a database at FBK):

Download Contrast-Ita Bank

Annotation Tools and Software

TextPro - website

A suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts

KnowledgeStore - website

Scalable storage for text and RDF data

Cromer - A Tool for Cross-Document Event and Entity Coreference

CROMER (CROss-document Main Events and entities Recognition) is a novel web-based tool to manually annotate event and entity coreference across clusters of documents. The tool has been developed so as to handle large collections of documents, perform collaborative annotation (several annotators can work on the same clusters), and enable the linking of the annotated data to external knowledge sources. Given the availability of semantic information encoded in Semantic Web resources, this tool is designed to support annotators in linking entities and events to DBPedia and Wikipedia, so as to facilitate the automatic retrieval of additional semantic information. In this way, event modelling and chaining is made easy, while guaranteeing the highest interconnection with external resources.

Download

AnnotatorPro - A tool for annotation of linguistic data

AnnotatorPro is a tool for annotation of linguistic data. It offers both complete monitoring functionalities (e.g. inter-annotator agreement and agreement with respect to a gold standard) and highly flexible task design (e.g. token and document level annotation, adjudication and reconciliation procedures). We tested AnnotatorPro in several industrial annotation scenarios, coupled with Active Learning techniques.

AnnotatorPro is distributed under the terms of Apache License 2.0.

Download link: https://bitbucket.org/qwaider/annotatorpro

Publications or presentations containing research results obtained through the use of AnnotatorPro should cite the following reference:

    • Mohammed R. H. Qwaider, Anne-Lyse Minard, Manuela Speranza, Bernardo Magnini. Find Problems before They Find You with AnnotatorPro's Monitoring Functionalities. In Proceedings of CLiC-it 2017, Rome 11-13 Decembre 2017. Download PDF.