My research interests are in the area of Computational Linguistics, Language Technology, and Digital Humanities. These are topics I work (or have worked) on. Currently more active ones are listed first.

Author Analysis

I am interested automatic analysis of authorship, both in terms of determining whether two texts are written by the same person or not (authorship attribution), as well as in terms of identifying some characteristics of the author, such as gender and age (author profiling). A few pointers:

Sentiment and Emotion Analysis

I use language processing tools to analyse and predict the way people express themselves on social media. I have pioneered this work on Italian, but I also work with other languages. I am also one of the initiators and scientific coordinators of the Social Media Sensing group at the University of Groningen.
  • We have created TWITA, a corpus of Italian tweets, tokenised, POS-tagged, and (automatically) sentiment annotated.
  • I am the co-organiser of the first and second campaigns for sentiment analysis in Italian: SENTIPOLC 2014 and SENTIPOLC 2016, run within the framework of EVALITA.
  • I co-organise the PEOPLES workshop, in 2018 at its second edition. PEOPLES is a forum for discussing the interplay of various aspects of profiling/sentiment/emotions in social media, and is co-located with major events (COLING 2016, NAACL 2018).
  • I have also worked with emotion and controversy detection, exploiting Facebook reactions as distant silver labels for training.

Modality

Investigation on factuality and speaker's attitude in a typologically sound, cross-linguistic perspective, including the promotion of a multilingual annotation scheme. Analysis and proposals for the crucial issue of annotation units. Related fields: opinion mining, sentiment analysis.

Multiword units

Analysis of semantics of complex nominals of the type N+P+N in Italian. Development of models for the interpretation of P thanks to the lexical semantics of both nouns involved. Integration of hypernym information from MultiWordNet. Analysis of (atypical) compounds of the type N+N in Italian through corpus-based studies and a language typology perspective. Multiword expressions in Italian: corpus-based work towards characterisation, creation of lexical resources, and modelling of internal morphology. Automatic processing.

Regular Polysemy

Extensive corpus-based studies with special attention to proper nouns. Creation of several annotated freely available datasets. Innovative view of metonymy resolution as a classification task, partially akin to word sense disambiguation. Development of machine learning algorithms for the automatic resolution of metonymy. Innovative integration of thesaurus to alleviate data sparseness exploiting regularity of phenomenon. Organisation of shared task within the SemEval 2007 evaluation campaign.

Discourse/Dialogue Structure

Development of specific annotation schemes and creation of annotated corpora for information status in dialogue. Innovative framework for information structure that comprises both spoken and written data. Coordination of annotation project. Analysis of paraphrases in annotated dialogues with specific attention to alternative constructions (e.g. pre- vs post-nominal genitive). Development of statistical model for automatic assignment of information status to discourse entities.

Entity Recognition

Use of statistical models for the resolution of various entities in several kinds of text. Particular attention to geographical, astronomical, and biomedical text. Best system for gene/protein recognition in 2004 (Biocreative challenge). Investigated issues concerning the granularity of classification and relation extraction. Excellent experience in designing and supervising named entity annotation tasks.

Anaphora Resolution

Theoretical and corpus-based description of bridging anaphors and non-anaphoric definite NPs. Theoretical and corpus-based comparison of definite NPs with genitive constructions. Development of both symbolic and statistical methods for the resolution of different types of lexical anaphora. Innovative use of the Web as a source of knowledge for this task. Several annotation projects in Italian and English on written texts and dialogues.