News
&
Innovative technology
  Linguistic expertise
    Customized service

Continued access to the latest NLP news.
Contact us

NLP Technologies
52, rue Le Royer Ouest,
Montréal, Québec,
Canada, H2Y 1W7
Phone: 514.733.8884
Fax:514.733.5554

Recent News

NLP and the University of Montreal present an article on automatic translation of tweets and hashtags at LREC 2014, May 26-31, Reykjavik, Iceland

LREC 2014 is the 9th edition of the Language Resources and Evaluation Conference which will be held in Reykjavik, Iceland from May 26 to May 31, 2014
http://lrec2014.lrec-conf.org/en/
The recent research and development project of NLP Technologies in collaboration with the Applied Research in Computational Linguistics (RALI) laboratory of the University of Montreal, under the supervision of  Dr. Philippe Langlais deals with machine translation of social media texts. The result of this joint R&D effort will be presented at LREC 2014.

Title
Hashtag Occurrences, Layout and Translation: A Corpus-driven Analysis of Tweets Published by the Canadian Government
Fabrizio Gotti, Philippe Langlais, Atefeh Farzindar
RALI-DIRO, Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, Canada, H3C 3J7
NLP Technologies Inc., 52 Le Royer, Montréal, Canada, H2Y 1W7
 
Abstract
In this article, we present an aligned bilingual corpus of 8,758 tweet pairs in French and English, derived from 12 Canadian government agencies. Hashtags account for 6% to 8% of all tokens, and exhibit a Zipfian distribution. They appear in either a tweet’s prologue, announcing its topic, or in the tweet’s text in lieu of traditional words, or in an epilogue. Hashtags are words prefixed with a pound sign in 80% of the cases. The rest is mostly multiword hashtags, for which we describe a simple segmentation algorithm. A manual analysis of the bilingual alignment of 5,000 hashtags shows that 5% (French) to 18% (English) of them don’t have a counterpart in their containing tweet’s translation. This analysis further shows that 80% of multiword hashtags are correctly translated by humans, and that the mistranslation of the rest may be due to incomplete translation directives regarding social media. We show how these resources and their analysis can guide the design of a statistical machine translation pipeline, and its evaluation. A baseline system implementing a tweet-specific tokenizer yields promising results. The system is improved by translating epilogues, prologues, and text separately. We attempt to feed the SMT engine with the original hashtag and some alternatives (“dehashed” version or a segmented version of multi-word hashtags), but translation quality improves at the cost of hashtag recall.
 
About LREC 2014
Since its first edition, held in Granada in 1998, LREC has become the major event on Language Resources (LRs) and Evaluation for Language Technologies (LT). The aim of LREC is to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, exchange information regarding LRs and their applications, evaluation methodologies and tools, ongoing and planned activities, industrial uses and needs, requirements coming from the e-society, both with respect to policy issues and to technological and organisational ones.
LREC provides a unique forum for researchers, industrials and funding agencies from across a wide spectrum of areas to discuss problems and opportunities, find new synergies and promote initiatives for international cooperation, in support to investigations in language sciences, progress and innovation in language technologies and development of corresponding products, services and applications, and standards.
For further information please visit LREC 2014’s website at http://lrec2014.lrec-conf.org/