Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several fields including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of the word embedding, as well as a method dedicated to identify translation of rare words. We carefully explore the meta-parameters of each method and measure their impact on the task of identifying the translation of English words in Wikipedia into French.
Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find the translation, therefore pushing each method to its limit. We show that all the approaches we tested have a clear biased toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10\% of rare words. In the end, we show that the union of the three approaches yields the best results, thus demonstrating their complementarity.
This session will be held by video conference in Room 430, Goldberg Computer Science Building.