Menu

Broadening statistical machine translation with comparable corpora and generalized models

calendar icon Feb 15, 2012 3041 views
video thumbnail
Pause
Mute
speed icon
speed icon
0.25
0.5
0.75
1
1.25
1.5
1.75
2

As we scale statistical machine translation systems to general domain, we face many challenges. This talk outlines two approaches for building better broad-domain systems. First, progress in data-driven translation is limited by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce substantial improvements on out-of-domain test data without suffering in-domain degradation. Second, many modern SMT systems are very heavily lexicalized. While such information excels on in-domain test data, quality falls off as the test data broadens. This next section of the talk describes robust generalized models that leverage lexicalization when available, and back off to linguistic generalizations otherwise. Such an approach results in large improvements over baseline phrasal systems when using broad domain test sets.

RELATED CATEGORIES

MORE VIDEOS FROM THE SAME CATEGORIES

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.