Using text mining techniques to maintain translation memories

In this paper, we explore the use of text mining techniques for translation memory maintenance. Language service providers often have large databases of translations, called translation memories, which have been in use for a long time leading to a slow population of the translation memory with other domains (i.e. adding financial content to a technical domain translation memory). To our best knowledge, no tools exist that would effectively separate the content of a translation memory according to different domains. Having the ability to extract individual domains from low-quality translation memories could mean a significant benefit to language service providers looking to utilize modern translation methods, such as machine translation and automated terminology management. In the first stage, we used OntoGen, a semi-automatic ontology building tool, to separate the segments in the translation memory according to domains. In the second stage, we wanted to test whether we could use OntoGen’s topic keywords as shortcuts for building classification models– the reason for this being that manual annotation is costly and time consuming. If the topics extracted with OntoGen are accurate enough, then we could potentially skip the manual annotation phase of text classification, thereby significantly speeding up the process. We successfully managed to build an ontology of the translation memory, but the boundaries between some topics were relatively vague. One reason for this is that we had to deal with sentences – as opposed to larger blocks of text – which are difficult to classify. Nevertheless, the results of the ontology creation were promising with manual evaluation showing that around 4 in 5 strings were assigned a correct label. The results of the second stage were less clear - the accuracy did significantly improve compared to the majority class classifier, but did not reach levels where it would be deemed useful in a professional language service provider environment.