Corpora and Machine Translation Harold Somers

Rapid development of MT for less-studied languages

Download 282.98 Kb.
Size282.98 Kb.
1   2   3   4   5   6

Rapid development of MT for less-studied languages

An important attraction of corpus-based MT techniques is the possibility that they can be used to quickly develop MT systems for less-studied languages (cf. Chapter Article 23), inasmuch as these MT techniques require only bilingual corpora and appropriate tools for alignment, extraction of linguistic data and so on. It must be said that some of the latest ideas, particularly in SMT, requiring treebanks and parsers make this less relevant. Nevertheless, empiriacala methods do seem to embody the best hoper for resourcing under-resourced languages.

The first such attempt to demonstrate the feasibility of this was at the Johns Hopkins Summer Workshop in 1999, when students built a Chinese–English SMT system in one day (Al-Onizan et al. 1999). Although Chinese is not a less-studied language as such, it is of interest because English and Chinese are typologically quite dissimilar. The corpus used was the 7 million-word “Hong Kong Laws” corpus and the system was built using the EGYPT SMT toolkit developed at the same workshop and now generally available online.

Germann (2001) tried similar techniques with rapidly developed resources, building a Tamil–English MT system by manually translating 24,000 words of Tamil into English in a six week period. Weerasinghe (2002) worked on Sinhala–English using a 50,000-word corpus from the World Socialist Web Site. Oard and Och (2003) built a system to translate between English and the Philippine language Cebuano, based on 1.3m words of parallel text collected from five sources (including Bible translations and on-line and hard-copy newsletters). Foster et al. (2003) describe a number of difficulties in their attempt to build a Chinese–English MT system in this way.


MT is often described as the historically original task of Nnatural Llanguage Pprocessing, as well as the archetypical task in that it has a bit of everything, indeed in several languages; so it is no surprise that corpora – or at least collections of texts – have played a significant role in the history of MT. However, it is only in the last 10–15 years that they have really come to the fore with the emergence and now predominance of corpus-based techniques for MT. This chapter article has reviewed that history, from “reference corpora” in the days of rule-based MT via corpus-based translators’ tools, to MT methods based exclusively on corpus information. Many of the tools developed for corpus exploitation and described in other chapters in this book have had their genesis in MT, and research in corpus-based MT is certainly at the forefront of computational linguistics at the moment.


I would like to thank the editors and an anonymous reviewer for their very helpful comments on earlier drafts of this article. I would also like to thank Andy Way for his advice and suggestions on several sections of this article. All errors and infelicities remain of course my own responsibility.


Al-Onizan, Y., ./Curin, J., ./Jahr, M., ./Knight, K., ./Lafferty, J., ./Melamed, D., ./Och, F.-J., ./Purdy, D., ./Smith, N.A. and /Yarowsky, D. (1999) Statistical machine translation: Final report, JHU Workshop 1999. Technical report, Johns Hopkins University, Baltimore, MD. Available at [accessed 7 June 2005].

Alshawi, H., ./Srinivas, B. and/ Douglas, S. (1998) Automatic acquisition of hierarchical transduction models for machine translation. In: COLING-ACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, pp. pages?41-47.

Arthern, P.J. (1978) Machine translation and computerized terminology systems: a translator’s viewpoint. In: Snell, B.M. (ed.) Translating and the Computer: Proceedings of a Seminar, London, 14th November 1978, Amsterdam (1979): North Holland, 77–108.

Barlow, M. (1995) ParaConc: A concordancer for parallel texts. In: Computers and Texts 10, 14–16.

Bowker, L. (2002) Computer-Aided Translation Technology. A Practical Introduction. Ottawa: University of Ottawa Press.

Brown, P. F., ./J. Cocke, J./S.A. Della Pietra, S. A./ V.J. Della Pietra, V. J./F. Jelinek, F./J.D. Lafferty, J. D./R.L. Mercer, R. L./ and P.S.Roossin P. S. (1990) A statistical approach to machine translation. In: Computational Linguistics 16, 79–85; repr. in Nirenberg et al. (2003), 355–362.

Brown, P.F./, S.A. Della Pietra, S. A./V.J. Della Pietra, V. J./ and R.L. Mercer, R. L. (1993) The mathematics of statistical machine translation: Parameter estimation. In: Computational Linguistics 19, 263–311.

Brown, R. D. (2000) Automated generalization of translation examples. In: Proceedings of the 18th International Conference on Computational Linguistics, Coling 2000 in Europe, Saarbrücken, Germany, 125-131.

Brown, R. D. (2001) Transfer-rule induction for example-based translation. In: MT Summit VIII Workshop on Example-Based Machine Translation, Santiago de Compostela, Spain, 1-11.

Carl, M./ and Way, A. (2003) (eds) Recent Advances in Example-Based Machine Translation. Dordrecht: Kluwer Academic Press.

Carl, M./Way, A. (2006/7) (eds) Special issue on example-based machine translation. Machine Translation 19(3-4) and 20(1).

Charniak, E., ./Knight, K./ and Yamada, K. (2003) Syntax-based language models for statistical machine translation. In: MT Summit IX, Proceedings of the Ninth Machine Translation Summit, New Orleans, LAUSA, pages40-46.

Church, K.W./ and Gale, W.A. (1991) Concordances for parallel texts. In: Using Corpora, Proceedings of the 7th Annual Conference of the UW Centre for the New OED and Text Research, Oxford, 40–62.

Cicekli, I. (2006) Inducing translation templates with type constraints. Machine Translation 19, 281-297.

Cicekli, I./, and Güvenir, H.A. (1996) Learning translation rules from a bilingual corpus. In: NeMLaP-2: Proceedings of the Second International Conference on New Methods in Language Processing, Ankara, Turkey, 90–97.

Cicekli, I./Güvenir, H.A. (2003) Learning translation templates from bilingual translation examples. In: Carl & Way 2003, 255-286.

Čmejrek, M./, Cuřín, J./ and Havelka, J. (2003) Treebanks in machine translation. In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), Växjö, Sweden, pp. 209–212.

Collins, B. (1998) Example-Based Machine Translation: An Adaptation-Guided Retrieval Approach. PhD thesis, Trinity College, Dublin.

Cranias, L./Papageorgiou, H./Piperidis, S. (1997) Example retrieval from a translation memory. In: Natural Language Engineering 3, 255–277.

Dempster, A. P./Laird, N. M./Rubin, D. B. (1977) maximum likelihood from incomplete data via the EM algorithm. In: Journal of the Royal Statistical Society Series B 39, 1-38.

Foster, G./, Gandrabur, S./ and Langlais, P./, Plamondon, P./, Russel, G./ and Simard, M. (2003) Statistical machine translation: Rapid development with limited resources. In: MT Summit IX, Proceedings of the Ninth Machine Translation Summit, New Orleans, LAUSA, 110-117pages.

Foster, G./, Langlais, P./ and Lapalme, G. (2002) User-friendly text prediction for translators. In: 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Philadelphia, PA, 148-155.pages?

Garcıa-Varea, I./, Casacuberta, F./ and Ney, H. (1998) An iterative DP-based search algorithm for statistical machine translation. In: Proceedings of the Fifth International Conference on Spoken Language Processing (ICSLP 98), Sydney, pp. 1135–1139.

Gaussier, E./, Langé, J.-M./ and Meunier, F. (1992) Towards bilingual terminology. In: Proceedings of the ALLC/ACH Conference, Oxford, 121–124.

Germann, U. (2001) Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In: ACL-2001 Workshop on Data-Driven Methods in Machine Translation, Toulouse, France, 1-8pages.

Germann, U./, Jahr, M./ Knight, K./, Marcu, D./ and Yamada, K. (2001) Fast decoding and optimal decoding for machine translation. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp. 228–235.

Gildea, D. (2003) Loosely tree-based alignment for machine translation. In: 41st Annual Conference of the Association for Computational Linguistics, Sapporo, Japan, 80-87pages.

Harris, B. (1988) Bi-text, a new concept in translation theory. In: Language Monthly 54, 8–10.

Hearne, M./ and Way, A. (2003) Seeing the wood for the trees: data-oriented translation. In: MT Summit IX, Proceedings of the Ninth Machine Translation Summit, New Orleans, LAUSA, 165-172pages.

Isabelle, P. (1992a) Préface - Preface. In: Quatrième Colloque international sur les aspects théoriques et méthodologiques de la traduction automatique, Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-92, Montréal, Canada, iii.

Isabelle, P. (1992b) Bi-textual aids for translators. In: Screening Words: User Interfaces for Text, Proceedings of the 8th Annual Conference of the UW Centre for the New OED and Text Research, Waterloo, Ont.; available at

Jurafsky, D./ and Martin, J. H. (2000) Speech and language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, . Upper Saddle River, NJ: Prentice Hall.

Kay, M. (1980) The proper place of men and machines in language translation. Research Report CSL-80-11, Xerox PARC, Palo Alto, Calif.; repr. in Machine Translation 12 (1997), 3–23; and in Nirenberg et al. (2003), 221–232.

King, G. W. (1956) Stochastic methods of mechanical translation. Mechanical Translation 3(2), 38-39, PAGES; repr. in Nirenburg et al. (2003), 37–38.

Knight, K. (1999) Decoding complexity in word-replacement translation models. In: Computational Linguistics 25, 607–615.

Knight, K./Koehn, P. (2004) What’s new in statistical machine translation? Tutorial at HLT-NAACL 2004, Human Language Technology Conference of the North Americal Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada.

Koehn, P./ and Knight, K. (2003) Feature-rich statistical translation of noun phrases. In: 41st Annual Conference of the Association for Computational Linguistics, Sapporo, Japan, 311-318pages.

Koehn, P./ Och, F. J./ and Marcu, D. (2003) Statistical phrase-based translation. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, 127-133pages.

Lehrberger, J. (1982), Automatic translation and the concept of sublanguage. In: Kittredge, R. I. and & Lehrberger, J. (eds) Sublanguage: Studies of Language in Restricted Semantic Domains. Berlin: Mouton de Gruyter, 81-106; repr. in Nirenberg et al. (2003), 207–220.

Macdonald, K. (2001) Improving automatic alignment for translation memory creation. In: Translating and the Computer 23: Proceedings from the Aslib Conference, London [pages not numbered].

Macklovitch, E./ and Russell, G. (2000) What’s been forgotten in translation memory. In: White, J. S. (ed.) Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA 2000, Cuernavaca, Mexico, Berlin: Springer, 137–146.

Manning, C. D./Schütze, H. (1999) Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

Marcu, D./ and Wong, W. (2002) A phrase-based, joint probability model for statistical machine translation. In: Conference on Empirical Methods for Natural Language Processing (EMNLP 2002), Philadelphia, PA, pp. pages?.

McTait, K./ and Trujillo, A. (1999) A language-neutral sparse-data algorithm for extracting translation patterns. In: Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 99), Chester, England, 98–108.

Melby, A. (1981) A bilingual concordance system and its use in linguistic studies. In:  Gutwinski, W. and & Jolly, G. (eds) LACUS 8: the 8th Lacus Forum, Glendon College, York University, Canada, August 1981, . Columbia, SC (1982): Hornbeam Press, 541–554.

Nagao, M. (1984) A framework of a mechanical translation between Japanese and English by analogy principle. In Elithorn, A. and & Banerji, R. (eds) Artificial and Human Intelligence, Amsterdam: North-Holland, pp. 173–180; repr. in Nirenberg et al. (2003), 351–354.

Niessen, S./, Vogel, S./, Ney,H./ and Tillmann, C. (1998) ADP-based search algorithm for statistical machine translation. In: COLING-ACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, pp. 960–967.

Nirenburg, S./, Domashnev, C./ and Grannes, D.J. (1993) Two approaches to matching in example-based machine translation. In: Proceedings of the Fifth International Conference on Theoretical and Methodological Issues in Machine Translation TMI ‘93: MT in the Next Generation, Kyoto, Japan, 47–57.

Nirenberg, S./, Somers, H./ and Wilks, Y. (2003) (eds) Readings in Machine Translation. Cambridge, Mass.: MIT Press.

Oard, D. W./ and Och, F. J. (2003) Rapid-response machine translation for unexpected languages. In: MT Summit IX, Proceedings of the Ninth Machine Translation Summit, New Orleans, LAUSA, 277-283.pages

Och, F.J., ./Gildea, D., ./Khudanpur, S., ./Sarkar, A., ./Yamada, K., ./Fraser, A., ./Kumar, S., ./Shen, L., ./Smith, D., ./Eng, K., ./Jain, V., ./Jin, Z./ and Radev, D. (2004) A smorgasbord of features for statistical machine translation. In: Human Language Technology Conference and Annual Meeting of the North American Chapter of the Association for Computational Linguistics, Boston, MA, 161-168pages.

Och, F. J./ and Ney, H. (2003) A systematic comparison of various statistical alignment models. In: Computational Linguistics 29, 19–51.

Och, F. J./ and Ney, H. (2004) The alignment template approach to statistical machine translation. In: Computational Linguistics 30, 417–449.

Och, F. J./, Tillmann, C./ and Ney, H. (1999) Improved alignment models for statistical machine translation. In: Proceedings of the 1999 Joint SIGDAT Conference of Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, pp. 20–28.

Och, F. J./, Ueffing, N./ and Ney, H. (2001) An efficient A* search algorithm for statistical machine translation. In: Proceedings of the Data-Driven Machine Translation Workshop, 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 55–62.

Planas, E./Furuse, O. (1999) Formalizing translation memories. In: Machine Translaiton Summit VII, Singapore, 331-330; repr. in Carl & Way 2003, 157-188.

Poutsma, A. (2000) Data-oriented parsing. In: COLING 2000 in Europe: The 18th International Conference on Computational Linguistics, Luxembourg, 635-641pages.

Rapp, R. (2002) A part-of-speech-based search algorithm for translation memories. In: LREC 2002, Third International Conference on language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, 466-472.

Romary, L., ./Mehl, N. and/ Woolls, D. (1995) The Lingua parallel concordancing project: Managing multilingual texts for educational purposes. In: Text Technology 5, 206–220.

Sato, S. /and Nagao, M. (1990) Toward memory-based translation. In: COLING-90, Papers Presented to the 13th International Conference on Computational Linguistics, Helsinki, Finland, Vol. 3, pp. 247–252.

Simard, M., ./Foster, G. and/ Perrault, F. (1993) TransSearch: A bilingual concordance tool. Industry Canada Centre for Information Technology Innovation (CITI), Laval, Canada, October 1993; available at

Somers, H. (2003) Translation memory systems. In: Somers, H. (ed.) Computers and Translation: A Translator's Guide, Amsterdam: Benjamins, 31-47pages.

Somers, H. /and Fernández Díaz, G. (20032004) Diferencias e interconexiones existentes entre los sistemas de memorias de traducción y la EBMT. In: Corpas Pastor, G. & Varela Salinas, M.a-J. (eds) Entornos informáticos de la traducción profesional: las memorias de traducción, Granada: Editorial Atrio, 167–192; English version, Translation memory vs. example-based MT: What is the difference? In: International Journal of Translation 16.(2) (2004), 5–33; based on: Diferencias e interconexiones existentes entre los sistemas de memorias de traducción y la EBMT. In: Corpas Pastor, G. & Varela Salinas, M.a-J. (eds) Entornos informáticos de la traducción profesional: las memorias de traducción, Granada (2003): Editorial Atrio, pp. 167–192.

Somers, H., ./Tsujii, J. and /Jones, D. (1990) Machine translation without a source text. In: COLING-90, Papers Presented to the 13th International Conference on Computational Linguistics, Helsinki, Finland, 3, 271-276; repr. in Nirenberg et al. (2003), 401–406.

Sumita, E., ./Iida, H. and /Kohyama, H. (1990) Translating with examples: A new approach to machine translation. In: The Third International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Language, Austin, Texas, pp. 203–212.

Tillmann, C. /and Ney, H. (2003) Word reordering and a dynamic programming beam search algorithm for statistical machine translation. In: Computational Linguistics 29, 97–133.

Tillmann, C., ./Vogel S., ./Ney, H. and /Sawaf, H. (2000) Statistical translation of text and speech: First results with the RWTH system. In: Machine Translation 15, 43–74.

Tillmann, C., ./Vogel S., ./Ney, H. and /Zubiaga, A. (1997. ) A DP-based search using monotone alignments in statistical translation. In: 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 289–296.

Ueffing, N., ./Och, F. J. and /Ney, H. (2002) Generation of word graphs in statistical machine translation. In: Conference on Empirical Methods for Natural Language Processing (EMNLP 2002), Philadelphia, PA, pp. 156–163.

Wang, Y-Y. and /Waibel, A. (1997) Decoding algorithm in statistical machine translation. In: 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 366–372.

Wang, Y-Y. and /Waibel, A. (1998) Modeling with structures in statistical machine translation. In: COLING-ACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Canada, pp. PAGES1357-1363.

Watanabe, T. and /Sumita, E. (2003) Example-based decoding for statistical machine translation. In: MT Summit IX, Proceedings of the Ninth Machine Translation Summit, New Orleans, LAUSA, 410-417. pages

Way, A. and /Gough, N. (2005) Comparing example-based and statistical machine translation. In: Journal of Natural Language Engineering 11, 295-309.(in press)

Weerasinghe, R. (2002) Bootstrapping the lexicon building process for machine translation between ‘new’ languages. In: Richardson, S. D. (ed.) Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA2002, Tiburon, CA, Berlin: Springer, 177-186pages.

Wu, D. (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. In: Computational Linguistics 23, 377–403.

Wu, D. and /Wong, H. (1998) Machine translation with a stochastic grammatical channel. In: COLING-ACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Canada, pp. pages1408-1414.

Yamada, K. and /Knight, K. (2001) A syntax-based statistical translation model. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp. PAGES523-530.

Maria no daba una bofetada a la bruja verda








Figure 1. Initial phrasal alignment for example (22)

Maria no daba una bofetada a la bruja verda








Figure 2. Further phrasal identification

Download 282.98 Kb.

Share with your friends:
1   2   3   4   5   6

The database is protected by copyright © 2024
send message

    Main page