Contemporary machine translation systems can be divided into three basic groups based on the approach:
1 Rule-based (or knowledge-based) MT systems (RBMT)
2 Statistical MT systems (SMT)
3 Hybrid MT systems
Rule-based machine translation systems (RBMT)
According to Baisa (7) the rule-based MT systems were developed and widely used during the 1980s, and dominated the field of machine translation also in the 1990s. There are two critical parts or components within the RBMT systems: linguistic rules and lexicon. Rules represent the syntax of the involved languages and the lexicon contains morphological, syntactic, and semantic information. However, these systems require human knowledge and input (they employ language and programming experts) and their development and maintenance is very expensive and time consuming. They also demand involvement of expert linguists (Lagarda et al. 217). Any further adjustments, changes, and enhancements of the rules are also very difficult and costly. The last important disadvantage of RBMT systems is that they “fail to adapt to new domains” (Lagarda et al. 217).
Statistical machine translation systems
The first statistical MT system was developed in IBM in the late 1980s and statistical systems soon replaced the older rule-based systems. Cyril Goutte states in his work Learning Machine Translation:
This may in fact be seen as part of a general move in computational linguistics: Within about a decade, statistical approaches became overwhelmingly dominant in the field, as shown for example, in the proceedings of the annual conference of the Association for Computational Linguistics (ACL). (2)
Statistical systems do not require nor involve any linguistic knowledge or analysis, they simply “rely on probabilistic and statistical models of the translation process trained on large amounts of bilingual corpora” (Trujillo 210), and focus only on the meaning, not the grammatical correctness.
Statistical systems collect data for further translation advancements, they require large corpora of words and sentences in the form of language pairs (source word or sentence and its counterpart expressed in the target language) in which they monitor and measure features enabling translation prediction. Such features are for example co-occurrence of two or more words in source and target, length of sentences, or the position of words within a sentence (Trujillo 210). Effectiveness and reliability of statistical machine translation relies on the size of the corpora. A good translation requires large corpora. Google Translate and Bing Translator are good examples of statistical machine translation systems.
Hybrid machine translation systems
The issues of the above described systems should be addressed by hybrid MT systems. Such systems combine statistical and linguistic features, but their disadvantage is that they are even more complex than the rule-based systems.
3 Machine translation in the Czech environment
3.1 General information about the Czech language
Ondřej Bojar, Georg Rehm, and Hans Uszkoreit compare the contemporary digital revolution in their book The Czech Language in the Digital Age (Čeština v digitálním věku) to Guttenberg’s invention of the printing press and try to derive the results and effects of this revolution to impacts that printing press had on communication and small languages. The contribution of printing press was the exchange of information in Europe and standardization of major European languages, however, it caused the extinction of many smaller languages (Bojar, Rehm, and Uszkoreit 3).
The Czech Republic is a small country in Central Europe with Czech as the official language. There are about 10,000,000 speakers and users living here and approximately another 200,000 of speakers living abroad especially in the United States, Canada, Austria, Germany, Slovakia, and Australia. More detailed information regarding the Czech citizens living abroad can be found on the web pages of the Ministry of Foreign Affairs of the Czech Republic (http://www.mzv.cz/file/73462/statistikaANGL2007.pdf). However, the 200,000 people are Czech citizens (owners of Czech passports), but there are still some quite significant numbers of Czech emigrants and their children without Czech citizenship but claiming to have Czech origin, the statistics mentioned above do not contain numbers of people actually using the Czech language.
The Czech Republic is a member of the European Union and since 2004 Czech has been one of the administrative languages of the EU. The European Union plays an important and irreplaceable role in the field of language technologies, it supports linguistic and technological research and funds a number of linguistic projects at the moment (such as EuroMatrix, EuroMatrix+, and iTranslate4) in order to achieve and preserve the multilingual environment within the framework of the EU. Speakers of a smaller language (such as Czech) face some sort of language barrier in their everyday lives and have an urgent need for tools that could help them to overcome these barriers. This could be especially important for small and medium organizations and enterprises operating in the Czech Republic.
3.2 Specifics of the Czech language: morphology, syntax, and word order
To get some insight and understand the principles of machine translation use for the purposes of the Czech language it is necessary to look closer at the language as such, to explain the specifics of Czech, and to find the crucial and problematic differences between Czech and English.
Since the word order in Czech is relatively free and unrestricted (but often contains the already known or given information at beginning and the new or most important information at the end of a sentence), the cases play together with the prepositions a crucial role in expressing the syntactic relationships between individual parts of sentences.
“The Czech language is highly inflectional language with a complicated morphology” (Bojar, Rehm, and Uszkoreit 45). Declension of Czech nouns, adjectives, pronouns and numerals distinguishes seven cases (nominative, genitive, dative, accusative, vocative, locative, and instrumental), two numbers (singular and plural), and four genders (maskulinum animatum, maskulinum inanimatum, femininum, and neutrum) that are only partially influenced by the natural gender. Each of the genders has its own declension paradigm (maskulinum animatum – pán, muž, předseda, soudce, maskulinum inanimatum – hrad, stroj, femininum – žena, růže, píseň, kost, neutrum – město, moře, kuře, stavení) which may but may not unambiguously determine the ending of the word in each case. By way of contrast, the English language has only four cases (nominative, genitive, dative, and accusative), the gender in English exists only in connection with personal or human nouns and reflects the natural gender. Endings of nouns do not change in different cases (with the exception of genitive where English adds the ending –s). The complexity of Czech grammar involves also adjectives (hard declension’s paradigm mladý and soft declension’s paradigm jarní) and verbs. Adjectives change their endings in dependence on case and gender of associated noun. And again the English adjectives do not change their endings in different cases.
The differences between the very complex system of Czech grammar and English grammar could be analyzed and described in detail in volumes of books, but that is not the aim of this work. The paragraphs above should only illustrate the possible issues that the dissimilarity of the two languages could cause in translations without human involvement.
Share with your friends: |