A large public-access corpus for Japanese
Tomaz Erjavec, Adam Kilgarriff, Irena Srdanovic
A corpus is a collection of texts as used for linguistic or literary research. With computing and the internet becoming ubiquitous, creating them can now be fast and using them, straightforward. For many kinds of linguistic research, an empiricist approach is viable as never before. We have loaded corpora for a number of languages into our corpus query tool, the Sketch Engine (Kilgarriff et al 2004) as can be seen at http://www.sketchengine.co.uk. The tool makes it possible to rapidly answer a range of questions in syntax, lexis, language change and language variation. .
Of all the major world languages, Japanese is lagging behind in terms of publicly-accessible and searchable corpus resources. Those corpora that do exist are not easily accessible for non-computational people researching Japanese language, culture, history or literature.
We are currently developing a large corpus of Japanese web text, which we shall then load into the Sketch Engine, where it will take its place alongside Chinese, English, French, German, Italian, Portuguese and Spanish. By the time of the workshop the first version of the corpus will be available.1 The size of the corpus will be of the order of 100 million words. It is being gathered using methods as described by Sharoff 2006, Baroni and Kilgarriff 2006, and for Japanese, Baroni and ??, Kawahara and Kurohashi 2006.
People are often concerned that web corpora will give a partial and distorted view of a language. Describing the nature of the text in a corpus is a challenge because we lack vocabulary and methods for describing the types of texts in large corpora (Kilgarriff 2001). Experiments that have been carried out for other web corpora, either gathered using methods similar to ours or using statistics taken directly from commercial search engines, have shown the validity of the methods for producing a corpus which, broadly speaking, represents the general language well and is suitable for general linguistic research as well as language technology resource development (Kilgarriff and Grefenstette 2003, Keller and Lapata 2003, Sharoff 2005, Baroni and Ciarimita 2006).
The Sketch Engine is a corpus tool with several distinctive features. It is fast, giving immediate responses for most regular queries for corpora of up to two billion words. It is designed for use over the web. It works with all standard browsers, so users need no technical knowledge, and do not need to install any software on their machine. All that is required is a computer with a web connection. This makes it particular useful for use in Japanese studies and other humanities areas, where researchers often do not have and should not need technical computational skills or support staff.
As well as offering standard corpus query functions such as concordancing, sorting, filtering etc., the Sketch Engine is unique in integrating grammatical analysis, which makes it possible to produce word sketches, one-page summaries of a word’s grammatical and collocational behaviour. These will be presented for Japanese at the workshop. Based on the grammatical analysis, we also produce a distributional thesaurus for the language, in which words occurring in similar settings, sharing the same collocates, are put together (Sparck Jones 1986, Grefenstette 1994, Lin 1998, Weeds and Weir 2003).
We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of researchers in Japanese language and culture.
References
Baroni, M. and A. Kilgarriff 2006. Large linguistically-processed Web corpora for multiple languages Proc. EACL. Trento, Italy.
M. Ciaramita and M. Baroni. 2006. A figure of merit for the evaluation of Web-corpus randomness. Proceedings of EACL 2006 (11th Conference of the European Chapter of the Association for Computational Linguistics), East Stroudsburg PA: ACL. 217-224.
Grefenstette, G. Explorations in Automatic Thesaurus Dioscovery. Kluwer, 1994.
Kawahara, D and S. Kurohashi 2006. Case Frame Compilation from the Web using High-Performance Computing. Proc LREC, Genoa, Italy.
Keller, F. and M. Lapata. 2003. Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics 29:3, 459-484.
Kilgarriff, A. 2001. “Comparing Corpora.” International Journal of Corpus Linguistics 6 (1): 1-37.
Kilgarriff, A. and G. Grefenstette 2003. Introduction to the Special Issue on Web as Corpus. Computational Linguistics 29 (3).
Kilgarriff, A., P. Rychly, P. Smrz and D. Tugwell 2004. The Sketch Engine Proc. Euralex. Lorient, France, July: 105-116.
Lin, D. 1998. Automatic retrieval; and clustering of similar words. COLING-ACL Montreal: 768-774.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In Baroni and Bernardini, editors, WaCky! Working papers on the Web as Corpus. Gedit, Bologna.
Sparck Jones, Karen. 1986. Synonymy and Semantic Classification. Edinburgh University Press.
Ueyama M. and M. Baroni. 2006. Automated construction and evaluation of a Japanese web-based reference corpus. Proceedings of Corpus Linguistics 2005, Birmingham, UK.
Directory: PublicationsPublications -> Acm word Template for sig sitePublications -> Preparation of Papers for ieee transactions on medical imagingPublications -> Adjih, C., Georgiadis, L., Jacquet, P., & Szpankowski, W. (2006). Multicast tree structure and the power lawPublications -> Swiss Federal Institute of Technology (eth) Zurich Computer Engineering and Networks LaboratoryPublications -> Quantitative skillsPublications -> Multi-core cpu and gpu implementation of Discrete Periodic Radon Transform and Its InversePublications -> List of Publications Department of Mechanical Engineering ucek, jntu kakinadaPublications -> 1. 2 Authority 1 3 Planning Area 1Publications -> Sa michelson, 2011: Impact of Sea-Spray on the Atmospheric Surface Layer. Bound. Layer Meteor., 140 ( 3 ), 361-381, doi: 10. 1007/s10546-011-9617-1, issn: Jun-14, ids: 807TW, sep 2011 Bao, jw, cw fairall, sa michelson
Share with your friends: |