Draft: March 14, 2008



Download 324.93 Kb.
Page2/13
Date31.01.2017
Size324.93 Kb.
#12909
1   2   3   4   5   6   7   8   9   ...   13

1.2Thesis Claims


The goal of this thesis is to automate the morphological analysis of natural language by decomposing lexical items into a network of mutually substitutable substrings. This network enables unsupervised discovery of structures which closely correlate with inflectional paradigms. Additionally,

  1. The discovered paradigmatic structures immediately lead to word segmentation algorithms—segmentation algorithms which identify morphemes with a quality on par with state-of-the-art unsupervised morphology analysis systems.

  1. The unsupervised paradigm discovery and word segmentation algorithms achieve this state-of-the-art performance for the diverse set of natural languages which primarily construct words through concatenation of morphemes, e.g. Spanish, Turkish.

  2. The paradigm discovery and word segmentation algorithms are computationally tractable.

  3. Augmenting a morphologically naïve information retrieval (IR) system with induced segmentations improves performance on a real world IR task. The IR improvements hold across a range of morphologically concatenative languages. Enhanced performance on other natural language processing tasks is likely.

1.3ParaMor: Paradigms across Morphology



The paradigmatic, syntagmatic, and phoneme sequence constraints of natural language allow ParaMor, the unsupervised morphology induction algorithm described in this thesis, to first reconstruct the morphological structure of a language, and to then deconstruct word forms of that language into constituent morphemes. The structures that ParaMor captures are sets of mutually replaceable word-final strings which closely model emergent paradigm cross-products. To reconstruct these paradigm structures, ParaMor searches a network of paradigmatically and syntagmatically organized schemes of candidate suffixes and candidate stems. ParaMor’s search algorithms are motivated by paradigmatic, syntagmatic, and phoneme sequence constraints. Figure 1.4 depicts a portion of a morphology scheme network auto­ma­ti­cal­ly derived from 100,000 words of the Brown Corpus of English (Francis, 1964). Each box in Figure 1.4 is a scheme, which lists in bold a set of can­di­date suffixes, or c suf­fixes, together with an ab­bre­via­ted list, in italics, of can­di­date stems, or c stems. Each of the c suf­fixes in a scheme concatenates onto each of the c stems in that scheme to form a word found in the input text. In Figure 1.4, the highlighted schemes con­taining the c suf­fix sets Ø.ed.es.ing and e.ed.es.ing, where Ø sig­ni­fies a null suffix, represent paradigmatically opposed sets of suffixes that constitute verbal sub-classes in English. The other candidate schemes in Figure 1.4 are wrong or incom­plete. Chapter 3 details the construction of morphology scheme networks over suffixes and describes a network search procedure that identifies schemes which contain in aggregate 91% of all Spanish inflectional suffixes when training over a corpus of 50,000 types. However, many of the initially selected schemes do not represent true paradigms. And of those that do represent paradigms, most capture only a portion of a complete paradigm. Hence, Chapter 4 describes algorithms to first merge candidate paradigm pieces into larger groups covering more of the affixes in a paradigm, and then filter out the poorer candidates.

Now with a handle on the paradigm structures of a language, ParaMor uses the induced morphological knowledge to segment word forms into likely morphemes. Recall that each scheme that ParaMor discovers is intended to model a single morpheme boundary in any particular surface form. To segment a word form then, ParaMor simply matches discovered schemes against that word and proposes a single morpheme boundary at the match point. Examples will help clarify word segmentation. Assume ParaMor correctly identifies the English scheme Ø.ed.es.ing from Figure 1.4. When requested to segment the word reaches, ParaMor finds that the es c­ suffix in the discovered scheme matches the word-final string es in reaches. Hence, ParaMor segments reaches as reach +es. Since more than one paradigm cross-product may match a particular word, a word may be segmented at more than one position. The Spanish word administradas from Section 1.1 contains three suffixes, each of which may match a separate discovered paradigm cross-product, producing the segmentation: administer +ad +a +s.

To evaluate the morphological segmentations which ParaMor produces, ParaMor competed in Morpho Challenge 2007 (Kurimo et al., 2007), a peer operated competition pitting against one another algorithms designed to discover the morphological structure of natural languages from nothing more than raw text. Unsupervised morphology induction systems were evaluated in two ways within Morpho Challenge 2007. First, a linguistically motivated metric measured each system at the task of morpheme identification. Second, an information retrieval (IR) system was augmented with the morphological segmentations each system proposed, and mean average precision of the relevance of returned documents measured. Each competing system could have Morpho Challenge 2007 evaluate morphological segmentations over four languages: English, German, Turkish, and Finnish.

Of the four language tracks in Morpho Challenge 2007, ParaMor officially competed in English and German. At morpheme identification, in English, ParaMor outperformed an already sophisticated baseline induction algorithm, Morfessor (Creutz, 2006). ParaMor placed fourth in English morpheme identification overall. In German, combining ParaMor’s analyses with analyses from Morfessor resulted in a set of analyses that outperform either algorithm alone, and that placed first in the morpheme identification among all algorithms submitted to Morpho Challenge 2007. The morphological segmentations produced by ParaMor at the time of the official Morpho Challenge did not perform well at the information retrieval task. However, in the months following the May 2007 Morpho Challenge submission deadline, a straightforward change to ParaMor’s segmentation algorithm significantly improved performance at the IR task. ParaMor’s current performance on the Morpho Challenge 2007 IR task is on par with the best officially submitted systems. Additionally, augmenting the IR system used in Morpho Challenge 2007 with ParaMor’s unsupervised morphological segmentations consistently, across languages, outperforms a morphologically naïve baseline system for which no morphological analysis is performed. The same improvements to ParaMor’s segmentation algorithm that improved IR performance also facilitated morphological segmentation of Turkish and Finnish. ParaMor’s current results at Finnish morpheme identification are statistically equivalent to the best systems which competed in Morpho Challenge 2007. And ParaMor’s current Turkish morpheme identification is 13.5% higher absolute than the best submitted system.

Chapter 2 begins this thesis with an overview of other work on the problem of unsupervised morphology induction. Chapters 3 and 4 present ParaMor’s core paradigm discovery algorithms. Chapter 5 describes ParaMor’s word segmentation models. And Chapter 6 details ParaMor’s performance in the Morpho Challenge 2007 competition. Finally Chapter 7 summarizes the contributions of ParaMor and outlines future directions both specifically for ParaMor and more generally for the broader field of unsupervised morphology induction.


Directory: ~cmonson -> Thesis

Download 324.93 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   ...   13




The database is protected by copyright ©ininet.org 2024
send message

    Main page