Towards Automated Language Classification: a clustering Approach Armin Buch, David Erschler, Gerhard Jäger, and Andrei Lupas



Download 1.74 Mb.
Page7/15
Date05.05.2018
Size1.74 Mb.
#48097
1   2   3   4   5   6   7   8   9   10   ...   15
Preparing data

Now, lists of 40 meanings are accumulated for all languages of the sample – if a word list for a particular language contains more items, they are excluded from further consideration. (However, even these shorter 40-word lists sometimes contain gaps.)

Now, all vowels are treated as a single class; all consonants are collapsed into four classes: bilabials (b, p, f, v); nasals (m, n); fricative velars and uvulars (x, ʁ, etc), the rest of consonants are collapsed into one more class.


          1. Computation of similarity

For each pair of languages, L’ and L’’, only the meanings present in both lists are kept. Let M denote the number of remaining meanings. For each remaining pair of words and , the Levenshtein distance is computed – disregarding whether or not the two words correspond to a same meaning. The similarity is then defined in the, following manner:

Download 1.74 Mb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10   ...   15




The database is protected by copyright ©ininet.org 2024
send message

    Main page