Towards Automated Language Classification: a clustering Approach Armin Buch, David Erschler, Gerhard Jäger, and Andrei Lupas

Download 1.74 Mb.

Page	9/15
Date	05.05.2018
Size	1.74 Mb.
	#48097

1 ... 5 6 7 8 9 10 11 12 ... 15

he lower the value of p_i is, the higher is the chance that the similarity between v_i and w_i is non accidental. Assuming that similarities among different pairs of potential cognates are independent, we take the product of p_i’s for all meanings out of the 40 for which we have data. Let P denote this product.

Now, we define the similarity S_L’L’’ between L’ and L’’ as -log(P) (the minus sign serves to make the thing positive). The values S_L’L’’ serve as the input for CLANs.

Figure 5. Indo-European language cluster with respect to the Word Similarity measure.

The method we use might look suspiciously similar to Greenberg’s (1987) ‘mass comparison”, justly criticized by many authors, for a detailed discussion and reference see, for example, Campbell and Poser (2008). The crucial difference between our approach and Greenberg’s mass comparison is that, unlike in Greenberg’s work, the similarity between words is established by an algorithm and not a human. That makes results considerably more reproducible (as long as the same initial dataset is used.)

Download 1.74 Mb.

Share with your friends:

1 ... 5 6 7 8 9 10 11 12 ... 15