Towards Automated Language Classification: a clustering Approach Armin Buch, David Erschler, Gerhard Jäger, and Andrei Lupas

Download 1.74 Mb.

Page	4/15
Date	05.05.2018
Size	1.74 Mb.
	#48097

1 2 3 4 5 6 7 8 9 ... 15

Figure 3. Geography of the language sample.

n this way, features which contain much information about the genetic affiliation of languages receive a high weight (and vice versa). This decision was motivated by the hope to extract a deep genetic signal from the WALS data.

The resulting cluster map (see Fig. 2) shows a circular structure. There are two large clusters of languages at opposite sides of the circle (shown in gray and black), and a third, smaller cluster (shown in white) in between. The other languages are arranged somewhere on the circle between these three regions without forming distinct groups.

The map on Fig 3 shows the geographic distribution of respective languages (colors on the map match the colors on Fig. 2).²

A manual inspection of this outcome reveals that this cluster map captures a strong typological and a somewhat weaker areal signal, but no usable information about genetic affiliations. The cluster shown in grey contains languages with head-initial basic word order (SVO or VSO), small phoneme inventories, and lack of case marking. The black cluster, on the other hand, is characterized by head-final word order, nominative-accusative alignment both for pronouns and full NPs, a large number of cases (mostly more than 6) and predominant dependent marking. Figure 2 shows that these groupings are neither genetically nor areally motivated.

That perfectly well agrees with the findings of Greenhill et al (2011) and Donohue et al (2011): The distribution of morphosyntactic features does not sufficiently well reflect genetic relationships between languages.

Download 1.74 Mb.

Share with your friends:

1 2 3 4 5 6 7 8 9 ... 15