Collaborative Project: "Minorities at Risk" Data Base and Explaining Ethnic Violence
NSF Grant Proposal
James D. Fearon -- Stanford University
David D. Laitin -- University of Chicago
The "Minorities at Risk" (hereafter MAR) data base produced by Ted R. Gurr and associates has been widely used in the scientific community (give cites). It has great potential to serve as the evidentiary arbitor of competing theories of ethnic violence. Large-scale ethnic violence is an interesting and important topic both because of the enormous human suffering it causes and because it could be an important piece of evidence in the larger puzzle of how world politics and polities are now evolving. Furthermore, civil and especially ethnic violence is certainly much more common now than is interstate violence of the classical sort, and it tends to be more protracted than interstate wars as well. (See Licklider 1995 and Walter 1997 for evidence on the intractability of civil and ethnic conflicts.) Because of the trend of greater degrees of ethnic violence, and because of its importance for policy and for theories of world politics, the MAR data base will play an increasingly important role in the search for explanations for this violence.
Despite its wide use and great potential, however, the data base suffers from some fundamental flaws. The first purpose of this project is to work with the Gurr team to improve substantially the scientific quality of the MAR data base, such that the social science community will have a much better resource for the study of ethnic violence than is currently available. The second purpose of this project is to exploit the newly created data base in order to make cross-sectional and time-series comparisons of violent and non-violent cases of relations between ethnic groups and states, in order to answer two questions. First, what forms or types of ethnic violence have been the most lethal in the period since 1945? Second, using the MAR case list, are there any obvious features that distinguish the ethnic groups that have been involved in large-scale violence against other groups from those that have not?
I. The Promise and the Flaws of the MAR Data Base
The MAR data set developed by Gurr and his associates contains information on some 268 culturally defined minority groups in 115 countries, with 449 variables coded concerning the social, cultural, political and military situation of these groups vis-a-vis other groups and the state since the end of World War II. The data have been systematically updated to take into account the creation of new states (and new minorities) in the wake of the collapse of the Soviet Union and Yugoslavia. (Minorities at Risk Phase III Dataset: Users' Manual, August 1996, University of Maryland. See also Gurr 1993a, 1993b, 1994).
The literature on ethnic conflict/violence before the development of the MAR data base employed two types of research designs. Scholars have undertaken (1) relatively large-N, cross-sectional comparisons among cases selected because they are marked by significant ethnic conflict or violence, and (2) smaller-N studies that consider the evolution of violence in particular cases over time (or which may compare two or three of these). The failure to systematically sample cases of low conflict or violence tends to undermine the first approach's ability to generate insight into what factors differentiate high and low violence cases. And while the small-N literature has produced a wealth of insights into particular dynamics and mechanisms at work in particular cases, this approach is inherently incapable of providing a "big picture" view of the empirical contours of ethnic violence, and thus an understanding of how particular mechanisms fit within the larger picture. Ultimately, we are interested in developing theories that describe the mechanisms that, under certain conditions, give rise to large-scale ethnic violence. But rather than move immediately to the "micro-level" of specific mechanisms, we want first to establish the larger empirical context, in part so that we can make better initial guesses about what sorts of mechanisms are most common and important empirically.
The MAR data set provides a useful first step in establishing the larger empirical context, in large part because in the selection of cases there is great variance on levels of violent conflict. The data set also has a reasonbly good proxy for our dependent variable. Finally, it has useful codings for many independent and control variables. Although we will have a good deal to say on case selection when we identify problems with the data, it should be noted up front that coverage is reasonably good, considering that this is a large-N data set that codes quite subjective entities. If we randomly choose countries and then look to see what groups appear in the list, most of the time we find a good correspondence between the groups included and our own sense of how people in the country code them (keeping in mind the "at risk" criteria of selection).
On the dependent variable side, we are most interested in the number of deaths due to ethnic violence per year and per capita for each minority group. These data are not available; the MAR data set contains two variables, however, that are imperfect measures of levels of group violence: REBEL (for "rebellion") and COMCON (for "communcal conflict"), each of them coded for every five-year period from 1945 through 1994. Informal validity checks for REBEL give us confidence in relying upon it as a proxy for our dependent variable. We listed all cases of high REBEL scores and for all but one case, our independent reading of the sources showed very high numbers of deaths directly attributable to the ethnic conflict. Moreover, the Gurr measure "gets right" a comparison that one might get wrong if one looked only at column-inches in the U.S. press: The worst cases of ethnic violence in Western Europe, such as Northern Ireland and Basque separatism in Spain, rate only low scores on the REBEL scale, as examples of "campaigns of terrorism." We believe this is a reasonable reflection of relative fatalities, since in both these cases between 1000 and 3200 have been killed over almost 30 years, a not atypical number for a five- or even one-year period in many of the more serious "third world" cases. Finally, a partial check is afforded by using the rough estimates of fatalities in 50 of the "most serious ethnopolitical conflicts" that Gurr studies in his 1994 article. We constructed a variable, DEATHS, from the estimates for these 50 cases, assigning a value of 0 to all other cases except those that have had some experience of large-scale guerilla activity or protracted civil war since 1945, which were treated as missing data. (For most of these cases, we know that fatalities are quite high, easily above the 1,000 total threshold, such as Ethiopia, Lebanon and Chechnya). When Gurr's estimates give a range, we used the midpoint of the range, and we gave the same estimate to each group involved in the conflict. The results do not differ much if the deaths estimate is divided by number of groups involved or by group populations.} The bivariate correlation of the log of this variable (LNDETH) with the maximum rebellion score since 1945 (our recode, MAXREB45) is .73. (To avoid log(0), we add 1 to DEATHS before taking the log.} Even more impressively, we find that if LNDETH is used as the dependent variable in our regressions, the results are again substantially unchanged (in many cases they are even stronger). This too gives us some confidence that REBEL is a reasonable proxy for levels of ethnic violence.
On the independent variable side of the equation, Gurr and associates have constructed scales for group history and status, opportunities for group political action, global processes shaping context of political action, and international factors facilitating political action. (For the flow diagram of Gurr's theoretical take on the data, see Gurr 1993, 125). We shall comment critically on some of these scales, but as will be explicated below, several of them are quite worthwhile and revealing for the general picture that we hope to draw.
As for controls, the MAR data base codes cases based on region, and some of the best work with the data finds distinctive regional patterns of ethnic conflict (e.g. Scarritt and McMillan 1995). Also the MAR data base allows for controls based upon types of politicized communcal groups (ethnonationalists, indigenous peoples, ethnoclases, militant sects, and both advantaged and disadvantaged communal contenders). This allows analysts to check theories across different types of communal group, or to control for communal group type in regression equations.
The promise of this data set is good. But there are some difficult problems, as yet unresolved, that undermine the validity of the results for all members of the scientific community who rely on these data. These problems include ones of selection of cases, inaccurate coding of existing variables, omission of important variables that speak to standard theories of ethnic conflict, and inclusion of variables that are endogenous to ethnic conflict itself.
Selection Problems
To be included, a group had to reside in a country with population greater than one million in 1990, had to have itself a population greater than 100,000 or 1 percent of country population, and had to meet at least one of the four criteria Gurr et al. used to decide if the group was "at risk." For the "at risk" criteria, Gurr et al. asked whether (1) the group suffers "discrimination" relative to other groups in the country, (2) the group is "disadvantaged from past discrimination," (3) the group is an advantaged minority being challenged," or (4) the group is "mobilized," meaning that "the group (in whole or part) supports one or more political organizations that advocates greater group rights, priveleges, or autonomy" (Manual, p. 7 and 65.)
Obviously, then, the criteria for inclusion are subjective and may be contestable for specific cases. There is also the problem of how to decide what the "group" is in cases where group boundaries and self/other descriptions are contested or unclear. For instance, MAR codes as single groups "Hispanics" in the U.S. and "Pashtuns" in Afghanistan, when one could argue for greater disaggregation in each case; certainly "Southerners" in Chad and Sudan could be greatly disaggregated. "Russians" are coded as a minority in Ukraine, though group boundaries in this case are at best "in formation" (Laitin 1998a). The coding of a number of the African groups in the sample could be criticized on similar grounds, and our impression is that quite a few African groups are omitted altogether that arguably might satisfy the "at risk" criteria. The Africa cases present special problems. The data base does not include ethnic groups in Malawi, the Central African Republic, Gabon, Liberia, and Tanzania. It includes for Somalia the Isaaq, but excludes other clans equally at risk. But not only in Africa are there ambiguities. The lumping together of almost all minority ethnic groups in Latin American countries under the heading "indigenous peoples" may is problematic. And ignoring the Flemish in Belgium is surprising.
Interestingly, Gurr et al. never address the problem of defining what bases and indicators of groupness potentially qualify a group for inclusion in the list. The implicit criterion seems to be that group membership must be mainly reckoned by descent by people in the country. The vast majority of the groups in the data set could be referred to as "ethnic" in ordinary language, and the vast majority in fact are. Bu if descent is crucial, why are Indian castes omitted, and why are the Ba'hais of Iran, a religious sect where being a believer is probably close to necessary and sufficient to be a member of the group, included?
One other important feature of the MAR case selection should be stressed: The most politically dominant ethnic group in a country is not included, unless (this is our best guess) the group is a "minority" in the numerical sense of having less than 50 percent of country population. Thus, "whites" are not listed in the U.S., "French" in France, "Germans" in Germany, "Malays" in Malaysia, "Russians" in Russia, "Estonians" in Estonia, and so on, because these groups comprise numerical majorities -- they are not "minorities at risk." But if a politically dominant ethnic group has less than 50 percent of country population, it may be included, even if it is the largest group in the country. Thus, Pashtuns in Afghanistan (38 percent of population) and Sunnis in Lebanon (30 percent) are included as "advantaged minorities being challenged," though each forms a "majority" in the sense of a plurality. The few other politically dominant groups in the sample are both numerical minorities and smaller in population than some other group in the country, such as Tutsis in Burundi and Rwanda, Alawi in Syria, Kalenjins in Kenya, and Ngbandi in Zaire.{Most of the 48 groups coded as "advantaged minorities being challenged" in the data set are not politically dominant. Twelve of these are "Russians" in the former Soviet Socialist Republics (counting also "Slavs" in Moldova). The list also includes a number of "trader" or "middleman" minorities, such as Chinese in several Southeast Asian countries, and a few cases of Europeans in south African countries.) The data set also includes some numerical minorities that form a plurality but are not coded as "advantaged" (such as Kikuyu in Kenya, Oromo in Ethiopia, and Bosnian Muslims in Bosnia). Finally, to add to the confusion, five "disadvantaged" groups that form absolute majorities are included (Hutus in Burundi and Rwanda, "Highland Indigenous Peoples" in Bolivia, Shi'is in Iraq, and Taiwanese in Taiwan).
While many of these inconsistencies are the result of judgment calls (and the judgment of the MAR coding team is rather good), one issue concerning selection is of overriding importance. As identified in our first cut through these data (Fearon and Laitin, 1998), and re-emphsized by Hug (1998), if REBEL is the dependent variable, there is a built in bias in the criteria of selection of cases. To be included in the MAR sample at all, a group must be larger than 100,000 persons in 1990, and be "at risk," which for Gurr et al. essentially means that the group is either "mobilized," subject to discrimination, or at a major economic disadvantage. Thus, the sample does not include the large number of ethnically defined groups that are small or that are not already marked by factors that might increase their odds of being engaged in violent conflict. Another way to put this is that virtually all cases in which there would be a high score for REBEL are included in the data base; yet not all cases are included where REBEL is nill, because under conditions of peace, the group may be insufficiently mobilized to catch the attention of coders. Therefore, the selection of cases induces an overprediction of rebellion.
To be sure, the MAR data show that large scale ethnic violence is relatively rare. Only 15 percent of the cases reach the highest level of REBEL at some time in the period 1945-1994. Only 35 percent reach the level of "small scale guerilla war" or greater at some time in this period. And fully 45 percent never in the whole post-war period rate above 0 on the REBEL scale. Yet even still, given the selection bias, the data overpredict ethnic violence, and this problem requires a solution.
Inaccurate Coding
The Phase III MAR data set with which we worked, despite careful coding procedures, still has substantial coding errors. The University of Maryland team has been proactive in correcting many of the errors, and continues to do so. We feel, however, that some of the important variables require major surgery. Rather than provide a catalogue of small problems, we present one important variable that we have found to be flawed.
The MAR variable "culdifx2" measures linguistic difference between the minority and the dominant group. The values of this variable go from "0" (No Difference) through "2" (Extreme Difference). Culdifx2 is an element of the index variable for cultural difference (culdifx), which plays a role in Gurr's own explanation for grievances in regard to group autonomy in Middle East and Latin America (Gurr 1993, 80-81).
Despite the apparent ease in coding on such a variable, the concept of linguistic difference is a tricky one to nail down. Gurr's scale is not specified in either the book or in the users' manual. Yet, one might legitimately ask, how does one code the language of either of the groups? For example, what is the language of the dominant group in Kenya? At the time of writing, the President's ancestral language is Kalenjin. The Kalenjin have garnered considerable resources due to President Daniel arap Moi's power to influence the distribution of resources. Yet he might speak Swahili -- a lingua franca throughout much of East Africa -- more often than he speaks Kalenjin; but on most official matters, concerning business and high government affairs, he is more likely to speak in English.
One might consider substituting "official language" for "language of dominant group", but that opens more problems than it solves. Many countries do not legislate an official language; and many that do give official recognition to (but no de facto role for) several languages.
Coding on the linguistic differences between dominant and subordinate groups gets compounded by the problem of classifying both languages. Let us go back to our Kenya example, and think about the Luo, who are a minority group vis-a-vis the now-dominant Kalenjin. For culdifx2, should we code the linguistic difference between Luo and Kalenjin, Luo and Swahili, Luo and English, or English (which the Luo elite speaks quite well) and Kalenjin, English and Swahili, or yet again English (for the Luos) and English (for the Kalenjins)? The score for culdifx2 can vary between 0 and 2 depending on the coding rules of what the language of each group is, and the answers are not obvious.
Not only is there a problem of classifying the language of any group, but there is the second problem of assessing the differences between them. On what metric? For example, the linguistic difference between northern and southern Chinese may be greater than between two distant Romance languages, yet because of a common schema of writing, intellectuals from all regions in China can communicate rather easily with one another. This is not because their languages are similar, but because they share an ideographic system that substitutes for speaking.
Insufficient attention to these details, or in fact to any clear coding rules, led Gurr and associates into some glaring anomalies in their MAR codings. Here are a few:
Chinese in Malaysia get a 2, while they get a 0 in Indonesia; but the official languages of Malaysia and Indonesia are virtually the same.
The Hindus in Pakistan get a 0; but the Muslims in India get a 1. If the dominant language of Pakistan is Urdu, the dominant language in India is Hindi, the Hindus in Pakistan are assumed to speak Hindi, and the Muslims in India assumed to speak Urdu, this coding seems inconsistent.
In Ghana the Ashanti (Akan speakers) get a 2; while in Kenya the Kikuyu get a 0. They are both from a minority language group yet one with the highest percentage speakers compared to all others in the country, and both were, at the time of coding, out of power. In both countries, English is the major language of power. The leaders of the country came from minority language groups (Ewe and Kalenjin). Since Ewe and Akan are both closely related Niger-Congo languages (Atlantic Congo branch; Volta-Congo sub-branch; Kwa sub-sub-branch; and Left Bank Kwa, sub-sub-sub-branch), while Kikuyu (Niger-Congo) and Kalenjin (Nilo-Saharan) are from completely different families, one might have expected the Kikuyu to get a 2 while the Ashanti a 0. But the reverse is the Gurr coding. This leads the analyst of the data to ask what these figures represent.
In South Africa, the Europeans get a 2; but the Asians a 0. This leaves one wondering what the language of the dominant group might be. If it is Xhosa (the language of the President's ancestral group), then the Asians and Europeans are equidistant. If it is English or Afrikaans, the languages of economic power, then the Europeans should be receiving the lower score.
In Nigeria, the Ibos get a 2, while the Yorubas get a 1. If the dominant language is either Hausa (an Afro-Asiatic language) or English (an Indo-European language), it is hard to see why Yoruba or Ibo (both Niger-Congo languages) is closer or further away from either of these two dominant languages.
In the United States, African-Americans get a 0 (reflecting full assimilation) while Native-Americans get a 2 (reflecting maximal difference). It seems that in the former, the criterion was the actual language practice of the group, while in the latter it was the historical language of the ancestors of the actual population.
In our communications with the MAR staff at University of Maryland, we learned that coding for this variable was done at two different sites, with different criteria for dominant language. What is very worrisome here is that culdifx2 is significant when regressed against REBEL, even with several controls. It would be dangerous to make a big deal from this finding before a better index for language difference is devised.
Omission of Important Variables
The MAR data base excludes war-horse variables such as GDP, GDP/capita, GDP growth, population growth, degree of ethnic heterogeneity in the country, colonial power, year entry of country into the international system of states, crucial political transitions facing the state, and basic aspects of state structure (e. g. federal vs. unitary structures). We believe that it would be a mistake to allow all users of the MAR data base to "import" data on these variables from other data bases, as this would not allow for easy replication of results. Therefore, standard theories of rebellion (including Gurr's classic of 1970) cannot be tested with the MAR data base as it now stands.
Variables that are Endogenous to Ethnic Conflict
A considerable number of the MAR variables seems to be endogenous to ethnic convlict or cannot be reliably coded independent of observation of the value on the dependent variable. The inclusion of these variables in the data set is almost an invitation for tautology to pose as explanation. Consider the set of variables on group grievances, which codes for public statements airing grievances about autonomy, several political rights (such as participation in decision-making, equal civil rights, policy demands), economic rights, and cultural rights. Grievances spur mobilization, in Gurr's flow diagram (1993, 125), which then raises the probability of communal protest and REBEL. The problem is that the articulation of grievances is more likely to be heard by outsiders if there is already rebellion. In fact, the act of rebellion involves the creation of propaganda machines designed to articulate those grievances. Thus the value on the dependent variable almost necessarily influences the coder's judgment on the independent variable.
An equally problematical explanatory variable-set is that of group identity. The MAR codesheet asks coders to make judgments concerning a group's ethno-cultural distinctness, concerning language, customs, beliefs, and race, which cumulate in an index of ethnic difference (ETHDIFXX). The problem here is that groups that are in violent conflict with the state tend, due to the conflict, to highlight their cultural differences from the dominant political groups. On a purely linguistic scale, for example, Black English is further from Standard American English than is Catalan from Spanish. Yet due to the autonomy movement in Catalonia, the cultural differences between the Catalans and Spaniards are highlighted by activists; in the U.S., attempts to highlight linguistic difference between racial groups are ridiculed by elites representing both groups. Or to take another example: the Isaaqs in Somalia today point to religious, linguistic and even racial differences with the Hawiyes, who control Mogadishu, Somalia's capital. Yet in the 1960s, the very same Isaaqs proudly pointed to their connection to a common Somali culture. The regional war led to greater saliency of what cultural differences there were. In this sense, the ETHDIFXX index is certainly endogenous to the value of REBEL.
II. The Corrective Measures We Plan to Take
Given that the MAR data base is already a valuable public good, our overriding concern is that all our proposed corrections and additions get accepted by the University of Maryland team that developed and continues to serve users of the data base. The UM team has been quite receptive to corrections from users, and has therefore continually updated the data base. As a sub-contractor for this proposal, the UM team will have final say in the discussions (to take place at the end of the grant period) concerning the our proposals for changes to the data base. This goes as well for variables we wish to add to the data base, as most of these will come from standard sources, and the question will be precisely which measures of these variables (such as GDP; for the problems in this regard, see below) are best to include. We will then live with the final product -- which will be available to the entire scientific community -- in our own analysis of the data. That said, we now discuss a few issues involving corrections and additions that we expect to address during the grant period.
Dealing with Selection Bias
We propose to add new cases to the data set, though we recognize that these additions will not in principle solve the selection bias problem, although we hope they will make some of the findings more robust. In this section, we will discuss criteria of selection for new cases, and then our plan to deal with selection bias.
We have three criteria for inclusion of new cases and the attendant problem of getting values on these cases for all variables that remain in the data set (but once we eliminate from the data set variables that are clearly endogenous to the dependent variable, and other variables that are not connected to any outstanding theoretical concerns, the costs of adding new cases will be reduced.) First, as we noted, the MAR data base has very few dominant groups. We propose to choose a few more, based on the group's relative size in the country. This will enable us to examine statistically the category of dominant groups. Second, we propose to add groups for countries that are not at all included in the data set. Adding these new cases should help us determine how robust earlier findings were. Third, we propose to add a set of cases that are of substantive interest to us, even though they may just miss MAR cut-offs. These cases can be screened out of general tests of hypotheses, but can be included to see how certain kinds of cases compare to the general population. For example, we propose to add a complete set of cases of African diasporas (only some of them appear in the MAR), so that we will have sufficient numbers to do statistical comparisons of African diaspora relations with all states and with other ethnic groups in states having significant African diaspora populations. One of our graduate students expects to work with these data for her doctoral dissertation. We will employ her as an RA from the grant in order to code the new set of African diaspora cases for all remaining variables.
None of these additions, we emphasize, alleviates the selection bias issue, as the very recognition by coders of group-ness is in part conditioned upon the activities of members of that group announcing themselves as a group with grievances against the political order. We therefore propose to build upon a suggestion made by Hug (1998), who incidentally cited our work as understanding the implications of the selection bias problem more so than the developers of the original data base. The proposal is to develop a regression model that would predict the inclusion of a group into the MAR data base, where one of the independent variables would be the maximum REBEL score. Once we get a good model predicting inclusion, we would then run simultaneous regression models on our dependent variable (REBEL) and on the dependent variable predicting inclusion (the selection equation). With these simultaneous equations, we would get truer coefficients of the effects of our independent variables on the probability of ethnic violence. This technique is quite innovative for comparative politics. It would mean appropriating a recently developed tool in econometrics to address a problem that has been up till now better addressed by postmoderns (that is, how to deal with group-ness as a variable, rather than something "out there" to be counted), than by more statistically minded analysts. Thus we would be addressing an unsolved problem in comparative politics (that of the emergence of groupness being explained by factors we have hypothesized to be the effects of groupness) with new econometric techniques.
Improving Existing Variables
In using the MAR data base, we have compiled a list of questionable codings. We propose, in our meetings with the Maryland MAR team, to go over these possible errors, but in a way that does not lead to selective overriding of coding norms. We will choose a random sample of values for each variable in which we found coding errors, and include our questionable cases with the random sample. We will then go to the original coding sheets to see if our case was an outlier, or part of a systematic coding rule. Based on what we find, we will decide whether to leave the coding as it was, to change the alleged errors, or to recode entirely for the variable under question.
Where we found systematic errors in the construction of a variable, we will develop new indices, and code values for all cases. We already plan to develop new coding rules for group concentration (REGCON) which as will become clear plays an important role in our explanation for violence, but is inadequately coded in the MAR scheme. Similarly, economic differentials between minority and dominant group is not well conceived in the MAR data base. We are now seeking a better measure. For language difference (culdifx2), we have already made some progress in developing a new indicator, and will describe our efforts here. We could not entertain the proposal by Greenberg (1956) to rely upon glottochronological techniques, as these methods have been rejected in linguistic circles and there are no cross regional data available (See Laitin 1998b). Instead, we took the world classification of languages, produced by Ethnologue (Grimes 1996), a society of linguists interested in producing versions of the Bible in all languages of the world. Ethnologue linguists rely upon linguistic trees, classifying languages by structure, with branch points for language family (e.g. Indo-European from Afro-Asiatic), language groups, and down to sub-dialects. From this list, we code the language of the dominant group for each of the minorities in the MAR data base, and count the branch point from which the minority group's language breaks off from that of the dominant group. If the two languages are of different language families as with Spanish and Basque, the score for language distance is 1, but if they break off on the fifth branch from one another, as did Akan from Ewe, the score is 5. The higher the number, the greater the language similarity. If the minority and majority speak the same language (e.g. Serbs in Croatia), we code the minority group with the number 20.
This measure of language distance is not without its own problems. First, we faced the same problem Gurr and associates faced, viz. that there is no accepted criterion for judging the language of the dominant group or the minority. Our criterion was to code the historic language of the country's political leadership as the dominant language (and thus the dominant language of Kenya changed with Jomo Kenyatta, a Kikuyu, died and power was transferred to Moi, a Kalenjin), and the historic language of the minority (and thus Germans in Russia are coded as German speakers even though most cannot speak German). This made sense to us as a general rule, but this decision is still rather arbitrary. Second, there are problems in Ethnologue's classification of languages, in part due to the fact that across language families, the data are not equally sensitive to dialectical differences in different regions. Since Ethnologue linguists have a greater interest in preparing Bible translations for heathens, they have been more sensitive to small differences in Papua-New Guinea than in Germany. And so, the data may overstate linguistic differences among non-Christians. A third problem, as was indicated earlier, is that structural differences are not a good proxy for communicative difficulties. While Castilian and Mexican Spanish are closely related, and equidistant from English, the interference of English-speakers in Sonora is so great as to make Spanish spoken there sound somewhat like a dialect of English. Despite these difficulties, Ethnologue data are available and give a rough and ready measure of linguistic difference. During the grant period, we will consult the specialized linguistic literature to check on the classification scheme. (In our initial use of the Ethnologue coding, the significant impact of culdifx2 on REBEL washes away).
Recode of the Dependent Variable
As noted, we have in general been satisfied with the coding on REBEL, our dependent variable. But since the REBEL scores are five-year maxima, we are unable to construct cross-sectional pooled samples that would enable us better to measure the effects of regime change, for which we have a compelling commitment model (see Fearon, 1994), on ethnic rebellion. We propose to rely upon the data in the MAR archives to extract annual rather than 5-year scores for REBEL. We would thus be able to do comparative statics on a game-theoretic model of ethnic war.
III. Preliminary Cross-Sectional Analysis (of the existing MAR Data Base with additional variables)
Our ambitions go beyond the provision of an improved public good. With support from this grant, we also intend to make use of the improved MAR data base to explain variations in levels of large scale ethnic violence since 1945. To give an indication of how we have analyzed the data so far, we present in this section first a general overview of our findings and second, our analysis of one variable, GROUPCON, in explaining levels of REBEL. As will be clear in our discussion, we are reluctant to publish these results until we are more confident with the accuracy of the GROUPCON coding.
In our initial analysis of the MAR data base, we asked whether there were obvious factors that reliably differentiate violent contests and separatist movements from the largely peaceful cases of groups involved in neither. We found that several variables that one might have thought would matter -- and which various theories predict should matter -- do not. These include measures of religious and "racial" difference between the minority and the dominant group; measures of the degree of economic disadvantage of the minority; the level of democracy of the state; whether ethnic brethren of the minority live close by in a neighboring state or "homeland"; and the rate of population growth of the country where the minority resides. (The result on economic differentials replicates a finding reported by Gurr (1993 IPSR, 1994) using essentially the same data, although our dependent variable and model specifications are very different from his.} The major point in each case concerns selection bias in previous research (far worse than the bias we pointed to in the MAR data base) -- the failure to systematically sample relatively peaceful cases. Selection bias led researchers to infer that, for instance, cultural and economic differences cause ethnic violence when in fact there are a great many cases where such differences exist but violence does not occur. Our results here should not be interpreted as saying that these factors are irrelevant in any specific case. Rather, they show that if they matter at all, it is through interaction with other things, which remain to be identified.
Our "first-cut" variables that seemed to differentiate the high- and low-violence cases are (a) GDP per capita, with groups in richer countries less disposed to violent separatism (In a subsample of 50 violent cases, Gurr (1994) found a bivariate correlation between 1990 GDP and an indicator of the magnitude of the violence. However, Gurr does not consider that large-scale ethnic violence surely causes low GDP -- look at Afghanistan, Bosnia, or Somalia -- so that we need to look at GDP data from before the onset of ethnic violence. We used GDP data from 1960, and consider the whole sample rather than just the relatively violent cases); (b) growth in GDP per capita, with faster growing economies in one period less likely to have ethnic groups engaged in large-scale violence in subsequent periods; (c) geographic concentration, with widely dispersed and mainly urban groups being unlikely to be involved in ethnic violence; (d) relative group size, with a weak tendency for larger groups to more disposed to violence; (e) Sunni and Shiite Islam, with groups with these religious affiliations being more disposed to violence irrespective of the religion of the dominant group in the country and controlled for region; (f) living in mountains or hills; and (g) degree of ethnic heterogeneity of the country, with greater heterogeneity associated with less violence once we control for GDP and other factors.
Because they challenge recent international relations' approaches to ethnic conflict, it is worth while -- at least to demonstrate how we have been going about analyzing the MAR data -- to illustrate our findings concerning group concentration. Thinking mainly about recent cases in Eastern Europe and Africa, several authors have argued that ethnic conflict is more likely to yield large-scale violence the more the populations of the ethnic groups live interspersed, as in Krajina, Bosnia, Burundi and Rwanda. Posen (1993), Van Evera (1994), and Kaufmann (1996) all maintain that groups face a more extreme "security dilemma" when they are interspersed. Fearon (1996) argues that mixed populations favor violence when "ethnic cleansing" is necessary to render effective a declaration of sovereignty or autonomy. A particularly compelling small-N comparison that supports such views is the contrast between the peaceful breakup of Czechoslovakia, which had little Czech/Slovak geographical intermixing, and the violent breakup of Yugoslavia, which had quite a bit in Krajina, Slavonia, and Bosnia. (Or, just within Yugoslavia, contrast the relatively peaceful secession of homogenous Slovenia with the more violent cases of Croatia and Bosnia.)
The MAR data contains measures of the geographical distribution of the minorities at risk, coded using data for 1980 whenever possible. Gurr et al. provide a four-point scale, GROUPCON (actually a combination of variables, but not of consequence here), that takes the values and labels indicated in Table 1.
Table 1: Geographic Concentration and Violence
GROUPCON
|
MAXREB45
[The highest REBEL score since 1945, on a scale from 0 (none reported) to 7 (protracted civil war)]
|
Value
|
Percent of cases in the data set
|
Label
|
Mean
|
Std. Dev.
|
0
|
14.5
|
Widely Dispersed
|
1.49
|
2.51
|
1
|
13.8
|
Primarily urban or minority in one region
|
.57
|
1.68
|
2
|
21.3
|
Majority in one region, others dispersed
|
3.25
|
2.74
|
3
|
50.4
|
Concentrated in one region
|
3.02
|
2.84
|
GROUPCON, we find, is quite related to large-scale ethnic violence, though not in the manner one would expect from the arguments and examples given above. Instead of dispersion and geographic intermixing in cities being associated with greater levels of ethnic violence, they are associated with less, on average. In fact, the preceding table really masks the nature and strength of the relationship. Being either "widely dispersed" or "primarily urban" (GROUPCON = 0 or 1) proves to be almost a sufficient condition for a group to have low MAXREB score in these data, as only nine of the 93 cases of large scale violence have dispersed populations, and of these, only three are violent conflicts over autonomy issues.
On a little thought, the near sufficiency of low geographic concentration for low ethnic violence is not surprising. Minorities that are primarily urban or widely dispersed in a country are likely to be groups that have no historical attachment or claim to a distinct geographic region within the country. This makes them unlikely candidates for the most common form of large-scale ethnic violence in the post-war period, since greater "autonomy" almost always means a recognition of distinct political powers within some patch of territory. Primarily urban groups lack a contiguous territory over which to fight for greater control. And if a widely dispersed group is engaged in violent conflict at all, it is most likely to try for control of the center, as in Burundi, Jordan, Rwanda, and perhaps Lebanon.
If this argument is correct, we would also expect that dispersed and urban groups would have a greater propensity for the more societal forms of violence such as riots. They might also be more likely victims of genocides, since both urban and highly dispersed groups lack a distinct territory on which to develop and shelter guerrilla forces -- the out-of-sample Jewish and Armenian genocides certainly fit this argument. There is some support for it in the data as well. Whereas separatist and autonomy-related ethnic violence is almost four times more likely for the GROUPCON = 2 or 3 cases, "communal warfare" is almost equally likely for widely dispersed/urban ethnic groups as for the more regionally concentrated groups. And the nine cases of low-concentration groups that engaged in communal warfare include the sole genocide in the sample (Croatia/Roma), plus at least two cases of large-scale ethnic rioting (India/Muslims, Kampuchea/Vietnamese).
Where do these findings leave the arguments that hold that greater geographical intermixing should be related to higher levels of ethnic violence? One possible response would be to concede that urban and widely dispersed groups tend, on average, to fall among the least violent cases cross-sectionally, but to ask about variation within the set of groups that have some degree of regional concentration. Perhaps among the GROUPCON = 2 and 3 cases more intermixing is associated with more violence, for "security dilemma" or other reasons. Table 1 indicates nominal support for this claim, since the average MAXREB45 score for the "majority in one region, others dispersed" is slightly higher than for the "concentrated in one region" cases. But this difference is not statistically signficant, is not robust across regions, and does not become significant in a regression that controls for the full set of variables considered elsewhere in our analysis.
A second possible response is to question the coding of population distributions. What exactly is GROUPCON measuring, particularly for values 2 and 3? When Gurr et al. say "concentrated in one region," what exactly do they mean by "concentrated" and what by "region"? Although this is unclear from the codebook, we can drawn an inference from a variable called REG1P, labelled as "Group's proportion of population in the region" for groups that are "concentrated in one region." Interestingly, in about 21 out of 76 cases for which this data are provided the minority's proportion of regional population is less that one half! For example, Abkhazis comprised only 18 percent of the former Abkhaz ASSR in Georgia, but they are coded as GROUPCON = 3. Thus, "concentrated in one region" must mean "a large majority of the group lives in one region," rather than "the group is concentrated in one place with little geographic intermixing with other groups."{The question of how Gurr et al. decided what constitutes the relevant region remains unclear, and unraveling this and other such problems is one of the goals of this project).
We can now retest the standard hypothesis by asking if groups that comprise a larger fraction of their region tend to have lower ethnic violence levels. Although there is not much data, again the answer appears to be "no." Group proportion of region (REG1P) is positively correlated with the rebellion and other ethnic violence indicators. The same is true for the cases coded as "majority in one region, others dispersed" (GROUPCON = 2).
What accounts for these relationships? In all likelihood, while the GROUPCON = 2 and 3 codings are not especially good measures of degree of geographic mixing of groups, they do pick up a nominal, historical, and/or emotional connection between a minority and a particular patch of territory. Abkhazis may be minority within Abkhazia, but it is still "Abkhazia" and they are the Abkhazis. The higher levels of ethnic violence associated with these cases as compared to dispersed and urban minorities are probably due as much to the nominal and historical connection between a group and piece of territory as to more material and strategic implications of geographic concentration. In particular, given the norms and practices underpinning the modern states system, the coincidence of a named region and (named) ethnic group creates a basis, and even an incentive, for claims to political autonomy or sovereignty. Beyond reach for most dispersed and urban minorities, such claims also have the potential to generate violent conflict with the state that officially controls the territory in question.
This interpretation of the results suggests a more sophisticated argument about the link between ethnic violence and geographic intermixing (Fearon 1996, Kaufmann 1996, Posen 1993, Van Evera 1994). The data clearly reject the theory that intermixing is enough by itself to dispose ethnic groups to violent conflict, and thus the argument that intermixing by itself creates a powerful "security dilemma." But it might still be the case that if we could control for the groups' desire for political autonomy over the territory in question, greater intermixing would imply a greater disposition to violence. The reason is simple: the more the members of two groups both desire political authority in the same patch of territory, the greater the potential for violent efforts to kill or clear out the "other" by force. This outcome is not inevitable, since the use of force is costly and dangerous. But under certain conditions it may occur nonetheless (see Fearon 1996 for such an argument).
We could evaluate this argument if we could somehow control for a group's attachment to a particular territory, but this is a near-impossible task. The only relatively exogenous measure in MAR data that might work here is a variable called TRADITN, which codes a group's length of residence in the country on a five-point scale. Not surprisingly, longer residence is positively correlated with higher GROUPCON values, and controlling for TRADITN reduces the coefficient on GROUPCON when MAXREB45 is regressed on these and other independent variables. But no reversal takes place -- it is not the case that once we control for length of residence, greater intermixing implies a greater disposition to violence. If such an effect exists, it can probably only be observed in more controlled, small-N comparisons.
In any event, geographic concentration as a predictor of a disposition to large-scale violence is quite impressive. The relationship is highly robust across regions -- in all regions, GROUPCON = 0 and 1 cases are much more likely to be peaceful than are GROUPCON = 2 and 3 cases. (Sub-Saharan Africa is a statistically insignificant exception, due entirely to the three cases of high violence among the widely dispersed Hutu and Tutsi of Burundi and Rwanda.) And as the multivariate analyses that we have elsewhere presented (Fearon and Laitin, 1998) indicate, the effect of geographic concentration stays just as strong when we control for other factors, like GDP.
Nonetheless, although "group concentration" appears to be almost a necessary condition for large-scale ethnic violence in the post-war period, it is far from being sufficient. More than two-thirds of the groups in the sample are coded as GROUPCON = 2 or 3, and more than half of these have never reached MAXREB45 = 4 ("small-scale guerilla war") or greater since 1945. A significant cross-sectional puzzle remains: among the large set of minorities that have some connection to a specific region within a country, why are some engaged in large-scale separatist or autonomy-related violence while others are not? Working out a fuller story is the major intellectual goal of our project.
IV. Refinements Possible with Revised MAR Data Base
There are many questions that are fiercely debated in the literature on ethnic conflict that can be answered rather conclusively with a refined MAR data base. One such question concerns whether federal arrangements (controlling for demographic and regional factors) ameliorate or exacerbate ethnic tensions (See Horowitz 1985, pp. 601-22). One of our students has already coded all MAR countries on seven criteria of federalism, and is doing preliminary examinations as to whether federal arrangements reduce the probability of ethnic rebellion, and what particular aspects of federal arrangements (e.g. taxation rights, ability to elect regional leaders) have the greatest dampening effects on ethnic rebellion. Another question is whether all, or only particular kinds of state breakdowns or transitions, lead to a commitment problem. Another of our students has begun coding all regime changes since 1945 for MAR countries, based on type of change (military coup, independence from a colonial power, break up of a communist regime). We cannot yet regress REBEL against these measures, as we would need annual scores for REBEL to ascertain precisely the direction of the causal arrow. But we plan to do so once we annualize the REBEL scores from the MAR archive. Another question is to make sense of the "finding" that Sunni and Shiite groups are more likely to be engaged in violent action against the state than non-Muslim groups. One of our students has already examined demographic differences (e.g. the relative size of the young male population) without being able to explain away this finding. More work here clearly needs to be done. Yet another question is why the "Polity" data variables for democracy have no explanatory power in regard to REBEL, once GDP is controlled for. Our case analysis leads us to think otherwise, and we need to examine the data more carefully to figure out why the effect of democracy comes out strongly in small-n but weakly in this large-n study. Matthew Kocher, who will be the project director for this grant, plans to write his dissertation on that question, and with the improved data base, we expect to be able to address the relationship of democracy to ethnic violence more conclusively.
In sum, our desire to upgrade the value of the MAR data base is not only as a service to the profession; we have our own empirical agenda that the improved data base will allow us to fulfill. Once we illuminate the broad general patterns, and are confident that our findings are not marred by selection bias, we expect to refine the theoretical models that intially drove our work (e.g. see Fearon and Laitin, 1996), and to test those models with comparative statics that the new MAR data base will permit. Not only will we have stronger theory and better tests, but it will be built upon a data base that other scholars will have available to them so that our work will be part of a continuing cumulative effort to understand the sources of ethnic violence in our times.
REFERENCES
Grimes, Barbara F., ed. (1996) Ethnologue: Languages of the World, 13th ed. (Dallas: Summer Institute of Linguistics)
Laitin, David (1998a) Identity in Formation: The Russian-speaking populations in the near abroad (Ithaca: Cornell University Press)
Laitin, David (1998b) "What is a Language Community" paper prepared for presentation at the Annual Meeting of the American Political Science Association, Boston, MA
Share with your friends: |