9.3 Calculating conditional probabilities
Following this brief summary of the theoretical normal distribution, we can now return to the matter of how to work out the conditional probability p(F1=380|ɪ) which is the probability that a value of F1 = 380 Hz could have come from a distribution of /ɪ/ vowels. The procedure is to sample a reasonably large size of F1 for /ɪ/ vowels and then to assume that these follow a normal distribution. Thus, the assumption here is that the sample of F1 of /ɪ/ deviates from the normal distribution simply because not enough samples have been obtained (and analogously with summing the number of Heads in the coin flipping experiment, the normal distribution would be the theoretical distribution from an infinite number of F1 samples for /ɪ/). It should be pointed out right away that this assumption of normality could well be wrong. However, the normal distribution is fairly robust and so it may nevertheless be an appropriate probability model, even if the sample does deviate from normality; and secondly, as outlined in some detail in Johnson (2008) and summarised again below, there are some diagnostic tests that can be applied to test the assumptions of normality.
As the discussion in 9.2.1 showed, only two parameters are needed to characterise any normal distribution uniquely, and these are μ and σ, the population mean and population standard deviation respectively. In contrast to the coin flipping experiment, these population parameters are unknown in the F1 sample of vowels. However, it can be shown that the best estimates of these are given by m and s, the mean and standard deviation of the sample which can be calculated with the mean() and sd() functions61. In Fig. 9.4, these are used to fit a normal a distribution to F1 of /ɪ/ for data extracted from the temporal midpoint of the male speaker's vowels in the vowlax dataset in the Emu-R library.
temp = vowlax.spkr == "67" & vowlax.l == "I"
f1 = vowlax.fdat.5[temp,1]
m = mean(f1); s = sd(f1)
hist(f1, freq=F, xlab="F1 (Hz)", main="", col="slategray")
curve(dnorm(x, m, s), 150, 550, add=T, lwd=2)
Fig. 9.4 about here
The data in Fig. 9.4 at least look as if they follow a normal distribution and if need be, a test for normality can be carried out with the Shapiro test:
shapiro.test(f1)
Shapiro-Wilk normality test
data: f1
W = 0.9834, p-value = 0.3441
If the test shows that the probability value is greater than some significance threshold, say 0.05, then there is no evidence to suggest that these data are not normally distributed. Another way, described more fully in Johnson (2008) of testing for normality is with a quantile-quantile plot:
qqnorm(f1); qqline(f1)
If the values fall more or less on the straight line, then there is no evidence that the distribution does not follow a normal distribution.
Once a normal distribution has been fitted to the data, the conditional probability can be calculated using the dnorm() function given earlier (see Fig. 9.3). Thus p(F1=380|I), the probability that a value of 380 Hz could have come from this distribution of /ɪ/ vowels, is given by:
conditional = dnorm(380, mean(f1), sd(f1))
conditional
0.006993015
which is the same probability given by the height of the normal curve in Fig. 9.4 at F1 = 380 Hz.
9.4 Calculating posterior probabilities
Suppose you are given a vowel whose F1 you measure to be 500 Hz but you are not told what the vowel label is except that it is one of /ɪ, ɛ, a/. The task is to find the most likely label, given the evidence that F1 = 500 Hz. In order to do this, the three posterior probabilities, one for each of the three vowel categories, has to be calculated and the unknown is then labelled as whichever one of these posterior probabilities is the greatest. As discussed in 9.1, the posterior probability requires calculating the prior and conditional probabilities for each vowel category. Recall also from 9.1 that the prior probabilities can be based on the proportions of each class in the training sample. The proportions in this example can be derived by tabulating the vowels as follows:
temp = vowlax.spkr == "67" & vowlax.l != "O"
f1 = vowlax.fdat.5[temp,1]
f1.l = vowlax.l[temp]
table(f1.l)
E I a
41 85 63
Each of these can be thought of as vowel tokens in a bag: if a token is pulled out of the bag at random, then the prior probability that the token’s label is /a/ is 63 divided by the total number of tokens (i.e. divided by 41+85+63 = 189). Thus the prior probabilities for these three vowel categories are given by:
prior = prop.table(table(f1.l))
prior
E I a
0.2169312 0.4497354 0.3333333
So there is a greater prior probability of retrieving /ɪ/ simply because of its greater proportion compared with the other two vowels. The conditional probabilities have to be calculated separately for each vowel class, given the evidence that F1 = 500 Hz. As discussed in the preceding section, these can be obtained with the dnorm() function. In the instructions below, a for-loop is used to obtain each of the three conditional probabilities, one for each category:
cond = NULL
for(j in names(prior)){
temp = f1.l==j
mu = mean(f1[temp]); sig = sd(f1[temp])
y = dnorm(500, mu, sig)
cond = c(cond, y)
}
names(cond) = names(prior)
cond
E I a
0.0063654039 0.0004115088 0.0009872096
The posterior probability that an unknown vowel could be e.g., /a/ given the evidence that its F1 has been measured to be 500 Hz can now be calculated with the formula given in (2) in 9.1. By substituting the values into (2), this posterior probability, denoted by p(a |F1 = 500), and with the meaning "the probability that the vowel could be /a/, given the evidence that F1 is 500 Hz", is given by:
(3)
The denominator in (3) looks fearsome but closer inspection shows that it is nothing more than the sum of the conditional probabilities multiplied by the prior probabilities for each of the three vowel classes. The denominator is therefore sum(cond * prior). The
numerator in (3) is the conditional probability for /a/ multiplied by the prior probability for /a/. In fact, the posterior probabilities for all categories, p(ɛ|F1 = 500), p(ɪ|F1 = 500), and p(a|F1 = 500) can be calculated in R in one step as follows:
post = (cond * prior)/sum(cond * prior)
post
E I a
0.72868529 0.09766258 0.17365213
As explained in 9.1, these sum to 1 (as sum(post) confirms). Thus, the unknown vowel with F1 = 500 Hz is categorised as /ɛ/ because, as the above calculation shows, this is the vowel class with the highest posterior probability, given the evidence that F1 = 500 Hz.
All of the above calculations of posterior probabilities can be accomplished with qda() and the associated predict() functions in the MASS library for carrying out a quadratic discriminant analysis (you may need to enter library(MASS) to access these functions). Quadratic discriminant analysis models the probability of each class as a normal distribution and then categorises unknown tokens based on the greatest posterior probabilities (Srivastava et al, 2007): in other words, much the same as the procedure carried out above.
The first step in using this function involves training (see 9.1 for the distinction between training and testing) in which normal distributions are fitted to each of the three vowel classes separately and in which they are also adjusted for the prior probabilities. The second step is the testing stage in which posterior probabilities are calculated (in this case, given that an unknown token has F1 = 500 Hz).
The qda() function expects a matrix as its first argument, but f1 is a vector: so in order to make these two things compatible, the cbind() function is used to turn the vectors into one-dimensional matrices at both the training and testing stages. The training stage, in which the prior probabilities and class means are calculated, is carried out as follows:
f1.qda = qda(cbind(f1), f1.l)
The prior probabilities obtained in the training stage are:
f1.qda$prior
E I a
0.2169312 0.4497354 0.3333333
which are the same as those calculated earlier. The calculation of the posterior probabilities, given the evidence that F1 = 500 Hz, forms part of the testing stage. The predict() function is used for this purpose, in which the first argument is the model calculated in the training stage and the second argument is the value to be classified:
pred500 = predict(f1.qda, cbind(500))
The posterior probabilities are given by:
pred500$post
E I a
0.7286853 0.09766258 0.1736521
which are also the same as those obtained earlier. The most probable category, E, is given by:
pred500$class
E
Levels: E I a
This type of single-parameter classification (single parameter because there is just one parameter, F1) results in n-1 decision points for n categories (thus 2 points given that there are three vowel categories in this example): at some F1 value, the classification changes from /a/ to /ɛ/ and at another from /ɛ/ to /ɪ/. In fact, these decision points are completely predictable from the points at which the product of the prior and conditional probabilities for the classes overlap (the denominator can be disregarded in this case, because, as (2) and (3) show, it is the same for all three vowel categories). For example, a plot of the product of the prior and the conditional probabilities over a range from 250 Hz to 800 Hz for /ɛ/is given by:
Fig. 9.5 about here
temp = vowlax.spkr == "67" & vowlax.l != "O"
f1 = vowlax.fdat.5[temp,1]
f1.l = vowlax.l[temp]
f1.qda = qda(cbind(f1), f1.l)
temp = f1.l=="E"; mu = mean(f1[temp]); sig = sd(f1[temp])
curve(dnorm(x, mu, sig)* f1.qda$prior[1], 250, 800)
Essentially the above two lines are used inside a for-loop in order to superimpose the three distributions of the prior multiplied by the conditional probabilities, one per vowel category, on each other (Fig. 9.5):
xlim = c(250,800); ylim = c(0, 0.0035); k = 1; cols = c("grey","black","lightblue")
for(j in c("a", "E", "I")){
temp = f1.l==j
mu = mean(f1[temp]); sig = sd(f1[temp])
curve(dnorm(x, mu, sig)* f1.qda$prior[j],xlim=xlim, ylim=ylim, col=cols[k], xlab=" ", ylab="", lwd=2, axes=F)
par(new=T)
k = k+1
}
axis(side=1); axis(side=2); title(xlab="F1 (Hz)", ylab="Probability density")
par(new=F)
From Fig. 9.5, it can be seen that the F1 value at which the probability distributions for /ɪ/ and /ɛ/ bisect each other is at around 460 Hz while for /ɛ/ and /a/ it is about 100 Hz higher.
Thus any F1 value less than (approximately) 460 Hz should be classified as /ɪ/; any value between 460 and 567 Hz as /ɛ/; and any value greater than 567 Hz as /a/. A classification of values at 5 Hz intervals between 445 Hz and 575 Hz confirms this:
# Generate a sequence of values at 5 Hz intervals between 445 and 575 Hz
vec = seq(445, 575, by = 5)
# Classify these using the same model established earlier
vec.pred = predict(f1.qda, cbind(vec))
# This is done to show how each of these values was classified by the model
names(vec) = vec.pred$class
vec
I I I E E E E E E E E E E E E E E E E E E
445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545
E E E E a a
550 555 560 565 570 575
9.5 Two-parameters: the bivariate normal distribution and ellipses
So far, classification has been based on a single parameter, F1. However, the mechanisms for extending this type of classification to two (or more dimensions) are already in place. Essentially, exactly the same formula for obtaining posterior probabilities is used, but in this case the conditional probabilities are based on probability densities derived from the bivariate (two parameters) or multivariate (multiple parameters) normal distribution. In this section, a few details will be given of the relationship between the bivariate normal and ellipse plots that have been used at various stages in this book; in the next section, examples are given of classifications from two or more parameters.
Fig. 9.6 about here
In the one-parameter classification of the previous section, it was shown how the population mean and standard deviation could be estimated from the mean and standard deviation of the sample for each category, assuming a sufficiently large sample size and that there was no evidence to show that the data did not follow a normal distribution. For the two-parameter case, there are five population parameters to be estimated from the sample: these are the two population means (one for each parameter), the two population standard deviations, and the population correlation coefficient between the parameters. A graphical interpretation of fitting a bivariate normal distribution for some F1 × F2 data for [æ] is shown in Fig. 9.6. On the left is the sample of data points and on the right is a two dimensional histogram showing the count in separate F1 × F2 bins arranged over a two-dimensional grid. A bivariate normal distribution that has been fitted to these data is shown in Fig. 9.7. The probability density of any point in the F1 × F2 plane is given by the height of the bivariate normal distribution above the two-dimensional plane: this is analogous to the height of the bell-shaped normal distribution for the one-dimensional case. The highest probability (the apex of the bell) is at the point defined by the mean of F1 and by the mean of F2: this point is sometimes known as the centroid.
Fig. 9.7 about here
The relationship between a bivariate normal and the two-dimensional scatter can also be interpreted in terms of an ellipse. An ellipse is any horizontal slice cut from the bivariate normal distribution, in which the cut is made at right angles to the probability axis. The lower down on the probability axis that the cut is made – that is, the closer the cut is made to the base of the F1 × F2 plane, the greater the area of the ellipse and the more points of the scatter that are included within the ellipse's outer boundary or circumference. If the cut is made at the very top of the bivariate normal distribution, the ellipse is so small that it includes only the centroid and a few points around it. If on the other hand the cut is made very close to the F1 × F2 base on which the probability values are built, then the ellipse may include almost the entire scatter.
The size of the ellipse is usually measured in ellipse standard deviations from the mean. There is a direct analogy here to the single parameter case. Recall from Fig. 9.3 that the number of standard deviations can be used to calculate the probability that a token falls within a particular range of the mean. So too with ellipse standard deviations. When an ellipse is drawn with a certain number of standard deviations, then there is an associated probability that a token will fall within its circumference. The ellipse in Fig. 9.8 is of F2 × F1 data of [æ] plotted at two standard deviations from the mean and this corresponds to a cumulative probability of 0.865: this is also the probability of any vowel falling inside the ellipse (and so the probability of it falling beyond the ellipse is 1 – 0.865 = 0.135). Moreover, if [æ] is normally, or nearly normally, distributed on F1 × F2 then, for a sufficiently large sample size, approximately 0.865 of the sample should fall inside the ellipse. In this case, the sample size was 140, so roughly 140 × 0.865 ≈ 121 should be within the ellipse, and 19 tokens should be beyond the ellipse's circumference (in fact, there are 20 [æ] tokens outside the ellipse's circumference in Fig. 9.8)62.
Fig. 9.8 about here
Whereas in the one-dimensional case, the association between standard-deviations and cumulative probability was given by qnorm(), for the bivariate case this relationship is determined by the square root of the quantiles from the χ2 distribution with two degrees of freedom. In R, this is given by qchisq(p, df) where the two arguments are the cumulative probability and the degrees of freedom respectively. Thus just under 2.45 ellipse standard deviations correspond to a cumulative probability of 0.95, as the following shows63:
sqrt(qchisq(0.95, 2))
2.447747
The function pchisq() provides the same information but in the other direction. Thus the cumulative probability associated with 2.447747 ellipse standard deviations from the mean is given by:
pchisq(2.447747^2, 2)
0.95
An ellipse is a flattened circle and it has two diameters, a major axis and a minor axis (Fig. 9.8). The point at which the major and minor axes intersect is the distribution's centroid. One definition of the major axis is that it is the longest radius that can be drawn between the centroid and the ellipse circumference. The minor axis is the shortest radius and it is always at right-angles to the major axis. Another definition that will be important in the analysis of data reduction technique in section 9.7 is that the major ellipse axis is the first principal component of the data.
Fig. 9.9 about here
The first principal component is a straight line that passes through the centroid of the scatter that explains most of the variance of the data. A graphical way to think about what this means is to draw any line through the scatter such that it passes through the scatter's centroid. We are then going to rotate our chosen line and all the data points about the centroid in such a way that it ends up parallel to the x-axis (parallel to F2-axis for these data). If the variance of F2 is measured before and after rotation, then the variance will not be the same: the variance might be smaller or larger after the data has been rotated in this way. The first principal component, which is also the major axis of the ellipse, can now be defined as the line passing through the centroid that produces the greatest amount of variance on the x-axis variable (on F2 for these data) after this type of rotation has taken place.
If the major axis of the ellipse is exactly parallel to the x-axis (as in the right panel of Fig. 9.9), then there is an exact equivalence between the ellipse standard deviation and the standard deviations of the separate parameters. Fig. 9.10 shows the rotated two standard deviation ellipse from the right panel of Fig. 9.9 aligned with a two standard deviation normal curve of the same F2 data after rotation. The mean of F2 is 1599 Hz and the standard deviation of F2 is 92 Hz. The major axis of the ellipse extends, therefore, along the F2 parameter at ± 2 standard deviations from the mean i.e., between 1599 + (2 × 92) = 1783 Hz and 1599 - (2 × 92) = 1415 Hz. Similarly, the minor axis extends 2 standard deviations in either direction from the mean of the rotated F1 data. However, this relationship only applies as long as the major axis is parallel to the x-axis. At other inclinations of the ellipse, the lengths of the major and minor axes depend on a complex interaction between the correlation coefficient, r, and the standard deviation of the two parameters.
Fig. 9.10 about here
In this special case in which the major axis of the ellipse is parallel to the x-axis, the two parameters are completely uncorrelated or have zero correlation. As discussed in the next section on classification, the less two parameters are correlated with each other, the more information they can potentially contribute to the separation between phonetic classes. If on the other hand two parameters are highly correlated, then it means that one parameter can to a very large extent be predicted from the other: they therefore tend to provide less useful information for distinguishing between phonetic classes than uncorrelated ones.
9.6 Classification in two dimensions
The task in this section will be to carry out a probabilistic classification at the temporal midpoint of five German fricatives [f, s, ʃ, ç, x] from a two-parameter model of the first two spectral moments (which were shown to be reasonably effective in distinguishing between fricatives in the preceding Chapter). The spectral data extends from 0 Hz to 8000 Hz and was calculated at 5 ms intervals with a frequency resolution of 31.25 Hz. The following objects are available in the Emu-R library:
fr.dft Spectral trackdata object containing spectral data between
the segment onset and offset at 5 ms intervals
fr.l A vector of phonetic labels
fr.sp A vector of speaker labels
The analysis will be undertaken using the first two spectral moments calculated at the fricatives' temporal midpoints over the range 200 Hz - 7800 Hz. The data are displayed as ellipses in Fig. 9.11 on these parameters for each of the four fricative categories and separately for the two speakers. The commands to create these plots in Fig. 9.11 are as follows:
# Extract the spectral data at the temporal midpoint
fr.dft5 = dcut(fr.dft, .5, prop=T)
# Calculate their spectral moments
fr.m = fapply(fr.dft5[,200:7800], moments, minval=T)
# Take the square root of the 2nd spectral moment so that the values are within sensible ranges
fr.m[,2] = sqrt(fr.m[,2])
# Give the matrix some column names
colnames(fr.m) = paste("m", 1:4, sep="")
par(mfrow=c(1,2))
xlab = "1st spectral moment (Hz)"; ylab="2nd spectral moment (Hz)"
# Logical vector to identify the male speaker
temp = fr.sp == "67"
eplot(fr.m[temp,1:2], fr.l[temp], dopoints=T, ylab=ylab, xlab=xlab)
eplot(fr.m[!temp,1:2], fr.l[!temp], dopoints=T, xlab=xlab)
Fig. 9.11 about here
As discussed in section 9.5, the ellipses are horizontal slices each taken from a bivariate normal distribution and the ellipse standard-deviations have been set to the default such that each includes at least 95% of the data points. Also, as discussed earlier, the extent to which the parameters are likely to provide independently useful information is influenced by how correlated they are. For the male speaker, cor.test(fr.m[temp,1], fr.m[temp,2]) shows both that the correlation is low (r = 0.13) and not significant; for the female speaker, it is significant although still quite low (r = -0.29). Thus the 2nd spectral moment may well provide information beyond the first that might be useful for fricative classification and, as Fig. 9.11 shows, it separates [x, s, f] from the other two fricatives reasonably well in both speakers, whereas the first moment provides a fairly good separation within [x, s, f].
The observations can now be quantified probabilistically using the qda() function in exactly the same way for training and testing as in the one-dimensional case:
# Train on the first two spectral moments, speaker 67
temp = fr.sp == "67"
x.qda = qda(fr.m[temp,1:2], fr.l[temp])
Classification is accomplished by calculating whichever of the five posterior probabilities is the highest, using the formula in (2) in an analogous way to one-dimensional classification discussed in section 9.4. Consider then a point [m1, m2] in this two-dimensional moment space with coordinates [4500, 2000]. Its position in relation to the ellipses in the left panel of Fig. 9.11 suggests that it is likely to be classified as /s/ and this is indeed the case:
unknown = c(4500, 2000)
result = predict(x.qda, unknown)
# Posterior probabilities
round(result$post, 5)
C S f s x
0.0013 0.0468 0 0.9519 0
# Classification label:
result$class
s
Levels: C S f s x
In the one-dimensional case, it was shown how the classes were separated by a single point that marked the decision boundary between classes (Fig. 9.5). For two dimensional classification, the division between classes is not a point but one of a family of quadratics (and hyperquadratics for higher dimensions) that can take on forms such as planes, ellipses, parabolas of various kinds: see Duda et al, 2001 Chapter 2 for further details)64. This becomes apparent in classifying a large number of points over an entire two-dimensional surface that can be done with the classplot() function in the Emu-R library: its arguments are the model on which the data was trained (maximally 2 dimensional) and the range over which the points are to be classified. The result of such a classification for a dense region of points in a plane with similar ranges as those of the ellipses in Fig. 9.11 is shown in the left panel of Fig 9.12, while the right panel shows somewhat wider ranges. The first of these was created as follows:
classplot(x.qda, xlim=c(3000, 5000), ylim=c(1900, 2300), xlab="First moment (Hz)", ylab="Second moment (Hz)")
text(4065, 2060, "C", col="white"); text(3820, 1937, "S");
text(4650, 2115, "s"); text(4060, 2215, "f"); text(3380, 2160, "x")
Fig. 9.12 about here
It is evident from the left panel of Fig. 9.12 that points around the edge of the region are classified (clockwise from top left) as [x, f, s ,ʃ] with the region for the palatal fricative [ç] (the white space) squeezed in the middle. The right panel of Fig. 9.12, which was produced in exactly the same way except with ylim = c(1500, 3500), shows that these classifications do not necessarily give contiguous regions, especially for regions far away from the class centroids: as the right panel of Fig. 9.12 shows, [ç] is split into two by [ʃ] while the territories for /s/ are also non-contiguous and divided by /x/. The reason for this is to a large extent predictable from the orientation of the ellipses. Thus since, as the left panel of Fig. 9.11 shows, the ellipse for [ç] has a near vertical orientation, then points below it will be probabilistically quite close to it. At the same time, there is an intermediate region at around m2 = 1900 Hz at which the points are nevertheless probabilistically closer to [ʃ], not just because they are nearer to the [ʃ]-centroid, but also because the orientation of the [ʃ] ellipse in the left panel of Fig. 9.11 is much closer to horizontal. One of the important conclusions that emerges from Figs. 9.10 and 9.11 is that it is not just the distance to the centroids that is important for classification (as it would be in a classification based on whichever Euclidean distance to the centroids was the least), but also the size and orientation of the ellipses, and therefore the probability distributions, that are established in the training stage of Gaussian classification.
As described earlier, a closed test involves training and testing on the same data and for this two dimensional spectral moment space, a confusion matrix and 'hit-rate' for the male speaker’s data is produced as follows:
# Train on the first two spectral moments, speaker 67
temp = fr.sp == "67"
x.qda = qda(fr.m[temp,1:2], fr.l[temp])
# Classify on the same data
x.pred = predict(x.qda)
# Equivalent to the above
x.pred = predict(x.qda, fr.m[temp,1:2])
# Confusion matrix
x.mat = table(fr.l[temp], x.pred$class)
x.mat
C S f s x
C 17 3 0 0 0
S 3 17 0 0 0
f 1 0 16 1 2
s 0 0 0 20 0
x 0 0 1 0 19
The confusion matrix could then be sorted on place of articulation as follows:
m = match(c("f", "s", "S", "C", "x"), colnames(x.mat))
x.mat[m,m]
x.l f s S C x
f 16 1 0 1 2
s 0 20 0 0 0
S 0 0 17 3 0
C 0 0 3 17 0
x 1 0 0 0 19
The correct classifications are in the diagonals and the misclassifications in the other cells. Thus 16 [f] tokens were correctly classified as /f/, one was misclassified as /s/, and so on. The hit-rate per class is obtained by dividing the scores in the diagonal by the total number of tokens in the same row:
diag(x.mat)/apply(x.mat, 1, sum)
C S f s x
0.85 0.85 0.80 1.00 0.95
The total hit rate across all categories is the sum of the scores in the diagonal divided by the total scores in the matrix:
sum(diag(x.mat))/sum(x.mat)
0.89
The above results show, then, that based on a Gaussian classification in the plane of the first two spectral moments at the temporal midpoint of fricatives, there is an 89% correct separation of the fricatives for the data shown in the left panel of Fig. 9.11 and, compatibly with that same Figure, the greatest confusion is between [ç] and [ʃ].
The score of 89% is encouragingly high and it is a completely accurate reflection of the way in which the data in the left panel of Fig. 9.11 are distinguished after a bivariate normal distribution has been fitted to each class. At the same time, the scores obtained from a closed test of this kind can be, and often are, very misleading because of the risk of over-fitting the training model. When over-fitting occurs, which is more likely when training and testing are carried out in increasingly higher dimensional spaces, then the classification scores and separation may well be nearly perfect, but only for the data on which the model was trained. For example, rather than fitting the data with ellipses and bivariate normal distributions, we could imagine an algorithm which might draw wiggly lines around each of the classes in the left panel of Fig. 9.11 and thereby achieve a considerably higher separation of perhaps nearer 99%. However, this type of classification would in all likelihood be entirely specific to this data, so that if we tried to separate the fricatives in the right panel of Fig. 9.11 using the same wiggly lines established in the training stage, then classification would almost certainly be much less accurate than from the Gaussian model considered so far: that is, over-fitting means that the classification model does not generalise beyond the data that it was trained on.
An open test, in which the training is carried out on the male data and classification on the female data in this example, can be obtained in an analogous way to the closed test considered earlier (the open test could be extended by subsequently training on the female data and testing on the male data and summing the scores across both classifications):
# Train on male data, test on female data.
y.pred = predict(x.qda, fr.m[!temp,1:2])
# Confusion matrix.
y.mat = table(fr.l[!temp], y.pred$class)
y.mat
C S f s x
C 15 5 0 0 0
S 12 2 3 3 0
f 4 0 13 0 3
s 0 0 1 19 0
x 0 0 0 0 20
# Hit-rate per class.:
diag(y.mat)/apply(y.mat, 1, sum)
C S f s x
0.75 0.10 0.65 0.95 1.00
# Total hit-rate:
sum(diag(y.mat))/sum(y.mat)
0.69
The total correct classification has now fallen by 20% compared with the closed test to 69% and the above confusion matrix reflects more accurately what we see in Fig. 9.11: that the confusion between [ʃ] and [ç] in this two-dimensional spectral moment space is really quite extensive.
Share with your friends: |