The Phonetic Analysis of Speech Corpora

Classifications in higher dimensional spaces

Download 1.58 Mb.

Page	26/30
Date	29.01.2017
Size	1.58 Mb.
	#11978

1 ... 22 23 24 25 26 27 28 29 30

9.8 Classifications in time
9.8.1 Parameterising dynamic spectral information

9.7 Classifications in higher dimensional spaces

The same mechanism can be used for carrying out Gaussian training and testing in higher dimensional spaces, even if spaces beyond three dimensions are impossible to visualise. However, as already indicated there is an increasingly greater danger of over-fitting as the number of dimensions increases: therefore, if the dimensions do not contribute independently useful information to category separation, then the hit-rate in an open test is likely to go down. Consider as an illustration of this point the result of training and testing on all four spectral moments. This is done for the same data as above, but to save repeating the same instructions, the confusion matrix for a closed test is obtained for speaker 67 in one line as follows:

temp = fr.sp == "67"

table(fr.l[temp], predict(qda(fr.m[temp,], fr.l[temp]), fr.m[temp,])$class)

C S f s x

C 20 0 0 0 0

S 0 20 0 0 0

f 0 0 19 0 1

s 0 0 0 20 0

x 0 0 2 0 18

Here then, there is an almost perfect separation between the categories, and the total hit rate is 97%. The corresponding open test on this four dimensional space in which, as before, training and testing are carried out on the male and female data respectively is given by the following command:
table(fr.l[!temp], predict(qda(fr.m[temp,], fr.l[temp]), fr.m[!temp,])$class)

C S f s x

C 17 0 0 3 0

S 7 7 1 1 4

f 3 0 11 0 6

s 0 0 0 20 0

x 0 0 1 0 19
Here the hit-rate is 74%, a more modest 5% increase on the two-dimensional space. It seems then that including m₃ and m₄ do provide only a very small amount of additional information for separating between these fricatives.

An inspection of the correlation coefficients between the four parameters can give some indication of why this is so. Here these are calculated across the male and female data together (see section 9.9. for a reminder of how the object fr.m was created):

round(cor(fr.m), 3)
[,1] [,2] [,3] [,4]

[1,] 1.000 -0.053 -0.978 0.416

[2,] -0.053 1.000 -0.005 -0.717

[3,] -0.978 -0.005 1.000 -0.429

[4,] 0.416 -0.717 -0.429 1.000
The correlation in the diagonals is always 1 because this is just the result of the parameter being correlated with itself. The off-diagonals show the correlations between different parameters. So the second column of row 1 shows that the correlation between the first and second moments is almost zero at -0.053 (the result is necessarily duplicated in row 2, column 1, the correlation between the second moment and the first). In general, parameters are much more likely to contribute independently useful information to class separation if they are uncorrelated with each other. This is because correlation means predictability: if two parameters are highly correlated (positively or negatively), then one is more or less predictable from another. But in this case, the second parameter is not really contributing any new information beyond the first and so will not improve class separation.

For this reason, the correlation matrix above shows that including both the first and third moments in the classification is hardly likely to increase class separation, given that the correlation between these parameters is almost complete (-0.978). This comes about for the reasons discussed in Chapter 8: as the first moment, or spectral centre of gravity, shifts up and down the spectrum, then the spectrum becomes left- or right-skewed accordingly and, since m₃ is a measure of skew, m₁ and m₃ are likely to be highly correlated. There may be more value in including m₄, however, given that the correlation between m₁ and m₄ is moderate at 0.416; on the other hand, m₂ and m₄ for this data are quite strongly negatively correlated at -0.717, so perhaps not much more is to be gained as far as class separation is concerned beyond classifications from the first two moments. Nevertheless, it is worth investigating whether classifications from m₁, m₂, and m₄ give a better hit-rate in an open-test than from m₁ and m₂ alone:

# Train on male data, test on female data in a 3D-space formed from m₁, m₂, and m₄

temp = fr.sp == "67"

res = table(fr.l[!temp], predict(qda(fr.m[temp,-3], fr.l[temp]), fr.m[!temp,-3])$class)
# Class hit-rate

diag(res)/apply(res, 1, sum)

C S f s x

0.95 0.40 0.75 1.00 1.00

# Overall hit rate

sum(diag(res))/sum(res)

0.82
In fact, including m₄ does make a difference: the open-test hit-rate is 82% compared with 69% obtained from the open test classification with m₁ and m₂ alone. But the result is also interesting from the perspective of over-fitting raised earlier: notice that this open test score from three moments is higher than from all four moments together which shows not only that there is redundant information as far as class separation is concerned in the four-parameter space, but also that the inclusion of this redundant information leads to an over-fit and therefore a poorer generalisation to new data.

The technique of principal components analysis (PCA) can sometimes be used to remove the redundancies that arise through parameters being correlated with each other. In PCA, a new set of dimensions is obtained such that they are orthogonal to, or uncorrelated with, each other and also such that the lower dimensions 'explain' or account for most of the variance in the data (see Figs. 9.9 and 9.10 for a graphical interpretation of 'explanation of the variance'). The new rotated dimensions derived from PCA are weighted linear combinations of the old ones and the weights are known as the eigenvectors. The relationship between original dimensions, rotated dimensions, and eigenvectors can be demonstrated by applying PCA to the spectral moments data from the male speaker using prcomp(). Before applying PCA, the data should be converted to z-scores by subtracting the mean (which is achieved with the default argument center=T) and by dividing by the standard-deviation^⁶⁵ ( the argument scale=T needs to be set). Thus to apply PCA to the male speaker's spectral moment data:

temp = fr.sp == "67"

p = prcomp(fr.m[temp,], scale=T)

The eigenvectors or weights that are used to transform the original data are stored in p$rotation which is a 4 × 4 matrix (because there were four original dimensions to which PCA was applied). The rotated data points themselves are stored in p$x and there is the same number of dimensions (four) as those in the original data to which PCA was applied.

The first new rotated dimension is obtained by multiplying the weights in column 1 of p$rotation with the original data and then summing the result (and it is in this sense that the new PCA-dimensions are weighted linear combinations of the original ones). In order to demonstrate this, we first have to carry out the same z-score transformation that was applied by PCA to the original data:

# Function to carry out z-score normalisation

zscore = function(x)(x-mean(x))/sd(x)

# z-score normalised data

xn = apply(fr.m[temp,], 2, zscore)

The value on the rotated first dimension for, say, the 5^th segment is stored in p$x[5,1] and is equivalently given by a weighted sum of the original values, thus:
sum(xn[5,] * p$rotation[,1])
The multiplications and summations for the entire data are more simply derived with matrix multiplication using the %*% operator^⁶⁶. Thus the rotated data in p$x is equivalently given by:
xn %*% p$rotation
The fact that the new rotated dimensions are uncorrelated with each other is evident by applying the correlation function, as before:
round(cor(p$x), 8)
PC1 PC2 PC3 PC4

PC1 1 0 0 0

PC2 0 1 0 0

PC3 0 0 1 0

PC4 0 0 0 1
The importance of the rotated dimensions as far as explaining the variance is concerned is given by either plot(p) or summary(p). The latter returns the following:
Importance of components:

PC1 PC2 PC3 PC4

Standard deviation 1.455 1.290 0.4582 0.09236

Proportion of Variance 0.529 0.416 0.0525 0.00213

Cumulative Proportion 0.529 0.945 0.9979 1.00000
The second line shows the proportion of the total variance in the data that is explained by each rotated dimension, and the third line adds this up cumulatively from lower to higher rotated dimensions. Thus, 52.9% of the variance is explained alone by the first rotated dimension, PC1, and, as the last line shows, just about all of the variance (99%) is explained by the first three rotated dimensions. This result simply confirms what had been established earlier: that there is redundant information in all four dimensions as far separating the points in this moments-space are concerned.

Rather than pursue this example further, we will explore a much higher dimensional space that is obtained from summed energy values in Bark bands calculated over the same fricative data. The following commands make use of some of the techniques from Chapter 8 to derive the Bark parameters. Here, a filter-bank approach is used in which energy in the spectrum is summed in widths of 1 Bark with centre frequencies extending from 2-20 Bark: that is, the first parameter contains summed energy over the frequency range 1.5 - 2.5 Bark, the next parameter over the frequency range 2.5 - 3.5 Bark and so on up to the last parameter that has energy from 19.5 - 20.5 Bark. As bark(c(1.5, 20.5), inv=T) shows, this covers the spectral range from 161-7131 Hz (and reduces it to 19 values). The conversion from Hz to Bark is straightforward:

fr.bark5 = bark(fr.dft5)
and, for a given integer value of j, the following line is at the core of deriving summed energy values in Bark bands:
fapply(fr.bark5[,(j-0.5):(j+0.5)], sum, power=T)
So when e.g., j is 5, then the above has the effect of summing the energy in the spectrum between 4.5 and 5.5 Bark. This line is put inside a for-loop in order to extract energy values in Bark bands over the spectral range of interest and the results are stored in the matrix fr.bs5:
fr.bs5 = NULL

for(j in 2:20){

sumvals = fapply(fr.bark5[,(j-0.5):(j+0.5)], sum, power=T)

fr.bs5 = cbind(fr.bs5, sumvals)

}

colnames(fr.bs5) = paste("Bark", 2:20)

fr.bs5 is now a matrix with the same number of rows as in the original spectral matrix fr.dft and with 19 columns (so whereas for the spectral-moment data, each fricative was represented by a point in a four-dimensional space, for these data, each fricative is a point in a 19-dimensional space).

We could now carry out a Gaussian classification over this 19-dimensional space as before although this is hardly advisable, given that this would almost certainly produce extreme over-fitting for the reasons discussed earlier. The alternative solution, then, is to use PCA to compress much of the non-redundant information into a smaller number of dimensions. However, in order to maintain the distinction between training and testing - that is between the data of the male and female speakers in these examples - PCA really should not be applied across the entire data in one go. This is because a rotation matrix (eigenvectors) would be derived based on the training and testing set together and, as a result, the training and testing data are not strictly separated. Thus, in order to maintain the strict separation between training and testing sets, PCA will be applied to the male speaker's data; subsequently, the rotation-matrix that is derived from PCA will be used to rotate the data for the female speaker. The latter operation can be accomplished simply with the generic function predict() which will have the effect of applying z-score normalisation to the data from the female speaker. This operation is accomplished by subtracting the male speaker's means and dividing by the male speaker's standard-deviations (because this is how the male speaker's data was transformed before PCA was applied):

# Separate out the male and female's Bark data

temp = fr.sp == "67"

# Apply PCA to the male speaker's z-normalised data

xb.pca = prcomp(fr.bs5[temp,], scale=T)

# Rotate the female speaker's data using the eigenvectors and z-score

# parameters from the male speaker

yb.pca = predict(xb.pca, fr.bs5[!temp,])
Before carrying out any classifications, the summary() function is used as before to get an overview of the extent to which the variance in the data is explained by the rotated dimensions (the results are shown here for just the first eight dimensions, and the scores are rounded to three decimal places):
sum.pca = summary(xb.pca)

round(sum.pca$im[,1:8], 3)

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8

Standard deviation 3.035 2.110 1.250 1.001 0.799 0.759 0.593 0.560

Proportion of Variance 0.485 0.234 0.082 0.053 0.034 0.030 0.019 0.016

Cumulative Proportion 0.485 0.719 0.801 0.854 0.887 0.918 0.936 0.953

Thus, 48.5% of the variance is explained by the first rotated dimension, PC1, alone and just over 95% by the first eight dimensions: this result suggests that higher dimensions are going to be of little consequence as far as distinguishing between the fricatives is concerned.

We can test this by carrying out an open test Gaussian classification on any number of rotated dimensions using the same functions that were applied to spectral moments. For example, the total hit-rate (of 76%) from training and testing on the first six dimensions of the rotated space is obtained from:

# Train on the male speaker's data

n = 6

xb.qda = qda(xb.pca$x[,1:n], fr.l[temp])
# Test on the female speaker's data

yb.pred = predict(xb.qda, yb.pca[,1:n])

z = table(fr.l[!temp], yb.pred$class)

sum(diag(z))/sum(z)

Fig. 9.13 shows the total hit-rate when training and testing were carried out successively on the first 2, first 3, …all 19 dimensions and it was produced by putting the above lines inside a for-loop, thus:
scores = NULL

for(j in 2:19){

xb.qda = qda(xb.pca$x[,1:j], fr.l[temp])

yb.pred = predict(xb.qda, yb.pca[,1:j])

z = table(fr.l[!temp], yb.pred$class)

scores = c(scores, sum(diag(z))/sum(z))

}
plot(2:19, scores, type="b", xlab="Rotated dimensions 2-n", ylab="Total hit-rate (proportion)")
Fig. 9.13 about here
Apart from providing another very clear demonstration of the damaging effects of over-fitting, the total hit-rate, which peaks when training and testing are carried out on the first six rotated dimensions, reflects precisely what was suggested by examining the proportion of variance examined earlier: that there is no more useful information for distinguishing between the data-points and therefore fricative categories beyond this number of dimensions.
9.8 Classifications in time

All of the above classifications so far have been static, because they have been based on information at a single point in time. For fricative noise spectra this may be appropriate under the assumption that the spectral shape of the fricative during the noise does not change appreciably in time between its onset and offset. However, a static classification from a single time slice would obviously not be appropriate for the distinction between monophthongs and diphthongs nor perhaps for differentiating the place of articulation from the burst of oral stops, given that there may be dynamic information that is important for their distinction (Kewley-Port, 1982; Lahiri et al, 1984). An example of how dynamic information can be parameterised and then classified is worked through in the next sections using a corpus of oral stops produced in initial position in German trochaic words.

It will be convenient, by way of an introduction to the next section, to relate the time-based spectral classifications of consonants with vowel formants, discussed briefly in Chapter 6 (e.g., Fig. 6.21). Consider a vowel of duration 100 ms with 20 first formant frequency values at intervals of 5 ms between the vowel onset and offset. As described in Chapter 6, the entire F1 time-varying trajectory can be reduced to just three values either by fitting a polynomial or by using the discrete-cosine-transformation (DCT): in either case, the three resulting values express the mean, the slope, and the curvature of F1 as a function of time. The same can be done to the other two time-varying formants, F2 and F3. Thus after applying the DCT to each formant number separately, a 100 ms vowel, parameterized in raw form by triplets of F1-F3 every 5 ms (60 values in total) is converted to a single point in a nine-dimensional space.

The corresponding data reduction of spectra is the same, but it needs one additional step that precedes this compression in time. Suppose an /s/ is of duration 100 ms and there is a spectral slice every 5 ms (thus 20 spectra between the onset and offset of the /s/). We can use the DCT in the manner discussed in Chapter 8 to compress each spectrum, consisting originally of perhaps 129 dB values (for a 256 point DFT) to 3 values. As a result of this operation, each spectrum is represented by three DCT or cepstral (see 8.4) coefficients which can be denoted by k₀, k₁, k₂. It will now be helpful to think of k₀, k₁, k₂ as analogous to F1-F3 in the case of the vowel. Under this interpretation, /s/ is parameterized by a triplet of DCT coefficients every 5 ms in the same way that the vowel in raw form is parameterized by a triplet of formants every 5 ms. Thus there is time-varying k₀, time-varying k₁, and time-varying k₂ between the onset and offset of /s/ in the same way that there is time-varying F1, time-varying F2, and time-varying F3 between the onset and offset of the vowel. The final step involves applying the DCT to compress separately each such time-varying parameter to three values, as summarised in the preceding paragraph: thus after this second transformation, k₀ as a function of time will be compressed to three values, in the same way that F1 as a function of time is reduced to three values. The same applies to time-varying k₁ and to time-varying k₂ which are also each compressed to three values after the separate application of the DCT. Thus a 100 ms /s/ which is initially parameterised as 129 dB values per spectrum that occur every 5 ms (i.e., 2580 values in total since for 100 ms there are 20 spectral slices) is also reduced after these transformations to a single point in a nine-dimensional space.

These issues are now further illustrated in the next section using the stops data set.

9.8.1 Parameterising dynamic spectral information

The corpus fragment for the analysis of dynamic spectral information includes a word-initial stop, /C = b, d, ɡ/ followed by a tense vowel or diphthong V = /a: au e: i: o: ø: oɪ u:/ in meaningful German words such as baten, Bauten, beten, bieten, boten, böten, Beute, Buden. The data were recorded as part of a seminar in 2003 at the IPDS, University of Kiel from 3 male and 4 female speakers (one of the female speakers only produced the words once or twice rather than 3 times which is why the total number of stops is 470 rather than the expected 504 from 3 stops x 8 vowels x 7 speakers x 3 repetitions). The sampled speech data were digitised at 16 kHz and spectral sections were calculated from a 512 point (32 ms) DFT at intervals of 2 ms. The utterances of the downloadable stops database were segmented into a closure, a burst extending from the stop-release to the periodic vowel onset, and the following vowel (between its periodic onset and offset). The objects from this database in the Emu-R library include the stop burst only (which across all speakers has a mean duration of just over 22 ms):

stops Segment list of the stop-burst.

stops.l A vector of stops labels for the above.

stopsvow.l A vector of labels of the following vowel context.

stops.sp A vector of labels for the speaker.

stops.dft Trackdata object of spectral data between the onset and offset of the burst.

stops.bark Trackdata object as stops.dft but

with the frequency axis converted into Bark.

stops.dct Trackdata object of the lowest three DCT c

coefficients derived from stops.bark.
The procedure that will be used here for parameterizing spectral information draws upon the techniques discussed in Chapter 8. There are three main steps, outlined below, which together have the effect of compressing the entire burst spectrum, initially represented by spectral slices at 2 ms intervals, to a single point in a nine-dimensional space, as described above.

Bark-scaled, DFT-spectra (stops.bark). The frequency axis is warped from the physical Hertz to the auditory Bark scale in the frequency range 200 - 7800 Hz (this range is selected both to discount information below 200 Hz that is unlikely to be useful for the acoustic distinction between stop place of articulation and to remove frequency information near the Nyquist frequency that may be unreliable).
Bark-scaled DCT coefficients (stops.dct). A DCT-transformation is applied to the output of 1. in order to obtain Bark-scaled DCT (cepstral) coefficients. Only the first three coefficients are calculated, i.e. k₀, k₁, k₂ which, as explained in Chapter 8, are proportional to the spectrum's mean, linear slope, and curvature. These three parameters are obtained for each spectral slice resulting in three trajectories between the burst onset and offset, each supported by data points at 2 ms intervals.
Polynomial fitting. Following step 2, the equivalent of a 2^nd order polynomial is fitted again using the DCT to each of the three trajectories thereby reducing each trajectory to just 3 values (the coefficients of the polynomial).

Steps 1-3 are now worked through in some more detail using a single stop burst token for a /d/ beginning with a perspective plot showing how its spectrum changes in time over the extent of the burst using the persp() function. Fig. 9.14(a) which shows the raw spectrum in Hz was created as follows. There are a couple of fiddly issues in this type of plot to do with arranging the display so that the burst onset is at the front and the vowel onset at the back: this requires changing the time-axis so that increasing negative values are closer to the vowel onset and reversing the row-order of the dB-spectra. The arguments theta and phi in the persp() function define the viewing direction^⁶⁷.

# Get the spectral data between the burst onset and offset for the 2^nd stop from 200-7800 Hz

d.dft = stops.dft[2,200:7800]

# Rearrange the time-axis

times = tracktimes(d.dft) - max(tracktimes(d.dft))

# These are the frequencies of the spectral slices

freqs = trackfreq(d.dft)

# These are the dB-values at those times (rows) and frequencies (columns)

dbvals = frames(d.dft)

par(mfrow=c(2,2)); par(mar=rep(.75, 4))

persp(times, freqs, dbvals[nrow(dbvals):1,], theta = 120, phi = 25, col="lightblue", expand=.75, ticktype="detailed", main="(a)",xlab="Time (ms)", ylab="Frequency (Hz)", zlab="dB")

Fig. 9.14(a) shows that the overall level of the spectrum increases from the burst onset at t = 0 ms (the front of the display) towards the vowel onset at t = - 20 ms (which is to be expected, given that the burst follows the near acoustic silence of the closure) and there are clearly some spectral peaks, although it is not easy to see where these are. However, the delineation of the peaks and troughs can be brought out much more effectively by smoothing the spectra with the discrete-cosine-transformation in the manner described in Chapter 8 (see Fig. 8.22, right panel) to obtain DCT-smoothed Hz-spectra. The corresponding spectral plot is shown in Fig. 9.14(b). This was smoothed with the first 11 DCT-coefficients, thereby retaining a fair amount of detail in the spectrum:
d.sm = fapply(d.dft, dct, 10, T)

persp(times, freqs, frames(d.sm)[nrow(dbvals):1,], theta = 120, phi = 25, col="lightblue", expand=.75, ticktype="detailed", main="(b)",xlab="Time (ms)", ylab="Frequency (Hz)", zlab="dB")

Fig. 9.14 about here
The smoothed display in Fig. 9.14(b) shows more clearly that there are approximately four peaks and an especially prominent one at around 2.5 kHz.

Steps 1 and 2, outlined earlier (9.8.1), additionally involve warping the frequency axis to the Bark scale and calculating only the first three DCT-coefficients. The commands for this are:

d.bark = bark(d.dft)

d.dct = fapply(d.bark, dct, 2)

d.dct contains the Bark-scaled DCT-coefficients (analogous to MFCC, mel-frequency cepstral coefficients in the speech technology literature). Also, there are three coefficients per time slice (which is why frames(d.dct) is a matrix of 11 rows and 3 columns) which define the shape of the spectrum at that point in time. The corresponding DCT-smoothed spectrum is calculated in the same way, but with the additional argument to the dct() function fit=T. This becomes one of the arguments appended after dct, as described in Chapter 8:
# DCT-smoothed spectra; one per time slice

d.dctf = fapply(d.bark, dct, 2, fit=T)

d.dctf is a spectral trackdata object containing spectral slices at intervals of 5 ms. Each spectral slice will, of course, be very smooth indeed (because of the small number of coefficients). In Fig. 9.14(c) these spectra smoothed with just 3 DCT coefficients are arranged in the same kind of perspective plot:
freqs = trackfreq(d.dctf)

persp(times, freqs, frames(d.dctf)[nrow(dbvals):1,], theta = 120, phi = 25, col="lightblue", expand=.75, ticktype="detailed", main="(c)",xlab="Time (ms)", ylab="Frequency (Hz)", zlab="dB")

The shape of the perspective spectral plot is now more like a billowing sheet and the reader may wonder whether we have not smoothed away all of the salient information in the /d/ spectra! However, the analogous representation of the corresponding Bark-scaled coefficients as a function of time in Fig. 9.14(d) shows that even this radically smoothed spectral representation retains a fair amount of dynamic information. (Moreover, the actual values of these trajectories may be enough to separate out the three separate places of articulation). Since d.dct (derived from the commands above) is a trackdata object, it can be plotted with the generic plot() function. The result of this is shown in Fig. 9.14(d) and given by the following command:
plot(d.dct, type="b", lty=1:3, col=rep(1, 3), lwd=2, main = "(d)", ylab="Amplitude", xlab="Time (ms)")
It should be noted at this point that, as far as classification is concerned, the bottom two panels of Fig. 9.14 contain equivalent information: that is, they are just two ways of looking at the same phenomenon. In the time-series plot, the three Bark-scaled DCT-coefficients are displayed as a function of time. In the 3D-perspective plot, each of these three numbers is expanded into its own spectrum. It looks as if the spectrum contains more information, but it does not. In the time-series plot, the three numbers at each time point are the amplitudes of cosine waves at frequencies 0, ½, and 1 cycles. In the 3D-perspective plot, the three cosine waves at these amplitudes are unwrapped over 256 points and then summed at equal frequencies. Since the shapes of these cosine waves are entirely predictable from the amplitudes (because the frequencies are known and the phase is zero in all cases), there is no more information in the 3D-perspective plot than in plotting the amplitudes of the ½ cycle cosine waves (i.e., the DCT-coefficients) as a function of time.

We have now arrived at the end of step 2 outlined earlier. Before proceeding to the next step, which is yet another compression and transformation of the data in Fig. 9.14(d), here is a brief recap of the information that is contained in the trajectories in Fig. 9.14(d).

k₀ (the DCT coefficient at a frequency of k = 0 cycles) encodes the average level in the spectrum. Thus, since k₀ (the top track whose points are marked with circles in Fig. 9.14(d)) rises as a function of time, then the mean dB-level of the spectrum from one spectra slice to the next must also be rising. This is evident from any of Figs 9.14(a, b, c) which all show that the spectra have progressively increasing values on the dB-axis in progressing in time towards the vowel (compare in particular the last spectrum at time t = -20 ms with the first at the burst onset). The correspondence between k₀ and the spectral mean can also be verified numerically:
# Calculate the mean dB-value per spectral slice across all frequencies

m = fapply(d.bark, mean)

# This shows a perfect correlation between the mean dB and k₀

cor(frames(m), frames(d.dct[,1]))

1
k₁, which is the middle track in Fig. 9.14(d), encodes the spectral tilt i.e, the linear slope calculated with dB on the y-axis and Hz (or Bark) on the x-axis in the manner of Fig. 8.13 of Chapter 8. Fig. 9.14(d) suggests that there is not much change in the spectral slope as a function of time and also that the slope is negative (positive values on k₁ denote a falling slope). The fact that the spectrum is tilted downwards with increasing frequency is evident from any of the displays in Fig. 9.14 (a, b, c). The association between k₁ and the linear slope can be demonstrated in the manner presented in Chapter 8 by showing that they are strongly negatively correlated with each other:
# Function to calculate the linear slope of a spectrum

slope <- function(x)

{

lm(x ~ trackfreq(x))$coeff[2]

}

specslope = fapply(d.bark, slope)

cor(frames(specslope), frames(d.dct[,2]))

-0.991261

Finally, k₂ which is the bottom track in Fig. 9.14(d), shows the spectral curvature as a function of time. If each spectral slice could be modelled entirely by a straight line, then the values on this parameter would be zero. Evidently they are not and since these values are negative, then the spectra should be broadly ∩-shaped, and this is most apparent in the heavily smoothed spectrum in Fig. 9.14(c). Once again it is possible to show that k₂ is related to curvature by calculating a 2^nd order polynomial regression on each spectral slice, i.e. by fitting a function to each spectral slice of the form:
dB = a₀ + a₁f + a₂f² (4)
The second coefficient a₂ of (4) defines the curvature and the closer it is to zero, the less curved the trajectory. Fitting a 2^nd order polynomial to each spectral slice can be accomplished in an analogous manner to obtaining the linear slope above. Once again, the correlation between k₂ and curvature is very high:
# Function to apply 2^nd order polynomial regression to a spectrum. Only a₂ is stored.

regpoly <- function(x)

{

lm(x ~ trackfreq(x) + I(trackfreq(x)^2))$coeff[3]

}
# Apply this function to all 11 spectral slices

speccurve = fapply(d.bark, regpoly)

# Demonstrate the correlation with k₂

cor(frames(speccurve), frames(d.dct[,3]))

0.9984751
So far, then, the spectral slices as a function of time have been reduced to three tracks that encode the spectral mean, linear slope, and curvature also as a function of time. In step 3 outlined earlier, this information is compressed further still by applying the discrete-cosine-transformation once more to each track in Fig. 9.14(d). In the command below, this transformation is applied to k₀:
trapply(d.dct[,1], dct, 2, simplify=T)

63.85595 -18.59999 -9.272398

So following the earlier discussion, k₀ as a function of time must have a positive linear slope (because the middle coefficient, -18.6, that defines the linear slope, is negative). It must also be curved (because the last coefficient that defines the curvature is not zero) and it must also be ∩-shaped (because the last coefficient is negative): indeed, this is what we see in looking at the overall shape of k₀ as a function of time in Fig. 9.14(d)^⁶⁸.

The same operation can be applied separately to the other two tracks in Fig. 9.14(d), so that we end up with 9 values. Thus the time-varying spectral burst of /d/ has now been reduced to a point in a 9-dimensional space (a considerable compression from the original 11 times slices x 257 DFT = 2827 dB values). To be sure, the dimensions are now necessarily fairly abstract, but they do still have an interpretation: they are the mean, linear slope, and curvature each calculated on the spectral mean (k₀), spectral tilt (k₁), and spectral curvature (k₂) as a function of time.

This 9 dimensional representation is now derived for all of the stops in this mini-database with the commands below and it is stored in a 470 × 9 matrix (470 rows because there are 470 segments). You can leave out the first step (in calculating stops.dct) because this object is in the Emu-R library (and can take a few minutes to calculate, depending on the power of your computer):
# Calculate k₀, k₁, k₂, the first three Bark-scaled DCT coefficients: this object is available.

# stops.dct = fapply(stops.bark, dct, 2)

# Reduced k₀, k₁, and k₂, each to three values:

dct0coefs = trapply(stops.dct[,1], dct, 2, simplify=T)

dct1coefs = trapply(stops.dct[,2], dct, 2, simplify=T)

dct2coefs = trapply(stops.dct[,3], dct, 2, simplify=T)

# Put them into a data-frame after giving the matrix some column names.

d = cbind(dct0coefs, dct1coefs, dct2coefs)

n = c("a0", "a1", "a2")

m = c(paste("k0", n, sep="."), paste("k1", n, sep="."), paste("k2", n, sep="."))

colnames(d) = m
# Add the stop labels as a factor.

bdg = data.frame(d, phonetic=factor(stops.l))

We are now ready to classify.

Directory: ~jmh -> research -> pasc010808
pasc010808 -> The Phonetic Analysis of Speech Corpora

Download 1.58 Mb.

Share with your friends:

1 ... 22 23 24 25 26 27 28 29 30