The Phonetic Analysis of Speech Corpora

Download 1.58 Mb.

Page	29/30
Date	29.01.2017
Size	1.58 Mb.
	#11978

1 ... 22 23 24 25 26 27 28 29 30

Fig. 3.17. The utterance bananas showing intensity contours calculated with a window size and shift of 25 ms and 5 ms (above) and 10 ms and 2 ms (below) respectively.
Fig. 3.18. A spectrogram and synchronized zero-crossing rate for the bananas utterance showing word segmentations.
Fig. 3.19. The argument utterance from the aetobi database showing a spectrogram with overlaid formants and a synchronized f0-contour.
Fig. 3.20. The procedure for calculating signal files, in this case fundamental frequency data from a segment list. This tkassp window is accessed from Signal Processing → Speech Signal Analysis after starting Emu. Import the segment list using the Input … button in the top left and choose Select Segment/Event, then select path/seg.txt where path is the directory where seg.txt is stored, then OK to samples. This should bring up the list of segments shown in the left part of the figure. Choose Use f0ana, inspect the f0ana pane if you wish, and finally click on Perform Analysis. If the input was the plain text file seg.txt, then the output will be seg-f0.txt containing the fundamental frequency data in the same directory (by default) as seg.txt.
Fig. 3.21. The Track (above) and View (below) panes of the aetobi database edited to include the new intensity parameter.
Fig. 4.1. The Emu Query Tool accessed from Database Operations followed by Query Database (above) and the segment list (below) that results from the query.
Fig 4.2 The Legal Labels pane of the Graphical Template Editor showing how the annotations at the Phonetic tier are grouped into different classes (features).
Fig. 4.3. A fragment of the utterance thorsten from the downloadable gt database showing signal (above) and hierarchy (below) views. The signal view has been set to display only the f0 contour (Display → Tracks → F0) and in the range 100-300 Hz by the use of the slider buttons to the left of the f0 pane. The hierarchy window is accessible from the signal window using either the Show Hierarchy button or Display → Detach hierarchy to display both windows simultaneously.
Fig. 4.4 The association between I, Word, and Tone tiers for the same utterance as in Fig. 4.3. This display was produced with Display → Hierarchy Levels and then by deselecting the other tiers.
Fig. 4.5. An example of an annotation of the word person showing many-to-many relationships between both Phonetic and Phoneme, and between Phoneme and Syllable tiers. In addition, the tiers are in a hierarchical relationship in which Syllable immediately dominates Phoneme which immediately dominates Phonetic.
Fig. 4.6 The path for the gt database (far left) showing the inter-tier relationships. (S) and (E) denote segment and event tiers respectively, all other tiers are timeless. A downward arrow denotes a non-linear, one-to-many relationship. The figure also shows the information in the Levels (top left), Labels (top right), and Labfiles (below) panes of the template file for encoding this path.
Fig. 4.7. The hierarchy window of the utterance dort in the gt database before (above) and after (below) annotation.
Fig. 4.8. The annotations for dort showing the annotation numbers (grey) coded in the corresponding hlb file. The annotation numbers can be displayed with Display ➝ Toggle Segment numbers (from the main Emu menu). For simplicity, the Break tier has been removed from the display (Display ➝ Hierarchy levels).
Fig. 4.9. The Praat TextGrid for the utterance dort corresponding to the structured annotation in Emu. This TextGrid was created with the labConvert window shown above the Praat display (accessible from Arrange Tools → Convert Labels → Emu2Praat) by entering the template file (gt) and input file (dort) and specifying an output directory.
Fig. 4.10. A fragment of the utterance K67MR095 from the kielread corpus showing apparent segment deletion in the words und, schreiben, and lernen.
Fig. 4.11. The graphical user interface to the Emu query language. Assuming you have loaded the utterances of the gt database click on graphical query in the query tool window (Fig 4.1) to open the spreadsheet shown above (initially you will only see the pane on the left). A query of all H% (intonational phrases) labels that dominate more than 5 words is entered at tier I, with the label H% and with num > 5 Word. A tick mark is placed next to Word to make a segment list of annotations at this tier; also click on Word itself to get the # sign which means that a segment list should be made of words only in this position (rather across this and the next word). Enter End i at the Word tier to search for intermediate-phrase-final words. At the Tone tier, enter L*+H to search for these annotations. Then click the >> arrow at the top of the spreadsheet beneath Position to enter the search criteria in an analogous manner for the following segment, i.e., L- at tier i and 0 at the Break tier. Finally select the Query button when you are ready and the system will calculate the search instruction (shown next to Querystring).
Fig. 4.12 The utterance anna1 from the aetobi database including the f0-contour as calculated in the exercises of chapter 3.
Fig. 4.13 The modified aetobi template file. Word is made a parent of Tone (left) and path/aetobi.txt is the file containing the Tcl code that is entered in the Variables pane (right) under AutoBuild. Also make sure that Word and Tone are selected in HierarchyViewLevels of the template file's view pane.
Fig. 4.14. The output of the LinkFromTimes function showing links between annotations at the Word and Tone tiers. This display was obtained by clicking on Build Hierarchy.
Fig. 4.15. The Emu AutoBuild tool that is accessible from Database Operations → AutoBuildExtern and which will automatically load any Tcl (AutoBuild) script that is specified in the template file. Follow through the instructions to apply the script to all utterances in the database.
Fig. 4.16. The path for representing parent-child tiers in aetobi.
Fig. 4.17. The modified Levels pane of the aetobi template file to encode the path relationships in Fig. 4.16. The simplest way to do this is in text mode by clicking on txt (inside the ellipse) which will bring up the plain text file of the template. Then replace the existing level statements with those shown on the right of this figure. Once this is done, click anywhere inside the GTemplate Editor and answer yes to the prompt: Do you want to update GTed? to create the corresponding tiers shown here in the Levels pane. Finally, save the template file in the usual way. (There is no need to save the plain text file).
Fig. 4.18. The result of running tobi2hier.txt on the utterance anyway. If the Intonational and Intermediate tiers are not visible when you open this utterance, then select them from Display → Hierarchy levels.
Fig. 4.19. The structural relationships between Word, Morpheme, Phoneme and, in a separate plane, between Word, Syllable and Phoneme for German kindisch (childish).
Fig. 4.20. The two paths for the hierarchical structures in Fig. 4.19 which can be summarized by the inter-tier relationships on the right. Since only Phoneme is a segment tier and all other tiers are timeless, then, where → denotes dominates, Word → Morpheme → Phoneme and Word → Syllable → Phoneme.
Fig. 4.21. The path relationships for the ae database. All tiers are timeless except Phonetic and Tone.
Fig. 4.22. Structural representations for the utterance msajc010 in the ae database for the tiers from two separate paths that can be selected from Display → Hierarchy Levels.
Fig. 4.23 The utterance schoen from the gt database with an L- phrase tone marked at a point of time in the second syllable of zusammen.
Fig. 4.24. Prosodic structure for [kita] in Japanese. ω, F, σ, μ are respectively prosodic word, foot, syllable, and mora. From Harrington, Fletcher & Beckman (2000).
Fig. 4.25. The coding in Emu of the annotations in Fig. 4.24.
Fig. 4.26. The tree in Fig. 4.24 as an equivalent three-dimensional annotation structure (left) and the derivation from this of the corresponding Emu path structure (right). The path structure gives expression to the fact that the tier relationships Word-Foot-Syll-Phon, Word-Syll-Phon and Word-Syll-Mora-Phon are in three separate planes.
Fig. 5.1 The tongue tip (TT), tongue mid (TM), and tongue-back (TB) sensors glued with dental cement to the surface of the tongue.
Fig. 5.2. The Carstens Medizinelektronik EMA 'cube' used for recording speech movement data.
Fig. 5.3. The sagittal, coronal, and transverse body planes. From http://training.seer.cancer.gov/module_anatomy/unit1_3_terminology2_planes.html
Fig. 5.4 The position of the sensors in the sagittal plane for the upper lip (UL), lower lip (LL), jaw (J), tongue tip (TT), tongue mid (TM), and tongue back (TB). The reference sensors are not shown.
Fig. 5.5. A section from the first utterance of the downloadable ema5 database showing acoustic phonetic (Segment), tongue tip (TT), and tongue body (TB) labeling tiers, a spectrogram, vertical tongue-tip movement, and vertical tongue body movement. Also shown are mid-sagittal (bottom left) and transverse (bottom right) views of the positions of the upper lip (UL), lower lip (LL), jaw (J), tongue tip (TT), tongue mid (TM) and tongue body (TB) at the time of the left vertical cursor (offset time of tongue body raising i.e., at the time of the highest point of the tongue body for /k/).
Fig. 5.6 The annotation structure for the ema5 database in which word-initial /k/ at the Segment tier is linked to a sequence of raise lower annotations at both the TT and TB tiers, and in which raise at the TT tier is linked to raise at the TB tier, and lower at the TT tier to lower at the TB tier.
Fig. 5.7 Relationship between various key functions in Emu-R and their output.
Fig. 5.8. Boxplots showing the median (thick horizontal line), interquartile range (extent of the rectangle) and range (the extent of the whiskers) for VOT in /kl/ (left) and /kn/ (right).
Fig. 5.9. Positions of the tongue body (dashed) and tongue tip (solid) between the onset of tongue-dorsum raising and the offset of tongue tip lowering in a /kl/ token.
Fig. 5.10. Vertical position of the tongue tip in /kn/ and /kl/ clusters synchronized at the point of maximum tongue body raising in /k/ (t = 0 ms) and extending between the tongue tip raising and lowering movements for /n/ (solid) or /l/ (dashed).
Fig. 5.11. Tongue body (solid) and tongue tip (dashed) trajectories averaged separately for /kn/ (black) and /kl/ (gray) after synchronization at t = 0 ms, the time of maximum tongue-body raising for /k/.
Fig. 5.12 A sinusoid (open circles) and the rate of change of the sinusoid (diamonds) obtained by central differencing. The dashed vertical lines are the times at which the rate of change is zero. The dotted vertical line is the time at which the rate of change is maximum.
Fig. 5.13. Tongue-body position (left) and velocity (right) as a function of time over an interval of tongue body raising and lowering in the /k/ of /kn/ (solid) and of /kl/ (dashed, gray) clusters.
Fig. 5.14. The same data as in the right panel of Fig. 5.13, but additionally synchronized at the time of the peak-velocity maximum in the tongue-back raising gesture of individual segments (left) and averaged after synchronization by category (right).
Fig. 5.15. The position (left) and velocity (right) of a mass in a critically damped mass-spring system with parameters in equation (3) x(0) = 1, ω = 0.05, v(0) = 0, x_targ = 0.
Fig. 5.16. Tongue-body raising and lowering in producing /k/. a: magnitude of the raising gesture. b: Duration of the raising gesture. c: Time to peak velocity in the raising gesture. d: magnitude of the lowering gesture. e: Duration of the lowering gesture. f: Time to peak velocity in the lowering gesture.
Fig. 5.17. Position (row 1) and velocity (row 2) as a function of time in varying the parameters x_o and ω of equation (2). In column 1, x_o was varied in 20 equal steps between 0.75 and 1.5 with constant ω = 0.05. In column 2, ω was varied in 20 equal steps between 0.025 and 0.075 while keeping x_o constant at 1. The peak velocity is marked by a point on each velocity trajectory.
Fig. 5.18. Data from the raising gesture of /kl/ and /kn/. Left: boxplot of the duration between the movement onset and the time to peak velocity. Right: the magnitude of the raising gesture as a function of its peak velocity.
Fig. 5.19. Jaw position as a function of the first formant frequency at the time of the lowest jaw position in two diphthongs showing the corresponding word label at the points.
Fig. 5.20. Vertical (left) and horizontal (right) position of the tongue mid sensor over the interval of the acoustic closure of [p] synchronized at the time of the lip-aperture minimum in Kneipe (black) and Kneipier (dashed, gray).
Fig. 5.21. Boxplot of F2 in [aɪ] and [aʊ] at the time of the lowest position of the jaw in these diphthongs.
Fig. 5.22. 95% confidence ellipses for two diphthongs in the plane of the horizontal position of the tongue-mid sensor and lip-aperture with both parameters extracted at the time of the lowest vertical jaw position in the diphthongs. Lower values on the x-axis correspond to positions nearer to the lips. The lip-aperture is defined as the difference in position between the upper and lower lip sensors.
Fig. 5.23. Jaw height trajectories over the interval between the maximum point of tongue tip raising in /n/ and the minimum jaw aperture in /p/ for Kneipe (solid) and Kneipier (gray, dashed).
Fig. 5.24. Tangential velocity of jaw movement between the time of maximum tongue tip raising in /n/ and the lip-aperture minimum in /p/ averaged separately in Kneipe (solid) and Kneipier (dashed, gray).
Fig. 6.1: 95% ellipse contours for F1 × F2 data extracted from the temporal midpoint of four German lax vowels produced by one speaker.
Fig. 6.2. Left. The [ɪ] ellipse from Fig. 6.1 with the left-context labels superimposed on the data points. Right: the same data partitioned into two clusters using kmeans-clustering.
Fig. 6.3. An illustration of the steps in kmeans clustering for 10 points in two dimensions. X₁ and Y₁ in the left panel are the initial guesses of the means of the two classes. x and y in the middle panel are the same points classified based on whichever Euclidean distance to X₁ and Y₁ is least. In the middle panel, X₂ and Y₂ are the means (centroids) of the points classified as x and y. In the right panel, x and y are derived by re-classifying the points based on the shortest Euclidean distance to X₂ and Y₂. The means of these two classes in the right panel are X₃ and Y₃.
Fig. 6.4: F2 trajectories for [ʔɪ] (gray) and for [fɪ] and [vɪ] together (black, dashed) synchronised at the temporal onset.
Fig. 6.5: 95% confidence intervals ellipses for [ɪ] vowels of speaker 67 at segment onset in the F1 × F2 plane before (left) and after (right) removing the outlier at F2 = 0 Hz.
Fig. 6.6: Spectrogram (0 - 3000 Hz) from the utterance K67MR096 showing an [ɪ] vowel with superimposed F1-F3 tracks; the miscalculated F2 values are redrawn on the right. The redrawn values can be saved – which has the effect of overwriting the signal file data containing the formant track for that utterance permanently. See Chapter 3 for further details.
Fig. 6.7. Left: F1 and F2 of [a] for a male speaker of standard German synchronised at time t = 0, the time of the F1 maximum. Right: F1 and F2 from segment onset to offset for an [a]. The vertical line marks the time within the middle 50% of the vowel at which F1 reaches a maximum.
Fig. 6.8: The revised labfiles pane in the template file of the kielread database to include a new Target tier.
Fig. 6.9: German lax monophthongs produced by a male (left) and a female (right) speaker in the F2 × F1 plane. Data extracted at the temporal midpoint of the vowel.
Fig. 6.10: Mean F2 × F1 values for the male (solid) and female (dashed) data from Fig. 6.9.
Fig. 6.11: Lobanov-normalised F1 and F2 of the data in Fig. 6.9. The axes are numbers of standard deviations from the mean.
Fig. 6.12: Relationship between the Hertz and Bark scales. The vertical lines mark interval widths of 1 Bark.
Fig. 6.13. A 3-4-5 triangle. The length of the solid line is the Euclidean distance between the points (0, 0) and (3, 4). The dotted lines show the horizontal and vertical distances that are used for the Euclidean distance calculation.
Fig. 6.14: Lax monophthongs in German for the female speaker in the F2 × F1 plane for data extracted at the vowels' temporal midpoint. X is the centroid defined as the mean position of the same speaker's mean across all tokens of the four vowel categories.
Fig. 6.15: Boxplots of Euclidean distances to the centroid (Hz) for speaker 67 (male) and speaker 68 (female) for four lax vowel categories in German.
Fig. 6.16: Histograms of the log. Euclidean distance ratios obtained from measuring the relative distance of [ɛ] tokens to the centroids of [ɪ] and [a] in the F1 × F2 space separately for a female (left) and a male (right) speaker.
Fig. 6.17: An F2 trajectory (right) and its linearly time normalised equivalent using 101 data points between t = ± 1.
Fig. 6.18: A raw F2 trajectory (gray) and a fitted parabola.
Fig. 6.19: F2 of [ɛ] for the male (black) and female (gray) speakers synchronised at the temporal midpoint (left), linearly time normalised (centre), and linearly time normalised and averaged (right).
Fig. 6.20: Boxplots of the c₂ coefficient of a parabola fitted to F2 of [ɛ] for the male (left) and female (right) speaker.
Fig. 6.21: A raw F2 formant trajectory of a back vowel [ɔ] (solid, gray) produced by a male speaker of Standard German and two smoothed contours of the raw signal based on fitting a parabola following van Bergem (1993) (dashed) and the first five coefficients of the discrete cosine transformation (dotted).
Fig. 6.22: Hypothetical F2-trajectories of [bɛb] (solid) and [bob] (dashed) when there is no V-on-C coarticulation (left) and when V-on-C coarticulation is maximal (right). First row: the trajectories as a function of time. Second row: A plot of the F2 values in the plane of the vowel target × vowel onset for the data in the first row. The dotted line bottom left is the line F2_Target = F2_Onset that can be used to estimate the locus frequency. From Harrington (2009).
Fig. 6.23: F2 trajectories in isolated /dVd/ syllables produced by a male speaker of Australian English for a number of different vowel categories synchronised at the vowel onset (left) and at the vowel offset (right).
Fig. 6.24. Locus equations (solid lines) for /dVd/ words produced by a male speaker of Australian English for F2-onset (left) and F2-offset (right) as a function of F2-target. The dotted line is y = x and is used to estimate the locus frequency at the point of intersection with the locus equation.
Fig. 6.25: The female speaker's [ɛ] vowels in the plane of F3–F2 Bark and –F1 Bark.
Fig. 6.26: German lax monophthongs produced by a male (left) and a female (right) speaker in the plane of F3–F2 Bark and –F1 Bark. Data extracted at the temporal midpoint of the vowel.
Fig. 6.27: 95% confidence ellipses in the plane of F2 × F1 at the temporal midpoint of [a, ɔ] and at the time at which F1 reaches a maximum in [aʊ] for a male speaker of Standard German.
Fig. 6.28: F2 transitions for a male speaker of Standard German following [d] (solid) and [d] (dashed, gray).
Fig. 7.1: The palate of the EPG3 system in a plaster cast impression of the subject's upper teeth and roof of the mouth (left) and fixed in the mouth (right). Pictures from the Speech Science Research centre, Queen Margaret University College, Edinburgh, http://www.qmuc.ac.uk/ssrc/DownSyndrome/EPG.htm. Bottom left is a figure of the palatographic array as it appears in R showing 6 contacts in the second row. The relationship to phonetic zones and to the row (R1-R8) and column (C1-C8) numbers are also shown.
Fig. 7.2. Palatogram in the /n/ of Grangate in the utterance of the same name from the epgassim database. The palatogram is at the time point shown by the vertical line in the waveform.
Fig. 7.3: Schematic outline of the relationship between electropalatographic objects and functions in R.
Fig. 7.4: Palatograms of said Coutts showing the times (ms) at which they occurred.
Fig. 7.5: Waveform over the same time interval as the palatograms in Fig. 7.4. The vertical dotted lines mark the interval that is selected in Fig. 7.6.
Fig. 7.6: Palatograms over the interval marked by the vertical lines in Fig. 7.5.
Fig. 7.7: Palatograms for 10 [s] (left) and 10 [ʃ] (right) Polish fricatives extracted at the temporal midpoint from homorganic [s#s] and [ʃ#ʃ] sequences produced by an adult male speaker of Polish. (The electrode in column 8 row 5 malfunctioned and was off throughout all productions).
Fig. 7.8: Gray-scale images of the data in Fig. 7.7 for [s] (left) and [ʃ] (right). The darkness of a cell is proportional to the number of times that the cell was contacted.
Fig. 7.9: Sum of the contacts in rows 1-3 (dashed) and in and rows 6-8 (solid) showing some phonetic landmarks synchronised with an acoustic waveform in said Coutts produced by an adult female speaker of Australian English.
Fig. 7.10: A histogram of the distribution of the minimum groove width shown separately for palatograms of Polish [s,ʃ]. The minimum groove width is obtained by finding whichever row over rows 1-5, columns 3-6 has the fewest number of inactive electrodes and then summing the inactive electrodes..
Fig. 7.11: Palatograms for the first 12 frames between the acoustic onset and offset of a Polish [s]. On the right is the number of inactive electrodes for each palatogram in rows 1-7. The count of inactive electrodes for the palatogram 1290 ms is highlighted.
Fig. 7.12: Minimum groove width (number of off electrodes in the midline of the palate) between the acoustic onset and offset of a Polish [s].
Fig. 7.13: Minimum groove width between the acoustic onset and offset of Polish [s] (black) and [ʃ] (gray) averaged after linear time normalisation.
Fig. 7.14: Palatograms with corresponding values on the anteriority index shown above.
Fig. 7.15: Palatograms with corresponding values on the centrality index shown above.
Fig. 7.16: Palatograms with corresponding centre of gravity values shown above.
Fig. 7.17: Synchronised waveform (top) anteriority index (middle panel, solid), dorsopalatal index (middle panel, dashed), centre of gravity (lower panel) for just relax. The palatograms are those that are closest to the time points marked by the vertical dotted lines in the segments [ʤ] and [t] of just, and in [l], [k], [s] of relax.
Fig. 7.18: Grayscale images for 10 tokens each of the Polish fricatives [s,ʃ,ɕ].
Fig. 7.19: Anteriority (AI), dorsopalatal (DI), centrality (CI), and centre of gravity (COG) indices for 10 tokens each of the Polish fricatives [s,ʃ,ɕ] (solid, dashed, gray) synchronised at their temporal midpoints.
Fig. 7.20: Palatograms from the acoustic onset to the acoustic offset of /nk/ (left) in the

blend duncourt and /sk/ (right) in the blend bescan produced by an adult female

Directory: ~jmh -> research -> pasc010808
pasc010808 -> The Phonetic Analysis of Speech Corpora

Download 1.58 Mb.

Share with your friends:

1 ... 22 23 24 25 26 27 28 29 30