Carnegie Mellon
Towards Communication
with Dolphins
- Segmenting dolphin speech -
Final Report
Date: 12/16/2003
Tal Blum
Jiazhi Ou
11-751 Course Project Final Report
Introduction
The Wild Dolphin Project (WDP), founded by Dr. Denise Herzing in 1985, is engaged in an ambitious, long-term scientific study of a specific pod of Atlantic spotted dolphins that live 40 miles off the coast of the Bahamas, in the Atlantic Ocean. For about 100 days each year, Phase I research has involved the photographing, videotaping, and audio taping of a group of resident dolphins, aiming to learn about their lives. (http://www.wilddolphinproject.org/index.cfm)
As Denise sees it, there are 3 main sounds that the dolphin produces. Clicks, pulse trains which are a equally interval spaced clicks and whistles. Between Whistles there are signature whistles which are repeating high frequency sounds that are characteristic for an individual dolphin and non-signature whistles which are non repeating ones.
Our goal in this project was to try to understand the characteristics of dolphin sound and see how far we can go with it. The initial goal was to try to find basic sound units for dolphin. Later we decided that we would do a more limited task of labeling some of the data, working with Janus and distinguishing signature whistles from what appears to be non-signature whistles. Later in the project we found that the available data almost does not contain non-signature whistles. So we tried to segment the data as best as we could.
Related Work
Our work is based on the dolphin-ID project implemented by Tanja Schultz, Alan W Black, and Yue Pan. The goal of this project is to identify dolphin IDs with their signature whistles. 51 labeled files were used. 10 HMMs were trained to model 10 different dolphins, plus a dolphin non-signature whistle model, a garbage model and a pause model, there are 13 left to right HMMs all together. An ergodic HMM consisting of all trained HMMs were created to recognize an input file. The decision is based on the best state sequence.
Data & Labels
The data we used for the project were 163 signature whistles files. These short files are files containing assumingly just one dolphin sound per file. Each file is on average of 7 seconds.
These files are also internally labeled (courtesy of Alan Black) to dolphin sounds, silence, machine noise, human noise, etc. Most labels are of dolphin whistles and water (silence).
We chose to model 4 main models: PAUSE, SIGWHISTLE, GARBAGE and DOLPHIN.
PAUSE – water
SIGWHISTLE – dolphin signature whistles
GARBAGE - human sounds, machine sounds
DOLPHINE - other dolphin sounds such as clicks burst trains non-signature whistles etc.
Table 1. Labels statistics
|
PAUSE
|
SIGWHISTLE
|
GARBAGE
|
DOLPHIN
|
#occurrences
|
756
|
633
|
13
|
24
|
Accumulated time (in secs)
|
466
|
320
|
7.1
|
11.3
|
Average time per occurrence (in secs)
|
0.6
|
0.5
|
0.55
|
0.47
|
Basically there are 2 main categories in these files, PAUSE and SIGWHISTLE. The other categories do not have enough data to model accurately. GARBAGE is composed of many different noises and therefore hard to model. DOLPHIN is also a category with only 11 seconds and it is very hard to distinguish it from PAUSE and SIGWHISTLE since even in PAUSE you hear some dolphin noises.
We did not explicitly model dolphin non-signature whistles. The reason we did not model dolphin non-signature whistles is that it is very hard to distinguish those from signature whistles. It is not even clear from these files that there are non-signature whistles. The total number of occurrences of non-signature whistles is 6 and even those are not guaranteed not to be signature whistles.
The data is very space containing about 20 minutes of signals, out of which less than 10 minutes of dolphin whistles.
Labeling Problems
The data is also frequency banded. There is much more in the dolphin signal than we can observe because if inadequate equipment. This has the effect that it is hard to track whistles since the whistle often drifts to the unobserved frequency range.
Another difficulty comes from the fact that you don’t have words, just segments. It makes a difference if you label one segment or several segments, since in the model each segment becomes a left-right hmm and it might contain several repeating in time structures. When those structures are labeled in the labeling level this would give a superior performance than dealing with it in the acoustic model in an unsupervised manner.
The wave files have been processed by Alan black to be displayed or played approximately 3 times slower than the normal rate. This has the advantage that it is easier to identify the signature whistle, but a disadvantage, that the sound does not sound natural any more. It would be harder now to identify a human sound.
Model Selection
We experimented three schemes of model selection:
Scheme 1: Our initial goal is to model dolphin signature whistles, dolphin non-signature whistles, garbage and pause. Therefore we built four acoustic models in the first round of our experiments. And because left to right models were proved to be applicable in the dolphin-ID project, we keep this topology (Figure 1). For signature whistle, non-signature whistle, and garbage we used three different states (b, m, e) to model the left to right HMMs. While for pause we used a same state (m) to model the 3-state HMM.
Signature Whistles
Non-Signature Whistles
Garbage
Pause
Figure 1. Left to right HMM topologies
Scheme 2: From Table 1 we noticed that the total lengths of the non-signal whistle signal and the garbage signal are fairly short (7 and 11 seconds), which might lead to the overfitting problem. To solve this problem to some extend we combined the data labeled as non-signature whistles and garbage and built a combined garbage model. Now we have three acoustic models: signature whistles, garbage, and pause.
Scheme 3: At last we wanted to train separated HMMs for each dolphin. In our database there are 10 different dolphin IDs. Hence we have 10 HMMs for non-signature whistles. Plus the garbage model and the pause model, there are 12 HMMs all together.
Evaluation Metric
Since we don’t have words labels such as in human speech, and since the segmentation label may contain varied number of occurring signals, for our experiments, we chose a metric that measures the frames alignment between our labeled segments and the test results segments. For each class we can compute the percentage of the time that the test results classify it as the same class and the percentage of it classified as another class.
When evaluating the results it is important to realize that the only significant results are the ones for the class PAUSE and SIGWHISTLE. The other 2 classes compose a very small percent of the data.
Experiments and Results
162 labeled files were used to evaluate our three schemes of model selection. We used half of the data for training and the other half for testing. Then we swap the training data and testing data to do cross validation. Therefore we had 162 test results for each scheme. Janus toolkit was applied to train the HMMs and do Viterbi decoding. Feature extraction is the same of that was used in the dolphin-ID project. It includes down sampling, high pass filtering, FFT, and LDA. The final feature vectors are based on Fourier coefficient. To make scheme 4 comparable to scheme 1 and 2 we changed all dolphin IDs with the token 'Signature Whistle' in the output file of Janus.
The evaluation results with our metric are shown in Table 2 to 4.
Table 2. The Confusion Matrix of Scheme 1
|
Sig
|
Non-Sig
|
Garbage
|
Pause
|
Sig
|
58%
|
6%
|
18%
|
34%
|
Non-Sig
|
33%
|
8%
|
37%
|
22%
|
Garbage
|
77%
|
0%
|
5%
|
18%
|
Pause
|
31%
|
6%
|
27%
|
34%
|
Table 3. The Confusion Matrix of Scheme 2
|
Sig
|
Garbage
|
Pause
|
Sig
|
79%
|
9%
|
21%
|
Garbage
|
52%
|
21%
|
27%
|
Pause
|
48%
|
14%
|
38%
|
Table 4. The Confusion Matrix of Scheme 3
|
Sig
|
Garbage
|
Pause
|
Sig
|
91%
|
0.6%
|
8%
|
Garbage
|
80%
|
10%
|
10%
|
Pause
|
69%
|
1%
|
30%
|
From Table 2 we can see that the model selection of scheme 1 is quite unsuccessful. We explained this as the overfitting problem of the garbage model and the dolphin signature model. Table 3 and Table 4 show better results for scheme 2 and scheme 3. Scheme 3 is analog as using speaker dependent models in human speech recognition. And it has the best alignment for signature whistles. But on the other hand all other signals tend to be recognized as signature whistles.
All trained parameters and test results are available on /afs/cs.cmu.edu/user/jzou/dolphin/162.
Conclusions
In this course project we experimented with different models to model dolphin acoustic phenomena. We labeled the data and we tried different schemes. We compared the results with our own evaluation metric. The results are not so satisfactory due to the lack of training and testing data for garbage and non-signature whistles.
We also tried two different kinds of topologies.
Figure 2. The “loop-back” we wanted to try.
Figure 2 is a “loop back” model for dolphin signature whistles. It has a transition from state 3 to state 1 and can be used to model the repeat pattern in the labeled segment. But for some reason it doesn’t make any difference from scheme 1. It might be due to the local minimum in the parameter space. We also wanted to model non-signature whistles and garbage with three same states, which are exactly the same as the Pause model. But in the training phase “Segmentation Fault” came out and we could not fix the problem by far.
Analyzing dolphin sounds is quite different than analyzing human speech. The methods used have to be adjusted to the characteristics of the dolphin sounds. There is a lot of work to be done in the signal processing stage or in the modeling of the classes. It might even be better just to construct a model for the labels we are sure and let the model learn what the units that discriminate between different labels are.
Tal Blum & Jiazhi Ou
|
Page of
|
12/16/2003
|
|
|
|
Share with your friends: |