Speech Recognition Application:
Voice Enabled Phone Directory
Introduction:
Speech Recognition is seen as one of the promising market technologies of the near future. As an example, different companies such as Advanced Recognition Technologies, Inc (ART), Microsoft, as well as other open source community based companies have been integrating/ implementing speech recognition systems in their software. These voice command based applications will be expected to cover many of the aspects of our daily from telephones to the Internet. Voice command based applications will make life easier due to the fact that people will get easy and fast access to information.
It is important to understand the process of speech recognition in order to be able to implement/integrate it into different applications. First of all, there are two types of speech recognition. The first is a ‘speaker dependent system’ that is designed for a single speaker, it is easy to develop while it is not flexible to use. The second is a ‘speaker independent system’ designed for any speaker. It is harder to develop, less accurate and more expensive than the ‘speaker dependent’ system, yet it is more flexible.
The vocabulary sizes of the Automatic Speech Recognition (ASR) system range from a small vocabulary that would consist of two words to a very large vocabulary that consists of tens of thousands of words. The size of the vocabulary affects the complexity, processing requirements, and the accuracy of the ASR system. There are two types of vocabularies: the first is an isolated system that uses a single word - either a full word or a letter - at a time. It is the simplest type because it is easy to find the starting and ending points of a word. Were as in the second type (the continuous system) this would be much harder because we are actually using sentences.
There are a number of different objects that could affect an ASR system, such as pronunciation and frequency. The speaker's current mood, age, sex, dialect, inflexions and background noise could affect the accuracy and performance of such system. It is thus necessary for the system to bypass these obstacles in order for it to be more accurate. As an example, the system could use filters to solve some of these problems like background noises, coughs, heavy breath and etc. Therefore, in most systems filtering is the first stage of speech analysis, where speech is filtered before it arrives to the recognizer process of speech. The process of speech requires analog-to-digital conversion in which the voice's pressure waves are converted in to numerical values in order to be digitally processed.
The Hidden Markov Model is a Markov Chain where the output symbols or probabilistic functions that describe them, are connected either to the states or to the transitions between states. The algorithm consists of a set of nodes that are chosen to represent a particular vocabulary. These nodes are ordered and connected from left to right, and recursive loops are allowed. Recognition is based on a transition matrix of changing from one node to another. Each node represents the probability of a particular set of coeds. Here is a figure that would show the functionality of ASR:
(11)
The HMM ‘Q’ is referred to often as a parametric model because the state of the system at each time t is completely described by a finite set of parameters. The algorithm that estimates the HMM parameters (training algorithm) takes a first good guess. Then the initialization model performs another guess and it computes the HMM parameters first guess using the preprocessed speech data (features) with their associated phoneme labels. The HMM parameters are kept or stored as files and then retrieved by the training procedure. Here is a figure that shows the initialization in ASR:
(11)
Before estimating the HMM parameters, the basic structure of HMM must be defined. To be specific, the graph structure, which is the number of states and their connections, and the number of mixtures per state M, must be specific. A good way to understand HMM is by giving an example. If we build a model Q that recognizes only the word “yes”, then the word is composed of the two phonemes ‘\ye’ and ‘\s’. This corresponds to the six states of the two phoneme models. To be more accurate “yes” is composed of ‘\y’ ‘\eh’ ‘\s’. The ASR would not know the acoustic state in mind of the speaker, therefore the ASR system would try to find W by reconstructing the more likely sequences of states and words W that have generated X.
Model training is performed by estimating the HMM parameters, since estimation accuracy is roughly proportional to the number of training data. The HMM is well suited for a speaker-independent system because the speech used during training uses probabilities or generalizations and that makes it a good system to use for multiple speakers.
It is important to keep in mind the E-set that includes: b, c, d, e, g, p, t, v, z, because when you are trying to pronounce these letters, they actually sound the same. Thus it is important to keep them in mind when dealing with issues like pronunciation.
Statement of the Problem:
The focus of my project is based on having automatic speech interacting phone directory assistance without having human interaction. It is hard to find a speech based command system that calls numbers for you because of all the complications that I have mentioned above.
Proposed Solution
My solution consists of three parts and I will go through them and explain my approaches and what I would like to acquire out of each. I will then demonstrate how they all play a part in the final configuration. Here is a diagram that will show the overview of the models, and the next paragraphs are its explanations:
Sphinx:
The first part needed is an ASR system that I would be able to work with in order to build my speech enabled phone directory. I need a speaker independent system based on HMM that has a large vocabulary. After researching the matter, I have decided to use sphinx, based from Carnegie Mellon University, for my ASR system.
In sphinx, basic sounds in the language are classified into phonemes or phones. The phones are distinguished according to their position within the word (Beginning, end, internal, or single) and they are further refined into context-dependent triphones. The building processes of acoustic models are through the triphones. Triphones are modeled by HMM and usually contain three to five states. The HMM states are clustered into a much smaller number of groups called senone.
The input audio is of 16 bit samples, from 8 to 16 Mhz, which is of a .raw type. Training consists of having good data that consists of spoken text or utterances. Each utterence:
-
Is converted into leaner sequences of triphone HMM’s using pronunciation lexicon.
-
Finds best state sequence or state alignment through HMM
For each senone, all frames are gathered in the training and are mapped in order to build suitable statistical models. The language model consists of:
-
Unigrams where the entire set of words and their individual probablilities of occurrences in language, are considered
-
Bigrams: the conditional probability that word2 immediately follows word 1 in the language.
-
Information for a subset of possible word pair.
It also contains the Lexicon Structure, which is the pronunciation dictionary. It is a file which specifies word pronunciation. Pronunciations are specified as linear sequences of phones. Also, it is essential to know that there are multiple pronunciations for the same word or letter. It also includes a silence symbol to represent the user’s silence. As an example, ‘ZERO’ is pronounced ‘Z IH R OW’.
Database (ADB)
The second item needed for my project is to build a database. I decided to use PostgeSQL for this part. The database, named ADB, will contain a “People” entity, which contains these attributes:
-
pid: which is an attribute that contains the unique identification for each, and is of type integer.
-
first_name: attribute that contains the first name of a person and is of type varchar(20)
-
last_name: attribute that contains the last name of a person and it is also of type varchar(20).
-
phone_number: attribute that contains phone number and it is also of type varchar(12) UNIQE (which means that system would not accept the same number more than once).
-
city: attribute that contains city name and its type is varchar(15).
The primary key is (pid, first_name, last_name)
Here is an example of what the Database contains:
Pid | first_name | last_name | phone_num | city
------+----------------+---------------+--------------------+--------------
1 | Sam | Smith | 765-973-2743 | Ramallah
2 | George | Adams | 765-973-2741 | Richmond
3 | Sam | Knight | 765-973-2222 | Houston
4 | Kathrin | Smith | 765-973-3343 | Jerusalem
5 | Samer | Abdo | 765-973-2190 | Jacksonville
The database’s function is for matching purposes. The database has the person’s information and it will provide the data, which is needed for the phone directory. In other words it will act as an address book but at the same time it can select information that will be needed by the application. For example, you can either select all names in the directory, or you can select a specific person by first name or last name. Selecting and inserting functions are important aspects of the database (ADB).
Application
The application is the third item in the deliverables and it is going to be one of the most important ones. It will serve as the main connector between the ASR system (sphinx) and the Database (ADB). Basically, the application will serve as easy communication through sphinx and the database to send and receive information. As I said briefly before, through sphinx, I will be able to get speech decoded, by through saying letters of first name, last name. Once decoded, the application will communicate with the database (ADB). The application’s functionality will include the:
Connect to DB:
-
Add person
-
Delete person.
-
Edit person.
-
View (through different select statements, depending on what the user wants). I will talk about viewing in the next section
* note: person here includes first name, last name, number
Connection with Sphinx to:
-
Decode letters said by user
-
Use log file generated to view and grep results
-
Strip the silence symbols, as well as space between letters in order to put together the decoded letters together, which forms the person’s first name or last name as a “word”.
Communication with User:
-
In order to insure that the decoder actually decoded the correct letters that have been said by the user. He/she will be asked “did you say ‘word’”. The word does not have to be a completed word, it could be a couple of letters from a word. As long as the user says the letters needed, the application will run commands to connect to database and then it will get the responded words. The user will be asked whether he/she wants to get the ‘letters of a word’ as for the first name, or last name. If for example the user gets a lot of names, the user has the option to say more letters. These letters will be combined with the previous letters and then the same operation happens. The application will connect to ADB and it will get back results.
The application(s), as I said, will communicate with Sphinx and ADB. The programming languages that I will use will include Perl, C, PHP and shell scripts.
Final Paper:
This part will consist of joining some of the proposal paper as well as including more info on the applications that I want to build. It will also include the source files that are included and reflect upon the results that I have gotten to. It could also include bugs and enhancements that could be implemented at some other stage.
Timeline:
Clarification of tasks:
- Reading: My aim was to go into a new field of automatic speech recognition. At first I was reading general and background information on speech recognition, and then I moved deeper in to subject. I read about different approaches and I also reviewed some info on about Databases (PostgreSQL).
- Bibliography: The bibliography part consists of different journals, articles and books that I have read or was reading at the time.
- Survey Paper & Presentation: This part consisted of background knowledge and information on the subject. Working on this part of the paper and presentation gave me a good idea of the content and understanding of the area.
- Find ASR system: During this time I was looking into different, mainly open source, software that I would later build my application out of.
- Test & Configure ASR : This part consisted of getting the software working. I chose Sphinx ASR system. There were a lot of dependencies that I had to configure in order for the software to work properly. I tested the system and tried to work with its language. I have also tested the software through its raw files and then figured how to change the .wav files into .raw files.
- Proposal Presentation: Involved setting up and getting different ideas into place. I took the important parts of the survey that I needed to include in the proposal presentation. It included what I wanted to work with, for example, the language and its model.
- Build Database: This part consisted of building the database. I chose to work with PostgreSQL. I have to fulfill a couple of dependencies here as well. I created a database, called ADB, and set up a table and tuples. I also inserted over 9 entries of different people. It included first name, last name, phone number, and city. I also wrote different selects\ statements that I will use in the application that I am building.
- Proposal Paper: This part consisted of writing the proposal paper. It included what I presented in the proposal presentation and, at the same time, I wrote more details about the application process building.
- Build Applications: Consists of implementing the application part, which will join the ASR system (Sphinx) and the PostgreSQL ADB database. I will use C, Perl, PHP, and shell scripting for writing the applications.
- Test Applications: This part will consist of testing the applications that I wrote about in the paper. This would include different bugs, and/or enhancements that could be later implemented to enhance the program.
- Final Paper & Revisions: This part will consist of joining some of the proposal paper, and it would include more about the applications that I built, as well as the source files that are included. It will also reflect upon the results reached, bugs and enhancements that could be implemented at some other stage.
- Colloquium Preparation: Consist of preparing the presentation for the colloquium. It would include some of the proposal presentation but will go further to talk about applications, as results and the reported bugs in the applications.
- Colloquium: Making final thoughts and preparation for presenting the colloquium.
Bibliography
White, George M. "Natural Language understanding and Speech Recognition." Communications of the ACM 33 (1990): 74 - 82.
Osada, Hiroyasu. "Evaluation Method for a Voice Recognition System Modeled with Discrete Markov Chain." IEEE 1997: 1 - 3.
Bradford, James H. "The Human Factors of Speech-Based Interfaces: A Research Agenda." SIGCHI Bulletin 27 (1995): 61 - 67.
Shneiderman, Ben. "The Limits of Speech Recognition." Communication of the ACM 43 (2000): 63 - 65.
Danis, Catalina, and John Karat. "Technology-Driven Design of Speech Recognition Systems" ACM 1995: 17 - 24.
Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces" ACM Transactions on Computer-Human Interaction 8 (2001) 60 - 98.
Brown, M.G., et al. "Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval" ACM Multimedia 96 1996: 307 - 316.
Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled Web Browsing" ACM 2000: 72 - 79.
Falavigna, D., et al. "Analysis of Different Acoustic front-ends for Automatic voice over IP Recognition" Italy 2001.
Simons, Sheryl P. "Voice Recognition Market Trends" Faulkner Information Services 2002.
(11) Becchetti, Claudio, and Lucio Prina Ricotti. Speech Recognition: Theory and C++ Implementation. New York : 1999
Abbott, Kenneth R. Voice Enabling Web Applications: VoiceXML and Beyond. New York: 2002
Miller, Mark. VOiceXML: 10 Projects to Voice Enable Your Web Site. New York: 2002
Syrdal, A., et. al. Applied Speech Technology Ann Arbor: CRC 1995
Larson, James A. VoiceXML:Introduction to Developing Speech Applications New Jersey : 2003
Share with your friends: |