You Don’t Say? Enriching Human-Computer Interactions through Voice Synthesis



Download 56.97 Kb.
Date06.08.2017
Size56.97 Kb.
#27234
You Don’t Say? Enriching Human-Computer Interactions through Voice Synthesis

Megan Jeffrey


March 17th, 2010
Com546: Evolutions and Trends in Digital Media
University of Washington: MCDM

Abstract
As computers continue to be an integral part of how individuals communicate, proponents of voice synthesis have claimed that the technology is a way to “humanize” our interactions with machines. Companies seeking to improve their customer service after business hours rely on call centers that use synthetic voices to answer consumer questions, and in-car GPS devices relay instructions to drivers in a safe, personable manner. Furthermore, for others who have lost the ability to speak, synthetic voices offer another chance to be heard, and express their feelings in a way that is far less robotic-sounding than the text-to-speech technology of the 1970s. Through the use of complex voice concatenation engines, technicians are approaching a time when the synthesized voices of our computers will enable us to be understood by our machines not only phonetically, but also culturally and emotionally. Introduction
Speech is the primary way in which humans engage each other. Shortly after birth, infants become aware that through the manipulation of their vocal cords, they can manipulate others, and this awareness grows alongside an individual’s understanding of rhetoric. By the time many of us enter grade-school, we recognize that speech not only enables us to interact with others, but also access and disseminate information. Hence, it is not surprising that “with the rapid advancement in information technology and communications, computer systems increasingly offer users the opportunity to interact with information through speech” (Al-Said, 2009). After all, as technology continues to increase the ease with which we access data, the apparent convenience of talking to our machines to get what we want must seem appealing. However, effective communication is not one-sided, and if humans want to be clearly understood by machines, machines should be able to respond in a voice of their own.

For decades, scientists have attempted to use technology to synthesize voices that will enable silent machines, and those humans silenced by physical trauma, to communicate. However, Clifford Nass, a professor of communication at Stanford University, argues that 200,000 years of evolution have made us “hard-wired to interpret every voice as if it were human, even when we know it comes from a computer” (Logan, 2007). As a result of this “hard-wiring,” Nass claims consumers find it difficult to respond to artificial voices because they lack personality and social awareness. For instance, due to their monotone inflection, synthesized voices can often sound indifferent, and their overtly-phonetic pronunciation too often reminds us that we are talking to a machine, and not a human. Nevertheless, the recent success of voice synthesis companies like CereProc Ltd. and AT&T Natural Voices have resulted in machine-generated voices that are capable of expressing emotion and demonstrating the type of nuanced speech that may one day make chatting with one’s computer an enjoyable exercise.


Historical Background
At the 1939 World’s Fair in New York, Homer Dudley debuted the Voder, or “VOice CoDER,” a Bell Labs machine that converted white noise into speech using a set of controls that were “played” like a musical instrument by a human operator. These controls enabled the operator to alter the rhythm, pitch, and inflection of each available tone until it roughly resembled human speech. However, it was not until the first digital revolution of the late 1970s that synthesized-speech systems gained widespread attention, thanks to the “Speak’n’Spell,” an educational toy developed by Texas Instruments in 1978 (Logan, 2007). By using Text-to-Speech (or TTS) software that applied mathematical models of sound moving along the human vocal tract, the toy would “speak” any word typed on its keyboard.

The following year, Dennis Klatt of the Massachusetts Institute of Technology (MIT) utilized similar “language-taking” software to give world-renowned physicist Stephen Hawking a new digital voice. Dubbed “Perfect Paul,” the program enabled Hawking to verbally converse with others, in spite of his neuromuscular dystrophy (Logan, 2007). However, both “Perfect Paul” and his deeper, more-masculine cousin “Huge Harry” still sounded very robotic, and voice technicians struggled to develop a TTS-based technology that would synthesize more natural-sounding and expressive voices.


Voice Synthesis Technology and Methodology: An Overview
Since Klatt’s pioneering research at MIT, voice synthesis has grown to include two primary components: a TTS engine and a library of pre-recorded voices that enable devices to speak in a variety of languages and accents. On their Website, AT&T Natural Voices describes how modern TTS operates in two stages: During the first, the engine decides how the text should be spoken (pronunciation, pitch, etc.), and in the second phase, the system generates audio that matches the previously-identified specifications (2010). However, it is important to note that TTS systems do not actually understand the human language they mimic. Instead, the process is “more like learning to read a foreign language aloud [;] with a good dictionary, grammar rules, etc. you can get better, but still make mistakes obvious to native speakers” (AT&T, 2010). Therefore, before TTS technology can advance to a level where it is self-correcting, programmers would have to create software that teaches machines the meaning of words (both literal and cultural) so that computers would be capable of understanding the text they read.
Voice Concatenation
Computer scientist Boris M. Lobanov writes that using TTS software to reproduce the human voice resembles “the widely-known biological problem of cloning, whereby on the basis of a comparatively small amount of genetic information, an attempt is made of reproducing a living being copy as a whole” (2004). In the early 1990s, speech synthesis researchers abandoned attempts to create human sounds from scratch and instead began using “voice concatenation,” which broke down recordings of a person’s voice into small units of speech (phonemes, allophones, etc.) and then put them back together to form new words and sentences (Logan, 2007). A phoneme is an abstract unit that speakers of a particular language recognize as a distinctive sound (ex: The hard “d” in the word “dive”). In comparison, an allophone is a variant of a phoneme; changing the allophone will not change the meaning of a word, but the result may sound unnatural or be unintelligible (ex: “Night Rate” vs. “nitrate”).

If we were to think of voice synthesis in terms of cloning, as Lobanov suggests, this use of “speech DNA” makes the task of synthesizing a new voice more manageable for a system; it has only to store the smaller units, rather than a copy of every word in a given language. Moreover, CereProc Ltd. researcher Matthew Aylett argues concatenation is effective because “a critical element of careful speech is to be able to mark important information bearing sections [;] a large amount of clarity can be added by inserting short phrase breaks appropriately” (2003). Hence, by using these small units, concatenation more closely resembles human speech because the words and sounds do not all run together. Nevertheless, because the quality of a speech synthesizer is judged on its similarity to the human voice, it is essential that companies maintain a database of pre-recorded texts and a large number of phonemes that preserve “the individual acoustic characteristics of a speaker’s voice” (Lobanov, 2004).

According to Aylett, the personal acoustic characteristics of the human voice are determined by a number of physical factors, such as the unique shape of each person’s speech organs: the larynx, vocal cords, mouth, etc. (2003). He and other voice synthesis experts like Tian-Swee Tan believe concatenation engines can compensate for this physical complexity by using sound units to generate precise speech. However, Tan argues that a select string of “continuous phoneme from the same source, instead of individual phoneme from different sources,” will reduce the number of concatenation points and tonal distortion, and will result in a more natural sounding synthesized voice (2008). Hence, this is why the best synthesized voices use audio from a single database. In the most basic TTS systems at the turn of the century, computers stored more than 100,000 bits of sound data associated with written words (Rae-Dupree, 2001). However, because human communication is more than a string of sounds, voice synthesis researchers have also developed natural language processors (NLP).

NLP apply the prosody rules that individuals use to give grammatical meaning to a sentence to whatever speech is being synthesized (Economist, 1999). The aptly named “prosodic-processor” divides a sentence into accentual units and then determines the amplitude and frequency (i.e. volume and pitch) of each unit. One possible end-result of this process would be a spoken text that ends in the upward-inflection that differentiates a question from a statement (Vasilopoulos, 2007). Nevertheless, despite these technological refinements, communication experts like Nass continued to denounce even the best artificial voice systems for their lack of convincing emotions; for want of a personality, computers remained silent.


Emotional Expression and Voice Synthesis
In 1999, a student from the University of Florida named D’Arcy Haskins Truluck created the GALE system in an early attempt to make synthesized voices more personable. Truluck developed a series of prosody rules that described how humans sounded when they were angry, sad, happy, or fearful and then coded these rules into a TTS program. For example, if someone using a synthesized voice wanted to express anger, there would be a marked increase in the “frication” of the speech, meaning that consonants would be heavily stressed and clipped, and the pitch would fall at the end of a sentence to demonstrate assertiveness (Economist, 1999).

Taking inspiration from Truluck’s research, CereProc Ltd. provides clients with a set of emotional tags that can be entered alongside text to indicate how it should be intoned. For instance, Get that out of my face /. However, the CereProc Ltd. software also uses a combination of pre-recorded voice styles and digital signal processing to simulate a fuller range of emotions, even though the company admits that there is a certain point at which strongly emotional speech can sound artificial and unnatural (2010). Therefore, as of this writing, it is far more difficult to synthesize “homicidal rage” than a vocal tone indicating that one is “cross.” Still, CereProc Ltd. claims that it can simulate “a wide variation in the underlying emotion of our voices,” as most emotional states can be categorized along two spectrums: positive-negative and active-passive.1 An active state requires a faster speech rate, and higher volume and pitch, whereas a passive state is slower and lower. CereProc Ltd. also has coded for emotions that are tied to the content of an exchange, such as surprise or disappointment (2010).

For those individuals like famed movie critic Roger Ebert who rely on TTS as a substitute voice for the one they have lost, the ability to add emotional intonations to their speech is of paramount importance. After the removal of his lower jaw, Ebert wrote last August of his frustration with having to converse with others through the use of TTS during business meetings: “I came across as the village idiot. I sensed confusion, impatience and condescension. I ended up having conversations with myself, just sitting there.” Speaking in public or on TV was also unpleasant, as the critic felt he sounded “like Robby the Robot” (2010).

Nass would say that Ebert’s business associates responded negatively to his computer voice because that is how humans react to a voice that sounds bored or insincere, as most machines do (Logan, 2007); we respond in kind to what we hear. Furthermore, Nass found that humans are more likely to respond positively to a voice that demonstrates qualities similar to their own, rather than one that sounds alien and false. In 2007, he and his team discovered that participants were more likely to follow the advice of a computer voice whose gender matched their own, and that if an artificial voice is to be trusted as a salesperson, its “personality” can be more important than what it actually says. For instance, those who identified themselves as extroverts preferred a voice that constantly asked them if they needed help, while the introverts preferred the “salesperson” who only brokered advice when asked (Logan, 2007). Therefore, Nass’ research indicates that businesses that rely heavily on computer voices would do well to choose a “digital workforce” with which their customers can identify.


Current Applications and Limitations of Voice Synthesis
Traditionally, consumers have viewed TTS as an assistive technology, used on personal devices owned by the visually impaired, by those who have no voice of their own, or by individuals who need to proofread a document. In recent years, synthesized voices have been employed by GPS devices that recite directions and helpful information to drivers who need to keep their eyes on the road and not on a screen. Many companies like AT&T Natural Voices offer a C-based software development kit for engineers seeking to “humanize” their programs, while CereProc Ltd. promises to quickly build voices that “not only sound real, but have character, making them suitable for any application that requires speech output” (2010). Drawing on Nass’ communication research, CereProc Ltd. asks why companies wouldn’t welcome a chance to “talk to customers in their own accent? Or communicate with younger or older customers in a voice they can identify with” (CereProc, 2010)?

Nathan Muller, author of the Desktop Encyclopedia of Telecommunications, feels that help desks and voice response systems are the most commercially important businesses for this type of technology, especially when customers need to access information after business hours (2002). The CereProc Ltd. Website also touts the corporation’s “full voice branding and selection service,” which will profile a business’ target market, and then use the research to “cast” and “test” the voice that would be the most appealing to customers (2010). If CereProc Ltd. and its competitors are to be believed, now that digital voices are friendlier, human callers will be more inclined to interact with a machine they feel is actually responsive to their needs. However, before we can feel completely at ease with chatting to computers, there are numerous issues which voice synthesis researchers will have to address.

In 2001, Adam Greenhalgh, cofounder and CEO of Speaklink, a software company that creates voice-centered applications, predicted that “we're probably two to five years away from having a synthesized voice that will be entirely undetectable by the human ear” (Rae-Dupree, 2001). However, nine years later, voice synthesis companies have yet to produce such a product. Even CereProc Ltd, the Scottish company that promised to give Roger Ebert’s own voice back to him, can only generate a halting, albeit expressive, replica that the critic says “still needs improvement, but at least sounds like me” (2010). Aylett admits that synthetic speech generated from pre-recorded audio still has a “buzzy” quality that results from the vocoding of speech waveforms (2003). Such tonal distortion can be disastrous in a synthesized voice that needs to be able to speak a language like Mandarin Chinese or Thai where “the meaning of words with the same sequence of phonemes can be different if they have different tones” (Chomphan, 2009).

Furthermore, New York Times journalist Keith Bradsher writes that “researchers have made slow progress in understanding how language works, how human beings speak, and how to program computers with this understanding” (1991). While we have advanced somewhat in our understanding of how humans process and understand language, there have been few efforts to “teach” computers how to think critically and respond appropriately to human speech.

In 2005, Voxify Inc., a voice recognition software company in California, attempted to correct what it saw as one glaring communication oversight: the lack of “cultural affirmative behavior traits” (ex: unconsciously muttering “uh-huh” in response to a statement) in a computer’s speech database. Chief technology officer Amit Desai says that most TTS systems get confused by such “chatter,” which can sour human-computer interactions. His feeling is that “if voice technology is going to get an expanded role in self service business applications, it has to adapt to what people utter” (Hall, 2005). Finally, before voice synthesis technology can be used to enrich our experiences with our personal computers, it is going to have to win over those individuals who can type faster than they talk, and read faster than they listen.
The Future… and What Needs to Happen Before We Can Get There

According to journalist Janet Rae-Dupree, “unrestricted use of the human voice--both to be understood by the computer and to vocalize the computer's output--has long been the holy grail of computing interfaces” (2001). However, Bill Meisel, a veteran of the speech-recognition market, believes that the main use of speech synthesis technology at the moment, and for the next couple of years, will instead be in specialized fields like medicine (Gomes, 2007).

In two years, the declining cost and increasing speed of microprocessors (as first stated in Moore’s Law2), will continue to make TTS systems synthesize smoother sentences (Guernsey, 2001). As voice synthesis technology becomes more widely available and cost-effective, patients suffering from aphasia, ALS, throat cancer, or other diseases that rob them of speech will turn to companies like CereProc Ltd. and ModelTalker for a chance to regain some semblance of their old voice and once more be heard. No longer will high-quality voices require “a good voice talent, a soundproof room, professional audio equipment, and hours of written material with thorough coverage of phoneme combinations” (AT&T, 2010). Instead, interested individuals will be able to create their own vocal databases using a personal computer, a microphone and a previously-determined set of “expressive” phrases that will result in an effective inventory of words and emotions. From this relatively small amount of speech data, a voice synthesis company will be able to create a voice that, although still imperfect, will sound like the subject in question. Finally, the “success” of these computer voices will be publicized by organizations like the NIH National Institute on Deafness and Other Communication Disorders, which will continue funding these companies and researching the effectiveness of synthesized speech and how social groups (the elderly, men, women, etc.) react to the still-developing technology.

In terms of mobile technology, there will also be an increased use of voice commands to access information stored on a computer. Currently, both CereProc Ltd. and AT&T Natural Voices’ software is licensable for nearly any use. Already it has been incorporated into “FeedMe,” an iPhone application that will read the news and other content to phone owners while they drive or have their eyes engaged elsewhere (Hermann, 2010). Similarly, mobile-phone users will be able to search the Web with their voices, and hear the selections spoken back to them in the voice of their choice3 (Gomes, 2007). For instance, on the iPhone, users can choose Brian, the wise-confident-navigator, Jerry, the laid-back-young-entrepreneur-who-still-knows-his-stuff, Kate, the perky-clever-fashionista, or Mary, the-voice-of-reason. Furthermore, as we train our devices to respond to the sound of our voices, we can securely access content stored within the cloud, as it will still be difficult to clone voices that are not our own.

As indicated by Nass’ research, when it comes to artificial voices, being able to match the synthesized voice to the user is a vital part of the technology’s success and subsequent adoption. However, any system capable of this will need to be able to detect and respond appropriately to human moods, so that the voice the computer selects is the closest possible match (Logan, 2007). In five years, the mood-detection software pioneered by Microsoft’s Project Natal for the Xbox will be applied to a variety of personal technologies with which people daily interact, including mobile phones, personal computers, and (of course) video games.

By being able to detect when users are upset, excited, or stressed, companies will be better able to respond to their clients’ needs and stay on their good-side. For example, if while navigating the Internet an individual attempts to complete a transaction and runs into technical difficulties, their computer can guide them through the site’s trouble-shooting process and read them suggestions from the site’s FAQ or similar discussion occurring on the site’s forum. If nothing is working and the computer recognizes that the user is growing angrier, it can first attempt to placate the individual by apologizing for the difficulty, and then directing the user to an actual human as a last resort. Although in five years time human assistance will still be necessary in extreme cases, the newly-empathetic synthesized voice of the computer can still be used to maintain a mutually beneficial relationship between users and online organizations. Moreover, recordings of these sessions will be beneficial in helping an organization determine at what point in the help process clients start losing control of their negative feelings, as “there are stress characteristics common to all speakers” (Logan, 2007); if some pattern or trend can be establish, businesses will have a better understanding of what needs to be done to solve the problem and when.

In ten years, voice synthesis will be used to develop speech-to-speech (STS) translation where a subject’s speech in one language can be used to be produce corresponding speech in another language while continuing to sound like the user’s voice. In the world of entertainment, this technology can be used to render subtitles and clunky voice-overs in international media obsolete (Aylett, 2003). For example, with this technology, Christopher Walken will still sound like Christopher Walken even after his latest film has been translated into Hindi. However, in order to accomplish this, voice synthesis engines would have to make significant advancements, both in their processing abilities and in their understanding of foreign languages, and the nuances of meaning in the words themselves between cultures. For instance, computer-speech expert Ruediger Hoffmann stresses that a “decisive factor in creating authentic voices is completeness in the resources and databases used… including vocabulary and grammar” (2004). Despite this, his colleague Lobanov points out that although there are roughly 2,736 Russian vowel allophones, voice synthesizers currently only have the power to process less than 1,500 (2004). Hence, it is extremely difficult to create an authentic-sounding Russian voice because the vocabulary of current databases is limited by an incomplete allophone collection. This means that it would be difficult to develop natural sounding English-to-Russian STS translations.

Therefore, based on its previous record of advancement, perhaps a more feasible prediction for voice synthesis in ten years would be that companies like CereProc Ltd. will perfect their voice-cloning technology and produce computer speech that is practically “as good as the real thing.” In fact, Mullen already foresees a future where celebrities’ contracts will have to include voice-licensing clauses. New York Times journalist Robert Frank argues that “voice cloning is just one of many technologies that expand the market reach of the economy's most able performers [and] creates a winner-take-all market — one in which even small differences in performance give rise to large differences in economic reward” (2001). For example, if Sephora wanted to license Adam Lambert’s voice for a promo about their new line of metallic eye-shadow, they could obtain his permission to dump his vocals into a TTS engine and churn out the audio for the commercial. All this would be done at a fraction of the cost of flying Lambert into town to lay down a 30 second track in the recording studio. However, Frank warns, although “cloning frees up resources […] the downside is that the monetary value of these gains is distributed so unequally” (2001).



Moreover, if voice synthesis technology does become more widespread and voice-cloning is made simpler and more convincing, synthesized voices may be used to perpetrate fraud. Similar to the e-mail and social media scams of today, criminals could obtain sensitive information by tricking people into thinking they were getting phone calls from someone they know (Guernsey, 2001). For instance, if the law did not protect against the misappropriation of an individual’s voice, companies like VoxChange could obtain recordings of a person’s voice, synthesize it using software similar to CereVoice or ModelTalker, and then use it for whatever purpose they like. After all, “the best way to check up woman’s fidelity or to prove man’s infidelity is to talk to them with a voice of common acquaintance or relative whom they let into secrets4” (VoxChange, 2010).
Conclusion
In his original research, Aylett wrote that humans regard vocal mimicry by computers with both awe and suspicion. He went on to say that this is in part due to the fact that “perfect vocal mimicry is also the mimicry of our own sense of individuality” (2003). Hence, although we may talk with, scream at, or supplicate our machines, the idea that in the future they may answer us with their own voice both fascinates and frightens us. For the time being, we seem quite content to use TTS systems to give a voice to those who literally have none. However, if humans do grow desirous of a meaningful relationship with the technology that already seems like such an integral part their lives, perhaps both they and their machines will have to learn how to listen to what the other has to say.

Appendix A: http://www.cereproc.com/images/eval_act_space2.jpg
Table 1 shows how various emotions can be arranged in the evaluation/activation space continuum. The '+' sign means a more extreme value. The (+Content) means that the emotion will be simulated if appropriate content is used.
Table 1:

Active Negative

Active Positive

++ Angry
++ Frightened/Scared/Panicked
+ Tense/Frustrated/Stressed/Anxious
Authoritative/Proud (+Content)

++ Happy
+ Upbeat/Surprised (+Content)/Interested (+Content)

Passive Negative

Passive Positive

++ Sad
Disappointed (+Content)/Bored

+ Relaxed
Concerned/Caring



Bibliography
Al-Said, G., & Abdallah, M. (2009). An Arabic text-to-speech system based on artificial neural networks. Journal of Computer Science, 5(3), 207.
AT&T Labs, Inc. Research. (2010) Text-To-Speech (TTS) -- Frequently Asked Questions. Retrieved 2/20/10, from http://www2.research.att.com/~ttsweb/tts/faq.php#TechWhat.
Aylett, M. & Junichi Yamagishi. (2003). Combining Statistical Parametric Speech Synthesis and Unit-Selection for Automatic Voice Cloning. Centre for Speech Technology Research, University of Edinburgh, U.K. Retrieved 02/16/2010 from, http://www.cstr.ed.ac.uk/downloads/publications/2008/03_AYLETT.pdf.
Bradsher, K. (1991) Computers, Having Learned to Talk, Are Becoming More Eloquent. New York Times (D6). Retrieved 2/16/10, from ProQuest Historical Newspapers.
CereProc. (2010). CereProc research and development. CereProc. Retrieved 2/20/2010, from http://www.cereproc.com/about/randd.
Chomphan, S. (2009). Towards the development of speaker-dependent and speaker-independent hidden markov model-based Thai speech synthesis. Journal of Computer Science, 5(12), 905. Retrieved 2/20/2010.
Dutton, G. (1991). Breaking communications barriers. Compute!, 13(9), 28. Retrieved 2/20/2010, from Academic Search Elite database.
Ebert, R. (2010). Finding my own voice - Roger Ebert's journal . Retrieved 2/20/2010, from http://blogs.suntimes.com/ebert/2009/08/finding_my_own_voice.html.
Economist. (1999). Once more, with feeling. The Economist 350 (8108), 78. Retrieved 2/20/2010, from http://search.ebscohost.com.ezproxy.lib.calpoly.edu:2048/login.aspx?direct=true&db=afh&AN=1584087&site=ehost-live.
Frank, R. (2001). The Downside of Hearing Whoopi at the Mall. New York Times. Retrieved 2/16/2010, from http://www.robert-h-frank.com/PDFs/NYT.8.7.01.pdf.
Gomes, L.  (2007). After Years of Effort, Voice Recognition Is Starting to Work. Wall Street Journal (Eastern Edition),  p. B.1.  Retrieved 2/21/2010, from ABI/INFORM Global.
Guernsey, L. (2001). Voice Cloning- Software Recreates Voices of Living and Dead. New York Times. Retrieved 2/16/2010, from http://www.rense.com/general12/ld.htm.
Hall, M. (2005). Speech-recognition apps behave…. Computerworld. 39 (48), 6. Retrieved 2/21/2010 from, http://offcampus.lib.washington.edu/login?url=http://search.ebscohost.com.offcampus.lib.washington.edu/login.aspx?direct=true&db=a9h&AN=19004476&site=ehost-live.
Herrman, J. (2010). How Ebert Will Get His Voice Back. Gizmodo. Retrieved 2/20/2010, from http://gizmodo.com/5474950/how-roger-ebert-will-get-his-voice-back?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+gizmodo/full+(Gizmodo).
Hoffmann, R. & Edward Shpilewsky, Boris Lobanov, Andrey Ronzhin. (2004). Development of Multi-Voice and Multi-Language Text to Speech (TTS) and Speech to Text (STT) conversion system. Retrieved 2/16/2010.
Lobanov, B. & Lilia I. Tsirulnik. (2004). Phonetic-Acoustical Problems of Personal Voice Cloning by TTS. United Institute of Informatics Problems, Nat. Ac. of Sc. Belarus. Retrieved 2/16/2010.
Logan. T. (2007). A little more conversation; Ever enjoyed talking to a machine? One day you might. New Scientist, (34-37). Retrieved 02/20/2010.
Model Talker. (2010). ModelTalker Speech Synthesis System. Retrieved 2/20/2010, from http://www.modeltalker.com/.
Moore, Gordon E. (1965). "Cramming more components onto integrated circuits" (PDF). Electronics Magazine. pp. 4. Retrieved 2/2//2010, from ftp://download.intel.com/museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_Article.pdf.
Muller, N. (2002). Desktop encyclopedia of telecommunications. McGraw-Hill telecommunications. McGraw-Hill Professional, (3), 1134.
Rae-Dupree, J. (2001). A bit of drawl, and a byte of baritone. U.S. News & World Report; 131 (60), 44.
Tan, T. & Sh-Hussain. (2008). Implementation of phonetic context variable length unit selection module for Malay text to speech. Journal of Computer Science, 4(7), 550.
Vasilopoulos, I. & Aggeliki S. Prayati, Antonis V. Athanasopoulos. (2007). Implementation and evaluation of a Greek Text To Speech System based on a Harmonic plus Noise Model. IEEE Transactions on Consumer Electronics, 53, (2), Retrieved 2/16/2010.

VoxChange. (2010). “100 % imitation of another person's voice.” VOXCHANGE.COM. Retrieved 2/20/2010 from, http://www.voxchange.com/voting/imitation-of-another-persons-voice.




1 Anger, for example, would be described as an active-negative emotion. See Appendix A for a graphic representation and Table 1 for more information

2 “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer”

3 Select voices will be available on select brands/models/carriers, based on market research that indicates which voices (read: personalities) would be most popular amongst a target audience

4 Direct, unaltered quote from the VoxChange Website


Download 56.97 Kb.

Share with your friends:




The database is protected by copyright ©ininet.org 2024
send message

    Main page