Voice vs. Finger: a comparative Study of the Use of Speech or Telephone Keypad for Navigation

Download 78.79 Kb.

Date	29.01.2017
Size	78.79 Kb.
	#10903

Voice vs. Finger: A Comparative Study of the Use of Speech or Telephone Keypad for Navigation

Jennifer Lai

IBM Corporation/ T.J. Watson Research Center

30 Saw Mill River Road

Hawthorne, NY. 10532, USA

+1 914-784-6515

lai@watson.ibm.com

Kwan Min Lee

Department of Communication

Stanford University

Stanford, CA 94305-2050, USA

+1 650-497-7357

kmlee@stanford.edu

ABSTRACT

In this paper, we describe the empirical findings from a user study (N=16) that compares the use of touch-tone and speech modalities to navigate within a telephone-based message retrieval system. Unlike previous studies comparing these two modalities, the speech system used was a working natural language system. Contrary to findings in earlier studies, results indicate that in spite of occasionally low accuracy rates, a majority of users preferred interacting with system by speech. The interaction with the speech modality was rated as being more satisfying, more entertaining, and more natural than the touch-tone modality.

Keywords

Speech User Interfaces, Voice User Interfaces (VUIs), Natural Language, Keypad Input, DTMF, Navigation

INTRODUCTION

As Voice User Interfaces (VUIs) become more common, users of these systems have sometimes experienced the discomfort of needing to interact with a computer by voice, over a phone line, in a public setting: “NO, play the NEXT message …… NEEXXXT MESSSSAGE”. This can cause observers to stare, some with horror, some with sympathy, wondering either how anyone could be so uncivil to a co-worker, or how this unfortunate, bedraggled businessman could have been saddled with such an incompetent assistant.

VUIs, as the name would imply, rely on the use of one’s voice, usually for both data input, (e.g. in response to the question “what city please”) and for navigation within the application (e.g. “open my calendar please”). [For a discussion of voice user interfaces see 1,15]. As designers

and users of VUIs, we too have occasionally found ourselves in similar situations, wishing for silent and 100% accurate interaction.

While embarrassment can be the humorous side of using speech-based interfaces in public places, there are more serious reasons for needing a silent mode of interacting with a speech system such as a need for privacy (e.g. airport ) or to be considerate to other people (e.g. meeting).

Additionally, subscribers who use a cell phone to call a speech-based system are subject to the vagaries of varying levels of cell coverage. When cell signal is low, there is a large and negative impact on the accuracy of the speech recognition, making the system unusable with speech.

Prior to the more generalized availability of speech recognition software and development toolkits, most telephone-based interfaces relied on touch-tone input using the telephone keypad, or Dual Tone Multiple Frequency (DTMF). Voice processing systems, which first started to appear in the 1980s, used DTMF for input and recorded human speech as output [2]. The fixed set of twelve keys (ten digits as well as the # and * keys) on the keypad lent itself to the construction of applications that present the caller with lists of options (e.g. “press one for Sales”), commonly referred to as menus. Since that time, menu-driven Interactive Voice Response (IVR) applications have become a pervasive phenomenon in the United States [9].

As an input mechanism, DTMF has the advantage of being both instantaneous and 100% accurate. This is in contrast to speech recognition, which is neither. Processing of the speech can cause delays for the caller, and variability of the spoken input will prevent speech from being 100% accurate in the foreseeable future. However, given the limitation of using only twelve keys for input, many users of IVR systems report feelings of “voice mail jail” [10] often causing them to hang up in frustration.

Given the tradeoff between unconstrained and natural input, which could be erroneously interpreted, and highly constrained but accurate input, we wondered which modality users of a telephone-based message retrieval system would prefer. This paper reports on an experiment which compared using either a DTMF key or natural speech to request a function within a telephone based application.

Prior Work

While we were not able to find any studies that compared use of a natural language system with touch-tone input, there are several studies that compare use of a simple speech input system and DTMF [5,8,11].

Delogu et al. [5] found no difference in terms of task completion time, and number of turns per task when comparing DTMF input for an IVR system with three different types of speech input. The three types they experimented with were simple digit recognition, simple command recognition, and a natural command recognition system that recognized relatively complete sentences. The digit system only recognized single digits 1 through 9, while the simple command system recognized a short list of words such as “next,” “skip,” “yes,” or “no.”. However, the analysis of user attitude toward the different systems, indicated that users preferred the natural command recognition system over DTMF based system for input. The DTMF input was preferred to the both simple digit and the command recognition systems. It should be noted that the voice systems used by Delogu et al. did not involve any true speech recognition technology. They used a Wizard of Oz method to compensate for technological barriers. In addition, there was no statistical analysis of the attitudinal survey data reported in their paper. Instead, they only provided the proportions for three questions (1. which of the three prototypes will be better accepted in the market place?; 2. Which is the most enjoyable system?; 3. Which system will you prefer to use in a real service?)

Similar to the above study, Foster et al. [8] also found that users preferred connected word (CW)^¹ speech input and DTMF input to the isolated word (IW)^² speech input modality. In addition, they reported the interaction effect between cognitive abilities (spatial and verbal abilities) of users and their attitudes toward the different modalities tested. Users with high cognitive abilities significantly preferred DTMF over CW and IW input. While the interaction effect between users’ spatial ability and their preference for DTMF can easily be explained by the positive effect of high spatial ability on mental mapping of DTMF options, the positive effect of verbal skills on the DTMF preference cannot be explained very well.

Finally, Goldstein et al. [11] reported no difference in task completion times between a DTMF based hierarchical-structure navigation system and a flexible-structure voice navigation system. With the hierarchical system, the users needed to first select a general option and then proceed to more specific choices. In the flexible-structure system, the users could move directly from one function to another. These systems have a larger vocabulary size and are more error prone. Unlike the Foster et al (1998)’s finding on the interaction effect between spatial ability and attitudes toward different modalities, they found no difference in subjective measures. However, they found a significant interaction effect with regard to task completion times. Users with high spatial ability finished tasks more quickly when they used the flexible-structure voice navigation system. In contrast, low spatial ability users finished tasks more quickly when they used the DTMF based hierarchical-structure navigation system. This study, also relied on a Wizard of Oz methodology where participants believe they are interacting with a machine, but in reality it is a human who is controlling the interaction with the participant.

In a study reported by Karis, [12] thirty-two subjects performed call management tasks twice, once using speech and once with touch-tone. The tasks involved interacting with a “single telephone number” service that included a variety of call management features. The majority of subjects (58.1%) preferred using DTMF rather than speech, and tasks were completed faster using touch tone, although both the type of task and whether the subject had a quick reference guide influenced task completion times. In this study there appeared to be only a loose relationship between accuracy of the speech recognition, ratings of acceptability, and preference choices. Some subjects who experienced low system recognition performance still said they preferred to interact via speech rather than touch-tone.
Although not directly related to the comparison between a speech-based system and DTMF one, a market report by Nuance [13] shows users’ general attitude toward their current voice navigation system which has a much larger vocabulary than the ones tested above. Even though Nuance systems do not use true natural language, the report nevertheless provides useful data for the prediction of how people will evaluate large vocabulary speech systems when compared to other methods such as speaking with a human operator or using touch tone. In the report, 80% of users said they were either as satisfied or more satisfied using a speech system than they had been using a touch-tone system. Interestingly, only 68% percent of users agreed either strongly or somewhat strongly with the sentence comparing two different modalities—“I like speaking my responses better than pushing buttons.” In addition, slightly more than half of the queried users (64 %) responded that their utterances were understood either very well or somewhat well by the system.

In summary, the referenced studies showed that the DTMF systems were preferred to isolated-word or simple digit recognition-based systems. When compared to connected-word systems, DTMF systems were rated either similarly or better especially by users with high spatial abilities. Low spatial ability users were able to finish the task more quickly with a DTMF-based hierarchical system than with a speech-based flexible system, whereas high spatial ability users finished tasks more quickly with speech rather than with the DTMF.

These results are surprisingly contrary to the widespread belief that most users would prefer a more natural form of interaction (such as speech) to a more artificial one (such as the keypad). We believe that one of the primary explanations for these results is due to the constraints of the speech systems tested. Fay [7] and Karis [12] also mention that users may actually prefer touch-tone based systems to voice command systems, due at least in part to

the limitation of speech recognition technology.

Without digressing too far into the differences of the various underlying speech technologies that can be used in speech-based systems (for an authoritative description of these see Schmandt), speech systems can be either command based (single word input), grammar based (recognizing specifically predefined sentences) or natural language (NL) based. Natural language recognition refers to systems that act on unconstrained speech [3]. In this case, the user does not need to know a set of predefined phrases to interact with the system successfully. NL systems are often trained on large amount of statistically analyzed data, and once trained, can understand a phrasing of a request that it has never seen before. This important move from speech “recognition” (the simple matching of an acoustic signal to a word) to speech “understanding” (extracting the meaning behind the words), often stated as the holy-grail of speech recognition, is working in limited domains today [4]. The Mobile Assistant is one such example of a true NL system.

The Mobile Assistant

This experiment was conducted within the framework of a working application called the Mobile Assistant (MA). It is a system that gives users ubiquitous access to unified messages (email, voice mail and faxes) and calendar information from a telephone. Calendar and messages can be accessed in one of three modalities:

From a desktop computer in a combination of audio and visual. This is the similar to the standard configuration today, except that voicemail is received as an audio attachment and can be listened to from the inbox. For internal calls, the identity of the caller is listed in the header information of the message. Email messages can be created using the MA system and are also received as audio attachments.
From a SmartPhone (a cell phone with a multi-line display and a web browser) in a silent visual mode. In this case, users connect over the network and can read their email and calendar entries on the phone’s display. Notifications of the arrival of urgent email messages and voicemail are usually sent to this phone, however the user can tailor which notifications they receive and which email addressable device they want the notifications sent to. Voicemail messages cannot be accessed in the silent visual mode since we do not transcribe them and thus they must be accessed by calling in to the system and listening to them.
From any phone using speech technologies (both recognition and synthesis) in an auditory mode. In this situation the users speak their requests for information which are interpreted by the mobile assistant. Examples of requests are: “do I have any messages from John?”, “what’s on my calendar next Friday at 10:00 a.m.?”, or “play my voicemail messages”. The MA replies to their queries and reads them the requested messages or calendar entries.

The focus of the research has been on supporting the pressing communication needs of mobile workers and overcoming technological hurdles such as high accuracy speech recognition in noisy environments, natural language understanding and optimal message presentation on a variety of devices and modalities. This system is currently being used by over 150 users at IBM Research to access their business data.

The component of the system that was targeted by this study is the third component, which allows users to access messages in an auditory mode using speech technologies.

Methodology

Experimental Design

We employed a within-subject design in order to maximize the power of the statistical tests. All participants experienced both the speech only and the DTMF only condition. They were asked to complete identical tasks for both input modalities. We created two test email accounts for the experiment, and populated the inboxes with messages. The order of modality and test account used was counter balanced to eliminate any possible order effect. Each account had fifteen email messages. To eliminate any possible bias which might have been due to any particularities of an account, we systematically rotated the assignment of an account to a tested modality. (See Table 1 for the order and assignments). Unlike most of the earlier studies mentioned, we did not use a Wizard of Oz methodology but instead used a real working speech system, which employs both recognition algorithms and Natural Language Understanding models to interpret the user’s requests.

User 1	Speech / Account 1	DTMF / Account 2
User 2	DTMF / Account 1	Speech / Account 2
User 3	Speech / Account 2	DTMF / Account 1
User 4	DTMF / Account 2	Speech / Account 1

Table 1: Order of modalities and accounts used for the experiment

Participants

A total of 16 participants (eight females and eight males) were recruited from the IBM research center in Hawthorne, New York. We recruited participants with a wide range in ages (from late teens to over sixty) because our target user population also varies greatly in age. All participants except one, were naïve users with regard to speech systems. None of the participants had any experience using the Mobile Assistant. While participants volunteered their time for the sake of science, they were given a parting gift of either a hat or a pen as a token of our appreciation. All participants were videotaped with informed consent and debriefed at the end of the experiment session.

Apparatus

DTMF modality

In this condition the participants were instructed that their only mode of interaction with the system could be through the telephone keypad. Thus to log on, when prompted by the system, they used the keypad to enter the telephone number for the account and then entered the six digit password. Each of the 12 telephone keypad buttons had a function. (See Table 2 for the list of keys and their corresponding functions). The list for the function mapping was given to the participants on the same page as the task list. They were told that they could refer to the mapping as often as they wanted to. They were also given a sample interaction on the same page: “To listen to the second email message in your inbox, press 5 and after listening to the message press the # key to hear the next message.”

DTMF Key	Function
* (star key)	Interrupt system output
# (pound key)	Next message/Next day
0	Cancel current request
1	Yes
2	No
3	Play phonemail
4	Play today’s calendar
5	Play first email msg.
6	Delete a message
7	Repeat
8	Reply to a message
9	Forward a message

Table 2: List of DTMF keys used in the experiment and their corresponding functions

In both conditions the system output was spoken, using synthesized speech. The participants could interrupt the spoken output at any time by using the star (*) key. The pound (#) key is mapped to the “Next” function and is context-dependent. If the user has just listened to today’s calendar (invoked with the 4 key) the pound key will play the calendar entries for the following day. On the other hand, if the user has just heard an email message, the pound key will cause the next message to be played.

Since a real speech application was used for this experiment, we had to deal with the fact that the user is presented with confirmation dialogs at different points in the interaction. For example when the user asks to delete a message, the system confirms the operation first: “are you sure you want to delete the message from Jane Doe?” Thus the 1 and 2 keys were used to handle replies to the confirmation dialogs.

NL-based speech modality

In speech condition, the participants were instructed that their only mode of interaction with the system could be with speech. Thus to log on, when prompted by the system, they spoke the name of the test account and then spoke the six digit password.

Because the system accepts natural language input, it was not necessary (nor would it be feasible) to define for the participants everything that they could say to the system. However, in an effort to balance both conditions, the participants were given sample phrases to give them an idea of the types of things that can be said. The sample phrases included on the task description were:

How many messages did I receive yesterday?
Do I have any messages from Peggy Jones?
What’s on my calendar next Wednesday?

Procedure

The participants took the study one at a time in a usability lab. They were told that the purpose of the study was to examine certain usability issues related to the use of a phone-based universal messaging and calendar system called the Mobile Assistant. Upon arrival in the lab, the participants were seated and given a booklet with the instructions on the first page. The instruction page described the purpose of the study and had a list of the required tasks. It also had the phone number to call to reach the system, the name of the test account they should use, and the password. In the case of the DTMF condition, the instruction page also had the function mapping and sample interaction. In the case of the speech-only condition, this page had the sample phrases that could be spoken to the system.

The participants were instructed that their task was to interact with the Mobile Assistant on the telephone to manage several email and calendar tasks. The experimenter assured the participants that all the data collected would be confidential. The experimenter showed the task list to the participant and talked about the sample interaction that was given to make sure the participant understood what was involved. Then the participant was shown the questionnaire that needed completing after the first condition. The experimenter then explained that they would then be asked to complete the same tasks in the alternate condition and showed them the instruction page and questionnaire for the second condition.

After the experimenter left the room, the participants dialed the number for the Mobile Assistant. They used the telephone on the table in front of them. The speech output from the system played through a set of speakers as well as the handset so that the system’s output could be captured in the videotapes.

Task Description

Participants were asked to complete the following tasks in both conditions. The only difference from one condition to the other was the inbox that they were logging in to (and thus the messages contained in it) and the method of interaction used.

Task1: Log on

Task2: Listen to the fourth message in the mail box

Task3: Find out if they received a message from a particular person. If so, listen to it.

Task4: Reply to the above message.

Task5: Delete the message

Task6: Find out what time they are meeting with David Jones on Friday.

Both test inboxes were balanced for number of messages, type of message and total number of words for all messages. Also, each message was approximately the same length as the message in the same position in the other test inbox. Thus the first message in test inbox 1 had about the same number of words as the first message in test inbox 2, as did the second message, third message etc…

In the third task, the participants were asked to determine if they had received a message from a particular person (one of the experimenters and authors of this paper). In both inboxes, this message was always in the middle of the list of messages, which was the ninth message down. The messages were comparable as shown below.

Ninth message in inbox 1:

Hello,

I would like to have your input for the workshop proposal that we discussed at lunch the other day. While the deadline is not until sometime in September, it also coincides with the papers deadline so it would be great if we could get this work out of the way. We really only need about 2 pages for the write up. Please let me know what time would be convenient to meet next week. How about a working lunch?

Many thanks,

Jennifer

Ninth message in inbox 2:

Hi Jacob,

I was given the job of compiling the highlights reports for this month and would very much appreciate it if you could send me your input by the end of the day today. Since you have been out of the office on vacation for three out of the four weeks this month, I would totally understand if you did not have much to contribute. Either way, please send me your input as soon as possible.

Many thanks in advance,

Jennifer

We tried to define the tasks in such a way that half of them would be well suited to a sequential traversal of information and thus favor the DTMF condition, and the remaining half would be better suited to random access of information (and thus favor the speech condition). For example, we believed that the sixth task, which asked them to find out what time they were meeting with David on Friday favored random access of data. With speech, the user could simply ask the system “what meetings do I have on Friday” and listen until the meeting with David was mentioned. With DTMF, the user had to first play today’s calendar, interrupt the listing with the star key (or listen to the entire day), and then press the pound key to get the next day’s listing. We expected that Task 1 (logging on), Task 4 (reply), and Task 5 would favor the DTMF condition, whereas Task 2 (listen to the fourth message), Task 3 (find a message from a particular person), and Task 6 (find the meeting time with David) would favor the Speech condition.

Measures

According to ETSI (the European Telecommunications Standards Institute), usability in telephony applications is defined as the level of effectiveness, efficiency and satisfaction with which a specific user achieves specific goals in a particular environment [6]. Effectiveness is defined here as how well a goal is achieved in a sense of absolute quality; efficiency as the amount of resources and effort that are used to achieve the specific goal; and satisfaction as the degree that users are satisfied with a specific system.

In this study, we examined all three elements of usability. We measured the effectiveness of a system by calculating a success rate for each user task. The amount of time to finish each task was used as a proxy measure for efficiency. User satisfaction was evaluated through a series of survey questions asked immediately following the use of each modality.

In order to measure user satisfaction, we administrated two types of questions. The first question asked users to evaluate the interaction with the system that they had just used. The questions asked users to evaluate the system itself, regardless of their evaluation of the interaction. We speculated that the evaluation of the system and the evaluation of the interaction could be different. That is, it would be possible for a user to evaluate the state-of-art NL-based speech system very positively due to the novelty effect (the cool factor). This same user however, might evaluate his speech-based interaction with the system negatively, because he had difficulty accomplishing the required task.

Immediately after completing all tasks, participants were asked to evaluate their interaction with the system by indicating how well certain adjectives described the interaction, on a scale of 1 to 10 (1 = “Describes Very Poorly”, 10 = “Describes Very Well”). Four adjectives—comfortable, exhausting (reverse coded), frustrating (reverse coded), and satisfying—were used to create a index we called “interaction satisfaction.” Five adjectives—boring(reverse coded), cool, entertaining, fun, and interesting—were used to create an “interaction entertainment” index. Finally, another four adjectives—artificial(reverse coded), natural, repetitive(reverse coded), and strained (reversed coded)—were used to form an “interaction naturalness” index.

After evaluating the interaction, participants then evaluated their general impression of the system in the same way as above. Three indices were created with regard to the evaluation of the system:

system entertainment: consisting of boring [reverse coded], cool, entertaining, and fun;
system satisfaction: consisting of comfortable, frustrating [reverse coded], satisfying, and reliable;
system easiness: consisting of easy, complicated [reverse coded], confusing [reverse coded], intuitive, and user-friendly.

The reliability for each index is shown in Table 3.

Index	Cronbach’s alpha
Interaction satisfaction	Speech: .77
	DTMF: .81
Interaction entertainment	Speech: .84
	DTMF: .79
Interaction naturalness	Speech: .84
	DTMF: .77
System satisfaction	Speech: .82
	DTMF: .83
System entertainment	Speech: .92
	DTMF: .75
System easiness	Speech: .87
	DTMF: .68

Table 3. Reliability of each index

Lastly, after completing both conditions and their associated questionnaires, users were asked to write the answers to the following questions:

Which of the two navigation methods did you prefer?
Why?
What would it take (either changes to the system or circumstances of use) to get you to use the navigation method that you least preferred?

RESULTS

Effectiveness of a system

Table 4 shows the success rate for each task with each of the two input modalities.

Task	Speech	DTMF
Task 1: Log on	69%	100%
Task 2: Listen to the fourth message in the mail box	56%	75%
Task3: Find out if they received a message from a particular person. If so, listen to it.	56%	63%
Task 4: Reply to the above message.	44%	56%
Task 5: Delete the message.	56%	50%
Task 6: Find out what time they are meeting with David Jones on Friday.	69%	38%

Table 4. Success rate for each task with each modality

Efficiency of the system

Table 5 shows the average number of seconds to complete each task with both modalities. For each task, only the times the user succeeded at the task were used to calculate the average.

Task	Speech	DTMF
Task1: Log on	42 sec.	22 sec.
Task2: Listen to the fourth message in the mail box	56 sec.	47 sec.
Task3: Find out if they received a message from a particular person. If so, listen to it.	19 sec.	46 sec.
Task4: Reply to the above message.	10 sec.	6 sec.
Task5: Delete the message.	20 sec.	9 sec.
Task6: Find out what time they are meeting with David Jones on Friday.	44 sec.	102 sec.

Table 5. Average times (in seconds) for task completion

User satisfaction

As discussed before, we used six indices to measure user satisfaction in greater detail. For statistical analyses, we used a repeated measure ANOVA with modality as the repeated factor. There was no between-subjects factor.

Interaction satisfaction

Participants evaluated their interaction via speech modality (M = 5.16; S.D. = 1.99) as more satisfying than with the DTMF modality (M = 3.84; S.D. = 1.81), F(1, 15)= 4.35, p<.055, ²= .23.

Interaction entertainment

Participants evaluated their interaction via speech modality (M = 5.79; S.D. = 1.83) as more entertaining than with the DTMF modality (M = 3.69; S.D. = 1.47), F(1, 15)= 12.77, p<.01, ²= .46.

Interaction naturalness

Participants evaluated their interaction via speech modality (M = 4.89; S.D. = 2.36) as more natural than with the DTMF modality (M = 3.65; S.D. = 1.63), F(1, 15)= 4.12, p<.06, ²= .22.

Figure 1 summarizes the mean values for the three indices used to evaluate the interaction.

Figure 1. Mean values for user evaluation of the interaction

System satisfaction

The speech modality (M = 4.95; S.D. = 1.97) and DTMF modality (M = 4.41; S.D. = 1.67) did not differ significantly with regard to the level of satisfaction with the system, F(1, 15)= .97, p< .34, ²= .061.

System entertainment

Participants evaluated the entertainment value of the system more positively with the speech modality (M = 6.06; S.D. = 2.03) than via DTMF modality (M = 3.54; S.D. = 1.39), F(1, 15)= 17.27, p<.001, ²= .54.

System easiness

Participants evaluated the system as being easier when using a speech modality (M = 6.06; S.D. = 2.07) than with a via DTMF modality (M = 4.41; S.D. = 1.77), F(1, 15)= 9.05, p<.01, ²= .38.

Figure 2. Mean values for user evaluation of the system

Modality Preference

In response to the question “which of the two navigation methods did you prefer?” 69% of participants (N = 11) chose speech as their preferred modality, whereas only 25% of participant (N = 4) selected DTMF. One user did not show any preference for a particular modality. When asked why, participants who chose the speech modality mostly indicated that it is because using speech is easy, intuitive, flexible, and fun. The other main reason cited was that it frees up their hands and enables them to do multiple tasks at the same time. The dominant reason for those who preferred the DTMF modality was that it is less error-prone than speech. One participant wrote that she prefers DTMF, simply because “she can interrupt the system more easily”.

When asked what would be required to make the participant want to use the least preferred modality we were expecting to see things like a need for hands free usage for those whom had preferred DTMF, and a need for silent and private interaction (for the participants that preferred speech). While one participant mentioned the usefulness of DTMF in a noisy environment, most responded by saying they would like to use a combination of modalities.

dISCUSSION

Unlike previous findings, our results indicate that users prefer the spoken interaction to the DTMF interaction for the NL based message retrieval system. This was in spite of the fairly high error rates experienced with the NL recognition technology. While it is not uncommon for a grammar-based system to get accuracy levels in the mid-nineties for spoken phrases that are part of the known vocabulary, the Mobile Assistant has an accuracy level which ranges between 75 and 80% depending on the task. The big advantage of a NL system is that is highly usable by first-time and novice users, and this might have been a factor in the results.

Another factor that might have contributed to the preference for speech, is the trade-off that users make between the advantage of using a speech-based system (such as the ability to control a device in a hands-free mode) and the disadvantage of dealing with recognition errors. In the domain of messaging, the need for hands-free control is well noted when attempting to access messages from a cellular car phone. There is also the fun factor to consider. When interviewing a participant as to why he preferred the speech modality when his experience with the speech system had been rather dismal (to our observation) he replied “Well, I guess speech is just more fun.”

Lastly, one can speculate that it may have seemed more natural for users to speak to the system since the system was speaking to them. As mentioned earlier, in both conditions, the system spoke to the participants using synthetic speech. The system is designed to be contrite and apologetic if it does not understand what the user is saying. It also makes (rather feeble) attempts at humor, and it always polite. Perhaps, in keeping with the media equation theory, [14] these rather anthropomorphic characteristics of the system contributed to the findings.

ACKNOWLEDGMENTS

We thank David Wood for his critical role in the vision and implementation of the Mobile Assistant project, Marisa Viveros for her leadership and support of this and other research projects, and all the kind participants who took part in the experiment and gave us their feedback. We also thank the other MA team members for their valuable work on the project since without them it never could have happened!

REFERENCES

How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues

2. Ballentine, B. Re-engineering the Speech Menu: A Device Approach to Interactive List-Selection. In Gardner-Bonneau (ed.) Human Factors and Voice Interactive Systems. Kluwer Academic Publishers, 1999

3. Boyce, S. Natural Spoken Dialogue Systems for Telephony Applications. In Communications of the ACM, September 2000, Vol. 43, Number 9

4. Davies, K. et al. The IBM Conversational Telephony System for Financial Applications. In Proceedings of Eurospeech ’99 . Budapest, Hungary, Sept. 1999

5. Delogu, C., Di Carlo, A., Rotundi, P., & Sartori, D.
A comparison between DTMF and ASR IVR services through objective and subjective evaluation FUB report: 5D01398. Proceedings of "IVTTA'98", Turin, September 1998, pp. 145-150

6. European Telecommunications Standards Institute (ETSI). Human Factors (HF), Guide for usability evaluations. ETSI Technical Report, ETR 095, 1993.

7. Fay,D. Interfaces to automated telephone services: Do users prefer touchtone or automatic speech recognition? In Proceedings of the 14th International Symposium on Human Factors in Telecommunications (pp. 339-349). Darmstadt, Germany: R. v. Decker’s Verlag. 1993

8. Foster, J.C., McInnes, F.R., Jack, M.A., Love, S., Dutton, R. T., Nairn, I.A., White, L.S. An experimental evaluation of preference for data entry method in automated telephone services. Behaviour & Information Technology, 17 (2), 82-92.

9. Gardner-Bonneau, D, Guidelines for Speech Enabled IVR Application Design. In Gardner-Bonneau (ed.) Human Factors and Voice Interactive Systems. Kluwer Academic Publishers, 1999

10. Greve, F. Dante’s 8^th circle of hell: Voice mail. St. Paul Pioneer Press, 1996.

11. Goldstein, M., Bretan, I., Sallnas, E.-L. & Bjork, H. Navigational abilities in audial voice-controlled dialogue structures. Behaviour & Information Technology, 18 (2), 83-95.

12. Karis, D. Speech recognition systems: performance, preference, and design. In 16th International Symposium on Human Factors in Telecommunications 1997, P65-72

13. Nuance. Market Research: 2000 Speech User Scorecard. Menlo Park, CA: Nuance. 2000.

14. Reeves, B. and Nass, C. The media equation: How people treat computers, television, and new media like real people and places. Cambridge University Press, New York, 1996.

Voice Communication with Computers: Conversational Systems

1 In a CW-based system, users say a string of words following a system prompt, without any pause required between the words.

2 In the IW system, users say only a single word after a prompt.