38 A single observation of speed is not very interesting, however.

If Fred did the task again, he would take a different amount of time, and if someone else did it, it would take an even more different amount of time. We therefore collect sets of measurements, and compare averages. The sets might be multiple observations of one person

performing a task over many **trials**, or of a range of people (experimental

**participants**) performing the same task under controlled conditions. As with most human performance, the measured results will usually be found to have a

**normal distribution**. Atypical HCI experiment involves

one or more experimental **treatments** that modify the user interface. Avery simple example might test the question How long does Fred take to finish task A when using a good UI, compared to a bad UI?” The result will often be that the good UI is

*usually* faster to use than the bad, but not in

*every* trial.

If we plot the measurements, we find two overlapping normal distributions, and we must therefore compare the effect of treatments relative to the spread in the population distribution. We need to know whether the difference between the averages is the result of ordinary random variation, or the effect of the changes we made to the user interface. This involves a statistical

**significance test** such as the

**t-test**. The t-test and other similar tests answer the question What is the probability that the observed difference in means could have occurred simply by random variation. The idea that the experimental difference might just have been a

random variation is called a **null hypothesis**, and it is important to remember that this is always a possibility in any experiment. We generally hope that the probability was very low – i.e. that the observed difference is because we designed a really good interface, rather than luck. In HCI research, we usually insist that the probability of the result being due to random variation (

**p**) is less than 0.05, or 5%. Good quality research results are normally based on experiments

with significance values *p* < 0.01, which can be expressed as ‘we reject the null hypothesis, with 99% confidence’.

**Share with your friends:**