6 Statistical Methods and Measurement used to combine them affects how easily understood the compound metric will be. This leads to
ratios (e.g., defects per thousand units,
rates (time-based ratios such as number of problem reports per month,
proportions or percentages (e.g., proportion of customers responding very
satisfied to a survey question,
linear algebraic combinations (e.g., mean repair cost – the sum of all repair costs divided by the total number of repairs, and
indices (dimensionless measures typically based on a sum and then standardized to some baseline value. Whereas simple metrics are always defined in terms of some measurement unit, compound metrics such as percentages and some linear combinations and indices can be dimensionless.
The definition of a metric affects its behavior (i.e., the likelihood of its taking on various values, its possible interpretations, and the kinds of analyses which are suitable for it. This argues for the use of simpler, more easily understood metrics rather than the
creative development of new, compound ones with poorly understood behavior. Indices in particular raise serious questions of interpretation and comparison, and are best used for showing long-term trends. The range of values a metric can have does not always follow a bell-shaped Normal curve for example, durations such as repair times almost always have a highly skewed distribution whose tail values pull the mean far from the median. Investigation of the distribution of a metrics values is one of the first tasks that must be undertaken in a statistical analysis. Furthermore, the range of values a measure can take on can be affected by internal or external limitations these are referred to as truncation or limitation, and censoring.
Truncation or limitation refers to situations where a measure never takes on a particular value or range of values. For example, repair time in theory can never have a value of zero (if it does, the measurement scale is too coarse. Or one may have results from a survey question which asks for some count, with an “
n or more response as the highest value this means that the upper part of the measure is truncated artificially. These situations can sometimes be problematic, and special statistical methods have been developed to handle them (see Long, 1997; Maddala,
1986). A much more difficult
case is that of censoring, which occurs with duration data. If the measure of interest is the time until an event happens (e.g., the time until a defect is repaired, then there necessarily will be cases where the event has not yet happened at the time of measurement. These observations are called censored because even though we believe the event will eventually occur and a duration will be defined, we do not know how long that duration will be (only that it has some current lower bound. This problem is often not recognized, and when it is, the typical response is to ignore the missing values. This unfortunately causes the subsequent analysis to be biased. Proper analysis of duration data is an extensive subarea of statistics usually termed survival analysis (because of its use in medical research its methods are essential for analyzing duration data correctly. See Hosmer and
Lemeshow (1999) or Kleinbaum (1996) fora good introduction.
Classical measurement theory (Krantz et al., 1971; Ghiselli et al., 1981) defines four basic types of measurement scale, depending on what kinds of mathematical manipulations make sense for the scale’s values. (Additional types have been proposed, but they are typically special cases for mathematical completeness) The four are
160 J.
RosenbergNominal. The scale values are unordered categories, and no mathematical manipulation makes sense.
Ordinal. The scale values are ordered, but the intervals between the values are not necessarily of the same size, so only order-preserving manipulations such as ranking make sense.
Interval. The scale values are ordered and have equal intervals, but there is no zero point, so only sums and differences make sense.
Ratio. The scale values are ordered and have equal intervals with a zero point, so any mathematical manipulation makes sense.
These scale types determine which kinds of analyses are appropriate fora measurement’s values. For example, coding nominal categories as numbers (as
with serial numbers, say) does not mean that calculating their mean makes any sense. Similarly, measuring the mean of subjective rating scale values (such as defect severity) is not likely to produce meaningful results, since the rating scale’s steps are probably not equal in size.
It is important to realize that the definition, interpretation, and resulting analyses of a metric are not necessarily fixed in advance. Given the complexities shown in Fig. 1, the actual characteristics of a metric are often not entirely clear until after considerable analysis has been done with it. For example, the values on an ostensibly ordinal scale may behave as if they were coming from an underlying ratio scale (as has been shown for many psychometric measures, see Cliff, 1992). It is commonly the case that serial numbers are assigned in a
chronologically ordered manner, so that they can be treated as an ordinal, rather than nominal, scale.
Velleman (1993) reports the case where branch store number correlated inversely with sales volume, as older stores (with smaller store numbers) had greater sales.
There has been much discussion in the software metrics literature about the implications of measurement theory for software metrics (Zuse, 1990; Shepperd and Ince, 1993; Fenton and Pfleeger, 1997). Much of this discussion has been misguided, as Briand et al. (1996) show. Measurement theory was developed by scientists to aid their empirical research putting the mathematical theory first and the empirical research after is exactly backwards. The prescriptions of measurement theory apply only after we have understood what sort of scale we are working with, and that is often not the case until we have worked with it extensively.
In
practical terms, then, one should initially make conservative assumptions about a scale’s type, based on similar scales, and only promote it to a higher type when there is good reason to do so. Above all, however, one should avoid uncritically applying measurement theory or any other methodology in doing research.
Share with your friends: