When both variables are qualitative, the relationship between the two may be examined by cross-tabulation, using a contingency table.
One qualitative variable here is “Profession,” having the two values “Nurse” and “Doctor.” The other qualitative variable is “Opinion about new procedure,” having the three possible values “Prefer new procedure,” “Prefer old procedure,” and “No preference.” We may examine the relationship between these two variables by calculating percentages (either row percentages or column percentages). To calculate row percentages, we create a new column, called Row Total, containing the total numbers of nurses and doctors surveyed. We then divide each number in each row by the row total to get a proportion:
We calculate column proportions similarly. We can also calculate cell proportions by dividing each frequency by the grand total, the combined number of nurses and doctors. It is clear from these proportions that the nurses tend to have a more favorable opinion of the new procedure than the doctors.
One Qualitative and One Quantitative Variable
When one variable is qualitative and the other is quantitative, we consider the data set to be divided into several separate sample data sets, depending on the value of the qualitative variable. Then we may do 1) graphs for each of the subsamples, and 2) numerical summary statistics for each of the subsamples.
Are there outlying observations in any of the three data sets? To find out, we would find the 5-number summaries for each data set, and determine whether the extreme values were within the intervals discussed in Chapter 2.
Two Quantitative Variables
When both variables are quantitative, we may represent the data set as a set of ordered pairs of numbers, (x, y). The variable x is called the input (or independent) variable; the variable y is called the response (or dependent) variable. We may examine the relationship between the two variables graphically using a scatter diagram, or scatterplot.
Example: The following data set for a sample of 6 randomly middle-age to elderly patients consists of x =
age of patient, and y = measured value of systolic blood pressure of patient. We expect that as people age, their blood pressure will increase. We will examine the relationship between the two variables.
Age, x
|
Systolic Blood Pressure, y
|
43
|
128
|
48
|
120
|
56
|
135
|
61
|
143
|
67
|
141
|
70
|
152
|
To construct a scatterplot of the data using the TI-83:
1) Choose STAT, EDIT. Name one column Age; name the other column SBP.
2) Enter the data into the two columns.
3) Choose WINDOW. Set Xmin to be slightly smaller than the smallest value of x. In this case, we set Xmin = 40. Set Xmax to be slightly larger than the largest value of x. In this case, we set Xmax = 72. Set Ymin to be slightly smaller than the smallest value of y; in this case, Ymin = 118. Set Ymax to be slightly larger than the largest value of y; in this case, Ymax = 155. Set Xscl = 1, and Yscl = 1.
4) Choose 2nd, STAT PLOT. Turn Plot 1 On. For Type, choose the first type, scatterplot. For Xlist, enter the name of the x variable; for Ylist, enter the name of the y variable.
5) Hit the GRAPH key.
In this example, we see an increasing, linear trend relationship between age and systolic blood pressure, as expected. If we want to see the coordinates of the data points, we use the TRACE key.
Linear Correlation
The purpose of linear correlation analysis is to measure the strength of the linear relationship between x and y. Note: If the relationship between the two does not appear to be linear, then linear correlation analysis should not be done.
If there is an increasing linear trend relationship, so that larger values of x tend to be associated with larger values of y, then we say that there is a positive correlation between x and y.
If there is a decreasing linear trend relationship, so that larger values of x tend to be associated with smaller values of y, then we say that there is a negative correlation between x and y.
If there is no linear trend present, then we say that the correlation between x and y is zero.
Definition:
Pearson’s correlation coefficient, r, is a numerical measure of the strength (and direction) of a linear relationship between two quantitative variables. The formula for the correlation coefficient is
,