the science of collecting, organizing, and analyzing numerical data.
Descriptive Statistics
We will be examining the different values that variables take on and how often these variables assume these values. Most of the information is summarized in what is called a distribution.
Here is a list of graphs we will examine now. Boxplots and scatterplots will be covered later.
III. Categorical Graphs
We'll concentrate on categorical variables and their graphs for now.
Bar Charts (Qualitative / Categorical Data)
-
Horizontal axis = categories
-
Vertical axis = values
Note: As the categories are distinctly unique and one category does not flow into the next, like real numbers on the
x-axis, there should be spaces between the bars. (That's not to say everyone gets this right.)
Example: Rainfall (Brase/Brase, Understanding Statistics)
The information listed below gives the average monthly rainfall throughout the year in Honolulu, Hawaii:
Month
|
Jan.
|
Feb.
|
Mar.
|
Apr.
|
May
|
June
|
July
|
Aug.
|
Sept.
|
Oct.
|
Nov.
|
Dec.
|
Rainfall
(in.)
|
4.4
|
2.46
|
3.18
|
1.36
|
0.96
|
0.32
|
0.60
|
0.76
|
0.67
|
1.51
|
2.99
|
3.64
|
-
Make a bar graph of this information with month on the horizontal axis and rainfall on the vertical.
-
There is the rainy season and the dry season. From the graph, which 6 months would you say make up the rainy season?
-
Without rain insurance, which winter month (Nov., Dec., Jan., or Feb.) would be the best time to plan a trip?
Pie Charts or Circle Graphs (Qualitative / Categorical Data)
-
Circle divided up into sectors to represent categories.
-
Angles for sectors are proportional to percentage weights of categories.
Example: Causes for Lateness (Brase/Brase, Understanding Statistics)
Suppose you want to arrive at your college 15 minutes before your first class so that you can feel relaxed when you walk into class. An early arrival time allows you room for unexpected delays. However, you always find yourself arrive "just in time" or slightly late. What causes you to be late? One student made a list of possible causes and then kept a checklist for 2 months. On some days, more than one item was checked because several events occurred that caused the student to be late. Construct a pie chart using this data.
Causes for Lateness
Cause Frequency
Snoozing after alarm goes off 15
Car trouble 5
Too long over breakfast 13
Last-minute studying 20
Finding something to wear 8
Talking too long with roommate 9
Other 3
Excel:
Note that Excel's chart wizard is a great way to make graphs. I'll post a how-to on the Resources page on the course site. Other spreadsheet programs have similar utilities.
We now turn to several examples of graphs for quantitative data.
IV. Dotplots
Example 6
Let X = the number of letters in the first name of a student in this class. Construct a dotplot of the data obtained from this class.
Definition (BPS)
An outlier in any graph is an individual observation that falls outside the overall pattern of the graph.
Question: Does the dotplot above appear to have any outliers?
Note: We will study more precise ways of determining if an observation is an outlier of a data set in the weeks to come.
V. Histograms
In many more cases, we find a
histogram to be a more useful visualization.
We begin with a definition:
Definition: Range
The range of a set of numbers is the difference between the largest and smallest numbers in the set, i.e.
range = (largest value) - (smallest value)
Before we can construct a histogram, we need to build a slightly different sort of frequency table. Here we divide the range into equally-sized classes and look at how many observations fall into each class.
We need to formalize a few more definitions:
Definitions and Properties for Grouped Data:
-
We divide the range into equally-sized intervals we call classes or class intervals.
-
The smallest number in a class is called the lower class limit.
-
The largest number in a class is called the upper class limit.
-
The classes must be designed so that each number in the set falls into exactly one class.
-
The midpoint of a class lies halfway between the limits, i.e.
midpoint =
-
The class width is the difference between consecutive lower class limits.
-
When we divide data into classes, we produce what we call grouped data.
We examine histograms is in the context of an example:
Example 7 (Understanding Statistics): Nurses
Nurses on the eighth floor of Community Hospital believe they need extra staffing at night.
To estimate the night workload, a random sample of 35 nights was used. For each night the total number of room calls to the nurses' station on the eighth floor was record as follows:
68 60 69 70 83 58 90 86 71 71
92 95 70 74 46 18 84 82 75 63
101 77 102 80 86 85 73 86 62 100
90 37 88 70 87
Build a histogram by following these steps:
-
Enter this data in a list on the calculator and sort the list.
-
Compute the range of this data set.
-
We'll use a total of five classes for our histogram. How wide should these classes be? What are they?
-
Find the midpoints of each of the classes.
-
Tally the frequencies for each class.
Classes Frequency
-
Draw the histogram. Put the frequency of each class on the y-axis and plot the point at the midpoint on the x-axis.
Notes of importance:
-
Once we turn data into grouped data, we can never go back and recover the original data.
-
Histograms can be created using the TI graphing calculators. (It's under "Stat Plot," accessed by hitting 2nd then Y=. You need to turn on the plot, choose the histogram, and choose the list. You'll also want to go to ZOOM and select 9:ZoomStat.)
-
When constructing a histogram, it is important to remember that all classes should have the same size (width). It is recommended that a histogram have somewhere between 5 and 20 classes (probably closer to 5). To find an appropriate number of classes, it is helpful to use the range, as we did above. Note that the intervals can go beyond the range a bit, within reason. In other words, it is usually more sensible to have integer class widths and go a little too high or a little to low than to use a class width like 14.2.
-
Bars of a histogram, in general, should be connected (unlike the bars of a bar graph, which are not).
-
A relative frequency histogram has the same shape as a histogram with the exception that the vertical axis measures relative frequencies (percents) instead of frequencies.
Example 8
Construct a relative frequency histogram using the data in example #7.
We want to be able to describe data by interpreting its histogram. The key features of a histogram worth noticing are:
-
The center of the histogram (more than one way to measure this)
-
The spread of the histogram (more than one way to measure this too)
-
The shape of the histogram (usually only one way to describe this)
There are three basic shapes you need to know:
-
Symmetric
-
Skewed right
-
Skewed left
Be careful with skewed left and skewed right!! It's very easy to confuse them!
One way you could remember the distinction is to think of starting with a symmetric graph and having someone step on the left, pushing more observations to the right. This is the skewed left situation. For skewed right, imagine someone stepping on the right side of the graph and pushing more observations to the left.
VI. Stemplots or Stem and Leaf Displays
Example 9 (Understanding Statistics): How old are rich people?
Forbes Richest People gives the profile of the world’s wealthiest men and women. Do you have to be old to be worth at least $2 billion? You can answer this question yourself by studying the following data- ages of men and women worth at least $2 billion:
40 66 43 82 52 58 77 52 50 48 47
68 66 73 76 53 67 88 40 79 73 66
65 70 72 77 48 75 82 54 76 41 93
65 60 57 74 70 83 67 68 77 66 34
66 59 48 56 71 40 53 63 52 57 83
52 60 56 71 64 61 53 53 73 70
Make a stem and leaf diagram for this data (feel free to use your graphing calculator to sort the numbers in increasing order).
Notes of importance:
-
To construct a stem and leaf plot, simply remove the last digit of each number in the data (these will be used as leaves) and use the remaining digits as a row label (called a stem label). An entire row of a stem plot is called a stem.
-
Leaves should be arranged in increasing order along each stem.
-
A stemplot preserves the original data in a data set whereas a histogram does not.
-
Stems may also be split (which is often done when there are a large number of leaves in one row. For example, suppose we use the ages in the last example for our discussion. Instead of listing all ages that were in the 50s, one could list all the ages between 50 and 54 on one stem and 55-59 on another stem. This part of the stemplot would look like this:
5 0 2 2 2 2 3 3 3 3 4
5 * 6 6 7 7 8 9
Example 10 (Understanding Statistics)
Tel-a-Message is experimenting with computer-delivered telephone advertisements. Of primary concern is how much of the 4-minute advertisement is heard. A study was done to see how long the advertisement ran before the listeners hung up. A random sample of 30 calls gave the information listed below. Construct a stemplot using this data.
1.3 0.7 2.1 0.5 0.2 0.9 1.1 3.2 4.0 3.8
1.4 3.1 2.5 0.6 0.5 2.1 4.0 4.0 0.3 1.2
1.0 1.5 0.4 4.0 2.3 2.7 4.0 0.7 0.5 4.0
Back-to-back Stemplots
Example 11 (BPS)
Here are the number of home runs that Babe Ruth hit in his 15 years with the New York Yankees, 1920-1934:
54 59 35 41 46 25 47 60
54 46 49 46 41 34 22
Babe Ruth’s home run record for a single season
was broken by another Yankee, Roger Maris, who hit 61 home runs in 1961. Here are Maris’ home run totals for his 10 years in the American League:
13 23 26 16 33 61 28 39 14 8
Draw a back-to-back stemplot illustrating this data. Comment on the shapes of the two distributions.
Example 12 (SDA): National League Stadiums
Consider the seating capacities of baseball stadiums in the National League, as shown in the table below:
Team Stadium Capacity
Atlanta Braves 52,003
Chicago Cubs 38,170
Cincinnati Reds 52,952
Colorado Rockies approximately 50,000
Florida Marlins 47,226
Houston Astros 54,816
L.A.Dodgers 56,000
Montreal Expos 43,739
New York Mets 55,601
Philadelphia Phillies 62,382
Pittsburgh Pirates 58,727
St. Louis Cardinals 56,227
San Diego Padres 59,022
San Francisco Giants 58,000
a. Draw a stem and leaf plot, using only the first two digits of each number.
b. Draw a histogram illustrating this data.
VII. Time Plots
Time plots are used to illustrate bivariate quantitative data where the independent variable represents time.
Example 13 (text)
The Virginia Department of Motor Vehicles publishes data each year on the number of road fatalities, pedestrian fatalities and alcohol-related fatalities in the state. This information is used as a stimulus for public safety awareness programs, legislation on speed limits and the use of seat belts, highway engineering projects and similar purposes. Here are the results for an eleven-year period:
Year Total Pedestrian Alcohol-related
1986 1118 141 492
1987 1022 117 418
1988 1069 131 522
1989 999 141 480
1990 1071 116 535
1991 938 112 429
1992 839 93 379
1993 875 112 397
1994 925 101 376
1995 900 93 360
1996 869 114 346
-
Make a time plot for the total number of road fatalities. Does there appear to be a trend? If so, describe it. Can you give some possible reasons for what has happened?
-
Make a time plot for the number of alcohol-related fatalities. Answer the same question as in (a).
VIII. Cumulative Relative Frequency and Ogives
Definitions:
-
A cumulative frequency is the number of observations less than or equal to a given number.
-
A cumulative relative frequency is the cumulative frequency divided by the total number of observations. (So what kind of numbers do we get?)
-
The graph of a cumulative relative frequency is called a cumulative relative frequency graph (mmmm…. creative) or an ogive.
Example 14 (Iman)
The following is a list of 20 typing test scores (net words per minute):
68 72 91 47 52 75 53 55 65 35
84 45 58 61 69 22 46 55 66 71
Let's first write down the cumulative frequency and cumulative relative frequency for this data:
-
Scores
|
Cumulative Frequency
|
Cumulative Relative Frequency
|
22
|
|
|
35
|
|
|
45
|
|
|
46
|
|
|
47
|
|
|
52
|
|
|
55 (two of them)
|
|
|
58
|
|
|
61
|
|
|
63
|
|
|
65
|
|
|
66
|
|
|
68
|
|
|
69
|
|
|
71
|
|
|
72
|
|
|
75
|
|
|
84
|
|
|
91
|
|
|
Definition (Iman): An
empirical distribution function (e.d.f.) is a graph of the cumulative relative frequency vs. the raw data in the sample. It is a form of a step function.
On the back, draw the empirical distribution function for the above data.
Homework: #1.1-1.11, 1.12a-d, 1.14, 1.17, 1.20-1.25, 1.28