# Basic Statistical Vocabulary and Displaying Distributions with Graphs

 Date 18.10.2016 Size 73.16 Kb.
Basic Statistical Vocabulary and Displaying Distributions with Graphs

AP Statistics – Section 1.1

We'll begin our study of statistics by looking at some basic vocabulary and some graphical displays of data.
I. What Exactly is Statistics?
The topic of statistics can be divided as follows:

Statistics

the science of collecting, organizing, and analyzing numerical data.

Descriptive Statistics

collecting and presenting data

Inferential Statistics

drawing conclusions from data that has already been collected

Data Analysis

presenting data in the form of charts and graphs; summarizing data numerically

Data Production

studying how data is collected

II. Variables
Definitions

1. An individual is an object described by a set of data.

Example____________________________________________________

1. A variable is a characteristic of an individual.

Example____________________________________________________

Example 1
Consider the set of students in this class.
The individuals in this set are_____________________________________
Some examples of variables that might describe them are

Variables used in statistics break down into the following categories

Qualitative or Categorical Variables Quantitative Variables
examples: examples:

Now we consider another distinction:

Discrete Variables Continuous Variables
definition: definition:

examples: examples:

Example 2 (from IPS)
Popular magazines often rank cities in terms of how desirable it is to live and work in each city. List five variables that you would measure for each city if you were designing a study. Give reasons for your choices.

Example 3 (from Iman)
Indicate whether the following variables are quantitative or qualitative.

1. Marital status _____________________________________________

1. Sex _____________________________________________________

1. Occupation _______________________________________________

1. Social Security # ___________________________________________

1. Number of children at home____________________________________

1. Annual income______________________________________________

1. Number of telephones in your home _______________________________

1. Whether you own or rent a home _________________________________

1. Type of credit card you use _____________________________________

1. Street number _____________________________________________

We will be examining the different values that variables take on and how often these variables assume these values. Most of the information is summarized in what is called a distribution.

Definition
The distribution of a variable indicates what values a variable takes on and the frequency (i.e., how often) at which it takes on these values. We are often interested in graphing the distribution of a variable. There are a few types of graphs we could use, depending on what type of variable we are examining and what type of information we’d like to display.

Here is a list of graphs we will examine now. Boxplots and scatterplots will be covered later.

Qualitative or Categorical Variable Graphs Quantitative Variable Graphs

• bar chart

• pie chart

III. Categorical Graphs
We'll concentrate on categorical variables and their graphs for now.
Bar Charts (Qualitative / Categorical Data)

• Horizontal axis = categories

• Vertical axis = values

Note: As the categories are distinctly unique and one category does not flow into the next, like real numbers on the x-axis, there should be spaces between the bars. (That's not to say everyone gets this right.)
Example: Rainfall (Brase/Brase, Understanding Statistics)

The information listed below gives the average monthly rainfall throughout the year in Honolulu, Hawaii:

 Month Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. Rainfall (in.) 4.4 2.46 3.18 1.36 0.96 0.32 0.60 0.76 0.67 1.51 2.99 3.64

1. Make a bar graph of this information with month on the horizontal axis and rainfall on the vertical.

1. There is the rainy season and the dry season. From the graph, which 6 months would you say make up the rainy season?

2. Without rain insurance, which winter month (Nov., Dec., Jan., or Feb.) would be the best time to plan a trip?

Pie Charts or Circle Graphs (Qualitative / Categorical Data)

• Circle divided up into sectors to represent categories.

• Angles for sectors are proportional to percentage weights of categories.

Example: Causes for Lateness (Brase/Brase, Understanding Statistics)

Suppose you want to arrive at your college 15 minutes before your first class so that you can feel relaxed when you walk into class. An early arrival time allows you room for unexpected delays. However, you always find yourself arrive "just in time" or slightly late. What causes you to be late? One student made a list of possible causes and then kept a checklist for 2 months. On some days, more than one item was checked because several events occurred that caused the student to be late. Construct a pie chart using this data.

Causes for Lateness

Cause Frequency

Snoozing after alarm goes off 15

Car trouble 5

Too long over breakfast 13

Last-minute studying 20

Finding something to wear 8

Talking too long with roommate 9

Other 3

Excel:

Note that Excel's chart wizard is a great way to make graphs. I'll post a how-to on the Resources page on the course site. Other spreadsheet programs have similar utilities.

We now turn to several examples of graphs for quantitative data.
IV. Dotplots
Example 6

Let X = the number of letters in the first name of a student in this class. Construct a dotplot of the data obtained from this class.

Definition (BPS)

An outlier in any graph is an individual observation that falls outside the overall pattern of the graph.

Question: Does the dotplot above appear to have any outliers?

Note: We will study more precise ways of determining if an observation is an outlier of a data set in the weeks to come.

V. Histograms
In many more cases, we find a histogram to be a more useful visualization.
We begin with a definition:

Definition: Range

The range of a set of numbers is the difference between the largest and smallest numbers in the set, i.e.

range = (largest value) - (smallest value)
Before we can construct a histogram, we need to build a slightly different sort of frequency table. Here we divide the range into equally-sized classes and look at how many observations fall into each class.
We need to formalize a few more definitions:

Definitions and Properties for Grouped Data:

• We divide the range into equally-sized intervals we call classes or class intervals.

• The smallest number in a class is called the lower class limit.

• The largest number in a class is called the upper class limit.

• The classes must be designed so that each number in the set falls into exactly one class.

• The midpoint of a class lies halfway between the limits, i.e.
midpoint =

• The class width is the difference between consecutive lower class limits.

• When we divide data into classes, we produce what we call grouped data.

We examine histograms is in the context of an example:

Example 7 (Understanding Statistics): Nurses
Nurses on the eighth floor of Community Hospital believe they need extra staffing at night. To estimate the night workload, a random sample of 35 nights was used. For each night the total number of room calls to the nurses' station on the eighth floor was record as follows:
68 60 69 70 83 58 90 86 71 71

92 95 70 74 46 18 84 82 75 63

101 77 102 80 86 85 73 86 62 100

90 37 88 70 87

Build a histogram by following these steps:

1. Enter this data in a list on the calculator and sort the list.

2. Compute the range of this data set.

3. We'll use a total of five classes for our histogram. How wide should these classes be? What are they?

4. Find the midpoints of each of the classes.

1. Tally the frequencies for each class.

Classes Frequency

1. Draw the histogram. Put the frequency of each class on the y-axis and plot the point at the midpoint on the x-axis.

Notes of importance:

1. Once we turn data into grouped data, we can never go back and recover the original data.

2. Histograms can be created using the TI graphing calculators. (It's under "Stat Plot," accessed by hitting 2nd then Y=. You need to turn on the plot, choose the histogram, and choose the list. You'll also want to go to ZOOM and select 9:ZoomStat.)

3. When constructing a histogram, it is important to remember that all classes should have the same size (width). It is recommended that a histogram have somewhere between 5 and 20 classes (probably closer to 5). To find an appropriate number of classes, it is helpful to use the range, as we did above. Note that the intervals can go beyond the range a bit, within reason. In other words, it is usually more sensible to have integer class widths and go a little too high or a little to low than to use a class width like 14.2.

4. Bars of a histogram, in general, should be connected (unlike the bars of a bar graph, which are not).

5. A relative frequency histogram has the same shape as a histogram with the exception that the vertical axis measures relative frequencies (percents) instead of frequencies.

Example 8
Construct a relative frequency histogram using the data in example #7.
We want to be able to describe data by interpreting its histogram. The key features of a histogram worth noticing are:

1. The center of the histogram (more than one way to measure this)

2. The spread of the histogram (more than one way to measure this too)

3. The shape of the histogram (usually only one way to describe this)

There are three basic shapes you need to know:

1. Symmetric

2. Skewed right

3. Skewed left

Be careful with skewed left and skewed right!! It's very easy to confuse them!

One way you could remember the distinction is to think of starting with a symmetric graph and having someone step on the left, pushing more observations to the right. This is the skewed left situation. For skewed right, imagine someone stepping on the right side of the graph and pushing more observations to the left.

VI. Stemplots or Stem and Leaf Displays
Example 9 (Understanding Statistics): How old are rich people?

Forbes Richest People gives the profile of the world’s wealthiest men and women. Do you have to be old to be worth at least \$2 billion? You can answer this question yourself by studying the following data- ages of men and women worth at least \$2 billion:
40 66 43 82 52 58 77 52 50 48 47
68 66 73 76 53 67 88 40 79 73 66
65 70 72 77 48 75 82 54 76 41 93
65 60 57 74 70 83 67 68 77 66 34
66 59 48 56 71 40 53 63 52 57 83
52 60 56 71 64 61 53 53 73 70
Make a stem and leaf diagram for this data (feel free to use your graphing calculator to sort the numbers in increasing order).

Notes of importance:

1. To construct a stem and leaf plot, simply remove the last digit of each number in the data (these will be used as leaves) and use the remaining digits as a row label (called a stem label). An entire row of a stem plot is called a stem.

2. Leaves should be arranged in increasing order along each stem.

3. A stemplot preserves the original data in a data set whereas a histogram does not.

4. Stems may also be split (which is often done when there are a large number of leaves in one row. For example, suppose we use the ages in the last example for our discussion. Instead of listing all ages that were in the 50s, one could list all the ages between 50 and 54 on one stem and 55-59 on another stem. This part of the stemplot would look like this:

5 0 2 2 2 2 3 3 3 3 4

5 * 6 6 7 7 8 9
Example 10 (Understanding Statistics)

Tel-a-Message is experimenting with computer-delivered telephone advertisements. Of primary concern is how much of the 4-minute advertisement is heard. A study was done to see how long the advertisement ran before the listeners hung up. A random sample of 30 calls gave the information listed below. Construct a stemplot using this data.

1.3 0.7 2.1 0.5 0.2 0.9 1.1 3.2 4.0 3.8

1.4 3.1 2.5 0.6 0.5 2.1 4.0 4.0 0.3 1.2

1.0 1.5 0.4 4.0 2.3 2.7 4.0 0.7 0.5 4.0

Back-to-back Stemplots
Example 11 (BPS)

Here are the number of home runs that Babe Ruth hit in his 15 years with the New York Yankees, 1920-1934:

54 59 35 41 46 25 47 60

54 46 49 46 41 34 22

Babe Ruth’s home run record for a single season was broken by another Yankee, Roger Maris, who hit 61 home runs in 1961. Here are Maris’ home run totals for his 10 years in the American League:
13 23 26 16 33 61 28 39 14 8
Draw a back-to-back stemplot illustrating this data. Comment on the shapes of the two distributions.

Example 12 (SDA): National League Stadiums

Consider the seating capacities of baseball stadiums in the National League, as shown in the table below:

Atlanta Braves 52,003

Chicago Cubs 38,170

Cincinnati Reds 52,952

Florida Marlins 47,226

Houston Astros 54,816

L.A.Dodgers 56,000

Montreal Expos 43,739

New York Mets 55,601

Pittsburgh Pirates 58,727

St. Louis Cardinals 56,227

San Francisco Giants 58,000
a. Draw a stem and leaf plot, using only the first two digits of each number.

b. Draw a histogram illustrating this data.

VII. Time Plots
Time plots are used to illustrate bivariate quantitative data where the independent variable represents time.
Example 13 (text)

The Virginia Department of Motor Vehicles publishes data each year on the number of road fatalities, pedestrian fatalities and alcohol-related fatalities in the state. This information is used as a stimulus for public safety awareness programs, legislation on speed limits and the use of seat belts, highway engineering projects and similar purposes. Here are the results for an eleven-year period:

Year Total Pedestrian Alcohol-related

1986 1118 141 492

1987 1022 117 418

1988 1069 131 522

1989 999 141 480

1990 1071 116 535

1991 938 112 429

1992 839 93 379

1993 875 112 397

1994 925 101 376

1995 900 93 360

1996 869 114 346

1. Make a time plot for the total number of road fatalities. Does there appear to be a trend? If so, describe it. Can you give some possible reasons for what has happened?

1. Make a time plot for the number of alcohol-related fatalities. Answer the same question as in (a).

VIII. Cumulative Relative Frequency and Ogives
Definitions:

1. A cumulative frequency is the number of observations less than or equal to a given number.

2. A cumulative relative frequency is the cumulative frequency divided by the total number of observations. (So what kind of numbers do we get?)

3. The graph of a cumulative relative frequency is called a cumulative relative frequency graph (mmmm…. creative) or an ogive.

Example 14 (Iman)

The following is a list of 20 typing test scores (net words per minute):

68 72 91 47 52 75 53 55 65 35

84 45 58 61 69 22 46 55 66 71

Let's first write down the cumulative frequency and cumulative relative frequency for this data:

 Scores Cumulative Frequency Cumulative Relative Frequency 22 35 45 46 47 52 55 (two of them) 58 61 63 65 66 68 69 71 72 75 84 91

Definition (Iman): An empirical distribution function (e.d.f.) is a graph of the cumulative relative frequency vs. the raw data in the sample. It is a form of a step function.
On the back, draw the empirical distribution function for the above data.

Homework: #1.1-1.11, 1.12a-d, 1.14, 1.17, 1.20-1.25, 1.28

Page of