**Teaching the Reasoning of Statistical
Inference**

*A "Top Ten" List*

**Allan J. Rossman and Beth L. Chance**

"Certainly the present trend toward reemphasizing actual experience with data analysis in beginning instruction before plunging into probability and inference makes sense pedagogically as well as presenting a more balanced introduction to statistical practice....Yet teachers of statistics, however we face the pedagogical obstacles posed by the difficulty of probability ideas, are obligated to present at least the basic reasoning of confidence intervals and significance testing as essential parts of our subject." - David Moore [16, p. 8]

During the past decade, a reform movement in statistics education
has emphasized that introductory statistics courses should focus on student
experiences with data and understanding of fundamental concepts. Cobb [6]
summarized the guidelines of an MAA/ASA Joint Committee on Undergraduate
Statistics as:

• Emphasize statistical thinking.

• More data and concepts; less theory, fewer recipes.

• Foster active learning.

Additional references on statistics education reform and current statistical practice include Gordon and Gordon [11], Cobb [5], Hoaglin and Moore [13], and Cobb and Moore [7]. The guidelines also correspond with the philosophy of the recently developed Advanced Placement Statistics syllabus [8].

As the quote from David Moore suggests, these principles have been incorporated more readily into the teaching of data analysis than into the teaching of statistical inference. Moreover, instructors of introductory statistics often treat statistical inference as an isolated subject with little connection to the issues of exploratory data analysis and data collection that precede it in most courses. With apologies to David Letterman, we offer the following "Top Ten" list of recommendations for teaching the reasoning of statistical inference. We do not mean to imply an order of importance for these recommendations. Rather, our goal is to focus on the following themes: student investigation and discovery of inferential reasoning, proper interpretation and cautious use of results, and effective communication of findings. We include examples of activities and exercises that illustrate the principles of these suggestions.

**#10 Have students perform physical simulations to discover basic
ideas of inference.**

We contend that simulation, not formal probability, provides the most effective introduction to sampling distributions and to concepts of inference. Surveying recent research in mathematics and statistics education research, Garfield [10] stated that the use of simulations can help students learn concepts through visualization and manipulation of concrete representations of abstract ideas. Moore [15] remarked that simulations offer "an alternative to proofs and algebraic derivations as a way of convincing students of the truth of important facts."

While modern technology performs simulations quickly and efficiently, we worry that students fail to connect the numbers and displays being produced with the process being simulated. We therefore advocate beginning with physical simulations, where students literally get a hands-on view of the process.

*Example: *We ask each student in the class to take a sample of
25 Reese’s Pieces candies and to calculate the proportion of orange candies
in the sample (see [19]). Students then aggregate their results with their
classmates’ to discover the simple but fundamental idea of sampling variability.
Students observe first-hand that the outcomes of a statistic vary from
sample to sample under repeated random sampling from the same population.
They also notice that while these values differ, a predictable pattern
emerges. Once this idea is clearly established, students can use technology
to perform simulations more efficiently and to develop their understanding
of properties of the sampling distribution.

*Example:* Landwehr, Swift, and Watkins [14] introduce the idea
of confidence intervals without resorting to formulas. Using a fixed sample
size, they ask groups of students to draw samples using a specific population
proportion, varying this value among the groups. Each group then produces
a "90% boxplot" containing the middle 90% of their generated proportions.
Students then construct a chart which displays the 90% boxplots for the
different values of the population proportion. Given a new sample proportion,
students see which of these boxplots overlap with this value; these boxplots
indicate which population proportions could reasonably have produced the
new sample proportion. Students thus interpret a confidence interval as
a set of plausible values for the population parameter based on the observed
sample statistic.

*Example: *Scheaffer, Gnanadesikan, Watkins, and Witmer [20] introduce
the concept of statistical significance by asking students to shuffle and
deal cards. This activity pertains to a question of sex discrimination
in a company’s firing of employees. The cards represent employees retained
or dismissed, and students use the cards to investigate how often the actual
number of females dismissed would occur by chance. This activity helps
students to understand and to verbalize the concept of a *p*-value
before they have ever seen an inference formula.

**#9 Encourage students to use technology to explore properties of
inference procedures.**

The power of modern computers and graphing calculators allows instructors to shift the emphasis from students performing extensive hand calculations to students exploring the underlying concepts and properties of inference procedures. After students have conducted physical simulations to become comfortable with the idea of repeated samples, technology enables us to extend these ideas quickly and efficiently. For example, students can discover for themselves what the phrases "95% confidence" and "significance level" represent.

*Example*: Patterned after an exercise of Moore & McCabe [18],
we provide students with a population of 1000 hypothetical SAT-M scores
and ask them to use the computer to calculate the mean of this population,
which turns out to equal 500. Students then use the computer to take 50
different samples of size 100 from this population and to construct a confidence
interval for the population mean from each sample. Since students know
the value of the population mean, they can then count how many of these
intervals contain the population parameter. They also see first-hand that
the intervals which fail to capture the population mean arise from samples
with unusually high or low sample means. Through this activity students
develop an understanding of the notion of "confidence" as describing how
often a confidence interval captures the population parameter in the long
run. Similarly, students test at the 5% level whether the population mean
equals 500 for each sample and then count how many of the samples would
lead to a false rejection of the null hypothesis. This activity helps students
to develop intuition about Type I error.

Technology can also free students from computational drudgery, allowing them to concentrate on exploring properties of the inferential procedures such as the effects of the sample size or confidence level.

*Example: *We ask students whether results from a random sample
provide strong evidence that more than half of a population favors a certain
candidate, telling them only that 54% of the sample are in favor (see [19]).
Students realize that the answer depends on the sample size used, and they
use technology to determine the smallest sample size for which the result
is significant. Students further investigate how this minimum sample size
changes for different significance levels. This activity reinforces the
idea that one cannot base decisions solely on point estimates.

In addition, technology enables students to investigate more complicated sampling distributions, such as those arising in regression or chi-square analyses.

*Example:* Returning to the SAT data, we introduce 1000 corresponding
GPA values and ask students to calculate the population regression equation.
Students then repeatedly sample 100 pairs of observations from this population
and calculate the regression equation for each sample. Students plot the
sample regression lines to visualize how they vary about the population
regression line. By examining the sample regression coefficients, students
observe the normality, variability, and unbiasedness of these sampling
distributions.

**#8 Present tests of significance in terms of p-values rather
than rejection regions.**

Not only do *p*-values provide more information than simple statements
of rejection, they also better reflect statistical practice. We try to
help students realize that while the significance level allows one to make
a decision, the *p*-value expresses the strength of evidence provided
by the sample data.

*Example:* We ask students to decide (at the 5% significance level)
whether more than half of a company’s customers are women, based on two
random samples of 200 customers (see [19]). In the first, 112 of the customers
are women (*p*-value = .0448), and in the second, 124 are (*p*-value
= .0003). Students indicate whether their report to the company would be
the same in both cases. While they reject the null hypothesis in both cases,
students realize that the sample results are quite different and that the
*p*-value
provides more information than a simple statement of rejection.

*Example:* In an episode of the television series *ER*, a
doctor excitedly reports that the *p*-value of his study is currently
.06 so that he is just "one successful outcome away from statistical significance."
He eagerly begins looking for that one last patient so that his work can
be published. We have students critique this argument, reinforcing cautions
against using fixed significance levels. The example becomes even more
dramatic when another doctor realizes that unsuccessful outcomes have been
dubiously dropped from the study; we ask students to comment on this practice
as well.

**#7 Accompany tests of significance with confidence intervals whenever
possible.**

Confidence intervals provide more information than tests of significance
but are generally underutilized in statistical teaching and practice. While
a test of significance indicates whether a sample result is statistically
significant, a confidence interval estimates the magnitude of the population
parameter. This allows one to assess the practical significance of the
sample result. Students need to understand the difference between "strong
evidence of an effect" (a low *p*-value) and a "strong effect" (e.g.,
a very large difference in means).

*Example*:* *Utts [21] discusses a meta-analysis which showed
a statistically significant reduction in cholesterol levels between a group
of subjects who consumed oat bran and a control group. However, a 95% confidence
interval for the mean amount of reduction extended from 3.3 mg/dl to 8.4
mg/dl, suggesting that the magnitude of the difference was actually quite
small relative to average cholesterol levels of about 210 mg/dl.

*Example*: Students analyze data reported in *The 1992 Statistical
Abstract of the United States *that 30.5% of a sample of 40,000 American
households own a pet cat (see [19]). We ask whether this sample provides
strong evidence that less than one-third of the population of all American
households owns a cat and then whether it provides evidence that much less
than one-third owns a cat. A significance test answers the first question
in the affirmative (*p*-value < .0001), but a confidence interval
supplies the additional information needed to answer the second question
in the negative (95% c.i.: (.300, .310)). Students discern that large sample
sizes can often lead to statistically significant results that are not
practically significant.

*Example: *We ask students to gather prices at two different grocery
stores on the same set of products and perform a matched pairs *t*-test.
While the significance test examines whether there is a price difference,
a confidence interval estimates the average amount of savings. Students
use the interval to decide if the amount of savings is enough to compensate
for other factors, such as additional travel time.

**#6 Help students to recognize that insignificant results do not necessarily
mean that no effect exists.**

Just as statistical significance does not establish that an effect is practically important or even guarantee that the effect is present, lack of significance does not constitute proof of no effect. Instead, there may be an effect that the test procedure fails to detect, typically due to an sample size that was not large enough. We aim to help students develop an intuitive understanding of the subtle but important point that failing to reject the null hypothesis does not establish it to be true. Students should realize that the null hypothesis could be false but that the sample data did not provide sufficient evidence to reject it.

*Example:* In an activity illustrating the famous "Monty Hall Problem,"
we give three playing cards, two red and one black, to pairs of students.
The cards represent prizes behind doors used in a game show, one a winner
(black) and two losers (red). One student (dealer) shuffles the cards and
holds them facing away from the other (contestant). The contestant chooses
a card and the dealer reveals one of the two remaining cards to be red.
The contestant is then asked to either switch to the remaining card or
stay with the original choice, in an effort to find the (winning) black
card. We have the students play this game 20 times and perform a test of
significance to see if the proportion of wins using the switching strategy
is different from 1/2. Most students fail to reject this hypothesis (power
= .152). Since 1/2 agrees with many students’ intuition, they do not find
this result surprising. However, if they continue to play the game or reason
probabilistically, they discover that the actual probability of winning
with the switch strategy is 2/3. Thus, they learn that the original sample
size of 20 was not large enough to enable them to detect this difference.

**#5 Stress the limited role that inference plays in statistical analysis.**

While statistical inference is a widely used and very important class of techniques, it is just one component of a statistical analysis. Other important considerations include the design of the data collection procedure and an exploratory analysis of the data. Moreover, in many situations, statistical inference procedures cannot even be applied in a meaningful manner. Since they are overused, students should be taught to adopt a cautious attitude toward them.

Foremost among its limitations, statistical inference applies only to situations where sample data have been selected from a population or process or where experimental subjects have been randomly divided into treatment groups. This reliance on randomization is crucial for helping students to understand what statistical inference is all about.

*Example:* We ask students to use the data that 9 of the 100 U.S.
Senators in 1998 are women to construct a confidence interval for the proportion
of women in the 1998 Senate (see [19]). While the numbers are very easy
to substitute into the familiar formula, the interval is meaningless since
one knows with certainty the value of the population parameter in this
case.

Such examples help students focus on the purposes of statistical inference- drawing conclusions about a population based on a sample or about a treatment effect based on random allocation of subjects.

**#4 Always consider issues of data collection.**

Another temptation is to ignore issues of random sampling and experimental design when moving on to the inference part of the course. We strongly advocate that students be forced to confront these issues when making inferences. For example, the distinction between an observational study and a controlled experiment determines what conclusions can be drawn from a test of significance. Moreover, applying inference techniques to poorly collected data can produce very misleading conclusions.

*Example:* A classic example often used to illustrate Simpson’s
paradox is the data from a sex discrimination case against the University
of California at Berkeley’s graduate admissions process (see [3] for an
original source and [9] for an interesting account). We suggest having
students also analyze these data in the context of statistical inference.
A significance test reveals that the difference in the acceptance rates
between men and women (.446 and .305, respectively) is highly significant
(*p*-value < .0001), but students should realize that since the
data come from an observational study and not a controlled experiment,
they can not conclude that discrimination occurred.

*Example:* An infamous example often used to illustrate the pitfalls
of biased sampling methods is the 1936 *Literary Digest* survey, in
which 57% of 2.4 million respondents favored Alf Landon over Franklin Roosevelt
in the presidential election. We introduce students to this example early
in the course as an illustration of how improper sampling techniques can
produce very misleading data (see [9] for a discussion of the sources of
sampling biases). We return to this example when discussing inference,
asking students to use the *Literary Digest *result to estimate the
proportion of Landon supporters in the population (95% c.i.: (.5694,.5706)).
As Roosevelt beat Landon in the actual election by a landslide, students
see that the poor data collection methods in this setting render any inference
results completely invalid.

**#3 Always examine visual displays of the data.**

An instructor can easily be tempted to limit discussion of exploratory
and graphical methods to the first part of the introductory course. However,
we strongly recommend always having students apply these techniques to
data, including *before* they carry out inference procedures. In many
cases an initial analysis of the data reveals much that a significance
test or confidence interval does not. An exploratory analysis can also
determine whether or not the inference procedure is even appropriate for
the data at hand.

*Example:* We provide students with data on times between eruptions
(in minutes) for the Old Faithful geyser, originally reported in [2] and
also presented in [4] and [12]. If students merely calculate a confidence
interval for the population mean intereruption time (95% c.i.: (70.73,
73.90)) without inspecting the data first, they fail to notice the pronounced
bimodal nature of the data with peaks around 55 and 78 minutes. With this
realization students are able to describe the inter-eruption times more
effectively.

*Example*:* *Anscombe [1] provides a particularly effective
illustration of the need to examine data first. Given four different bivariate
data sets, students calculate the same correlation coefficient (.816) and
very significant regression equation (*p*-value = .002) for each one.
However, when students examine scatterplots of the data, they discover
that the data sets differ dramatically from each other. Students discern
that linear regression is entirely inappropriate for all but one of the
data sets, a fact they cannot recognize from the numerical summaries alone.

**#2 Help students to see the common elements of inference procedures.**

We want students to see that the reasoning and structure of statistical inference procedures are consistent, regardless of the specific technique being studied. For example, students should see the sampling distributions for several types of statistics to appreciate their similarities and understand the common reasoning process underlying the inference formulas. In addition, students can view these formulas as special cases of one basic idea. For example, confidence intervals in the introductory course have the form

estimate __+__ (critical value)(standard error of the estimate).

Similarly, test statistics are typically of the form

__estimated value - hypothesized value__.

standard error of the estimate

By understanding this general structure of the formulas, students can concentrate on understanding one big idea, rather than trying to memorize a series of seemingly unrelated formulas. Students can then focus on the type and number of variables involved in order to properly decide which formula is applicable. This approach also empowers students to extend their knowledge beyond the inference procedures covered in the introductory course.

**#1 Insist on complete presentation and interpretation of results
in the context of the data.**

Students need to realize that the end result of statistical inference is not simply a "yes" or "no" answer. We consider it unacceptable for a student to write a conclusion as brief as "reject the null hypothesis". Instead, students should discuss inference results in the context of the issue at hand, as in "the data provide strong evidence that Vietnam veterans divorce at a rate higher than the general population". Our goal is not only for students to be able to interpret conclusions reported in scholarly and popular literature, but also to be able to explain them clearly to people who are not familiar with statistics. Ideally, students also describe the reasoning behind the inference statement, for example by interpreting the phrases "95% confidence" and "significant result" in their own words. Finally, students should be given the opportunity to submit their interpretations repeatedly, with frequent feedback from the instructor, until they are able to express their ideas clearly. This emphasis on mastering the language further helps students internalize the concepts.

**Conclusion**

Statistical education reform emphasizes active learning on the part of students, conceptual understanding of fundamental statistical ideas, use of engaging applications involving genuine data, and development of student communication skills. While these principles have largely been accepted for teaching data analysis, we believe they have not been sufficiently implemented for teaching inference. To facilitate incorporation of these principles into the teaching of statistical inference, we have provided suggestions and examples that:

- focus on investigation and discovery of inferential reasoning (#10 and #9),
- emphasize appropriate interpretation of inference results (#8, #7, and #6),
- caution students about overuse and misuse of inference procedures (#5, #4, and #3),
- require students to build connections and effectively communicate results (#2 and #1).

1. Francis J. Anscombe, Graphs in statistical analysis,

2. A. Azzalini and A.W. Bowman, A look at some data on the
Old Faithful geyser,
*Journal of the Royal Statistical Society, Series
C*, 39 (1990), 357-366.

3. P. Bickel and J.W. O'Connell, Is there a sex bias in graduate
admissions?,
*Science*, 187 (1975), 398-404.

4. Samprit Chatterjee, Mark S. Handcock, and Jeffrey S. Simonoff,
*A
Casebook to Accompany a First Course in Data Analysis*. John Wiley &
Sons, 1995.

5. George Cobb, Reconsidering statistics education: a National
Science Foundation conference, *Journal of Statistics Education *[Online].
1 (1993), http://www.stat.ncsu.edu/info/jse/v1n1/cobb.html.

6. George Cobb, Teaching statistics, in Lynn Steen, ed. *Heeding
the Call for Change: Suggestions for Curricular Action*, MAA Notes #22,
Mathematical Association of America, 1992, 3-43.

7. George W. Cobb and David S. Moore, Mathematics, statistics,
and teaching,
*TheAmerican Mathematical Monthly*, 104 (1997), 801-824.

8. The College Board, *Advanced Placement course description:
Statistics*, College Entrance Examination Board and Educational Testing
Services, 1996.

9. David Freedman, Robert Pisani, and Roger Purves, *Statistics*
(3rd ed.), W.W. Norton & Co., 1998.

10. Joan Garfield, How students learn statistics,* International
Statistical Review*, 63 (1995), 25-34.

11. Florence and Sheldon Gordon, eds., *Statistics for the
Twenty-First Century*, MAA Notes #26, Mathematical Association of America,
1992.

12. D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski,
eds. *A Handbook of* *Small Data Sets*, Chapman & Hall, 1994.

13. David Hoaglin and David Moore, eds., *Perspectives on
Contemporary Statistics*, MAA Notes #21, Mathematical Association of
America, 1992.

14. James M. Landwehr, Jim Swift, and Ann E. Watkins, *Exploring
Surveys and* *Information from Samples*, Dale Seymour Publications,
1987.

15. David S. Moore, New pedagogy and new content: the case
of statistics, *International Statistical Review*, 65 (1997), 123-165.

16. David S. Moore, What is statistics?, in David Hoaglin
and David Moore, eds., *Perspectives on Contemporary Statistics*,
MAA Notes #21, Mathematical Association of America, 1992, 1-17.

17. David S. Moore and George W. Cobb, Mathematics, statistics,
and teaching,
*American Mathematical Monthly*, 104 (1997), 801-824.

18. David S. Moore and George P. McCabe, *Introduction to
the Practice of Statistics* (2nd ed.), W.H. Freeman, 1993.

19. Allan J. Rossman and Beth L. Chance, *Workshop Statistics:
Discovery with Data and Minitab*, Springer-Verlag, 1998.

20. Richard L. Scheaffer, Mrudulla Gnanadesikan, Ann Watkins,
and Jeffrey A. Witmer, *Activity-Based Statistics*, Springer-Verlag,
1996.

21. Jessica M. Utts, *Seeing Through Statistics*, Duxbury
Press, 1996.