INVESTIGATING STATISTICAL CONCEPTS, APPLICATIONS, AND METHODS, Third Edition

NOTES FOR INSTRUCTORS

August, 2020

Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5

For the third edition, we have pulled out the one-sample quantitative variable into its own chapter. This allows you to parallel the one-sample z-test and introduce the t distribution before moving on to comparing groups. We still think it would be feasible if you preferred to discuss Chapter 3 before Chapter 2. One advantage to this ordering is this chapter also introduces log transformations which are used in Chapter 3. The main open question is how to handle the simulation. We have chosen to have students to a few simulations with finite populations to motive the t-procedures. Bootstrapping is discussed briefly in Investigation 2.9 as an alternative, especially with statistics other than the mean.

CHAPTER 2: ANALYZING QUANTITATIVE DATA

Section 1: Descriptive Statistics

The primary goals of this section are having students work with quantitative data and learn how to describe distributions of quantitative data. Some of this material will be review for many students, especially if you worked with Investigation A, so we focus on modelling distributions of data, including assessing model fit and transformations. We hope to remind students that some interesting questions are descriptive in nature.

Investigation 2.1: Birth Weights

Materials: In this investigation we work with a very rich, large data set, USbirthsJan2016.txt. To simplify things, we focus on only one month’s worth of data. Even so, it will take a little bit of care to read all of the data values into your technology. In particular, the webpage doesn’t load the full dataset, so copying and pasting from the webpage no longer works. For most packages, students can save the .txt file directly to their own computer and then open that file. (You may want to add this activity about using ack to process the data file. Perhaps worth half a class period?)

Timing: 60 minutes

Students begin to see some of the “data cleaning” issues that must be dealt with when they read data from the web such as missing values and how they are coded. We have students learn how to subset the data. R users: you can also do a lot more with tidyverse here if you wanted. Students also learn more technology details such as creating histograms and normal probability plots (these are given more emphasis in later investigations). We show students a few different methods for assessing model fit, e.g., checking something like the empirical rule (we focus on 2SD), overlaying density curves, and normal probability plots. The motivation for probability plots is that judging whether the data follow a line is easier than judging the fit of a curve to the histogram and is independent of choice of number of intervals. We want students to focus on judging that linearity, but also what basic deviations from that pattern imply about the shape of the distribution. Keep in mind that with the huge data set some steps will be slower. Finally, we use a normal model (as they did in Chapter 1 but in the context of sampling distributions) to estimate the probability of an outcome and compare that to the actual relative frequency. We encourage you to emphasize to students the distinction between the model and the data. You may also wish to add more practice with the normal distribution at this point. You may also want to skip the end of this activity or demo it, as the technology instructions can slow students down. We encourage you to see the new homework problem on the empirical rule as a follow-up here. The (updated) Old Faithful data set in PP 2.1 is also fodder for interesting explorations. Exercise #8 gives students a couple of versions empirical rule to explore.

In (q), students will need to refer back to previous technology instructions. We encourage you to continue to emphasize drawing a sketch and labeling the horizontal axis.

Technology notes: In R, you can probably get by with “hist” at this stage, but later you will want “histogram” from the lattice package, so maybe get them in the habit now. You can also change the number of bins, e.g., histogram(birthweight, nint = 30).

A different dataset for Practice Problem 2.1A with duration (from Statistical Sleuth) can be found here.

Investigation 2.2: How long can you stand it

Materials: honking.txt dataset

Timing: 30-40 minutes, you might be able to combine this with either 2.3 or 2.4 depending on technology. You could also choose to skip 2.2 and 2.3 for now and return to 2.2 at the end of the chapter with Investigation 2.7.

This investigation focuses on a skewed distribution. Students are also reminded of the relationship between the mean and median with skewed data and are introduced to boxplots (including technology instructions for modified boxplots) and the 1.5IQR criterion for outliers. You may be able to move some of this part of the investigation outside of class. The investigation than explores transformation of the data, as well as other probability models. For qqplots in R, we set them up so the variable of interest is on the horizontal axis and we compare the observed data to theoretical quantiles (e.g., qexp). We encourage you to look at the histogram of the theoretical quantiles to help students see the shape of the theoretical distribution. With new R functions, both R and Minitab will allow you to overlay the exponential and lognormal probability models on the sample histogram (Minitab will of course allow many others as well; with R, the functions are simple enough students could create their own.) Students should use these visual comparisons as well as comparing probability calculations to see how the models fit to the sample data. It’s important to remind students that fitting the existing data is one issue, but you also want your model to be ‘robust’ enough that it predicts unobserved observations as well.

Investigation 2.3: Readability of Cancer Pamphlets

Timing: 15 minutes. You could consider assigning this outside of class.

This is a fairly short investigation but gives students some practice thinking about distributions and limitations of measures of center. If students struggle in (f), encourage them to think about creating a simple graph to compare the two distributions and/or considering the proportion of pamphlets that cannot be read by any patients.

Section 2: Inference for Population Mean

This section focuses on inference for a population mean. We first provide students with a hypothetical population to sample from to explore properties of the sampling distribution of the sample mean. We believe this will be more “concrete” for students than sampling from a theoretical probability distribution or bootstrapping. We focus on use of applets for exploring these conceptual ideas before turning to more standard statistical software for analysis.

Investigation 2.4: The Ethan Allen

Timing: This can also be about a 50-min class period. If you need to supplement, you can also talk about variability in means vs. individual observations (e.g., Why do we diversify a stock portfolio? Why do some sporting contests (e.g., rodeo) average over scores rather than just having one score?).

Materials: Sampling from a Finite Population applet, WeightPopulations.xls

Note: You can copy in all 3 columns at once and then use the Variable pull-down menu to select Pop1, Pop2, or Pop3.

A true story of a tour boat that sank. You can find pictures of the incident online. Twenty of the 47 passengers died. State and federal weight limits have since been modified. Wikipedia claims the date is Oct. 2. We designed this investigation to help students focus on the distribution of the sample mean (rather than individual observations) and to emphasize the distinctions between sample, population, and sampling distribution. The change from total weight to sample weight is not required, but allows us to focus only on the distribution of sample means. The applet also now allows you to sample from a population model as well. You may also want to supplement the statement of the Central Limit Theorem with derivations of the formulas for the mean and standard deviation of . You may want to refer back to the Gettysburg Address investigation (1.12) and prior use of the normal distribution (e.g., Investigation 1.8).

This is a good time to remind students of the difference between number of samples (repetitions of the simulation) and sample size. In particular, try to curb the common student assumption that large samples make the sample or population data more normally distributed.

Investigation 2.5: Healthy Body Temperatures

Timing: 60 minutes. Before question (o) could be a convenient spot to divide the investigation.

Materials: Sampling from a Finite Population applet, BodyTempPop.txt

This investigation provides a straight forward application of the CLT with a genuine research question about whether 98.6 is the “right” temperature to focus on. See also PP 2.5B for more recent references on this issue.

Part of the point of (a) is that students don’t know some of these values, especially the population standard deviation.

Students are introduced to the concept of standard error of the sample mean and then consider the impact on the behavior of the standardized statistic. The applet allows them to compare the normal and t probability models for the t-statistic with a small sample size first. Use of the t distribution is also justified in looking at the coverage rate of confidence intervals for the population mean (this applet exploration in questions (o)-(s) could be a stand-alone assignment). Students see that the results don’t differ much with larger sample sizes, but we tell students it is not bad to use the t distribution rather than switching back to the normal distribution. We focus on a normal population here, but remind students of the lessons from the previous investigation for non-normal data as well. Students are given an opportunity to practice finding critical values directly before seeing the more general technology instructions for t-procedures.

This investigation might be a good time to again explore what is meant by “confidence level” and using simulations to explore the robustness of the t-intervals under different population shapes (use the applet pull-down menu to select Uniform or Exponential populations).

Investigation 2.6: Healthy Body Temperatures (cont.)

Timing: 45 minutes

This is a stand-alone investigation to focus on prediction intervals. You could ask students to literally see what percentage of the sample is in the confidence interval. Minitab and R do not have a convenient short-cut for calculating prediction intervals, but students could write their own function in R. There is also more discussion on the normality assumption and how to assess it.

Section 3: Inference for Other Statistics

This section is more optional, but gives students some alternatives to t-procedures: transformations, sign test, bootstrapping. All these investigations are new to the third edition.

Investigation 2.7: Water Oxygen Levels

Timing: 30 minutes

Materials: WaterQuality.txt

You may want to keep/start having students more responsible for reading some of the background of the study, answering the first few terminology questions before coming to class. This investigation sneaks in an example of a systematic sample. You can also continue to distinguish sampling from a process with sampling from a finite population. Students may struggle with (h) but it’s a key question. Question (i) can also generate good class discussion.

Investigation 2.8: Turbidity

Timing: 15-30 minutes

Materials: Now MermentauTurbidity.txt

This investigation builds on the earlier exploration of log transformations. Here we focus on how interpretation of the parameter should change to be about the median rather than the mean (questions e and f). You will also want to continue to emphasize with the students that you aren’t changing the data as much as rescaling the data and preserving the order of observations.

Investigation 2.9: Heroin Treatment Times

Timing: 45 minutes

Materials: heroin.txt

This is a classical data set and also links to survival analysis. The investigation focuses on the distribution of the median, but you may want to introduce other statistics, like the trimmed mean or 75^th percentile, as well. You can also remind students that “smooth” transformations don’t always work with data that is not unimodal. Here we do a very brief introduction to bootstrapping, motivated by the need to estimate the standard error of the sample median. We then simply have students do a 2SD confidence interval (but watch for skewness in the bootstrap distribution). You could easily expand to other methods based on these basic ideas. (There are several ways to use R and JMP to carry out bootstrapping)

Exercises

You might also want to see the Ch. 0 exercises (1-7) for some interesting descriptive statistics problems. The end of the Ch. 1 exercises also has some additional practice with normal probability distribution calculations.