Stat
301 - HW 7
Due
noon, Friday, March 6
If you submit your assignment in Canvas,
remember to upload separate files for each problem and to put your name inside
each file. Remember to show your work/calculations/computer details and to integrate
this output into the body of the solution.
FYI: Solutions to Quiz
24
1)
For the study
on elephants’ walking distances, we considered the two groups of elephants as
random samples from their respective populations. We considered these
populations to be large, but had no access to the actual population data. The
sample distributions were not particularly normal and the sample sizes were not
particularly large, so this meant we were a little skeptical about the validity
of the t-procedures. How can we
investigate this? What can we do if we don’t want to use the t-procedures? If we had access to the
populations, we could carry out a simulation to investigate whether the
distribution of t-statistics follows
a t-distribution, but what do we do
when we (more realistically!) only have the samples? One way to create
“populations” to sample from is to repeat our samples infinitely many times!
This is equivalent to sampling with replacement. To estimate the
sample-to-sample variation for a confidence interval, we can resample from each
sample separately. To estimate the sample-to-sample variation for a test of
significance, we can pool the two samples together and resample from that
combined (infinite) sample. Keep in mind that you will always use the same
sample sizes as the observed data.
(a) Open the Two sample bootstrapping applet.
Paste in the elephant data in the
Sample data box on the left and press Use
data. Confirm the values for the sample means and standard deviations. Check the Show Sampling Options box and press Bootstrap Samples. Verify you have selected 23 Asian elephants and
33 African elephants. Look at the Plot
and/or Data windows. Include a screen capture where you identify
any elephants that were selected more than once, explain how you can tell.
(b) Press Bootstrap
Samples again. Did you find the same
sample means or does this procedure model sample-to-sample variation from the
random sampling process?
(c) Include a screen
capture of this sample selection (including the mean and SD values, Selected
Summary Statistics) and use this bootstrap sample data to calculate a t-statistic (show your work). Use the
pull-down menu to change the Statistic
to t-statistic to confirm your calculation (the blue one).
(d) Now (with t-statistic
selected), take at least 1,000 bootstrap samples. Check the box to overlay the t-distribution. Include a screen
capture. Does this simulation analysis
support the use of the t-distribution
to calculate the p-value? Explain your
answer.
(e) Use the pull-down menu
to switch the statistic back to the difference in means. Where is this distribution centered? Why does this center make sense?
(f) What is the standard
deviation of the difference in sample means distribution? How does it compare to what we predicted with
the Central Limit Theorem? (Cite both values.)
(g) Use the observed
difference in sample means and this standard deviation (from the bootstrap
distribution of difference in sample means) to approximate a 95% confidence
interval for the difference in population means (estimate
+ 2 SE). How does it
compare to the 95% t-confidence
interval? (Cite both intervals.)
(h) Uncheck and then
recheck the Show Sampling Options box
(to clear out the previous simulation results) and return the Number of Samples
to 1. This time, check the Pooled
box. Press the Bootstrap Sample button until you find a resampling that
demonstrates how this method pools the two samples together and then selects a
group of 23 and a group of 33 (identify some elephants that change species!). Now generate a bootstrap distribution (at
least 1,000 bootstrap samples). What is
the mean of this bootstrap distribution; why does the value make sense? How
does the standard deviation compare (to f)?
(i) Count how many of the
bootstrap differences in sample means are larger than the observed difference
to approximate a two-sided p-value. How
does the p-value compare to the t-test
p-value?
(j) One large benefit of
bootstrapping is it works with statistics other than differences in sample
means. Use the pull-down menu to choose
the difference in sample medians. Report a two-sided p-value. (Include a screen capture.) How does the
p-value compare? Which p-value is smaller and why?
A group of Cal Poly
students wanted to investigate whether men with children tend to live longer
than men without children. They randomly
sampled men from the obituaries page on the San
Luis Obispo Tribune’s website between June and November 2012. For each man selected, they noted the age at
which the person died and whether or not the person had any children.
(a) State appropriate null
and alternative hypotheses for testing whether the average lifespan is longer
for men with children than for men without children.
(b) Identify and classify the explanatory variable
and the response variable in this study.
(c) Does this study
involve random sampling or random assignment or both or neither?
(d) The data are in ChildrenandLifespan.txt.
Use R or JMP or Theory-Based inference applet to create
numerical and graphical summaries of the data comparing the two samples. Summarize what they reveal about the shapes,
centers, and spreads of the two samples.
Explain why the shape of the
distribution of the response variable makes sense in this context.
R users: check out
proportion= table(ChildrenandLifespan$Children)/nrow(ChildrenandLifespan)
boxplot(ChildrenandLifespan$Age~ ChildrenandLifespan$Children, width=proportion)
JMP users: check out using
Fix Y by X and then selecting Boxplots under Display Options.
(e) Do you consider the t-procedures valid for these data? Explain how you are deciding.
(f) Carry out a two-sample
t-test to estimate p-value for this
study. Include your output, including a
well-labeled graph of the null distribution with the p-value shaded. Would you reject or fail to reject the null
hypothesis at the 5% level of significance?
(g) Calculate a 95%
confidence interval for these data. (Interpretation in next question.)
(h) Summarize the
conclusions you would draw from this study including significance, estimation,
causation, and generalizability. Provide
a brief justification for each component.
3) To investigate an association between violent video games
and aggressive behavior, British researchers Hollingdale and Greitemeyer (2014)
randomly assigned 49 students from a university in the United Kingdom to play Call of Duty: Modern Warfare (a violent
video game) and 52 students to play LittleBigPlanet
2 (a nonviolent/neutral video game). After 30 minutes of playing the video
games, the subjects were asked to complete a
marketing survey investigating a new hot chili sauce recipe. They were told
they were to prepare some chili sauce for a taste tester and that the taste
tester “couldn't stand hot chili sauce but was taking part due to good
payment.” They were then presented with what appeared to be a very hot chili
sauce and asked to spoon what they thought would be an appropriate amount into
a bowl for a new recipe. The amount of chili sauce was weighed in grams after
the participant left the experiment. The amount of chili sauce was used as a
measure of aggression: the more chili sauce, the greater the subject’s
aggression.
(a) Does this study
involve random sampling or random assignment or both or neither?
(b) Load the VideoAgression data into the Comparing Groups – Quantitative applet. Screen capture the numerical and graphical summaries of the data comparing the two groups. Summarize what they reveal about the shapes, centers, and spreads of the two samples.
(c) Do you think t-procedures are likely to be valid with
these data?
(d) Do you think “equal
variances” is a reasonable assumption for these data? Explain.
(e) State appropriate null
and alternative hypotheses to test whether there is an association between type
of video games and level of aggression.
(f) Create a randomization
distribution for the difference in means. Include a screen capture. How does
the SD of this distribution compare to the “pooled standard deviation”
(calculate)?
(g) Use the pull-down menu
to select the t-statistic. Report the observed value of the t-statistic for the
actual study (this is unpooled if you want to verify its value) and use it to
determine the simulation-based and
the t-distribution-based p-values. Include a screen capture. How do they
p-values compare?
Questions
(h)-(j) all use the same simulation results.
(h) Does 10 appear to be a
plausible value for the increase in average aggression with more violent
games?: Specify 10 as the hypothesized difference (or -10, check direction of
subtraction). Set the Number of Shuffles to 1 and select the Plot. Press Shuffle Responses and watch the animation. Explain in your own
words what this animation is doing and why.
(i)
Set the number of Shuffles to 1000 and regenerate the randomization
distribution of the difference in sample
means. How do the values of the mean and standard deviation compare to (f).
Which change(s) and why/why not?
(j)
Generate a two-sided p-value (include a screen capture). What conclusion do you
draw in context?
(k)
Check the box for a 95% confidence interval (lower left). Interpret the interval in context and comment on whether it is consistent
with your p-value in (j).
(l)
Put the data into Excel or R or JMP and log transform the aggression scores.
(Note: there is a zero, which you can turn into 0.5 first.) Use the log-transformed data to calculate a
two-sample t-test for an association.
(Include a screen capture.) Do the results differ/does one analysis provide
stronger evidence of an association than the other? Explain.
Possible
Extension Assignments
·
Dr.
Anna Bargagliotti from Loyola Marymount, Thursday, Mar 5 at 11:10 in 38-121.
She will be speaking about her NSF funded research on Undergraduate Data
Pathways -- an assessment of how universities provide undergraduate students
opportunities to work with data across a variety of disciplines.
·
How
many elephants are in North American zoos? (Include your reference(s).) Does
this impact our analysis? If so, how. (Be specific.)
·
For
problem 1, give an intuitive explanation for why we would expect the first
simulation (unpooled) to give a larger standard deviation for the distribution
of differences in sample means than the pooled approach.
·
For
the data in problem 2, use R (see p. 277) to carry out a randomization test to
determine whether the ratio in standard deviations is statistically
significant. (Feel free to first explore the difference in means to compare to
the applet output.) Interpret your results in context (when/why might this be
an interesting research question?).
·
For
problem 3, use the log-transformed data to obtain a 95% confidence interval and
back-transform the endpoints to obtain a confidence interval for the ratio of
the population medians. Interpret the interval in context. Explain why this
becomes a ratio rather than a difference.
·
Check
out http://datavizcatalogue.com/blog/box-plot-variations/
and make use the data in problem 2
to make boxplots of varying width and some other variations and comment on the
effectiveness of the different displays (e.g., the bee swarm!).
·
Check
out the Guess
the p-value applet. How accurately can you anticipate the p-value from the
picture? What other information is important? And how often get a small p-value
when null is true? Does increasing the sample size change that?