Stat 301 – Review 2
Review Problems: Click here (Solutions),
Investigation 2.3
By Monday, noon: Submit Exam 2 Question and Exam 2
Example Questions in Canvas and then review responses to former before exam
Optional review session: Monday, 6:30-7:30pm, 10-126
Exam Format:
The exam will cover
topics from Chapter 2 – one quantitative
variable and Chapter 3 – comparing
two groups on a binary response variable (Days 15-23, HW 4-6). The exam
format will be similar to Exam 1 with a mix of short answer and longer answer
questions. You will not be using software but you will be expected to interpret
provided output (e.g., from R, JMP, applets, other), discuss how you would use
technology (e.g., R or JMP or applet), and interpret “code.” You should also bring your calculator (not a cell phone). You may use two pages of
notes
Study Hints: You should study from the text
(including chapter summaries), lecture notes, solutions to investigations,
powerpoint slides, homeworks, homework solutions, Chapter Examples (2.1, 3.1,
3.2), practice problems, and quizzes. See also the overview of procedures pages.
In studying, I recommend re-doing investigations and homeworks, without looking
at the online solutions, then checking your answers. The questions will not be heavily
computational, but you are expected to know how to set up the calculations by
hand. You are also expected to explain your reasoning, show your steps, and
interpret your results. The exam will contain approximately 50 points (so a 2
point problem should only take you about 2 minutes).
Principles to keep in mind:
Standardized/test
statistic = (statistic – hypothesized)/(SE of statistic)
Confidence
interval = observed statistic + (critical value) × (SE of statistic)
SE of statistic
estimates sample to sample variation of statistic (vs. variability in data)
From Section 2.1 (Investigations 2.1,
2.2, 2.3) you should be able to:
·
Create
and interpret different types of graphs: histogram, dotplot, stemplot, boxplot
o
Critique
effectiveness of display (e.g., labels, bin width)
·
Describe
the shape of a distribution of a quantitative variable as symmetric, skewed to
the left, or skewed to the right
·
Assess
the normality of a distribution (e.g., overlay, normal probability plot, S-W
test)
·
Identify
possible outlier(s) (in the dataset) and suggest explanations
·
Critique
justifications for removing outlier(s) from dataset
·
Interpret
the five number summary and the inter-quartile range (IQR)
·
Understand
how skewness and/or outliers impact the relative positions of the mean and
median and the standard deviation
·
Explore
data transformations to normalize a distribution
·
Transform
data and use a normal model to estimate a probability
From Section 2.2 (Investigations 2.4,
2.5, 2.6) you should be able to:
·
Continue
to consider whether or not you are likely to have a representative sample
·
Explain
the reasoning behind simulating random samples from a finite population
o
Critique
assumptions made about the population
·
Predict
the behavior of the sampling distribution of the sample mean (mean, standard
deviation, shape) and compare to the population distribution
o
How/when
does the shape of the population matter?
·
Apply
the Central Limit Theorem of the sample mean
o
Know
when it does/does not apply
o
Use
technology (R, JMP, or Normal Probability
Calculator) to approximate probabilities for sample means with the
normal distribution
§
What
values to use for mean, SD, observation; direction
§
Sketch,
label the corresponding distribution and shade the probability of interest
·
Define
a population mean in context
·
State
appropriate null and alternative hypotheses about a population mean for a given
research question
·
Calculate
and interpret the standard error of the sample mean, s/.
·
Determine
and interpret the “standardized” distance between and
·
Roughly
approximate a 95% confidence interval for by + 2s/.
·
Use
the t-distribution to model the
behavior of the standardized statistic
o
Determine
the degrees of freedom and their impact on the t distribution
§
As
df increases, t distribution
approaches standard normal distribution
o
Explain
the difference between the normal distribution and the t distribution
§
Heavier
tails
§
Why
that’s helpful to use for inference about the population mean
§
Consequences
on p-values, confidence intervals, coverage rate
o
Assess
the validity of the t procedures
·
Interpret
a confidence interval for
o
If
asked, interpret the confidence level
·
Determine
and interpret a prediction interval (PI) for a future observation
o
With
raw data, by hand
o
Explain
the reasoning behind the summation in the SE formula for a PI
o
Compare
a confidence interval to a prediction interval
o
Assess
the validity of the prediction interval procedure
From Section 2.3 you
should be able to:
·
Carry
out a sign test
o
Determine
observed successes and failures
o
Define
in context
o
State
hypotheses about
o
Use
binomial or normal (if valid) distribution to find p-value
o
Relate
to a test about a population median
·
Take
a log transformation, apply a t
confidence interval, back-transform the endpoints of the interval to the
original measurement units
·
Identify
and interpret a bootstrap confidence interval for a population parameter
From Chapter 3
In Ch. 3, we
bounced around within the chapter but the big picture to keep in mind is
comparing a categorical response variable between two groups and that it matters
a lot where those groups came from: an observational study with independent
random samples or a randomized experiment.
This distinction must be considered when drawing your final conclusions
(can I generalize to the larger populations, can I draw cause and effect?), and
should probably also be considered when you analyze the data (are modeling
random samples from populations or random assignment). This can impact the
standard errors that you use, but with large sample sizes the results won’t
differ too much, and analysts tend to apply the same normal approximation.
From Section 3.1 you
should be able to:
· Construct a two-way table of counts
· Calculate (appropriate) conditional
proportions and compare them
o
Proportion
of 6 ft tall men in the NBA vs. Proportion of NBA players over 6 ft vs.
proportion of men that are 6 foot tall NBA players.
· Create a segmented bar graph from a
two-way table (may use technology) and describe what it reveals (e.g.,
do the distributions differ across the groups)
· Define the parameter in terms of the
difference in population proportions
· State hypotheses in terms of the
difference in population proportions
· Simulate random sampling (independent
binomials) from two populations under the null hypothesis
o
Create
a null distribution of differences in sample proportions
o
Produce
graphical and numerical summaries of this distribution
o
Estimate
or obtain a p-value from simulation results
·
Including
summing a Boolean expression in JMP
o
Explain
the simulation process (e.g., independent random samples)
o
Interpret
the p-value in context
· Determine whether a normal approximation
to the null distribution should be valid
o
Remember
simple way of checking this is all cell counts in table are at least 5, list
the values you are looking at
o
Should
also consider sizes of (finite) populations sampling from
o
Reasoning
behind the standard error formula (adding variances)
·
Pooled
vs. unpooled variance estimates (TOS vs. CI)
· Calculate a z test statistic and p-value using the normal distribution
o
Interpret
the standard error, test statistic, and p-value in context
· Calculate and interpret a z-confidence interval for the difference
in two population proportions
o
Make
sure the direction is clear
· Discuss factors that will affect
standard error, test statistic, p-value, confidence interval, and how
o
e.g.,
sample size, order of subtraction, size of difference in sample proportions
·
Distinguish
between the explanatory variable and the response variable from a study
description
·
Identify
and explain a potential confounding variable in observational studies
o
Be
sure to explain on how there could be a differential effect by the confounding
variable on the response variable between the explanatory variable groups (Make
sure it’s an alternative explanation for the observed difference between groups
separate from the explanatory variable, not just another variable).
From Section 3.2 (Investigations
3.3, 3.4, 3.5) you should be able to:
·
Distinguish
between an observational study and an experimental study
o Be able to justify which type of study
you have
·
Discuss
the advantages of using a placebo treatment
·
Discuss
the advantages to blinding and double-blinding in a study
·
Discuss
the purposes/goals/merits of “randomization” (aka random assignment to
treatment groups)
·
Identify
when we are allowed to draw cause-and-effect conclusions (perhaps just about
the experimental units in the study)
·
Interpret
and critique a description of a research study (e.g., Inv 3.5)
·
Discuss
some of the limitations in the type of conclusions that can be drawn from
different designs
o
Do
not draw cause-and-effect conclusions from an observational study
·
Can
still decide whether there is evidence of an association, measure how strong it
is
·
Identify
and justify the appropriate “scope of conclusions” (generalizability,
causation) from the study design
From Section 3.3 you should be able to:
·
Define
the parameter in terms of the difference in (long-run) treatment probabilities
·
Simulate
random assignment under the null hypothesis, create a null (or randomization)
distribution of the difference in two
sample proportions
o
Carry
out and interpret the results from a randomization simulation for a two-way
table (e.g., card shuffling, including how many cards and how many of each
color, how many deal out to each group)
o
Use
the Analyzing Two-way Tables applet
o
Understand
the equivalence of using the number of successes in group A, difference in
group proportions, relative risk, and odds ratio in this simulation
o
Including
how to approximate the (one or two-sided) p-value based on the simulation
results
·
Calculate
the exact (one or two-sided) p-value using the hypergeometric distribution (aka
Fisher’s Exact Test)
o
Including
showing set up by hand and with technology
o
Including
writing out the probability statement P(X > …. ) and the input values
of the hypergeometric (N, M, n)
o
Using
technology to carry out the full FET p-value calculation
·
Approximate
(and interpret) the p-value and confidence interval for 1 - 2 using the normal distribution (two-sample z-procedures)
o
Decide
whether the procedure is valid (just worry about cell counts, not population
size)
o
Consider
continuity correction/Adjusted Wald adjustment as an improvement
From
Section 3.4 (Investigation 3.9, 3.11) you should be able to:
·
Calculate
and interpret relative risk as an alternative measure of association between
two binary variables
o
Remember
that the difference in proportions does not take into account the magnitude of
the baseline risk
§
Small
differences in proportions “seem” much larger when the baseline risk is small
·
Simulate
a null distribution (using random sampling and/or random assignment) under the
null hypothesis and interpret the results using the relative risk as the
statistic
o
Create
a null distribution of relative risk
o
Including
how to approximate the p-value based on the simulation results
·
Determine
(by hand and with applet) and interpret a confidence interval for the ratio of
treatment probabilities1/2 using the normal distribution
o
Including
how and why we “transformed” the statistic
o
Calculate
the standard error of the transformed statistic
o
Back
transform the confidence interval
·
Calculate
and interpret odds ratio as an alternative measure of the association between
two binary variables
·
Distinguish
between a cohort, case-control, and cross-classified designs of an
observational study and how the design affects which numerical summaries you
can reasonably interpret
o
Don’t
use relative risk or difference in proportions with case-control studies
o
It
is always ok to calculate odds ratio
· Calculate and interpret odds ratio
o How to decide which calculation is being
asked for in the context of the problem
o How to interpret the results of the
calculations
·
Interpret
a confidence interval for the population odds ratio in context
Code you should be able to
interpret/explain/write pseudo-code
·
Subsetting
data
·
Recoding
a categorical variable
·
Splitting
the graph by an explanatory variable
·
Create
simulations to replicate random sampling and/or random assignment
·
You
should also be able to interpret generic computer output for the procedures we
have learned (one-sample t-procedures,
two-sample z-procedures)
What you should be able to do with the
calculator
Calculate
conditional proportions, relative risk, odds ratio
Calculate confidence intervals for
relative risk
Things you need to remember from Exam 1
·
Defining
variables and parameters in context
·
How
to interpret probability as a long-run relative frequency
·
Showing
your work/explaining how used the computer
·
Explaining
your simulation process
·
One-sided
vs. two-sided alternatives
·
The
reasoning of statistical significance and what a p-value measures
·
Making
conclusions based on the size of the p-value
·
Interpreting
confidence intervals and confidence levels
·
“Duality”
between confidence intervals and tests of significance
·
The
concept of power and factors that affect power
·
Comparing
“theoretical” and “simulation” results
Keep in mind
·
When
to talk in terms of population means, μ, and when to talk in terms of
probabilities,
·
When
comparing distributions, remember to cite your evidence if you think there is a
difference in the groups. In particular, tell me what you see in the summary
statistics (e.g., a higher proportion) that leads to your conclusion (e.g., abstainers
more likely to develop peanut allergy than consumers)
·
Remember
that the confidence level refers to
the reliability of the method – how often, in the long run, random samples will
produce an interval that succeeds in capturing the population parameter
·
Remember
to think about the direction of subtraction used by the technology
·
We
can use a one-sample t-procedure even
when the sample sizes are small if we have reason to believe the population
distribution is normally distributed. You can try to judge this, especially if
you don’t have past experience with the variable, based on graphs of the sample
data. If the sample data looks
reasonably normally distributed (normal probability plots are a useful tool for
helping this judgment), you can cite this as evidence that the population
distribution is normally distributed. If you aren’t sure, then use an
alternative analysis instead (e.g., data transformation).
·
Keep
in mind the one-sample t-procedure
only tell you about the population mean (vs. other aspects of the
distributions)
·
Always
putting your conclusions in the context of the research study
o
Including
considering “practical significance”
·
Try
to avoid the word “accurate” without explaining exactly what you mean by it.
·
Try
to avoid use of the word “group” but clarify if you mean the sample or the
population or the long-run treatment
·
Avoid
use of the word “it”
Also keep in mind:
·
Part
of your grade will be based on communication.
Be precise in your statements and use of terminology. Avoid unclear statements, and especially don’t
use the word “it”! Always relate your comments to the study context.
·
Show
the details of any of your calculations.
·
Organize
your notes ahead of time, and don’t plan to rely on your notes too much.
·
Be
able to both make conclusions from a
p-value and provide a detailed
interpretation of what the p-value measures in context
·
You
should continue to focus on the overall statistical process from collecting the
data, to looking at the data, to analyzing the data to interpreting results
·
Simulation-based
vs. Exact vs. Theory-based (normal) procedures
·
Think
big picture and be able to apply your knowledge to new situations
Some additional Lessons from HW
HW 4
·
Subsetting
data and possible consequences on scope of conclusions
·
Identifying
outliers by 1.5IQR
·
Calculating
P(Weight > 325) vs. P(Average weight > 325)
·
Impact
of sample size on this probability
·
Main
benefit of stratified sampling
HW 5
·
Interpreting
the mean/median in the context of the research question
·
If
you “control” the distribution of a variable (pick 20 males and 80 females),
don’t use your data to estimate the probability of outcomes for that variable (20%
of the population are male)
HW 6
·
If
you can see the result going either direction, use a two-sided alternative
·
A huge
advantage of odds ratios is they are “invariant” to which variable is
explanatory and which is response, and to which outcome is defined as success
and which as failure
o
So it
only reasonable statistic with case-control study
·
Reasoning
behind continuity correction
·
When/Why
recode a variable vs. subsetting a variable
·
Impact
of sample size on p-value