301 review 1

Stat 301 – Review 2

Review Problems: Click here (Solutions), Investigation 2.3

By Monday, noon: Submit Exam 2 Question and Exam 2 Example Questions in Canvas and then review responses to former before exam

Optional review session: Monday, 6:30-7:30pm, 10-126

Exam Format: The exam will cover topics from Chapter 2 – one quantitative variable and Chapter 3 – comparing two groups on a binary response variable (Days 15-23, HW 4-6). The exam format will be similar to Exam 1 with a mix of short answer and longer answer questions. You will not be using software but you will be expected to interpret provided output (e.g., from R, JMP, applets, other), discuss how you would use technology (e.g., R or JMP or applet), and interpret “code.” You should also bring your calculator (not a cell phone). You may use two pages of notes

Study Hints: You should study from the text (including chapter summaries), lecture notes, solutions to investigations, powerpoint slides, homeworks, homework solutions, Chapter Examples (2.1, 3.1, 3.2), practice problems, and quizzes. See also the overview of procedures pages. In studying, I recommend re-doing investigations and homeworks, without looking at the online solutions, then checking your answers. The questions will not be heavily computational, but you are expected to know how to set up the calculations by hand. You are also expected to explain your reasoning, show your steps, and interpret your results. The exam will contain approximately 50 points (so a 2 point problem should only take you about 2 minutes).

Principles to keep in mind:

Standardized/test statistic = (statistic – hypothesized)/(SE of statistic)

Confidence interval = observed statistic + (critical value) × (SE of statistic)

SE of statistic estimates sample to sample variation of statistic (vs. variability in data)

From Section 2.1 (Investigations 2.1, 2.2, 2.3) you should be able to:

· Create and interpret different types of graphs: histogram, dotplot, stemplot, boxplot

o Critique effectiveness of display (e.g., labels, bin width)

· Describe the shape of a distribution of a quantitative variable as symmetric, skewed to the left, or skewed to the right

· Assess the normality of a distribution (e.g., overlay, normal probability plot~~, S-W test~~)

· Identify possible outlier(s) (in the dataset) and suggest explanations

· Critique justifications for removing outlier(s) from dataset

· Interpret the five number summary and the inter-quartile range (IQR)

· Understand how skewness and/or outliers impact the relative positions of the mean and median and the standard deviation

· Explore data transformations to normalize a distribution

· Transform data and use a normal model to estimate a probability

From Section 2.2 (Investigations 2.4, 2.5, 2.6) you should be able to:

· Continue to consider whether or not you are likely to have a representative sample

· Explain the reasoning behind simulating random samples from a finite population

o Critique assumptions made about the population

· Predict the behavior of the sampling distribution of the sample mean (mean, standard deviation, shape) and compare to the population distribution

o How/when does the shape of the population matter?

· Apply the Central Limit Theorem of the sample mean

o Know when it does/does not apply

o Use technology (R, JMP, or Normal Probability Calculator) to approximate probabilities for sample means with the normal distribution

§ What values to use for mean, SD, observation; direction

§ Sketch, label the corresponding distribution and shade the probability of interest

· Define a population mean in context

· State appropriate null and alternative hypotheses about a population mean for a given research question

· Calculate and interpret the standard error of the sample mean, s/.

· Determine and interpret the “standardized” distance between and

· Roughly approximate a 95% confidence interval for by + 2s/.

· Use the t-distribution to model the behavior of the standardized statistic

o Determine the degrees of freedom and their impact on the t distribution

§ As df increases, t distribution approaches standard normal distribution

o Explain the difference between the normal distribution and the t distribution

§ Heavier tails

§ Why that’s helpful to use for inference about the population mean

§ Consequences on p-values, confidence intervals, coverage rate

o Assess the validity of the t procedures

· Interpret a confidence interval for

o If asked, interpret the confidence level

· Determine and interpret a prediction interval (PI) for a future observation

o With raw data, by hand

o Explain the reasoning behind the summation in the SE formula for a PI

o Compare a confidence interval to a prediction interval

o Assess the validity of the prediction interval procedure

From Section 2.3 you should be able to:

· ~~Carry out a sign test~~

o ~~Determine observed successes and failures~~

o ~~Define~~ ~~in context~~

o ~~State hypotheses about~~

o ~~Use binomial or normal (if valid) distribution to find p-value~~

o ~~Relate to a test about a population median~~

· Take a log transformation, apply a t confidence interval, back-transform the endpoints of the interval to the original measurement units

· ~~Identify and interpret a bootstrap confidence interval for a population parameter~~

From Chapter 3

In Ch. 3, we bounced around within the chapter but the big picture to keep in mind is comparing a categorical response variable between two groups and that it matters a lot where those groups came from: an observational study with independent random samples or a randomized experiment. This distinction must be considered when drawing your final conclusions (can I generalize to the larger populations, can I draw cause and effect?), and should probably also be considered when you analyze the data (are modeling random samples from populations or random assignment). This can impact the standard errors that you use, but with large sample sizes the results won’t differ too much, and analysts tend to apply the same normal approximation.

From Section 3.1 you should be able to:

· Construct a two-way table of counts

· Calculate (appropriate) conditional proportions and compare them

o Proportion of 6 ft tall men in the NBA vs. Proportion of NBA players over 6 ft vs. proportion of men that are 6 foot tall NBA players.

· Create a segmented bar graph from a two-way table ~~(may use technology)~~ and describe what it reveals (e.g., do the distributions differ across the groups)

· Define the parameter in terms of the difference in population proportions

· State hypotheses in terms of the difference in population proportions

· Simulate random sampling (independent binomials) from two populations under the null hypothesis

o Create a null distribution of differences in sample proportions

o Produce graphical and numerical summaries of this distribution

o Estimate or obtain a p-value from simulation results

· ~~Including summing a Boolean expression in JMP~~

o Explain the simulation process (e.g., independent random samples)

o Interpret the p-value in context

· Determine whether a normal approximation to the null distribution should be valid

o Remember simple way of checking this is all cell counts in table are at least 5, list the values you are looking at

o Should also consider sizes of (finite) populations sampling from

o ~~Reasoning behind the standard error formula (adding variances)~~

· Pooled vs. unpooled variance estimates (TOS vs. CI)

· Calculate a z test statistic and p-value using the normal distribution

o Interpret the standard error, test statistic, and p-value in context

· Calculate and interpret a z-confidence interval for the difference in two population proportions

o Make sure the direction is clear

· Discuss factors that will affect standard error, test statistic, p-value, confidence interval, and how

o e.g., sample size, order of subtraction, size of difference in sample proportions

· Distinguish between the explanatory variable and the response variable from a study description

· Identify and explain a potential confounding variable in observational studies

o Be sure to explain on how there could be a differential effect by the confounding variable on the response variable between the explanatory variable groups (Make sure it’s an alternative explanation for the observed difference between groups separate from the explanatory variable, not just another variable).

From Section 3.2 (Investigations 3.3, 3.4, 3.5) you should be able to:

· Distinguish between an observational study and an experimental study

o Be able to justify which type of study you have

· Discuss the advantages of using a placebo treatment

· Discuss the advantages to blinding and double-blinding in a study

· Discuss the purposes/goals/merits of “randomization” (aka random assignment to treatment groups)

· Identify when we are allowed to draw cause-and-effect conclusions (perhaps just about the experimental units in the study)

· Interpret and critique a description of a research study (e.g., Inv 3.5)

· Discuss some of the limitations in the type of conclusions that can be drawn from different designs

o Do not draw cause-and-effect conclusions from an observational study

· Can still decide whether there is evidence of an association, measure how strong it is

· Identify and justify the appropriate “scope of conclusions” (generalizability, causation) from the study design

From Section 3.3 you should be able to:

· Define the parameter in terms of the difference in (long-run) treatment probabilities

· Simulate random assignment under the null hypothesis, create a null (or randomization) distribution of the difference in two sample proportions

o Carry out and interpret the results from a randomization simulation for a two-way table (e.g., card shuffling, including how many cards and how many of each color, how many deal out to each group)

o Use the Analyzing Two-way Tables applet

o Understand the equivalence of using the number of successes in group A, difference in group proportions, relative risk, and odds ratio in this simulation

o Including how to approximate the (one or two-sided) p-value based on the simulation results

· Calculate the exact (one or two-sided) p-value using the hypergeometric distribution (aka Fisher’s Exact Test)

o Including showing set up by hand and with technology

o Including writing out the probability statement P(X > …. ) and the input values of the hypergeometric (N, M, n)

o Using technology to carry out the full FET p-value calculation

· Approximate (and interpret) the p-value and confidence interval for ₁ - ₂ using the normal distribution (two-sample z-procedures)

o Decide whether the procedure is valid (just worry about cell counts, not population size)

o Consider continuity correction/Adjusted Wald adjustment as an improvement

From Section 3.4 (Investigation 3.9, 3.11) you should be able to:

· Calculate and interpret relative risk as an alternative measure of association between two binary variables

o Remember that the difference in proportions does not take into account the magnitude of the baseline risk

§ Small differences in proportions “seem” much larger when the baseline risk is small

· Simulate a null distribution (using random sampling and/or random assignment) under the null hypothesis and interpret the results using the relative risk as the statistic

o Create a null distribution of relative risk

o Including how to approximate the p-value based on the simulation results

· Determine (by hand and with applet) and interpret a confidence interval for the ratio of treatment probabilities₁/₂ using the normal distribution

o Including how and why we “transformed” the statistic

o Calculate the standard error of the transformed statistic

o Back transform the confidence interval

· Calculate and interpret odds ratio as an alternative measure of the association between two binary variables

· Distinguish between a cohort, case-control, and cross-classified designs of an observational study and how the design affects which numerical summaries you can reasonably interpret

o Don’t use relative risk or difference in proportions with case-control studies

o It is always ok to calculate odds ratio

· Calculate and interpret odds ratio

o How to decide which calculation is being asked for in the context of the problem

o How to interpret the results of the calculations

· Interpret a confidence interval for the population odds ratio in context

Code you should be able to interpret/explain/write pseudo-code

· Subsetting data

· Recoding a categorical variable

· Splitting the graph by an explanatory variable

· Create simulations to replicate random sampling and/or random assignment

· You should also be able to interpret generic computer output for the procedures we have learned (one-sample t-procedures, two-sample z-procedures)

What you should be able to do with the calculator

Calculate conditional proportions, relative risk, odds ratio

Calculate confidence intervals for relative risk

Things you need to remember from Exam 1

· Defining variables and parameters in context

· How to interpret probability as a long-run relative frequency

· Showing your work/explaining how used the computer

· Explaining your simulation process

· One-sided vs. two-sided alternatives

· The reasoning of statistical significance and what a p-value measures

· Making conclusions based on the size of the p-value

· Interpreting confidence intervals and confidence levels

· “Duality” between confidence intervals and tests of significance

· The concept of power and factors that affect power

· Comparing “theoretical” and “simulation” results

Keep in mind

· When to talk in terms of population means, μ, and when to talk in terms of probabilities,

· When comparing distributions, remember to cite your evidence if you think there is a difference in the groups. In particular, tell me what you see in the summary statistics (e.g., a higher proportion) that leads to your conclusion (e.g., abstainers more likely to develop peanut allergy than consumers)

· Remember that the confidence level refers to the reliability of the method – how often, in the long run, random samples will produce an interval that succeeds in capturing the population parameter

· Remember to think about the direction of subtraction used by the technology

· We can use a one-sample t-procedure even when the sample sizes are small if we have reason to believe the population distribution is normally distributed. You can try to judge this, especially if you don’t have past experience with the variable, based on graphs of the sample data. If the sample data looks reasonably normally distributed (normal probability plots are a useful tool for helping this judgment), you can cite this as evidence that the population distribution is normally distributed. If you aren’t sure, then use an alternative analysis instead (e.g., data transformation).

· Keep in mind the one-sample t-procedure only tell you about the population mean (vs. other aspects of the distributions)

· Always putting your conclusions in the context of the research study

o Including considering “practical significance”

· Try to avoid the word “accurate” without explaining exactly what you mean by it.

· Try to avoid use of the word “group” but clarify if you mean the sample or the population or the long-run treatment

· Avoid use of the word “it”

Also keep in mind:

· Part of your grade will be based on communication. Be precise in your statements and use of terminology. Avoid unclear statements, and especially don’t use the word “it”! Always relate your comments to the study context.

· Show the details of any of your calculations.

· Organize your notes ahead of time, and don’t plan to rely on your notes too much.

· Be able to both make conclusions from a p-value and provide a detailed interpretation of what the p-value measures in context

· You should continue to focus on the overall statistical process from collecting the data, to looking at the data, to analyzing the data to interpreting results

· Simulation-based vs. Exact vs. Theory-based (normal) procedures

· Think big picture and be able to apply your knowledge to new situations

Some additional Lessons from HW

HW 4

· Subsetting data and possible consequences on scope of conclusions

· Identifying outliers by 1.5IQR

· Calculating P(Weight > 325) vs. P(Average weight > 325)

· Impact of sample size on this probability

· Main benefit of stratified sampling

HW 5

· Interpreting the mean/median in the context of the research question

· If you “control” the distribution of a variable (pick 20 males and 80 females), don’t use your data to estimate the probability of outcomes for that variable (20% of the population are male)

HW 6

· If you can see the result going either direction, use a two-sided alternative

· A huge advantage of odds ratios is they are “invariant” to which variable is explanatory and which is response, and to which outcome is defined as success and which as failure

o So it only reasonable statistic with case-control study

· Reasoning behind continuity correction

· When/Why recode a variable vs. subsetting a variable

· Impact of sample size on p-value