Stat 301 – Exam 1 Preparations

Review Problems: Review problems (solutions)

See also: Investigation 1.16, 1.17, and 1.18, Examples 1.1-1.3, Chapter 1 Summary

Required by noon Tuesday: Submit Review 1 questions (parts 1 and 2) in Canvas discussion boards

Optional: Review session Tuesday, 6:30-7:30 pm, 10-221

Exam Format: The exam will cover topics from Chapter 1 (Lectures 1-12; HW 1-3). The exam questions will be short-answer questions, often with several questions on the same study (but you do not necessarily have to answer (a) to try (b) etc.). There may be a few multiple choice but be prepared to explain your reasoning. You will have access to JMP, R, and the Applets and Data files page. You may use one page of your own notes (8.5 x 11, front and back). These are the most relevant formulas:

Binomial: P(X = k) = E(X) = n SD(X) =

Normal approximation for : E() = , SD() =

Standard score: (observation – mean)/std dev = (x-)/

One-sample z-test statistic:

One-sample (Wald) z-confidence interval: + z*

Adjusted Wald 95% confidence interval:

Sample mean: = Sample standard deviation: s =

Also see technology hints below. Keep in mind that I am not trying to test you on the technology but you should be ready to use appropriate tools to perform calculations more quickly and to interpret supplied applet or JMP or R output. You may also be expected to use your calculator or to set up calculations by hand (show the values substituted into the formula).

You will be expected to explain your reasoning, show your steps, and interpret your results. Make sure, especially when using technology, that your solution methods are clear!

The exam will be worth approximately 50 points, so plan to spend one minute per point.

Study Advice: You should study from the text (including study conclusions, chapter examples, and chapter summary), lecture notes (ppt), graded homeworks, hw solutions (follow original HW link), and practice problems. The quiz questions/solutions should be accessible to you in Canvas. In studying, I recommend going back through investigations, practice problems, and homeworks, without looking at the solutions, then check your answers, then repeat. (Solutions to ISCAM investigations and Practice problems can be found in Canvas on the Home page under Textbook Resources.) I also strongly believe working on Project 1 is a good study strategy.

Make sure you:

o Review the salmon-colored boxes and Chapter Summaries and Choice of Procedures table (p. 129, but add to it, see technology guide below)

· The hypergeometric distribution will not be covered on Exam 1. We will work with very large populations and use the binomial approximation to the hypergeometric or the normal approximation.

o Review Examples 1.1, 1.2, and 1.3 at the end of chapter

o See HW notes below

Overview: The exam will focus on studies that involve one binary categorical (i.e., yes/no) variable, where the data are a sample of independent (repeat) observations from a random process (the randomness is in the outcome) or a random sample from a large population (the randomness is in which observational units are in your sample). We have studied two main types of statistical inference:

• Statistical significance, where the goal is to assess the degree to which the sample data provide evidence against a null hypothesis and in support of a research conjecture;

• Statistical confidence, where the goal is to estimate a population parameter with an interval of plausible values.

Big Idea:

We have a categorical variable and we have gathered observations from a random process or a random sample from a larger population. From that sample, we want to infer something about the underlying process or population. In other words, we want to use the statistic (which we calculate from our sample data) to test claims about (test of significance) or to estimate (confidence interval) the value of, the parameter (which we don’t know). To do this, we need to assess the amount of “random variation” in our statistic, how much it varies by chance alone. We can use simulation or the binomial distribution or (often) the normal distribution to predict what that variation looks like. If our model is appropriate, then we know how far the statistic might be varying randomly from the parameter.

From Day 1 and Investigations A and B you should be able to:

· Critique and suggest suitable comparisons to answer a research question

· Describe the distribution of a quantitative variable (shape, center, variability, outliers)

o Interpret the mean and standard deviation of a data set

o Interpret a histogram of a quantitative variable

o Remember to talk in terms of distribution not just individual values

· Anticipate and explain variable behavior including outliers

· Interpret probability as a long-run proportion (under identical conditions)

· Interpret expected value as a long-run average

· Use simulation to estimate a probability

· Distinguish between “exact” probability calculations and simulated results

From Section 1.1 you should be able to:

· Define the observational units and variable of interest in a study

· Classify the variable as quantitative or categorical

· Produce a bar graph to summarize a categorical variable (by hand or with technology using summarized data)

· Calculate a statistic to summarize a binary variable (e.g., sample count, X, or sample proportion, )

· Define a corresponding parameter of interest in the study in words (e.g., process probability, )

· Describe how to carry out a tactile simulation to represent a “random choice” process (e.g., with a coin or a die or a spinner) and to estimate a p-value

· Describe and interpret the results of a simulation

· Use the One Proportion Inference applet to carry out a simulation to represent a binomial process and to estimate a p-value

· Set up a binomial probability calculation given values for n and (show numbers plugged into equation, use P(X > k) notation)

· Calculate an exact p-value using the binomial distribution (iscambinomprob or JMP Distribution Calculator or One Proportion Inference applet)

· Provide a “layman’s” interpretation of p-value in your own words in the context of the research question

· Explain what is meant by “statistical significance” and how it is assessed

· Draw a conclusion about the “random chance” hypothesis based on a p-value

· State null and alternative hypothesis in symbols and in words (including choosing less than, greater than, or not equal to for the alternative)

· Carry out a binomial test of significance

1. Define parameter

2. State hypotheses (one or two-sided)

3. Calculate p-value (one or two-sided) using binomial distribution (iscambinomtest or JMP Analyze > Distribution > Test Probabilities, or One Proportion Inference applet)

4. Make a decision to reject or fail to reject the null hypothesis based on the magnitude of the p-value

5. Make a final conclusion in context about the research question

· Interpret a confidence interval as a range of plausible values for the parameter (those not rejected by a two-sided test)

· Use technology to obtain a binomial confidence interval (iscambinomtest or JMP Confidence Interval for One Proportion)

· Define Type I and Type II errors for a particular context

· Know that the level of significance () controls the probability of a Type I Error

· Be able to also describe the consequences of each type of error in context

· Use technology (iscambinompower or JMP(View > JMP Starter) DOE > Sample Size and Power or Power Simulation applet) to calculate power using the binomial distribution for a given alternative value

· Remember, it’s a two-step process

· Visual

· Identify the factors that affect power and how

· Understand idea of using technology to determine the sample size necessary to achieve a stated power for a particular value of the alternative

From Section 1.2 you should be able to:

· Explain what is meant by the “sampling distribution of the sample proportion”

· Determine whether or not the normal approximation is reasonable (show details) for the sampling distribution of the sample proportion (be able to label and sketch the predicted distribution)

· Determine the mean and standard deviation for the sampling distribution of the sample proportion

o Apply the CLT to predict the shape of a sampling distribution, including drawing a well-labeled and partially scaled (3-5 values on the horizontal axis) sketch of the distribution and shade the area of interest

o Consider probabilities as areas under a continuous mathematical probability curve

· Calculate and interpret the z-score for a sample proportion

· Carry out a one-proportion z-test of significance

1. Define parameter

2. State hypotheses (one or two-sided)

3. Be able to report and interpret the test statistic

4. Check whether the procedure is valid for the sample size used

5. Calculate a p-value (one or two-sided) using the normal approximation (R iscamonepropztest, or JMP (Journal) Hypothesis Test for One Proportion, or Theory-Based Inference applet)

6. Make a decision to reject or fail to reject the null hypothesis based on the magnitude of the p-value

7. Make a final conclusion in context about the research question

· ~~Apply and explain the logic behind a continuity correction for the p-value~~

· Calculate power using the normal distribution for a given alternative value (p. 88)

· Solve for the sample size necessary to achieve a certain level of power

· Use technology to calculate a one-sample z-interval (R one propztest or JMP (Journal) Confidence Interval for One Proportion, or Theory-Based Inference applet)

· Be able to change the confidence level

· Be able to explain the components of the confidence interval formula (e.g., midpoint, width)

· Determine and interpret margin-of-error as the measured of expected random sampling error

· Identify the factors that affect the midpoint and width

· Be able to solve for the sample size necessary to achieve a desired margin of error (p. 78)

· Be able to interpret confidence level in terms of the reliability of the method

· Apply and explain the Adjusted Wald procedure for 95% confidence

· Decide when to use Wald vs. Adjusted Wald vs. Binomial and when they will be similar

· Describe and utilize the duality between two-sided tests and confidence intervals

From Section 1.3 you should be able to:

· Define the population, sample, sampling frame, statistic, and parameter for a particular study context

· Use appropriate symbols to refer to parameters and statistics (mean, standard deviation, proportion)

· Decide whether a sampling method is unbiased by

· Examining the sampling distribution of the statistic, and determining whether it is (approximately) centered at the parameter value

· Considering whether the sampling frame is complete and the selection method is random, based on a description of the sampling process.

· Be able to conjecture with justification a direction for sampling or nonsampling bias (likely to systematically produce over or underestimates of the parameter value)

· Know the difference between “bias” and an unlucky sample

· Produce a simple random sample from a sampling frame, e.g., with GRN applet, Random.org

· Describe the concept of (random) sampling variability to a nonstatistician

· ~~Identify the following sampling methods from a description: systematic sampling, multistage sampling, stratified sampling~~

· ~~Explain how they differ from a simple random sample~~

· Suggest sampling and nonsampling errors present in a study context (see Investigation 1.15; Example 1.3)

· Describe the difference between statistical significance and practical significance (Investigation 1.17)

· Realize that when we are sampling from a finite population, the binomial distribution is an approximation

· This approximation is more valid the larger the population size compared to the sample size

· When this is approximation is valid, we apply all the same techniques (e.g., simulation, binomial, normal) as earlier in the chapter.

· When this approximation is valid, neither the population size nor the percentage of the population sampled influence our statements of significance or confidence

Technology Summary

· To calculate/estimate a probability from a binomial distribution knowing n and

o One Proportion Inference applet

o JMP: Distribution Calculator (Journal)

o R: iscambinomprob

· To calculate a probability from a normal distribution knowing mean and std dev

o Normal Probability Calculator Applet

§ Easy to label horizontal axis

o JMP: Distribution Calculator (Journal)

o R: iscamnormprob

All three methods allow you to find the probability above, below, between, or outside values

· FYI: To calculate a percentile from a normal distribution knowing mean and std (you know the probability and want to find the corresponding observation, z-score)

o Normal Probability Calculator Applet

§ Enter value in probability box and press enter or click mouse elsewhere

o JMP: Distribution Calculator (Input probability and calculate quantiles)

o R: iscaminvnorm

o You can do something like this with the binomial distribution as well

· FYI: To find critical values (z*) from a standard normal distribution (mean = 0, SD = 1)

o Normal Probability Calculator applet, specifying the tail probabilities (1-C)/2 and pressing Enter

o JMP: Distribution Calculator (Input probability and calculate quantiles)

o R: iscaminvorm

· To calculate the exact binomial p-value

o One proportion Inference applet

§ Check the Exact Binomial box

o JMP: Analyze > Distribution (one-sided alternative hypothesis)

§ Can also use Distribution Calculator

o R: iscambinomprob

· To approximate a binomial p-value

o Simulation: One Proportion Inference applet, especially when CLT does not apply

§ Make sure run enough repetitions for simulation-based p-value

§ Can also calculate exact p-value, exact binomial, or normal approximation

o CLT: Theory-Based Inference Applet (one proportion)

§ Includes graph (can paste in raw data) and Ho/Ha statements

§ Uses normal approximation

§ Allows continuity correction

o JMP: (Journal) Hypothesis Test for One Proportion (z-test)

§ Includes Ho/Ha, p-value format

o R: iscamonepropztest

· To calculate an exact binomial confidence interval

o JMP: (Journal) Confidence Interval for One Proportion

o R: iscambinomtest

· To calculate a one-sample z-confidence interval

o Theory-Based Inference applet (one proportion)

o JMP: (Journal) Confidence Interval for One Proportion

§ If you use Analyze > Distribution you get the “score interval” (p. 85)

o R: iscamonepropztest

With 95% confidence, can use the Adjusted Wald by specifying two more successes and 4 more observations.

· To calculate power

o Power Simulation applet (simulation or exact or normal approximation)

o JMP: DOE > Sample Size and Power (binomial = Exact Clopper-Pearson)

o R: iscambinompower, iscamnormpower

Applets you don’t need to use on Exam 1

· Descriptive Statistics

· Random Babies (Just remember how to interpret “probability”)

· Reese’s Pieces or Colored Candies (are just special cases of One Proportion Inference applet)

· Simulation Confidence Intervals (Just remember how to interpret “confidence”)

· Sampling Words (Just remember role of population size in our calculations)

Which distribution do I use to find a p-value or a confidence interval?

· You have several options for categorical data (assuming you are sampling a binary variable from a process or a large population)

o Simulation, although don’t have a confidence interval or power formula

o The binomial distribution

o The normal distribution if the conditions for the CLT are met

Miscellaneous

• Be able to define a probability as a long-run proportion (whether it’s a probability from a model, from a normal distribution, from a p-value)

• Clearly differentiate parameters from statistics (e.g., long-run proportion or proportion of all adults)

• Don’t mix counts, proportions, percentages

• Be able to state hypotheses in symbols and/or words

o Use symbols correctly (e.g., know when you are using and when or ₀)

• Clearly explain how you are finding your output (e.g., which command used)

• Choice of success is often arbitrary, just make sure you are consistent

• Thinking about your sample size can often help you define the observational units

• Be able to define the observational units and variable in our “null” distributions (aka sampling distributions) vs. the sample distribution

• A calculation will seldom be the end of the question – always be on the look out for “and interpret”

• We can now give better answers to some of the early “generalizability” questions

• Always put your comments in context

• Be able to sketch and label the predicted null distribution

• Know the difference between “predicted” and “theoretical” values (e.g., for mean and SD, p-value)

• You won’t be asked to take derivatives but should be able to use the lessons learned

o SD() maximized at = .5

o Sample size effects are larger than effects on SD() but exhibit diminishing returns

o 1/√n is pretty good approximation of margin-of-error for 95% confidence for .

• It’s possible I will say find p-value or interval and if normal approximation is not valid you should not use it

o Remember the sample size checks differ slightly between a test and an interval

o For proportions: Binomial and Adjusted Wald can be used with any sample size

• Be able to explain what is meant by “95% confidence” in your own words, in context, without using the words confidence, probability, sure, or chance

• Be able to interpret a p-value in your own words, not only evaluate

• Know the factors that affect test statistic, p-value, confidence intervals, and power/types of error probabilities

• ~~Be able to perform a continuity correction (for tail probabilities, “outside” and “between”; counts and/or proportions)~~

• Keep in mind we never get evidence for the null, only lack of evidence against it

o Absence of evidence is not evidence of absence

• When making a choice between two options, you should argue both for one and against the other (sometimes you tell me one has one property/advantage but don’t really tell me why the other does not)

o Make sure your explanations/justifications aren’t too “circular” (e.g., I have a larger confidence level because I am more certain the parameter is contained in the interval)

• Be able to evaluate the appropriate of a model, understand the assumptions underlying a model

o E.g., how to check the four conditions of a binomial model (e.g., is it ok to assume the infants’ choices are independent of each other)

o E.g., how to also check the sample size conditions for a normal approximation to the binomial

• You won’t do a lot of hand calculations but may be asked to set up an equation (e.g., show the values substituted in) or explain a property using the equation (e.g., because n is in the denominator)

• We don’t always want to assume 0.5 in Ho/Ha. The choices of hypothesized value and alternative direction are based entirely on the research question, not anything about the observed sample data.

Advice:

• Part of your grade will be based on communication. Be precise in your statements and use of terminology. Avoid unclear statements, and especially don’t use the word “it”! Always relate your comments to the study context.

o I would also avoid “data,” “results,” “accurate”

o Also say the distribution of what and the standard deviation of what

• Show the details of any of your calculations (including sample size checks)

• Organize notes for efficient retrieval of information/formulas

• Don’t plan to use notes too much

o Prepare as if exam were closed book/notes

o Focus on understanding, not memorization

o Be cognizant of time constraint

• Expect similar questions to what we answer in class every day, on HW

o Also be ready for “what if” questions (small changes that require you to conjecture and explain more than perform additional calculations)

• Be sure to explain any assumptions you are making along the way

• Be prepared to think/explain/interpret

o Not just plug into formulas

o Be ready to explain process of how you would do calculations

§ E.g., p-value = Pr(X ≤ k), where X ~ Binomial(n, π)

o Be able to both make conclusions from a p-value (evaluate) and provide a detailed interpretation of what the p-value measures in context (interpret).

o Be succinct in your answers (using acceptable statistical terms helps with this, but don’t use them incorrectly)

• Be ready to interpret computer output

o You may ask clarifying technology questions during the exam

• Read carefully

• Be sure to answer the question asked

• Take advantage of information provided

• Relate conclusions to context

• Prepare as thoroughly as you would for a closed-book exam

o Re-work in-class investigations

o Re-work HW questions

o Work through examples

o Re-read wrap-up sections

o Come to Tuesday’s class prepared with questions

o Bring questions to office hours, Canvas discussion boards