Stat 414 – Review I

Due (preferably before Monday, 9am): In Canvas,

(1) post at least one question you have on the material described below.

(2) post at least one example question that you think I could ask on the material described below

Optional Review Session: Monday, 7-8pm, Zoom Office Hour link

Optional Review Problems: review1probs.html

Format of Exam: The exam will cover Lectures 1–8, Quizzes 1-8, Homeworks 1–2.

The exam will mostly be interpretation questions but I could ask you questions about R commands/R output. The exam will be worth approximately 50 points. Be ready to explain your reasoning, including in “laymen’s” language. You can use one page (double sided) 8.5x11 pages of notes (hard copy). All reference material should come from this course only.

Advice: Prepare as if a closed-book exam. Review Quizzes/commentary, HW solutions/ commentary, Lecture notes. The first exam is really all about components of regression models: model assumptions, residual plots (standardized residuals), model equations ( vs. , slope vs. intercept, Residual error (MSE, SE residuals), Indicator vs. Effect coding, Centering, Interactions, Adjusted vs. unadjusted associations (individual vs. group).

Italics indicate related material that may not have been explicitly stated.

What you should recall from previous statistics courses

· Mean = average value, SD = typical deviation in data from average value

· The basic regression model, E(Y_i) = + x_i,

o represents the part of the response (dependent) variable that cannot be explained by the linear function.

o Matrix form

§ measures the unexplained variability of the responses about the regression model

o Least squares (OLS) estimation minimizes SSError =

§ (p = number of slopes)

o Interpreting regression coefficients in context

§ = expected vs. predicted mean response when all x = 0

· But often extrapolating

§ = expected vs. predicted change in mean response for x vs. x + 1

o The difference between the model equation (e.g., E(Y) = ) and the prediction equation (e.g.,

o Checking model assumptions

§ Residuals vs. Fits should show no pattern to satisfy Linearity (E(Y) vs. x)

§ Errors should be Independent,

§ Residuals should look roughly Normal (Y|x ~ Normal)

§ Residuals vs. Fits should show no fanning to satisfy Equal variance

· Var(Y|x) same at each value of x

o aka heteroscedasticity vs. homoscedasticity

§ Violation of model assumptions usually leads to bad estimates of

· Usually underestimating (overstating significance)

o Also assuming x values are fixed and measured without random error

o Also assuming you have the “right” variables in the model

o Unusual observations can be very interesting to explore

§ High leverage: observation is far from center of x-space

§ Influential: removal of observation substantially changes regression model (e.g. Cook’s distance explores change in fitted values)

· Combination of high leverage and/or large residual

· Use individual t-statistic or F-statistic to judge significance of association

o t-statistic is adjusted for all other variables in the model (two-sided p-values assuming the variable was the last one entered into the model)

· Generally t = (estimate – 0)/SE(estimate)

§ SE(estimate) represents the random variation in the statistic (e.g., different random samples from same population)

§ simplifies to

o overall F-statistic assumes all slopes are zero (compares full model to model with only intercept) vs. at least one slope differs from zero

o partial F-test assumes some slopes are zero (compares full model to reduced model)

§ especially helpful for categorical variables or after adjusting for a specific subset of variables

· Analysis of Variance (ANOVA)

o Traditionally thought of as method for comparing population means but more generally is an accounting system for partitioning of variability of response into model and error

o To compare groups, assume equal variances

§ s_p = residual standard error = root mean square error

o SSTotal = == SSModel + SSError with df = n – 1

o F = MSGroups / MSError (Between group variation / Within group variation)

§ Values larger than 4 are generally significant

§ Equivalent to a pooled two-sample t-test

· Values larger than 2 are generally significant

· R² = percentage of variation in response explained by the model

o 1 – SSError/SSTotal 1 –

o Explain similarities and differences in how R² and R²_adj are calculated (whether or not adjust for df)

· Changes that you expect to see when you make modifications to the model, e.g., adding a variable

o Explain that adding terms to the multiple regression model “costs a df” for each estimated coefficient and reduces df error

o Advantages/Disadvantages to treating a variable as quantitative or categorical

· Using MSError and R² to evaluate a model performance

o Slight differences in interpretation (e.g., typical prediction error vs. comparison to original amount of variation in response)

o Comparing models

· Explain that least squares estimates are unbiased for regression coefficients (and how we decide)

· Explain that measures sample to sample variation in estimated regression coefficients and depends on , n and SD(X).

From Day 0, you should be able to

· Describe the distribution of a quantitative variable

· Describe the association between two quantitative variables

· Identify the cases in a dataset

· Identify multilevel (clustered) data

From Day 1, you should be able to

· Defining meaningful variables for a given context (e.g. time vs. speed vs. adjusting for track length)

· Explain the meaning of “least squares” as a possible estimation method

· Use output to write a fitted regression model using good statistical notation

· Interpret the estimated slope and intercept in context

· Identify and evaluate validity conditions (LINE) for inference with least squares regression (in context)

o Can also use residual plots to check for patterns/relationships with other variables not currently in model – suggesting whether a previously unused variable should be added to the model

From Day 2, you should be able to

· Utilize residual plots to evaluate validity conditions

o Identify which plot to use and what you learn from the plot

o The need for an interaction could show up as “curvature” in residual plots

o Also review study design

· Suggest possible remedies for violations of basic model assumptions

o Transformations of y and/or x can improve linearity

o Transformations of y can improve normality and equal variance

o Including polynomial terms can model curvature

· Use residual plots to analyze whether new model is more valid for the data in hand

o Look at graphs of model vs. data to assess appropriateness of model

· Define/interpret residual/investigate possible causes of large residuals

From Day 3, you should be able to

· Interpret residual standard error

· Explain the relationship between residual standard error and MSError

· Use/Contrast t-tests and F-tests for assessing significance of model coefficients

· Define multicollinearity and its consequences

o Use VIF values to measure amount of multicollinearity in a model

· Understand impacts of centering a variable on interpretation, significance of model see also Day 5

o Consider centering to reduce mutlicolinearity with “product terms” like x and x²

From Day 4, you should be able to

· Add a categorical variable to a regression model using k – 1 terms for a variable with k categories

o where is the k^th “group effect”

o Indicator variables change intercepts

o Interpret coefficients as changes in intercepts (e.g., parallel lines, at any value of x)

· Use either effect coding or indicator coding

o Determine the value of the coefficient not displayed in the output

o Interpret signs of coefficients (comparing to overall mean or to reference group)

o Interpretation of intercept (reference group vs. overall mean)

From Day 5, you should be able to

· Explain the concept of an “index”

· Distinguish different types of “feature scaling” (e.g., centering vs. standardizing vs. min-max scaling)

o What are the benefits to using them?

o How do they change interpretation?

· Assess multicollinearity = explanatory variables have a strong linear association

o VIF = variance inflation factor, measures R² from regressing one explanatory variable on all the others

o Variables with high VIF will have inflated standard errors/misleading information about significance of the variable in the model

o Usually VIF values larger than 5 or 10 are concerning

o Remedies include reexpressing variables, removing variables, centering

§ Centering () and/or Standardizing ()/s_x helps when variables are “products” (e.g., quadratic, interactions)

§ Centering is also useful in providing a more meaningful intercept

§ Standardizing can also make regression coefficients more comparable by putting the variables on the same SD scale

From Day 6, you should be able to

· Use a “partial F-test” to test significance of categorical variable see also day 4 or a subset of variables

o Null is all k-1 coefficients are zero; Alt. is at least one coefficient differs from 0

o Compares increase in SSError with the reduced model to MSE for the full model to is see whether reduced model is significantly worse (df num = k – 1, df denom = MSE df for full model)

o Special cases of partial F tests:

§ One coefficient (equivalent to t-test as last variable entered into model)

§ All coefficients (“model utility test”)

o Use “anova(model1, model2)” to carry out partial F-test in R.

· Distinguish the meaning of sequential vs. adjusted sums of squares

· Carry out test of significance involving individual or groups of slope coefficients

o Stating appropriate hypotheses for removing the variable

From Day 7, you should be able to

· Correctly interpret adjusted vs. unadjusted associations

o From graph

o From multiple regression model

· Interpret intercept and slope coefficients for multiple regression model in context

o Multiple regression: After adjusting for other variables in the model (e.g., comparing individuals in the same sub-population)

§ Effect of x₁ on y can differ when x₂ is in the model

§ Adjusted and Unadjusted relationships can look very different

· In multiple regression, interpret individual coefficients after adjusting for other variables in the model (e.g., comparing individuals in the same sub-population)

o Identify the other variables you are talking about

o Quote: We interpret the regression slopes as comparisons of individuals that differ in one predictor while being at the same levels of the other predictors. In some settings, one can also imagine manipulating the predictors to change some or hold others constant—but such an interpretation is not necessary. This becomes clearer when we consider situations in which it is logically impossible to change the value of one predictor while keeping the value of another constant (e.g., x and x² or x₁, x₂, and x₁*x₂).

· Use graphs, context, and model equation to interpret an interaction in context

o Be able to explain the nature of the interaction

§ Change in slopes = change in effect of x₁ on y depending on value of x₂

§ NOT the same as x₁ and x₂ being related to each other

o Be able to interpret signs of coefficients

o Be able to write out separate equations

o Be able to talk about why we don’t just fit separate equations

o Be careful when interpreting “main effects” if have an interaction

§ Can describe slope of x₁ on y when x₂ is at zero (or mean if centered)

o Why it’s useful to center variables involved in an interaction

From Day 8, you should be able to

· Distinguish between a confidence interval for E(Y) and a prediction interval for y

o Population regression line vs. variability about the line

o Identify a 95% prediction interval for an individual observation as roughly

o What does/does not impact the widths of these intervals

· Explain that transformations may not be able to “solve” more complicated unequal variance patterns

o Consider whether Var(Y) is changing with an explanatory variable (e.g., increasing variability with larger values of x)

o Explain the principle of obtaining an estimate of when have unequal variance

§ e.g., Var(Y_i

§ This can impact judgement of significance of terms in model, predictions, etc.

· Explain the principle of weighted least squares to address heterogeneity in residuals

o Var(Y_i typo in Day 8 handout?

o Giving less weight in the estimation of the “best fit” line to observations that are measured with more uncertainty

o Use standardized residuals in residual plots to evaluate the appropriateness of the model

From HW 1

· Translate the model assumptions into the context of a study

· t-test vs. ANOVA vs. regression

· Write out a regression equation using good statistical notation

· Interpret regression coefficients in context

· Evaluate residual plots

· Components of an ANOVA table

· Recognize clustered data

From HW 2

· Explain the difference between “within group” and “between group” associations

o Aggregating vs Disaggregating data

o Between group regression focuses on the association between and , the group means

o The within group regression focuses on the association between and within each group, assuming it’s the same within each group

o The between group regression slope may look very different (even opposite in direction) from the within group regression slope

o Regular regression can be written as a combination of the within and between group regressions, e.g.,

§ matches the within group regression coefficient

§ is the between group coefficient minus within group coefficient (do the difference between them)

o Including an interaction between and would allow the within group regression lines to vary by group

· Use residual plots to compare the effectiveness of different transformations, weighted least squares

· Interpreting interactions

· Predict how a weighted regression model will compare to an unweighted regression model

And, you should be able to

· Determine and interpret a t-test statistic

· Graphically compare data to model predictions

Keeping track of variances

· = variation in response variable

· = standard deviation of explanatory (aka predictor) variable

· = variation about regression line/unexplained variation in regression model, variation in response at a particular x value

o aka s aka s_e = residual standard error = square of sum of squared residuals (“root MSE”)

· = sample to sample variation in regression slope

· = sample to sample variation in estimated predicted value. There are actually two “se fit” values, one for a confidence interval to predict E(Y) and one for a prediction interval to predict . The latter can be approximated with , but actually depends on etc.