Stat 414 – Review I
Due (preferably before Monday, 9am): In Canvas,
(1) post at least one question you have on the material described below.
(2) post at least one example question that you think I could ask on the material described below
Optional Review Session: Monday, 7-8pm, Zoom Office Hour link
Optional Review Problems: review1probs.html
Format of Exam: The exam will cover Lectures 1–8, Quizzes 1-8, Homeworks 1–2.
The exam will mostly be interpretation questions but I could ask you questions about R commands/R output. The exam will be worth approximately 50 points. Be ready to explain your reasoning, including in “laymen’s” language. You can use one page (double sided) 8.5x11 pages of notes (hard copy). All reference material should come from this course only.
Advice: Prepare as if a closed-book exam. Review Quizzes/commentary, HW solutions/ commentary, Lecture notes. The first exam is really all about components of regression models: model assumptions, residual plots (standardized residuals), model equations ( vs. , slope vs. intercept, Residual error (MSE, SE residuals), Indicator vs. Effect coding, Centering, Interactions, Adjusted vs. unadjusted associations (individual vs. group).
Italics indicate related material that may not have been
explicitly stated.
What you should recall
from previous statistics courses
·
Mean =
average value, SD = typical deviation in data from average value
·
The
basic regression model, E(Yi) = + xi,
o represents
the part of the response (dependent) variable that cannot be explained by the
linear function.
o
Matrix form
§
measures the
unexplained variability of the responses about the regression model
o
Least squares (OLS)
estimation minimizes SSError =
§
(p =
number of slopes)
§
o Interpreting regression coefficients in
context
§
= expected
vs. predicted mean response when all x = 0
·
But
often extrapolating
§
= expected
vs. predicted change in mean response for x vs. x + 1
o
The
difference between the model equation (e.g., E(Y) = ) and the prediction equation (e.g.,
o
Checking model
assumptions
§
Residuals vs.
Fits should show no pattern to satisfy Linearity (E(Y) vs. x)
§
Errors should
be Independent,
§
Residuals
should look roughly Normal (Y|x ~
Normal)
§
Residuals vs.
Fits should show no fanning to satisfy Equal variance
·
Var(Y|x) same at each value of x
o
aka
heteroscedasticity vs. homoscedasticity
§
Violation of
model assumptions usually leads to bad estimates of
·
Usually
underestimating (overstating significance)
o
Also assuming x
values are fixed and measured without random error
o
Also assuming you
have the “right” variables in the model
o
Unusual
observations can be very interesting to explore
§ High leverage: observation is far from center of x-space
§ Influential: removal of observation substantially
changes regression model (e.g. Cook’s distance
explores change in fitted values)
·
Combination of
high leverage and/or large residual
· Use individual t-statistic or F-statistic to judge significance of association
o t-statistic is adjusted for all other variables in the model (two-sided p-values assuming the variable was the last one entered into the model)
·
Generally t = (estimate
– 0)/SE(estimate)
§
SE(estimate) represents the random variation in the statistic
(e.g., different random samples from same population)
§
simplifies
to
o overall F-statistic assumes all slopes are zero (compares full model to model with only intercept) vs. at least one slope differs from zero
o partial F-test assumes some slopes are zero (compares full model to reduced model)
§ especially helpful for categorical variables or after adjusting for a specific subset of variables
·
Analysis of
Variance (ANOVA)
o
Traditionally
thought of as method for comparing population means but more generally is an
accounting system for partitioning of variability of response into model and error
o
To compare
groups, assume equal variances
§
sp = residual
standard error = root mean square error
o
SSTotal = == SSModel + SSError with df = n – 1
o
F = MSGroups / MSError (Between group variation / Within group variation)
§
Values larger
than 4 are generally significant
§
Equivalent to a
pooled two-sample t-test
·
Values larger
than 2 are generally significant
·
R2 =
percentage of variation in response
explained by the model
o
1 – SSError/SSTotal 1 –
o Explain similarities and differences in how R2 and R2adj are calculated (whether or not adjust for df)
·
Changes that you expect to see when you make
modifications to the model, e.g., adding a variable
o
Explain that adding terms to the multiple regression model “costs a df”
for each estimated coefficient and reduces df error
o Advantages/Disadvantages
to treating a variable as quantitative or categorical
·
Using MSError and R2
to evaluate a model performance
o Slight differences in interpretation (e.g., typical prediction error vs. comparison to original amount of variation in response)
o Comparing models
· Explain that least squares estimates are unbiased for regression coefficients (and how we decide)
· Explain that measures sample to sample variation in estimated regression coefficients and depends on , n and SD(X).
From Day 0, you
should be able to
·
Describe
the distribution of a quantitative variable
·
Describe
the association between two quantitative variables
·
Identify
the cases in a dataset
·
Identify
multilevel (clustered) data
From Day 1, you
should be able to
· Defining meaningful variables for a given context (e.g. time vs. speed vs. adjusting for track length)
· Explain the meaning of “least squares” as a possible estimation method
·
Use output to write a fitted regression model
using good statistical notation
· Interpret the estimated slope and intercept in context
· Identify and evaluate validity conditions (LINE) for inference with least squares regression (in context)
o Can also use residual plots to check for patterns/relationships with other variables not currently in model – suggesting whether a previously unused variable should be added to the model
From Day 2, you
should be able to
·
Utilize residual plots to evaluate validity conditions
o
Identify which plot to use and what you learn
from the plot
o
The need for an interaction could show up as
“curvature” in residual plots
o
Also review study design
·
Suggest
possible remedies for violations of basic model assumptions
o Transformations of y and/or x
can improve linearity
o Transformations of y can improve
normality and equal variance
o Including polynomial terms can model curvature
·
Use
residual plots to analyze whether new model is more valid for the data
in hand
o Look at graphs of model vs. data to assess appropriateness
of model
·
Define/interpret residual/investigate
possible causes of large residuals
From Day 3, you
should be able to
·
Interpret
residual standard error
·
Explain
the relationship between residual standard error and MSError
·
Use/Contrast
t-tests and F-tests for assessing significance of model coefficients
·
Define
multicollinearity and its consequences
o Use VIF values to measure amount of
multicollinearity in a model
·
Understand
impacts of centering a variable on interpretation, significance of model see also Day 5
o Consider centering to reduce mutlicolinearity with “product terms” like x and x2
From Day 4, you
should be able to
·
Add a
categorical variable to a regression model using k – 1 terms for a variable with k categories
o where is the kth
“group effect”
o Indicator variables change intercepts
o Interpret coefficients as changes in
intercepts (e.g., parallel lines, at any value of x)
·
Use
either effect coding or indicator coding
o Determine the value of the coefficient not
displayed in the output
o Interpret signs of coefficients (comparing to
overall mean or to reference group)
o Interpretation of intercept (reference group
vs. overall mean)
From Day 5, you should be able to
· Explain the concept of an “index”
· Distinguish different types of “feature scaling” (e.g., centering vs. standardizing vs. min-max scaling)
o What are the benefits to using them?
o How do they change interpretation?
·
Assess
multicollinearity = explanatory variables have a strong linear association
o VIF = variance inflation factor, measures R2
from regressing one explanatory variable on all the others
o Variables with high VIF will have inflated
standard errors/misleading information about significance of the variable in
the model
o Usually VIF values larger than 5 or 10 are
concerning
o Remedies include reexpressing
variables, removing variables, centering
§ Centering ()
and/or Standardizing ()/sx
helps when variables are “products” (e.g., quadratic, interactions)
§ Centering is also useful in providing a more
meaningful intercept
§ Standardizing can also make regression coefficients
more comparable by putting the variables on the same SD scale
From Day 6, you should be able to
·
Use a
“partial F-test” to test significance of categorical variable see also day 4 or a subset of variables
o Null is all k-1 coefficients are zero;
Alt. is at least one coefficient differs from 0
o Compares increase in SSError
with the reduced model to MSE for the full model to is see whether reduced
model is significantly worse (df num = k
– 1, df denom = MSE df for full model)
o Special cases of partial F tests:
§ One coefficient (equivalent to t-test as last
variable entered into model)
§ All coefficients (“model utility test”)
o Use “anova(model1,
model2)” to carry out partial F-test in R.
· Distinguish the meaning of sequential vs. adjusted sums of squares
·
Carry out test of significance involving
individual or groups of slope coefficients
o
Stating appropriate hypotheses for removing the
variable
From Day 7, you
should be able to
· Correctly interpret adjusted vs. unadjusted associations
o From graph
o From multiple regression model
·
Interpret intercept and slope coefficients for
multiple regression model in context
o
Multiple regression: After adjusting for other
variables in the model (e.g., comparing individuals in the same sub-population)
§
Effect
of x1 on y can differ when x2 is in the model
§ Adjusted and Unadjusted relationships can
look very different
·
In multiple regression, interpret individual
coefficients after adjusting for other variables in the model (e.g., comparing
individuals in the same sub-population)
o Identify the other variables you are talking about
o
Quote: We interpret the regression slopes as
comparisons of individuals that differ in one predictor while being at
the same levels of the other predictors. In some settings, one can also imagine
manipulating the predictors to change some or hold others constant—but such an
interpretation is not necessary. This becomes clearer when we consider
situations in which it is logically impossible to change the value of one
predictor while keeping the value of another constant (e.g., x and x2
or x1, x2, and x1*x2).
· Use graphs, context, and model equation to interpret an interaction in context
o Be able to explain the nature of the interaction
§ Change in slopes = change in effect of x1 on y depending on value of x2
§ NOT the same as x1 and x2 being related to each other
o Be able to interpret signs of coefficients
o Be able to write out separate equations
o Be
able to talk about why we don’t just fit separate equations
o Be careful when interpreting “main effects” if have an interaction
§ Can describe slope of x1 on y when x2 is at zero (or mean if centered)
o Why it’s useful to center variables involved in an interaction
From Day 8, you should be able to
·
Distinguish between
a confidence interval for E(Y) and a prediction interval for y
o
Population regression line vs. variability about
the line
o Identify a 95% prediction interval for an individual observation as roughly
o What does/does not impact the widths of these intervals
·
Explain that transformations may not be able to
“solve” more complicated unequal variance patterns
o
Consider whether Var(Y) is changing with an
explanatory variable (e.g., increasing variability with larger values of x)
o
Explain the principle of obtaining an estimate
of when have unequal variance
§
e.g., Var(Yi
§
This can impact judgement of significance
of terms in model, predictions, etc.
· Explain the principle of weighted least squares to address heterogeneity in residuals
o
Var(Yi typo in Day 8 handout?
o Giving less weight in the estimation of the “best fit” line to observations that are measured with more uncertainty
o Use standardized residuals in residual plots to evaluate the appropriateness of the model
From HW 1
· Translate the model assumptions into the context of a study
·
t-test vs. ANOVA vs. regression
·
Write out a regression equation using good
statistical notation
·
Interpret regression coefficients in context
· Evaluate residual plots
· Components of an ANOVA table
·
Recognize clustered data
From HW 2
· Explain the difference between “within group” and “between group” associations
o Aggregating vs Disaggregating data
o
Between group regression focuses on the
association between and ,
the group means
o
The within group regression focuses on the
association between and within each group, assuming it’s the same
within each group
o
The between group regression slope may look very
different (even opposite in direction) from the within group regression slope
o
Regular regression can be written as a
combination of the within and between group regressions, e.g.,
§
matches the within group regression coefficient
§
is the between group coefficient minus within
group coefficient (do the difference between them)
o
Including an interaction between and would allow the within group regression lines
to vary by group
·
Use residual plots to compare the effectiveness
of different transformations, weighted least squares
· Interpreting interactions
·
Predict how a weighted regression model will
compare to an unweighted regression model
And, you should be able
to
· Determine and interpret a t-test statistic
· Graphically compare data to model predictions
Keeping track of variances
· = variation in response variable
· = standard deviation of explanatory (aka predictor) variable
· = variation about regression line/unexplained variation in regression model, variation in response at a particular x value
o aka s aka se = residual standard error = square of sum of squared residuals (“root MSE”)
· = sample to sample variation in regression slope
·
= sample to sample variation in estimated
predicted value. There are actually two “se fit” values, one for a confidence interval
to predict E(Y) and one for a prediction interval to predict .
The latter can be approximated with ,
but actually depends on etc.