Stat
414 – HW 3
Due
Friday, midnight, Oct. 20
Please
submit individual files for each problem. There are a few nice packages for
displaying multiple models. For example
install.packages("stargazer")
library(stargazer)
stargazer(model1, model2, type = "text")
1) Read this Oct. 1, 2023
article by Nate Silver:
https://www.natesilver.net/p/fine-ill-run-a-regression-analysis
(a) The article
mentions “true and robust” and we looked at “robust standard errors” in the
course. What is meant by the term
“robust” in statistics? Identify another
robust procedure you have seen in this course this quarter.
(b) One of the
critiques of Nate’s claims was that “unadjusted state comparisons are misleading”. Explain the argument in your own words. Do
you agree or disagree? Explain your reasoning.
(c) Nate mentions “this is almost entirely orthogonal to
state partisanship.” What is meant by
“orthogonal” in this context?
(d) In the regression model, define the “biden”
variable. Is this variable quantitative
or categorical? What does it mean for the variable to have a negative
coefficient in the model? What needs to be true for this coefficient to be
meaningful?
(e) What are the
observational units (aka cases) in his regression analysis?
(f) Do you agree with
his argument to drop Biden from the model?
2) Reconsider the salary data from HW 2.
The “within group” slope (regression salary on number of semesters after adjusting for major) was -2.186 and the “between group” slope (regressing the major mean salary on the major mean number of semesters) was 1.822. But in the model with both semester and avgsem, the coefficient of avgsem was “awkward” to interpret (the average semesters for the major increased by one, but everyone in the major stayed the same?). The slope ended up being the difference in the two previous slopes: 1.822 – (-2.186) = 3.990. Here is another way we could run the model.
(a) First, we will create a “deviation” variable: where we have subtracted the group mean from
each observation. This is called “group
mean centering” as opposed to the “grand mean centering” we did before by
subtracting the overall mean. Is this a “level 1” or “level 2” variable?
(b) Now include this variable, and the group mean variable
in the same model:
saldata = read.table("http://www.rossmanchance.com/stat414/data/saldata.txt","\t", header=TRUE)
summary(model4 <- lm(salary ~ semesters + avgsem, data = saldata))
saldata$avgsem = ave(saldata$semesters, saldata$major)
#create a “deviation”
variable
saldata$dev = saldata$semesters
-saldata$avgsem
model6
= lm(salary ~ dev + avgsem,
data = saldata)
Which coefficient(s) have changed from
model 4? How do you know interpret each
slope coefficient? (What is going on here?)
3) Recall the squid data,
where we looked at several different models that allowed the variances to vary.
(See the models/recreate the model summaries in class this week as well as SquidModels.R)
(a) Look at the first
ten observations
head(Squid,
10)
(b) Install the nlraa package and then look at the variance-covariance
matrices for each model for the first 10 observations:
#install.packages("nlraa")
vcmatrix1 = nlraa::var_cov(model1REML); vcmatrix1[1:10, 1:10]
vcmatrix2 = nlraa::var_cov(model2REML); vcmatrix2[1:10, 1:10]
vcmatrix3 = nlraa::var_cov(model3REML); vcmatrix3[1:10, 1:10]
vcmatrix4 = nlraa::var_cov(model4REML); vcmatrix4[1:10, 1:10]
(c) What is true about
the diagonal elements for matrix 1? Why?
How is the first value related to the residual standard error for model
1?
(d) For matrix 2, how
is the very first value () related to the
residual standard error for model 2? Which observation has the largest variance in
matrix 2? Why?
(e) Which observation
has the largest variance in matrix 3? Why? How is the very first value () related to the
residual standard error for model 3? How is the second value () related to the
residual standard error for model 3?
(f) In matrix 4,
explain how/why the variances for observations 8-10 differ.