Workshop Statistics: Discovery with Data and Fathom

Topic 27: Inference for Correlation and Regression

Activity 27-1: Baseball Payrolls

(a) There appears to be a moderate positive association between payroll and winning percentage.
(b)

    These teams seem to have large payrolls and larger winning percentages.

(c) Positive relationship with moderate strength. Student guesses will vary.
Actual r = .630
(d) Answers will vary. These are intended to be sample answers.
Below is one possible random assignment.
payroll win pct
70.3710 0.598765
75.0650 0.444444
55.3685 0.475309
42.1428 0.588957
54.3925 0.395062
15.1500 0.635802
55.5640 0.456790
71.1358 0.413580
42.9274 0.419753
16.3630 0.530864
71.3314 0.617284
30.5165 0.595092
24.2177 0.484472
46.2482 0.475309
45.9322 0.459627
46.0096 0.465839
The correlation coefficient for these two columns is r=-.281
(e) This correlation is not nearly as large as the one observed in the sample (.630).
(f)

(g) The graph is fairly symmetric, centered around zero.
(h) None is close to the correlation we observed in the sample and thus the sample correlation (.705) is very unlikely to happen by chance alone, indicating that the relationship between winning percentage and payroll is statistically significant.
(i) t = .630 sqrt(16-1)/sqrt(1-.6302) = 3.03 with 16-2=14 degrees of freedom.
Table III indicates that .001 < one-sided p-value < .005
With such a small p-value, we have strong evidence against the null hypothesis of no association, suggesting that there is an association between payroll and winning percentage.

Activity 27-2: Studying and Grades

(a) observational units: students surveyed
variable 1: hours of study (quantitative)
variable 2: GPA (quantitative)
(b)

    There appears to be a fairly weak positive association between hours studied per week and GPA.

(c) Predicted GPA = 2.89 + .0894 hours/week
correlation coefficient, r=.343
(d) the slope=.0894 indicates the change in gpa for an increase of one study hours/week.
(e) no, sampling variability
(f) Correlation: .00000014; Regression equation: GPA = .000096 hours/week + 3.25
These show that the correlation and the slope coefficient are both essentially zero.
Answers will vary, these are intended as sample answers.
(g)

(h)
Sample # 1 2 3 4 5 6 7 8 9 10
Sample slope .00997 .0315 -.00989 -.0185 -.00481 -.0212 -.0126 .00631 .00722 .0159
Sample intercept 3.26 3.12 3.3 3.35 3.2 3.4 3.3 3.35 3.3 3.22
Sample # 11 12 13 14 15 16 17 18 19 20
Sample slope -.02 -.0193 .0519 .0212 -.0281 .0186 .0291 -.0544 .0124 .0326
Sample intercept 3.36 3.3 3.07 3.24 3.5 3.2 3.23 3.53 3.15 3.1

(i)

(j) The slope coefficient from the UOP students' data (.0894) falls above all of these sample slope coefficients taken from a population with no association between GPA and study time.  This shows that a slope coefficient at least as extreme as that of the UOP students would rarely happen by chance if there were no association in the population.
(k) Mean of slopes: .002395; Standard deviation of slopes: .0256
This mean is reasonably close to zero.
(l) The standard error of the slope coefficient is .0277.  This is reasonably close to the standard deviation of the 20 simulated slope coeffiecients (.0256).
(m)  t=.0894/.02771 = 3.23
(n) Using df=80, we'd get a .0005< p-value<.001
(o) Ftahom gives a p-value of .0018.
(p) This small p-value is consistent with the simulation results, because both showed a slope coefficient of .0894 to be rare if there is in fact no association between hours of study and GPA.
(q) Yes, since the p-value represents the chance that a slope coefficient this high would happen if in fact no association exits, and the p-value here is very low.
(r) t*(n-2)=1.990
    b + 1.990 (.0225) = (.045, .134)
(s) (.0343, .1446)
(t) We are 95% confident that the increase in GPA for an additional hour of study time is between .0343 and .1446.  Since zero is not in this interval, we have evidence that there is an association between GPA and hours of study time.
(u)

    The actual sample line is the pink line that starts much lower on the left and ends up much higher on the right.  This line has a lower intercept and a higher slope than the simulated sample lines from the population with no association.

(v) Thus a slope this extreme is not very likely to occur if there is no relationship between the two variables in the population. This indicates that we have strong evidence of an association between hours of study and GPA.

Activity 27-3: Studying and Grades (cont.)

(a)


The distribution is relatively symmetric, with a center around 0.  These plots have no marked features suggesting nonnormality.

(b)

This plot shows no relationship between hours of study and the residual.  It reveals no strong patterns, so no evidence suggesting that a linear model is not appropriate.  The variability of the residuals stays relatively constant for different hours of study per week (except maybe at the high hrs/week end, but we don't have many observations there).