Math 37 - Lecture 28

Relations in Categorical Data (2.6)

SITUATION: Two or more categorical variables, are they related?

Example Want to know if the smoking behavior of the parents is related to their children's behavior. Examine data for 5375 students from 8 randomly selected high schools and find 18.7% of high school students smoke and 33% of students have 2 smoking parents, 42% have one smoking parent. Is the parent's behavior related to the child's?

What other information do we need?

Two way Table: Describes two categorical variables jointly

Row variable:

Column variable:

Entries: Counts in each parent-by-student class

NUMERICAL SUMMARIES

1) Each variable separately=marginal distribution (counts or percents)

2) Look at the relationships among the categorical variables

Use percents

Hint: Ask what group represents the total that I want a percent of

Example Do smoking habits of parents help explain whether or not their children smoke?

Conditional Distributions

- What % of students who have two smoking parents also smoke?

- Among those with one smoking parent?

- Among those with neither parent a smoker?

Do the conditional distributions in the different parent groups differ?

Complete conditional distributions:

 

Student Smokes

Doesn’t

Both parents

 

 

One parent

 

 

Neither parent

 

 

The influence of parents smoking on their children smoking is found by comparing the three conditional distributions.

 

GRAPHICAL SUMMARIES: Bar Graph

- Choose either the row or column variable.

- Each bar represents one group of the chosen variable.

- Height of the bar is the % of that group at that level of the variable.

- Segmented Bar Graph: Each bar includes all subgroup percents so the bars total 100% (Minitab: MTB> exec 'bargraph').

Example Look at the three parent groups, what percent of students in each group smokes? Do these differ from bar to bar? Conclusions?

Notes

- Conditional distributions differ from marginal (overall) distributions

- May want to use the other variable as the explanatory variable

- Large amount of information given in a two-way table.

Must decide which information answers your question.

Caution - Lurking Variables

In 1973, lawyers considered suing UC Berkeley for discrimination after it was found that 44.6% of males, but only 30.5% of females were accepted to undergraduate programs. Is this difference statistically significant?

To investigate which program was to blame the following breakdown was recorded:

 

Men

 

women

 

accepted

denied

 

accepted

denied

program A

511

314

 

89

19

program B

352

208

 

17

8

program C

120

205

 

202

391

program D

137

270

 

132

243

program E

53

138

 

95

298

program F

22

351

 

24

317

Total

 

 

 

 

 

Your task is to find evidence of gender discrimination.

Within each of the six programs, calculate and compare the percentage of male applicants who were accepted with the percentage of females who were accepted. Which programs seem most responsible for substantial discrimination?

Simpson’s Paradox:

Direction of association reverses when data from several groups are combined to form a single group.