Math 37 - Lecture 28
Relations in Categorical Data (2.6)
SITUATION: Two or more categorical variables, are they related?
Example Want to know if the smoking behavior of the parents is related to their children's behavior. Examine data for 5375 students from 8 randomly selected high schools and find 18.7% of high school students smoke and 33% of students have 2 smoking parents, 42% have one smoking parent. Is the parent's behavior related to the child's?
What other information do we need?
Two way Table: Describes two categorical variables jointly
Row variable:
Column variable:
Entries: Counts in each parent-by-student class
NUMERICAL SUMMARIES
1) Each variable separately=marginal distribution (counts or percents)
2) Look at the relationships among the categorical variables
Use percents
Hint: Ask what group represents the total that I want a percent of
Example Do smoking habits of parents help explain whether or not their children smoke?
Conditional Distributions
- What % of students who have two smoking parents also smoke?
- Among those with one smoking parent?
- Among those with neither parent a smoker?
Do the conditional distributions in the different parent groups differ?
Complete conditional distributions:
|
Student Smokes |
Doesnt |
Both parents |
|
|
One parent |
|
|
Neither parent |
|
|
The influence of parents smoking on their children smoking is found by comparing the three conditional distributions.
GRAPHICAL SUMMARIES: Bar Graph
- Choose either the row or column variable.
- Each bar represents one group of the chosen variable.
- Height of the bar is the % of that group at that level of the variable.
- Segmented Bar Graph: Each bar includes all subgroup percents so the bars total 100% (Minitab: MTB> exec 'bargraph').
Example Look at the three parent groups, what percent of students in each group smokes? Do these differ from bar to bar? Conclusions?
Notes
- Conditional distributions differ from marginal (overall) distributions
- May want to use the other variable as the explanatory variable
- Large amount of information given in a two-way table.
Must decide which information answers your question.
Caution - Lurking Variables
In 1973, lawyers considered suing UC Berkeley for discrimination after it was found that 44.6% of males, but only 30.5% of females were accepted to undergraduate programs. Is this difference statistically significant?
To investigate which program was to blame the following breakdown was recorded:
|
Men |
|
women |
||
|
accepted |
denied |
|
accepted |
denied |
program A |
511 |
314 |
|
89 |
19 |
program B |
352 |
208 |
|
17 |
8 |
program C |
120 |
205 |
|
202 |
391 |
program D |
137 |
270 |
|
132 |
243 |
program E |
53 |
138 |
|
95 |
298 |
program F |
22 |
351 |
|
24 |
317 |
Total |
|
|
|
|
|
Your task is to find evidence of gender discrimination.
Within each of the six programs, calculate and compare the percentage of male applicants who were accepted with the percentage of females who were accepted. Which programs seem most responsible for substantial discrimination?
Simpsons Paradox:
Direction of association reverses when data from several groups are combined to form a single group.