Skip to content

S3 Chapter 6: Goodness of Fit and Contingency Tables

Imagine you are a board game enthusiast. You come across a Kickstarter project called “The Honest Dice.” The founders claim that through precision engineering and special materials, they have created the fairest dice in history. They assert that the probability of rolling any face is strictly 1/61/6, unlike standard mass-produced dice which inevitably have manufacturing imperfections.

But here is the problem:

  • Every physical object has imperfections.
  • In a game session, these tiny biases can accumulate over hundreds of rolls.
  • How can you scientifically verify if this “Honest Dice” is actually fairer than a cheap plastic one?
  • More importantly, how can the founders provide convincing statistical evidence to potential backers?

This problem isn’t just about dice. In the digital world, verifying randomness is even more critical:

  • Online Gambling: How do regulators verify that a digital slot machine is fair?
  • Lucky Draws: How do we know a promotional lottery isn’t rigged?
  • Cryptography: Security relies on random number generators. If a pattern exists, hackers might exploit it.

This chapter introduces the Chi-Square (χ2\chi^2) Tests, a powerful statistical framework to answer these questions by comparing what we see (data) with what we expect (theory).

The Fundamental Question: How large is the discrepancy between our observed data and the theoretical prediction? Is this discrepancy just due to random chance, or does it indicate a systematic bias?

The Logic: If the dice is truly fair, the observed frequency of each face should be “close enough” to the expected frequency. If the difference is “too large,” we suspect the dice is not fair.

We start by defining the null hypothesis (H0H_0), which represents the status quo or the theoretical distribution we are testing against.

  • H0H_0: The data follows the specified distribution (e.g., The die is fair).
  • H1H_1: The data does not follow the specified distribution.

Note: We never “prove” H0H_0 is true. We only check if there is strong evidence to reject it.

If H0H_0 is true, what should we see? We calculate the Expected Frequency (EiE_i) for each category ii.

Ei=n×piE_i = n \times p_i

where:

  • nn is the total sample size (total number of trials).
  • pip_i is the theoretical probability of category ii under H0H_0.

We need a single number to summarize the total discrepancy between Observed (OiO_i) and Expected (EiE_i) values. We use the Chi-Square statistic:

χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

The shape of the Chi-Square distribution depends on the degrees of freedom (dfdf).

df=k1m\boxed{df = k - 1 - m}

where:

  • kk = Number of categories (bins).
  • 11 = Constraint due to fixed sample size (knowing k1k-1 frequencies determines the last one).
  • mm = Number of population parameters estimated from the sample data to calculate expected frequencies.

Example 1: The Uniform Distribution (The Honest Dice)

Section titled “Example 1: The Uniform Distribution (The Honest Dice)”

Let’s test the “Honest Dice.” We roll it 600 times.

Observed Data:

Face123456
Observed (OiO_i)981029510596104

Task: Test, at the 5% level of significance, whether or not a uniform distribution is a suitable model for these data. State your hypotheses and show your working clearly.

A basketball player shoots 3 free throws per game. We record his successful shots (XX) over 100 games. We want to test if XB(3,p)X \sim B(3, p).

Observed Data:

XX (successes)0123Total
Observed Freq (OiO_i)4540132100

Task:

(a) Show that the estimated probability of a successful shot is 0.240.24. (b) Test, at the 5% level of significance, whether or not a binomial distribution is a suitable model for these data. State your hypotheses and show your working clearly.

Testing if continuous data follows a Normal distribution is slightly more complex because:

  • The Normal distribution is continuous, but the chi-square test requires discrete categories.
  • We typically don’t know the true μ\mu and σ\sigma, so we must estimate them from the data.

The Solution: Binning

We divide the continuous range into intervals (bins) and count how many observations fall into each bin. This converts continuous data into a frequency table.

Detailed Example: Testing Normality of Exam Scores

Section titled “Detailed Example: Testing Normality of Exam Scores”

A teacher suspects that exam scores follow a Normal distribution. She collects 100 scores and groups them into bins:

Score Range<50<5050506060606070707070808080\ge 80
Observed (OiO_i)822352510

From the raw data (before binning), she calculates:

  • Sample mean: xˉ=64.5\bar{x} = 64.5
  • Sample standard deviation: s=12.0s = 12.0

Task:

(a) Assuming the scores follow a N(64.5,122)N(64.5, 12^2) distribution, show that the expected frequency for the ”50506060” bin is approximately 24.0124.01. (b) Given the expected frequencies for the five bins are roughly E={11.35,24.01,32.30,22.48,9.83}E = \{11.35, 24.01, 32.30, 22.48, 9.83\}, test, at the 5% level of significance, whether or not a normal distribution is a suitable model for these data. State your hypotheses and show your working clearly.

Contingency Tables: Testing for Independence

Section titled “Contingency Tables: Testing for Independence”

Introduction: The Mysterious Case of the Titanic

Section titled “Introduction: The Mysterious Case of the Titanic”

On April 15, 1912, the RMS Titanic sank after hitting an iceberg. Of the 2,224 passengers and crew, more than 1,500 died. In the aftermath, a troubling question arose:

Was survival related to passenger class?

The “Women and children first” protocol was supposed to apply equally, but rumors suggested that first-class passengers had better access to lifeboats. How can we statistically test whether survival was independent of social class, or whether there was a significant association?

SurvivedDiedTotal
1st Class203122325
2nd Class118167285
3rd Class178528706
Total4998171316

At first glance, the survival rate for 1st class (203/325=62%203/325 = 62\%) seems much higher than for 3rd class (178/706=25%178/706 = 25\%). But could this difference be due to random chance? This is exactly the question that a Chi-Square Test for Independence can answer.

Often we want to know if two categorical variables are related.

  • Is gender related to voting preference?
  • Is a new drug treatment related to recovery rate?
  • Was survival on the Titanic related to passenger class?

Two events AA and BB are independent if knowing AA occurred gives no information about BB. Mathematically: P(AB)=P(A)×P(B)P(A \cap B) = P(A) \times P(B)

Hypotheses:

  • H0H_0: The two variables are independent (no association).
  • H1H_1: The two variables are not independent (there is an association).

Expected Frequencies: If H0H_0 is true, the probability of falling into cell (i,j)(i, j) depends only on the row and column totals. Eij=Row Totali×Column TotaljGrand TotalE_{ij} = \frac{\text{Row Total}_i \times \text{Column Total}_j}{\text{Grand Total}}

Degrees of Freedom: For a table with rr rows and cc columns: df=(r1)(c1)\boxed{df = (r-1)(c-1)}

Example: Coffee Preference vs. Time of Day

Section titled “Example: Coffee Preference vs. Time of Day”

A cafe wants to know if drink preference depends on the time of day. They survey 200 customers.

MorningAfternoonEveningTotal
Latte70255100
Espresso50473100
Total120728200

Task:

  1. State the hypotheses H0H_0 and H1H_1.
  2. Calculate the Expected Frequencies table. Check the Rule of 5! If necessary, pool columns to ensure all expected frequencies are 5\ge 5.
  3. Calculate the χ2\chi^2 statistic.
  4. Determine the degrees of freedom (based on the new table) and find the critical value at α=0.05\alpha = 0.05.
  5. Conclude whether coffee preference is independent of the time of day.

Challenge: Why Does the Chi-Square Statistic Follow a Chi-Square Distribution?

Section titled “Challenge: Why Does the Chi-Square Statistic Follow a Chi-Square Distribution?”

This section guides you through understanding why our test statistic χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} follows a Chi-Square distribution. This is a challenging but rewarding exploration!

Definition: Chi-Square Distribution If Z1,Z2,,ZkZ_1, Z_2, \ldots, Z_k are independent standard normal random variables (ZiN(0,1)Z_i \sim N(0,1)), then the sum of their squares: Q=Z12+Z22++Zk2Q = Z_1^2 + Z_2^2 + \cdots + Z_k^2 follows a Chi-Square distribution with kk degrees of freedom, written Qχk2Q \sim \chi^2_k.

Key Insight: The Chi-Square distribution is fundamentally about sums of squared standard normal variables.

Our goal is to show that χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} approximately follows χk12\chi^2_{k-1} when H0H_0 is true.

Part E: Why Estimating Parameters Costs More Degrees of Freedom

(i) When we estimate mm parameters from the data, we impose mm additional constraints (the estimated parameters must “fit” the data in some optimal way). (ii) Each constraint removes one degree of freedom from the (k1)(k-1)-dimensional space. (iii) Final result: df=k1mdf = k - 1 - m. (iv) Example: For testing normality with 5 bins:

  • k=5k = 5 categories
  • m=2m = 2 (estimate μ\mu and σ\sigma)
  • df=512=2df = 5 - 1 - 2 = 2

Explain in your own words why estimating μ\mu from xˉ\bar{x} and σ\sigma from ss each “uses up” one degree of freedom. :::

1. Hypotheses

  • H0H_0: A uniform distribution is a suitable model for these data (P(1)=P(2)==P(6)=1/6P(1)=P(2)=\dots=P(6)=1/6).
  • H1H_1: A uniform distribution is not a suitable model for these data.

2. Expected Frequencies Total n=600n = 600. Under H0H_0, Ei=600×16=100E_i = 600 \times \frac{1}{6} = 100 for all faces.

3. Calculate χ2\chi^2

χ2=(98100)2100+(102100)2100+(95100)2100+(105100)2100+(96100)2100+(104100)2100=0.04+0.04+0.25+0.25+0.16+0.16=0.90\begin{aligned} \chi^2 &= \frac{(98-100)^2}{100} + \frac{(102-100)^2}{100} + \frac{(95-100)^2}{100} + \frac{(105-100)^2}{100} + \frac{(96-100)^2}{100} + \frac{(104-100)^2}{100} \\ &= 0.04 + 0.04 + 0.25 + 0.25 + 0.16 + 0.16 = \mathbf{0.90} \end{aligned}

4. Degrees of Freedom and Critical Value k=6k = 6, m=0m = 0 (probabilities are given by the definition of a fair die). df=610=5df = 6 - 1 - 0 = 5. Critical value (α=0.05,df=5\alpha=0.05, df=5) is 11.070.

5. Conclusion 0.90<11.0700.90 < 11.070. Fail to reject H0H_0. There is insufficient evidence to suggest the die is unfair; a uniform distribution is a suitable model.

Solution to Example 2: The Binomial Distribution

Section titled “Solution to Example 2: The Binomial Distribution”

1. Hypotheses

  • H0H_0: A binomial distribution is a suitable model for these data.
  • H1H_1: A binomial distribution is not a suitable model for these data.

2. Estimate pp Total shots = 300. Total successes = 0(45)+1(40)+2(13)+3(2)=720(45)+1(40)+2(13)+3(2) = 72. p^=72/300=0.24\hat{p} = 72/300 = 0.24.

3. Expected Frequencies (Before Pooling) Using B(3,0.24)B(3, 0.24):

XX0123
EiE_i43.9041.5913.131.38

4. Rule of 5 & Pooling E3<5E_3 < 5, so we MUST pool X=2X=2 and X=3X=3.

XX (New Categories)012\ge 2
OiO_i454015
EiE_i43.9041.5914.51

5. Calculate χ2\chi^2 χ2=(4543.9)243.9+(4041.59)241.59+(1514.51)214.510.028+0.061+0.017=0.106\chi^2 = \frac{(45-43.9)^2}{43.9} + \frac{(40-41.59)^2}{41.59} + \frac{(15-14.51)^2}{14.51} \approx 0.028 + 0.061 + 0.017 = \mathbf{0.106}

6. Degrees of Freedom k=3k = 3 (after pooling!), m=1m = 1 (estimated pp). df=311=1df = 3 - 1 - 1 = \mathbf{1}.

7. Conclusion Critical value (α=0.05,df=1\alpha=0.05, df=1) is 3.841. 0.106<3.8410.106 < 3.841. Fail to reject H0H_0. A binomial distribution is a suitable model.

Solution to Example 3: The Normal Distribution

Section titled “Solution to Example 3: The Normal Distribution”

1. Hypotheses

  • H0H_0: A normal distribution is a suitable model for these data.
  • H1H_1: A normal distribution is not a suitable model for these data.

2. Probability for bin 50-60 P(50X<60)=P(5064.512Z<6064.512)=P(1.208Z<0.375)=0.2401P(50 \le X < 60) = P\left(\frac{50-64.5}{12} \le Z < \frac{60-64.5}{12}\right) = P(-1.208 \le Z < -0.375) = 0.2401

3. Expected Frequency E2=100×0.2401=24.01E_2 = 100 \times 0.2401 = 24.01

4. Calculate χ2\chi^2

χ2=(811.35)211.35+(2224.01)224.01+(3532.30)232.30+(2522.48)222.48+(109.83)29.83=0.989+0.168+0.226+0.282+0.003=1.668\begin{aligned} \chi^2 &= \frac{(8-11.35)^2}{11.35} + \frac{(22-24.01)^2}{24.01} + \frac{(35-32.30)^2}{32.30} + \frac{(25-22.48)^2}{22.48} + \frac{(10-9.83)^2}{9.83} \\ &= 0.989 + 0.168 + 0.226 + 0.282 + 0.003 = \mathbf{1.668} \end{aligned}

5. Degrees of Freedom k=5k = 5 bins. m=2m = 2 (estimated μ\mu and σ\sigma). df=512=2df = 5 - 1 - 2 = \mathbf{2}. Critical value (α=0.05\alpha=0.05) is 5.991.

6. Conclusion 1.668<5.9911.668 < 5.991. Fail to reject H0H_0. A normal distribution is a suitable model.

1. Hypotheses H0H_0: Coffee preference is independent of time of day. H1H_1: They are not independent.

2. Expected Frequencies (Before Pooling) Eij=Row Total×Col TotalGrand TotalE_{ij} = \frac{\text{Row Total} \times \text{Col Total}}{\text{Grand Total}}

(Expected)MorningAfternoonEvening
Latte60364
Espresso60364

3. Rule of 5 & Pooling Since ELatte, Evening<5E_{\text{Latte, Evening}} < 5 and EEspresso, Evening<5E_{\text{Espresso, Evening}} < 5, we must pool the “Afternoon” and “Evening” columns.

(Observed)MorningAfternoon/Evening
Latte7030
Espresso5050
(Expected)MorningAfternoon/Evening
Latte6040
Espresso6040

4. Calculate χ2\chi^2

χ2=(7060)260+(3040)240+(5060)260+(5040)240=10060+10040+10060+10040=1.667+2.5+1.667+2.5=8.334\begin{aligned} \chi^2 &= \frac{(70-60)^2}{60} + \frac{(30-40)^2}{40} + \frac{(50-60)^2}{60} + \frac{(50-40)^2}{40} \\ &= \frac{100}{60} + \frac{100}{40} + \frac{100}{60} + \frac{100}{40} \\ &= 1.667 + 2.5 + 1.667 + 2.5 = \mathbf{8.334} \end{aligned}

5. Degrees of Freedom df=(r1)(c1)=(21)(21)=1df = (r-1)(c-1) = (2-1)(2-1) = 1 (using the pooled table!). Critical value (α=0.05,df=1\alpha=0.05, df=1) is 3.841.

6. Conclusion 8.334>3.8418.334 > 3.841. Reject H0H_0. There is significant evidence of an association between coffee preference and time of day.