Skip to content

S3 Chapter 5: Correlation and Rank

Introduction: Does Screen Time Ruin Your Sleep?

Section titled “Introduction: Does Screen Time Ruin Your Sleep?”

Review of Product Moment Correlation (PMCC)

Section titled “Review of Product Moment Correlation (PMCC)”

Recall from S1 that to measure the strength of a linear relationship between two variables xx and yy, we use the Product Moment Correlation Coefficient (PMCC), denoted by rr.

Definition: Pearson’s Correlation Coefficient The sample correlation coefficient is calculated as:

r=SxySxxSyyr = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}

where:

  • Sxy=(xixˉ)(yiyˉ)=xy(x)(y)nS_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}) = \sum xy - \frac{(\sum x)(\sum y)}{n}
  • Sxx=(xixˉ)2S_{xx} = \sum(x_i - \bar{x})^2
  • Syy=(yiyˉ)2S_{yy} = \sum(y_i - \bar{y})^2

Suppose we calculate r=0.4r = -0.4 for our screen time study (n=10n=10). This suggests a moderate negative relationship.

But wait! Even if there is zero relationship in reality, random sampling might just happen to give us 10 students where the ones who used their phones more slept worse.

Question: How strong does rr need to be before we can confidently say it’s not just luck?

Just like Xˉ\bar{X} estimates μ\mu, the sample correlation rr estimates the true population correlation coefficient, which we denote by the Greek letter ρ\rho (rho).

  • ρ\rho: The true correlation for all students (unknown parameter).
  • rr: The correlation for our sample of 10 students (calculated statistic).

To test for evidence of correlation, we set up hypotheses about ρ\rho:

Null Hypothesis (H0H_0):ρ=0\rho = 0 (No correlation in the population)
Alternative Hypothesis (H1H_1):ρ0\rho \neq 0 (Correlation exists — two-tailed)
ρ>0\rho > 0 (Positive correlation — one-tailed)
ρ<0\rho < 0 (Negative correlation — one-tailed)

We don’t calculate a ZZ or tt score manually. Instead, we compare our calculated sample r|r| to a Critical Value from statistical tables.

PMCC (rr) is powerful, but it has two major weaknesses. Let’s explore them with hands-on examples.

Spearman’s Rank Correlation Coefficient (denoted rsr_s) is simply the PMCC calculated on the ranks of the data rather than the data itself.

  1. Rank the xx values from 1 to nn (e.g., 1=1= smallest, n=n= largest).
  2. Rank the yy values from 1 to nn.
  3. Calculate Pearson’s rr using these ranks.

If there are no tied ranks (no two values are the same), algebra gives us a much simpler formula:

If two or more values are identical (e.g., two students both score 85), we give them the average of the ranks they would have occupied.

  • Example: If the 3rd and 4th best scores are tied, both get rank 3+42=3.5\frac{3+4}{2} = 3.5. The next best score gets rank 5.

We can test for association using ranks too.

  • Null Hypothesis (H0H_0): ρs=0\rho_s = 0 (No association).
  • Alternative Hypothesis (H1H_1): ρs0\rho_s \neq 0 (Association exists).

Method: Look up the critical value in the Spearman’s Rank Correlation Coefficient table.

Solution to PMCC Calculation:

xxyyx2x^2y2y^2xyxy
25042500100
36093600180
570254900350
680366400480
990818100810
x=25\sum x = 25y=350\sum y = 350x2=155\sum x^2 = 155y2=25500\sum y^2 = 25500xy=1920\sum xy = 1920

Sxx=1552525=30S_{xx} = 155 - \frac{25^2}{5} = 30 Syy=2550035025=1000S_{yy} = 25500 - \frac{350^2}{5} = 1000 Sxy=192025×3505=170S_{xy} = 1920 - \frac{25 \times 350}{5} = 170 r=17030×10000.981r = \frac{170}{\sqrt{30 \times 1000}} \approx 0.981

Solution to The “Zero PMCC” Trap:

  • x=0\sum x = 0, y=10\sum y = 10, xy=0\sum xy = 0

Since Sxy=00×105=0S_{xy} = 0 - \frac{0 \times 10}{5} = 0, we get r=0r = 0.

Solution to The “Outlier” Illusion:

  • x=26\sum x = 26, y=134\sum y = 134, xy=1844\sum xy = 1844
  • x2=406\sum x^2 = 406, y2=8588\sum y^2 = 8588

Sxx=4062625=270.8S_{xx} = 406 - \frac{26^2}{5} = 270.8 Syy=858813425=4995.2S_{yy} = 8588 - \frac{134^2}{5} = 4995.2 Sxy=184426×1345=1147.2S_{xy} = 1844 - \frac{26 \times 134}{5} = 1147.2 r=1147.2270.8×4995.20.986r = \frac{1147.2}{\sqrt{270.8 \times 4995.2}} \approx 0.986