Example: Screen Time vs. Sleep Quality
We often hear that using your phone before bed is bad for your sleep. To test this, suppose we survey 10 students and record:
x x x : Number of hours spent on phone before sleep
y y y : Sleep quality score (0-100)
We might see a pattern where higher x x x generally corresponds to lower y y y .
But is this just a coincidence in our small sample of 10 students?
Or does it provide statistical evidence of a real relationship in the entire student population?
In S1, we learned how to measure correlation. In S3, we learn how to test it.
Recall from S1 that to measure the strength of a linear relationship between two variables x x x and y y y , we use the Product Moment Correlation Coefficient (PMCC) , denoted by r r r .
Definition: Pearson’s Correlation Coefficient
The sample correlation coefficient is calculated as:
r = S x y S x x S y y r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} r = S xx S y y S x y
where:
S x y = ∑ ( x i − x ˉ ) ( y i − y ˉ ) = ∑ x y − ( ∑ x ) ( ∑ y ) n S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y}) = \sum xy - \frac{(\sum x)(\sum y)}{n} S x y = ∑ ( x i − x ˉ ) ( y i − y ˉ ) = ∑ x y − n ( ∑ x ) ( ∑ y )
S x x = ∑ ( x i − x ˉ ) 2 S_{xx} = \sum(x_i - \bar{x})^2 S xx = ∑ ( x i − x ˉ ) 2
S y y = ∑ ( y i − y ˉ ) 2 S_{yy} = \sum(y_i - \bar{y})^2 S y y = ∑ ( y i − y ˉ ) 2
Interpreting r r r
r r r is always between − 1 -1 − 1 and + 1 +1 + 1 .
r = + 1 r = +1 r = + 1 : Perfect positive linear correlation.
r = − 1 r = -1 r = − 1 : Perfect negative linear correlation.
r = 0 r = 0 r = 0 : No linear correlation (could still be related in a non-linear way!).
Example: Hands-on — Calculating PMCC
Let’s practice calculating PMCC from scratch. Suppose we have n = 5 n=5 n = 5 students, and we record their study hours (x x x ) and test scores (y y y ):
Student 1 2 3 4 5 Study Hours (x x x ) 2 3 5 6 9 Test Score (y y y ) 50 60 70 80 90
Task: Calculate the Product Moment Correlation Coefficient (PMCC) for this data. (Try to do this yourself first before looking at the solution in the Appendix!)
Suppose we calculate r = − 0.4 r = -0.4 r = − 0.4 for our screen time study (n = 10 n=10 n = 10 ).
This suggests a moderate negative relationship.
But wait! Even if there is zero relationship in reality, random sampling might just happen to give us 10 students where the ones who used their phones more slept worse.
Question: How strong does r r r need to be before we can confidently say it’s not just luck?
Just like X ˉ \bar{X} X ˉ estimates μ \mu μ , the sample correlation r r r estimates the true population correlation coefficient, which we denote by the Greek letter ρ \rho ρ (rho).
ρ \rho ρ : The true correlation for all students (unknown parameter).
r r r : The correlation for our sample of 10 students (calculated statistic).
To test for evidence of correlation, we set up hypotheses about ρ \rho ρ :
Null Hypothesis (H 0 H_0 H 0 ): ρ = 0 \rho = 0 ρ = 0 (No correlation in the population)Alternative Hypothesis (H 1 H_1 H 1 ): ρ ≠ 0 \rho \neq 0 ρ = 0 (Correlation exists — two-tailed)ρ > 0 \rho > 0 ρ > 0 (Positive correlation — one-tailed)ρ < 0 \rho < 0 ρ < 0 (Negative correlation — one-tailed)
We don’t calculate a Z Z Z or t t t score manually. Instead, we compare our calculated sample ∣ r ∣ |r| ∣ r ∣ to a Critical Value from statistical tables.
Decision Rule
Look up the critical value for sample size n n n and significance level α \alpha α in the “Product Moment Correlation Coefficient” table.
If ∣ r ∣ > Critical Value |r| > \text{Critical Value} ∣ r ∣ > Critical Value , reject H 0 H_0 H 0 . There is evidence of correlation.
If ∣ r ∣ < Critical Value |r| < \text{Critical Value} ∣ r ∣ < Critical Value , do not reject H 0 H_0 H 0 . Insufficient evidence.
Exercise: WST03/01/Jan22/3
A medical research team carried out an investigation into the metabolic rate, MR, of men aged between 30 years and 60 years.
A random sample of 10 men was taken from this age group.
The table below shows for each man his MR and his body mass index, BMI.
Man A B C D E F G H I J MR (x x x ) 6.24 5.94 6.83 6.53 6.31 7.44 7.32 8.70 7.88 7.78 BMI (y y y ) 19.6 19.2 23.6 21.4 20.2 20.8 22.9 25.5 23.3 25.1
[You may use S x y = 15.1608 S_{xy} = 15.1608 S x y = 15.1608 , S x x = 6.90181 S_{xx} = 6.90181 S xx = 6.90181 , S y y = 45.304 S_{yy} = 45.304 S y y = 45.304 ]
(a) Calculate the value of the product moment correlation coefficient between MR and BMI for these 10 men.
(b) Use your value of the product moment correlation coefficient to test, at the 5% significance level, whether or not there is evidence of a positive correlation between MR and BMI. State your hypotheses clearly.
(c) State an assumption that must be made to carry out the test in part (b).
PMCC (r r r ) is powerful, but it has two major weaknesses. Let’s explore them with hands-on examples.
Example: The “Zero PMCC” Trap — Perfect Non-linear Relationship
Suppose we record a company’s price deviation from optimal (x x x ) and its resulting profit loss (y y y ):
Deviation (x x x ) -2 -1 0 1 2 Loss (y y y ) 4 1 0 1 4
Clearly, y = x 2 y = x^2 y = x 2 . There is a perfect deterministic relationship.
Task: Calculate the PMCC for this data. What do you get? Why? (Try it yourself, then check the Appendix!)
Example: The “Outlier” Illusion
Suppose we have 4 students whose hours (x x x ) and scores (y y y ) are:
( 1 , 10 ) , ( 1 , 12 ) , ( 2 , 10 ) , ( 2 , 12 ) (1, 10), (1, 12), (2, 10), (2, 12) ( 1 , 10 ) , ( 1 , 12 ) , ( 2 , 10 ) , ( 2 , 12 )
These points form a small square. There is no correlation (r = 0 r=0 r = 0 ).
Now, we add just one extreme outlier student who studied 20 hours and scored 90:
( 20 , 90 ) (20, 90) ( 20 , 90 )
Task: Calculate the PMCC for these 5 points. How much does the single outlier affect r r r ? (Check the Appendix for the result).
When should we NOT use Pearson’s r r r ?
If the data is:
Non-linear (e.g., curves, exponential growth, but still strictly increasing/decreasing), OR
Not normal / Has Outliers (highly sensitive to extremes), OR
Ordinal (rankings like 1st, 2nd, 3rd, instead of measurements like height or weight),
then Pearson’s r r r is not appropriate.
Solution: We use Ranks instead of raw values!
Spearman’s Rank Correlation Coefficient (denoted r s r_s r s ) is simply the PMCC calculated on the ranks of the data rather than the data itself.
Rank the x x x values from 1 to n n n (e.g., 1 = 1= 1 = smallest, n = n= n = largest).
Rank the y y y values from 1 to n n n .
Calculate Pearson’s r r r using these ranks.
If there are no tied ranks (no two values are the same), algebra gives us a much simpler formula:
Challenge Exercise: Derivation
Challenge: Can you prove this formula?
Hint:
The ranks are the integers 1 , 2 , … , n 1, 2, \ldots, n 1 , 2 , … , n .
The mean of the ranks is n + 1 2 \frac{n+1}{2} 2 n + 1 .
The variance of the integers 1 … n 1 \ldots n 1 … n is n 2 − 1 12 \frac{n^2-1}{12} 12 n 2 − 1 .
Start with the definition of PMCC and substitute these values!
If two or more values are identical (e.g., two students both score 85), we give them the average of the ranks they would have occupied.
Example: If the 3rd and 4th best scores are tied, both get rank 3 + 4 2 = 3.5 \frac{3+4}{2} = 3.5 2 3 + 4 = 3.5 . The next best score gets rank 5.
CRITICAL WARNING: Tied Ranks and the Formula
The shortcut formula r s = 1 − 6 ∑ d 2 n ( n 2 − 1 ) r_s = 1 - \frac{6 \sum d^2}{n(n^2 - 1)} r s = 1 − n ( n 2 − 1 ) 6 ∑ d 2 was derived under the strict assumption that the ranks are exactly the integers 1 , 2 , … , n 1, 2, \ldots, n 1 , 2 , … , n .
When you have tied ranks (like 3.5 3.5 3.5 ), this assumption is broken.
Do NOT use the shortcut formula if there are tied ranks!
Correct Method: You must use the original PMCC formula r = S x y S x x S y y r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} r = S xx S y y S x y on the ranked values.
We can test for association using ranks too.
Null Hypothesis (H 0 H_0 H 0 ): ρ s = 0 \rho_s = 0 ρ s = 0 (No association).
Alternative Hypothesis (H 1 H_1 H 1 ): ρ s ≠ 0 \rho_s \neq 0 ρ s = 0 (Association exists).
Method:
Look up the critical value in the Spearman’s Rank Correlation Coefficient table.
Exercise: 6691/01/May16/3
(a) Describe when you would use Spearman’s rank correlation coefficient rather than the product moment correlation coefficient to measure the strength of the relationship between two variables.
A shop sells sunglasses and ice cream. For one week in the summer the shopkeeper ranked the daily sales of ice cream and sunglasses. The ranks are shown in the table below.
Sun Mon Tues Weds Thurs Fri Sat Ice cream 6 4 7 5 3 2 1 Sunglasses 6 5 7 2 3 4 1
(b) Calculate Spearman’s rank correlation coefficient for these data.
(c) Test, at the 5% level of significance, whether or not there is a positive correlation between sales of ice cream and sales of sunglasses. State your hypotheses clearly.
(d) The shopkeeper calculates the product moment correlation coefficient from his raw data and finds r = 0.65 r = 0.65 r = 0.65 . Using this new coefficient, test, at the 5% level of significance, whether or not there is a positive correlation between sales of ice cream and sales of sunglasses.
(e) Using your answers to part (c) and part (d), comment on the nature of the relationship between sales of sunglasses and sales of ice cream.
Exercise: WST03/01/May15/2
Nine dancers, Adilzhan (A), Bianca (B), Chantelle (C), Lee (L), Nikki (N), Ranjit (R), Sergei (S), Thuy (T) and Yana (Y), perform in a dancing competition.
Two judges rank each dancer according to how well they perform. The table below shows the rankings of each judge starting from the dancer with the strongest performance.
Rank 1 2 3 4 5 6 7 8 9 Judge 1 S N B C T A Y R L Judge 2 S T N B C Y L A R
(a) Calculate Spearman’s rank correlation coefficient for these data. (5)
(b) Stating your hypotheses clearly, test at the 1% level of significance, whether or not the two judges are generally in agreement. (4)
Exercise: 6691/01R/May14/1
A journalist is investigating factors which influence people when they buy a new car. One possible factor is fuel efficiency. The journalist randomly selects 8 car models. Each model’s annual sales and fuel efficiency, in km/litre, are shown in the table below.
Car model A B C D E F G H Annual sales 1800 5400 18100 7100 9300 4800 12200 10700 Fuel efficiency 5.2 18.6 14.8 13.2 18.3 11.9 16.5 17.7
(a) Calculate Spearman’s rank correlation coefficient for these data.
(b) The journalist believes that car models with higher fuel efficiency will achieve higher sales. Stating your hypotheses clearly, test whether or not the data support the journalist’s belief. Use a 5% level of significance.
(c) State the assumption necessary for a product moment correlation coefficient to be valid in this case.
(d) The mean and median fuel efficiencies of the car models in the random sample are 14.5 km/litre and 15.65 km/litre respectively. Considering these statistics, as well as the distribution of the fuel efficiency data, state whether or not the data suggest that the assumption in part (c) might be true in this case. Give a reason for your answer. (No further calculations are required.)
Exercise: WST03/01/June22/1
The table below shows the number of televised tournaments won and the total number of tournaments won by the top 10 ranked darts players in 2020.
Player’s rank Televised tournaments won Total tournaments won 1 55 135 2 7 33 3 5 17 4 2 14 5 4 9 6 2 5 7 9 36 8 0 15 9 3 3 10 0 13
Michael did not want to calculate Spearman’s rank correlation coefficient between player’s rank and the rank in televised tournaments won because there would be tied ranks.
(a) Explain how Michael could have dealt with these tied ranks.
(b) Given that the largest number of total tournaments won is ranked number 1, calculate the value of Spearman’s rank correlation coefficient between player’s rank and the rank in total tournaments won.
(c) Stating your hypotheses and critical value clearly, test at the 5% level of significance, whether or not there is evidence of a positive correlation between player’s rank and the rank in total tournaments won for these darts players.
(d) Michael does not believe that there is a positive correlation between player’s rank and the rank in total number of tournaments won. Find the largest level of significance, that is given in the tables provided, which could be used to support Michael’s claim. You must state your critical value.
Solution to PMCC Calculation:
x x x y y y x 2 x^2 x 2 y 2 y^2 y 2 x y xy x y 2 50 4 2500 100 3 60 9 3600 180 5 70 25 4900 350 6 80 36 6400 480 9 90 81 8100 810 ∑ x = 25 \sum x = 25 ∑ x = 25 ∑ y = 350 \sum y = 350 ∑ y = 350 ∑ x 2 = 155 \sum x^2 = 155 ∑ x 2 = 155 ∑ y 2 = 25500 \sum y^2 = 25500 ∑ y 2 = 25500 ∑ x y = 1920 \sum xy = 1920 ∑ x y = 1920
S x x = 155 − 25 2 5 = 30 S_{xx} = 155 - \frac{25^2}{5} = 30 S xx = 155 − 5 2 5 2 = 30
S y y = 25500 − 350 2 5 = 1000 S_{yy} = 25500 - \frac{350^2}{5} = 1000 S y y = 25500 − 5 35 0 2 = 1000
S x y = 1920 − 25 × 350 5 = 170 S_{xy} = 1920 - \frac{25 \times 350}{5} = 170 S x y = 1920 − 5 25 × 350 = 170
r = 170 30 × 1000 ≈ 0.981 r = \frac{170}{\sqrt{30 \times 1000}} \approx 0.981 r = 30 × 1000 170 ≈ 0.981
Solution to The “Zero PMCC” Trap:
∑ x = 0 \sum x = 0 ∑ x = 0 , ∑ y = 10 \sum y = 10 ∑ y = 10 , ∑ x y = 0 \sum xy = 0 ∑ x y = 0
Since S x y = 0 − 0 × 10 5 = 0 S_{xy} = 0 - \frac{0 \times 10}{5} = 0 S x y = 0 − 5 0 × 10 = 0 , we get r = 0 r = 0 r = 0 .
Solution to The “Outlier” Illusion:
∑ x = 26 \sum x = 26 ∑ x = 26 , ∑ y = 134 \sum y = 134 ∑ y = 134 , ∑ x y = 1844 \sum xy = 1844 ∑ x y = 1844
∑ x 2 = 406 \sum x^2 = 406 ∑ x 2 = 406 , ∑ y 2 = 8588 \sum y^2 = 8588 ∑ y 2 = 8588
S x x = 406 − 26 2 5 = 270.8 S_{xx} = 406 - \frac{26^2}{5} = 270.8 S xx = 406 − 5 2 6 2 = 270.8
S y y = 8588 − 134 2 5 = 4995.2 S_{yy} = 8588 - \frac{134^2}{5} = 4995.2 S y y = 8588 − 5 13 4 2 = 4995.2
S x y = 1844 − 26 × 134 5 = 1147.2 S_{xy} = 1844 - \frac{26 \times 134}{5} = 1147.2 S x y = 1844 − 5 26 × 134 = 1147.2
r = 1147.2 270.8 × 4995.2 ≈ 0.986 r = \frac{1147.2}{\sqrt{270.8 \times 4995.2}} \approx 0.986 r = 270.8 × 4995.2 1147.2 ≈ 0.986