Skip to content

S3 Chapter 4: Central Limit Theorem and Comparing Means

In Chapter 3, we learned how to evaluate estimators (Bias, Efficiency) and construct Confidence Intervals. However, nearly all our calculations relied on a crucial assumption: “Assume the population is Normally Distributed.” But real data is often skewed, discrete, or just weird. What do we do then?

In this chapter, we generalize those S2 approximations to any distribution.

  • Goal 1: Use CLT to perform inference on a single sample mean from any distribution.
  • Goal 2: Use CLT to compare two sample means from different distributions.

1. Computer Simulation and the Shape of Sample Means

Section titled “1. Computer Simulation and the Shape of Sample Means”

Theorem: Central Limit Theorem

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent and identically distributed random variables with

E[Xi]=μ,Var(Xi)=σ2<.E[X_i] = \mu, \qquad \mathrm{Var}(X_i) = \sigma^2 < \infty.

Then as nn \to \infty,

Z=Xˉμσ/n    N(0,1),Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \;\Longrightarrow\; N(0, 1),

that is, the distribution of ZZ tends to the standard normal distribution.

For large nn this gives the useful approximation

XˉN ⁣(μ,σ2n).\bar{X} \approx N\!\left(\mu, \frac{\sigma^2}{n}\right).

Example: Discrete General Distribution

Consider a highly volatile asset. Its annual return RR follows a discrete distribution:

  • Loss (-10%): Probability 0.20.2
  • Break-even (0%): Probability 0.50.5
  • Gain (+20%): Probability 0.30.3

This distribution is discrete and not symmetric.

Task: Suppose you hold a portfolio of n=50n=50 such independent assets. What is the probability that your average return is greater than 5%?

Step 1: Calculate Population Parameters (μ,σ2\mu, \sigma^2)

First, we analyze the single asset RR.

E[R]=(10×0.2)+(0×0.5)+(20×0.3)=2+0+6=4%E[R] = (-10 \times 0.2) + (0 \times 0.5) + (20 \times 0.3) = -2 + 0 + 6 = 4\% E[R2]=((10)2×0.2)+(02×0.5)+(202×0.3)=(100×0.2)+0+(400×0.3)=20+120=140E[R^2] = ((-10)^2 \times 0.2) + (0^2 \times 0.5) + (20^2 \times 0.3) = (100 \times 0.2) + 0 + (400 \times 0.3) = 20 + 120 = 140 Var(R)=E[R2](E[R])2=14042=124\text{Var}(R) = E[R^2] - (E[R])^2 = 140 - 4^2 = 124

So, the population has μ=4\mu = 4 and σ2=124\sigma^2 = 124.

Step 2: Apply CLT to the Sample Mean Rˉ\bar{R}

Since n=50n=50 is large, the average return Rˉ\bar{R} follows:

RˉN(μ,σ2n)=N(4,12450)=N(4,2.48)\bar{R} \sim N\left(\mu, \frac{\sigma^2}{n}\right) = N\left(4, \frac{124}{50}\right) = N(4, 2.48)

Standard Deviation of Rˉ=2.481.575\bar{R} = \sqrt{2.48} \approx 1.575.

Step 3: Calculate Probability

We want P(Rˉ>5)P(\bar{R} > 5). Standardize:

Z=541.575=11.5750.635Z = \frac{5 - 4}{1.575} = \frac{1}{1.575} \approx 0.635

Using standard normal tables:

P(Z>0.635)=1P(Z<0.635)10.737=0.263P(Z > 0.635) = 1 - P(Z < 0.635) \approx 1 - 0.737 = 0.263

Conclusion: Even though individual assets have a discrete, “jumpy” distribution, the portfolio average behaves normally. There is a ~26.3% chance the portfolio beats 5%.

Sampling Distribution of the Mean (Non-Normal Population)

Section titled “Sampling Distribution of the Mean (Non-Normal Population)”

Under the CLT, when nn is large,

XˉN ⁣(μ,σ2n).\bar{X} \approx N\!\left(\mu, \frac{\sigma^2}{n}\right).

If σ\sigma is unknown we estimate it with the sample standard deviation SS and approximate

XˉN ⁣(μ,S2n).\bar{X} \approx N\!\left(\mu, \frac{S^2}{n}\right).

Definition: Estimated Standard Error of the Mean

For a large sample of size nn, the estimated standard error of the sample mean is

SE(Xˉ)=Sn,\mathrm{SE}(\bar{X}) = \frac{S}{\sqrt{n}},

where SS is the sample standard deviation.

Large-Sample Confidence Interval for a Mean

Section titled “Large-Sample Confidence Interval for a Mean”

Using the CLT, for large nn we have approximately

Z=XˉμS/nN(0,1).Z = \frac{\bar{X} - \mu}{S/\sqrt{n}} \approx N(0, 1).

Therefore a 100(1α)%100(1-\alpha)\% confidence interval for μ\mu is

Xˉ±zSn,\bar{X} \pm z^* \cdot \frac{S}{\sqrt{n}},

where zz^* satisfies P(z<Z<z)=1αP(-z^* < Z < z^*) = 1 - \alpha for ZN(0,1)Z \sim N(0,1).

Confidence levelα\alphazz^*
90%0.101.645
95%0.051.96
99%0.012.576

To test

H0:μ=μ0againstH1:μμ0,H_0: \mu = \mu_0 \quad\text{against}\quad H_1: \mu \ne \mu_0,

with a large sample and unknown σ\sigma, we use the test statistic

Z=Xˉμ0S/nN(0,1)under H0.Z = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \approx N(0, 1) \quad\text{under } H_0.

We reject H0H_0 if Z|Z| is too large (outside the critical region determined by the chosen significance level).

Independent Samples and Difference of Means

Section titled “Independent Samples and Difference of Means”

Suppose we have two populations:

  • Population A with mean μA\mu_A and variance σA2\sigma_A^2
  • Population B with mean μB\mu_B and variance σB2\sigma_B^2

We take independent random samples:

X1,,XnAfrom population A,Y1,,YnBfrom population B,X_1, \ldots, X_{n_A} \quad\text{from population A}, \qquad Y_1, \ldots, Y_{n_B} \quad\text{from population B},

and form the sample means Xˉ\bar{X} and Yˉ\bar{Y}.

If both sample sizes are large, CLT gives

XˉN ⁣(μA,σA2nA),YˉN ⁣(μB,σB2nB),\bar{X} \approx N\!\left(\mu_A, \frac{\sigma_A^2}{n_A}\right), \qquad \bar{Y} \approx N\!\left(\mu_B, \frac{\sigma_B^2}{n_B}\right),

and, because the samples are independent,

XˉYˉN ⁣(μAμB,σA2nA+σB2nB).\bar{X} - \bar{Y} \approx N\!\left(\mu_A - \mu_B, \frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}\right).

When the population variances are unknown we estimate them by the sample variances SA2S_A^2 and SB2S_B^2 and use the estimated standard error

SE(XˉYˉ)=SA2nA+SB2nB.\mathrm{SE}(\bar{X} - \bar{Y}) = \sqrt{\frac{S_A^2}{n_A} + \frac{S_B^2}{n_B}}.

Confidence Interval for a Difference of Means

Section titled “Confidence Interval for a Difference of Means”

A large-sample 100(1α)%100(1-\alpha)\% confidence interval for μAμB\mu_A - \mu_B is

(XˉYˉ)±zSE(XˉYˉ).(\bar{X} - \bar{Y}) \pm z^* \cdot \mathrm{SE}(\bar{X} - \bar{Y}).

To test

H0:μAμB=Δ0H_0: \mu_A - \mu_B = \Delta_0

against one- or two-sided alternatives, we use

Z=(XˉYˉ)Δ0SE(XˉYˉ)N(0,1)under H0Z = \frac{(\bar{X} - \bar{Y}) - \Delta_0}{\mathrm{SE}(\bar{X} - \bar{Y})} \approx N(0,1) \quad\text{under } H_0

for large samples.

Example: Skewed vs Uniform Distributions

An engineer compares an old server (A) with a new server (B).

  • Server A (Old): Latency is Skewed. Most requests are fast, but some hang. μA=205 ms,σA=50 ms\mu_A = 205 \text{ ms}, \quad \sigma_A = 50 \text{ ms}
  • Server B (New): Latency is Uniformly Distributed between 150ms and 210ms. XBU[150,210]X_B \sim U[150, 210]

We collect nA=100n_A = 100 requests from A and nB=100n_B = 100 from B.

Question: What is the probability that the sample mean of A is at least 20ms slower (higher) than B? i.e., P(XˉAXˉB>20)P(\bar{X}_A - \bar{X}_B > 20).

Step 1: Determine Parameters for B

For Uniform [a,b][a, b]:

μB=a+b2=150+2102=180 ms\mu_B = \frac{a+b}{2} = \frac{150+210}{2} = 180 \text{ ms} σB2=(ba)212=(60)212=360012=300\sigma_B^2 = \frac{(b-a)^2}{12} = \frac{(60)^2}{12} = \frac{3600}{12} = 300

Step 2: Distribution of the Difference

Mean Difference: μdiff=μAμB=205180=25 ms\mu_{diff} = \mu_A - \mu_B = 205 - 180 = 25 \text{ ms}.

Variance of Difference:

Var(XˉAXˉB)=σA2nA+σB2nB=502100+300100=2500100+3=25+3=28\text{Var}(\bar{X}_A - \bar{X}_B) = \frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B} = \frac{50^2}{100} + \frac{300}{100} = \frac{2500}{100} + 3 = 25 + 3 = 28

Standard Error = 285.29\sqrt{28} \approx 5.29 ms.

Step 3: Calculate Probability

We want P(D>20)P(D > 20) where DN(25,28)D \sim N(25, 28).

Z=20255.29=55.290.945Z = \frac{20 - 25}{5.29} = \frac{-5}{5.29} \approx -0.945 P(Z>0.945)=P(Z<0.945)0.8277P(Z > -0.945) = P(Z < 0.945) \approx 0.8277

Insight: Despite A being skewed and B being uniform, we can easily calculate probabilities about their difference using the Normal distribution!