Population Regression Model
Suppose y , x y,x y , x and two variables describing properties of population and one wants to “explain y y y in terms of x x x ”. Thus, if we could observe everything, their relationship can be expressed as y = f ( x ; z 1 , z 2 , … , z p ) y=f(x;z_1,z_2,\dots, z_p) y = f ( x ; z 1 , z 2 , … , z p ) , where z i z_i z i are p p p extra factors in addition to x x x influencing y y y .
But it’s not always possible to observe all data. So we just need to build a useful model focusing on relationship between on x , y x,y x , y that is true “on average ”, i.e. g ( x ) : = E [ y ∣ x ] g(x):=\mathbb{E}[y|x] g ( x ) : = E [ y ∣ x ] .
Linear Regression
From the above definition, we can rewrite y y y as
y = g ( x ) + e y=g(x) + e
y = g ( x ) + e
If we assume that
g ( x ) g(x) g ( x ) is linear, i.e., g ( x ) = b 0 + b 1 x g(x)=b_0+b_1x g ( x ) = b 0 + b 1 x
Exogeneity . x x x is deterministic, or independent of e e e , i.e. E [ x ∣ e ] = 0 \mathbb{E}[x|e]=0 E [ x ∣ e ] = 0
Homoscedasticity . V a r ( e i ∣ { x j } ) = σ 2 , ∀ i = 1 , 2 , … , n Var(e_i | \{x_j\})=\sigma^2, \forall i=1,2,\dots,n V a r ( e i ∣ { x j } ) = σ 2 , ∀ i = 1 , 2 , … , n .
No serial correlation . E [ e i e j ∣ { x ℓ } ] = 0 , ∀ i ≠ j \mathbb{E}[e_ie_j | \{ x_\ell \}]=0,\forall i\ne j E [ e i e j ∣ { x ℓ } ] = 0 , ∀ i = j .
Given dataset { ( x i , y i ) } \{ (x_i, y_i) \} { ( x i , y i ) } , we can find the optimal b 0 , b 1 b_0,b_1 b 0 , b 1 by minimizing sum of squares error (SSE)
∑ i e i 2 = ∑ i ( y i − b 0 − b 1 x i ) 2 \sum_i e_i^2 = \sum_i (y_i-b_0-b_1x_i)^2
i ∑ e i 2 = i ∑ ( y i − b 0 − b 1 x i ) 2
The solution is
b 0 = y ˉ − b 1 x ˉ b 1 = ∑ i ( x i − x ˉ ) ( y i − y ˉ ) ∑ i ( x i − x ˉ ) 2 \begin{aligned}
b_0&=\bar{y}-b_1\bar{x} \\
b_1&=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2}
\end{aligned}
b 0 b 1 = y ˉ − b 1 x ˉ = ∑ i ( x i − x ˉ ) 2 ∑ i ( x i − x ˉ ) ( y i − y ˉ )
b 1 = s x , y s x 2 b_1=\frac{s_{x,y}}{s_x^2} b 1 = s x 2 s x , y
Besides, let SST = ∑ i ( y i − y ˉ ) 2 \text{SST}=\sum_i (y_i-\bar{y})^2 SST = ∑ i ( y i − y ˉ ) 2 , SSR = ∑ i ( y ^ i − y ˉ ) 2 \text{SSR}=\sum_i (\hat{y}_i-\bar{y})^2 SSR = ∑ i ( y ^ i − y ˉ ) 2 and SSE = ∑ i ( y i − y ^ i ) 2 \text{SSE}=\sum_i (y_i-\hat{y}_i)^2 SSE = ∑ i ( y i − y ^ i ) 2 , then we have
SST = SSR + SSE \text{SST}=\text{SSR}+\text{SSE}
SST = SSR + SSE
Furthermore, we define coefficient of determination R 2 = SSR SST R^2=\frac{\text{SSR}}{\text{SST}} R 2 = SST SSR . Since
SSR = ∑ i ( ( b 0 + b 1 x i ) − ( b 0 + b 1 x ˉ ) ) 2 = b 1 2 ∑ i ( x i − x ˉ ) 2 \text{SSR}=\sum_i ((b_0+b_1x_i)-(b_0+b_1\bar{x}))^2 = b_1^2 \sum_i (x_i-\bar{x})^2
SSR = i ∑ ( ( b 0 + b 1 x i ) − ( b 0 + b 1 x ˉ ) ) 2 = b 1 2 i ∑ ( x i − x ˉ ) 2
then we have
R 2 = b 1 2 ∑ i ( x i − x ˉ ) 2 ∑ i ( y i − y ˉ ) 2 = ( s x , y s x 2 ) 2 ⋅ s x 2 s y 2 = r x , y 2 R^2 = b_1^2 \frac{\sum_i (x_i-\bar{x})^2}{\sum_i (y_i-\bar{y})^2} = (\frac{s_{x,y}}{s_x^2})^2 \cdot \frac{s_x^2}{s_y^2}=r_{x,y}^2
R 2 = b 1 2 ∑ i ( y i − y ˉ ) 2 ∑ i ( x i − x ˉ ) 2 = ( s x 2 s x , y ) 2 ⋅ s y 2 s x 2 = r x , y 2
Hypothesis Testing with Linear Regression
Hypothesis testing usually focuses on testing b 1 b_1 b 1 .
H 0 : b 1 = b 1 ∗ H_0:b_1=b_1^\ast H 0 : b 1 = b 1 ∗ or b 1 ≤ b 1 ∗ b_1\le b_1^\ast b 1 ≤ b 1 ∗ , while H 1 : b 1 > b 1 ∗ H_1:b_1\gt b_1^\ast H 1 : b 1 > b 1 ∗
H 0 : b 1 = b 1 ∗ H_0:b_1=b_1^\ast H 0 : b 1 = b 1 ∗ or b 1 ≥ b 1 ∗ b_1\ge b_1^\ast b 1 ≥ b 1 ∗ , while H 1 : b 1 < b 1 ∗ H_1:b_1\lt b_1^\ast H 1 : b 1 < b 1 ∗
H 0 : b 1 = b 1 ∗ H_0:b_1=b_1^\ast H 0 : b 1 = b 1 ∗ , while H 1 : b 1 ≠ b 1 ∗ H_1:b_1\ne b_1^\ast H 1 : b 1 = b 1 ∗
Under assumption, b ^ 1 \hat b_1 b ^ 1 follows normal distribution with means b 1 b_1 b 1 and variance σ 2 / ∑ i ( x i − x ˉ ) 2 \sigma^2/\sum_i(x_i-\bar{x})^2 σ 2 / ∑ i ( x i − x ˉ ) 2 , we have
b ^ 1 − b 1 σ 2 / ∑ i ( x i − x ˉ ) 2 ∼ N ( 0 , 1 ) \frac{\hat b_1-b_1}{\sqrt{\sigma^2/\sum_i (x_i-\bar{x})^2}} \sim \mathcal{N}(0,1)
σ 2 / ∑ i ( x i − x ˉ ) 2 b ^ 1 − b 1 ∼ N ( 0 , 1 )
We use the following to approximate σ \sigma σ :
σ ^ 2 = ∑ i e i 2 n − 2 \hat \sigma^2=\frac{\sum_i e_i^2}{n-2}
σ ^ 2 = n − 2 ∑ i e i 2
The reason of n − 2 n-2 n − 2 is that, 2 normal equations need to be satisfied, hance lose 2 degrees of freedom.
Unbiased Estimators
Since E [ y i ∣ { x i } ] = b 0 + b 1 x i \mathbb{E}[y_i | \{ x_i \}]=b_0+b_1x_i E [ y i ∣ { x i } ] = b 0 + b 1 x i , suppose the OLS algorithm gives b ^ 0 , b ^ 1 \hat b_0, \hat b_1 b ^ 0 , b ^ 1 we can show that
E [ b ^ 1 ] = b 1 E [ b ^ 0 ] = b 0 \mathbb{E}[\hat b_1]=b_1 \\
\mathbb{E}[\hat b_0]=b_0
E [ b ^ 1 ] = b 1 E [ b ^ 0 ] = b 0
Also, the assumptions give that e i ∼ N ( 0 , σ 2 ) e_i\sim \mathcal{N}(0,\sigma^2) e i ∼ N ( 0 , σ 2 ) . Since V a r ( y i ) = V a r ( b 0 + b 1 x i + e i ) = V a r ( e i ) = σ 2 Var(y_i)=Var(b_0+b_1x_i+e_i)=Var(e_i)=\sigma^2 V a r ( y i ) = V a r ( b 0 + b 1 x i + e i ) = V a r ( e i ) = σ 2
V a r ( b ^ 1 ) = σ 2 ∑ i ( x i − x ˉ ) 2 V a r ( b ^ 0 ) = ∑ i x i 2 n ∑ i ( x i − x ˉ ) 2 σ 2 C o v ( b ^ 0 , b ^ 1 ) = − x ˉ ∑ i ( x i − x ˉ ) 2 σ 2 Var(\hat b_1)=\frac{\sigma^2}{\sum_i (x_i-\bar{x})^2}\\
Var(\hat b_0)=\frac{\sum_i x_i^2}{n\sum_i (x_i-\bar{x})^2}\sigma^2\\
Cov(\hat b_0, \hat b_1)=-\frac{\bar{x}}{\sum_i (x_i-\bar{x})^2}\sigma^2
V a r ( b ^ 1 ) = ∑ i ( x i − x ˉ ) 2 σ 2 V a r ( b ^ 0 ) = n ∑ i ( x i − x ˉ ) 2 ∑ i x i 2 σ 2 C o v ( b ^ 0 , b ^ 1 ) = − ∑ i ( x i − x ˉ ) 2 x ˉ σ 2