Population Regression Model

Suppose y,xy,x and two variables describing properties of population and one wants to “explain yy in terms of xx”. Thus, if we could observe everything, their relationship can be expressed as y=f(x;z1,z2,,zp)y=f(x;z_1,z_2,\dots, z_p), where ziz_i are pp extra factors in addition to xx influencing yy.

But it’s not always possible to observe all data. So we just need to build a useful model focusing on relationship between on x,yx,y that is true “on average”, i.e. g(x):=E[yx]g(x):=\mathbb{E}[y|x].

Linear Regression

From the above definition, we can rewrite yy as

y=g(x)+ey=g(x) + e

If we assume that

  1. g(x)g(x) is linear, i.e., g(x)=b0+b1xg(x)=b_0+b_1x
  2. Exogeneity. xx is deterministic, or independent of ee, i.e. E[xe]=0\mathbb{E}[x|e]=0
  3. Homoscedasticity. Var(ei{xj})=σ2,i=1,2,,nVar(e_i | \{x_j\})=\sigma^2, \forall i=1,2,\dots,n.
  4. No serial correlation. E[eiej{x}]=0,ij\mathbb{E}[e_ie_j | \{ x_\ell \}]=0,\forall i\ne j.

Given dataset {(xi,yi)}\{ (x_i, y_i) \}, we can find the optimal b0,b1b_0,b_1 by minimizing sum of squares error (SSE)

iei2=i(yib0b1xi)2\sum_i e_i^2 = \sum_i (y_i-b_0-b_1x_i)^2

The solution is

b0=yˉb1xˉb1=i(xixˉ)(yiyˉ)i(xixˉ)2\begin{aligned} b_0&=\bar{y}-b_1\bar{x} \\ b_1&=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2} \end{aligned}

Proposition

b1=sx,ysx2b_1=\frac{s_{x,y}}{s_x^2}

Besides, let SST=i(yiyˉ)2\text{SST}=\sum_i (y_i-\bar{y})^2, SSR=i(y^iyˉ)2\text{SSR}=\sum_i (\hat{y}_i-\bar{y})^2 and SSE=i(yiy^i)2\text{SSE}=\sum_i (y_i-\hat{y}_i)^2, then we have

SST=SSR+SSE\text{SST}=\text{SSR}+\text{SSE}

Furthermore, we define coefficient of determination R2=SSRSSTR^2=\frac{\text{SSR}}{\text{SST}}. Since

SSR=i((b0+b1xi)(b0+b1xˉ))2=b12i(xixˉ)2\text{SSR}=\sum_i ((b_0+b_1x_i)-(b_0+b_1\bar{x}))^2 = b_1^2 \sum_i (x_i-\bar{x})^2

then we have

R2=b12i(xixˉ)2i(yiyˉ)2=(sx,ysx2)2sx2sy2=rx,y2R^2 = b_1^2 \frac{\sum_i (x_i-\bar{x})^2}{\sum_i (y_i-\bar{y})^2} = (\frac{s_{x,y}}{s_x^2})^2 \cdot \frac{s_x^2}{s_y^2}=r_{x,y}^2

Hypothesis Testing with Linear Regression

Hypothesis testing usually focuses on testing b1b_1.

  1. H0:b1=b1H_0:b_1=b_1^\ast or b1b1b_1\le b_1^\ast, while H1:b1>b1H_1:b_1\gt b_1^\ast
  2. H0:b1=b1H_0:b_1=b_1^\ast or b1b1b_1\ge b_1^\ast, while H1:b1<b1H_1:b_1\lt b_1^\ast
  3. H0:b1=b1H_0:b_1=b_1^\ast, while H1:b1b1H_1:b_1\ne b_1^\ast

Under assumption, b^1\hat b_1 follows normal distribution with means b1b_1 and variance σ2/i(xixˉ)2\sigma^2/\sum_i(x_i-\bar{x})^2, we have

b^1b1σ2/i(xixˉ)2N(0,1)\frac{\hat b_1-b_1}{\sqrt{\sigma^2/\sum_i (x_i-\bar{x})^2}} \sim \mathcal{N}(0,1)

We use the following to approximate σ\sigma:

σ^2=iei2n2\hat \sigma^2=\frac{\sum_i e_i^2}{n-2}

The reason of n2n-2 is that, 2 normal equations need to be satisfied, hance lose 2 degrees of freedom.

Unbiased Estimators

Since E[yi{xi}]=b0+b1xi\mathbb{E}[y_i | \{ x_i \}]=b_0+b_1x_i, suppose the OLS algorithm gives b^0,b^1\hat b_0, \hat b_1 we can show that

E[b^1]=b1E[b^0]=b0\mathbb{E}[\hat b_1]=b_1 \\ \mathbb{E}[\hat b_0]=b_0

Also, the assumptions give that eiN(0,σ2)e_i\sim \mathcal{N}(0,\sigma^2). Since Var(yi)=Var(b0+b1xi+ei)=Var(ei)=σ2Var(y_i)=Var(b_0+b_1x_i+e_i)=Var(e_i)=\sigma^2

Var(b^1)=σ2i(xixˉ)2Var(b^0)=ixi2ni(xixˉ)2σ2Cov(b^0,b^1)=xˉi(xixˉ)2σ2Var(\hat b_1)=\frac{\sigma^2}{\sum_i (x_i-\bar{x})^2}\\ Var(\hat b_0)=\frac{\sum_i x_i^2}{n\sum_i (x_i-\bar{x})^2}\sigma^2\\ Cov(\hat b_0, \hat b_1)=-\frac{\bar{x}}{\sum_i (x_i-\bar{x})^2}\sigma^2