统计学: Linear Regression

Population Regression Model

Suppose $y,x$ and two variables describing properties of population and one wants to “explain $y$ in terms of $x$ ”. Thus, if we could observe everything, their relationship can be expressed as $y=f(x;z_1,z_2,\dots, z_p)$ , where $z_i$ are $p$ extra factors in addition to $x$ influencing $y$ .

But it’s not always possible to observe all data. So we just need to build a useful model focusing on relationship between on $x,y$ that is true “on average”, i.e. $g(x):=\mathbb{E}[y|x]$ .

Linear Regression

From the above definition, we can rewrite $y$ as

y=g(x) + e

If we assume that

$g(x)$ is linear, i.e., $g(x)=b_0+b_1x$
Exogeneity. $x$ is deterministic, or independent of $e$ , i.e. $\mathbb{E}[x|e]=0$
Homoscedasticity. $Var(e_i | \{x_j\})=\sigma^2, \forall i=1,2,\dots,n$ .
No serial correlation. $\mathbb{E}[e_ie_j | \{ x_\ell \}]=0,\forall i\ne j$ .

Given dataset $\{ (x_i, y_i) \}$ , we can find the optimal $b_0,b_1$ by minimizing sum of squares error (SSE)

\sum_i e_i^2 = \sum_i (y_i-b_0-b_1x_i)^2

The solution is

\begin{aligned} b_0&=\bar{y}-b_1\bar{x} \\ b_1&=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2} \end{aligned}

Proposition

b_1=\frac{s_{x,y}}{s_x^2}

Besides, let $\text{SST}=\sum_i (y_i-\bar{y})^2$ , $\text{SSR}=\sum_i (\hat{y}_i-\bar{y})^2$ and $\text{SSE}=\sum_i (y_i-\hat{y}_i)^2$ , then we have

\text{SST}=\text{SSR}+\text{SSE}

Furthermore, we define coefficient of determination $R^2=\frac{\text{SSR}}{\text{SST}}$ . Since

\text{SSR}=\sum_i ((b_0+b_1x_i)-(b_0+b_1\bar{x}))^2 = b_1^2 \sum_i (x_i-\bar{x})^2

then we have

R^2 = b_1^2 \frac{\sum_i (x_i-\bar{x})^2}{\sum_i (y_i-\bar{y})^2} = (\frac{s_{x,y}}{s_x^2})^2 \cdot \frac{s_x^2}{s_y^2}=r_{x,y}^2

Hypothesis Testing with Linear Regression

Hypothesis testing usually focuses on testing $b_1$ .

$H_0:b_1=b_1^\ast$ or $b_1\le b_1^\ast$ , while $H_1:b_1\gt b_1^\ast$
$H_0:b_1=b_1^\ast$ or $b_1\ge b_1^\ast$ , while $H_1:b_1\lt b_1^\ast$
$H_0:b_1=b_1^\ast$ , while $H_1:b_1\ne b_1^\ast$

Under assumption, $\hat b_1$ follows normal distribution with means $b_1$ and variance $\sigma^2/\sum_i(x_i-\bar{x})^2$ , we have

\frac{\hat b_1-b_1}{\sqrt{\sigma^2/\sum_i (x_i-\bar{x})^2}} \sim \mathcal{N}(0,1)

We use the following to approximate $\sigma$ :

\hat \sigma^2=\frac{\sum_i e_i^2}{n-2}

The reason of $n-2$ is that, 2 normal equations need to be satisfied, hance lose 2 degrees of freedom.

Unbiased Estimators

Since $\mathbb{E}[y_i | \{ x_i \}]=b_0+b_1x_i$ , suppose the OLS algorithm gives $\hat b_0, \hat b_1$ we can show that

\mathbb{E}[\hat b_1]=b_1 \\ \mathbb{E}[\hat b_0]=b_0

Also, the assumptions give that $e_i\sim \mathcal{N}(0,\sigma^2)$ . Since $Var(y_i)=Var(b_0+b_1x_i+e_i)=Var(e_i)=\sigma^2$

Var(\hat b_1)=\frac{\sigma^2}{\sum_i (x_i-\bar{x})^2}\\ Var(\hat b_0)=\frac{\sum_i x_i^2}{n\sum_i (x_i-\bar{x})^2}\sigma^2\\ Cov(\hat b_0, \hat b_1)=-\frac{\bar{x}}{\sum_i (x_i-\bar{x})^2}\sigma^2