Properties of LS

Let (Y,X,U)(Y, X, U) be a random vector where YY and UU take values in R\mathbf{R} and XX takes values in Rk+1\mathbf{R}^{k+1}. Assume further that the first component of XX is a constant equal to one. Let βRk+1\beta \in \mathbf{R}^{k+1} be such that

Y=Xβ+UY=X^{\prime} \beta+U

We assume that our model satisfies the following conditions:

  1. E[XU]=0E[X U]=0

  2. E[XX]<E\left[X X^{\prime}\right]<\infty

  3. There is no perfect collinearity in XX

Denote the marginal distribution of (Y,X)(Y,X) by PP. And let (Y1,X1),,(Yn,Xn)\left(Y_1, X_1\right), \ldots,\left(Y_n, X_n\right) be an i.i.d. sample of random vectors with distribution PP.

And the properties we will discuss next are:

  • Bias

  • Gauss-Markov Theorem

  • Consistency

  • Asymptotic Normality

Bias

Unbiasedness

Under the first assumption, E[UX]=0E[U \mid X]=0 (i.e., E[YX]=XβE[Y \mid X]=X^{\prime} \beta) it follows that E[β^]=βE[\hat{\beta}]=\beta

Proof:

E[XU]=E[E[XUX]]=E[XE[UX]]=0E[UX]=0E[X U]=E[E[XU \mid X]] = E[XE[U \mid X]] =0 \Rightarrow E[U \mid X] =0

Based on the OLS formula, we have that:

β^=(i=1nXiXi)1(i=1nXiYi)\hat{\beta}=\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1}\left(\sum_{i=1}^n X_i Y_i\right)

Take Yi=Xiβ+uiY_i = X_i^{\prime}\beta + u_i into our formula, we can have that:

β^=(i=1nXiXi)1i=1nXi(Xiβ+ui)=(i=1nXiXi)1i=1nXiXiβ+(i=1nXiXi)1i=1nXiui=β+(i=1nXiXi)1i=1nXiui\begin{aligned} \hat{\beta} & =\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \sum_{i=1}^n X_i\left(X_i^{\prime} \beta+u_i\right) \\ & =\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \sum_{i=1}^n X_i X_i^{\prime} \beta+\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \sum_{i=1}^n X_i u_i \\ & =\beta+\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \sum_{i=1}^n X_i u_i \end{aligned}

As we can have that E[β^]=E[E[β^X]]E[\hat{\beta}]=E[E[\hat{\beta} \mid X]], we can have the following proof:

E[β^X1,Xn]=β+(i=1nXiXi)1i=1nXiE[uiX1,,Xn]\mathbb{E}\left[\hat{\beta} \mid X_1, \cdots X_n\right]=\beta+\left(\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \sum_{i=1}^n X_i \mathbb{E}[u_i \mid X_1, \cdots ,X_n]

Note that:

  • If XiX_is are i.i.d, then E[uiX1,,Xn]=E[uiXi]\mathbb{E}\left[u_i \mid X_1, \cdots, X_n\right] = \mathbb{E}[u_i \mid X_i]. Otherwise, this equation may not hold.

  • Both our assupmtion E[UX]=0E[U \mid X]=0 and E[uiX1,,Xn]\mathbb{E}\left[u_i \mid X_1, \cdots, X_n\right] can be a sufficient condition for us to finish the proof.

As E[uiX1,,Xn]=E[uiXi]=0\mathbb{E}\left[u_i \mid X_1, \cdots, X_n\right] = \mathbb{E}\left[u_i \mid X_i\right]=0, we can have that E[β^X1,Xn]=β\mathbb{E}\left[\hat{\beta} \mid X_1, \cdots X_n\right]=\beta

So we can have that:

E[β^]=E[E[β^X]]=E[β]=βE[\hat{\beta}]=E[E[\hat{\beta} \mid X]] = E[\beta] = \beta

Now we finished the proof for unbiasedness.

Biasedness from Omitted Variable

If E[β^]β\mathbb{E}[\hat{\beta}] \neq \beta, then β^\hat{\beta} is biased, bias(β^)=E[β^]β\text{bias}(\hat{\beta}) = \mathbb{E}[\hat{\beta}] - \beta, this may because of the omitted variable (some related variable).

  • If bias(β^)>0\operatorname{bias}(\hat{\beta}) > 0: β^\hat{\beta} is overestimated.

  • If bias(β^)<0\operatorname{bias}(\hat{\beta}) < 0: β^\hat{\beta} is underestimated.

For example, in a regression about factors that affects wage:

 wage =β0+β1 edu+u\text { wage }=\beta_0+\beta_1 \text { edu}+u

We regressed wage on the education level (edu). However, there are other factors that may not be independent of the education level that also affects wage, like ability, motivation, .etc. So in this example, we have Cov(edu,u)0\operatorname{Cov}(e d u, u) \neq 0, therefore, this will lead to the biased estimation of β1\beta_1.

To calculate the omitted variable bias, we can use the following steps:

  1. Build a long regression: Y=X1β1+X2β2+eY=X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+e, where E[X1e]=0,E[X2e]=0\mathbb{E}\left[X_1 e\right]=0, \mathbb{E}\left[X_2 e\right]=0

  2. Build a short regression: Y=X1γ1+uY=X_1^{\prime} \gamma_1+u, where E[X1u]=0\mathbb{E}\left[X_1 u\right]=0

  3. By the formula of OLS, we have that:

γ1=E[X1X1]1E[X1Y]=E[X1X1]1E[X1(X1β1+X2β2+e)]=β1+[E[X1X1]1E[X1X2]β2]+[E[X1X1]1E[X1e]]\begin{aligned} \gamma_1& =\mathbb{E}\left[X_1X_1^{\prime}\right]^{-1} \mathbb{E}\left[X_1Y\right] \\ & =\mathbb{E}\left[X_1 X_1^{\prime}\right]^{-1} \mathbb{E}\left[X_1\left(X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+e\right)\right] \\ & =\beta_1+\left[\mathbb{E}\left[X_1 X_1^{\prime}\right]^{-1} \mathbb{E}\left[X_1 X_2^{\prime}\right] \beta_2\right]+\left[\mathbb{E}\left[X_1 X_1^{\prime}\right]^{-1} \mathbb{E}\left[X_1 e\right]\right] \end{aligned}

As E[X1e]=0\mathbb{E}\left[X_1 e\right]=0, and denote Γ12\Gamma_{12} as E[X1X1]1E[X1X2]\mathbb{E}\left[X_1 X_1^{\prime}\right]^{-1} \mathbb{E}\left[X_1 X_2^{\prime}\right], we can have that:

γ1=β1+Γ12β2\gamma_1 =\beta_1+\Gamma_{12} \beta_2

Note that:

  • Γ12\Gamma_{12} can be defined as X2=X1Γ12+eX_2 = X_1^{\prime} \Gamma_{12}+e, which is the projection of X2X_2 on X1X_1. However, in the real world, it is very hard to estimate because we usually cannot observe omitted variables.

  • As X1X_1 stands for education, and X2X_2 stands for ability or motivation, based on the real world experience, we can easily conclude that:

    • Better ability/motivation are more likely leads to higher wage: β2>0\beta_2 > 0 is highly likely.

    • Better ability/motivation are more likely leads to higher education: Γ12>0\Gamma_{12} > 0 is highly likely.

  • Therefore, we have that the bias we got from the omitted variable: Γ12β2\Gamma_{12} \beta_2 is highly likely to be greater than 0. Therefore, in our short regression, we are very likely overestimated the effect of education on wage.

Gauss-Markov Theorem

Homoskedastic and Heteroskedastic

Suppose E[UX]=0E[U \mid X]=0 and that Var[UX]=σ2\operatorname{Var}[U \mid X]=\sigma^2.

  • When Var[UX]\operatorname{Var}[U \mid X] is constant (and therefore does not depend on XX ) we say that UU is homoskedastic.

  • Otherwise, we say that UU is heteroskedastic.

And there are two ways to check the homoskedastic and heteroskedastic:

  1. We can do regression for Var[UX]\operatorname{Var}[U \mid X] on XX, if covariance for XX is significantly different with 0.

  2. We can also plot the value of Var[UX]\operatorname{Var}[U \mid X], and check the plot to see if there is any trend in the plot.

Gauss-Markov Theorem

Gauss-Markov Theorem: under these assumptions the OLS estimator is "best" in the sense that it has the "smallest" value of Var[AYX1,,Xn]\operatorname{Var}\left[\mathbb{A}^{\prime} \mathbb{Y} \mid X_1, \ldots, X_n\right] among all estimators of the form

AY\mathbb{A}^{\prime} \mathbb{Y}

for some matrix A=A(X1,,Xn)\mathbb{A}=\mathbb{A}\left(X_1, \ldots, X_n\right) satisfying

E[AYX1,,Xn]=βE\left[\mathbb{A}^{\prime} \mathbb{Y}\mid X_1, \ldots, X_n\right]=\beta

Note that AY\mathbb{A}^{\prime} \mathbb{Y} is linear in Y\mathbb{Y}, so we might be able to get better in variance using the WLS. Gauss-Markove Theorem is just for OLS case.

Denote the variance matrix as BB, we can have that the "smallest" is understood as the partial order obtained by BB~B \geq \tilde{B} if BB~B-\tilde{B} is positive semi-definite.

Taking above matrix comparison skill in to our theorem, the "best" unbiased linear estimator is obtained by finding the matrix A0\mathbb{A}_0 satisfying A0X=Ik\mathbb{A}_0^{\prime} \mathbb{X}=\mathbb{I}_{k} such that A0A0\mathbb{A}_0^{\prime} \mathbb{A}_0 is minimized in the positive definite sense, which means that for any other matrix A\mathbb{A} satisfying AX=Ik\mathbb{A}^{\prime} \mathbb{X}=\mathbb{I}_k then AAA0A0\mathbb{A}^{\prime} \mathbb{A}-\mathbb{A}_0^{\prime} \mathbb{A}_0 is positive semi-definite.

This class of estimators includes the OLS estimator as a special case (by setting A=(XX)1X\mathbb{A}^{\prime}=\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime}. The property is sometimes expressed as saying that OLS estimator is the "best linear unbiased estimator (BLUE)" of β\beta under these assumptions.

The Gauss-Markov theorem provides a lower bound on the variance matrix of unbiased linear estimators under the assumption of homoskedasticity. It says that no unbiased linear estimator can have a variance matrix smaller (in the positive definite sense) than σ2(XX)1\sigma^2\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}.

Now we do the following proofs:

  • The estimator is AY\mathbb{A}^{\prime} Y for A=A(X1,,Xn)\mathbb{A}=\mathbb{A}\left(X_1, \ldots, X_n\right) and satisfies E[AYX1,,Xn]=βE\left[\mathbb{A}^{\prime} \mathbf{Y} \mid X_1, \ldots, X_n\right]=\beta.

Proof:

As we have Y=Xβ+UY=X \beta+U \Rightarrow AY=A(Xβ+U)\mathbb{A}^{\prime} Y=\mathbb{A}^{\prime}(X \beta+U) \Rightarrow AY=AXβ+AU\mathbb{A}^{\prime} Y=\mathbb{A}^{\prime} X \beta+\mathbb{A}^{\prime} U

Now we can have that, E[AYX1,,Xn]=E[AXβ+AUX1,,Xn]E\left[\mathbb{A}^{\prime} Y \mid X_1, \ldots, X_n\right]=E\left[\mathbb{A}^{\prime} X \beta+\mathbb{A}^{\prime} U \mid X_1, \ldots, X_n\right]

As A=A(X1,,Xn)\mathbb{A}=\mathbb{A}\left(X_1, \ldots, X_n\right) depend on X1,,XnX_1, \ldots, X_n, and XUX \perp U, we can have that:

E[AXβX1,,Xn]=AXβE\left[\mathbb{A}^{\prime} X \beta\mid X_1, \ldots, X_n\right]=\mathbb{A}^{\prime} X \beta & E[AUX1,,Xn]=0E\left[\mathbb{A}^{\prime} U \mid X_1, \ldots, X_n\right]=0

So we got that E[AYX1,,Xn]=AXβE\left[\mathbb{A}^{\prime} Y \mid X_1, \ldots, X_n\right]=\mathbb{A}^{\prime} X \beta, combine this with AX=I\mathbb{A}^{\prime} \mathbb{X}=\mathbb{I}, we finished the proof that

E[AYX1,,Xn]=βE\left[\mathbb{A}^{\prime} \mathbf{Y} \mid \boldsymbol{X}_1, \ldots, \boldsymbol{X}_n\right]=\beta
  • Show AA(XX)1\mathbb{A}^{\prime} \mathbb{A}-\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} is positive semi-definite for any A\mathbb{A} satisfying AX=I\mathbb{A}^{\prime} \mathbb{X}=\mathbb{I}

Proof:

We need to show AA(XX)1>0\mathbb{A}^{\prime} \mathbb{A}-\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} > 0.

Set . Note that XC=0\boldsymbol{X}^{\prime} \boldsymbol{C}=\mathbf{0}. We calculate that

AA(XX)1=(C+X(XX)1)(C+X(XX)1)(XX)1=CC+CX(XX)1+(XX)1XC+(XX)1XX(XX)1(XX)1=CC>0\begin{aligned} \mathbb{A}^{\prime} \mathbb{A}-\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} & =\left(\mathbb{C}+\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}\right)^{\prime}\left(\mathbb{C}+\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}\right)-\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \\ & =\mathbb{C}^{\prime} \mathbb{C}+\mathbb{C}^{\prime} \mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}+\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} \mathbb{C} \\ & +\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} \mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}-\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \\ & =\mathbb{C}^{\prime} \mathbb{C} \\ & >0 \end{aligned}

The final inequality states that the matrix CC\mathbb{C}^{\prime} \mathbb{C} is positive semi-definite which is a property of quadratic form.

Consistency

 Under our three main assumptions, β^Pβ as n\text { Under our three main assumptions, } \hat{\beta} \stackrel{P}{\rightarrow} \beta \text { as } n \rightarrow \infty \text {. }

Proof:

Based on the OLS formula, we can have that:

β^=(1n1inXiXi)1(1n1inXiYi)\hat{\beta}=\left(\frac{1}{n} \sum_{1 \leq i \leq n} X_i X_i^{\prime}\right)^{-1}\left(\frac{1}{n} \sum_{1 \leq i \leq n} X_i Y_i\right)

Take Yi=Xiβ+uiY_i = X_i^{\prime}\beta + u_i into our formula, we can have that:

β^=(1ni=1nXiXi)11ni=1nXi(Xiβ+ui)=(1ni=1nXiXi)11ni=1nXiXiβ+(1ni=1nXiXi)11ni=1nXiui=β+(1ni=1nXiXi)1(1ni=1nXiui)\begin{aligned} \hat{\beta} & =\left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n}\sum_{i=1}^n X_i\left(X_i^{\prime} \beta+u_i\right) \\ & =\left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime} \beta+\left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n}\sum_{i=1}^n X_i u_i \\ & =\beta+\left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \left(\frac{1}{n}\sum_{i=1}^n X_i u_i\right) \end{aligned}

Now, we denote Bn=(1ni=1nXiXi)1(1ni=1nXiui)B_n = \left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \left(\frac{1}{n}\sum_{i=1}^n X_i u_i\right), then we can have that, as XiX_is are i.i.d. and based on the LLN:

  • (1ni=1nXiXi)1\left(\frac{1}{n}\sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} P\stackrel{P}{\rightarrow} E[XiXi]\mathbb{E}\left[X_i X_i^{\prime}\right], which is less than \infin

  • 1ni=1nXiui\frac{1}{n} \sum_{i=1}^n X_i u_i P\stackrel{P}{\rightarrow} E[Xiui]=0\mathbb{E}\left[\begin{array}{lll}X_iu_i\end{array}\right]=0

Therefore, we can have that :

plimnβ^=β+plimnBn=β+E[XiXi]E[Xiui]=β+0=β\begin{aligned} p \lim _{n \rightarrow \infty} \hat{\beta} & =\beta+p\lim _{n \rightarrow \infty} B_n \\ & =\beta+\mathbb{E}\left[X_i X_i^{\prime}\right] \mathbb{E}\left[X_i u_i\right] \\ & =\beta+0=\beta \end{aligned}

Now we finished our proof for consistency.

Asymptotic Normality

We already assume that our model satisfies the three assumptions in the previous analysis:

  1. E[XU]=0E[X U]=0

  2. E[XX]<E\left[X X^{\prime}\right]<\infty

  3. There is no perfect collinearity in XX

Now, we add a forth assumption, which is:

  1. Var[XU]=E[XXU2]<\operatorname{Var}[X U]=E\left[X X^{\prime} U^2\right]<\infty

Then, as nn \rightarrow \infty, we have

n(β^β)dN(0,V) where V=(E[XX])1E[XXU2](E[XX])1\sqrt{n}(\hat{\beta}-\beta) \stackrel{d}{\rightarrow} N(0, \mathbb{V}) \text { where } \mathbb{V}=\left(E\left[X X^{\prime}\right]\right)^{-1} E\left[X X^{\prime} U^2\right]\left(E\left[X X^{\prime}\right]\right)^{-1}

As we have calculated before,

β^=(1ni=1nXiXi)11ni=1nXiYi=(1ni=1nXiXi)11ni=1nXi(Xiβ+Ui)=β+(1ni=1nXiXi)11ni=1nXiUi\begin{aligned} \hat{\beta} & =\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n} \sum_{i=1}^n X_i Y_i \\ & =\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n} \sum_{i=1}^n X_i\left(X_i \beta+U_i\right) \\ & =\beta+\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n} \sum_{i=1}^n X_i U_i \end{aligned}

Therefore, we have

β^β=(1ni=1nXiXi)11ni=1nXiUi\hat{\beta}-\beta=\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right)^{-1} \frac{1}{n} \sum_{i=1}^n X_i U_i

We have shown that:

Last updated