Linear Regression When E[XU]=0

Assumptions for Linear Regression when E[XU]=0

Let (Y,X,U)(Y, X, U) be a random vector where YY and UU take values in R\mathbf{R} and XX takes values in Rk+1\mathbf{R}^{k+1}. Assume further that X=(X0,X1,,Xk)X=\left(X_0, X_1, \ldots, X_k\right)^{\prime} with X0=1X_0=1 and let β=(β0,β1,,βk)Rk+1\beta=\left(\beta_0, \beta_1, \ldots, \beta_k\right)^{\prime} \in \mathbf{R}^{k+1} be such that

Y=Xβ+UY=X^{\prime} \beta+U

We will make following three assumptions for this model.

  1. E[XU]=0E[X U]=0: The justification of 1 varies depending on which of the three preceding interpretations we invoke. This expectation refers to the covariance between the regressor XX and the error term UU being zero.

    1. For Linear Conditional Expectation: In this interpretation, the regression is viewed as modeling the conditional expectation of YY given XX. That is, E[YX]=β0+β1XE[Y \mid X]=\beta_0+\beta_1 X.

      Here, UU represents the deviation of YY from its conditional expectation given XX. Since UU is defined as YE[YX]Y-E[Y \mid X], it inherently means that E[UX]=0E[U \mid X]= 0 . Consequently, this implies that E[XU]=0E[X U]=0, because the error term UU has no systematic relationship with XX when considering the conditional expectation.

    2. For Best Linear Approximation: In the best linear approximation, the regression equation is considered the best linear approximation to the true relationship between YY and XX, regardless of the true underlying relationship.

      The error term UU in this context captures all factors that affect YY but are not captured by XX. For the linear approximation to be the 'best', it requires that these omitted factors (captured in UU ) are uncorrelated with XX. If XX were correlated with UU, there would be systematic information in XX that could improve the approximation, contradicting the premise that the model is the 'best' linear approximation. Hence, E[XU]=0E[X U]=0 is necessary to ensure that the linear model is indeed the best approximation under the given circumstances.

    3. For Causal Model: the regression is understood as modeling a causal relationship between XX and YY. Here, β1\beta_1 is interpreted as the causal effect of XX on YY.

      For β1\beta_1 to represent the causal effect of XX on YY, it is crucial that all other factors influencing YY are either controlled for or are not correlated with XX. The error term UU includes all these other factors. If UU were correlated with XX, it would mean that there are omitted variables that both affect YY and are related to XX, which would bias the estimation of the causal effect. Therefore, ensuring E[XU]=0E[X U]=0 is essential for a valid causal interpretation, as it implies that there are no omitted confounders that are correlated with XX.

  2. E[XX]<E\left[X X^{\prime}\right]<\infty, this 2 ensures that E[XX]E\left[X X^{\prime}\right] exists.

    This matrix E[XX]E\left[X X^{\prime}\right] is called Design Matrix.

  3. There is NO PERFECT COLLINEARITY in XX or the matrix E[XX]E\left[X X^{\prime}\right] is in fact invertible.

    Since E[XX]E\left[X X^{\prime}\right] is positive semi-definite, invertibility of E[XX]E\left[X X^{\prime}\right] is equivalent to E[XX]E\left[X X^{\prime}\right] being positive definite. This ensures there is a unique solution to β\beta when solving for it. So, this is also called the identification condition for β\beta.

We can talk more detail about Invertibility:

Definition:

There is perfect collinearity or multicollinearity in XX if there exists nonzero cRk+1c \in \mathbf{R}^{k+1} such that P{cX=0}=1P\{c^{\prime} X=0\}=1, (Here we treat X as a random variable.), i.e., if we can express one component of XX as a linear combination of the others.

Lemma:

Let XX be such that E[XX]<E\left[X X^{\prime}\right]<\infty. Then E[XX]E\left[X X^{\prime}\right] is invertible iff there is no perfect collinearity in XX.

Proof:

If there is perfect collinearity, such that c0\exist c \neq0 that

cE[XX]c=E[cXXc]=E[(cX)2]=0 with P=1c^{\prime} \mathbb{E}\left[X X^{\prime}\right] c=\mathbb{E}\left[c^{\prime} X X^{\prime} c\right]=\mathbb{E}\left[\left(c^{\prime} X\right)^2\right]=0 \text { with } P=1

In order for E[XX]E\left[X X^{\prime}\right] to be positive definite, cE[XX]cc^{\prime} \mathbb{E}\left[X X^{\prime}\right] c can take value 0, but it cannot have this with P=1. So, cE[XX]cc^{\prime} \mathbb{E}\left[X X^{\prime}\right] c is not positive definite, therefore it is not invertible. This is a contradiction.

Note that, based on the calculation rule of the matrix, (AB)=BA(A B)^{\prime}=B^{\prime} A^{\prime}, we have (XZ)=ZX(X^{\prime} Z)^{\prime}=Z^{\prime} X, therefore,

ZE[XX]Z=E[ZXXZ]=E[ZX(ZX)]=E[(ZX)2]0\begin{aligned} Z^{\prime} \mathbb{E}\left[X X^{\prime}\right] Z & =\mathbb{E}\left[Z^{\prime} X X^{\prime} Z\right]=\mathbb{E}\left[Z^{\prime} X\left(Z^{\prime} X\right)^{\prime}\right] \\ & =\mathbb{E}\left[\left(Z^{\prime} X\right)^2\right] \geqslant 0 \end{aligned}

So, E[XX]E\left[X X^{\prime}\right] is always positive semi-definite, we need E[XX]>0E\left[X X^{\prime}\right]>0 with P>0P>0 to be positive definite.

Solving for Beta

  • E[UX]=0E[U X]=0 implies that E[X(YXβ)]=0E\left[X\left(Y-X^{\prime} \beta\right)\right]=0, which is the FOC for optimization problem argminbRk+1E[(YXb)2]\underset{b \in \mathbb{R}^{k+1}}{\operatorname{argmin}} E\left[\left(Y-X^{\prime} b\right)^2\right]: E[(YXb)2]b=E[(YXb)2b]=E[(YXb)X]=0\frac{\partial \mathbb{E}\left[\left(Y-X^{\prime} b\right)^2\right]}{\partial b}=\mathbb{E}\left[\frac{\partial\left(Y-X^{\prime} b\right)^2}{\partial b}\right]=\mathbb{E}\left[\left(Y-X^{\prime} b\right) \cdot X\right]=0

  • As β\beta is the solution to this optimization problem. So, we have E[(YXβ)X]=0\mathbb{E}\left[\left(Y-X^{\prime} \beta\right) \cdot X\right]=0

  • E[YX]E[XXβ]=0\mathbb{E}[Y X]-\mathbb{E}\left[X^{\prime} X \beta\right]=0 \Rightarrow E[XY]=E[XX]β\mathbb{E}[X Y]=\mathbb{E}[X X^{\prime}]\beta

  • As E[XX]<E\left[X X^{\prime}\right]<\infty and E[XX]E\left[X X^{\prime}\right] is positive definite. So, it exists and is invertible. We have:

β=[E[XX]]1E[XY]\beta=\left[\mathbb{E}\left[X X^{\prime}\right]\right]^{-1} \mathbb{E}[X Y]

Note that, here the β\beta we got is the linear projection coefficient.

  • And for E[UX]=0E[U X]=0, we can have that E[U]=0E[U] = 0.

Proof:

Since E[UX]=0E[U X]=0 indicates XX is independent with UU, E[xju]=0\mathbb{E}[x_j u]=0 for j=1,,kj = 1 , \cdots, k.\Rightarrowxjux_j \perp u

Cov(X,U)=E[XU]E[X]E[U]=0E[X]E[U]=0Cov(X, U) = \mathbb{E}[XU] - \mathbb{E}[X]\mathbb{E}[U] = 0 - \mathbb{E}[X]\mathbb{E}[U] = 0

As E[X]0\mathbb{E}[X] \neq 0, so we have E[U]=0\mathbb{E}[U]=0.

  • If E[XX]E\left[X X^{\prime}\right] is not invertible (not positive definite), there will be more than one solution to this system of equations. Any two solutions β\beta and β~\tilde{\beta} will necessarily satisfy Xβ=Xβ~X^{\prime} \beta=X^{\prime} \tilde{\beta}.

Estimating Beta using OLS

Let (Y,X,U)(Y, X, U) be as described and let PP be the marginal distribution of (Y,X)(Y, X). And let (Y1,X1),,(Yn,Xn)\left(Y_1, X_1\right), \ldots,\left(Y_n, X_n\right) be an i.i.d. sequence of random vectors with distribution PP.

A natural estimator of β=(E[XX])1E[XY]\beta=\left(E\left[X X^{\prime}\right]\right)^{-1} E[X Y] is simply:

β^=(1n1inXiXi)1(1n1inXiYi)=(1inXiXi)1(1inXiYi)\hat{\beta}=\left(\frac{1}{n} \sum_{1 \leq i \leq n} X_i X_i^{\prime}\right)^{-1}\left(\frac{1}{n} \sum_{1 \leq i \leq n} X_i Y_i\right)=\left(\sum_{1 \leq i \leq n} X_i X_i^{\prime}\right)^{-1}\left(\sum_{1 \leq i \leq n} X_i Y_i\right)

This estimator is called the ordinary least squares (OLS) estimator of β\beta because it can also be derived as the solution to the following minimization problem:

minbRk+11n1in(YiXib)2\min _{b \in \mathbf{R}^{k+1}} \frac{1}{n} \sum_{1 \leq i \leq n}\left(Y_i-X_i^{\prime} b\right)^2

Here:

  • β^\hat{\beta} is a random variable, since it is f(Xi,Yi)f\left(X_i, Y_i\right)

  • By Law of Large Number: 1n1inXiXiPE[XiXi]\frac{1}{n} \sum_{1 \leqslant i \leqslant n} X_i X_i^{\prime} \stackrel{P}{\longrightarrow} \mathbb{E}\left[X_i X_i^{\prime}\right]

  • By Law of Large Number: 1n1inXiYiPE[XiYi]\frac{1}{n} \sum_{1 \leqslant i \leqslant n} X_i Y_i^{\prime} \stackrel{P}{\longrightarrow} \mathbb{E}\left[X_i Y_i^{\prime}\right]

  • Note that for more efficient estimator β^\hat{\beta}, its Variance is smaller.

Proof:

Define Objective Function:

Q(b)=1ni=1n(YiXib)2Q(b)=\frac{1}{n} \sum_{i=1}^n\left(Y_i-X_i^{\prime} b\right)^2

We take the derivative of the objective function w.r.t. bb and get the FOC:

Qb=2ni=1nXi(YiXib)=0\frac{\partial Q}{\partial b}=-\frac{2}{n} \sum_{i=1}^n X_i\left(Y_i-X_i^{\prime} b\right)=0

Then by FOC, we have that:

2ni=1nXi(YiXib)=0i=1nXiYii=1nXiXib=0\begin{aligned} & -\frac{2}{n} \sum_{i=1}^n X_i\left(Y_i-X_i^{\prime} b\right)=0 \\ & \sum_{i=1}^n X_i Y_i-\sum_{i=1}^n X_i X_i^{\prime} b=0 \end{aligned}

Now, we rearrange to solve for bb :

i=1nXiXib=i=1nXiYi(i=1nXiXi)b=i=1nXiYi\begin{aligned} & \sum_{i=1}^n X_i X_i^{\prime} b=\sum_{i=1}^n X_i Y_i \\ & \left(\sum_{i=1}^n X_i X_i^{\prime}\right) b=\sum_{i=1}^n X_i Y_i \end{aligned}

Dividing by nn to align with your provided formulation:

(1ni=1nXiXi)b=1ni=1nXiYi\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right) b=\frac{1}{n} \sum_{i=1}^n X_i Y_i

Finally, solving for bb, the result will be our β^\hat{\beta}:

β^=b=(1ni=1nXiXi)1(1ni=1nXiYi)\hat{\beta} = b=\left(\frac{1}{n} \sum_{i=1}^n X_i X_i^{\prime}\right)^{-1}\left(\frac{1}{n} \sum_{i=1}^n X_i Y_i\right)

Now we finished the proof.

Note that we can also apply weight to the above formula to make this a weighted least squares (WLS) estimator.

minbRk+11n1inWi(YiXib)2\min _{b \in \mathbf{R}^{k+1}} \frac{1}{n} \sum_{1 \leq i \leq n}W_i \left(Y_i-X_i^{\prime} b\right)^2

So, we have that OLS is a special case of WLS. In this case, WiW_is are equal to 11.

Matrix Notation

Define

Y=(Y1,,Yn)X=(X1,,Xn)Y^=(Y^1,,Y^n)=Xβ^U=(U1,,Un)U^=(U^1,,U^n)=YY^=YXβ^.\begin{aligned} \mathbb{Y} & =\left(Y_1, \ldots, Y_n\right)^{\prime} \\ \mathbb{X} & =\left(X_1, \ldots, X_n\right)^{\prime} \\ \hat{\mathbb{Y}} & =\left(\hat{Y}_1, \ldots, \hat{Y}_n\right)^{\prime} =\mathbb{X} \hat{\beta} \\ \mathbb{U} & =\left(U_1, \ldots, \mathbb{U}_n\right)^{\prime} \\ \hat{\mathbb{U}} & =\left(\hat{U}_1, \ldots, \hat{U}_n\right)^{\prime} \\ & =\mathbb{Y}-\hat{\mathbb{Y}} =\mathbb{Y}-\mathbb{X} \hat{\beta} . \end{aligned}

In this notation,

β^=(XX)1XY\hat{\beta}=\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} \mathbb{Y}

and may be equivalently described as the solution to

minbRk+1YXb2.\min _{b \in \mathbf{R}^{k+1}}|\mathbb{Y}-\mathbb{X} b|^2 .

Hence, Xβ^\mathbb{X} \hat{\beta} is the vector in the column space of X\mathbb{X} that is the closest (in terms of Euclidean distance) to Y\mathbb{Y}.

Xβ^=X(XX)1XY\mathbb{X} \hat{\beta}=\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} \mathbb{Y}

is the orthogonal projection of YY onto the ((k+1)((k+1)-dimensional) column space of X\mathbb{X}.

The matrix

P=X(XX)1X\mathbb{P}=\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime}

is known as the Projection Matrix. It projects a vector in Rn\mathbf{R}^n (such as Y\mathbb{Y} ) onto the column space of X\mathbb{X}.

Note that P2=P\mathbb{P}^2=\mathbb{P}, which reflects the fact that projecting something that already lies in the column space of X\mathbb{X} onto the column space of X\mathbb{X} does nothing.

The matrix $\mathbb{P}$ is also symmetric. The matrix

M=IP\mathbb{M}=\mathbb{I}-\mathbb{P}

is also a projection matrix. It projects a vector onto the ((nk1)((n-k-1)-dimensional) vector space orthogonal to the column space of X\mathbb{X}. Hence, MX=0\mathbb{M X}=0. Note that MY=U^\mathbb{M Y}=\hat{\mathbb{U}}. For this reason, M\mathbb{M} is sometimes called the "residual maker" matrix.

Proof:

  • PX=X\mathbb{P}\mathbb{X} = \mathbb{X}:

PX=X(XX)1XX=X[(XX)1(XX)]=XI=X\mathbb{P} \mathbb{X}=\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} \mathbb{X}=\mathbb{X}\left[\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1}\left(\mathbb{X}^{\prime} \mathbb{X}\right)\right]=\mathbb{X} \mathbb{I}=\mathbb{X}
  • P2=P\mathbb{P}^2 = \mathbb{P}:

P2=(X(XX)1X)(X(XX)1X)=X(XX)1XX[(XX)1]X=X(XX)1X=P\mathbb{P}^2=\left(\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime}\right)\left(\mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime}\right)^{\prime} = \mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime}\mathbb{X}[\left(\mathbb{X}^{\prime}\mathbb{X}\right)^{-1}]^{\prime} \mathbb{X}^{\prime} = \mathbb{X}\left(\mathbb{X}^{\prime} \mathbb{X}\right)^{-1} \mathbb{X}^{\prime} = \mathbb{P}
  • PM=0\mathbb{P M}=0:

PM=P(IP)=PP2=PP=0\mathbb{P M}=\mathbb{P}(\mathbb{I}-\mathbb{P }) = \mathbb{P} - \mathbb{P}^2 = \mathbb{P}-\mathbb{P}=0

Sub-Vectors of Beta

Let (Y,X,U)(Y, X, U) be a random vector where YY and UU take values in R\mathbf{R} and XX takes values in Rk+1\mathbf{R}^{k+1}. Let β=(β0,β1,,βk)Rk+1\beta=\left(\beta_0, \beta_1, \ldots, \beta_k\right)^{\prime} \in \mathbf{R}^{k+1} such that

Y=Xβ+UY=X^{\prime} \beta+U

Partition XX into X1X_1 and X2X_2, where X1X_1 takes values in Rk1\mathbf{R}^{k_1} and X2X_2 takes values in Rk2\mathbf{R}^{k_2}. Partition β\beta into β1\beta_1 and β2\beta_2 analogously. In this notation,

Y=X1β1+X2β2+UY=X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+U

Our preceding results imply that:

(β1β2)=[E(X1X2)(X1X2)]E(X1X2)Y\left(\begin{array}{l} \beta_1 \\ \beta_2 \end{array}\right)=\left[E\left(\begin{array}{l} X_1 \\ X_2 \end{array}\right)\left(X_1^{\prime} X_2^{\prime}\right)\right]^{\prime} E\left(\begin{array}{l} X_1 \\ X_2 \end{array}\right) Y
(β1β2)=(E[X1X1]E[X1X2]E[X2X1]E[X2X2])1(E[X1Y]E[X2Y])\left(\begin{array}{l} \beta_1 \\ \beta_2 \end{array}\right)=\left(\begin{array}{ll} E\left[X_1 X_1^{\prime}\right] & E\left[X_1 X_2^{\prime}\right] \\ E\left[X_2 X_1^{\prime}\right] & E\left[X_2 X_2^{\prime}\right] \end{array}\right)^{-1}\left(\begin{array}{l} E\left[X_1 Y\right] \\ E\left[X_2 Y\right] \end{array}\right)

This is derived from the general linear model estimation method, specifically the Ordinary Least Squares (OLS) method, under the assumption that:

  • The expected value of the error term UU is zero and that UU is uncorrelated with XX.

  • Existence and invertibility of matrix:

(E[X1X1]E[X1X2]E[X2X1]E[X2X2])\left(\begin{array}{ll} E\left[X_1 X_1^{\prime}\right] & E\left[X_1 X_2^{\prime}\right] \\ E\left[X_2 X_1^{\prime}\right] & E\left[X_2 X_2^{\prime}\right] \end{array}\right)

Question: Can we derive formulae for β1\beta_1 and β2\beta_2 that admit some interesting interpretations?

  1. Partial Effects: The coefficients in β1\beta_1 and β2\beta_2 can be interpreted as the partial effects of the variables in X1X_1 and X2X_2 on YY, controlling for the other variables. This means that β1\beta_1 captures the impact of X1X_1 on YY while holding X2X_2 constant, and vice versa for β2\beta_2.

  2. Multicollinearity Consideration: In cases where there is multicollinearity (i.e., high correlation) between some variables in X1X_1 and X2X_2, partitioning the regression can help understand how each group of variables uniquely contributes to explaining the variation in YY.

  3. Group-Specific Analysis: This partitioning is particularly useful when X1X_1 and X2X_2 represent distinct groups of variables, such as demographic factors versus economic indicators. It allows for the isolation of the effects of each group on YY.

  4. Interpreting in Context: The interpretation of β1\beta_1 and β2\beta_2 will highly depend on the specific context of the study. For instance, in an economic growth model, X1X_1 might include labor and capital variables, while X2X_2 includes policy variables. The coefficients would then tell us about the distinct impacts of these groups on economic growth.

  5. Statistical Significance: It's important to evaluate the statistical significance of the estimated coefficients in β1\beta_1 and β2\beta_2 to determine if the observed relationships are not due to random chance.

  6. Estimation Challenges: If X1X_1 and X2X_2 are highly correlated, the inverse of the matrix in the formula may be difficult to compute accurately, leading to estimation problems. This is a common issue in models with multicollinearity.

Result Based on BLP

BLP: for a random variable AA and a random vector BB, denote by BLP(AB)\operatorname{BLP}(A \mid B) the best linear predictor of AA given BB, i.e.

BLP(AB)B(E[BB])1E[BA].\operatorname{BLP}(A \mid B) \equiv B^{\prime}\left(E\left[B B^{\prime}\right]\right)^{-1} E[B A] .

If AA is a random vector, then define BLP(AB)\mathrm{BLP}(A \mid B) component-wise.

And we also define A~=ABLP(AB)\tilde{A}=A-\operatorname{BLP}(A \mid B).

Now, come back to our problem: Y=X1β1+X2β2+UY=X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+U

Define Y~=YBLP(YX2)\tilde{Y}=Y-\operatorname{BLP}\left(Y \mid X_2\right) and X~1=X1BLP(X1X2)\tilde{X}_1=X_1-\operatorname{BLP}\left(X_1 \mid X_2\right). Consider the linear regression

Y~=X~1β~1+U~ where E[X~1U~]=0\tilde{Y}=\tilde{X}_1^{\prime} \tilde{\beta}_1+\tilde{U} \text { where } E\left[\tilde{X}_1 \tilde{U}\right]=0

as for example, in the second interpretation of the linear regression model: Best Linear Approximation that we described before, then we can have that:

β~1=(E[X~1X~1])1E[X~1Y~]=β1\tilde{\beta}_1=\left(E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\right)^{-1} E\left[\tilde{X}_1 \tilde{Y}\right]=\beta_1

Proof:

First we can have the following result based on the definition of Y~\tilde{Y} and X~1\tilde{X}_1:

  • Y=X2γ+U2Y=X_2^{\prime} \gamma+U_2 \Rightarrow E[X2U2]=0\mathbb{E}\left[X_2 U_2\right]=0 \Rightarrow U2=Y~U_2=\tilde{Y}=YX2γ=Y-X_2^{\prime} \gamma

  • X1=X2η+eX_1 = X_2^{\prime} \eta+e \Rightarrow E[X2e]=0\mathbb{E}\left[X_2 e\right]=0 \Rightarrow e=X~1e=\tilde{X}_1\Rightarrow E[X2X~1]=0\mathbb{E}\left[X_2 \tilde{X}_1^{\prime}\right]=0

Above can be got because U2 U_2 and ee are projections from linear regression models.

For formula β~1=(E[X~1X~1])1E[X~1Y~]\tilde{\beta}_1=\left(E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\right)^{-1} E\left[\tilde{X}_1 \tilde{Y}\right], we will first separate the right-hand side E[X~1Y~]E\left[\tilde{X}_1 \tilde{Y}\right]

E[X~1Y~]=E[X~1(YX2γ)]E\left[\tilde{X}_1 \tilde{Y}\right] = E\left[\tilde{X}_1 (Y-X_2^{\prime} \gamma)\right]=E[X~1Y]E[X~1X2γ] = E\left[\tilde{X}_1 Y\right]-E\left[\tilde{X}_1X_2^{\prime} \gamma\right]

Since X~1\tilde{X}_1 can be viewed as a scaler, we can have that as E[X2X~1]=0\mathbb{E}\left[X_2 \tilde{X}_1^{\prime}\right]=0, then E[X~1X2γ]=E[X2X~1γ]=E[X2X~1]γ=0E\left[\tilde{X}_1 X_2^{\prime} \gamma\right] = E\left[X_2\tilde{X}_1^{\prime} \gamma\right] = E\left[X_2\tilde{X}_1^{\prime} \right]\gamma = 0

Now we have that: E[X~1Y~]=E[X~1Y]E\left[\tilde{X}_1 \tilde{Y}\right]=E\left[\tilde{X}_1 Y\right], take the original formula Y=X1β1+X2β2+UY=X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+U inside, we can have that:

  • UX1U \perp X_1, and UX2U \perp X_2, since we are under the interpretation of the Best Linear Approximation.

  • E[X~1Y]=E[X~1(X1β1+X2β2+U)]E[\tilde{X}_1 Y] = E[\tilde{X}_1 (X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+U)] =E[X~1X1]β1+E[X~1X2]β2+E[X~1U]=E\left[\tilde{X}_1X_1^{\prime} \right]\beta_1+E\left[\tilde{X}_1X_2^{\prime}\right] \beta_2+E\left[\tilde{X}_1U\right]

    • As we have shown E[X~1X2]=0E\left[\tilde{X}_1 X_2^{\prime}\right] =0 \Rightarrow E[X~1X2]β2=0E\left[\tilde{X}_1 X_2^{\prime}\right] \beta_2=0

    • As UX1U \perp X_1 \Rightarrow E[X~1U]=0E\left[\tilde{X}_1 U\right]=0

Now we got that E[X~1Y~]=E[X~1Y]=E[X~1X1]β1E\left[\tilde{X}_1 \tilde{Y}\right]=E\left[\tilde{X}_1 Y\right]=E\left[\tilde{X}_1 X_1^{\prime}\right] \beta_1, now for E[X~1X1]E\left[\tilde{X}_1 X_1^{\prime}\right], we can have that:

E[X~1X1]=E[X~1(X2η+e)]=E[X~1X2η]+E[X~1e]E\left[\tilde{X}_1 X_1^{\prime}\right] = E\left[\tilde{X}_1 (X_2^{\prime} \eta+e)^{\prime}\right] = E\left[\tilde{X}_1 X_2^{\prime} \eta\right]+E\left[\tilde{X}_1e^{\prime}\right]

  • For E[X~1X2η]E\left[\tilde{X}_1 X_2^{\prime} \eta\right], as eX2ηe \perp X_2^{\prime} \eta because of the projection, we have X~1=eX2η\tilde{X}_1 = e \perp X_2^{\prime} \eta, so E[X~1X2η]=0E\left[\tilde{X}_1 X_2^{\prime} \eta\right]=0

  • For E[X~1e]E\left[\tilde{X}_1 e^{\prime}\right], as e=X~1e=\tilde{X}_1, we have that E[X~1e]=E[X~1X~1]E\left[\tilde{X}_1 e^{\prime}\right] = E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]

Now we got that E[X~1Y~]=E[X~1Y]=E[X~1X1]β1=E[X~1X~1]β1E\left[\tilde{X}_1 \tilde{Y}\right]=E\left[\tilde{X}_1 Y\right]=E\left[\tilde{X}_1 X_1^{\prime}\right] \beta_1 = E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\beta_1

Take this into the formula of β~1\tilde{\beta}_1, we can have that:

β~1=(E[X~1X~1])1E[X~1Y~]=(E[X~1X~1])1E[X~1X~1]β1Iβ1=β1\tilde{\beta}_1=\left(E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\right)^{-1} E\left[\tilde{X}_1 \tilde{Y}\right] = \left(E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\right)^{-1}E\left[\tilde{X}_1 \tilde{X}_1^{\prime}\right]\beta_1=\mathbb{I}\beta_1=\beta_1

Now we successfully proved the above equation.

Interpretation

Above equaation can be interpreted like this: β1\beta_1 in the linear regression of YY on X1X_1 and X2X_2 is equal to the coefficient in a linear regression of the error term from a linear regression of YY on X2X_2 on the error terms from a linear regression of the components of X1X_1 on X2X_2.

This formalizes the common description of β1\beta_1 as the "effect" of X1X_1 on YY after "controlling for X2X_2 . "

Take X2=X_2= constant and X1RX_1 \in \mathbf{R}. Then Y~=YE[Y]\tilde{Y}=Y-E[Y] and X~1=X1E[X1]\tilde{X}_1=X_1-E\left[X_1\right]. Hence,

β1=(E[(X1E[X1])2])1E[(X1EX1)(YE[Y])]=Cov[X1,Y]Var[X1]\beta_1=\left(E\left[\left(X_1-E\left[X_1\right]\right)^2\right]\right)^{-1} E\left[\left(X_1-E X_1\right)(Y-E[Y])\right]=\frac{\operatorname{Cov}\left[X_1, Y\right]}{\operatorname{Var}\left[X_1\right]}

Proof:

As X2X_2 is a constant, we define X2=1X_2=1 for simplicity. In this way, we have Y=β2+β1X1+UY=\beta_2+\beta_1 X_1+U.

Therefore, we use X2X_2 to estimate X1X_1 and YY, we can have that:

  • X1=C+eX_1 = C + e, to get the BLP of X1X_1, we minimize over minCE(X1C)2\min _C E\left(X_1-C\right)^2 \Rightarrow FOC: 2E(X1C)=0-2 E\left(X_1-C\right)=0 \Rightarrow C=E[X1]C=E[X_1]

  • Y=D+eY=D+e, to get the BLP of YY, we minimize over minDE(YD)2\min _D E\left(Y-D\right)^2 \Rightarrow FOC: 2E(YD)=0D=E[Y]-2 E\left(Y-D\right)=0 \Rightarrow D=E\left[Y\right]

In this way, we can get that Y~=YE[Y]\tilde{Y}=Y-E[Y] and X~1=X1E[X1]\tilde{X}_1=X_1-E\left[X_1\right].

Now, based on the interpretation we shown above, we can have that:

Y~=β1X~1+ϵ\tilde{Y}=\beta_ 1\tilde{X}_1+\epsilon

So, in order to solve the β1\beta_1, we have that we need to minimize over Eβ(Y~β1X~1)2E_{\beta}(\tilde{Y}-\beta_1 \tilde{X}_1)^2

The FOC for this problem is E(Y~β1X~1)X~1=0E(\tilde{Y}-\beta _1\tilde{X}_1) \tilde{X}_1=0 \Rightarrow E(Y~X~1)=β1E(X~1X~1)E(\tilde{Y} \tilde{X}_1)=\beta _1E(\tilde{X}_1 \tilde{X}_1)

So, we got

β1=E(Y~X~1)E(X~12)\beta_1=\frac{E\left(\tilde{Y} \tilde{X}_1\right)}{E\left(\tilde{X}_1^2\right)}

Take Y~=YE[Y]\tilde{Y}=Y-E[Y] and X~1=X1E[X1]\tilde{X}_1=X_1-E\left[X_1\right] into this equation, we have β1=(E[(X1E[X1])2])1E[(X1EX1)(YE[Y])]=Cov[X1,Y]Var[X1]\beta_1=\left(E\left[\left(X_1-E\left[X_1\right]\right)^2\right]\right)^{-1} E\left[\left(X_1-E X_1\right)(Y-E[Y])\right]=\frac{\operatorname{Cov}\left[X_1, Y\right]}{\operatorname{Var}\left[X_1\right]} being proved.

Now, we start to check the individual elements in vector β\beta:

If we use our formula to interpret the coefficient βj\beta_j, we obtain:

βj=Cov[X~j,Y]Var[X~j].\beta_j=\frac{\operatorname{Cov}\left[\tilde{X}_j, Y\right]}{\operatorname{Var}\left[\tilde{X}_j\right]} .

\Rightarrow each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" all the other variables in the model.

Estimating Sub-Vectors of Beta

Partition XX and β\beta as before and consider

Y=X1β1+X2β2+UY=X_1^{\prime} \beta_1+X_2^{\prime} \beta_2+U

We use β^=(β^1,β^2)\hat{\beta}=\left(\hat{\beta}_1^{\prime}, \hat{\beta}_2^{\prime}\right)^{\prime} to denote the LS estimator of β\beta in a regression of YY on XX.

We now derive estimation counterparts to the previous results about solving for sub-vectors of β\beta. That is, β^1\hat{\beta}_1 can also be obtained from a "residualized" regression.

As we can have that Y~=X~1β1+ϵ\tilde{Y}=\tilde{X}_1^{\prime}\beta_1 +\epsilon, just treat this residualized regression as an ordinary regression we have shown before. Using OLS to get the best estimator of β1\beta_1

β^1=(1n1inX~1iX~1i)1(1n1inX~1iY~i)=(1inX~1iX~1i)1(1inX~1iY~i)\hat{\beta}_1=\left(\frac{1}{n} \sum_{1 \leq i \leq n} \tilde{X}_{1 i} \tilde{X}_{1 i}{ }^{\prime}\right)^{-1}\left(\frac{1}{n} \sum_{1 \leq i \leq n} \tilde{X}_{1 i} \tilde{Y}_i\right)=\left(\sum_{1 \leq i \leq n} \tilde{X}_{1 i} \tilde{X}_{1 i}{ }^{\prime}\right)^{-1}\left(\sum_{1 \leq i \leq n} \tilde{X}_{1 i} \tilde{Y}_i\right)

Matrix Form

Let X1=(X1,1,,X1,n)\mathbb{X}_1=\left(X_{1,1}, \ldots, X_{1, n}\right)^{\prime} and X2=(X2,1,,X2,n)\mathbb{X}_2=\left(X_{2,1}, \ldots, X_{2, n}\right)^{\prime}.

Denote by P1\mathbb{P}_1 the projection matrix onto the column space of X1\mathbb{X}_1 and P2\mathbb{P}_2 the projection matrix onto the column space of X2\mathbb{X}_2.

Define Residual Makers M1=IP1\mathbb{M}_1=\mathbb{I}-\mathbb{P}_1 and M2=IP2\mathbb{M}_2=\mathbb{I}-\mathbb{P}_2.

  • Apply M2\mathbb{M}_2 to X1\mathbb{X}_1 to get the residualized version of X1\mathbb{X}_1: X~1=M2X1\tilde{\mathbb{X}}_1=\mathbb{M}_2 \mathbb{X}_1

  • Apply M2\mathbb{M}_2 to Y\mathbb{Y} to get the residualized version of Y\mathbb{Y}: Y~=M2Y\tilde{\mathbb{Y}}=\mathbb{M}_2 \mathbb{Y}

The regression model now becomes Y~=β1X~1+ϵ\widetilde{\mathbb{Y}}=\beta_1 \widetilde{\mathbb{X}}_1+\epsilon, where ϵ\epsilon is the error term.

Then use Ordinary Least Squares (OLS) to estimate β1\beta_1. The formula for β^1\hat{\beta}_1 is similar to before:

β^1=(X~1X~1)1X~1Y~\hat{\beta}_1=\left(\tilde{\mathbb{X}}_1^{\prime} \tilde{\mathbb{X}}_1\right)^{-1} \tilde{\mathbb{X}}_1^{\prime} \widetilde{\mathbb{Y}}

Last updated