Let (Y,X,U) be a random vector where Y and U take values in R and X takes values in Rk+1. Assume further that X=(X0,X1,…,Xk)′ with X0=1 and let β=(β0,β1,…,βk)′∈Rk+1 be such that
Y=X′β+U
We will make following three assumptions for this model.
E[XU]=0: The justification of 1 varies depending on which of the three preceding interpretations we invoke. This expectation refers to the covariance between the regressor X and the error term U being zero.
For Linear Conditional Expectation: In this interpretation, the regression is viewed as modeling the conditional expectation of Y given X. That is, E[Y∣X]=β0+β1X.
Here, U represents the deviation of Y from its conditional expectation given X. Since U is defined as Y−E[Y∣X], it inherently means that E[U∣X]=0 . Consequently, this implies that E[XU]=0, because the error term U has no systematic relationship with X when considering the conditional expectation.
For Best Linear Approximation: In the best linear approximation, the regression equation is considered the best linear approximation to the true relationship between Y and X, regardless of the true underlying relationship.
The error term U in this context captures all factors that affect Y but are not captured by X. For the linear approximation to be the 'best', it requires that these omitted factors (captured in U ) are uncorrelated with X. If X were correlated with U, there would be systematic information in X that could improve the approximation, contradicting the premise that the model is the 'best' linear approximation. Hence, E[XU]=0 is necessary to ensure that the linear model is indeed the best approximation under the given circumstances.
For Causal Model: the regression is understood as modeling a causal relationship between X and Y. Here, β1 is interpreted as the causal effect of X on Y.
For β1 to represent the causal effect of X on Y, it is crucial that all other factors influencing Y are either controlled for or are not correlated with X. The error term U includes all these other factors. If U were correlated with X, it would mean that there are omitted variables that both affect Y and are related to X, which would bias the estimation of the causal effect. Therefore, ensuring E[XU]=0 is essential for a valid causal interpretation, as it implies that there are no omitted confounders that are correlated with X.
E[XX′]<∞, this 2 ensures that E[XX′] exists.
This matrix E[XX′] is called Design Matrix.
There is NO PERFECT COLLINEARITY in X or the matrix E[XX′] is in fact invertible.
Since E[XX′] is positive semi-definite, invertibility of E[XX′] is equivalent to E[XX′] being positive definite.This ensures there is a unique solution to β when solving for it. So, this is also called the identification condition forβ.
We can talk more detail about Invertibility:
Definition:
Lemma:
Proof:
Solving for Beta
Proof:
Estimating Beta using OLS
Here:
Proof:
Define Objective Function:
Then by FOC, we have that:
Now we finished the proof.
Note that we can also apply weight to the above formula to make this a weighted least squares (WLS) estimator.
Matrix Notation
Define
In this notation,
and may be equivalently described as the solution to
The matrix
The matrix $\mathbb{P}$ is also symmetric. The matrix
Proof:
Sub-Vectors of Beta
Our preceding results imply that:
This is derived from the general linear model estimation method, specifically the Ordinary Least Squares (OLS) method, under the assumption that:
Existence and invertibility of matrix:
Result Based on BLP
as for example, in the second interpretation of the linear regression model: Best Linear Approximation that we described before, then we can have that:
Proof:
Now we successfully proved the above equation.
Interpretation
Proof:
Now, based on the interpretation we shown above, we can have that:
So, we got
Estimating Sub-Vectors of Beta
Matrix Form
There is perfect collinearity or multicollinearity in X if there exists nonzero c∈Rk+1 such that P{c′X=0}=1, (Here we treat X as a random variable.), i.e., if we can express one component of X as a linear combination of the others.
Let X be such that E[XX′]<∞. Then E[XX′] is invertible iff there is no perfect collinearity in X.
If there is perfect collinearity, such that ∃c=0 that
c′E[XX′]c=E[c′XX′c]=E[(c′X)2]=0 with P=1
In order for E[XX′] to be positive definite, c′E[XX′]c can take value 0, but it cannot have this with P=1. So, c′E[XX′]c is not positive definite, therefore it is not invertible. This is a contradiction.
Note that, based on the calculation rule of the matrix, (AB)′=B′A′, we have (X′Z)′=Z′X, therefore,
Z′E[XX′]Z=E[Z′XX′Z]=E[Z′X(Z′X)′]=E[(Z′X)2]⩾0
So, E[XX′] is always positive semi-definite, we need E[XX′]>0 with P>0 to be positive definite.
E[UX]=0 implies that E[X(Y−X′β)]=0, which is the FOC for optimization problem b∈Rk+1argminE[(Y−X′b)2]: ∂b∂E[(Y−X′b)2]=E[∂b∂(Y−X′b)2]=E[(Y−X′b)⋅X]=0
As β is the solution to this optimization problem. So, we have E[(Y−X′β)⋅X]=0
E[YX]−E[X′Xβ]=0⇒E[XY]=E[XX′]β
As E[XX′]<∞ and E[XX′] is positive definite. So, it exists and is invertible. We have:
β=[E[XX′]]−1E[XY]
Note that, here the β we got is the linear projection coefficient.
And for E[UX]=0, we can have that E[U]=0.
Since E[UX]=0 indicates X is independent with U, E[xju]=0 for j=1,⋯,k.⇒xj⊥u
Cov(X,U)=E[XU]−E[X]E[U]=0−E[X]E[U]=0
As E[X]=0, so we have E[U]=0.
If E[XX′] is not invertible (not positive definite), there will be more than one solution to this system of equations. Any two solutions β and β~ will necessarily satisfy X′β=X′β~.
Let (Y,X,U) be as described and let P be the marginal distribution of (Y,X). And let (Y1,X1),…,(Yn,Xn) be an i.i.d. sequence of random vectors with distribution P.
A natural estimator of β=(E[XX′])−1E[XY] is simply:
This estimator is called the ordinary least squares (OLS) estimator of β because it can also be derived as the solution to the following minimization problem:
b∈Rk+1minn11≤i≤n∑(Yi−Xi′b)2
β^ is a random variable, since it is f(Xi,Yi)
By Law of Large Number: n1∑1⩽i⩽nXiXi′⟶PE[XiXi′]
By Law of Large Number: n1∑1⩽i⩽nXiYi′⟶PE[XiYi′]
Note that for more efficient estimator β^, its Variance is smaller.
Q(b)=n1i=1∑n(Yi−Xi′b)2
We take the derivative of the objective function w.r.t. b and get the FOC:
Hence, Xβ^ is the vector in the column space of X that is the closest (in terms of Euclidean distance) to Y.
Xβ^=X(X′X)−1X′Y
is the orthogonal projection of Y onto the ((k+1)-dimensional) column space of X.
P=X(X′X)−1X′
is known as the Projection Matrix. It projects a vector in Rn (such as Y ) onto the column space of X.
Note that P2=P, which reflects the fact that projecting something that already lies in the column space of X onto the column space of X does nothing.
M=I−P
is also a projection matrix. It projects a vector onto the ((n−k−1)-dimensional) vector space orthogonal to the column space of X. Hence, MX=0. Note that MY=U^. For this reason, M is sometimes called the "residual maker" matrix.
The expected value of the error term U is zero and that U is uncorrelated with X.
(E[X1X1′]E[X2X1′]E[X1X2′]E[X2X2′])
Question: Can we derive formulae for β1 and β2 that admit some interesting interpretations?
Partial Effects: The coefficients in β1 and β2 can be interpreted as the partial effects of the variables in X1 and X2 on Y, controlling for the other variables. This means that β1 captures the impact of X1 on Y while holding X2 constant, and vice versa for β2.
Multicollinearity Consideration: In cases where there is multicollinearity (i.e., high correlation) between some variables in X1 and X2, partitioning the regression can help understand how each group of variables uniquely contributes to explaining the variation in Y.
Group-Specific Analysis: This partitioning is particularly useful when X1 and X2 represent distinct groups of variables, such as demographic factors versus economic indicators. It allows for the isolation of the effects of each group on Y.
Interpreting in Context: The interpretation of β1 and β2 will highly depend on the specific context of the study. For instance, in an economic growth model, X1 might include labor and capital variables, while X2 includes policy variables. The coefficients would then tell us about the distinct impacts of these groups on economic growth.
Statistical Significance: It's important to evaluate the statistical significance of the estimated coefficients in β1 and β2 to determine if the observed relationships are not due to random chance.
Estimation Challenges: If X1 and X2 are highly correlated, the inverse of the matrix in the formula may be difficult to compute accurately, leading to estimation problems. This is a common issue in models with multicollinearity.
BLP: for a random variable A and a random vector B, denote by BLP(A∣B) the best linear predictor of A given B, i.e.
BLP(A∣B)≡B′(E[BB′])−1E[BA].
If A is a random vector, then define BLP(A∣B) component-wise.
And we also define A~=A−BLP(A∣B).
Now, come back to our problem: Y=X1′β1+X2′β2+U
Define Y~=Y−BLP(Y∣X2) and X~1=X1−BLP(X1∣X2). Consider the linear regression
Y~=X~1′β~1+U~ where E[X~1U~]=0
β~1=(E[X~1X~1′])−1E[X~1Y~]=β1
First we can have the following result based on the definition of Y~ and X~1:
Y=X2′γ+U2⇒E[X2U2]=0⇒U2=Y~=Y−X2′γ
X1=X2′η+e⇒E[X2e]=0⇒e=X~1⇒E[X2X~1′]=0
Above can be got because U2 and e are projections from linear regression models.
For formula β~1=(E[X~1X~1′])−1E[X~1Y~], we will first separate the right-hand side E[X~1Y~]
E[X~1Y~]=E[X~1(Y−X2′γ)]=E[X~1Y]−E[X~1X2′γ]
Since X~1 can be viewed as a scaler, we can have that as E[X2X~1′]=0, then E[X~1X2′γ]=E[X2X~1′γ]=E[X2X~1′]γ=0
Now we have that: E[X~1Y~]=E[X~1Y], take the original formula Y=X1′β1+X2′β2+U inside, we can have that:
U⊥X1, and U⊥X2, since we are under the interpretation of the Best Linear Approximation.
Above equaation can be interpreted like this: β1 in the linear regression of Y on X1 and X2 is equal to the coefficient in a linear regression of the error term from a linear regression of Y on X2 on the error terms from a linear regression of the components of X1 on X2.
This formalizes the common description of β1 as the "effect" of X1 on Y after "controlling for X2. "
Take X2= constant and X1∈R. Then Y~=Y−E[Y] and X~1=X1−E[X1]. Hence,
As X2 is a constant, we define X2=1 for simplicity. In this way, we have Y=β2+β1X1+U.
Therefore, we use X2 to estimate X1 and Y, we can have that:
X1=C+e, to get the BLP of X1, we minimize over minCE(X1−C)2⇒ FOC: −2E(X1−C)=0⇒C=E[X1]
Y=D+e, to get the BLP of Y, we minimize over minDE(Y−D)2⇒ FOC: −2E(Y−D)=0⇒D=E[Y]
In this way, we can get that Y~=Y−E[Y] and X~1=X1−E[X1].
Y~=β1X~1+ϵ
So, in order to solve the β1, we have that we need to minimize over Eβ(Y~−β1X~1)2
The FOC for this problem is E(Y~−β1X~1)X~1=0⇒E(Y~X~1)=β1E(X~1X~1)
β1=E(X~12)E(Y~X~1)
Take Y~=Y−E[Y] and X~1=X1−E[X1] into this equation, we have β1=(E[(X1−E[X1])2])−1E[(X1−EX1)(Y−E[Y])]=Var[X1]Cov[X1,Y] being proved.
Now, we start to check the individual elements in vector β:
If we use our formula to interpret the coefficient βj, we obtain:
βj=Var[X~j]Cov[X~j,Y].
⇒ each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor, after "partialling out" all the other variables in the model.
Partition X and β as before and consider
Y=X1′β1+X2′β2+U
We use β^=(β^1′,β^2′)′ to denote the LS estimator of β in a regression of Y on X.
We now derive estimation counterparts to the previous results about solving for sub-vectors of β. That is, β^1 can also be obtained from a "residualized" regression.
As we can have that Y~=X~1′β1+ϵ, just treat this residualized regression as an ordinary regression we have shown before. Using OLS to get the best estimator of β1