There are three ways to interpret Linear Regression:
Linear Conditional Expectation
Best Linear Approximation
Causal Model
Now, we talk about each way in detail.
Linear Conditional Expectation
Suppose that:
E[Y∣X]=X′β
and define: U=Y−E[Y∣X]
Recall the Conditional Expectation Function (CEF): E[Y∣X=x]=m(x), which is the best estimator of Y.
For E[Y∣X], we have E[Y∣X]=E[Y∣X1,X2,…,Xk]=m(X1,X2,…,Xk).But in practice, the form of this function m(⋅) is unknown to us.
This Linear Conditional Expectation has several implications:
E(U)=0.
E[U]=EX[E[U∣X]]
=EX[E[Y−X′β∣X]]=EX[E[Y∣X]−X′β]=EX[X′β−X′β]=0
E[XU]=0
E[XU]=E[X⋅(Y−X′β)]=E[XY]−E[XX′β]
=E[E[XY∣X]]−E[XX′β]
take E[Y∣X]=X′β into it we have:
=E[XE[Y∣X]]−E[XX′β]=0
Cov(X,U)=0
"Best" Linear Approximation
In general, the conditional expectation is probably NOT linear. We are just using a linear function to approximate this.
To this end, consider the minimization problem for approximation/prediction error:
Best Linear Predictor (BLP)
Best linear predictor (BLP) can also be defined as:
Proof:
So, we got:
Summary: Two in One
Two interpretations from equivalent optimization problems:
Therefore, we can have that:
We can change the order between the expectation and partial derivative because of the dominated convergence theorem.
Lemma:
Proof:
We have
and thus we can use the dominated convergence theorem to conclude
This completes the proof.
Casual Model
Potential Outcomes
Potential outcomes are an easy way to think about causal relationships.
Illustration: randomized controlled experiment where individuals are randomly assigned to a treatment (a drug) that is intended to improve their health status.
Treatment Effects
Using this notation, we may rewrite the observed outcome as:
which is often called a causal parameter given that it is an average of causal effects.
For β we got by this define method, it does not necessarily have a Causal Effect. Since one Xi change, others might also change.
Therefore, we also cannot interpret βj as the ceteris paribus (i.e., holding X−j and U constant) effect of a one unit change in Xi on Y. We need more information to check if we can do that.
We are only holding Xj,j=i constant, but we didn't hold U constant in this structure.
As U=Y−E[Y∣X], we have Y=E[Y∣X]+U=m(X)+U
So, ∂Xi∂Y=∂Xi∂m(X)+∂Xi∂U
∂Xi∂m(X)=βi only if m(X) is a linear function, e.g. m(X)=β0+β1X1+⋯+βkXk
Suppose that: E[Y2]<∞ and E[XX′]<∞ (or, E[Xj2]<∞ for 1≤j≤k)
Under these assumptions, one may consider what is the "best" linear approximation (i.e., function of the form X′b for some choice of b∈Rk+1, b as the best linear predictor for β) to the conditional expectation.
b∈Rk+1minE[(E[Y∣X]−X′b)2]
Minimize over b and denote β as the solution to this minimization problem. Then,
β is called the best linear predictor, and it is the linear projection coefficient.
b is the generic coefficient vector.
But we still cannot interpret βj as the ceteris paribus (i.e., holding X−j and U constant) effect of a one unit change in Xj on Y. Because we still do not know the information about the error term. Partial derivative only hold for other Xis.
β∈b∈Rk+1argminE[(Y−X′b)2]
This β is also a convenient way of summarizing the “best” linear predictor of Y given X.
However, only the part E[(E[Y∣X]−X′b)2] is depended on b, so optimize the whole function over b is same to optimize over this part. Therefore, this BLP definition is same with the previous one.
β∈b∈Rk+1argminE[(E[Y∣X]−X′b)2] and β∈b∈Rk+1argminE[(Y−X′b)2]
Note E[(Y−X′b)2] is a convex (as a function of b ) and this has the following implications.
We can take the derivative on it and use FOC to solve b and get β
However, in order to do this, we need to make more assumptions. And I will introduce them in detail later in the estimation of β
We can solve b→β. This is the First Order Condition (FOC), note that in above equations:
Let X∈X be a random variable g:R×X→R a function such that g(t,X) is integrable for all t and g is continuously differentiable w.r.t. t. Assume that there is a random variable Z such that ∂t∂g(t,X)≤Z a.s. for all t and E(Z)<∞. Then
We can change the order between the X and b because they are scalers.
In order to analysis the Casual Effect using the model, suppose that: Y=g(X,U), where X are the observed determinants of Y and U are the unobserved determinants of Y.
Such a relationship is a model of how Y is determined and may come from physics, economics, etc. The effect of Xj on Y holding X−j and U constant (i.e., ceteris paribus) is determined by this function g.
If g is differentiable. then it is given by
Dxjg(X,U).
If we assume further that g is linear, such that,
g(X,U)=X′β+U,
then the ceteris paribus effect of Xj on Y is simply βj. We may normalize U so that E[U]=0, we can achieve this by replacing U with U−E[U] and β0 with β0+E[U] if E[U]=0 is not the original case.
On the other hand, E[U∣X],E[U∣Xj] and E[UXj] for 1≤j≤k may or may not equal zero. This aspect is crucial in causal inference as it relates to the independence or correlation of the unobserved determinants with the observed variables. So, βj might be biased for the ceteris paribus effect. To get the ceteris paribus effect, we need more assumptions.
Notation: Let Y denote the observed health status and X∈{0,1} denote whether the individual takes the drug or not.
The causal relationship between X and Y can be described using the so-called potential outcomes:
Y(0) potential outcome in the absence of treatment
Y(1) potential outcome in the presence of treatment
Under health status, we have two health status variables: (Y(0),Y(1)), where
Y(0) is the value of the outcome that would have been observed if (possibly counter-to-fact) X were 0.
Y(1) is the value of the outcome that would have been observed if (possibly counter-to-fact) X were 1.
The difference Y(1)−Y(0) is called the treatment effect (TE).
The quantity E[Y(1)−Y(0)] is usually referred to as the average treatment effect (ATE).
Y=β0+β1X+U with β1=Y(1)−Y(0).
Note that this is not quite "the" linear model: since the ccoefficient β1 is random. For β1 to be constant. we need to assume that Y(1)−Y(0) is constant across individuals.
Under all these assumptions: we end up with a linear constant effect causal model with U independent with X (from the nature of the randomized experiment), E[U]=0, and so E[XU]=0.
Without assuming constant treatment effects, it can be shown that a regression of Y on X identifies the average treatment effect,