Interpretations of Linear Regression

There are three ways to interpret Linear Regression:

  1. Linear Conditional Expectation

  2. Best Linear Approximation

  3. Causal Model

Now, we talk about each way in detail.

Linear Conditional Expectation

Suppose that:

E[YX]=XβE[Y \mid X]=X^{\prime} \beta

and define: U=YE[YX]U=Y-E[Y \mid X]

Recall the Conditional Expectation Function (CEF): E[YX=x]=m(x)\mathbb{E}[Y \mid X=x]=m(x), which is the best estimator of YY.

For E[YX]\mathbb{E}[Y \mid X], we have E[YX]=E[YX1,X2,,Xk]=m(X1,X2,,Xk)\mathbb{E}[Y \mid X] = \mathbb{E}\left[Y \mid X_1, X_2, \ldots, X_k\right] = m(X_1, X_2, \ldots, X_k).But in practice, the form of this function m()m(\cdot) is unknown to us.

This Linear Conditional Expectation has several implications:

  1. E(U)=0\mathbb{E}(U)=0.

    E[U]=EX[E[UX]]\mathbb{E}[U]=\mathbb{E}_{X}[\mathbb{E}[U \mid X]]

    =EX[E[YXβX]]=EX[E[YX]Xβ]=EX[XβXβ]=0=\mathbb{E}_{X}[\mathbb{E}[Y - X^{\prime}\beta \mid X]] =\mathbb{E}_X[\mathbb{E}[Y| X]-X^{\prime} \beta]=\mathbb{E}_X[X^{\prime} \beta - X^{\prime} \beta] = 0

  2. E[XU]=0\mathbb{E}[X U]=0

    E[XU]=E[X(YXβ)]=E[XY]E[XXβ]\mathbb{E}[X U]=\mathbb{E}\left[X \cdot\left(Y-X^{\prime} \beta\right)\right]=\mathbb{E}[X Y]-\mathbb{E}\left[X X^{\prime} \beta\right]

    =E[E[XYX]]E[XXβ]=\mathbb{E}[\mathbb{E}[X Y \mid X]]-\mathbb{E}\left[XX^{\prime} \beta\right]

    take E[YX]=XβE[Y \mid X]=X^{\prime} \beta into it we have:

    =E[XE[YX]]E[XXβ]=0 = \mathbb{E}[X \mathbb{E}[Y \mid X]]-\mathbb{E}\left[XX^{\prime} \beta\right]=0

  3. Cov(X,U)=0\operatorname{Cov}(X, U)=0

For β\beta we got by this define method, it does not necessarily have a Causal Effect. Since one XiX_i change, others might also change.

Therefore, we also cannot interpret βj\beta_j as the ceteris paribus (i.e., holding XjX_{-j} and UU constant) effect of a one unit change in XiX_i on YY. We need more information to check if we can do that.

  • We are only holding Xj,jiX_j, j \neq i constant, but we didn't hold UU constant in this structure.

  • As U=YE[YX]U=Y-\mathbb{E}[Y \mid X], we have Y=E[YX]+U=m(X)+UY=\mathbb{E}[Y \mid X]+U = m(X) + U

  • So, YXi=m(X)Xi+UXi\frac{\partial Y}{\partial X_i}=\frac{\partial m(X)}{\partial X_i}+\frac{\partial U}{\partial X_i}

  • m(X)Xi=βi\frac{\partial m(X)}{\partial X_i} = \beta_i only if m(X)m(X) is a linear function, e.g. m(X)=β0+β1X1++βkXkm(X)=\beta_0+\beta_1 X_1+\cdots+\beta_k X_k

"Best" Linear Approximation

In general, the conditional expectation is probably NOT linear. We are just using a linear function to approximate this.

Suppose that: E[Y2]<E\left[Y^2\right]<\infty and E[XX]<E\left[X X^{\prime}\right]<\infty (or, E[Xj2]<E\left[X_j^2\right]<\infty for 1jk)\left.1 \leq j \leq k\right)

Under these assumptions, one may consider what is the "best" linear approximation (i.e., function of the form XbX^{\prime} b for some choice of bRk+1b \in \mathbf{R}^{k+1}, bb as the best linear predictor for β\beta) to the conditional expectation.

To this end, consider the minimization problem for approximation/prediction error:

minbRk+1E[(E[YX]Xb)2]\min _{b \in \mathbf{R}^{k+1}} E\left[\left(E[Y \mid X]-X^{\prime} b\right)^2\right]

Minimize over bb and denote β\beta as the solution to this minimization problem. Then,

  • β\beta is called the best linear predictor, and it is the linear projection coefficient.

  • bb is the generic coefficient vector.

But we still cannot interpret βj\beta_j as the ceteris paribus (i.e., holding XjX_{-j} and UU constant) effect of a one unit change in XjX_j on YY. Because we still do not know the information about the error term. Partial derivative only hold for other XiX_is.

Best Linear Predictor (BLP)

Best linear predictor (BLP) can also be defined as:

βargminbRk+1E[(YXb)2]\beta \in \underset{b \in \mathbf{R}^{k+1}}{\operatorname{argmin}} E\left[\left(Y-X^{\prime} b\right)^2\right]

This β\beta is also a convenient way of summarizing the “best” linear predictor of YY given XX.

Proof:

E[(YXb)2]=E[(YE[YX]+E[YX]Xb)2]\mathbb{E}\left[\left(Y-X^{\prime} b\right)^2\right] =\mathbb{E}\left[\left(Y-\mathbb{E}[Y \mid X]+\mathbb{E}[Y \mid X]-X^{\prime} b\right)^2\right]

=E[(YE[YX])2]+E[(E[YX]Xb)2]+2E[(YE[YX])(E[YX]Xb)]= \mathbb{E}\left[ (Y - \mathbb{E}[Y|X])^2 \right] + \mathbb{E}\left[ (\mathbb{E}[Y \mid X] - X^{\prime} b)^2 \right]+2 \mathbb{E}\left[(Y-\mathbb{E}[Y \mid X])\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)\right]

As for E[(YE[YX])(E[YX]Xb)]\mathbb{E}\left[(Y-\mathbb{E}[Y \mid X])\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)\right], we have that it is equal to,

E[E[(YE[YX])(E[YX]Xb)X])\mathbb{E}\left[\mathbb{E}\left[(Y-\mathbb{E}[Y \mid X])\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right) \mid X\right]\right), and E[YX]Xb\mathbb{E}[Y \mid X]-X^{\prime} b only depend on XX, we have:

E[(YE[YX])(E[YX]Xb)]=E[(E[YX]Xb)(E[YE[YX]X])] \mathbb{E}\left[(Y-\mathbb{E}[Y \mid X])\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)\right] = \mathbb{E}[(\mathbb{E}[Y \mid X]-X^{\prime} b)(\mathbb{E}[Y-\mathbb{E}[Y \mid X]\mid X])]

=E[(E[YX]Xb)(E[E[YX]E[YX]])]=0=\mathbb{E}\left[\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)(\mathbb{E}[\mathbb{E}[Y \mid X] -\mathbb{E}[Y \mid X]])\right] = 0

So, we got:

E[(YXb)2]=E[(YE[YX])2]+E[(E[YX]Xb)2]\mathbb{E}\left[\left(Y-X^{\prime} b\right)^2\right]=\mathbb{E}\left[(Y-\mathbb{E}[Y \mid X])^2\right]+\mathbb{E}\left[\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)^2\right]

However, only the part E[(E[YX]Xb)2]\mathbb{E}\left[\left(\mathbb{E}[Y \mid X]-X^{\prime} b\right)^2\right] is depended on bb, so optimize the whole function over b b is same to optimize over this part. Therefore, this BLP definition is same with the previous one.

Summary: Two in One

Two interpretations from equivalent optimization problems:

βargminbRk+1E[(E[YX]Xb)2] and βargminbRk+1E[(YXb)2]\beta \in \underset{b \in \mathbf{R}^{k+1}}{\operatorname{argmin}} E\left[\left(E[Y \mid X]-X^{\prime} b\right)^2\right] \text { and } \beta \in \underset{b \in \mathbb{R}^{k+1}}{\operatorname{argmin}} E\left[\left(Y-X^{\prime} b\right)^2\right]

Note E[(YXb)2]E\left[\left(Y-X^{\prime} b\right)^2\right] is a convex (as a function of bb ) and this has the following implications.

  • We can take the derivative on it and use FOC to solve bb and get β\beta

  • However, in order to do this, we need to make more assumptions. And I will introduce them in detail later in the estimation of β\beta

Therefore, we can have that:

E[(YXb)2]b=E[(YXb)2b]=E[(YXb)X]=E[YX]E[XbX]=E[YX]E[XXb]]=0\begin{aligned} \frac{\partial \mathbb{E}\left[\left(Y-X^{\prime} b\right)^2\right]}{\partial b} & =\mathbb{E}\left[\frac{\partial\left(Y-X^{\prime} b\right)^2}{\partial b}\right]=\mathbb{E}\left[\left(Y-X^{\prime} b\right) \cdot X\right] \\ & \left.=\mathbb{E}[Y X]-\mathbb{E}\left[X^{\prime} b X\right]=\mathbb{E}[Y X]-\mathbb{E}\left[X^{\prime} X b\right]\right]=0\end{aligned}

We can solve bβb \rightarrow \beta. This is the First Order Condition (FOC), note that in above equations:

  1. We can change the order between the expectation and partial derivative because of the dominated convergence theorem.

    Lemma:

    Let XXX \in \mathcal{X} be a random variable g:R×XRg: \mathbb{R} \times \mathcal{X} \rightarrow \mathbb{R} a function such that g(t,X)g(t, X) is integrable for all tt and gg is continuously differentiable w.r.t. tt. Assume that there is a random variable ZZ such that tg(t,X)Z\left|\frac{\partial}{\partial t} g(t, X)\right| \leq Z a.s. for all tt and E(Z)<\mathbb{E}(Z)<\infty. Then

    tE(g(t,X))=E(tg(t,X)).\frac{\partial}{\partial t} \mathbb{E}(g(t, X))=\mathbb{E}\left(\frac{\partial}{\partial t} g(t, X)\right) .

    Proof:

    We have

    tE(g(t,X))=limh01h(E(g(t+h,X))E(g(t,X)))=limh0E(g(t+h,X)g(t,X)h)=limh0E(tg(τ(h),X)),\begin{aligned} \frac{\partial}{\partial t} \mathbb{E}(g(t, X)) & =\lim _{h \rightarrow 0} \frac{1}{h}(\mathbb{E}(g(t+h, X))-\mathbb{E}(g(t, X))) \\ & =\lim _{h \rightarrow 0} \mathbb{E}\left(\frac{g(t+h, X)-g(t, X)}{h}\right) \\ & =\lim _{h \rightarrow 0} \mathbb{E}\left(\frac{\partial}{\partial t} g(\tau(h), X)\right), \end{aligned}

    where τ(h)(t,t+h)\tau(h) \in(t, t+h) exists by the mean value theorem. By assumption we have

    tg(τ(h),X)Z\left|\frac{\partial}{\partial t} g(\tau(h), X)\right| \leq Z

    and thus we can use the dominated convergence theorem to conclude

    tE(g(t,X))=E(limh0tg(τ(h),X))=E(tg(t,X)).\frac{\partial}{\partial t} \mathbb{E}(g(t, X))=\mathbb{E}\left(\lim _{h \rightarrow 0} \frac{\partial}{\partial t} g(\tau(h), X)\right)=\mathbb{E}\left(\frac{\partial}{\partial t} g(t, X)\right) .

    This completes the proof.

  2. We can change the order between the XX and bb because they are scalers.

Casual Model

In order to analysis the Casual Effect using the model, suppose that: Y=g(X,U)Y=g(X, U), where XX are the observed determinants of YY and UU are the unobserved determinants of YY.

Such a relationship is a model of how YY is determined and may come from physics, economics, etc. The effect of XjX_j on YY holding XjX_{-j} and UU constant (i.e., ceteris paribus) is determined by this function gg.

If gg is differentiable. then it is given by

Dxjg(X,U)D_{x_j} g(X, U) \text {. }

If we assume further that gg is linear, such that,

g(X,U)=Xβ+U,g(X, U)=X^{\prime} \beta+U,

then the ceteris paribus effect of XjX_j on YY is simply βj\beta_j. We may normalize UU so that E[U]=0E[U]=0, we can achieve this by replacing UU with UE[U]U-E[U] and β0\beta_0 with β0+E[U]\beta_0+E[U] if E[U]=0E[U]=0 is not the original case.

On the other hand, E[UX],E[UXj]E[U \mid X], E\left[U \mid X_j\right] and E[UXj]E\left[U X_j\right] for 1jk1 \leq j \leq k may or may not equal zero. This aspect is crucial in causal inference as it relates to the independence or correlation of the unobserved determinants with the observed variables. So, βj\beta_j might be biased for the ceteris paribus effect. To get the ceteris paribus effect, we need more assumptions.

Potential Outcomes

Potential outcomes are an easy way to think about causal relationships.

Illustration: randomized controlled experiment where individuals are randomly assigned to a treatment (a drug) that is intended to improve their health status.

Notation: Let YY denote the observed health status and X{0,1}X \in \{0,1\} denote whether the individual takes the drug or not.

The causal relationship between XX and YY can be described using the so-called potential outcomes:

  • Y(0)Y(0) potential outcome in the absence of treatment

  • Y(1)Y(1) potential outcome in the presence of treatment

Under health status, we have two health status variables: (Y(0),Y(1))(Y(0), Y(1)), where

  • Y(0)Y(0) is the value of the outcome that would have been observed if (possibly counter-to-fact) XX were 0.

  • Y(1)Y(1) is the value of the outcome that would have been observed if (possibly counter-to-fact) XX were 1.

Treatment Effects

  • The difference Y(1)Y(0)Y(1)-Y(0) is called the treatment effect (TE).

  • The quantity E[Y(1)Y(0)]E[Y(1)-Y(0)] is usually referred to as the average treatment effect (ATE).

Using this notation, we may rewrite the observed outcome as:

Y=β0+β1X+U with β1=Y(1)Y(0).Y=\beta_0+\beta_1 X+U \text { with } \beta_1=Y(1)-Y(0) .

Note that this is not quite "the" linear model: since the ccoefficient β1\beta_1 is random. For β1\beta_1 to be constant. we need to assume that Y(1)Y(0)Y(1)-Y(0) is constant across individuals.

Under all these assumptions: we end up with a linear constant effect causal model with UU independent with XX (from the nature of the randomized experiment), E[U]=0E[U]=0, and so E[XU]=0E[X U]=0.

Without assuming constant treatment effects, it can be shown that a regression of YY on XX identifies the average treatment effect,

β=Cov[Y,X]Var[X]=E[Y(1)Y(0)]\beta=\frac{\operatorname{Cov}[Y, X]}{\operatorname{Var}[X]}=E[Y(1)-Y(0)]

which is often called a causal parameter given that it is an average of causal effects.

Last updated