Data visualization is an important tool for understanding the relationship between variables. We often care about complex relationships between covariates, such as the relationship between \(y\) and \(x\) conditional on \(z\), rather than pure bivariate relationship. This creates a problem: we want to show someone how variation in x effects y, but if we know z is a confounder, a simple plot of y as a function of x will not tell us much.
library(ggplot2)
z = rnorm(1000, 1, 2)
x = 2*z + rnorm(1000)
y = x + 3*z + rnorm(1000)
df = cbind.data.frame(z, x, y)
ggplot(df, aes(x,y, size = z)) +
geom_point(alpha = .5) +
labs(x = 'x',
y = 'y',
size = 'z') +
theme_minimal()
If we look at our plot above, we see that y is generally increasing in \(x\), but we also know from our simulation that \(x\) and \(y\) are both an increasing function of \(z\). We can see this with the sizing of each point based on the value of \(z\), larger values of \(z\) appear associated with high \(x\) and high \(y\), and vice versa. Maybe if z is a discrete variable without too many categories, we can create a facet plot showing the relationship between \(x\) and \(y\) for each bucket of \(z\), but what if it’s continuous, as it is here? Or if there are so many categories in \(z\) that our facet plot would be unreadable? What can we do?
This brings us to an excellent econometric tool: the Frisch–Waugh–Lovell (FWL) theorem. The idea behind the theorem is simple but powerful: we can turn one multiple regression of \(y \sim \beta_0 + \beta_1 x + \beta_2 z + \epsilon\) into two regressions \(r_{y \sim z}\) and \(r_{x \sim z}\), where r is meant to signify the residuals from the regression in the subscript (a reminder: the residual is the actual outcome minus the predicted value from our regression). The theorem goes on to say we can obtain \(\beta_1\) from the multiple regression with the bivariate regression: \(r_{y \sim z} \sim r_{x \sim z}\).
There are excellent proofs of this theorem in abundance, so I will spare the technical details here. Instead, I’ll focus on the intuition for this result.
Plainly, we can think of \(r_{x \sim z}\) as variation in \(x\) not explained by z. Why is this the case? Because the residual is taking the actual \(x\) value and subtracting  what we would predict \(x\) to be based on \(z\). What is left, then, is variation that doesn’t come from \(z\). We can think of \(r_{y \sim z}\) in the same way: it is variation in \(y\) that doesn’t come from z. So the regression \(r_{y \sim z} \sim r_{x \sim z}\) is looking at how y (netting out z) is shaped by x (netting out z). What we are left with is how x effects y while adjusting for z, which is the same as including \(z\) in a multiple regression.
So that’s neat…but what does it have to do with data visualization? Let’s return back to our motivating problem: we want to show a plot of y on x conditional on z, but we don’t want to make a confusing three dimensional plot and we can’t just make a facet plot based on values of z. We can solve our problem using the insights of the FWL theorem. Since FWL turns a multivariate problem into a bivariate one, we can produce simple plots while adjusting for many covariates. Here are our steps:
Let’s write a function in R to do this for us. Let’s make the inputs the char strings of the pieces of the linear model that we will put in (for example, response will be ‘y’, predictor will be ‘x’, covariates will be ‘z+m’).
plot_fwl <- function(data, response, predictor, covariates) {
data$resid_y = resid(lm(as.formula(paste0(response, "~", covariates)), data = data))
data$resid_x = resid(lm(as.formula(paste0(predictor, "~", covariates)), data = data))
ggplot(data, aes(x = resid_x, y = resid_y)) +
geom_point(alpha = .5) +
labs(
x = as.character(predictor),
y = as.character(response),
title = "Relationship Net of Controls"
) +
theme_minimal()
}
Now, let’s apply our function to our simulated data.
plot_fwl(df, 'y', 'x', 'z')