Getting to the bottom of TMLE: targeting in action

Wait 5 sec.

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.In the previous post, I worked my way through some key elements of TMLE theory as I try to understand how it all works. At its essence, TMLE is focused on getting the efficient influence function (EIF) to behave properly. When that happens, the estimator of the target parameter behaves as if it were based on a random sample from the true data-generating distribution.Estimating the outcome and treatment (or exposure) models is an important part of constructing the EIF, but they are treated as nuisance components and do not need to be perfectly specified. The targeting step can adjust for errors in these nuisance estimates, often recovering the desired empirical behavior of the EIF and improving the resulting estimate of the target parameter, even when one of the nuisance models is misspecified.In that previous post, I described how TMLE does not simply try to improve the nuisance models, but instead makes a targeted adjustment so that the empirical mean of the estimated efficient influence function is brought back to zero. In this post, I use simulation to look directly at what targeting changes. In particular, I compare two estimators of the average treatment effect (ATE)—the plug-in estimator and TMLE—with the goal of understanding what the targeting step is doing mechanically and how it affects the final estimate.I focus on two questions. First, how far is the empirical mean of the estimated influence function from zero before and after targeting? Second, how do the plug-in and TMLE estimates of the ATE behave across different nuisance-model scenarios?A quick recap of the targetFor a binary treatment $A$, covariates $X$, and outcome $Y$, define the conditional outcome regression and propensity score as\[Q_a(X)=E[Y∣A=a,X],\ \ \ g(X)=P(A=1∣X).\]Our target is the average treatment effect (represented as $\psi$):\[\psi_0 = E\big[Y(1)−Y(0)\big].\]Under the usual identification assumptions, this can be written as\[\psi(P)=E_P[Q_1(X)−Q_0(X)].\]The efficient influence function for the ATE is\[\phi_P(Z) = \big(Q_1(X)−Q_0(X)−\psi(P)\big ) + \frac{A}{g(X)}\big(Y − Q_1(X)) − \frac{1−A}{1−g(X)}\big(Y−Q_0(X)\big).\]If we knew the true nuisance functions, this quantity would be centered under the true distribution, and its empirical mean would fluctuate around zero because of sampling variability. In practice, of course, we plug in estimated nuisance functions, and then the empirical mean need not be close to zero at all. TMLE updates the initial outcome regression just enough to remove that empirical imbalance.The simulation below is designed to make that step visible.Data-generating processTo keep the story aligned with the previous simulation post, I use the same data-generating process. Covariates influence both treatment assignment and outcome, so there is genuine confounding. The true ATE is constant and equal to $\tau$.knitr::opts_chunk$set( eval = FALSE)library(simstudy)library(data.table)library(ggplot2)gen_dgp defData( varname = "a", formula = "-0.2 + 0.8 * x1 + 0.6 * x2", dist = "binary", link = "logit" ) |> defData( varname = "y", formula = "..tau * a + 1.0 * x1 + 1.0 * x2 + 1.5 * x1 * x2", variance = 1, dist = "normal" ) genData(n, def)[]}Nuisance-model scenariosTo see how the estimators behave under different modeling assumptions, I consider four scenarios:both nuisance models correctly specifiedoutcome model misspecified, propensity model correctpropensity model misspecified, outcome model correctboth nuisance models misspecifiedThe nuisance fitting functions vary under each scenario:fit_nuisance