How conservative is the Neyman estimator of the variance of the SATE? Very.

Abstract: How conservative is the Neyman estimator of the variance of the SATE? It is very conservative. But you can do better. Thanks Aronow, Green, and Lee (2014)!

In statistical work we use standard errors all the time to express our uncertainty over some estimated effect. Roughly speaking we often say that an estimate is “significant” if it is at least twice as large as its associated standard error. That’s well and good but there is a problem: the standard error itself has to be estimated and in fact we often do a very bad job estimating it.

Experimentalists are in a good position to estimate standard errors since they can do it from what they know about the data generating process—the way treatment is assigned. A great approach is to use the Neyman standard error. This can be used to calculate the standard error for the estimand defined at the level of the cases actually studied—the sample average treatment effect—or it can be used for a large “population average treatment effect,” if you are willing to think of the estimate as being relevant for some larger population. (For many purposes you might care about the sample average effect --- for example if you want to know what the effect of a program was in some set of units, then it's the sample effect you want. If you are going to be cautious about claims to external validity then you might want to restrict claims to the sample. But if you want to make a more ambitious claim then go for the population---and justify it.)

It is well known that for the sample average treatment effect the estimate is conservative. That means that it is too big.

But just how conservative is it in the benchmark case where treatment and control groups are of the same size, the sample is very large, variance of potential outcomes in treatment and control groups are measured without error, and there is no correlation between treatment and control potential outcomes?

It is very conservative. Two times too big. That means that we might be routinely failing to reject null hypotheses that we ought to be rejecting. Things might be more significant than they seem.

Aronow, Green,and Lee (2014) consider a case where n experimental units are sampled from a population of N units and m of these are then assigned to treatment. The target quantity of interest is the average treatment effect for the N units and the variance of estimates of the average treatment effect.

From their Eq 3 the sampling variance of the difference in means estimator, for the simple case in which N = n = 2\times m and \sigma_n(y_1,y_0)=0, is:

V_n = \frac{1}{n-1}\left(\sigma^2_n(y_1) + \sigma^2_n(y_0) \right)

However from Eq 4 of the same paper the conservative Neyman estimator of the variance is (optimistically substituting here the true variance for an estimate of variance of the within treatment group potential outcomes):

\hat{V}_n = \frac{2}{n-1}\left(\sigma^2_n(y_1) + \sigma^2_n(y_0) \right)

So in this case:

\hat{V}_n = 2V_n

This is enough to make nonsense of p values. Given this, a false rejection of the null requires an estimated t-stat of at least 2.77 (1.96 times 1.414) in absolute value. This arises with probability 0.0056. And so the probability of falsely rejecting a null of no effect at the 5% level is nothing close to 5%.

Another way of thinking about it is that a researcher with a one unit treatment effect might be sweating that their standard error is 0.7 say, which means a t-stat of 1.43 — way too small for significance. But in truth the standard error might by .5, giving a t-stat of 2 and statistical significance.

Although shown here for the Neyman variance, the regression estimate will be exactly the same given the balanced design: the regression estimate of the standard error is conservative if you are interested in the sample average treatment effect.

Things will be better, in a sense, if potential outcomes are positively correlated. In that case the true variance is larger and the conservative estimator will be closer to the truth (though, of course negative correlation is better because in that case the true variance will be lower).

This all comes fairly immediately from Aronow et al “Sharp Bounds on the Variance in Randomized Experiments.” The paper then goes a lot farther to derive a consistent estimator of the sharp bounds on the variance, making progress by putting bounds on the covariance of potential outcomes.

Who would have thought that was possible? It can be done because the marginal distributions of observed outcomes contain a lot more information than we typically use. For intuition: if we know there is no variation at all in the control outcomes, then we know that treatment and control potential outcomes are uncorrelated. Aronow et al provide code, reproduced below (fixing return carriage bugs from the journal formatting):

sharp.var <- function(yt,yc,N=length(c(yt,yc)),upper=TRUE) {
   m <- length(yt)
   n <- m + length(yc)
   FPvar <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
   yt <- sort(yt)
   if(upper == TRUE) {yc <- sort(yc)}  else {yc <- sort(yc,decreasing=TRUE)}
   p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m))) - .Machine$double.eps^.5
   p_i[1] <- .Machine$double.eps^.5
   yti <- yt[ceiling(p_i*m)]
   yci <- yc[ceiling(p_i*(n-m))]
   p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
   return(((N-m)/m * FPvar(yt,N) + (N-(n-m))/(n-m) * FPvar(yc,N) +
   2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) -
   2*mean(yt)*mean(yc))/(N-1))
  }

Let’s illustrate with an example with a million observation dataset, treatment randomly assigned, Y(0)=0 for everyone and Y(1) is distributed normally with mean 0 and standard deviation of 1000. In this case we should have an estimated standard error of 1, but using the Neyman estimator (or OLS) we estimate a standard error of \sqrt{2} instead. To wit:

n  <- 1000000
Y  <- c(rep(0,n/2), 1000*rnorm(n/2))
X  <- c(rep(0,n/2), rep(1, n/2))
round(coef(summary(lm(Y~X)))[2,],3)
##   Estimate Std. Error    t value   Pr(>|t|) 
##     -0.030      1.413     -0.021      0.983

So the standard error is wrong just as we expect it to be. 1.414 instead of 1.

Let’s see how well Aronow et al do:

round(c(sharp.var(Y[X==1], Y[X==0], upper = FALSE), sharp.var(Y[X==1], Y[X==0], upper = TRUE)),3)
## [1] 1 1

Exactly.

People certainly fetishize p values too much. But the surprising news is that if we do a better job of calculating the ps we might find them smaller than we thought.