Chapter 2 Design

2.1 Units

We carefully define the units of observation. They maybe households, automobile registrations, 911 callers, or students, about which we are measuring application status, traffic tickets, primary care visits, or days of attendance in school, respectively.

2.2 Level of Randomization

We define the level of randomization, which is not always the same as the unit of observation. In particular, the units of observation may be assigned in clusters. For example, we may assign some classrooms to a particular teacher-based intervention, and all students in intervention classrooms are assigned to the same intervention. When clusters are meaningful, this has substantial implications for our design and analysis.

Generally, we prefer to randomize at lower levels of aggregation – at the student level rather than the classroom level, e.g. – because we will have more randomization units. However, there are often logistical or statistical reasons for assigning conditions in clusters. For example, logistically, a teacher can only deliver one particular curriculum to their class; statistically, we may be concerned about interference between students, where students in one condition interact with students in the other, and causal effects of the intervention are difficult to isolate.

2.3 Blocking on Background Characteristics

In order to create balance on potential outcomes, which promotes less estimation error and more precision, we block using prognostic covariates. A blocked randomization first creates groups of similar randomization units, and then randomizes within those groups. In a random allocation, by contrast, one assigns a proportion of units to each treatment condition from a single pool. See Moore (2012) or the slides here for an introduction and discussion.

2.3.1 Examples

The Lab’s TANF recertification experiment (Moore et al. 2022) blocked participants on service center and assigned visit date.

2.4 Setting the Assignment Seed

Whenever our design or analysis involves randomization, we set the seed so that our results can be perfectly replicated.

We set the seed at the top of the file, just after the (e.g., library()) commands that load and attach that file’s packages. In a short random assignment file, e.g., we might have

# Packages:
library(dplyr) 

# Set seed:
set.seed(SSSSSSSSS)

# Conduct Bernoulli randomization:
df <- df |> mutate(
  treatment = sample(0:1, 
                     size = n(),
                     replace = TRUE))

2.4.1 Seeding Procedures

We use two types of seeds: date seeds and sampled seeds, which we describe below. By default, we use them in the following conditions:

Public-relevant implementations: date
Other implementations: sampled

We consider public-relevant implementations to include situations like a random assignment of a program to some members of a waitlist, a random selection of some households to participate in a survey, and random assignment of hypothetical treatments during confirmatory randomization inference.

We consider other implementations to include simulations to create example datasets, simulations to estimate power or bias, or other design or diagnostic procedures.

2.4.1.1 Date Seeds

To seed the random seed, run at the R prompt

set.seed(YYMMDDHH)

where YY is the two-digit year (23 for 2023), MM is the two-digit month, DD is the two-digit date, and HH is the two-digit hour (between 00 and 23) of implementation.

2.4.1.2 Sampled Seeds

To set the random seed, run at the R prompt only once

sample(1e9, 1)

then copy and paste the result as the argument of set.seed(). If the result of sample(1e9, 1) is SSSSSSSSS, then set the seed with

set.seed(SSSSSSSSS)

2.4.2 Updating the Seed

We update date seeds to reflect the last time that the code was run for implementation.

We do not need to update sampled seeds in our other design, demonstration, or diagnostic code.

2.4.3 Motivation

We want to ensure that our stochastic work can be exactly replicated. We do not manipulate seeds to obtain particular results. However, we do not want our draws to be entirely dependent within a given date, and we note that some seemingly-random phenomena are sometimes later found to have patterns. For an example of dependence, consider these two different draws that use the same seed:

set.seed(758296545)
sample(100, 10)

##  [1] 11 22 88 94 25 76  1  4 64 12

set.seed(758296545)
sample(100, 20)

##  [1] 11 22 88 94 25 76  1  4 64 12 35 45 65 77 34 91 90  6 27  3

Note that the first 10 cases are identical.

2.5 Power

We conduct power analysis to determine how precise our experiment are likely to be. Power analysis should go beyond sample size, and attempt to account for as many features of the design, implementation, and analysis as possible. Sometimes we can achieve this with “formula-based” power strategies; other times we need to use simulation-based techniques. Power analyses should be conducted in code, so that they are replicable. If we use an online calculator for a quick calculation, we replicate the results in code.¹

If our design and analysis plan match the assumptions of a formula-based power calculation well, we perform a formula-based power calculation. For example, if the design is complete randomization and the analysis plan is to conduct a two-sample \(t\)-test, we might use R’s power.t.test(), as below. However, if the design includes blocked assignment in several waves, untreated units stay in the pool across waves, and assignment probabilities vary by wave, with an analysis plan of covariance-adjusted Lin estimation with strong covariates, then we need to use simulation. If we can’t find a formula-based approach that sufficiently approximates our design and analysis plan, we use simulation.

2.5.1 Formula-based Power Analysis

An example of formula-based power calculation, where the design is complete randomization and the analysis plans for a two-sample \(t\)-test:

power_out <- power.t.test(delta = 1, 
                          sd = 1,
                          sig.level = 0.05,
                          power = 0.8)
power_out

## 
##      Two-sample t test power calculation 
## 
##               n = 16.71477
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

For two equally-sized samples drawn from a population with standard normal outcomes, we need 17 observations in each group to have a probability of 0.8 of detecting a true effect that is one standard deviation of the outcome in size, where “detecting” means rejecting a null hypothesis of \(H_0: \mu_{Tr} = \mu_{Co}\) against an alternative of \(H_a: \mu_{Tr} \neq \mu_{Co}\) using \(\alpha = 0.05\).

2.5.2 Formula-based MDE (minimum detectable effect)

An example of a formula-based MDE calculation follows, where the analysis plans for a two-sample test of proportions. The sample size is 75 (in each group), and we want to detect stipulated effects with probability 0.8. Below, we make the most conservative (SE-maximizing) possible assumption about the base rate, that it is 0.5.

power_out_mde <- power.prop.test(n = 75, 
                                 p1 = 0.5,
                                 power = 0.8)
power_out_mde

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 75
##              p1 = 0.5
##              p2 = 0.7213224
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power_out_mde$p2 - power_out_mde$p1

## [1] 0.2213224

We see a minimum detectable effect of about 0.22, over a base rate of 0.5.

2.5.3 Simulation-based Power Analysis

Simulation-based power analysis allows us to estimate the power of any combination of randomization technique, missing data treatment, estimation strategy, etc. that we like.

Create some data to illustrate simulation-based power analysis:

library(estimatr)
library(here)
library(tidyverse)

set.seed(988869862)

n_samp <- 100
df <- tibble(x = rchisq(n_samp, df = 3),
             z = rbinom(n_samp, 1, prob = 0.5),
             y = x + z + rnorm(n_samp, sd = 1.1))

save(df, file = here("data", "02-01-df.RData"))

Suppose the estimation strategy is linear regression \(y_i = \beta_0 + \beta_1 z_i + \beta_2 x_i + \epsilon_i\) with heteroskedasticity-robust HC2 standard errors, and the coefficient of interest is \(\beta_1\). Perform 1000 reassignments and determine what proportion of them reveal \(\hat{\beta}_1\) that is statistically significant at \(\alpha = 0.05\).

n_sims <- 1000
alpha <- 0.05
true_te <- 1

is_stat_sig <- vector("logical", n_sims) # Storage

for(idx in 1:n_sims){
  
  # Re-assign treatment and recalculate outcomes n_sims times:
  # (Note: conditions on original x in df, but not original y.)
  df <- df |> mutate(z_tmp = rbinom(n_samp, 1, prob = 0.5),
                     y_tmp = true_te * z_tmp + x + rnorm(100, sd = 1.1))
  
  # Estimation:
  lm_out <- lm_robust(y_tmp ~ z_tmp + x, data = df)
  
  # Store p-value:
  stat_sig_tmp <- lm_out$p.value["z_tmp"]
  
  # Store whether true effect is 'detected':
  is_stat_sig[idx] <- (stat_sig_tmp <= alpha)
}

mean(is_stat_sig)

## [1] 0.995

So the probability of detecting the true average treatment effect of 1 is about 0.995. This high power comes largely from the strongly predictive nature of the covariate x. Note that a naïve formula-based approach that ignores the data generating process estimates the power to be roughly 0.45.

2.6 Balance Checking

To increase our certainty that our treatment and control conditions are balanced on predictive covariates, we compare the distributions of covariates. For example, in Moore et al. (2022), we describe

the median absolute deviation (MAD) of appointment dates is about 0.15 days (about 3.5 hours) or less in 99% of the trios. In other words, the medians of the treatment and control groups tend to vary by much less than a day across the months and Service Centers.

References

Moore, Ryan T. 2012. “Multivariate Continuous Blocking to Improve Political Science Experiments.” Political Analysis 20 (4): 460–79.

Moore, Ryan T., Katherine N. Gan, Karissa Minnich, and David Yokum. 2022. “Anchor Management: A Field Experiment to Encourage Families to Meet Critical Programme Deadlines.” Journal of Public Policy 42 (4): 615–36. https://doi.org/10.1017/S0143814X21000131.

An online power calculator is available from EGAP, e.g.↩︎