```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, fig.width=7, fig.height=5) options(width = 200) notice <- "Note: if the `optmatch` package is not available, the subsequent lines will not run." use <- { if (requireNamespace("optmatch", quietly = TRUE)) "full" else if (requireNamespace("quickmatch", quietly = TRUE)) "quick" else "none" } me_ok <- requireNamespace("marginaleffects", quietly = TRUE) && requireNamespace("sandwich", quietly = TRUE) ``` ```{=html} ``` ## Introduction `MatchIt` implements the suggestions of @ho2007 for improving parametric statistical models for estimating treatment effects in observational studies and reducing model dependence by preprocessing data with semi-parametric and non-parametric matching methods. After appropriately preprocessing with `MatchIt`, researchers can use whatever parametric model they would have used without `MatchIt` and produce inferences that are more robust and less sensitive to modeling assumptions. `MatchIt` reduces the dependence of causal inferences on commonly made, but hard-to-justify, statistical modeling assumptions using a large range of sophisticated matching methods. The package includes several popular approaches to matching and provides access to methods implemented in other packages through its single, unified, and easy-to-use interface. Matching is used in the context of estimating the causal effect of a binary treatment or exposure on an outcome while controlling for measured pre-treatment variables, typically confounding variables or variables prognostic of the outcome. Here and throughout the `MatchIt` documentation we use the word "treatment" to refer to the focal causal variable of interest, with "treated" and "control" reflecting the names of the treatment groups. The goal of matching is to produce *covariate balance*, that is, for the distributions of covariates in the two groups to be approximately equal to each other, as they would be in a successful randomized experiment. The importance of covariate balance is that it allows for increased robustness to the choice of model used to estimate the treatment effect; in perfectly balanced samples, a simple difference in means can be a valid treatment effect estimate. Here we do not aim to provide a full introduction to matching or causal inference theory, but simply to explain how to use `MatchIt` to perform nonparametric preprocessing. For excellent and accessible introductions to matching, see @stuart2010 and @austin2011b. A matching analysis involves four primary steps: 1) planning, 2) matching, 3) assessing the quality of matches, and 4) estimating the treatment effect and its uncertainty. Here we briefly discuss these steps and how they can be implemented with `MatchIt`; in the other included vignettes, these steps are discussed in more detail. We will use Lalonde's data on the evaluation of the National Supported Work program to demonstrate `MatchIt`'s capabilities. First, we load `MatchIt` and bring in the `lalonde` dataset. ```{r} library("MatchIt") data("lalonde") head(lalonde) ``` The statistical quantity of interest is the causal effect of the treatment (`treat`) on 1978 earnings (`re78`). The other variables are pre-treatment covariates. See `?lalonde` for more information on this dataset. In particular, the analysis is concerned with the marginal, total effect of the treatment for those who actually received the treatment. In what follows, we briefly describe the four steps of a matching analysis and how to implement them in `MatchIt`. For more details, we recommend reading the other vignettes, `vignette("matching-methods")`, `vignette("assessing-balance")`, and `vignette("estimating-effects")`, especially for users less familiar with matching methods. For the use of `MatchIt` with sampling weights, also see `vignette("sampling-weights")`. It is important to recognize that the ease of using `MatchIt` does not imply the simplicity of matching methods; advanced statistical methods like matching that require many decisions to be made and caution in their use should only be performed by those with statistical training. **Selecting covariates to balance.** Selecting covariates carefully is critical for ensuring the resulting treatment effect estimate is free of confounding and can be validly interpreted as a causal effect. To estimate total causal effects, all covariates must be measured prior to treatment (or otherwise not be affected by the treatment). Covariates should be those that cause variation in the outcome and selection into treatment group; these are known as confounding variables. See @vanderweele2019 for a guide on covariate selection. Ideally these covariates are measured without error and are free of missingness. ## Check Initial Imbalance After planning and prior to matching, it can be a good idea to view the initial imbalance in one's data that matching is attempting to eliminate. We can do this using the code below: ```{r} # No matching; constructing a pre-match matchit object m.out0 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, method = NULL) ``` The first argument is a `formula` relating the treatment to the covariates used in estimating the propensity score and for which balance is to be assessed. The `data` argument specifies the dataset where these variables exist. Typically, the `method` argument specifies the method of matching to be performed; here, we set it to `NULL` so we can assess balance prior to matching[^1]. The `distance` argument specifies the method for estimating the propensity score, a one-dimensional summary of all the included covariates, computed as the predicted probability of being the treated group given the covariates; here, we set it to `"glm"` for generalized linear model, which implements logistic regression by default[^2] (see `?distance` for other options). [^1]: Note that the default for `method` is `"nearest"` to perform nearest neighbor matching. To prevent any matching from taking place in order to assess pre-matching imbalance, `method` must be set to `NULL`. Below we assess balance on the unmatched data using `summary()`: ```{r} # Checking balance prior to matching summary(m.out0) ``` We can see severe imbalances as measured by the standardized mean differences (`Std. Mean Diff.`), variance ratios (`Var. Ratio`), and empirical cumulative density function (eCDF) statistics. Values of standardized mean differences and eCDF statistics close to zero and values of variance ratios close to one indicate good balance, and here many of them are far from their ideal values. ## Matching Now, matching can be performed. There are several different classes and methods of matching, described in `vignette("matching-methods")`. Here, we begin by briefly demonstrating 1:1 nearest neighbor (NN) matching on the propensity score, which is appropriate for estimating the ATT. One by one, each treated unit is paired with an available control unit that has the closest propensity score to it. Any remaining control units are left unmatched and excluded from further analysis. Due to the theoretical balancing properties of the propensity score described by @rosenbaum1983, propensity score matching can be an effective way to achieve covariate balance in the treatment groups. Below we demonstrate the use of `matchit()` to perform nearest neighbor propensity score matching. ```{r} # 1:1 NN PS matching w/o replacement m.out1 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, method = "cem", distance = "mahalanobis") ``` We use the same syntax as before, but this time specify `method = "cem"` to implement coarsened exact matching. You will get a warning message saying distance meausre is ignored. This is because CEM will look for exact match and not allow any differences. The matching outputs are contained in the `m.out1` object. Printing this object gives a description of the type of matching performed: ```{r} m.out1 ``` The key components of the `m.out1` object are `weights` (the computed matching weights), `subclass` (matching pair membership), `distance` (the estimated propensity score), and `match.matrix` (which control units are matched to each treated unit). How these can be used for estimating the effect of the treatment after matching is detailed in `vignette("estimating-effects")`. ## Assessing the Quality of Matches Although matching on the propensity score is often effective at eliminating differences between the treatment groups to achieve covariate balance, its performance in this regard must be assessed. If covariates remain imbalanced after matching, the matching is considered unsuccessful, and a different matching specification should be tried. `MatchIt` offers a few tools for the assessment of covariate balance after matching. These include graphical and statistical methods. More detail on the interpretation of the included plots and statistics can be found in `vignette("assessing-balance")`. In addition to covariate balance, the quality of the match is determined by how many units remain after matching. Matching often involves discarding units that are not paired with other units, and some matching options, such as setting restrictions for common support or calipers, can further decrease the number of remaining units. If, after matching, the remaining sample size is small, the resulting effect estimate may be imprecise. In many cases, there will be a trade-off between balance and remaining sample size. How to optimally choose among them is an instance of the fundamental bias-variance trade-off problem that cannot be resolved without substantive knowledge of the phenomena under study. Prospective power analyses can be used to determine how small a sample can be before necessary precision is sacrificed. To assess the quality of the resulting matches numerically, we can use the `summary()` function on `m.out1` as before. Here we set `un = FALSE` to suppress display of the balance before matching for brevity and because we already saw it. (Leaving it as `TRUE`, its default, would display balance both before and after matching.) ```{r} # Checking balance after NN matching summary(m.out1, un = FALSE) ``` At the top is a summary of covariate balance after matching. Although balance has improved for some covariates, in general balance is still quite poor, indicating that coarsened exact matching is not sufficient for removing confounding in this dataset. The final column, `Std. Pair Diff`, displays the average absolute within-pair difference of each covariate. Let's also visualize the result. ```{r} plot(m.out1, interactive = FALSE) ``` We can visually examine balance on the covariates using `plot()` with `type = "density"`: ```{r, fig.alt="Density plots of age, married and re75 in the unmatched and matched samples."} plot(m.out1, type = "density", interactive = FALSE, which.xs = c("age","married","re75")) ``` ### Trying a Different Matching Specification Given the poor performance of nearest neighbor matching in this example, we can try a different matching method or make other changes to the matching algorithm or distance specification. B ```{r} # Full matching on a probit PS m.out2 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, method = "nearest", distance = "mahalanobis") m.out2 ``` We can examine balance on this new matching specification. ```{r, eval = (use != "none")} # Checking balance after full matching summary(m.out2, un = FALSE) ``` ```{r, eval = (use != "none"), fig.alt = "A love plot with matched dots below the threshold lines, indicaitng good balance after matching, in contrast to the unmatched dots far from the treshold lines, indicating poor balance before matching."} plot(summary(m.out2)) ``` Love plots are a simple and straightforward way to summarize balance visually. See `vignette("assessing-balance")` for more information on how to customize `MatchIt`'s Love plot and how to use `cobalt`, a package designed specifically for balance assessment and reporting that is compatible with `MatchIt`. ## Estimating the Treatment Effect First we will get the matched dataset: ```{r} m.data <- match.data(m.out2) head(m.data) ``` Let's take a look at the matched set by using ``View(m.data)`` Let's compare the treatment effect between the treated and matched units. For now, we will use difference in means between the two groups. Later in this semester, we will discuss more about regression and how it can be utilized in this setting. ```{r} sum(m.data$re78[m.data$treat==1])/sum(m.data$treat==1) - sum(m.data$re78[m.data$treat==0])/sum(m.data$treat==0) ``` ## Conclusion Although we have covered the basics of performing a matching analysis here, to use matching to its full potential, the more advanced methods available in `MatchIt` should be considered. We recommend reading the other vignettes included here to gain a better understand of all the `MatchIt` has to offer and how to use it responsibly and effectively. As previously stated, the ease of using `MatchIt` does not imply that matching or causal inference in general are simple matters; matching is an advanced statistical technique that should be used with care and caution. We hope the capabilities of `MatchIt` ease and encourage the use of nonparametric preprocessing for estimating causal effects in a robust and well-justified way. ## References