Oct 26, 2020
Approaches for spatial analysis in observational epidemiology using R
A significant portion of my K01 grant focuses on geospatial approaches for health data: spatial epidemiology. While there is a rich history of geography in epidemiological analyses - think John Snow's cholera work in the mid nineteenth century - it is only relatively recently that rigorous quantitative approaches have been developed to allow a variety of exploratory data analyses as well as hypothesis testing studies. This post is an attempt to distill an entire field into a brief writeup. As such, by no means is this meant to be comprehensive or complete; it is merely an inventory of the more common approaches in use and may be useful as a jumping off point for those interested in these types of analyses.
The big picture question that one grapples with in spatial epidemiology is whether the observed distribution of an outcome, in our case a health state, is random over space - that is, geography - or is there an underlying spatial process that explains it? Our null hypothesis is that it is random, and our alternative hypothesis, which we hope to see, is that it is not random. If it turns out to be non-random, we call this spatial dependence or spatial autocorrelation. I will use the terms interchangeably herein. If it is positive, an outcome in one area, say high rates of coronavirus infection will result in higher rates of coronavirus infection in neighboring areas. If the dependence is negative, high rates of coronavirus infection in one area will result in lower rates of coronavirus infection in neighboring areas (visualize a chessboard, where white squares may mean high rates and black squares may mean low rates).
Then, if there is spatial dependence present, we next need to understand how does the X-->Y relation exist overall accounting for spatial effects. Are the inputs to our regression model (whether X, Y, or error) spatially dependent? And if so, how should we include that dependency in my regression model? These questions will be considered later in this post.
For a longer treatment of this topic, see Chapter 4 in the excellent book Neighborhoods and Health.
Preparing the data
As a brief aside, most of the methods assume that the data are aggregated or contextual measures that have been correspondingly mapped into polygons. This is very typical in spatial epidemiology where we focus on census tracts, ZIP codes, states, and so on. For example, one may have counts of coronavirus cases by ZIP code in a state. Less frequently we may also be interested in spatial analyses of point data: that actual locations of each coronavirus case. In this case, a point pattern analysis is employed. We can still ask the same question: is there spatial dependence in the points, but the techniques slightly differ.
To prepare the data for analysis, we need the geographic representation of the area of interest in a computer readable format (often a shapefile, see this previous blog post and along with the variables interests: exposure, outcome, and any covariables. The epidemiologic data are merged with the geographic data to create a geographic health dataset. Then, we need two additional objects for modeling purposes. The first is a mapping that defines what is a neighbor. This could be as simple as contiguity (does a given polygon share a border with another, dubbed the Queen's contiguity), or a distance based measure (all polygons within a given distance). It could also be more sophisticated, for example, if we were mapping an intensive care unit and wanted to define patient to patient transmission of infectious disease as a result of healthcare workers who may have arbitrary patient assignments. The second object we need is a weighting scheme based on the neighbor definitions, which defines how much influence a given neighbor should have in the model. As with the algorithm that defines neighbors, the algorithm that defines weight has several approaches. For example, row standardization says that each neighbor is "weighted" by the number of total neighbors. The choice with which weighting approach to use is dictated by theory and the research question. One may also take a pragmatic approach and choose an approach that describes the data the best, for example, the lowest measure of an information criterion.
Diagnosing spatial dependence
Typically, the first step one undertakes is an assessment of spatial dependence to test the hypothesis of whether there is complete spatial randomness or not. One can start with simple exploratory techniques - typically a choropleth map - and qualitatively identify if there are any patterns. Then we can employ some quantitative assessment to mathematically describe the autocorrelation, and whether it is statistically meaningful.
There are a growing number of techniques to diagnose spatial autocorrelation depending on the data type (areas versus points), cluster type (global or local), and other considerations. By far the most common approach is using Moran's I statistic, which has been around since 1950. In its original incarnation, Moran's I is a global test of area autocorrelation, but has also been adapted to local measures of autocorrelation. The difference between global and local measures of autocorrelation has to do with whether the clusters are identifiable. In a global test, we merely assess whether any spatial dependence exists in our data. We are unable, at least through the test, to pinpoint where the spatial dependency arises. One may be able to do this visually be inspection of a choropleth map. To pinpoint the location(s) of spatial dependence, a local statistics, such as Moran's local I can be used to identify and map the exact areas where spatial autocorrelation is occurring in the data. Keep in mind it is possible to have local spatial autocorrelation without global spatial autocorrelation, so a prudent approach tests for both. Interpreting Moran's I is straightforward: a value >0 suggests positive spatial autocorrelation, <0 suggests negative spatial autocorrelation, and a value of 0 equates to no spatial autocorrelation. One can compute confidence intervals around Moran's I to see if the statistic meaningfully differs from 0; a p-value is often also produced during the Moran's I test. Another popular test for autocorrelation is the spatial scan statistic, which is flexible to global and local tests for areas or point data, and can incorporate time dependence as well (spatio-temporal autocorrelation). The freely available SaTScan software can compute this measure in an intuitive, easy to use interface.
If there is spatial dependence in your data, and you employ a typical ordinary least squares (OLS) regression model without considering this phenomenon, the error terms may be biased and lead to incorrect inference. This is because a basic assumption of regression has been violated: the independence of observations and errors. The solution to this problem is spatial regression modeling. The research question under study is to understand how the X-->Y relation exists after accounting for spatial effects. As a comparison, we can first construct the OLS model. This can then be used to demonstrate whether failure to account for spatial dependence has led to biased inference. One may use one of the aforementioned tests for spatial dependence on the residuals of an OLS model to see if spatial autocorrelation is present. If spatial dependence is present, then a spatial regression model may be warranted (or inclusion of the variables leading to the spatial autocorrelation, if those data are available).
The simplest and most frequently used spatial regression models fall into a category called simultaneous autoregression (SAR) models. These models are considered "global" in nature: regardless of where Y is, the models produce a global prediction for any location. These can also be thought of as marginal models in that the effect estimates are global, averaged across all geographies in the analysis.
Starting with an OLS model, every component of this original model may be subject to spatial autocorrelation: the X term(s), the Y term, and even the error term. The spatial regression model approaches below account for the autocorrelation that may be present in the various combinations across these terms.
In a spatial error model the error term contains spatial effect: Y is dependent on our X and uncaptured effect (error) of our neighbors. The spatial component is measured through the parameter dubbed lambda, it can be viewed as nuisance parameter capturing error. The spatial error model is the most straightforward model and can be interpreted just like an OLS model.
In a spatial lag model or spatially lagged Y model, Y is dependent on our X and neighboring Y. That is, there is a spillover effect present. If, for example, our Y variable corresponds to coronavirus, a spillover suggests that high rates of coronavirus infection in a given area may be partially the result of higher rates of coronavirus infection in neighboring areas. This should make intuitive sense as often geographical areas are created based on arbitrary geopolitical or administrative definitions. Many health outcomes do not occur with respect these boundaries (unless, for example, a river served as a boundary and it was impossible to traverse this river, then contagion is blocked). The spatial component is captured in the parameter rho>: the effect of neighborhood average. Rho is not directly interpretable but one can examine statistical significance of this parameter to see if the spatial lag is meaningful.
Another type of spatial lag model is the spatially lagged X model. This model is very similar to the spatially lagged Y model, the difference being Y is dependent on our X and neighboring X (not the neighboring Y). As an example of this, suppose the exposure (X) was vaccination rates. It may stand to reason that vaccination rates will be similar in surrounding areas due to homophily, thus a spatially lagged X model may be warranted. The spatial component in the spatially lagged X model is dubbed theta: the effect of neighborhood average. As before, theta is not directly interpretable but one can examine statistical significance of this parameter to see if the spatial lag is meaningful.
Finally, one may consider a model with both X and Y lags: where Y is dependent on our X, and neighboring X and Y. This is known as a spatial Durbin model or lagged-mixed model. The parameters of spatial autocorrelation are as defined in the spatial lag models for X and Y. To continue the example from previously, suppose the exposure (X) was vaccination rates and the outcome (Y) was a count of measles cases. One can reasonably posit that both X and Y will have spatial dependency: vaccination rates due to homophily and measles due to contagion. In this case, a Durbin model is likely a good option.
There are additional permutations of spatial regression models to consider in special circumstances. A good resource is BurkeyAcademy. To help guide the decision in model choice among the many SAR models, Lagrange multiplier tests (basic and robust) are frequently used.
In all models with spatial lags, the parameter estimates cannot be directly interpreted, rather we need to decompose effects into total, direct, and indirect. A spillover is present when outcome or covariate levels for a given county affect the outcome for a neighboring county, and the effects in the local and neighboring counties (direct and indirect effects) are in the same direction.
A brief tangent is necessary before concluding the discussion of global SAR models. Thus far we have focused on SAR models. There is another category of modeling that known as conditional autoregressive (CAR) models that are sometimes employed to account for spatial dependency. The difference between the two is based on the conceptual distinction of spatial dependency, known as 1st order and 2nd order dependency. A 1st order effect suggests that spatial dependency is a function of space itself, whereas a 2nd order effect suggests that dependency is a function of neighboring observations, and not the space per se. CAR models pertain to the former, and SAR models pertain to the latter. To quote an example from a former professor of mine: "Imagine oak trees grow better in certain places on a hillside than others. You might say that there is something about the space overall in the data generation process that determines where oak trees tend to be. But oak trees also propagate by way of acorns from existing trees. Acorns may move around, but not all that far, so oak trees tend to grow near each other. The data generating process comes from the trees. CAR looks at how that broader terrain (space) is the source of spatial dependency in oak tree data and SAR looks at how the locations of oak trees are dependent on one another."
SAR models are dubbed global models because the produce estimates that assume homogeneity throughout the data. That is, the estimates are marginal and apply the same to all areas under study. In reality, this is an overly restrictive assumption: we can expect that the X-->Y relation can and will differ by geography. Therefore a newer technique known as geographically weighted regression (GWR) modeling has been developed to attempt to understand local phenomenon, or how place may modify the relationship under study.
GWR analysis in essence creates a separate regression analysis on each geographical area in the data; as such, one receives regression estimates and p-values for each geography. This feature is not without drawbacks: it can be difficult to interpret the data. Also, the assumptions put in to the GWR model have proven to be rather sensitive. Thus GWR is recommended as more of an exploratory tool rather than confirming or disproving hypotheses, and should be considered alongside other spatial regression approaches.
The most important part of the GWR analysis is the kernel. Heuristically, one can think of the kernel as the subset of data defined by neighbors. The kernel is a function of bandwidth (neighbors) and weighting function (influence of the neighbors define by a distribution, for example, geographies closest may have greatest effect). It can be constant across the data (fixed kernel) or can change over space (aka adaptive kernel).
Conducting a GWR analysis is relatively straightforward. First, one "calibrates" (i.e., determine size of) the bandwidth based on an OLS model and weighting function. Second, the GWR is estimated. The output from this estimation is per geography, analogous to a regression model for each geographical unit in the data. This indicates how much does the XY relation vary by geography. Third and finally, the results are mapped. There are a variety of options herein including plotting the slopes of each predictor and how they vary over space, or plotting the statistical significance of slopes (e.g. local t values). Diagnostic plots are also useful to assess spatial dependency post modeling, for example, by plotting residuals on map. A much more informative discussion of mapping options is presented by Mennis in his paper Mapping the Results of Geographically Weighted Regression.
Other approaches to spatial analysis
Thus far I have considered only the more common - and some may say straightforward - approaches to considering spatial dependence in epidemiological analysis. There are, of course, other approaches that have additional strengths and features compared to the presented approaches. For those with interest and expertise in Bayesian statistics, there are a variety of spatial approaches possible the resolve some key limitations in "frequentist" analyses. For example, using a Bayesian perspective, one can incorporate prior knowledge into the model, and not just rely upon the sample at hand. Bayesian models are robust to high parameter models and can include parameters for which there is no observed data (non-identifiability).
Another approach is to use eigenvectors to describe spatial dependency. I have previously blogged about their use. In short, this approach seeks to account for spatial dependency through an additional parameter into an OLS model. This makes this approach flexible to a variety of applications, such as a spatial multilevel model. The downside is to this approach is it merely seeks to account for the spatial dependency through a latent variable: we cannot decompose it or identify what is causing the spatial dependency to arise.
Sample Codes in R
### READ and PREPARE DATA ### #read shapefile, using either readOGR (rgdal package) or st_read (sf package) functions map_data = readOGR("shapefile_path","shapefile_layer") map_data = st_read("shapfile_path/layer") #if using readOGR, recast as an sf object map_data = st_as_sf(map_data1) #check CRS st_crs(map_data) #if needed, (re)project the shapefile: for CRS lookup, see https://epsg.io map_proj = st_transform(map_data, crs) #check CRS st_crs(map_proj) #test plot of map data: note that geometry is in a variable named geometer; can also access via st_geometry plot(map_data$geometry) #at this point, can merge epidemiologic data with map data map_proj_merged = merge(x=map_proj, y=your_data, by="id", all.x=T, duplicateGeoms=T) #define neighbors using Queen's contiguity (spdep package) nb_list = poly2nb(map_proj_merged) #define weighting scheme wt_list = nb2listw(nb_list) #using row standardization (default) wt_list = nb2listw(nb_list, style="B") #using binary weights ### ASSESS SPATIAL DEPENDENCY ### #diagnose spatial dependence using global Moran's I moran.plot(map_proj_merged$outcome, wt_list) moran.test(map_proj_merged$outcome, queen_listw) moran.mc(map_proj_merged$outcome, queen_listw, nsim=1000) #for rate data, use empirical Bayes index modification of Moran's I EBImoran.mc(n=map_proj_merged$numerator, x=map_proj_merged$denominator, listw=wt_list, nsim=10000) #local Moran's I: these values can then be plotted local_i = localmoran(map_proj_merged$outcome, queen_listw) #construct an OLS model: we can use this for comparison and test the residuals for spatial autocorrelation model1 = lm(outcome ~ exposure, data=map_proj_merged) moran.test(model1$residuals, wt_list, randomisation=F, alternative="two.sided") lm.morantest(model1, wt_list) #for linear models #if there is spatial dependence, can use a LaGrange Multiplier tests to inform the spatial regression approach lm.LMtests(model1, wt_list, test="all") #for linear models ### SAR MODELING (spatialreg package) ### #spatial error model2 = errorsarlm(outcome ~ exposure, data=map_proj_merged, wt_list) summary(model2) #spatial lag Y model model3 = lagsarlm(outcome ~ exposure, data=map_proj_merged, wt_list) summary(model3) #spatial lag X model model4 = lmSLX(outcome ~ exposure, data= map_proj_merged, wt_list) summary(model4) #lagged-mixed (Durbin) model: X and Y lagged model5 = lagsarlm(outcome ~ exposure, data= map_proj_merged, wt_list, type="mixed") #for models that have a lag, need to compute direct, indirect, total estimates summary(impacts(model3, tr=trW(as(wt_list, "CsparseMatrix"), type="mult"), R=1000), zstats=T) summary(impacts(model4, tr=trW(as(wt_list, "CsparseMatrix"), type="mult"), R=1000), zstats=T) summary(impacts(model5, tr=trW(as(wt_list, "CsparseMatrix"), type="mult"), R=1000), zstats=T) #can test for improved fit over OLS model anova(model3, model1) ### GWR MODELING (spgwr package) ### #calibrate bandwidth using linear model significant variables gwr_bw <- gwr.sel(model1, data=map_proj_merged, gweight=gwr.Gauss, verbose = FALSE) #derive and summarize the gwr model model6 <- gwr(model1, data=map_proj_merged, bandwidth=gwr_br, gweight=gwr.Gauss, hatmatrix = T) #extract the GWR output and add to the map data for mapping map_proj_gwr = merge(x=map_proj_merged, y= slot(model6$SDF, "data"), by="id", all.x=T, duplicateGeoms=T) #can check residuals for spatial autocorrelation moran.test(model6$gwr.e, wt_list, randomisation=F, alternative="two.sided") #can now map various aspects of the GWR results #e.g., local significance mapping shows where the model is meaningful by geographical unit #p-values can be obtained via: gwr.t.adjust(model6)