Skip to content
Snippets Groups Projects
07-basic_statistics.qmd 31.4 KiB
Newer Older
Lea's avatar
Lea committed
---
bibliography: references.bib
---

# Basic statistics for spatial analysis

This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into spatial statistics applied to epidemiology and their limitations you can consult the tutorial "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)" from M. Kramer from which the statistical analysis of this section was adapted.
Lea's avatar
Lea committed

## Import and visualize epidemiological data

In this section, we load data that reference the cases of an imaginary disease, the W fever, throughout Cambodia. Each point corresponds to the geo-localization of a case.
Lea's avatar
Lea committed

```{r load_cases, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
library(dplyr)
Lea's avatar
Lea committed
library(sf)

#Import Cambodia country border
country <- st_read("data_cambodia/cambodia.gpkg", layer = "country", quiet = TRUE)
Lea's avatar
Lea committed
#Import provincial administrative border of Cambodia
education <- st_read("data_cambodia/cambodia.gpkg", layer = "education", quiet = TRUE)
Lea's avatar
Lea committed
#Import district administrative border of Cambodia
district <- st_read("data_cambodia/cambodia.gpkg", layer = "district", quiet = TRUE)
Lea's avatar
Lea committed

# Import locations of cases from an imaginary disease
cases <- st_read("data_cambodia/cambodia.gpkg", layer = "cases", quiet = TRUE)
cases <- subset(cases, Disease == "W fever")
Lea's avatar
Lea committed

```

The first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.

```{r cases_visualization, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# View the cases object
head(cases)

# Map the cases
library(mapsf)

mf_map(x = district, border = "white")
mf_map(x = country,lwd = 2, col = NA, add = TRUE)
mf_map(x = cases, lwd = .5, col = "#990000", pch = 20, add = TRUE)
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
mf_layout(title = "W Fever infections in Cambodia")
Lea's avatar
Lea committed
```

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, we cannot precisely tell if this observation represents an event of interest (e.g., illness, death, ...) or a person at risk (e.g., a participant that may or may not experience the disease). If you can consider that the population at risk is uniformly distributed in small area (within a city for example), this is likely not the case at a country scale. Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appear as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.

```{r district_aggregate, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
# Aggregate cases over districts
district$cases <- lengths(st_intersects(district, cases))

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
# Plot number of cases using proportional symbol 
mf_map(x = district) 
mf_map(
  x = district, 
  var = "cases",
  val_max = 50,
  type = "prop",
  col = "#990000", 
  leg_title = "Cases")
mf_layout(title = "Number of cases of W Fever")

The incidence ($\frac{cases}{population}$) expressed per 100,000 population is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represent the deviation of observed and expected number of cases and is expressed as $SIR = \frac{Y_i}{E_i}$ with $Y_i$, the observed number of cases and $E_i$, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e., the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the average incidence across Cambodia.

```{r indicators, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, fig.height=4, class.output="code-out", warning=FALSE, message=FALSE}

# Compute incidence in each district (per 100 000 population)
district$incidence <- district$cases/district$T_POP * 100000

# Compute the global risk
rate <- sum(district$cases)/sum(district$T_POP)

# Compute expected number of cases 
district$expected <- district$T_POP * rate
district$SIR <- district$cases / district$expected
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
```{r inc_visualization, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=11, fig.height=7, class.output="code-out", warning=FALSE, message=FALSE}
par(mfrow = c(1, 2))
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
mf_map(x = district)
mf_map(x = district,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       var = c("T_POP", "incidence"),
       type = "prop_choro",
       pal = "Reds",
       inches = .1,
       breaks = exp(mf_get_breaks(log(district$incidence+1), breaks = "pretty"))-1,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       leg_title = c("Population", "Incidence \n(per 100 000)"))
mf_layout(title = "Incidence of W Fever")

# Plot SIRs
# create breaks and associated color palette
break_SIR <- c(0,exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty")))
col_pal <- c("#273871", "#3267AD", "#6496C8", "#9BBFDD", "#CDE3F0", "#FFCEBC", "#FF967E", "#F64D41", "#B90E36")
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
mf_map(x = district)
mf_map(x = district,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       var = c("T_POP", "SIR"),
       type = "prop_choro",
       breaks = break_SIR,
       pal = col_pal,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       inches = .1,
       #cex = 2,
       leg_title = c("Population", "SIR"))
mf_layout(title = "Standardized Incidence Ratio of W Fever")
```

These maps illustrate the spatial heterogeneity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have:

-   higher risk than average (SIR \> 1) when standardized for population
Lea's avatar
Lea committed

-   lower risk than average (SIR \< 1) when standardized for population
Lea's avatar
Lea committed

-   average risk (SIR \~ 1) when standardized for population

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
::: callout-tip
### To go further ...
In this example, we standardized the cases distribution for population count. This simple standardization assumes that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g., the number of childhood illness and death outcomes in a district are usually related to the age pyramid). A confounding factor is a variable that influences both the dependent variable and independent variable, causing a spurious association. You should keep in mind that other standardization can be performed based on these confounding factors, i.e. variables known to have an effect but that you don't want to analyze (e.g., sex ratio, occupations, age pyramid).
![](img/Stat_Confounders.jpg){fig-align="center" width="300"}

In addition, one can wonder what does an SIR \~ 1 means, i.e., what is the threshold to decide whether the SIR is greater, lower or equivalent to 1. The significant of the SIR can be tested globally (to determine whether or not the incidence is homogeneously distributed) and locally in each district (to determine Which district have an SIR different than 1). We won't perform these analyses in this tutorial but you can look at the functions `?achisq.test()` (from `Dcluster` package [@DCluster]) and `?probmap()` (from `spdep` package [@spdep]) to compute these statistics.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
:::
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
### General introduction
Why studying clusters in epidemiology? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution:
Lea's avatar
Lea committed

-   The **1st order effects** are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through an environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern is caused by a difference in risk intensity.
Lea's avatar
Lea committed

-   The **2nd order effects** describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern is caused by correlations or co-variations.
![](img/Stat_order_effects.png){fig-align="center" width="500"}

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
No statistical methods could distinguish between these competing processes since their outcome results in similar pattern of points. The cluster analysis help describing the magnitude and the location of pattern but in no way could answer the question of why such patterns occurs. It is therefore a step that help detecting cluster for description and surveillance purpose and rising hypothesis on the underlying process that will lead further investigations.

Knowledge about the disease and its transmission process could orientate the choice of the methods of study. We presented in this brief tutorial two methods of cluster detection, the Moran's I test that test for spatial independence (likely related to 2nd order effects) and the scan statistics that test for homogeneous distribution (likely related 1st order effects). It relies on epidemiologist to select the tools that best serve the studied question.
::: callout-note
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
### Statistic tests and distributions
Lea's avatar
Lea committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
In statistics, problems are usually expressed by defining two hypotheses: the null hypothesis (H0), i.e., an *a priori* hypothesis of the studied phenomenon (e.g., the situation is a random) and the alternative hypothesis (H1), e.g., the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.
In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (also known as negative binomial) distributions.
Many the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e., each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e., some district have very sparse data (with high variance) while other have adequate data (with lower variance).
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

```{r distribution, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}

# dataset statistics
m_cases <- mean(district$incidence)
sd_cases <- sd(district$incidence)

hist(district$incidence, probability = TRUE, ylim = c(0, 0.4), xlim = c(-5, 16), xlab = "Number of cases", ylab = "Probability", main = "Histogram of observed incidence compared\nto Normal and Poisson distributions")
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
curve(dnorm(x, m_cases, sd_cases),col = "blue",  lwd = 1, add = TRUE)
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

points(0:max(district$incidence), dpois(0:max(district$incidence),m_cases),
       type = 'b', pch = 20, col = "red", ylim = c(0, 0.6), lty = 2)
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

legend("topright", legend = c("Normal distribution", "Poisson distribution", "Observed distribution"), col = c("blue", "red", "black"),pch = c(NA, 20, NA), lty = c(1, 2, 1))
```

In this tutorial, we used the Poisson distribution in our statistical tests.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
:::

### Test for spatial autocorrelation (Moran's I test)

#### The global Moran's I test

A popular test for spatial autocorrelation is the Moran's test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran's I value of +1 indicates a concentration of spatial units exhibiting similar rates.

::: callout-note
##### Moran's I test

The Moran's statistics is:
Lea's avatar
Lea committed

$$I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}$$ with:

-   $N$: the number of polygons,

-   $w_{ij}$: is a matrix of spatial weight with zeroes on the diagonal (i.e., $w_{ii}=0$). For example, if polygons are neighbors, the weight takes the value $1$ otherwise it takes the value $0$.

-   $Y_i$: the variable of interest,

-   $\bar{Y}$: the mean value of $Y$.

Under the Moran's test, the statistics hypotheses are:
-   **H0**: the distribution of cases is spatially independent, i.e., $I=0$.
-   **H1**: the distribution of cases is spatially autocorrelated, i.e., $I\ne0$.
We will compute the Moran's statistics using `spdep`[@spdep] and `Dcluster`[@DCluster] packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use `poly2nb()` and `nb2listw()`. These functions respectively detect the neighboring polygons and assign weight corresponding to $1/\#\ of\ neighbors$. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data.
Lea's avatar
Lea committed

```{r MoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
#install.packages("spdep")
#install.packages("DCluster")
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
library(spdep) # Functions for creating spatial weight, spatial analysis
library(DCluster)  # Package with functions for spatial cluster analysis
Lea's avatar
Lea committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
set.seed(345) # remove random sampling for reproducibility

queen_nb <- poly2nb(district) # Neighbors according to queen case
q_listw <- nb2listw(queen_nb, style = 'W') # row-standardized weights
Lea's avatar
Lea committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
# Moran's I test
m_test <- moranI.test(cases ~ offset(log(expected)), 
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
                  data = district,
                  model = 'poisson',
                  R = 499,
                  listw = q_listw,
                  n = length(district$cases), # number of regions
                  S0 = Szero(q_listw)) # Global sum of weights
print(m_test)
plot(m_test)
Lea's avatar
Lea committed

```

The Moran's statistics is here $I =$ `r signif(m_test$t0, 2)`. When comparing its value to the H0 distribution (built under `r m_test$R` simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is $p_{value} =$ `r signif(( 1+ (sum((-abs(as.numeric(m_test$t0-mean(m_test$t))))>as.numeric(m_test$t-mean(m_test$t)))) + (sum(abs(as.numeric(m_test$t0-mean(m_test$t)))<as.numeric(m_test$t-mean(m_test$t)))) )/(m_test$R+1), 2)`. We therefore reject H0 with error risk of $\alpha = 5\%$. The distribution of cases is therefore autocorrelated across districts in Cambodia.

#### The Local Moran's I LISA test
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
The global Moran's test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlations occurs, i.e., what is the locations of the clusters. To identify such cluster, we can decompose the Moran's I statistic to extract local information of the level of correlation of each district and its neighbors. This is called the Local Moran's I LISA statistic. Because the Local Moran's I LISA statistic test each district for autocorrelation independently, concern is raised about multiple testing limitations that increase the Type I error ($\alpha$) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test has detected autocorrelation.
::: callout-note
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
##### Statistical test
For each district $i$, the Local Moran's I statistics is:
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
$$I_i = \frac{(Y_i-\bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}\sum_{j=1}^Nw_{ij}(Y_j - \bar{Y}) \text{ with }  I = \sum_{i=1}^NI_i/N$$
:::
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
The `localmoran()`function from the package `spdep` treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local Moran's test has not been implemented for Poisson distribution (population not large enough in some districts) in `spdep` package. However, Bivand *et al.* [@bivand2008applied] provided some code to manually perform the analysis using Poisson distribution and this code was further implemented in the course "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)".
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
```{r LocalMoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# Step 1 - Create the standardized deviation of observed from expected
sd_lm <- (district$cases - district$expected) / sqrt(district$expected)

# Step 2 - Create a spatially lagged version of standardized deviation of neighbors
wsd_lm <- lag.listw(q_listw, sd_lm)

# Step 3 - the local Moran's I is the product of step 1 and step 2
district$I_lm <- sd_lm * wsd_lm

# Step 4 - setup parameters for simulation of the null distribution

# Specify number of simulations to run
nsim <- 499

# Specify dimensions of result based on number of regions
N <- length(district$expected)

# Create a matrix of zeros to hold results, with a row for each county, and a column for each simulation
sims <- matrix(0, ncol = nsim, nrow = N)

# Step 5 - Start a for-loop to iterate over simulation columns
for(i in 1:nsim){
  y <- rpois(N, lambda = district$expected) # generate a random event count, given expected
  sd_lmi <- (y - district$expected) / sqrt(district$expected) # standardized local measure
  wsd_lmi <- lag.listw(q_listw, sd_lmi) # standardized spatially lagged measure
  sims[, i] <- sd_lmi * wsd_lmi # this is the I(i) statistic under this iteration of null
}
# Step 6 - For each county, test where the observed value ranks with respect to the null simulations
xrank <- apply(cbind(district$I_lm, sims), 1, function(x) rank(x)[1])
# Step 7 - Calculate the difference between observed rank and total possible (nsim)
diff <- nsim - xrank
diff <- ifelse(diff > 0, diff, 0)

# Step 8 - Assuming a uniform distribution of ranks, calculate p-value for observed
# given the null distribution generate from simulations
district$pval_lm <- punif((diff + 1) / (nsim + 1))
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
Briefly, the process consist on 1) computing the I statistics for the observed data, 2) estimating the null distribution of the I statistics by performing random sampling into a poisson distribution and 3) comparing the observed I statistic with the null distribution to determine the probability to observe such value if the number of cases were spatially independent. For each district, we obtain a p-value based on the comparison of the observed value and the null distribution.
A conventional way of plotting these results is to classify the districts into 5 classes based on local Moran's I output. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with `lag.listw()`):
-   Districts that have higher-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local $I_i$ statistic are defined as **High-High** (hotspot of the disease)
-   Districts that have lower-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local $I_i$ statistic are defined as **Low-Low** (cold spot of the disease).
-   Districts that have higher-than-average rates in the index regions and lower-than-average rates in their neighbors, and showing statistically significant negative values for the local $I_i$ statistic are defined as **High-Low**(outlier with high incidence in an area with low incidence).
-   Districts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local $I_i$ statistic are defined as **Low-High** (outlier of low incidence in area with high incidence).
-   Districts with non-significant values for the $I_i$ statistic are defined as **Non-significant**.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

```{r LocalMoransI_plt, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# create lagged local raw_rate - in other words the average of the queen neighbors value
# values are scaled (centered and reduced) to be compared to average
district$lag_std   <- scale(lag.listw(q_listw, var = district$incidence))
district$incidence_std <- scale(district$incidence)

# extract pvalues
# district$lm_pv <- lm_test[,5]
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

# Classify local moran's outputs
district$lm_class <- NA
district$lm_class[district$incidence_std >=0 & district$lag_std >=0] <- 'High-High'
district$lm_class[district$incidence_std <=0 & district$lag_std <=0] <- 'Low-Low'
district$lm_class[district$incidence_std <=0 & district$lag_std >=0] <- 'Low-High'
district$lm_class[district$incidence_std >=0 & district$lag_std <=0] <- 'High-Low'
district$lm_class[district$pval_lm >= 0.05] <- 'Non-significant'
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

district$lm_class <- factor(district$lm_class, levels=c("High-High", "Low-Low", "High-Low",  "Low-High", "Non-significant") )

# create map
mf_map(x = district,
       var = "lm_class",
       type = "typo",
       cex = 2,
       col_na = "white",
       #val_order = c("High-High", "Low-Low", "High-Low",  "Low-High", "Non-significant") ,
       pal = c("#6D0026" , "blue",  "white") , # "#FF755F","#7FABD3" ,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       leg_title = "Clusters")

mf_layout(title = "Cluster using Local Moran's I statistic")
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
### Spatial scan statistics

While Moran's indices focus on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independence), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.
The function `kulldorff` from the package `SpatialEpi` [@SpatialEpi] is a simple tool to implement spatial-only scan statistics.

::: callout-note
##### Kulldorf test

Under the kulldorff test, the statistics hypotheses are:

-   **H0**: the risk is constant over the area, i.e., there is a spatial homogeneity of the incidence.

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
-   **H1**: the observed window have higher incidence than the rest of the area , i.e., there is a spatial heterogeneity of incidence.
:::

Briefly, the kulldorff scan statistics scan the area for clusters using several steps:
1.  It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could include 50% of the population).

2.  It aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation.

3.  Finally, it computes the likelihood ratio and test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window (H1). The H0 distribution is estimated by simulating the distribution of counts under the null hypothesis (homogeneous risk).
4.  These 3 steps are repeated for each location and each possible windows-radii.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
While we test the significance of a large number of observation windows, one can raise concern about multiple testing and Type I error. This approach however suggest that we are not interest in a set of signifiant cluster but only in a most-likely cluster. This *a priori* restriction eliminate concern for multpile comparison since the test is simplified to a statistically significance of one single most-likely cluster.

Because we tested all-possible locations and window-radius, we can also choose to look at secondary clusters. In this case, you should keep in mind that increasing the number of secondary cluster you select, increases the risk for Type I error.

```{r spatialEpi, eval = TRUE, echo = TRUE, nm = TRUE, class.output="code-out", warning=FALSE, message=FALSE}
#install.packages("SpatialEpi")
library("SpatialEpi")
The use of R spatial object is not implements in `kulldorff()` function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids fall into the circle.

```{r kd_centroids, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}

district_xy <- st_centroid(district) %>% 
  st_coordinates()
Lea's avatar
Lea committed

head(district_xy)
Lea's avatar
Lea committed

We can then call kulldorff function (you are strongly encouraged to call `?kulldorff` to properly call the function). The `alpha.level` threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.
Lea's avatar
Lea committed

```{r kd_test, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
Lea's avatar
Lea committed

kd_Wfever <- kulldorff(district_xy, 
                cases = district$cases,
                population = district$T_POP,
                expected.cases = district$expected,
                pop.upper.bound = 0.5, # include maximum 50% of the population in a windows
                n.simulations = 499,
                alpha.level = 0.2)

```

The function plot the histogram of the distribution of log-likelihood ratio simulated under the null hypothesis that is estimated based on Monte Carlo simulations. The observed value of the most significant cluster identified from all possible scans is compared to the distribution to determine significance. All outputs are saved into an R object, here called `kd_Wfever`. Unfortunately, the package did not develop any summary and visualization of the results but we can explore the output object.

```{r kd_outputs, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
names(kd_Wfever)

```

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
First, we can focus on the most likely cluster and explore its characteristics.

```{r kd_mlc, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}

# We can see which districts (r number) belong to this cluster
kd_Wfever$most.likely.cluster$location.IDs.included

# standardized incidence ratio
kd_Wfever$most.likely.cluster$SMR

# number of observed and expected cases in this cluster
kd_Wfever$most.likely.cluster$number.of.cases
kd_Wfever$most.likely.cluster$expected.cases

```
`r length(kd_Wfever$most.likely.cluster$location.IDs.included)` districts belong to the cluster and its number of cases is `r signif(kd_Wfever$most.likely.cluster$SMR, 2)` times higher than the expected number of cases.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
Similarly, we could study the secondary clusters. Results are saved in a list.

```{r kd_sc, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}

# We can see which districts (r number) belong to this cluster
length(kd_Wfever$secondary.clusters)

# retrieve data for all secondary clusters into a table
df_secondary_clusters <- data.frame(SMR = sapply(kd_Wfever$secondary.clusters, '[[', 5),  
                          number.of.cases = sapply(kd_Wfever$secondary.clusters, '[[', 3),
                          expected.cases = sapply(kd_Wfever$secondary.clusters, '[[', 4),
                          p.value = sapply(kd_Wfever$secondary.clusters, '[[', 8))

print(df_secondary_clusters)
```

We only have one secondary cluster composed of one district.

```{r plt_clusters, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# create empty column to store cluster informations
district$k_cluster <- NA

# save cluster information from kulldorff outputs
district$k_cluster[kd_Wfever$most.likely.cluster$location.IDs.included] <- 'Most likely cluster'

for(i in 1:length(kd_Wfever$secondary.clusters)){
district$k_cluster[kd_Wfever$secondary.clusters[[i]]$location.IDs.included] <- paste(
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
  'Secondary cluster', i, sep = '')
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
#district$k_cluster[is.na(district$k_cluster)] <- "No cluster"


# create map
mf_map(x = district,
       var = "k_cluster",
       type = "typo",
       cex = 2,
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
       col_na = "white",
       pal = mf_get_pal(palette = "Reds", n = 3)[1:2],
       leg_title = "Clusters")

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
mf_layout(title = "Cluster using kulldorf scan statistic")
Lea's avatar
Lea committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
::: callout-tip
#### To go further ...
Lea's avatar
Lea committed

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
In this example, the expected number of cases was defined using the population count but note that standardization over other variables as age could also be implemented with the `strata` parameter in the `kulldorff()` function.

In addition, this cluster analysis was performed solely using the spatial scan but you should keep in mind that this method of cluster detection can be implemented for spatio-temporal data as well where the cluster definition is an abnormal number of cases in a delimited spatial area and during a given period of time. The windows of observation are therefore defined for a different center, radius and time-period. You should take a look at the function `scan_ep_poisson()` function in the package `scanstatistic` [@scanstatistics] for this analysis.
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
:::
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed

## Conclusion

```{r conclusion, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

par(mfrow = c(1, 2))

# create map
mf_map(x = district,
       var = "lm_class",
       type = "typo",
       cex = 2,
       col_na = "white",
       pal = c("#6D0026" , "blue",  "white") , # "#FF755F","#7FABD3" ,
       leg_title = "Clusters")

mf_layout(title = "Cluster using Local Moran's I statistic")

# create map
mf_map(x = district,
       var = "k_cluster",
       type = "typo",
       cex = 2,
       col_na = "white",
       pal = mf_get_pal(palette = "Reds", n = 3)[1:2],
       leg_title = "Clusters")

mf_layout(title = "Cluster using kulldorf scan statistic")

```

Both methods identified significant clusters. The two methods could identify a cluster around Phnom Penh after standardization for population counts. However, the identified clusters does not rely on the same assumption. While the Moran's test wonder whether their is any autocorrelation between clusters (i.e. second order effects of infection), the Kulldorff scan statistics wonder whether their is any heterogeneity in the case distribution. None of these test can inform on the infection processes (first or second order) for the studied disease and previous knowledge on the disease will help selecting the most accurate test. 

lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed
::: callout-tip
In this example, Cambodia is treated as an island, i.e. there is no data outside of its borders. In reality, some clusters can occurs across country's borders. You should be aware that such district will likely not be detected by these analysis. This border effect is still a hot topic in spatial studies and there is no conventional ways to deal with it. You can find in the literature some suggestion on how to deals with these border effect as assigning weights, or extrapolating data.
:::
lea.douchet_ird.fr's avatar
lea.douchet_ird.fr committed