kulldorf scan statistics

ff30d4c0 · lea.douchet_ird.fr · ef5f30f0 · ff30d4c0 · ef5f30f0 · ff30d4c0
Commit ff30d4c0 authored 2 years ago by lea.douchet_ird.fr
--- a/07-basic_statistics.qmd
+++ b/07-basic_statistics.qmd
@@ -11,18 +11,19 @@ This section aims at providing some basic statistical tools to study the spatial
 In this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case.

 ```{r load_cases, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
+library(dplyr)
 library(sf)

 #Import Cambodia country border
-country = st_read("data_cambodia/cambodia.gpkg", layer = "country", quiet = TRUE)
+country <- st_read("data_cambodia/cambodia.gpkg", layer = "country", quiet = TRUE)
 #Import provincial administrative border of Cambodia
-education = st_read("data_cambodia/cambodia.gpkg", layer = "education", quiet = TRUE)
+education <- st_read("data_cambodia/cambodia.gpkg", layer = "education", quiet = TRUE)
 #Import district administrative border of Cambodia
-district = st_read("data_cambodia/cambodia.gpkg", layer = "district", quiet = TRUE)
+district <- st_read("data_cambodia/cambodia.gpkg", layer = "district", quiet = TRUE)

 # Import locations of cases from an imaginary disease
-cases = st_read("data_cambodia/cambodia.gpkg", layer = "cases", quiet = TRUE)
-cases = subset(cases, Disease == "W fever")
+cases <- st_read("data_cambodia/cambodia.gpkg", layer = "cases", quiet = TRUE)
+cases <- subset(cases, Disease == "W fever")

 ```

@@ -42,7 +43,7 @@ mf_map(x = cases, lwd = .5, col = "#990000", pch = 20, add = TRUE)

 ```

-In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, ...) or a person at risk (e.g. a participant that may or may not experience the disease).  Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study.
+In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, ...) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.

 ```{r district_aggregate, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
 # Aggregate cases over districts
@@ -50,21 +51,21 @@ district$cases <- lengths(st_intersects(district, cases))

 ```

-The incidence ($\frac{cases}{population}$) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as $SIR = \frac{Y_i}{E_i}$ with $Y_i$, the observed number of cases and $E_i$, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district.
+The incidence ($\frac{cases}{population}$) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as $SIR = \frac{Y_i}{E_i}$ with $Y_i$, the observed number of cases and $E_i$, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the averaged average incidence across Cambodia.

 ```{r indicators, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, fig.height=4, class.output="code-out", warning=FALSE, message=FALSE}

 # Compute incidence in each district (per 100 000 population)
-district$incidence = district$cases/district$T_POP * 100000
+district$incidence <- district$cases/district$T_POP * 100000

 # Compute the global risk
-rate = sum(district$cases)/sum(district$T_POP)
+rate <- sum(district$cases)/sum(district$T_POP)

 # Compute expected number of cases 
-district$expected = district$T_POP * rate
+district$expected <- district$T_POP * rate

 # Compute SIR
-district$SIR = district$cases / district$expected
+district$SIR <- district$cases / district$expected
 ```

 ```{r inc_visualization, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, fig.height=4, class.output="code-out", warning=FALSE, message=FALSE}
@@ -90,8 +91,8 @@ mf_layout(title = "Incidence of W Fever")

 # Plot SIRs
 # create breaks and associated color palette
-break_SIR = c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty")))
-col_pal = c("#273871", "#3267AD", "#6496C8", "#9BBFDD", "#CDE3F0", "#FFCEBC", "#FF967E", "#F64D41", "#B90E36")
+break_SIR <- c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty")))
+col_pal <- c("#273871", "#3267AD", "#6496C8", "#9BBFDD", "#CDE3F0", "#FFCEBC", "#FF967E", "#F64D41", "#B90E36")

 mf_map(x = district,
       var = "SIR",
@@ -115,54 +116,201 @@ In this example, we standardized the cases distribution for population count. Th

 ## Cluster analysis

-Since this W fever seems to have a heterogeneous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. The first question is to wonder if data are auto correlated or spatially independent, i.e. study if neighboring districts are likely to have similar incidence.
+Since this W fever seems to have a heterogeneous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. The definition of clusters emcompass many XXXXXXX

-In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.
+The first question is to wonder if data are auto correlated or spatially independent, i.e. study if neighboring districts are likely to have similar incidence.

-### Spatial autocorrelation (Moran's I test)
+### Test for spatial autocorrelation (Moran's I test)

 A popular test for spatial autocorrelation is the Moran's test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran's I value of +1 indicates a concentration of spatial units exhibiting similar rates.

-Here the statistics hypothesis are :
+::: callout-note
+## Statistical test

-   H0 : the distribution of cases is spatially independant, i.e. Moran's I value is 0.
+In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an *a priori* hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.

-   H1: the distribution of cases is spatially autocorrelated, i.e. Moran's I value is different than 0.
+The Moran's statistics is :

-We will compute the Moran's statistics using `spdep` and `Dcluster` packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data.
+$$I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}$$ with :
+
+-   $N$: the number of polygons,
+
+-   $w_{ij}$: is a matrix of spatial weight with zeroes on the diagonal (i.e., $w_{ii}=0$). For example, if polygons are neighbors, the weight takes the value $1$ otherwise it take the value $0$.
+
+-   $Y_i$: the variable of interest,
+
+-   $\bar{Y}$: the mean value of $Y$.
+
+Under the Moran's test, the statistics hypothesis are :
+
+-   **H0** : the distribution of cases is spatially independent, i.e. $I=0$.
+
+-   **H1**: the distribution of cases is spatially autocorrelated, i.e. $I\ne0$.
+:::
+
+We will compute the Moran's statistics using `spdep` and `Dcluster` packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use `poly2nb()` and `nb2listw()`. These function respectively detect the neighboring polygons and assign weight corresponding to $1/\#\ of\ neighbors$. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data.

 ```{r MoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

 library(spdep) # Functions for creating spatial weight, spatial analysis
-library(DCluster)  # Package with functions for spatial cluster analysis)
+library(DCluster)  # Package with functions for spatial cluster analysis

-qnb <- poly2nb(district)
-q_listw <- nb2listw(qnb, style = 'W') # row-standardized weights
+queen_nb <- poly2nb(district) # Neighbors according to queen case
+q_listw <- nb2listw(queen_nb, style = 'W') # row-standardized weights

 # Moran's I test
-moranI.test(cases ~ offset(log(expected)), 
+m_test <- moranI.test(cases ~ offset(log(expected)), 
                  data = district,
                  model = 'poisson',
                  R = 499,
                  listw = q_listw,
-                  n = 159,
-                  S0 = Szero(q_listw))
+                  n = length(district$cases), # number of regions
+                  S0 = Szero(q_listw)) # Global sum of weights
+print(m_test)
+plot(m_test)

 ```

+The Moran's statistics is here $I =$ `r signif(m_test$t0, 2)`. When comparing its value to the H0 distribution (built under `r m_test$R` simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is $p_{value} =$ `r signif(( 1+ (sum((-abs(as.numeric(m_test$t0-mean(m_test$t))))>as.numeric(m_test$t-mean(m_test$t)))) + (sum(abs(as.numeric(m_test$t0-mean(m_test$t)))<as.numeric(m_test$t-mean(m_test$t)))) )/(m_test$R+1), 2)`. We therefore reject H0 with error risk of $\alpha = 5\%$. The distribution of cases is therefore autocorrelated across districts in Cambodia.
+
+::: callout-note
+## Statistic distributions
+
+In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the binomial, the poisson and the Poisson-gamma mixture (a.k.a negative binomial) distributions.
+
+The default Global Moran's I test assume data are normally distributed. It implies that the mean However, in epidemiology, rates and count values are usually not normally distributed and their variance is not homogeneous across the districts since the size of population at risk differs. to be the same since more variability occurs when we study smaller populations.
+
+While many measures may be appropriately assessed under the normality assumptions of the previous Global Moran's I, in general disease rates are not best assessed this way. This is because the rates themselves may not be normally distributed, but also because the variance of each rate likely differs because of different size population at risk. For example the previous test assumed that we had the same level of certainty about the rate in each county, when in fact some counties have very sparse data (with high variance) and others have adequate data (with relatively lower variance).
+
+```{r distribution, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
+
+# dataset statistics
+m_cases <- mean(district$cases)
+sd_cases <- sd(district$cases)
+
+curve(dnorm(x, m_cases, sd_cases), from = -5, to = 16, ylim = c(0, 0.4), col = "blue", lwd = 1, 
+      xlab = "Number of cases", ylab = "Probability", main = "Histogram of observed data compared\nto Normal and Poisson distributions")
+points(0:max(district$cases), dpois(0:max(district$cases), m_cases),type = 'b ', pch = 20, col = "red", ylim = c(0, 0.6), lty = 2)
+hist(district$cases,  add = TRUE, probability = TRUE)
+
+legend("topright", legend = c("Normal distribution", "Poisson distribution", "Observed distribution"), col = c("blue", "red", "black"),pch = c(NA, 20, NA), lty = c(1, 2, 1))
+```
+:::
+
 ### Spatial scan statistics

-While Moran's indice focuses on finding correlation between neighboring polygons, the spatial scan statistic compare the incidence level of a given windows of observation with the incidence level outside of this windows.
+While Moran's indice focuses on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independance), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.
+
+The function `kulldorf` from the package `SpatialEpi`is a simple tool to implement spatial-only scan statistics. Briefly, the kulldorf scan statistics scan the area for clusters using several steps:
+
+1.  It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could includes 50% of the population).
+
+2.  It aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation.
+
+3.  Finally, it computes the likelihood ratio to test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window
+
+4.  These 3 steps are repeted for each location and each possible windows-radii.
+
+```{r spatialEpi, eval = TRUE, echo = TRUE, nm = TRUE, class.output="code-out", warning=FALSE, message=FALSE}
+
+library("SpatialEpi")

-The package `SpatialEpi`
+```
+
+The use of R spatial object is not implementes in `kulldorf()` function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids falls into the circle.
+
+```{r kd_centroids, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
+
+district_xy <- st_centroid(district) %>% 
+  st_coordinates()

+head(district_xy)

-### Population-based clusters (kulldorf statistic)
+```
+
+We can then call kulldorff function (you are strongly encourage to call `?kulldorf` to properly call the function). The `alpha.level` threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.

-Kulldorff 's spatial scan statistic identifies the most likely disease clusters maximizing the likelihood that disease cases are located within a set of concentric circles that are moved across the study area.
+```{r kd_test, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}

-### Expectation-based cluster
+kd_Wfever <- kulldorff(district_xy, 
+                cases = district$cases,
+                population = district$T_POP,
+                expected.cases = district$expected,
+                pop.upper.bound = 0.5, # include maximum 50% of the population in a windows
+                n.simulations = 499,
+                alpha.level = 0.2)
+
+```
+
+All outputs are saved into the R object `kd_Wfever`. Unfortunately the package did not developed any summary and visualization of the results but we can explore the output object. 
+
+```{r kd_outputs, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
+names(kd_Wfever)
+
+```
+
+First, we can focus on the most likely cluster and explore its characteristics. 
+
+```{r kd_mlc, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
+
+# We can see which districts (r number) belong to this cluster
+kd_Wfever$most.likely.cluster$location.IDs.included
+
+# standardized incidence ratio
+kd_Wfever$most.likely.cluster$SMR
+
+# number of observed and expected cases in this cluster
+kd_Wfever$most.likely.cluster$number.of.cases
+kd_Wfever$most.likely.cluster$expected.cases
+
+```
+`r length(kd_Wfever$most.likely.cluster$location.IDs.included)` districts belong to the cluster and its number of cases is `r signif(kd_Wfever$most.likely.cluster$SMR, 2)` times higher than the expected number of case.
+
+Similarly, we could study the secondary clusters. Results are saved in a list. 
+
+```{r kd_sc, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE}
+
+# We can see which districts (r number) belong to this cluster
+length(kd_Wfever$secondary.clusters)
+
+# retrieve data for all secondary clusters into a table
+df_secondary_clusters <- data.frame(SMR = sapply(kd_Wfever$secondary.clusters, '[[', 5),  
+                          number.of.cases = sapply(kd_Wfever$secondary.clusters, '[[', 3),
+                          expected.cases = sapply(kd_Wfever$secondary.clusters, '[[', 4),
+                          p.value = sapply(kd_Wfever$secondary.clusters, '[[', 8))
+
+print(df_secondary_clusters)
+
+
+```
+
+We only have one secondary cluster composed of one district.
+
+
+```{r plt_clusters, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
+
+# create empty column to store cluster informations
+district$k_cluster <- NA
+
+# save cluster informations from kulldorff outputs
+district$k_cluster[kd_Wfever$most.likely.cluster$location.IDs.included] <- 'Most likely cluster'
+
+for(i in 1:length(kd_Wfever$secondary.clusters)){
+district$k_cluster[kd_Wfever$secondary.clusters[[i]]$location.IDs.included] <- paste(
+  'Secondary cluster ', i, sep = '')
+}
+
+# create map
+mf_map(x = district,
+       var = "k_cluster",
+       type = "typo",
+       cex = 2,
+       leg_title = "Clusters")
+mf_layout(title = "Cluster using kulldorf scan statistic")
+
+
+
+```
+This cluster analysis was performed solely using the spatial 

-In many case, population is not specific enough to

-### To go further ...
--- a/public/.07-basic_statistics.html.swp
+++ b/public/.07-basic_statistics.html.swp
--- a/public/01-introduction.html
+++ b/public/01-introduction.html
@@ -2,7 +2,7 @@
 <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>

 <meta charset="utf-8">
-<meta name="generator" content="quarto-1.1.251">
+<meta name="generator" content="quarto-1.1.189">

 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">


--- a/public/07-basic_statistics.html
+++ b/public/07-basic_statistics.html
--- a/public/07-basic_statistics_files/figure-html/MoransI-1.png
+++ b/public/07-basic_statistics_files/figure-html/MoransI-1.png
--- a/public/07-basic_statistics_files/figure-html/distribution-1.png
+++ b/public/07-basic_statistics_files/figure-html/distribution-1.png
--- a/public/07-basic_statistics_files/figure-html/inc_visualization-1.png
+++ b/public/07-basic_statistics_files/figure-html/inc_visualization-1.png
--- a/public/07-basic_statistics_files/figure-html/incidence_visualization-1.png
+++ b/public/07-basic_statistics_files/figure-html/incidence_visualization-1.png
--- a/public/07-basic_statistics_files/figure-html/kd_test-1.png
+++ b/public/07-basic_statistics_files/figure-html/kd_test-1.png
--- a/public/07-basic_statistics_files/figure-html/plt_clusters-1.png
+++ b/public/07-basic_statistics_files/figure-html/plt_clusters-1.png
--- a/public/search.json
+++ b/public/search.json
--- a/public/search.json.orig
+++ b/public/search.json.orig