diff --git a/07-basic_statistics.qmd b/07-basic_statistics.qmd index cdfb0ecc8949c1504e63afd9b3b9ccc9720b4a1c..3e8443b375c144f8fafcea86e03b7ba311a6ca45 100644 --- a/07-basic_statistics.qmd +++ b/07-basic_statistics.qmd @@ -4,11 +4,11 @@ bibliography: references.bib # Basic statistics for spatial analysis -This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into these analysis and their limitations you can consult the tutorial "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)" from M. Kramer from which the statistical analysis of his section were adapted. +This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into spatial statistics applied to epidemiology and their limitations you can consult the tutorial "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)" from M. Kramer from which the statistical analysis of this section was adapted. We will use ## Import and visualize epidemiological data -In this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case. +In this section, we load data that reference the cases of an imaginary disease, the W fever, throughout Cambodia. Each point corresponds to the geo-localization of a case. ```{r load_cases, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE} library(dplyr) @@ -43,7 +43,7 @@ mf_map(x = cases, lwd = .5, col = "#990000", pch = 20, add = TRUE) ``` -In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, ...) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study. +In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, we cannot precisely tell if this observation represents an event of interest (e.g., illness, death, ...) or a person at risk (e.g., a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appear as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study. ```{r district_aggregate, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE} # Aggregate cases over districts @@ -51,7 +51,7 @@ district$cases <- lengths(st_intersects(district, cases)) ``` -The incidence ($\frac{cases}{population}$) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as $SIR = \frac{Y_i}{E_i}$ with $Y_i$, the observed number of cases and $E_i$, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the averaged average incidence across Cambodia. +The incidence ($\frac{cases}{population}$) expressed per 100,000 population is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represent the deviation of observed and expected number of cases and is expressed as $SIR = \frac{Y_i}{E_i}$ with $Y_i$, the observed number of cases and $E_i$, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e., the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the average incidence across Cambodia. ```{r indicators, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, fig.height=4, class.output="code-out", warning=FALSE, message=FALSE} @@ -86,12 +86,13 @@ mf_map(x = district, var = "incidence", type = "choro", pal = "Reds 3", + breaks = exp(mf_get_breaks(log(district$incidence+1), breaks = "pretty"))-1, leg_title = "Incidence \n(per 100 000)") mf_layout(title = "Incidence of W Fever") # Plot SIRs # create breaks and associated color palette -break_SIR <- c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty"))) +break_SIR <- c(0,exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty"))) col_pal <- c("#273871", "#3267AD", "#6496C8", "#9BBFDD", "#CDE3F0", "#FFCEBC", "#FF967E", "#F64D41", "#B90E36") mf_map(x = district, @@ -104,7 +105,7 @@ mf_map(x = district, mf_layout(title = "Standardized Incidence Ratio of W Fever") ``` -These maps illustrates the spatial heterogenity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have : +These maps illustrate the spatial heterogeneity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have: - higher risk than average (SIR \> 1) when standardized for population @@ -113,20 +114,22 @@ These maps illustrates the spatial heterogenity of the cases. The incidence show - average risk (SIR \~ 1) when standardized for population ::: callout-tip -### To go futher ... +### To go further ... -In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don't want to analyze (e.g. sex ratio, occupations, age pyramid). +In this example, we standardized the cases distribution for population count. This simple standardization assumes that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g., the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don't want to analyze (e.g., sex ratio, occupations, age pyramid). + +In addition, one can wonder what does an $SIR \~ 1$ means, i.e., what is the threshold to decide whether the SIR is greater, lower or equivalent to 1. The significant of the SIR can be tested globally (to determine whether or not the incidence is homogeneously distributed) and locally in each district (to determine Which district have an SIR different than 1). We won't perform these analyses in this tutorial but you can look at the function `?achisq.test()` (from `Dcluster` package [@DCluster]) and `?probmap()` (from `spdep` package [@spdep]) to compute these statistics. ::: ## Cluster analysis ### General introduction -Why studying clusters in epidemiology ? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution : +Why studying clusters in epidemiology? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution: -- The **1st order effects** are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through a environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern are caused by a difference in risk intensity. +- The **1st order effects** are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through an environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern is caused by a difference in risk intensity. -- The **2nd order effects** describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern are caused by correlations or co-variations. +- The **2nd order effects** describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern is caused by correlations or co-variations. No statistical methods could distinguish between these competing processes since their outcome results in similar pattern of points. The cluster analysis help describing the magnitude and the location of pattern but in no way could answer the question of why such patterns occurs. It is therefore a step that help detecting cluster for description and surveillance purpose and rising hypothesis on the underlying process that will lead further investigations. @@ -135,11 +138,11 @@ Knowledge about the disease and its transmission process could orientate the cho ::: callout-note ### Statistic tests and distributions -In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an *a priori* hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis. +In statistics, problems are usually expressed by defining two hypotheses: the null hypothesis (H0), i.e., an *a priori* hypothesis of the studied phenomenon (e.g., the situation is a random) and the alternative hypothesis (HA), e.g., the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis. -In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (a.k.a negative binomial) distributions. +In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (also known as negative binomial) distributions. -Many the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e. each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e. some district have very sparse data (with high variance) while other have adequate data (with lower variance). +Many the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e., each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e., some district have very sparse data (with high variance) while other have adequate data (with lower variance). ```{r distribution, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE} @@ -154,7 +157,7 @@ points(0:max(district$incidence), dpois(0:max(district$incidence), m_cases),type legend("topright", legend = c("Normal distribution", "Poisson distribution", "Observed distribution"), col = c("blue", "red", "black"),pch = c(NA, 20, NA), lty = c(1, 2, 1)) ``` -In this tutorial, we used the poisson distribution in our statistical tests. +In this tutorial, we used the Poisson distribution in our statistical tests. ::: ### Test for spatial autocorrelation (Moran's I test) @@ -166,26 +169,26 @@ A popular test for spatial autocorrelation is the Moran's test. This test tells ::: callout-note ##### Moran's I test -The Moran's statistics is : +The Moran's statistics is: -$$I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}$$ with : +$$I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}$$ with: - $N$: the number of polygons, -- $w_{ij}$: is a matrix of spatial weight with zeroes on the diagonal (i.e., $w_{ii}=0$). For example, if polygons are neighbors, the weight takes the value $1$ otherwise it take the value $0$. +- $w_{ij}$: is a matrix of spatial weight with zeroes on the diagonal (i.e., $w_{ii}=0$). For example, if polygons are neighbors, the weight takes the value $1$ otherwise it takes the value $0$. - $Y_i$: the variable of interest, - $\bar{Y}$: the mean value of $Y$. -Under the Moran's test, the statistics hypothesis are : +Under the Moran's test, the statistics hypotheses are: -- **H0** : the distribution of cases is spatially independent, i.e. $I=0$. +- **H0**: the distribution of cases is spatially independent, i.e., $I=0$. -- **H1**: the distribution of cases is spatially autocorrelated, i.e. $I\ne0$. +- **H1**: the distribution of cases is spatially autocorrelated, i.e., $I\ne0$. ::: -We will compute the Moran's statistics using `spdep`[@spdep] and `Dcluster`[@DCluster] packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use `poly2nb()` and `nb2listw()`. These function respectively detect the neighboring polygons and assign weight corresponding to $1/\#\ of\ neighbors$. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data. +We will compute the Moran's statistics using `spdep`[@spdep] and `Dcluster`[@DCluster] packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use `poly2nb()` and `nb2listw()`. These functions respectively detect the neighboring polygons and assign weight corresponding to $1/\#\ of\ neighbors$. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data. ```{r MoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE} @@ -212,17 +215,17 @@ The Moran's statistics is here $I =$ `r signif(m_test$t0, 2)`. When comparing it #### Moran's I local test -The global Moran's test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlation occurs, i.e. what is the locations of the clusters. To identify such cluster we can decompose the Moran's I statistic to extract local informations of the level of correlation of each district and its neighbors. This is called the Local Moran's I LISA statistic. Because the Local Moran's I LISA statistic test each district for autocorrelation independently, concern are raised about multiple testing limitations that increase the Type I error ($\alpha$) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation. +The global Moran's test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlations occurs, i.e., what is the locations of the clusters. To identify such cluster, we can decompose the Moran's I statistic to extract local information of the level of correlation of each district and its neighbors. This is called the Local Moran's I LISA statistic. Because the Local Moran's I LISA statistic test each district for autocorrelation independently, concern is raised about multiple testing limitations that increase the Type I error ($\alpha$) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation. ::: callout-note ##### Statistical test -For each district $i$, the Moran's statistics is : +For each district $i$, the Local Moran's I statistics is: $$I_i = \frac{(Y_i-\bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}\sum_{j=1}^Nw_{ij}(Y_j - \bar{Y}) \text{ with } I = \sum_{i=1}^NI_i/N$$ ::: -The `localmoran()`function from the package `spdep` treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local moran's test has not been implemented for poisson distribution (population not large enough in some districts) in `spdep` package. However Bivand **et al.** [@bivand2008applied] provided some code to manual perform the analysis using poisson distribution and was further implemented in the course "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)" . +The `localmoran()`function from the package `spdep` treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local Moran’s test has not been implemented for Poisson distribution (population not large enough in some districts) in `spdep` package. However, Bivand **et al.** [@bivand2008applied] provided some code to manual perform the analysis using Poisson distribution and was further implemented in the course "[Spatial Epidemiology](https://mkram01.github.io/EPI563-SpatialEPI/index.html)â€. @@ -271,15 +274,15 @@ district$pval_lm <- punif((diff + 1) / (nsim + 1)) For each district, we obtain a p-value based on permutations process -A conventional way of plotting these results is to classify the districts into 5 classes based on local Moran's I outputs. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with `lag.listw()`) : +A conventional way of plotting these results is to classify the districts into 5 classes based on local Moran's I output. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with `lag.listw()`): - Districts that have higher-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local $I_i$ statistic are defined as __High-High__ (hotspot of the disease) -- Districts that have lower-than-average rates in both index regions and their neighbors adn showing statistically significant positive values for the local $I_i$ statistic are defined as __Low-Low__ (coldspot of the disease). +- Districts that have lower-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local $I_i$ statistic are defined as __Low-Low__ (cold spot of the disease). - Districts that have higher-than-average rates in the index regions and lower-than-average rates in their neighbors, and showing statistically significant negative values for the local $I_i$ statistic are defined as __High-Low__(outlier with high incidence in an area with low incidence). -- Districts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local $I_i$ statistic are defined as __Low-High__(outlier of low incidence in area with high incidence). +- Districts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local $I_i$ statistic are defined as __Low-High__ (outlier of low incidence in area with high incidence). - Districts with non-significant values for the $I_i$ statistic are defined as __Non-significant__. @@ -314,7 +317,7 @@ mf_map(x = district, pal = c("#6D0026" , "blue", "white") , # "#FF755F","#7FABD3" , leg_title = "Clusters") -mf_layout(title = "Cluster using Local moran'I statistic") +mf_layout(title = "Cluster using Local Moran's I statistic") @@ -324,17 +327,17 @@ mf_layout(title = "Cluster using Local moran'I statistic") ### Spatial scan statistics -While Moran's indice focuses on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independance), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods. +While Moran's indices focus on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independence), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods. The function `kulldorff` from the package `SpatialEpi` [@SpatialEpi] is a simple tool to implement spatial-only scan statistics. Briefly, the kulldorff scan statistics scan the area for clusters using several steps: -1. It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could includes 50% of the population). +1. It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could include 50% of the population). 2. It aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation. 3. Finally, it computes the likelihood ratio to test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window -4. These 3 steps are repeted for each location and each possible windows-radii. +4. These 3 steps are repeated for each location and each possible windows-radii. ```{r spatialEpi, eval = TRUE, echo = TRUE, nm = TRUE, class.output="code-out", warning=FALSE, message=FALSE} @@ -342,7 +345,7 @@ library("SpatialEpi") ``` -The use of R spatial object is not implementes in `kulldorff()` function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids falls into the circle. +The use of R spatial object is not implements in `kulldorff()` function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids fall into the circle. ```{r kd_centroids, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE} @@ -353,7 +356,7 @@ head(district_xy) ``` -We can then call kulldorff function (you are strongly encourage to call `?kulldorff` to properly call the function). The `alpha.level` threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance. +We can then call kulldorff function (you are strongly encouraged to call `?kulldorff` to properly call the function). The `alpha.level` threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance. ```{r kd_test, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE} @@ -367,7 +370,7 @@ kd_Wfever <- kulldorff(district_xy, ``` -All outputs are saved into an R object, here called `kd_Wfever`. Unfortunately the package did not developed any summary and visualization of the results but we can explore the output object. +All outputs are saved into an R object, here called `kd_Wfever`. Unfortunately, the package did not develop any summary and visualization of the results but we can explore the output object. ```{r kd_outputs, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=6, class.output="code-out", warning=FALSE, message=FALSE} names(kd_Wfever) @@ -390,7 +393,7 @@ kd_Wfever$most.likely.cluster$expected.cases ``` -`r length(kd_Wfever$most.likely.cluster$location.IDs.included)` districts belong to the cluster and its number of cases is `r signif(kd_Wfever$most.likely.cluster$SMR, 2)` times higher than the expected number of case. +`r length(kd_Wfever$most.likely.cluster$location.IDs.included)` districts belong to the cluster and its number of cases is `r signif(kd_Wfever$most.likely.cluster$SMR, 2)` times higher than the expected number of cases. Similarly, we could study the secondary clusters. Results are saved in a list. @@ -415,7 +418,7 @@ We only have one secondary cluster composed of one district. # create empty column to store cluster informations district$k_cluster <- NA -# save cluster informations from kulldorff outputs +# save cluster information from kulldorff outputs district$k_cluster[kd_Wfever$most.likely.cluster$location.IDs.included] <- 'Most likely cluster' for(i in 1:length(kd_Wfever$secondary.clusters)){ @@ -440,7 +443,7 @@ mf_layout(title = "Cluster using kulldorf scan statistic") ``` ::: callout-tip -#### To go futher ... +#### To go further ... In this example, the expected number of cases was defined using the population count but note that standardization over other variables as age could also be implemented with the `strata` parameter in the `kulldorff()` function. diff --git a/public/01-introduction.html b/public/01-introduction.html index 86f41f46e2a3c051c52d96df369faaa741186bbe..f8fb5f8063b6f346df7cd933595c055bb016ebe6 100644 --- a/public/01-introduction.html +++ b/public/01-introduction.html @@ -2,7 +2,7 @@ <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head> <meta charset="utf-8"> -<meta name="generator" content="quarto-1.1.251"> +<meta name="generator" content="quarto-1.1.189"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> @@ -302,12 +302,12 @@ div.csl-indent { <div class="cell"> <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(sf)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output cell-output-stderr"> -<pre><code>Linking to GEOS 3.10.2, GDAL 3.4.3, PROJ 8.2.1; sf_use_s2() is TRUE</code></pre> +<pre><code>Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE</code></pre> </div> <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>district <span class="ot"><-</span> <span class="fu">st_read</span>(<span class="st">"data_cambodia/district.shp"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output cell-output-stdout"> <pre class="code-out"><code>Reading layer `district' from data source - `/home/lucas/Documents/ForgeIRD/rspatial-for-onehealth/data_cambodia/district.shp' + `C:\Users\UNiK\Documents\R_works\IRD\Rspatial\rspatial-for-onehealth\data_cambodia\district.shp' using driver `ESRI Shapefile' Simple feature collection with 197 features and 10 fields Geometry type: MULTIPOLYGON @@ -348,7 +348,7 @@ Available layers: <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>road <span class="ot"><-</span> <span class="fu">st_read</span>(<span class="st">"data_cambodia/cambodia.gpkg"</span>, <span class="at">layer =</span> <span class="st">"road"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output cell-output-stdout"> <pre class="code-out"><code>Reading layer `road' from data source - `/home/lucas/Documents/ForgeIRD/rspatial-for-onehealth/data_cambodia/cambodia.gpkg' + `C:\Users\UNiK\Documents\R_works\IRD\Rspatial\rspatial-for-onehealth\data_cambodia\cambodia.gpkg' using driver `GPKG' Simple feature collection with 6 features and 9 fields Geometry type: MULTILINESTRING diff --git a/public/07-basic_statistics.html b/public/07-basic_statistics.html index 3e6528c54f888c18c6703dfba8cc9286da3f254c..6c97dbb1445c5d655469b4abcac690522ec5eb35 100644 --- a/public/07-basic_statistics.html +++ b/public/07-basic_statistics.html @@ -268,10 +268,10 @@ div.csl-indent { </header> -<p>This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into these analysis and their limitations you can consult the tutorial “<a href="https://mkram01.github.io/EPI563-SpatialEPI/index.html">Spatial Epidemiology</a>†from M. Kramer from which the statistical analysis of his section were adapted.</p> +<p>This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into spatial statistics applied to epidemiology and their limitations you can consult the tutorial “<a href="https://mkram01.github.io/EPI563-SpatialEPI/index.html">Spatial Epidemiology</a>†from M. Kramer from which the statistical analysis of this section was adapted. We will use</p> <section id="import-and-visualize-epidemiological-data" class="level2" data-number="7.1"> <h2 data-number="7.1" class="anchored" data-anchor-id="import-and-visualize-epidemiological-data"><span class="header-section-number">7.1</span> Import and visualize epidemiological data</h2> -<p>In this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case.</p> +<p>In this section, we load data that reference the cases of an imaginary disease, the W fever, throughout Cambodia. Each point corresponds to the geo-localization of a case.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(dplyr)</span> <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(sf)</span> @@ -315,12 +315,12 @@ Projected CRS: WGS 84 / UTM zone 48N <p><img src="07-basic_statistics_files/figure-html/cases_visualization-1.png" class="img-fluid" width="768"></p> </div> </div> -<p>In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.</p> +<p>In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, we cannot precisely tell if this observation represents an event of interest (e.g., illness, death, …) or a person at risk (e.g., a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appear as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Aggregate cases over districts</span></span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>cases <span class="ot"><-</span> <span class="fu">lengths</span>(<span class="fu">st_intersects</span>(district, cases))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> </div> -<p>The incidence (<span class="math inline">\(\frac{cases}{population}\)</span>) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as <span class="math inline">\(SIR = \frac{Y_i}{E_i}\)</span> with <span class="math inline">\(Y_i\)</span>, the observed number of cases and <span class="math inline">\(E_i\)</span>, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the averaged average incidence across Cambodia.</p> +<p>The incidence (<span class="math inline">\(\frac{cases}{population}\)</span>) expressed per 100,000 population is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represent the deviation of observed and expected number of cases and is expressed as <span class="math inline">\(SIR = \frac{Y_i}{E_i}\)</span> with <span class="math inline">\(Y_i\)</span>, the observed number of cases and <span class="math inline">\(E_i\)</span>, the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e., the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the average incidence across Cambodia.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Compute incidence in each district (per 100 000 population)</span></span> <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>incidence <span class="ot"><-</span> district<span class="sc">$</span>cases<span class="sc">/</span>district<span class="sc">$</span>T_POP <span class="sc">*</span> <span class="dv">100000</span></span> @@ -352,27 +352,28 @@ Projected CRS: WGS 84 / UTM zone 48N <span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a> <span class="at">var =</span> <span class="st">"incidence"</span>,</span> <span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a> <span class="at">type =</span> <span class="st">"choro"</span>,</span> <span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> <span class="st">"Reds 3"</span>,</span> -<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"Incidence </span><span class="sc">\n</span><span class="st">(per 100 000)"</span>)</span> -<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Incidence of W Fever"</span>)</span> -<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot SIRs</span></span> -<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="co"># create breaks and associated color palette</span></span> -<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>break_SIR <span class="ot"><-</span> <span class="fu">c</span>(<span class="dv">0</span>, <span class="fu">exp</span>(<span class="fu">mf_get_breaks</span>(<span class="fu">log</span>(district<span class="sc">$</span>SIR), <span class="at">nbreaks =</span> <span class="dv">8</span>, <span class="at">breaks =</span> <span class="st">"pretty"</span>)))</span> -<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>col_pal <span class="ot"><-</span> <span class="fu">c</span>(<span class="st">"#273871"</span>, <span class="st">"#3267AD"</span>, <span class="st">"#6496C8"</span>, <span class="st">"#9BBFDD"</span>, <span class="st">"#CDE3F0"</span>, <span class="st">"#FFCEBC"</span>, <span class="st">"#FF967E"</span>, <span class="st">"#F64D41"</span>, <span class="st">"#B90E36"</span>)</span> -<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_map</span>(<span class="at">x =</span> district,</span> -<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a> <span class="at">var =</span> <span class="st">"SIR"</span>,</span> -<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> <span class="at">type =</span> <span class="st">"choro"</span>,</span> -<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> <span class="at">breaks =</span> break_SIR, </span> -<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> col_pal, </span> -<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> <span class="at">cex =</span> <span class="dv">2</span>,</span> -<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"SIR"</span>)</span> -<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Standardized Incidence Ratio of W Fever"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> +<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> <span class="at">breaks =</span> <span class="fu">exp</span>(<span class="fu">mf_get_breaks</span>(<span class="fu">log</span>(district<span class="sc">$</span>incidence<span class="sc">+</span><span class="dv">1</span>), <span class="at">breaks =</span> <span class="st">"pretty"</span>))<span class="sc">-</span><span class="dv">1</span>,</span> +<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"Incidence </span><span class="sc">\n</span><span class="st">(per 100 000)"</span>)</span> +<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Incidence of W Fever"</span>)</span> +<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot SIRs</span></span> +<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a><span class="co"># create breaks and associated color palette</span></span> +<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>break_SIR <span class="ot"><-</span> <span class="fu">c</span>(<span class="dv">0</span>,<span class="fu">exp</span>(<span class="fu">mf_get_breaks</span>(<span class="fu">log</span>(district<span class="sc">$</span>SIR), <span class="at">nbreaks =</span> <span class="dv">8</span>, <span class="at">breaks =</span> <span class="st">"pretty"</span>)))</span> +<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>col_pal <span class="ot"><-</span> <span class="fu">c</span>(<span class="st">"#273871"</span>, <span class="st">"#3267AD"</span>, <span class="st">"#6496C8"</span>, <span class="st">"#9BBFDD"</span>, <span class="st">"#CDE3F0"</span>, <span class="st">"#FFCEBC"</span>, <span class="st">"#FF967E"</span>, <span class="st">"#F64D41"</span>, <span class="st">"#B90E36"</span>)</span> +<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_map</span>(<span class="at">x =</span> district,</span> +<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> <span class="at">var =</span> <span class="st">"SIR"</span>,</span> +<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> <span class="at">type =</span> <span class="st">"choro"</span>,</span> +<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> <span class="at">breaks =</span> break_SIR, </span> +<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> col_pal, </span> +<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a> <span class="at">cex =</span> <span class="dv">2</span>,</span> +<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"SIR"</span>)</span> +<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Standardized Incidence Ratio of W Fever"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output-display"> <p><img src="07-basic_statistics_files/figure-html/inc_visualization-1.png" class="img-fluid" width="768"></p> </div> </div> -<p>These maps illustrates the spatial heterogenity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have :</p> +<p>These maps illustrate the spatial heterogeneity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have:</p> <ul> <li><p>higher risk than average (SIR > 1) when standardized for population</p></li> <li><p>lower risk than average (SIR < 1) when standardized for population</p></li> @@ -384,11 +385,12 @@ Projected CRS: WGS 84 / UTM zone 48N <i class="callout-icon"></i> </div> <div class="callout-caption-container flex-fill"> -To go futher … +To go further … </div> </div> <div class="callout-body-container callout-body"> -<p>In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid).</p> +<p>In this example, we standardized the cases distribution for population count. This simple standardization assumes that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g., the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g., sex ratio, occupations, age pyramid).</p> +<p>In addition, one can wonder what does an <span class="math inline">\(SIR \~ 1\)</span> means, i.e., what is the threshold to decide whether the SIR is greater, lower or equivalent to 1. The significant of the SIR can be tested globally (to determine whether or not the incidence is homogeneously distributed) and locally in each district (to determine Which district have an SIR different than 1). We won’t perform these analyses in this tutorial but you can look at the function <code>?achisq.test()</code> (from <code>Dcluster</code> package <span class="citation" data-cites="DCluster">(<a href="references.html#ref-DCluster" role="doc-biblioref">Gómez-Rubio et al. 2015</a>)</span>) and <code>?probmap()</code> (from <code>spdep</code> package <span class="citation" data-cites="spdep">(<a href="references.html#ref-spdep" role="doc-biblioref">R. Bivand et al. 2015</a>)</span>) to compute these statistics.</p> </div> </div> </section> @@ -396,10 +398,10 @@ To go futher … <h2 data-number="7.2" class="anchored" data-anchor-id="cluster-analysis"><span class="header-section-number">7.2</span> Cluster analysis</h2> <section id="general-introduction" class="level3" data-number="7.2.1"> <h3 data-number="7.2.1" class="anchored" data-anchor-id="general-introduction"><span class="header-section-number">7.2.1</span> General introduction</h3> -<p>Why studying clusters in epidemiology ? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution :</p> +<p>Why studying clusters in epidemiology? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution:</p> <ul> -<li><p>The <strong>1st order effects</strong> are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through a environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern are caused by a difference in risk intensity.</p></li> -<li><p>The <strong>2nd order effects</strong> describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern are caused by correlations or co-variations.</p></li> +<li><p>The <strong>1st order effects</strong> are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through an environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern is caused by a difference in risk intensity.</p></li> +<li><p>The <strong>2nd order effects</strong> describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern is caused by correlations or co-variations.</p></li> </ul> <p>No statistical methods could distinguish between these competing processes since their outcome results in similar pattern of points. The cluster analysis help describing the magnitude and the location of pattern but in no way could answer the question of why such patterns occurs. It is therefore a step that help detecting cluster for description and surveillance purpose and rising hypothesis on the underlying process that will lead further investigations.</p> <p>Knowledge about the disease and its transmission process could orientate the choice of the methods of study. We presented in this brief tutorial two methods of cluster detection, the Moran’s I test that test for spatial independence (likely related to 2nd order effects) and the scan statistics that test for homogeneous distribution (likely related 1st order effects). It relies on epidemiologist to select the tools that best serve the studied question.</p> @@ -413,9 +415,9 @@ Statistic tests and distributions </div> </div> <div class="callout-body-container callout-body"> -<p>In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an <em>a priori</em> hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.</p> -<p>In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (a.k.a negative binomial) distributions.</p> -<p>Many the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e. each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e. some district have very sparse data (with high variance) while other have adequate data (with lower variance).</p> +<p>In statistics, problems are usually expressed by defining two hypotheses: the null hypothesis (H0), i.e., an <em>a priori</em> hypothesis of the studied phenomenon (e.g., the situation is a random) and the alternative hypothesis (HA), e.g., the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.</p> +<p>In mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (also known as negative binomial) distributions.</p> +<p>Many the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e., each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e., some district have very sparse data (with high variance) while other have adequate data (with lower variance).</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># dataset statistics</span></span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>m_cases <span class="ot"><-</span> <span class="fu">mean</span>(district<span class="sc">$</span>incidence)</span> @@ -430,7 +432,7 @@ Statistic tests and distributions <p><img src="07-basic_statistics_files/figure-html/distribution-1.png" class="img-fluid" width="576"></p> </div> </div> -<p>In this tutorial, we used the poisson distribution in our statistical tests.</p> +<p>In this tutorial, we used the Poisson distribution in our statistical tests.</p> </div> </div> </section> @@ -449,22 +451,22 @@ Moran’s I test </div> </div> <div class="callout-body-container callout-body"> -<p>The Moran’s statistics is :</p> -<p><span class="math display">\[I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}\]</span> with :</p> +<p>The Moran’s statistics is:</p> +<p><span class="math display">\[I = \frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}(Y_i-\bar{Y})(Y_j - \bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}\]</span> with:</p> <ul> <li><p><span class="math inline">\(N\)</span>: the number of polygons,</p></li> -<li><p><span class="math inline">\(w_{ij}\)</span>: is a matrix of spatial weight with zeroes on the diagonal (i.e., <span class="math inline">\(w_{ii}=0\)</span>). For example, if polygons are neighbors, the weight takes the value <span class="math inline">\(1\)</span> otherwise it take the value <span class="math inline">\(0\)</span>.</p></li> +<li><p><span class="math inline">\(w_{ij}\)</span>: is a matrix of spatial weight with zeroes on the diagonal (i.e., <span class="math inline">\(w_{ii}=0\)</span>). For example, if polygons are neighbors, the weight takes the value <span class="math inline">\(1\)</span> otherwise it takes the value <span class="math inline">\(0\)</span>.</p></li> <li><p><span class="math inline">\(Y_i\)</span>: the variable of interest,</p></li> <li><p><span class="math inline">\(\bar{Y}\)</span>: the mean value of <span class="math inline">\(Y\)</span>.</p></li> </ul> -<p>Under the Moran’s test, the statistics hypothesis are :</p> +<p>Under the Moran’s test, the statistics hypotheses are:</p> <ul> -<li><p><strong>H0</strong> : the distribution of cases is spatially independent, i.e. <span class="math inline">\(I=0\)</span>.</p></li> -<li><p><strong>H1</strong>: the distribution of cases is spatially autocorrelated, i.e. <span class="math inline">\(I\ne0\)</span>.</p></li> +<li><p><strong>H0</strong>: the distribution of cases is spatially independent, i.e., <span class="math inline">\(I=0\)</span>.</p></li> +<li><p><strong>H1</strong>: the distribution of cases is spatially autocorrelated, i.e., <span class="math inline">\(I\ne0\)</span>.</p></li> </ul> </div> </div> -<p>We will compute the Moran’s statistics using <code>spdep</code><span class="citation" data-cites="spdep">(<a href="references.html#ref-spdep" role="doc-biblioref">R. Bivand et al. 2015</a>)</span> and <code>Dcluster</code><span class="citation" data-cites="DCluster">(<a href="references.html#ref-DCluster" role="doc-biblioref">Gómez-Rubio et al. 2015</a>)</span> packages. <code>spdep</code> package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use <code>poly2nb()</code> and <code>nb2listw()</code>. These function respectively detect the neighboring polygons and assign weight corresponding to <span class="math inline">\(1/\#\ of\ neighbors\)</span>. <code>Dcluster</code> package provides a set of functions for the detection of spatial clusters of disease using count data.</p> +<p>We will compute the Moran’s statistics using <code>spdep</code><span class="citation" data-cites="spdep">(<a href="references.html#ref-spdep" role="doc-biblioref">R. Bivand et al. 2015</a>)</span> and <code>Dcluster</code><span class="citation" data-cites="DCluster">(<a href="references.html#ref-DCluster" role="doc-biblioref">Gómez-Rubio et al. 2015</a>)</span> packages. <code>spdep</code> package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use <code>poly2nb()</code> and <code>nb2listw()</code>. These functions respectively detect the neighboring polygons and assign weight corresponding to <span class="math inline">\(1/\#\ of\ neighbors\)</span>. <code>Dcluster</code> package provides a set of functions for the detection of spatial clusters of disease using count data.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(spdep) <span class="co"># Functions for creating spatial weight, spatial analysis</span></span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(DCluster) <span class="co"># Package with functions for spatial cluster analysis</span></span> @@ -488,18 +490,18 @@ Moran’s I test Model used when sampling: Poisson Number of simulations: 499 Statistic: 0.1566449 - p-value : 0.014 </code></pre> + p-value : 0.012 </code></pre> </div> <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="fu">plot</span>(m_test)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output-display"> <p><img src="07-basic_statistics_files/figure-html/MoransI-1.png" class="img-fluid" width="768"></p> </div> </div> -<p>The Moran’s statistics is here <span class="math inline">\(I =\)</span> 0.16. When comparing its value to the H0 distribution (built under 499 simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is <span class="math inline">\(p_{value} =\)</span> 0.014. We therefore reject H0 with error risk of <span class="math inline">\(\alpha = 5\%\)</span>. The distribution of cases is therefore autocorrelated across districts in Cambodia.</p> +<p>The Moran’s statistics is here <span class="math inline">\(I =\)</span> 0.16. When comparing its value to the H0 distribution (built under 499 simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is <span class="math inline">\(p_{value} =\)</span> 0.012. We therefore reject H0 with error risk of <span class="math inline">\(\alpha = 5\%\)</span>. The distribution of cases is therefore autocorrelated across districts in Cambodia.</p> </section> <section id="morans-i-local-test" class="level4" data-number="7.2.2.2"> <h4 data-number="7.2.2.2" class="anchored" data-anchor-id="morans-i-local-test"><span class="header-section-number">7.2.2.2</span> Moran’s I local test</h4> -<p>The global Moran’s test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlation occurs, i.e. what is the locations of the clusters. To identify such cluster we can decompose the Moran’s I statistic to extract local informations of the level of correlation of each district and its neighbors. This is called the Local Moran’s I LISA statistic. Because the Local Moran’s I LISA statistic test each district for autocorrelation independently, concern are raised about multiple testing limitations that increase the Type I error (<span class="math inline">\(\alpha\)</span>) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation.</p> +<p>The global Moran’s test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlations occurs, i.e., what is the locations of the clusters. To identify such cluster, we can decompose the Moran’s I statistic to extract local information of the level of correlation of each district and its neighbors. This is called the Local Moran’s I LISA statistic. Because the Local Moran’s I LISA statistic test each district for autocorrelation independently, concern is raised about multiple testing limitations that increase the Type I error (<span class="math inline">\(\alpha\)</span>) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation.</p> <div class="callout-note callout callout-style-default callout-captioned"> <div class="callout-header d-flex align-content-center"> <div class="callout-icon-container"> @@ -510,11 +512,11 @@ Statistical test </div> </div> <div class="callout-body-container callout-body"> -<p>For each district <span class="math inline">\(i\)</span>, the Moran’s statistics is :</p> +<p>For each district <span class="math inline">\(i\)</span>, the Local Moran’s I statistics is:</p> <p><span class="math display">\[I_i = \frac{(Y_i-\bar{Y})}{\sum_{i=1}^N(Y_i-\bar{Y})^2}\sum_{j=1}^Nw_{ij}(Y_j - \bar{Y}) \text{ with } I = \sum_{i=1}^NI_i/N\]</span></p> </div> </div> -<p>The <code>localmoran()</code>function from the package <code>spdep</code> treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local moran’s test has not been implemented for poisson distribution (population not large enough in some districts) in <code>spdep</code> package. However Bivand <strong>et al.</strong> <span class="citation" data-cites="bivand2008applied">(<a href="references.html#ref-bivand2008applied" role="doc-biblioref">R. S. Bivand et al. 2008</a>)</span> provided some code to manual perform the analysis using poisson distribution and was further implemented in the course “<a href="https://mkram01.github.io/EPI563-SpatialEPI/index.html">Spatial Epidemiology</a>†.</p> +<p>The <code>localmoran()</code>function from the package <code>spdep</code> treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local Moran’s test has not been implemented for Poisson distribution (population not large enough in some districts) in <code>spdep</code> package. However, Bivand <strong>et al.</strong> <span class="citation" data-cites="bivand2008applied">(<a href="references.html#ref-bivand2008applied" role="doc-biblioref">R. S. Bivand et al. 2008</a>)</span> provided some code to manual perform the analysis using Poisson distribution and was further implemented in the course “<a href="https://mkram01.github.io/EPI563-SpatialEPI/index.html">Spatial Epidemiology</a>â€.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Step 1 - Create the standardized deviation of observed from expected</span></span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>sd_lm <span class="ot"><-</span> (district<span class="sc">$</span>cases <span class="sc">-</span> district<span class="sc">$</span>expected) <span class="sc">/</span> <span class="fu">sqrt</span>(district<span class="sc">$</span>expected)</span> @@ -560,12 +562,12 @@ Statistical test <span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>pval_lm <span class="ot"><-</span> <span class="fu">punif</span>((diff <span class="sc">+</span> <span class="dv">1</span>) <span class="sc">/</span> (nsim <span class="sc">+</span> <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> </div> <p>For each district, we obtain a p-value based on permutations process</p> -<p>A conventional way of plotting these results is to classify the districts into 5 classes based on local Moran’s I outputs. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with <code>lag.listw()</code>) :</p> +<p>A conventional way of plotting these results is to classify the districts into 5 classes based on local Moran’s I output. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with <code>lag.listw()</code>):</p> <ul> <li><p>Districts that have higher-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>High-High</strong> (hotspot of the disease)</p></li> -<li><p>Districts that have lower-than-average rates in both index regions and their neighbors adn showing statistically significant positive values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>Low-Low</strong> (coldspot of the disease).</p></li> +<li><p>Districts that have lower-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>Low-Low</strong> (cold spot of the disease).</p></li> <li><p>Districts that have higher-than-average rates in the index regions and lower-than-average rates in their neighbors, and showing statistically significant negative values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>High-Low</strong>(outlier with high incidence in an area with low incidence).</p></li> -<li><p>Districts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>Low-High</strong>(outlier of low incidence in area with high incidence).</p></li> +<li><p>Districts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local <span class="math inline">\(I_i\)</span> statistic are defined as <strong>Low-High</strong> (outlier of low incidence in area with high incidence).</p></li> <li><p>Districts with non-significant values for the <span class="math inline">\(I_i\)</span> statistic are defined as <strong>Non-significant</strong>.</p></li> </ul> <div class="cell" data-nm="true"> @@ -597,7 +599,7 @@ Statistical test <span id="cb14-26"><a href="#cb14-26" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> <span class="fu">c</span>(<span class="st">"#6D0026"</span> , <span class="st">"blue"</span>, <span class="st">"white"</span>) , <span class="co"># "#FF755F","#7FABD3" ,</span></span> <span id="cb14-27"><a href="#cb14-27" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"Clusters"</span>)</span> <span id="cb14-28"><a href="#cb14-28" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb14-29"><a href="#cb14-29" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Cluster using Local moran'I statistic"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> +<span id="cb14-29"><a href="#cb14-29" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Cluster using Local Moran's I statistic"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output-display"> <p><img src="07-basic_statistics_files/figure-html/LocalMoransI_plt-1.png" class="img-fluid" width="768"></p> </div> @@ -606,18 +608,18 @@ Statistical test </section> <section id="spatial-scan-statistics" class="level3" data-number="7.2.3"> <h3 data-number="7.2.3" class="anchored" data-anchor-id="spatial-scan-statistics"><span class="header-section-number">7.2.3</span> Spatial scan statistics</h3> -<p>While Moran’s indice focuses on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independance), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.</p> +<p>While Moran’s indices focus on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independence), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.</p> <p>The function <code>kulldorff</code> from the package <code>SpatialEpi</code> <span class="citation" data-cites="SpatialEpi">(<a href="references.html#ref-SpatialEpi" role="doc-biblioref">Kim and Wakefield 2010</a>)</span> is a simple tool to implement spatial-only scan statistics. Briefly, the kulldorff scan statistics scan the area for clusters using several steps:</p> <ol type="1"> -<li><p>It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could includes 50% of the population).</p></li> +<li><p>It create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could include 50% of the population).</p></li> <li><p>It aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation.</p></li> <li><p>Finally, it computes the likelihood ratio to test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window</p></li> -<li><p>These 3 steps are repeted for each location and each possible windows-radii.</p></li> +<li><p>These 3 steps are repeated for each location and each possible windows-radii.</p></li> </ol> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(<span class="st">"SpatialEpi"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> </div> -<p>The use of R spatial object is not implementes in <code>kulldorff()</code> function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids falls into the circle.</p> +<p>The use of R spatial object is not implements in <code>kulldorff()</code> function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids fall into the circle.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>district_xy <span class="ot"><-</span> <span class="fu">st_centroid</span>(district) <span class="sc">%>%</span> </span> <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">st_coordinates</span>()</span> @@ -633,7 +635,7 @@ Statistical test 6 360528.3 1516339</code></pre> </div> </div> -<p>We can then call kulldorff function (you are strongly encourage to call <code>?kulldorff</code> to properly call the function). The <code>alpha.level</code> threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.</p> +<p>We can then call kulldorff function (you are strongly encouraged to call <code>?kulldorff</code> to properly call the function). The <code>alpha.level</code> threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>kd_Wfever <span class="ot"><-</span> <span class="fu">kulldorff</span>(district_xy, </span> <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a> <span class="at">cases =</span> district<span class="sc">$</span>cases,</span> @@ -646,7 +648,7 @@ Statistical test <p><img src="07-basic_statistics_files/figure-html/kd_test-1.png" class="img-fluid" width="576"></p> </div> </div> -<p>All outputs are saved into an R object, here called <code>kd_Wfever</code>. Unfortunately the package did not developed any summary and visualization of the results but we can explore the output object.</p> +<p>All outputs are saved into an R object, here called <code>kd_Wfever</code>. Unfortunately, the package did not develop any summary and visualization of the results but we can explore the output object.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="fu">names</span>(kd_Wfever)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output cell-output-stdout"> @@ -676,7 +678,7 @@ Statistical test <pre class="code-out"><code>[1] 52.97195</code></pre> </div> </div> -<p>17 districts belong to the cluster and its number of cases is 2.3 times higher than the expected number of case.</p> +<p>17 districts belong to the cluster and its number of cases is 2.3 times higher than the expected number of cases.</p> <p>Similarly, we could study the secondary clusters. Results are saved in a list.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb29"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We can see which districts (r number) belong to this cluster</span></span> @@ -693,7 +695,7 @@ Statistical test <span id="cb31-7"><a href="#cb31-7" aria-hidden="true" tabindex="-1"></a><span class="fu">print</span>(df_secondary_clusters)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output cell-output-stdout"> <pre class="code-out"><code> SMR number.of.cases expected.cases p.value -1 3.767698 16 4.246625 0.004</code></pre> +1 3.767698 16 4.246625 0.008</code></pre> </div> </div> <p>We only have one secondary cluster composed of one district.</p> @@ -701,7 +703,7 @@ Statistical test <div class="sourceCode cell-code" id="cb33"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="co"># create empty column to store cluster informations</span></span> <span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>k_cluster <span class="ot"><-</span> <span class="cn">NA</span></span> <span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a><span class="co"># save cluster informations from kulldorff outputs</span></span> +<span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a><span class="co"># save cluster information from kulldorff outputs</span></span> <span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>k_cluster[kd_Wfever<span class="sc">$</span>most.likely.cluster<span class="sc">$</span>location.IDs.included] <span class="ot"><-</span> <span class="st">'Most likely cluster'</span></span> <span id="cb33-6"><a href="#cb33-6" aria-hidden="true" tabindex="-1"></a></span> <span id="cb33-7"><a href="#cb33-7" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span>(i <span class="cf">in</span> <span class="dv">1</span><span class="sc">:</span><span class="fu">length</span>(kd_Wfever<span class="sc">$</span>secondary.clusters)){</span> @@ -732,7 +734,7 @@ Statistical test <i class="callout-icon"></i> </div> <div class="callout-caption-container flex-fill"> -To go futher … +To go further … </div> </div> <div class="callout-body-container callout-body"> diff --git a/public/07-basic_statistics_files/figure-html/LocalMoransI-1.png b/public/07-basic_statistics_files/figure-html/LocalMoransI-1.png index f93e192f62002be37a8c8d480133e26a96891cba..fe0f3cdb57e1064e4264569d8af69d3c66451bab 100644 Binary files a/public/07-basic_statistics_files/figure-html/LocalMoransI-1.png and b/public/07-basic_statistics_files/figure-html/LocalMoransI-1.png differ diff --git a/public/07-basic_statistics_files/figure-html/LocalMoransI_plt-1.png b/public/07-basic_statistics_files/figure-html/LocalMoransI_plt-1.png index 567565cfa7861893745dec4daac1580ed2f18485..ff3163fba9e3801f4b0c53aefb5844775faaf11d 100644 Binary files a/public/07-basic_statistics_files/figure-html/LocalMoransI_plt-1.png and b/public/07-basic_statistics_files/figure-html/LocalMoransI_plt-1.png differ diff --git a/public/07-basic_statistics_files/figure-html/MoransI-1.png b/public/07-basic_statistics_files/figure-html/MoransI-1.png index 6fe0c65fb8d1606dfd85e49e73a29cdcc7723d3b..cf991396958a2c9e41e06c7b4cbd430de36959e3 100644 Binary files a/public/07-basic_statistics_files/figure-html/MoransI-1.png and b/public/07-basic_statistics_files/figure-html/MoransI-1.png differ diff --git a/public/07-basic_statistics_files/figure-html/inc_visualization-1.png b/public/07-basic_statistics_files/figure-html/inc_visualization-1.png index f51d9d8ad227415eba0b327eb30d51e5bbb0ca9c..631e529f9b65f28451f2428a723aa42572800024 100644 Binary files a/public/07-basic_statistics_files/figure-html/inc_visualization-1.png and b/public/07-basic_statistics_files/figure-html/inc_visualization-1.png differ diff --git a/public/07-basic_statistics_files/figure-html/incidence_visualization-1.png b/public/07-basic_statistics_files/figure-html/incidence_visualization-1.png new file mode 100644 index 0000000000000000000000000000000000000000..573c23b0e3e6d18f1b59f99d3f20f6d3aeab929c Binary files /dev/null and b/public/07-basic_statistics_files/figure-html/incidence_visualization-1.png differ diff --git a/public/07-basic_statistics_files/figure-html/kd_test-1.png b/public/07-basic_statistics_files/figure-html/kd_test-1.png index 5f7ec7926a8c35a0e5abeb86280e58b47ec40142..925564b06a5c677095386aa29f9448c647115a68 100644 Binary files a/public/07-basic_statistics_files/figure-html/kd_test-1.png and b/public/07-basic_statistics_files/figure-html/kd_test-1.png differ diff --git a/public/search.json b/public/search.json index 51a756ffdb7bced48902b63c174529172cb259b1..40f3788f550b80523db5e45702ccf14f82813ce5 100644 --- a/public/search.json +++ b/public/search.json @@ -11,14 +11,14 @@ "href": "07-basic_statistics.html#import-and-visualize-epidemiological-data", "title": "7 Basic statistics for spatial analysis", "section": "7.1 Import and visualize epidemiological data", - "text": "7.1 Import and visualize epidemiological data\nIn this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case.\n\nlibrary(dplyr)\nlibrary(sf)\n\n#Import Cambodia country border\ncountry <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"country\", quiet = TRUE)\n#Import provincial administrative border of Cambodia\neducation <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"education\", quiet = TRUE)\n#Import district administrative border of Cambodia\ndistrict <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"district\", quiet = TRUE)\n\n# Import locations of cases from an imaginary disease\ncases <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"cases\", quiet = TRUE)\ncases <- subset(cases, Disease == \"W fever\")\n\nThe first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.\n\n# View the cases object\nhead(cases)\n\nSimple feature collection with 6 features and 2 fields\nGeometry type: MULTIPOINT\nDimension: XY\nBounding box: xmin: 255891 ymin: 1179092 xmax: 506647.4 ymax: 1467441\nProjected CRS: WGS 84 / UTM zone 48N\n id Disease geom\n1 0 W fever MULTIPOINT ((280036.2 12841...\n2 1 W fever MULTIPOINT ((451859.5 11790...\n3 2 W fever MULTIPOINT ((255891 1467441))\n4 5 W fever MULTIPOINT ((506647.4 12322...\n5 6 W fever MULTIPOINT ((440668 1197958))\n6 7 W fever MULTIPOINT ((481594.5 12714...\n\n# Map the cases\nlibrary(mapsf)\n\nmf_map(x = district, border = \"white\")\nmf_map(x = country,lwd = 2, col = NA, add = TRUE)\nmf_map(x = cases, lwd = .5, col = \"#990000\", pch = 20, add = TRUE)\n\n\n\n\nIn epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.\n\n# Aggregate cases over districts\ndistrict$cases <- lengths(st_intersects(district, cases))\n\nThe incidence (\\(\\frac{cases}{population}\\)) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as \\(SIR = \\frac{Y_i}{E_i}\\) with \\(Y_i\\), the observed number of cases and \\(E_i\\), the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the averaged average incidence across Cambodia.\n\n# Compute incidence in each district (per 100 000 population)\ndistrict$incidence <- district$cases/district$T_POP * 100000\n\n# Compute the global risk\nrate <- sum(district$cases)/sum(district$T_POP)\n\n# Compute expected number of cases \ndistrict$expected <- district$T_POP * rate\n\n# Compute SIR\ndistrict$SIR <- district$cases / district$expected\n\n\npar(mfrow = c(1, 3))\n# Plot number of cases using proportional symbol \nmf_map(x = district) \nmf_map(\n x = district, \n var = \"cases\",\n val_max = 50,\n type = \"prop\",\n col = \"#990000\", \n leg_title = \"Cases\")\nmf_layout(title = \"Number of cases of W Fever\")\n\n# Plot incidence \nmf_map(x = district,\n var = \"incidence\",\n type = \"choro\",\n pal = \"Reds 3\",\n leg_title = \"Incidence \\n(per 100 000)\")\nmf_layout(title = \"Incidence of W Fever\")\n\n# Plot SIRs\n# create breaks and associated color palette\nbreak_SIR <- c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = \"pretty\")))\ncol_pal <- c(\"#273871\", \"#3267AD\", \"#6496C8\", \"#9BBFDD\", \"#CDE3F0\", \"#FFCEBC\", \"#FF967E\", \"#F64D41\", \"#B90E36\")\n\nmf_map(x = district,\n var = \"SIR\",\n type = \"choro\",\n breaks = break_SIR, \n pal = col_pal, \n cex = 2,\n leg_title = \"SIR\")\nmf_layout(title = \"Standardized Incidence Ratio of W Fever\")\n\n\n\n\nThese maps illustrates the spatial heterogenity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have :\n\nhigher risk than average (SIR > 1) when standardized for population\nlower risk than average (SIR < 1) when standardized for population\naverage risk (SIR ~ 1) when standardized for population\n\n\n\n\n\n\n\nTo go futher …\n\n\n\nIn this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid)." + "text": "7.1 Import and visualize epidemiological data\nIn this section, we load data that reference the cases of an imaginary disease, the W fever, throughout Cambodia. Each point corresponds to the geo-localization of a case.\n\nlibrary(dplyr)\nlibrary(sf)\n\n#Import Cambodia country border\ncountry <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"country\", quiet = TRUE)\n#Import provincial administrative border of Cambodia\neducation <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"education\", quiet = TRUE)\n#Import district administrative border of Cambodia\ndistrict <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"district\", quiet = TRUE)\n\n# Import locations of cases from an imaginary disease\ncases <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"cases\", quiet = TRUE)\ncases <- subset(cases, Disease == \"W fever\")\n\nThe first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.\n\n# View the cases object\nhead(cases)\n\nSimple feature collection with 6 features and 2 fields\nGeometry type: MULTIPOINT\nDimension: XY\nBounding box: xmin: 255891 ymin: 1179092 xmax: 506647.4 ymax: 1467441\nProjected CRS: WGS 84 / UTM zone 48N\n id Disease geom\n1 0 W fever MULTIPOINT ((280036.2 12841...\n2 1 W fever MULTIPOINT ((451859.5 11790...\n3 2 W fever MULTIPOINT ((255891 1467441))\n4 5 W fever MULTIPOINT ((506647.4 12322...\n5 6 W fever MULTIPOINT ((440668 1197958))\n6 7 W fever MULTIPOINT ((481594.5 12714...\n\n# Map the cases\nlibrary(mapsf)\n\nmf_map(x = district, border = \"white\")\nmf_map(x = country,lwd = 2, col = NA, add = TRUE)\nmf_map(x = cases, lwd = .5, col = \"#990000\", pch = 20, add = TRUE)\n\n\n\n\nIn epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, we cannot precisely tell if this observation represents an event of interest (e.g., illness, death, …) or a person at risk (e.g., a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appear as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use the district as the areal unit of the study.\n\n# Aggregate cases over districts\ndistrict$cases <- lengths(st_intersects(district, cases))\n\nThe incidence (\\(\\frac{cases}{population}\\)) expressed per 100,000 population is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represent the deviation of observed and expected number of cases and is expressed as \\(SIR = \\frac{Y_i}{E_i}\\) with \\(Y_i\\), the observed number of cases and \\(E_i\\), the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e., the incidence is the same in each district. The SIR therefore represents the deviation of incidence compared to the average incidence across Cambodia.\n\n# Compute incidence in each district (per 100 000 population)\ndistrict$incidence <- district$cases/district$T_POP * 100000\n\n# Compute the global risk\nrate <- sum(district$cases)/sum(district$T_POP)\n\n# Compute expected number of cases \ndistrict$expected <- district$T_POP * rate\n\n# Compute SIR\ndistrict$SIR <- district$cases / district$expected\n\n\npar(mfrow = c(1, 3))\n# Plot number of cases using proportional symbol \nmf_map(x = district) \nmf_map(\n x = district, \n var = \"cases\",\n val_max = 50,\n type = \"prop\",\n col = \"#990000\", \n leg_title = \"Cases\")\nmf_layout(title = \"Number of cases of W Fever\")\n\n# Plot incidence \nmf_map(x = district,\n var = \"incidence\",\n type = \"choro\",\n pal = \"Reds 3\",\n breaks = exp(mf_get_breaks(log(district$incidence+1), breaks = \"pretty\"))-1,\n leg_title = \"Incidence \\n(per 100 000)\")\nmf_layout(title = \"Incidence of W Fever\")\n\n# Plot SIRs\n# create breaks and associated color palette\nbreak_SIR <- c(0,exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = \"pretty\")))\ncol_pal <- c(\"#273871\", \"#3267AD\", \"#6496C8\", \"#9BBFDD\", \"#CDE3F0\", \"#FFCEBC\", \"#FF967E\", \"#F64D41\", \"#B90E36\")\n\nmf_map(x = district,\n var = \"SIR\",\n type = \"choro\",\n breaks = break_SIR, \n pal = col_pal, \n cex = 2,\n leg_title = \"SIR\")\nmf_layout(title = \"Standardized Incidence Ratio of W Fever\")\n\n\n\n\nThese maps illustrate the spatial heterogeneity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have:\n\nhigher risk than average (SIR > 1) when standardized for population\nlower risk than average (SIR < 1) when standardized for population\naverage risk (SIR ~ 1) when standardized for population\n\n\n\n\n\n\n\nTo go further …\n\n\n\nIn this example, we standardized the cases distribution for population count. This simple standardization assumes that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g., the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g., sex ratio, occupations, age pyramid).\nIn addition, one can wonder what does an \\(SIR \\~ 1\\) means, i.e., what is the threshold to decide whether the SIR is greater, lower or equivalent to 1. The significant of the SIR can be tested globally (to determine whether or not the incidence is homogeneously distributed) and locally in each district (to determine Which district have an SIR different than 1). We won’t perform these analyses in this tutorial but you can look at the function ?achisq.test() (from Dcluster package (Gómez-Rubio et al. 2015)) and ?probmap() (from spdep package (R. Bivand et al. 2015)) to compute these statistics." }, { "objectID": "07-basic_statistics.html#cluster-analysis", "href": "07-basic_statistics.html#cluster-analysis", "title": "7 Basic statistics for spatial analysis", "section": "7.2 Cluster analysis", - "text": "7.2 Cluster analysis\n\n7.2.1 General introduction\nWhy studying clusters in epidemiology ? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution :\n\nThe 1st order effects are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through a environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern are caused by a difference in risk intensity.\nThe 2nd order effects describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern are caused by correlations or co-variations.\n\nNo statistical methods could distinguish between these competing processes since their outcome results in similar pattern of points. The cluster analysis help describing the magnitude and the location of pattern but in no way could answer the question of why such patterns occurs. It is therefore a step that help detecting cluster for description and surveillance purpose and rising hypothesis on the underlying process that will lead further investigations.\nKnowledge about the disease and its transmission process could orientate the choice of the methods of study. We presented in this brief tutorial two methods of cluster detection, the Moran’s I test that test for spatial independence (likely related to 2nd order effects) and the scan statistics that test for homogeneous distribution (likely related 1st order effects). It relies on epidemiologist to select the tools that best serve the studied question.\n\n\n\n\n\n\nStatistic tests and distributions\n\n\n\nIn statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.\nIn mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (a.k.a negative binomial) distributions.\nMany the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e. each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e. some district have very sparse data (with high variance) while other have adequate data (with lower variance).\n\n# dataset statistics\nm_cases <- mean(district$incidence)\nsd_cases <- sd(district$incidence)\n\nhist(district$incidence, probability = TRUE, ylim = c(0, 0.4), xlim = c(-5, 16), xlab = \"Number of cases\", ylab = \"Probability\", main = \"Histogram of observed incidence compared\\nto Normal and Poisson distributions\")\ncurve(dnorm(x, m_cases, sd_cases),col = \"blue\", lwd = 1, add = TRUE)\npoints(0:max(district$incidence), dpois(0:max(district$incidence), m_cases),type = 'b', pch = 20, col = \"red\", ylim = c(0, 0.6), lty = 2)\n\nlegend(\"topright\", legend = c(\"Normal distribution\", \"Poisson distribution\", \"Observed distribution\"), col = c(\"blue\", \"red\", \"black\"),pch = c(NA, 20, NA), lty = c(1, 2, 1))\n\n\n\n\nIn this tutorial, we used the poisson distribution in our statistical tests.\n\n\n\n\n7.2.2 Test for spatial autocorrelation (Moran’s I test)\n\n7.2.2.1 The global Moran’s I test\nA popular test for spatial autocorrelation is the Moran’s test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran’s I value of +1 indicates a concentration of spatial units exhibiting similar rates.\n\n\n\n\n\n\nMoran’s I test\n\n\n\nThe Moran’s statistics is :\n\\[I = \\frac{N}{\\sum_{i=1}^N\\sum_{j=1}^Nw_{ij}}\\frac{\\sum_{i=1}^N\\sum_{j=1}^Nw_{ij}(Y_i-\\bar{Y})(Y_j - \\bar{Y})}{\\sum_{i=1}^N(Y_i-\\bar{Y})^2}\\] with :\n\n\\(N\\): the number of polygons,\n\\(w_{ij}\\): is a matrix of spatial weight with zeroes on the diagonal (i.e., \\(w_{ii}=0\\)). For example, if polygons are neighbors, the weight takes the value \\(1\\) otherwise it take the value \\(0\\).\n\\(Y_i\\): the variable of interest,\n\\(\\bar{Y}\\): the mean value of \\(Y\\).\n\nUnder the Moran’s test, the statistics hypothesis are :\n\nH0 : the distribution of cases is spatially independent, i.e. \\(I=0\\).\nH1: the distribution of cases is spatially autocorrelated, i.e. \\(I\\ne0\\).\n\n\n\nWe will compute the Moran’s statistics using spdep(R. Bivand et al. 2015) and Dcluster(Gómez-Rubio et al. 2015) packages. spdep package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use poly2nb() and nb2listw(). These function respectively detect the neighboring polygons and assign weight corresponding to \\(1/\\#\\ of\\ neighbors\\). Dcluster package provides a set of functions for the detection of spatial clusters of disease using count data.\n\nlibrary(spdep) # Functions for creating spatial weight, spatial analysis\nlibrary(DCluster) # Package with functions for spatial cluster analysis\n\nqueen_nb <- poly2nb(district) # Neighbors according to queen case\nq_listw <- nb2listw(queen_nb, style = 'W') # row-standardized weights\n\n# Moran's I test\nm_test <- moranI.test(cases ~ offset(log(expected)), \n data = district,\n model = 'poisson',\n R = 499,\n listw = q_listw,\n n = length(district$cases), # number of regions\n S0 = Szero(q_listw)) # Global sum of weights\nprint(m_test)\n\nMoran's I test of spatial autocorrelation \n\n Type of boots.: parametric \n Model used when sampling: Poisson \n Number of simulations: 499 \n Statistic: 0.1566449 \n p-value : 0.014 \n\nplot(m_test)\n\n\n\n\nThe Moran’s statistics is here \\(I =\\) 0.16. When comparing its value to the H0 distribution (built under 499 simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is \\(p_{value} =\\) 0.014. We therefore reject H0 with error risk of \\(\\alpha = 5\\%\\). The distribution of cases is therefore autocorrelated across districts in Cambodia.\n\n\n7.2.2.2 Moran’s I local test\nThe global Moran’s test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlation occurs, i.e. what is the locations of the clusters. To identify such cluster we can decompose the Moran’s I statistic to extract local informations of the level of correlation of each district and its neighbors. This is called the Local Moran’s I LISA statistic. Because the Local Moran’s I LISA statistic test each district for autocorrelation independently, concern are raised about multiple testing limitations that increase the Type I error (\\(\\alpha\\)) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation.\n\n\n\n\n\n\nStatistical test\n\n\n\nFor each district \\(i\\), the Moran’s statistics is :\n\\[I_i = \\frac{(Y_i-\\bar{Y})}{\\sum_{i=1}^N(Y_i-\\bar{Y})^2}\\sum_{j=1}^Nw_{ij}(Y_j - \\bar{Y}) \\text{ with } I = \\sum_{i=1}^NI_i/N\\]\n\n\nThe localmoran()function from the package spdep treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local moran’s test has not been implemented for poisson distribution (population not large enough in some districts) in spdep package. However Bivand et al. (R. S. Bivand et al. 2008) provided some code to manual perform the analysis using poisson distribution and was further implemented in the course “Spatial Epidemiology†.\n\n# Step 1 - Create the standardized deviation of observed from expected\nsd_lm <- (district$cases - district$expected) / sqrt(district$expected)\n\n# Step 2 - Create a spatially lagged version of standardized deviation of neighbors\nwsd_lm <- lag.listw(q_listw, sd_lm)\n\n# Step 3 - the local Moran's I is the product of step 1 and step 2\ndistrict$I_lm <- sd_lm * wsd_lm\n\n# Step 4 - setup parameters for simulation of the null distribution\n\n# Specify number of simulations to run\nnsim <- 499\n\n# Specify dimensions of result based on number of regions\nN <- length(district$expected)\n\n# Create a matrix of zeros to hold results, with a row for each county, and a column for each simulation\nsims <- matrix(0, ncol = nsim, nrow = N)\n\n# Step 5 - Start a for-loop to iterate over simulation columns\nfor(i in 1:nsim){\n y <- rpois(N, lambda = district$expected) # generate a random event count, given expected\n sd_lmi <- (y - district$expected) / sqrt(district$expected) # standardized local measure\n wsd_lmi <- lag.listw(q_listw, sd_lmi) # standardized spatially lagged measure\n sims[, i] <- sd_lmi * wsd_lmi # this is the I(i) statistic under this iteration of null\n}\n\nhist(sims[1,])\n\n\n\n# Step 6 - For each county, test where the observed value ranks with respect to the null simulations\nxrank <- apply(cbind(district$I_lm, sims), 1, function(x) rank(x)[1])\n\n# Step 7 - Calculate the difference between observed rank and total possible (nsim)\ndiff <- nsim - xrank\ndiff <- ifelse(diff > 0, diff, 0)\n\n# Step 8 - Assuming a uniform distribution of ranks, calculate p-value for observed\n# given the null distribution generate from simulations\ndistrict$pval_lm <- punif((diff + 1) / (nsim + 1))\n\nFor each district, we obtain a p-value based on permutations process\nA conventional way of plotting these results is to classify the districts into 5 classes based on local Moran’s I outputs. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with lag.listw()) :\n\nDistricts that have higher-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local \\(I_i\\) statistic are defined as High-High (hotspot of the disease)\nDistricts that have lower-than-average rates in both index regions and their neighbors adn showing statistically significant positive values for the local \\(I_i\\) statistic are defined as Low-Low (coldspot of the disease).\nDistricts that have higher-than-average rates in the index regions and lower-than-average rates in their neighbors, and showing statistically significant negative values for the local \\(I_i\\) statistic are defined as High-Low(outlier with high incidence in an area with low incidence).\nDistricts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local \\(I_i\\) statistic are defined as Low-High(outlier of low incidence in area with high incidence).\nDistricts with non-significant values for the \\(I_i\\) statistic are defined as Non-significant.\n\n\n# create lagged local raw_rate - in other words the average of the queen neighbors value\n# values are scaled (centered and reduced) to be compared to average\ndistrict$lag_std <- scale(lag.listw(q_listw, var = district$incidence))\ndistrict$incidence_std <- scale(district$incidence)\n\n# extract pvalues\n# district$lm_pv <- lm_test[,5]\n\n# Classify local moran's outputs\ndistrict$lm_class <- NA\ndistrict$lm_class[district$incidence_std >=0 & district$lag_std >=0] <- 'High-High'\ndistrict$lm_class[district$incidence_std <=0 & district$lag_std <=0] <- 'Low-Low'\ndistrict$lm_class[district$incidence_std <=0 & district$lag_std >=0] <- 'Low-High'\ndistrict$lm_class[district$incidence_std >=0 & district$lag_std <=0] <- 'High-Low'\ndistrict$lm_class[district$pval_lm >= 0.05] <- 'Non-significant'\n\ndistrict$lm_class <- factor(district$lm_class, levels=c(\"High-High\", \"Low-Low\", \"High-Low\", \"Low-High\", \"Non-significant\") )\n\n# create map\nmf_map(x = district,\n var = \"lm_class\",\n type = \"typo\",\n cex = 2,\n col_na = \"white\",\n #val_order = c(\"High-High\", \"Low-Low\", \"High-Low\", \"Low-High\", \"Non-significant\") ,\n pal = c(\"#6D0026\" , \"blue\", \"white\") , # \"#FF755F\",\"#7FABD3\" ,\n leg_title = \"Clusters\")\n\nmf_layout(title = \"Cluster using Local moran'I statistic\")\n\n\n\n\n\n\n\n7.2.3 Spatial scan statistics\nWhile Moran’s indice focuses on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independance), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.\nThe function kulldorff from the package SpatialEpi (Kim and Wakefield 2010) is a simple tool to implement spatial-only scan statistics. Briefly, the kulldorff scan statistics scan the area for clusters using several steps:\n\nIt create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could includes 50% of the population).\nIt aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation.\nFinally, it computes the likelihood ratio to test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window\nThese 3 steps are repeted for each location and each possible windows-radii.\n\n\nlibrary(\"SpatialEpi\")\n\nThe use of R spatial object is not implementes in kulldorff() function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids falls into the circle.\n\ndistrict_xy <- st_centroid(district) %>% \n st_coordinates()\n\nhead(district_xy)\n\n X Y\n1 330823.3 1464560\n2 749758.3 1541787\n3 468384.0 1277007\n4 494548.2 1215261\n5 459644.2 1194615\n6 360528.3 1516339\n\n\nWe can then call kulldorff function (you are strongly encourage to call ?kulldorff to properly call the function). The alpha.level threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.\n\nkd_Wfever <- kulldorff(district_xy, \n cases = district$cases,\n population = district$T_POP,\n expected.cases = district$expected,\n pop.upper.bound = 0.5, # include maximum 50% of the population in a windows\n n.simulations = 499,\n alpha.level = 0.2)\n\n\n\n\nAll outputs are saved into an R object, here called kd_Wfever. Unfortunately the package did not developed any summary and visualization of the results but we can explore the output object.\n\nnames(kd_Wfever)\n\n[1] \"most.likely.cluster\" \"secondary.clusters\" \"type\" \n[4] \"log.lkhd\" \"simulated.log.lkhd\" \n\n\nFirst, we can focus on the most likely cluster and explore its characteristics.\n\n# We can see which districts (r number) belong to this cluster\nkd_Wfever$most.likely.cluster$location.IDs.included\n\n [1] 48 93 66 180 133 29 194 118 50 144 31 141 3 117 22 43 142\n\n# standardized incidence ratio\nkd_Wfever$most.likely.cluster$SMR\n\n[1] 2.303106\n\n# number of observed and expected cases in this cluster\nkd_Wfever$most.likely.cluster$number.of.cases\n\n[1] 122\n\nkd_Wfever$most.likely.cluster$expected.cases\n\n[1] 52.97195\n\n\n17 districts belong to the cluster and its number of cases is 2.3 times higher than the expected number of case.\nSimilarly, we could study the secondary clusters. Results are saved in a list.\n\n# We can see which districts (r number) belong to this cluster\nlength(kd_Wfever$secondary.clusters)\n\n[1] 1\n\n# retrieve data for all secondary clusters into a table\ndf_secondary_clusters <- data.frame(SMR = sapply(kd_Wfever$secondary.clusters, '[[', 5), \n number.of.cases = sapply(kd_Wfever$secondary.clusters, '[[', 3),\n expected.cases = sapply(kd_Wfever$secondary.clusters, '[[', 4),\n p.value = sapply(kd_Wfever$secondary.clusters, '[[', 8))\n\nprint(df_secondary_clusters)\n\n SMR number.of.cases expected.cases p.value\n1 3.767698 16 4.246625 0.004\n\n\nWe only have one secondary cluster composed of one district.\n\n# create empty column to store cluster informations\ndistrict$k_cluster <- NA\n\n# save cluster informations from kulldorff outputs\ndistrict$k_cluster[kd_Wfever$most.likely.cluster$location.IDs.included] <- 'Most likely cluster'\n\nfor(i in 1:length(kd_Wfever$secondary.clusters)){\ndistrict$k_cluster[kd_Wfever$secondary.clusters[[i]]$location.IDs.included] <- paste(\n 'Secondary cluster', i, sep = '')\n}\n\n#district$k_cluster[is.na(district$k_cluster)] <- \"No cluster\"\n\n\n# create map\nmf_map(x = district,\n var = \"k_cluster\",\n type = \"typo\",\n cex = 2,\n col_na = \"white\",\n pal = mf_get_pal(palette = \"Reds\", n = 3)[1:2],\n leg_title = \"Clusters\")\n\nmf_layout(title = \"Cluster using kulldorf scan statistic\")\n\n\n\n\n\n\n\n\n\n\nTo go futher …\n\n\n\nIn this example, the expected number of cases was defined using the population count but note that standardization over other variables as age could also be implemented with the strata parameter in the kulldorff() function.\nIn addition, this cluster analysis was performed solely using the spatial scan but you should keep in mind that this method of cluster detection can be implemented for spatio-temporal data as well where the cluster definition is an abnormal number of cases in a delimited spatial area and during a given period of time. The windows of observation are therefore defined for a different center, radius and period of time. You should look at the function scan_ep_poisson() function in the package scanstatistic (Allévius 2018) for this analysis.\n\n\n\n\n\n\nAllévius, Benjamin. 2018. “Scanstatistics: Space-Time Anomaly Detection Using Scan Statistics.†Journal of Open Source Software 3 (25): 515.\n\n\nBivand, Roger S, Edzer J Pebesma, Virgilio Gómez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with r. Vol. 747248717. Springer.\n\n\nBivand, Roger, Micah Altman, Luc Anselin, Renato Assunção, Olaf Berke, Andrew Bernat, and Guillaume Blanchet. 2015. “Package ‘Spdep’.†The Comprehensive R Archive Network.\n\n\nGómez-Rubio, Virgilio, Juan Ferrándiz-Ferragud, Antonio López-QuıÌlez, et al. 2015. “Package ‘DCluster’.â€\n\n\nKim, Albert Y, and Jon Wakefield. 2010. “R Data and Methods for Spatial Epidemiology: The SpatialEpi Package.†Dept of Statistics, University of Washington." + "text": "7.2 Cluster analysis\n\n7.2.1 General introduction\nWhy studying clusters in epidemiology? Cluster analysis help identifying unusual patterns that occurs during a given period of time. The underlying ultimate goal of such analysis is to explain the observation of such patterns. In epidemiology, we can distinguish two types of process that would explain heterogeneity in case distribution:\n\nThe 1st order effects are the spatial variations of cases distribution caused by underlying properties of environment or the population structure itself. In such process individual get infected independently from the rest of the population. Such process includes the infection through an environment at risk as, for example, air pollution, contaminated waters or soils and UV exposition. This effect assume that the observed pattern is caused by a difference in risk intensity.\nThe 2nd order effects describes process of spread, contagion and diffusion of diseases caused by interactions between individuals. This includes transmission of infectious disease by proximity, but also the transmission of non-infectious disease, for example, with the diffusion of social norms within networks. This effect assume that the observed pattern is caused by correlations or co-variations.\n\nNo statistical methods could distinguish between these competing processes since their outcome results in similar pattern of points. The cluster analysis help describing the magnitude and the location of pattern but in no way could answer the question of why such patterns occurs. It is therefore a step that help detecting cluster for description and surveillance purpose and rising hypothesis on the underlying process that will lead further investigations.\nKnowledge about the disease and its transmission process could orientate the choice of the methods of study. We presented in this brief tutorial two methods of cluster detection, the Moran’s I test that test for spatial independence (likely related to 2nd order effects) and the scan statistics that test for homogeneous distribution (likely related 1st order effects). It relies on epidemiologist to select the tools that best serve the studied question.\n\n\n\n\n\n\nStatistic tests and distributions\n\n\n\nIn statistics, problems are usually expressed by defining two hypotheses: the null hypothesis (H0), i.e., an a priori hypothesis of the studied phenomenon (e.g., the situation is a random) and the alternative hypothesis (HA), e.g., the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.\nIn mathematics, a probability distribution is a mathematical expression that represents what we would expect due to random chance. The choice of the probability distribution relies on the type of data you use (continuous, count, binary). In general, three distribution a used while studying disease rates, the Binomial, the Poisson and the Poisson-gamma mixture (also known as negative binomial) distributions.\nMany the statistical tests assume by default that data are normally distributed. It implies that your variable is continuous and that all data could easily be represented by two parameters, the mean and the variance, i.e., each value have the same level of certainty. If many measure can be assessed under the normality assumption, this is usually not the case in epidemiology with strictly positives rates and count values that 1) does not fit the normal distribution and 2) does not provide with the same degree of certainty since variances likely differ between district due to different population size, i.e., some district have very sparse data (with high variance) while other have adequate data (with lower variance).\n\n# dataset statistics\nm_cases <- mean(district$incidence)\nsd_cases <- sd(district$incidence)\n\nhist(district$incidence, probability = TRUE, ylim = c(0, 0.4), xlim = c(-5, 16), xlab = \"Number of cases\", ylab = \"Probability\", main = \"Histogram of observed incidence compared\\nto Normal and Poisson distributions\")\ncurve(dnorm(x, m_cases, sd_cases),col = \"blue\", lwd = 1, add = TRUE)\npoints(0:max(district$incidence), dpois(0:max(district$incidence), m_cases),type = 'b', pch = 20, col = \"red\", ylim = c(0, 0.6), lty = 2)\n\nlegend(\"topright\", legend = c(\"Normal distribution\", \"Poisson distribution\", \"Observed distribution\"), col = c(\"blue\", \"red\", \"black\"),pch = c(NA, 20, NA), lty = c(1, 2, 1))\n\n\n\n\nIn this tutorial, we used the Poisson distribution in our statistical tests.\n\n\n\n\n7.2.2 Test for spatial autocorrelation (Moran’s I test)\n\n7.2.2.1 The global Moran’s I test\nA popular test for spatial autocorrelation is the Moran’s test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran’s I value of +1 indicates a concentration of spatial units exhibiting similar rates.\n\n\n\n\n\n\nMoran’s I test\n\n\n\nThe Moran’s statistics is:\n\\[I = \\frac{N}{\\sum_{i=1}^N\\sum_{j=1}^Nw_{ij}}\\frac{\\sum_{i=1}^N\\sum_{j=1}^Nw_{ij}(Y_i-\\bar{Y})(Y_j - \\bar{Y})}{\\sum_{i=1}^N(Y_i-\\bar{Y})^2}\\] with:\n\n\\(N\\): the number of polygons,\n\\(w_{ij}\\): is a matrix of spatial weight with zeroes on the diagonal (i.e., \\(w_{ii}=0\\)). For example, if polygons are neighbors, the weight takes the value \\(1\\) otherwise it takes the value \\(0\\).\n\\(Y_i\\): the variable of interest,\n\\(\\bar{Y}\\): the mean value of \\(Y\\).\n\nUnder the Moran’s test, the statistics hypotheses are:\n\nH0: the distribution of cases is spatially independent, i.e., \\(I=0\\).\nH1: the distribution of cases is spatially autocorrelated, i.e., \\(I\\ne0\\).\n\n\n\nWe will compute the Moran’s statistics using spdep(R. Bivand et al. 2015) and Dcluster(Gómez-Rubio et al. 2015) packages. spdep package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. In this example, we use poly2nb() and nb2listw(). These functions respectively detect the neighboring polygons and assign weight corresponding to \\(1/\\#\\ of\\ neighbors\\). Dcluster package provides a set of functions for the detection of spatial clusters of disease using count data.\n\nlibrary(spdep) # Functions for creating spatial weight, spatial analysis\nlibrary(DCluster) # Package with functions for spatial cluster analysis\n\nqueen_nb <- poly2nb(district) # Neighbors according to queen case\nq_listw <- nb2listw(queen_nb, style = 'W') # row-standardized weights\n\n# Moran's I test\nm_test <- moranI.test(cases ~ offset(log(expected)), \n data = district,\n model = 'poisson',\n R = 499,\n listw = q_listw,\n n = length(district$cases), # number of regions\n S0 = Szero(q_listw)) # Global sum of weights\nprint(m_test)\n\nMoran's I test of spatial autocorrelation \n\n Type of boots.: parametric \n Model used when sampling: Poisson \n Number of simulations: 499 \n Statistic: 0.1566449 \n p-value : 0.012 \n\nplot(m_test)\n\n\n\n\nThe Moran’s statistics is here \\(I =\\) 0.16. When comparing its value to the H0 distribution (built under 499 simulations), the probability of observing such a I value under the null hypothesis, i.e. the distribution of cases is spatially independent, is \\(p_{value} =\\) 0.012. We therefore reject H0 with error risk of \\(\\alpha = 5\\%\\). The distribution of cases is therefore autocorrelated across districts in Cambodia.\n\n\n7.2.2.2 Moran’s I local test\nThe global Moran’s test provides us a global statistical value informing whether autocorrelation occurs over the territory but does not inform on where does these correlations occurs, i.e., what is the locations of the clusters. To identify such cluster, we can decompose the Moran’s I statistic to extract local information of the level of correlation of each district and its neighbors. This is called the Local Moran’s I LISA statistic. Because the Local Moran’s I LISA statistic test each district for autocorrelation independently, concern is raised about multiple testing limitations that increase the Type I error (\\(\\alpha\\)) of the statistical tests. The use of local test should therefore be study in light of explore and describes clusters once the global test detected autocorrelation.\n\n\n\n\n\n\nStatistical test\n\n\n\nFor each district \\(i\\), the Local Moran’s I statistics is:\n\\[I_i = \\frac{(Y_i-\\bar{Y})}{\\sum_{i=1}^N(Y_i-\\bar{Y})^2}\\sum_{j=1}^Nw_{ij}(Y_j - \\bar{Y}) \\text{ with } I = \\sum_{i=1}^NI_i/N\\]\n\n\nThe localmoran()function from the package spdep treats the variable of interest as if it was normally distributed. In some cases, this assumption could be reasonable for incidence rate, especially when the areal units of analysis have sufficiently large population count suggesting that the values have similar level of variances. Unfortunately, the local Moran’s test has not been implemented for Poisson distribution (population not large enough in some districts) in spdep package. However, Bivand et al. (R. S. Bivand et al. 2008) provided some code to manual perform the analysis using Poisson distribution and was further implemented in the course “Spatial Epidemiologyâ€.\n\n# Step 1 - Create the standardized deviation of observed from expected\nsd_lm <- (district$cases - district$expected) / sqrt(district$expected)\n\n# Step 2 - Create a spatially lagged version of standardized deviation of neighbors\nwsd_lm <- lag.listw(q_listw, sd_lm)\n\n# Step 3 - the local Moran's I is the product of step 1 and step 2\ndistrict$I_lm <- sd_lm * wsd_lm\n\n# Step 4 - setup parameters for simulation of the null distribution\n\n# Specify number of simulations to run\nnsim <- 499\n\n# Specify dimensions of result based on number of regions\nN <- length(district$expected)\n\n# Create a matrix of zeros to hold results, with a row for each county, and a column for each simulation\nsims <- matrix(0, ncol = nsim, nrow = N)\n\n# Step 5 - Start a for-loop to iterate over simulation columns\nfor(i in 1:nsim){\n y <- rpois(N, lambda = district$expected) # generate a random event count, given expected\n sd_lmi <- (y - district$expected) / sqrt(district$expected) # standardized local measure\n wsd_lmi <- lag.listw(q_listw, sd_lmi) # standardized spatially lagged measure\n sims[, i] <- sd_lmi * wsd_lmi # this is the I(i) statistic under this iteration of null\n}\n\nhist(sims[1,])\n\n\n\n# Step 6 - For each county, test where the observed value ranks with respect to the null simulations\nxrank <- apply(cbind(district$I_lm, sims), 1, function(x) rank(x)[1])\n\n# Step 7 - Calculate the difference between observed rank and total possible (nsim)\ndiff <- nsim - xrank\ndiff <- ifelse(diff > 0, diff, 0)\n\n# Step 8 - Assuming a uniform distribution of ranks, calculate p-value for observed\n# given the null distribution generate from simulations\ndistrict$pval_lm <- punif((diff + 1) / (nsim + 1))\n\nFor each district, we obtain a p-value based on permutations process\nA conventional way of plotting these results is to classify the districts into 5 classes based on local Moran’s I output. The classification of cluster that are significantly autocorrelated to their neighbors is performed based on a comparison of the scaled incidence in the district compared to the scaled weighted averaged incidence of it neighboring districts (computed with lag.listw()):\n\nDistricts that have higher-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local \\(I_i\\) statistic are defined as High-High (hotspot of the disease)\nDistricts that have lower-than-average rates in both index regions and their neighbors and showing statistically significant positive values for the local \\(I_i\\) statistic are defined as Low-Low (cold spot of the disease).\nDistricts that have higher-than-average rates in the index regions and lower-than-average rates in their neighbors, and showing statistically significant negative values for the local \\(I_i\\) statistic are defined as High-Low(outlier with high incidence in an area with low incidence).\nDistricts that have lower-than-average rates in the index regions and higher-than-average rates in their neighbors, and showing statistically significant negative values for the local \\(I_i\\) statistic are defined as Low-High (outlier of low incidence in area with high incidence).\nDistricts with non-significant values for the \\(I_i\\) statistic are defined as Non-significant.\n\n\n# create lagged local raw_rate - in other words the average of the queen neighbors value\n# values are scaled (centered and reduced) to be compared to average\ndistrict$lag_std <- scale(lag.listw(q_listw, var = district$incidence))\ndistrict$incidence_std <- scale(district$incidence)\n\n# extract pvalues\n# district$lm_pv <- lm_test[,5]\n\n# Classify local moran's outputs\ndistrict$lm_class <- NA\ndistrict$lm_class[district$incidence_std >=0 & district$lag_std >=0] <- 'High-High'\ndistrict$lm_class[district$incidence_std <=0 & district$lag_std <=0] <- 'Low-Low'\ndistrict$lm_class[district$incidence_std <=0 & district$lag_std >=0] <- 'Low-High'\ndistrict$lm_class[district$incidence_std >=0 & district$lag_std <=0] <- 'High-Low'\ndistrict$lm_class[district$pval_lm >= 0.05] <- 'Non-significant'\n\ndistrict$lm_class <- factor(district$lm_class, levels=c(\"High-High\", \"Low-Low\", \"High-Low\", \"Low-High\", \"Non-significant\") )\n\n# create map\nmf_map(x = district,\n var = \"lm_class\",\n type = \"typo\",\n cex = 2,\n col_na = \"white\",\n #val_order = c(\"High-High\", \"Low-Low\", \"High-Low\", \"Low-High\", \"Non-significant\") ,\n pal = c(\"#6D0026\" , \"blue\", \"white\") , # \"#FF755F\",\"#7FABD3\" ,\n leg_title = \"Clusters\")\n\nmf_layout(title = \"Cluster using Local Moran's I statistic\")\n\n\n\n\n\n\n\n7.2.3 Spatial scan statistics\nWhile Moran’s indices focus on testing for autocorrelation between neighboring polygons (under the null assumption of spatial independence), the spatial scan statistic aims at identifying an abnormal higher risk in a given region compared to the risk outside of this region (under the null assumption of homogeneous distribution). The conception of a cluster is therefore different between the two methods.\nThe function kulldorff from the package SpatialEpi (Kim and Wakefield 2010) is a simple tool to implement spatial-only scan statistics. Briefly, the kulldorff scan statistics scan the area for clusters using several steps:\n\nIt create a circular window of observation by defining a single location and an associated radius of the windows varying from 0 to a large number that depends on population distribution (largest radius could include 50% of the population).\nIt aggregates the count of events and the population at risk (or an expected count of events) inside and outside the window of observation.\nFinally, it computes the likelihood ratio to test whether the risk is equal inside versus outside the windows (H0) or greater inside the observed window\nThese 3 steps are repeated for each location and each possible windows-radii.\n\n\nlibrary(\"SpatialEpi\")\n\nThe use of R spatial object is not implements in kulldorff() function. It uses instead matrix of xy coordinates that represents the centroids of the districts. A given district is included into the observed circular window if its centroids fall into the circle.\n\ndistrict_xy <- st_centroid(district) %>% \n st_coordinates()\n\nhead(district_xy)\n\n X Y\n1 330823.3 1464560\n2 749758.3 1541787\n3 468384.0 1277007\n4 494548.2 1215261\n5 459644.2 1194615\n6 360528.3 1516339\n\n\nWe can then call kulldorff function (you are strongly encouraged to call ?kulldorff to properly call the function). The alpha.level threshold filter for the secondary clusters that will be retained. The most-likely cluster will be saved whatever its significance.\n\nkd_Wfever <- kulldorff(district_xy, \n cases = district$cases,\n population = district$T_POP,\n expected.cases = district$expected,\n pop.upper.bound = 0.5, # include maximum 50% of the population in a windows\n n.simulations = 499,\n alpha.level = 0.2)\n\n\n\n\nAll outputs are saved into an R object, here called kd_Wfever. Unfortunately, the package did not develop any summary and visualization of the results but we can explore the output object.\n\nnames(kd_Wfever)\n\n[1] \"most.likely.cluster\" \"secondary.clusters\" \"type\" \n[4] \"log.lkhd\" \"simulated.log.lkhd\" \n\n\nFirst, we can focus on the most likely cluster and explore its characteristics.\n\n# We can see which districts (r number) belong to this cluster\nkd_Wfever$most.likely.cluster$location.IDs.included\n\n [1] 48 93 66 180 133 29 194 118 50 144 31 141 3 117 22 43 142\n\n# standardized incidence ratio\nkd_Wfever$most.likely.cluster$SMR\n\n[1] 2.303106\n\n# number of observed and expected cases in this cluster\nkd_Wfever$most.likely.cluster$number.of.cases\n\n[1] 122\n\nkd_Wfever$most.likely.cluster$expected.cases\n\n[1] 52.97195\n\n\n17 districts belong to the cluster and its number of cases is 2.3 times higher than the expected number of cases.\nSimilarly, we could study the secondary clusters. Results are saved in a list.\n\n# We can see which districts (r number) belong to this cluster\nlength(kd_Wfever$secondary.clusters)\n\n[1] 1\n\n# retrieve data for all secondary clusters into a table\ndf_secondary_clusters <- data.frame(SMR = sapply(kd_Wfever$secondary.clusters, '[[', 5), \n number.of.cases = sapply(kd_Wfever$secondary.clusters, '[[', 3),\n expected.cases = sapply(kd_Wfever$secondary.clusters, '[[', 4),\n p.value = sapply(kd_Wfever$secondary.clusters, '[[', 8))\n\nprint(df_secondary_clusters)\n\n SMR number.of.cases expected.cases p.value\n1 3.767698 16 4.246625 0.008\n\n\nWe only have one secondary cluster composed of one district.\n\n# create empty column to store cluster informations\ndistrict$k_cluster <- NA\n\n# save cluster information from kulldorff outputs\ndistrict$k_cluster[kd_Wfever$most.likely.cluster$location.IDs.included] <- 'Most likely cluster'\n\nfor(i in 1:length(kd_Wfever$secondary.clusters)){\ndistrict$k_cluster[kd_Wfever$secondary.clusters[[i]]$location.IDs.included] <- paste(\n 'Secondary cluster', i, sep = '')\n}\n\n#district$k_cluster[is.na(district$k_cluster)] <- \"No cluster\"\n\n\n# create map\nmf_map(x = district,\n var = \"k_cluster\",\n type = \"typo\",\n cex = 2,\n col_na = \"white\",\n pal = mf_get_pal(palette = \"Reds\", n = 3)[1:2],\n leg_title = \"Clusters\")\n\nmf_layout(title = \"Cluster using kulldorf scan statistic\")\n\n\n\n\n\n\n\n\n\n\nTo go further …\n\n\n\nIn this example, the expected number of cases was defined using the population count but note that standardization over other variables as age could also be implemented with the strata parameter in the kulldorff() function.\nIn addition, this cluster analysis was performed solely using the spatial scan but you should keep in mind that this method of cluster detection can be implemented for spatio-temporal data as well where the cluster definition is an abnormal number of cases in a delimited spatial area and during a given period of time. The windows of observation are therefore defined for a different center, radius and period of time. You should look at the function scan_ep_poisson() function in the package scanstatistic (Allévius 2018) for this analysis.\n\n\n\n\n\n\nAllévius, Benjamin. 2018. “Scanstatistics: Space-Time Anomaly Detection Using Scan Statistics.†Journal of Open Source Software 3 (25): 515.\n\n\nBivand, Roger S, Edzer J Pebesma, Virgilio Gómez-Rubio, and Edzer Jan Pebesma. 2008. Applied Spatial Data Analysis with r. Vol. 747248717. Springer.\n\n\nBivand, Roger, Micah Altman, Luc Anselin, Renato Assunção, Olaf Berke, Andrew Bernat, and Guillaume Blanchet. 2015. “Package ‘Spdep’.†The Comprehensive R Archive Network.\n\n\nGómez-Rubio, Virgilio, Juan Ferrándiz-Ferragud, Antonio López-QuıÌlez, et al. 2015. “Package ‘DCluster’.â€\n\n\nKim, Albert Y, and Jon Wakefield. 2010. “R Data and Methods for Spatial Epidemiology: The SpatialEpi Package.†Dept of Statistics, University of Washington." }, { "objectID": "01-introduction.html", @@ -32,7 +32,7 @@ "href": "01-introduction.html#the-package-sf", "title": "1 Introduction", "section": "1.2 The package sf", - "text": "1.2 The package sf\n The package sf was released in late 2016 by Edzer Pebesma (also author of sp). Its goal is to combine the feature of sp, rgeos and rgdal in a single, more ergonomic package. This package offers simple objects (following the simple feature standard) which are easier to manipulate. Particular attention has been paid to the compatibility of the package with the pipe syntax and the operators of the tidyverse.\nsf directly uses the GDAL, GEOS and PROJ libraries.\n\n\n\n\n\nFrom r-spatial.org\n\n\n\n\n\n\nWebsite of package sf : Simple Features for R\n\n\n\nMany of the spatial data available on the internet are in shapefile format, which can be opened in the following way\n\nlibrary(sf)\n\nLinking to GEOS 3.10.2, GDAL 3.4.3, PROJ 8.2.1; sf_use_s2() is TRUE\n\ndistrict <- st_read(\"data_cambodia/district.shp\")\n\nReading layer `district' from data source \n `/home/lucas/Documents/ForgeIRD/rspatial-for-onehealth/data_cambodia/district.shp' \n using driver `ESRI Shapefile'\nSimple feature collection with 197 features and 10 fields\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: 211534.7 ymin: 1149105 xmax: 784612.1 ymax: 1625495\nProjected CRS: WGS 84 / UTM zone 48N\n\n\n\n\n\n\n\n\nShapefile format limitations\n\n\n\nFor the multiple limitations of this format (multi-file, limited number of records…) we advise you to prefer another format such as the geopackage *.gpkg. All the good reasons not to use the shapefile are here.\n\n\nA geopackage is a database, to load a layer, you must know its name\n\nst_layers(\"data_cambodia/cambodia.gpkg\")\n\nDriver: GPKG \nAvailable layers:\n layer_name geometry_type features fields crs_name\n1 country Multi Polygon 1 10 WGS 84 / UTM zone 48N\n2 district Multi Polygon 197 10 WGS 84 / UTM zone 48N\n3 education Multi Polygon 25 19 WGS 84 / UTM zone 48N\n4 hospital Point 956 13 WGS 84 / UTM zone 48N\n5 cases Multi Point 972 2 WGS 84 / UTM zone 48N\n6 road Multi Line String 6 9 WGS 84 / UTM zone 48N\n\n\n\nroad <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"road\")\n\nReading layer `road' from data source \n `/home/lucas/Documents/ForgeIRD/rspatial-for-onehealth/data_cambodia/cambodia.gpkg' \n using driver `GPKG'\nSimple feature collection with 6 features and 9 fields\nGeometry type: MULTILINESTRING\nDimension: XY\nBounding box: xmin: 212377 ymin: 1152214 xmax: 784654.7 ymax: 1625281\nProjected CRS: WGS 84 / UTM zone 48N\n\n\n\n1.2.1 Format of spatial objects sf\n\n\n\n\n\nObjectssf are objects in data.frame which one of the columns contains geometries. This column is the class of sfc (simple feature column) and each individual of the column is a sfg (simple feature geometry). This format is very practical insofa as the data and the geometries are intrinsically linked in the same object.\n\n\n\n\n\n\nThumbnail describing the simple feature format: Simple Features for R\n\n\n\n\n\n\n\n\n\nTip\n\n\n\nA benchmark of vector processing libraries is available here." + "text": "1.2 The package sf\n The package sf was released in late 2016 by Edzer Pebesma (also author of sp). Its goal is to combine the feature of sp, rgeos and rgdal in a single, more ergonomic package. This package offers simple objects (following the simple feature standard) which are easier to manipulate. Particular attention has been paid to the compatibility of the package with the pipe syntax and the operators of the tidyverse.\nsf directly uses the GDAL, GEOS and PROJ libraries.\n\n\n\n\n\nFrom r-spatial.org\n\n\n\n\n\n\nWebsite of package sf : Simple Features for R\n\n\n\nMany of the spatial data available on the internet are in shapefile format, which can be opened in the following way\n\nlibrary(sf)\n\nLinking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE\n\ndistrict <- st_read(\"data_cambodia/district.shp\")\n\nReading layer `district' from data source \n `C:\\Users\\UNiK\\Documents\\R_works\\IRD\\Rspatial\\rspatial-for-onehealth\\data_cambodia\\district.shp' \n using driver `ESRI Shapefile'\nSimple feature collection with 197 features and 10 fields\nGeometry type: MULTIPOLYGON\nDimension: XY\nBounding box: xmin: 211534.7 ymin: 1149105 xmax: 784612.1 ymax: 1625495\nProjected CRS: WGS 84 / UTM zone 48N\n\n\n\n\n\n\n\n\nShapefile format limitations\n\n\n\nFor the multiple limitations of this format (multi-file, limited number of records…) we advise you to prefer another format such as the geopackage *.gpkg. All the good reasons not to use the shapefile are here.\n\n\nA geopackage is a database, to load a layer, you must know its name\n\nst_layers(\"data_cambodia/cambodia.gpkg\")\n\nDriver: GPKG \nAvailable layers:\n layer_name geometry_type features fields crs_name\n1 country Multi Polygon 1 10 WGS 84 / UTM zone 48N\n2 district Multi Polygon 197 10 WGS 84 / UTM zone 48N\n3 education Multi Polygon 25 19 WGS 84 / UTM zone 48N\n4 hospital Point 956 13 WGS 84 / UTM zone 48N\n5 cases Multi Point 972 2 WGS 84 / UTM zone 48N\n6 road Multi Line String 6 9 WGS 84 / UTM zone 48N\n\n\n\nroad <- st_read(\"data_cambodia/cambodia.gpkg\", layer = \"road\")\n\nReading layer `road' from data source \n `C:\\Users\\UNiK\\Documents\\R_works\\IRD\\Rspatial\\rspatial-for-onehealth\\data_cambodia\\cambodia.gpkg' \n using driver `GPKG'\nSimple feature collection with 6 features and 9 fields\nGeometry type: MULTILINESTRING\nDimension: XY\nBounding box: xmin: 212377 ymin: 1152214 xmax: 784654.7 ymax: 1625281\nProjected CRS: WGS 84 / UTM zone 48N\n\n\n\n1.2.1 Format of spatial objects sf\n\n\n\n\n\nObjectssf are objects in data.frame which one of the columns contains geometries. This column is the class of sfc (simple feature column) and each individual of the column is a sfg (simple feature geometry). This format is very practical insofa as the data and the geometries are intrinsically linked in the same object.\n\n\n\n\n\n\nThumbnail describing the simple feature format: Simple Features for R\n\n\n\n\n\n\n\n\n\nTip\n\n\n\nA benchmark of vector processing libraries is available here." }, { "objectID": "01-introduction.html#package-mapsf", @@ -235,7 +235,7 @@ "href": "07-basic_statistics.html", "title": "7 Basic statistics for spatial analysis", "section": "", - "text": "This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into these analysis and their limitations you can consult the tutorial “Spatial Epidemiology†from M. Kramer from which the statistical analysis of his section were adapted." + "text": "This section aims at providing some basic statistical tools to study the spatial distribution of epidemiological data. If you wish to go further into spatial statistics applied to epidemiology and their limitations you can consult the tutorial “Spatial Epidemiology†from M. Kramer from which the statistical analysis of this section was adapted. We will use" }, { "objectID": "07-basic_statistics.html#import-and-visualize-epidemiological-data",