diff --git a/07-basic_statistics.qmd b/07-basic_statistics.qmd index a515b24f227e430c6a930023720bfbad9cc4a2fa..9b729dbe677327d2f6b1900d03dda965c033f4bf 100644 --- a/07-basic_statistics.qmd +++ b/07-basic_statistics.qmd @@ -42,7 +42,7 @@ mf_map(x = cases, lwd = .5, col = "#990000", pch = 20, add = TRUE) ``` -In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, ...) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggreagation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study. +In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, ...) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study. ```{r district_aggregate, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE} # Aggregate cases over districts @@ -89,7 +89,6 @@ mf_map(x = district, mf_layout(title = "Incidence of W Fever") # Plot SIRs - # create breaks and associated color palette break_SIR = c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = "pretty"))) col_pal = c("#273871", "#3267AD", "#6496C8", "#9BBFDD", "#CDE3F0", "#FFCEBC", "#FF967E", "#F64D41", "#B90E36") @@ -112,35 +111,51 @@ These maps illustrates the spatial heterogenity of the cases. The incidence show - average risk (SIR \~ 1) when standardized for population -In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. Howerver, this case does not apply for all diseases and for all observed events (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don't want to analyze (e.g. sex ratio, occupations, age pyramid). +In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don't want to analyze (e.g. sex ratio, occupations, age pyramid). ## Cluster analysis -Since this W fever seems to have a heterogenous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. +Since this W fever seems to have a heterogeneous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. The first question is to wonder if data are auto correlated or spatially independent, i.e. study if neighboring districts are likely to have similar incidence. In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis. ### Spatial autocorrelation (Moran's I test) -A popular test for spatial autocorrelation is the Moran's test. +A popular test for spatial autocorrelation is the Moran's test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran's I value of +1 indicates a concentration of spatial units exhibiting similar rates. + +Here the statistics hypothesis are : -Moran's I test tells us whether nearby units tend to exhibit similar rates. It ranges from -1 to +1, whith a value of -1 denoting that units with low rates are located near other units with high rates, while a Moran's I value of +1 indicates a concentration of spatial units exhibiting similar rates. +- H0 : + +- H1: , i.e. Moran's I value is different than 0. We will compute the Moran's statistics using `spdep` and `Dcluster` packages. `spdep` package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data. ```{r MoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE} -# Plot the incidence histogramm -hist(log(district$incidence)) - +library(spdep) # Functions for creating spatial weight, spatial analysis +library(DCluster) # Package with functions for spatial cluster analysis) +qnb <- poly2nb(district) +q_listw <- nb2listw(qnb, style = 'W') # row-standardized weights +# Moran's I test +moranI.test(cases ~ offset(log(expected)), + data = district, + model = 'poisson', + R = 499, + listw = q_listw, + n = 159, + S0 = Szero(q_listw)) ``` -## Cluster analysis +### Spatial scan statistics + +While Moran's indice focuses on finding correlation between neighboring polygons, the spatial scan statistic compare the incidence level of a given windows of observation with the incidence level outside of this windows. + +The package `SpatialEpi` -In epidemiology, the definition of a cluster ### Population-based clusters (kulldorf statistic) diff --git a/public/07-basic_statistics.html b/public/07-basic_statistics.html index 310c59914b74b3300399429f23a412135b7a1fc0..758307f0fdc489e6697de1d964a2f2a0a27b3cc8 100644 --- a/public/07-basic_statistics.html +++ b/public/07-basic_statistics.html @@ -216,15 +216,13 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni <ul> <li><a href="#import-and-visualize-epidemiological-data" id="toc-import-and-visualize-epidemiological-data" class="nav-link active" data-scroll-target="#import-and-visualize-epidemiological-data"><span class="toc-section-number">7.1</span> Import and visualize epidemiological data</a></li> - <li><a href="#basics-statistics" id="toc-basics-statistics" class="nav-link" data-scroll-target="#basics-statistics"><span class="toc-section-number">7.2</span> Basics statistics</a> + <li><a href="#cluster-analysis" id="toc-cluster-analysis" class="nav-link" data-scroll-target="#cluster-analysis"><span class="toc-section-number">7.2</span> Cluster analysis</a> <ul class="collapse"> <li><a href="#spatial-autocorrelation-morans-i-test" id="toc-spatial-autocorrelation-morans-i-test" class="nav-link" data-scroll-target="#spatial-autocorrelation-morans-i-test"><span class="toc-section-number">7.2.1</span> Spatial autocorrelation (Moran’s I test)</a></li> - </ul></li> - <li><a href="#cluster-analysis" id="toc-cluster-analysis" class="nav-link" data-scroll-target="#cluster-analysis"><span class="toc-section-number">7.3</span> Cluster analysis</a> - <ul class="collapse"> - <li><a href="#population-based-clusters-kulldorf-statistic" id="toc-population-based-clusters-kulldorf-statistic" class="nav-link" data-scroll-target="#population-based-clusters-kulldorf-statistic"><span class="toc-section-number">7.3.1</span> Population-based clusters (kulldorf statistic)</a></li> - <li><a href="#expectation-based-cluster" id="toc-expectation-based-cluster" class="nav-link" data-scroll-target="#expectation-based-cluster"><span class="toc-section-number">7.3.2</span> Expectation-based cluster</a></li> - <li><a href="#to-go-further" id="toc-to-go-further" class="nav-link" data-scroll-target="#to-go-further"><span class="toc-section-number">7.3.3</span> To go further …</a></li> + <li><a href="#spatial-scan-statistics" id="toc-spatial-scan-statistics" class="nav-link" data-scroll-target="#spatial-scan-statistics"><span class="toc-section-number">7.2.2</span> Spatial scan statistics</a></li> + <li><a href="#population-based-clusters-kulldorf-statistic" id="toc-population-based-clusters-kulldorf-statistic" class="nav-link" data-scroll-target="#population-based-clusters-kulldorf-statistic"><span class="toc-section-number">7.2.3</span> Population-based clusters (kulldorf statistic)</a></li> + <li><a href="#expectation-based-cluster" id="toc-expectation-based-cluster" class="nav-link" data-scroll-target="#expectation-based-cluster"><span class="toc-section-number">7.2.4</span> Expectation-based cluster</a></li> + <li><a href="#to-go-further" id="toc-to-go-further" class="nav-link" data-scroll-target="#to-go-further"><span class="toc-section-number">7.2.5</span> To go further …</a></li> </ul></li> </ul> </nav> @@ -294,7 +292,7 @@ Projected CRS: WGS 84 / UTM zone 48N <p><img src="07-basic_statistics_files/figure-html/cases_visualization-1.png" class="img-fluid" width="768"></p> </div> </div> -<p>In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggreagation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study.</p> +<p>In epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study.</p> <div class="cell" data-nm="true"> <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Aggregate cases over districts</span></span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>district<span class="sc">$</span>cases <span class="ot"><-</span> <span class="fu">lengths</span>(<span class="fu">st_intersects</span>(district, cases))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> @@ -335,19 +333,18 @@ Projected CRS: WGS 84 / UTM zone 48N <span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Incidence of W Fever"</span>)</span> <span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot SIRs</span></span> -<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a><span class="co"># create breaks and associated color palette</span></span> -<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>break_SIR <span class="ot">=</span> <span class="fu">c</span>(<span class="dv">0</span>, <span class="fu">exp</span>(<span class="fu">mf_get_breaks</span>(<span class="fu">log</span>(district<span class="sc">$</span>SIR), <span class="at">nbreaks =</span> <span class="dv">8</span>, <span class="at">breaks =</span> <span class="st">"pretty"</span>)))</span> -<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>col_pal <span class="ot">=</span> <span class="fu">c</span>(<span class="st">"#273871"</span>, <span class="st">"#3267AD"</span>, <span class="st">"#6496C8"</span>, <span class="st">"#9BBFDD"</span>, <span class="st">"#CDE3F0"</span>, <span class="st">"#FFCEBC"</span>, <span class="st">"#FF967E"</span>, <span class="st">"#F64D41"</span>, <span class="st">"#B90E36"</span>)</span> -<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_map</span>(<span class="at">x =</span> district,</span> -<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> <span class="at">var =</span> <span class="st">"SIR"</span>,</span> -<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> <span class="at">type =</span> <span class="st">"choro"</span>,</span> -<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> <span class="at">breaks =</span> break_SIR, </span> -<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> col_pal, </span> -<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a> <span class="at">cex =</span> <span class="dv">2</span>,</span> -<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"SIR"</span>)</span> -<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Standardized Incidence Ratio of W Fever in Cambodia"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> +<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="co"># create breaks and associated color palette</span></span> +<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>break_SIR <span class="ot">=</span> <span class="fu">c</span>(<span class="dv">0</span>, <span class="fu">exp</span>(<span class="fu">mf_get_breaks</span>(<span class="fu">log</span>(district<span class="sc">$</span>SIR), <span class="at">nbreaks =</span> <span class="dv">8</span>, <span class="at">breaks =</span> <span class="st">"pretty"</span>)))</span> +<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>col_pal <span class="ot">=</span> <span class="fu">c</span>(<span class="st">"#273871"</span>, <span class="st">"#3267AD"</span>, <span class="st">"#6496C8"</span>, <span class="st">"#9BBFDD"</span>, <span class="st">"#CDE3F0"</span>, <span class="st">"#FFCEBC"</span>, <span class="st">"#FF967E"</span>, <span class="st">"#F64D41"</span>, <span class="st">"#B90E36"</span>)</span> +<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_map</span>(<span class="at">x =</span> district,</span> +<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a> <span class="at">var =</span> <span class="st">"SIR"</span>,</span> +<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> <span class="at">type =</span> <span class="st">"choro"</span>,</span> +<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> <span class="at">breaks =</span> break_SIR, </span> +<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> <span class="at">pal =</span> col_pal, </span> +<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> <span class="at">cex =</span> <span class="dv">2</span>,</span> +<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a> <span class="at">leg_title =</span> <span class="st">"SIR"</span>)</span> +<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a><span class="fu">mf_layout</span>(<span class="at">title =</span> <span class="st">"Standardized Incidence Ratio of W Fever"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> <div class="cell-output-display"> <p><img src="07-basic_statistics_files/figure-html/inc_visualization-1.png" class="img-fluid" width="768"></p> </div> @@ -358,39 +355,62 @@ Projected CRS: WGS 84 / UTM zone 48N <li><p>lower risk than average (SIR < 1) when standardized for population</p></li> <li><p>average risk (SIR ~ 1) when standardized for population</p></li> </ul> -<p>In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. Howerver, this case does not apply for all disease and for all observed events (e.g. the number of childhood illness and death outcomes are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid).</p> +<p>In this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid).</p> </section> -<section id="basics-statistics" class="level2" data-number="7.2"> -<h2 data-number="7.2" class="anchored" data-anchor-id="basics-statistics"><span class="header-section-number">7.2</span> Basics statistics</h2> -<p>The problem is usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.</p> -<p>The statistical analysis performed relies on the type of data.</p> +<section id="cluster-analysis" class="level2" data-number="7.2"> +<h2 data-number="7.2" class="anchored" data-anchor-id="cluster-analysis"><span class="header-section-number">7.2</span> Cluster analysis</h2> +<p>Since this W fever seems to have a heterogeneous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. The first question is to wonder if data are auto correlated or spatially independent, i.e. study if neighboring districts are likely to have similar incidence.</p> +<p>In statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.</p> <section id="spatial-autocorrelation-morans-i-test" class="level3" data-number="7.2.1"> <h3 data-number="7.2.1" class="anchored" data-anchor-id="spatial-autocorrelation-morans-i-test"><span class="header-section-number">7.2.1</span> Spatial autocorrelation (Moran’s I test)</h3> -<p>A popular test for spatial autocorrelation is the Moran’s test.</p> -<p>Moran’s I test tells us whether nearby units tend to exhibit similar rates. It ranges from -1 to +1, whith a value of -1 denoting that units with low rates are located near other units with high rates, while a Moran’s I value of +1 indicates a concentration of spatial units exhibiting similar rates.</p> +<p>A popular test for spatial autocorrelation is the Moran’s test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran’s I value of +1 indicates a concentration of spatial units exhibiting similar rates.</p> +<p>Here the statistics hypothesis are :</p> +<ul> +<li><p>H0 :</p></li> +<li><p>H1: , i.e. Moran’s I value is different than 0.</p></li> +</ul> <p>We will compute the Moran’s statistics using <code>spdep</code> and <code>Dcluster</code> packages. <code>spdep</code> package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. <code>Dcluster</code> package provides a set of functions for the detection of spatial clusters of disease using count data.</p> <div class="cell" data-nm="true"> -<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot the incidence histogramm</span></span> -<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="fu">hist</span>(<span class="fu">log</span>(district<span class="sc">$</span>incidence))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> -<div class="cell-output-display"> -<p><img src="07-basic_statistics_files/figure-html/MoransI-1.png" class="img-fluid" width="768"></p> +<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(spdep) <span class="co"># Functions for creating spatial weight, spatial analysis</span></span> +<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(DCluster) <span class="co"># Package with functions for spatial cluster analysis)</span></span> +<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>qnb <span class="ot"><-</span> <span class="fu">poly2nb</span>(district)</span> +<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>q_listw <span class="ot"><-</span> <span class="fu">nb2listw</span>(qnb, <span class="at">style =</span> <span class="st">'W'</span>) <span class="co"># row-standardized weights</span></span> +<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Moran's I test</span></span> +<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a><span class="fu">moranI.test</span>(cases <span class="sc">~</span> <span class="fu">offset</span>(<span class="fu">log</span>(expected)), </span> +<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> district,</span> +<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a> <span class="at">model =</span> <span class="st">'poisson'</span>,</span> +<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a> <span class="at">R =</span> <span class="dv">499</span>,</span> +<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a> <span class="at">listw =</span> q_listw,</span> +<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a> <span class="at">n =</span> <span class="dv">159</span>,</span> +<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a> <span class="at">S0 =</span> <span class="fu">Szero</span>(q_listw))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> +<div class="cell-output cell-output-stdout"> +<pre class="code-out"><code>Moran's I test of spatial autocorrelation + + Type of boots.: parametric + Model used when sampling: Poisson + Number of simulations: 499 + Statistic: 0.1264291 + p-value : 0.016 </code></pre> </div> </div> </section> +<section id="spatial-scan-statistics" class="level3" data-number="7.2.2"> +<h3 data-number="7.2.2" class="anchored" data-anchor-id="spatial-scan-statistics"><span class="header-section-number">7.2.2</span> Spatial scan statistics</h3> +<p>While Moran’s indice focuses on finding correlation between neighboring polygons, the spatial scan statistic compare the incidence level of a given windows of observation with the incidence level outside of this windows.</p> +<p>The package <code>SpatialEpi</code></p> </section> -<section id="cluster-analysis" class="level2" data-number="7.3"> -<h2 data-number="7.3" class="anchored" data-anchor-id="cluster-analysis"><span class="header-section-number">7.3</span> Cluster analysis</h2> -<p>In epidemiology, the definition of a cluster</p> -<section id="population-based-clusters-kulldorf-statistic" class="level3" data-number="7.3.1"> -<h3 data-number="7.3.1" class="anchored" data-anchor-id="population-based-clusters-kulldorf-statistic"><span class="header-section-number">7.3.1</span> Population-based clusters (kulldorf statistic)</h3> +<section id="population-based-clusters-kulldorf-statistic" class="level3" data-number="7.2.3"> +<h3 data-number="7.2.3" class="anchored" data-anchor-id="population-based-clusters-kulldorf-statistic"><span class="header-section-number">7.2.3</span> Population-based clusters (kulldorf statistic)</h3> <p>Kulldorff ’s spatial scan statistic identifies the most likely disease clusters maximizing the likelihood that disease cases are located within a set of concentric circles that are moved across the study area.</p> </section> -<section id="expectation-based-cluster" class="level3" data-number="7.3.2"> -<h3 data-number="7.3.2" class="anchored" data-anchor-id="expectation-based-cluster"><span class="header-section-number">7.3.2</span> Expectation-based cluster</h3> +<section id="expectation-based-cluster" class="level3" data-number="7.2.4"> +<h3 data-number="7.2.4" class="anchored" data-anchor-id="expectation-based-cluster"><span class="header-section-number">7.2.4</span> Expectation-based cluster</h3> <p>In many case, population is not specific enough to</p> </section> -<section id="to-go-further" class="level3" data-number="7.3.3"> -<h3 data-number="7.3.3" class="anchored" data-anchor-id="to-go-further"><span class="header-section-number">7.3.3</span> To go further …</h3> +<section id="to-go-further" class="level3" data-number="7.2.5"> +<h3 data-number="7.2.5" class="anchored" data-anchor-id="to-go-further"><span class="header-section-number">7.2.5</span> To go further …</h3> </section> diff --git a/public/07-basic_statistics_files/figure-html/inc_visualization-1.png b/public/07-basic_statistics_files/figure-html/inc_visualization-1.png index 573c23b0e3e6d18f1b59f99d3f20f6d3aeab929c..1a2440ffe6b9c03562d1a847f0bd173ed1be0dcd 100644 Binary files a/public/07-basic_statistics_files/figure-html/inc_visualization-1.png and b/public/07-basic_statistics_files/figure-html/inc_visualization-1.png differ diff --git a/public/search.json b/public/search.json index 0dcb34e30017ada10ce172eefec1e54c53533dd1..d95d9282deec95f2613cb9c11c1e41e9da5a33f5 100644 --- a/public/search.json +++ b/public/search.json @@ -220,8 +220,8 @@ "objectID": "07-basic_statistics.html#cluster-analysis", "href": "07-basic_statistics.html#cluster-analysis", "title": "7 Basic statistics for spatial analysis", - "section": "7.3 Cluster analysis", - "text": "7.3 Cluster analysis\nIn epidemiology, the definition of a cluster\n\n7.3.1 Population-based clusters (kulldorf statistic)\nKulldorff ’s spatial scan statistic identifies the most likely disease clusters maximizing the likelihood that disease cases are located within a set of concentric circles that are moved across the study area.\n\n\n7.3.2 Expectation-based cluster\nIn many case, population is not specific enough to\n\n\n7.3.3 To go further …" + "section": "7.2 Cluster analysis", + "text": "7.2 Cluster analysis\nSince this W fever seems to have a heterogeneous distribution across Cambodia, it would be interesting to study where excess of cases appears, i.e. to identify clusters of the disease. The first question is to wonder if data are auto correlated or spatially independent, i.e. study if neighboring districts are likely to have similar incidence.\nIn statistics, problems are usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.\n\n7.2.1 Spatial autocorrelation (Moran’s I test)\nA popular test for spatial autocorrelation is the Moran’s test. This test tells us whether nearby units tend to exhibit similar incidences. It ranges from -1 to +1. A value of -1 denote that units with low rates are located near other units with high rates, while a Moran’s I value of +1 indicates a concentration of spatial units exhibiting similar rates.\nHere the statistics hypothesis are :\n\nH0 :\nH1: , i.e. Moran’s I value is different than 0.\n\nWe will compute the Moran’s statistics using spdep and Dcluster packages. spdep package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. Dcluster package provides a set of functions for the detection of spatial clusters of disease using count data.\n\nlibrary(spdep) # Functions for creating spatial weight, spatial analysis\nlibrary(DCluster) # Package with functions for spatial cluster analysis)\n\nqnb <- poly2nb(district)\nq_listw <- nb2listw(qnb, style = 'W') # row-standardized weights\n\n# Moran's I test\nmoranI.test(cases ~ offset(log(expected)), \n data = district,\n model = 'poisson',\n R = 499,\n listw = q_listw,\n n = 159,\n S0 = Szero(q_listw))\n\nMoran's I test of spatial autocorrelation \n\n Type of boots.: parametric \n Model used when sampling: Poisson \n Number of simulations: 499 \n Statistic: 0.1264291 \n p-value : 0.016 \n\n\n\n\n7.2.2 Spatial scan statistics\nWhile Moran’s indice focuses on finding correlation between neighboring polygons, the spatial scan statistic compare the incidence level of a given windows of observation with the incidence level outside of this windows.\nThe package SpatialEpi\n\n\n7.2.3 Population-based clusters (kulldorf statistic)\nKulldorff ’s spatial scan statistic identifies the most likely disease clusters maximizing the likelihood that disease cases are located within a set of concentric circles that are moved across the study area.\n\n\n7.2.4 Expectation-based cluster\nIn many case, population is not specific enough to\n\n\n7.2.5 To go further …" }, { "objectID": "07-basic_statistics.html", @@ -242,6 +242,6 @@ "href": "07-basic_statistics.html#import-and-visualize-epidemiological-data", "title": "7 Basic statistics for spatial analysis", "section": "7.1 Import and visualize epidemiological data", - "text": "7.1 Import and visualize epidemiological data\nIn this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case.\n\nlibrary(sf)\n\n#Import Cambodia country border\ncountry = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"country\", quiet = TRUE)\n#Import provincial administrative border of Cambodia\neducation = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"education\", quiet = TRUE)\n#Import district administrative border of Cambodia\ndistrict = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"district\", quiet = TRUE)\n\n# Import locations of cases from an imaginary disease\ncases = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"cases\", quiet = TRUE)\ncases = subset(cases, Disease == \"W fever\")\n\nThe first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.\n\n# View the cases object\nhead(cases)\n\nSimple feature collection with 6 features and 2 fields\nGeometry type: MULTIPOINT\nDimension: XY\nBounding box: xmin: 255891 ymin: 1179092 xmax: 506647.4 ymax: 1467441\nProjected CRS: WGS 84 / UTM zone 48N\n id Disease geom\n1 0 W fever MULTIPOINT ((280036.2 12841...\n2 1 W fever MULTIPOINT ((451859.5 11790...\n3 2 W fever MULTIPOINT ((255891 1467441))\n4 5 W fever MULTIPOINT ((506647.4 12322...\n5 6 W fever MULTIPOINT ((440668 1197958))\n6 7 W fever MULTIPOINT ((481594.5 12714...\n\n# Map the cases\nlibrary(mapsf)\n\nmf_map(x = district, border = \"white\")\nmf_map(x = country,lwd = 2, col = NA, add = TRUE)\nmf_map(x = cases, lwd = .5, col = \"#990000\", pch = 20, add = TRUE)\n\n\n\n\nIn epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggreagation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study.\n\n# Aggregate cases over districts\ndistrict$cases <- lengths(st_intersects(district, cases))\n\nThe incidence (\\(\\frac{cases}{population}\\)) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as \\(SIR = \\frac{Y_i}{E_i}\\) with \\(Y_i\\), the observed number of cases and \\(E_i\\), the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district.\n\n# Compute incidence in each district (per 100 000 population)\ndistrict$incidence = district$cases/district$T_POP * 100000\n\n# Compute the global risk\nrate = sum(district$cases)/sum(district$T_POP)\n\n# Compute expected number of cases \ndistrict$expected = district$T_POP * rate\n\n# Compute SIR\ndistrict$SIR = district$cases / district$expected\n\n\npar(mfrow = c(1, 3))\n# Plot number of cases using proportional symbol \nmf_map(x = district) \nmf_map(\n x = district, \n var = \"cases\",\n val_max = 50,\n type = \"prop\",\n col = \"#990000\", \n leg_title = \"Cases\")\nmf_layout(title = \"Number of cases of W Fever\")\n\n# Plot incidence \nmf_map(x = district,\n var = \"incidence\",\n type = \"choro\",\n pal = \"Reds 3\",\n leg_title = \"Incidence \\n(per 100 000)\")\nmf_layout(title = \"Incidence of W Fever\")\n\n# Plot SIRs\n\n# create breaks and associated color palette\nbreak_SIR = c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = \"pretty\")))\ncol_pal = c(\"#273871\", \"#3267AD\", \"#6496C8\", \"#9BBFDD\", \"#CDE3F0\", \"#FFCEBC\", \"#FF967E\", \"#F64D41\", \"#B90E36\")\n\nmf_map(x = district,\n var = \"SIR\",\n type = \"choro\",\n breaks = break_SIR, \n pal = col_pal, \n cex = 2,\n leg_title = \"SIR\")\nmf_layout(title = \"Standardized Incidence Ratio of W Fever in Cambodia\")\n\n\n\n\nThese maps illustrates the spatial heterogenity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have :\n\nhigher risk than average (SIR > 1) when standardized for population\nlower risk than average (SIR < 1) when standardized for population\naverage risk (SIR ~ 1) when standardized for population\n\nIn this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. Howerver, this case does not apply for all disease and for all observed events (e.g. the number of childhood illness and death outcomes are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid)." + "text": "7.1 Import and visualize epidemiological data\nIn this section, we load data that reference the cases of an imaginary disease throughout Cambodia. Each point correspond to the geolocalisation of a case.\n\nlibrary(sf)\n\n#Import Cambodia country border\ncountry = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"country\", quiet = TRUE)\n#Import provincial administrative border of Cambodia\neducation = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"education\", quiet = TRUE)\n#Import district administrative border of Cambodia\ndistrict = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"district\", quiet = TRUE)\n\n# Import locations of cases from an imaginary disease\ncases = st_read(\"data_cambodia/cambodia.gpkg\", layer = \"cases\", quiet = TRUE)\ncases = subset(cases, Disease == \"W fever\")\n\nThe first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.\n\n# View the cases object\nhead(cases)\n\nSimple feature collection with 6 features and 2 fields\nGeometry type: MULTIPOINT\nDimension: XY\nBounding box: xmin: 255891 ymin: 1179092 xmax: 506647.4 ymax: 1467441\nProjected CRS: WGS 84 / UTM zone 48N\n id Disease geom\n1 0 W fever MULTIPOINT ((280036.2 12841...\n2 1 W fever MULTIPOINT ((451859.5 11790...\n3 2 W fever MULTIPOINT ((255891 1467441))\n4 5 W fever MULTIPOINT ((506647.4 12322...\n5 6 W fever MULTIPOINT ((440668 1197958))\n6 7 W fever MULTIPOINT ((481594.5 12714...\n\n# Map the cases\nlibrary(mapsf)\n\nmf_map(x = district, border = \"white\")\nmf_map(x = country,lwd = 2, col = NA, add = TRUE)\nmf_map(x = cases, lwd = .5, col = \"#990000\", pch = 20, add = TRUE)\n\n\n\n\nIn epidemiology, the true meaning of point is very questionable. If it usually gives the location of an observation, its not clear if this observation represents an event of interest (e.g. illness, death, …) or a person at risk (e.g. a participant that may or may not experience the disease). Considering a ratio of event compared to a population at risk is often more informative than just considering cases. Administrative divisions of countries appears as great areal units for cases aggregation since they make available data on population count and structures. In this study, we will use district as the areal unit of the study.\n\n# Aggregate cases over districts\ndistrict$cases <- lengths(st_intersects(district, cases))\n\nThe incidence (\\(\\frac{cases}{population}\\)) is commonly use to represent cases distribution related to population density but other indicators exists. As example, the standardized incidence ratios (SIRs) represents the deviation of observed and expected number of cases and is expressed as \\(SIR = \\frac{Y_i}{E_i}\\) with \\(Y_i\\), the observed number of cases and \\(E_i\\), the expected number of cases. In this study, we computed the expected number of cases in each district by assuming infections are homogeneously distributed across Cambodia, i.e. the incidence is the same in each district.\n\n# Compute incidence in each district (per 100 000 population)\ndistrict$incidence = district$cases/district$T_POP * 100000\n\n# Compute the global risk\nrate = sum(district$cases)/sum(district$T_POP)\n\n# Compute expected number of cases \ndistrict$expected = district$T_POP * rate\n\n# Compute SIR\ndistrict$SIR = district$cases / district$expected\n\n\npar(mfrow = c(1, 3))\n# Plot number of cases using proportional symbol \nmf_map(x = district) \nmf_map(\n x = district, \n var = \"cases\",\n val_max = 50,\n type = \"prop\",\n col = \"#990000\", \n leg_title = \"Cases\")\nmf_layout(title = \"Number of cases of W Fever\")\n\n# Plot incidence \nmf_map(x = district,\n var = \"incidence\",\n type = \"choro\",\n pal = \"Reds 3\",\n leg_title = \"Incidence \\n(per 100 000)\")\nmf_layout(title = \"Incidence of W Fever\")\n\n# Plot SIRs\n# create breaks and associated color palette\nbreak_SIR = c(0, exp(mf_get_breaks(log(district$SIR), nbreaks = 8, breaks = \"pretty\")))\ncol_pal = c(\"#273871\", \"#3267AD\", \"#6496C8\", \"#9BBFDD\", \"#CDE3F0\", \"#FFCEBC\", \"#FF967E\", \"#F64D41\", \"#B90E36\")\n\nmf_map(x = district,\n var = \"SIR\",\n type = \"choro\",\n breaks = break_SIR, \n pal = col_pal, \n cex = 2,\n leg_title = \"SIR\")\nmf_layout(title = \"Standardized Incidence Ratio of W Fever\")\n\n\n\n\nThese maps illustrates the spatial heterogenity of the cases. The incidence shows how the disease vary from one district to another while the SIR highlight districts that have :\n\nhigher risk than average (SIR > 1) when standardized for population\nlower risk than average (SIR < 1) when standardized for population\naverage risk (SIR ~ 1) when standardized for population\n\nIn this example, we standardized the cases distribution for population count. This simple standardization assume that the risk of contracting the disease is similar for each person. However, assumption does not hold for all diseases and for all observed events since confounding effects can create nuisance into the interpretations (e.g. the number of childhood illness and death outcomes in a district are usually related to the age pyramid) and you should keep in mind that other standardization can be performed based on variables known to have an effect but that you don’t want to analyze (e.g. sex ratio, occupations, age pyramid)." } ] \ No newline at end of file