07-basic_statistics.qmd

---
bibliography: references.bib
---

# Basic statistics for spatial analysis

This section aims at providing some basic statistical tools to study the spatial distribution of the cases.

## Import and visualize epidemiological data

In this section, we load data that reference the cases of an imaginary disease throughout Cambodia.

```{r load_cases, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}
library(sf)

#Import Cambodia country border
country = st_read("data_cambodia/cambodia.gpkg", layer = "country", quiet = TRUE)
#Import provincial administrative border of Cambodia
education = st_read("data_cambodia/cambodia.gpkg", layer = "education", quiet = TRUE)
#Import district administrative border of Cambodia
district = st_read("data_cambodia/cambodia.gpkg", layer = "district", quiet = TRUE)

# Import locations of cases from an imaginary disease
cases = st_read("data_cambodia/cambodia.gpkg", layer = "cases", quiet = TRUE)
cases = subset(cases, Disease == "W fever")

# Aggregate cases over districts
district$cases <- lengths(st_intersects(district, cases))


```

The first step of any statistical analysis always consists on visualizing the data to check they were correctly loaded and to observe general pattern of the cases.

```{r cases_visualization, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# View the cases object
head(cases)

# Map the cases
library(mapsf)

mf_map(x = district, border = "white")
mf_map(x = country,lwd = 2, col = NA, add = TRUE)
mf_map(x = cases, lwd = .5, col = "#990000", pch = 20, add = TRUE)

```

## Basics statistics

The problem is usually expressed by defining two hypothesis : the null hypothesis (H0), i.e. an a priori hypothesis of the studied phenomenon (e.g. the situation is a random) and the alternative hypothesis (HA), e.g. the situation is not random. The main principle is to measure how likely the observed situation belong to the ensemble of situation that are possible under the H0 hypothesis.

The statistical analysis performed relies on the type of data.

### Spatial autocorrelation (Moran's I test)

A popular test for spatial autocorrelation is the Moran's test.

Moran's I test tells us whether nearby units tend to exhibit similar rates. It ranges from -1 to +1, whith a value of -1 denoting that units with low rates are located near other units with high rates, while a Moran's I value of +1 indicates a concentration of spatial units exhibiting similar rates.

We will compute the Moran's statistics using `spdep` and `Dcluster` packages. This package provides a collection of functions to analyze spatial correlations of polygons and works with sp objects. `Dcluster` package provides a set of functions for the detection of spatial clusters of disease using count data.

```{r MoransI, eval = TRUE, echo = TRUE, nm = TRUE, fig.width=8, class.output="code-out", warning=FALSE, message=FALSE}

# Compte incidence in each district (per 100 000 population)
district$incidence <- district$cases/district$T_POP * 100000

# Plot the incidence histogramm
hist(log(district$incidence))


```

## Cluster analysis

In epidemiology, the definition of a cluster

### Population-based clusters (kulldorf statistic)

Kulldorff 's spatial scan statistic identifies the most likely disease clusters maximizing the likelihood that disease cases are located within a set of concentric circles that are moved across the study area.

### Expectation-based cluster

In many case, population is not specific enough to

### To go further ...