Preparing the NDVI data cube
The Normalized Difference Vegetation Index (NDVI) is a commonly used index to estimate vegetation changes through satellite imagery. It does not correspond to a physical quantity, but is a good proxy for the biomass. It uses the fact that vegetation has a very low reflectance in the red band (which is a photosynthetic active radiation) and a very high reflection in the near infrared band (unused part of the sun’s light spectrum). It is usually between -1 and 1, with high negative values corresponding to open water bodies, high positive values corresponding to highly vegetated areas and values around 0 to mineral or human-made surfaces.
In the modspa_pixel processing chain, it is calculated as follows
Where:
RED: red band
NIR: near infrared band
\(corr_{ACORVI}\): correction parameter applied to the red band to smooth extreme NDVI values, equal to \(5%\) of the red band range
Download satellite imagery
The Sentinel-2 images can be automatically downloaded with the eodag
module by using the following function:
- modspa_pixel.preprocessing.download_S2.download_S2_data(start_date: str, end_date: str, preferred_provider: str, save_path: str, shapefile: str, mode: str = 'pixel', cloud_cover_limit: int = 80) List[str] [source]
download_S2_data uses the eodag module to look for all products of a given provider (copernicus or theia) during a specific time window and covering the whole shapefile enveloppe (several Sentinel-2 tiles might be needed, only one can be chosen for the pixel mode). It then downloads that data into the download path parametered in the config file. Paths to the downloaded data are returned and saved as a
csv
file. An extra month of data is downloaded for a better interpolation, it is then discarded and the final NDVI cube has the dates defined in the config file.Arguments
- start_date:
str
beginning of the time window to download (format:
YYYY-MM-DD
)
- start_date:
- end_date:
str
end of the time window to download (format:
YYYY-MM-DD
)
- end_date:
- preferred_provider:
str
chosen source of the Sentinel-2 data (
copernicus
ortheia
)
- preferred_provider:
- save_path:
str
path where a csv file containing the product paths will be saved
- save_path:
- shapefile:
str
path to the shapefile (
.shp
) for which the data is downloaded
- shapefile:
- mode:
str
default = 'pixel'
run download code in ‘pixel’ or ‘parcel’ mode
- mode:
- cloud_cover_limit:
int
default = 80
maximum percentage to pass the filter before download (between 0 and 100)
- cloud_cover_limit:
Returns
- product_paths:
list[str]
a list of the paths to the downloaded data
- product_paths:
This will download all the Sentinel-2 images found during the specified window (an additional month before and after the start and end dates are downloaded to make sure there is enough clear data for daily interpolation, the extra data is then discarded) and over the specified area (shapefile declared in the config file) into the download directory as zip
or tar
archives. Specific bands can then be extracted from the archive using this function:
- modspa_pixel.preprocessing.download_S2.extract_zip_archives(download_path: str, list_paths: List[str] | str, preferred_provider: str, resolution: int, save_path: str, remove_archive: bool = False) List[str] [source]
Extract specific bands in a zip archive for a list of tar archives.
Arguments
- download_path:
str
path in which the archives will be extracted (usually where the archives are located)
- download_path:
- list_paths:
List[str]
list of paths to the zip archives
- list_paths:
- bands_to_extract:
List[str]
list of strings that will be used to match specific bands. For example if you are looking for bands B3 and B4 in a given archive, bands_to_extract = [‘*_B3.TIF’, ‘*_B4.TIF’]. This depends on the product architecture.
- bands_to_extract:
- resolution:
int
resolution of images in meters. Either 10 or 20.
- resolution:
- save_path:
str
path where a csv file containing the product paths will be saved
- save_path:
- remove_archive:
bool
default = False
boolean to choose whether to remove the archive or not
- remove_archive:
Returns
- product_list:
List[str]
list of the paths to the extracted products
- product_list:
The archives can then be deleted, freeing up some disk space.
Note
The scripts to download LandSat data will be added soon.
Calculate NDVI
Pixel mode
The NDVI calculation is done using the xarray
module. This allows for an easy parallelization of the NDVI calculation (with the integrated dask
module). The first step is to calculate the NDVI for the existing images and save them in a data cube (a stack of two dimensional images along a time dimension). This first data cube is called the NDVI pre_cube. The chosen file format is netCDF4
(.nc
) for a more efficient reading and writing process.
To limit the size of the input datasets, the NDVI data is converted to the uint8
data type (one Byte per pixel). It means NDVI values are saved as integers between 1 and 255 (which correspond to 0 and 1 values, 0 being the no_data value). This gives a precision of about 0.4 % for the NDVI values, which is lower than the uncertainty of the satellite measurements. Little actual data is lost.
The NDVI pre_cube can be created with the following function:
- modspa_pixel.preprocessing.calculate_ndvi.calculate_ndvi(extracted_paths: List[str] | str, config_file: str, chunk_size: dict = {'time': -1, 'x': 1000, 'y': 1000}, scaling: int = 255, acorvi_corr: int = 500) str [source]
Calculate ndvi images in a xarray dataset (a data cube) and save it. ndvi values are scaled and saved as
uint8
(0 to 255).Warning
Current version for Copernicus Sentinel-2 images
Warning
only 10 and 20 meters currently supported
Arguments
- extracted_paths:
Union[List[str], str]
list of paths to extracted sentinel-2 products or path to
csv`
file containing those paths
- extracted_paths:
- config_file:
str
path to configuration file
- config_file:
- chunk_size:
dict
default = {'x': 1000, 'y': 1000, 'time': -1}
dictionnary containing the chunk size for the xarray dask calculation
- chunk_size:
- scaling:
int
default = 255
integer scaling to save NDVI data as integer values
- scaling:
- acorvi_corr:
int
default = 500
acorvi correction parameter to add to the red band to adjust ndvi values
- acorvi_corr:
Returns
- ndvi_precube_path:
str
path to save the ndvi pre-cube
- ndvi_precube_path:
Once this data cube is written, it needs to be interpolated along the time dimension. The processing chain requires NDVI data at a daily frequency, and high resolution satellite imagery rarely has a revisit time smaller than 5 days. The daily interpolation is also done with the xarray
module. The resulting dataset is also saved with the uint8
data type.
The final NDVI cube can be created with the following function:
- modspa_pixel.preprocessing.calculate_ndvi.interpolate_ndvi(ndvi_path: str, config_file: str, chunk_size: dict = {'time': -1, 'x': 500, 'y': 500}) str [source]
Interpolate the ndvi cube to a daily frequency between the desired dates defined in the
json
config file. The extra month of data downloaded is used for a better interpolation, it is then discarded and the final NDVI cube has the dates defined in the config file.Arguments
- ndvi_path:
str
path to ndvi pre-cube
- ndvi_path:
- config_file:
str
path to
json
config file
- config_file:
- chunk_size:
dict
default = {'x': 500, 'y': 500, 'time': -1}
chunk size to use by dask for calculation,
'time' = -1
means the chunk has the whole time dimension in it. The Dataset can’t be divided along the time axis for interpolation.
- chunk_size:
Returns
None
Warning
Both of the previous functions are ressource hungry (CPU and RAM), it can take up to an hour or more depending on the size of the dataset and the specifications of your machine.
Parcel mode
The preparation of the NDVI data can be longer when using the parcel mode. The calculation are done in several steps, intermediary data is saved as pandas.DataFrames
in csv
format.
NDVI calculation
First, NDVI images are created from the raw satellite images using this function:
- modspa_pixel.preprocessing.calculate_ndvi.calculate_ndvi_parcel(download_path: List[str] | str, save_dir: str, save_path: str, overwrite: bool = False, max_cpu: int = 4, ACORVI_correction: int = 500) List[str] [source]
Opens the red, near infrared and scene classification to calculate the ndvi, apply a mask to keep only clear land data (no clouds, no snow, no water, no unclassified or erroneous pixel). This array is then saved in a GeoTiff image with the parrent name file and
'_NDVI.tif'
extension. If overwrite is false, already existing fils won’t be calculated and saved again (default value isfalse
).This function calls the calculate_ndvi_image function as a subprocess to allow multiprocessing. A modified version of the
tqdm
module (p_tqdm
) is used for progress bars.Arguments
- download_path:
list[str]
orstr
list or paths to the products to extract for ndvi calculation or path to
csv
file that contains this list
- download_path:
- save_dir:
str
directory in which to save ndvi images
- save_dir:
- save_path
str
path to a csv file containing the paths to the saved ndvi images
- save_path
- overwrite:
bool
default = False
boolean to choose to rewrite already existing files
- overwrite:
- max_cpu:
int
default = 4 max number of CPUs that the pool can use, if max_cpu is bigger than available CPUs, the pool only uses availables CPUs
- max_cpu:
- ACORVI_correction:
int
default = 500
correction parameter to apply on the red band for stability of NDVI values
- ACORVI_correction:
Returns
- ndvi_path:
list[str]
list of paths to the saved ndvi images
- ndvi_path:
This function uses multiprocessing to call the following functions:
- modspa_pixel.preprocessing.calculate_ndvi.calculate_ndvi_image(args: tuple) str [source]
Opens zip file of the given product to extract the red, near-infrared and mask bands (without extracting the whole archive) and calculate the ndvi, apply a mask to keep only clear land data (no clouds, no snow, no water, no unclassified or erroneous pixel). This array is then saved in a GeoTiff image with the parent name file and
'_NDVI.tif'
extension. If overwrite is false, already existing files won’t be calculated and saved again (default value isfalse
).This function is called in the calculate_ndvi function as a subprocess to allow multiprocessing. Variables that have served their purpose are directly deleted to reduce memory usage.
Arguments (packed in args:
tuple
)- product_path:
str
path to the product to extract for ndvi calculation
- product_path:
- save_dir:
str
directory in which to save ndvi images
- save_dir:
- overwrite:
bool
boolean to choose to rewrite already existing files
- overwrite:
- ACORVI_correction:
int
correction parameter to apply on the red band for stability of NDVI values
- ACORVI_correction:
Returns
- ndvi_path:
str
path to the saved ndvi image
- ndvi_path:
- modspa_pixel.preprocessing.calculate_ndvi.write_geotiff(path: str, data: ndarray, transform: tuple, projection: str, scaling: int = 255) None [source]
Writes a GeoTiff image using
rasterio
. Takes an array and georeferencing data (obtained byrasterio
on an image with same georeferencing) to write a well formatted image. Data format:uint8
Arguments
- path:
str
true path of file to save data in (path = path + save name)
- path:
- data:
np.ndarray
numpy array containging the data
- data:
- geotransform:
tuple
tuple containing the geo-transform data
- geotransform:
- projection:
str
string containing the epsg projection data
- projection:
- scaling:
int
default = 255
scaling factor for saving NDVI images as integer arrays
- scaling:
Returns
None
The NDVI images are kept and not rewritten unless specified in the json configuration file.
NDVI extraction
In this section, most of the functions actually call another function designed to run on a portion of the original data, this allows to use the multiprocessing python package for faster runtime.
Zonal statistics are applied on each NDVI image to produce a NDVI dataframe: it contains NDVI mean value on the polygon for each date. The raw dataframe is built with this function:
- modspa_pixel.preprocessing.extract_ndvi.extract_ndvi_stats(ndvi_paths: List[str] | str, shapefile: str, save_raw_dataframe_path: str, scaling: int = 255, buffer_distance: int = -10, max_cpu: int = 4) DataFrame [source]
extract_ndvi_stats extracts ndvi statistics (
mean
andpixel count
) from a list of ndvi images and a shapefile (.shp
). It iterates over the list of ndvi images and for one image, calls therasterstats
zonal_stats
method. This information is stored in apandas DataFrame
.This function calls the
extract_ndvi_image
function to allow for the use of multiprocessing. A modified version of thetqdm
module (p_tqdm
) is used for progress bars.It returns a
pandas DataFrame
that contains the statistics, a featureid
, the date and tile for every dates in the dataset and every polygon in the shapefile geometry. ThisDataFrame
is saved as acsv
file.Arguments
- ndvi_paths:
list[str]
orstr
list of paths to the ndvi images or path to
csv
file containing that list
- ndvi_paths:
- shapefile:
str
path to the shapefile
- shapefile:
- save_raw_dataframe_path:
str
path to save the
DataFrame
ascsv
- save_raw_dataframe_path:
- scaling:
int
default = 255
integer scaling used to save NDVI values as integers
- scaling:
- buffer_distance:
int
default = -10
distance to buffer shapefile polygon to prevent extracting pixels that are not entirely in a polygon, in meters for Sentinel-2 and LandSat 8 images, < 0 to reduce size
- buffer_distance:
- max_cpu:
int
default = 4 max number of CPU cores to use for calculation
- max_cpu:
Returns
- ndvi_dataframe: pd.DataFrame
pandas DataFrame
containing ndvi statistics, feature information and image information for every polygon in the shapefile and every date in the ndvi image dataset
This raw dataframe is then filtered (average of multiple images on a single date, filtering of images with too few valid pixels) with this function:
- modspa_pixel.preprocessing.extract_ndvi.filter_raw_ndvi(ndvi_raw_dataframe: DataFrame | str, save_filtered_dataframe_path: str, min_pixel_ratio: float = 0.7, max_cpu: int = 4, custom: bool = False) DataFrame [source]
filter_raw_ndvi takes the raw ndvi dataframe and filters it for further processing. It removes measurements with too few valid pixels and averages the statistics over identical dates where multiple products cover the same parcel. It takes three steps :
collect all measurements for a single id value.
finds max pixel count (taken as total pixel count) and removes measurements with too few valid pixels (min_pixel_ratio is entered as an argument).
averages over dates to smooth multiple values for a single date.
It returns a
pandas DataFrame
that has the same structure as the raw ndviDataFrame
, but filtered. ThisDataFrame
is saved as a csv file.This function uses multiprocessing for faster calculations.
Arguments
- ndvi_raw_dataframe:
pd.DataFrame
orstr
Dataframe containing the raw ndvi statistics or path to csv file containing that dataframe
- ndvi_raw_dataframe:
- save_filtered_dataframe_path:
str
path to save the
DataFrame
ascsv
- save_filtered_dataframe_path:
- min_pixel_ratio:
float
default = 0.8
minimum ratio of valid pixels to total pixels of a feature. Between 0 and 1.
- min_pixel_ratio:
- max_cpu:
int
default = 4
max number of CPU cores to use for calculation
- max_cpu:
- custom:
bool
default = False
if function is called with custom set as true, count values are not used for filtering
- custom:
Returns
- filtered_ndvi:
pd.DataFrame
pandas DataFrame
containing ndvi statistics, feature information and image information for every polygon in the shapefile and every date in the ndvi image dataset, filtered according to pixel count and similar dates.
- filtered_ndvi:
This dataframe is then cleaned (removal of values flagged as anormal) and interpolated to a daily timestep with this funcion:
- modspa_pixel.preprocessing.extract_ndvi.interpolate_ndvi_parcel(ndvi_dataframe: DataFrame | str, save_path: str, start_date: str, end_date: str, interp_method: str = 'linear', threshold_ratio_deriv: int = 25, threshold_median: float = 0.1, median_window: int = 3, max_cpu: int = 4) DataFrame [source]
interpolate_ndvi takes in the filtered NDVI
DataFrame
and builds aDataFrame
with daily values between the first and last dates and fills in the missing values with the chosen interpolation method (default islinear
), using thepandas
library. The interpolatedDataFrame
is then saved as acsv
file and returned.Arguments
- ndvi_dataframe:
pd.DataFrame
orstr
DataFrame
containing the filtered NDVI data
- ndvi_dataframe:
- save_path:
str
path to save the interpolated
DataFrame
csv
- save_path:
- start_date:
str
beginning of the time window to download (format:
yyyy-mm-dd
)
- start_date:
- end_date:
str
end of the time window to download (format:
yyyy-mm-dd
)
- end_date:
- interp_method:
str
default = 'linear'
chosen interpolation method, possible values are :
'linear'
or'pchip'
- interp_method:
- max_cpu:
int
default = 4
max number of cores to use for calculation
- max_cpu:
Returns
- interpolated_dataframe:
pd.DataFrame
pandas DataFrame
containing interpolated NDVI values for every shapefile feature
- interpolated_dataframe:
This produces a NDVI dataframe on a daily index for every polygon in the shapefile.
NDVI formatting
This dataframe is then converted to an xarray
dataset with this function:
- modspa_pixel.preprocessing.parcel_to_pixel.convert_dataframe_to_xarray(dataframe_in: str | DataFrame, save_path: str, variables: List[str], data_types: List[str], time_dimension: bool = True) None [source]
Convert
pandas dataframes
of the parcel mode intoxarray datasets
for the model calculations. The resulting xarray dataset has dimensions:time: number of dates
,y: 1
,x: number of poygons
(to make a 3D dataset),or dimensions:
y: 1
,x: number of poygons
(to make a 2D dataset)Arguments
- dataframe_in:
Union[str, pd.DataFrame]
dataframe or path to dataframe to convert
- dataframe_in:
- save_path:
str
save path of output xarray dataset
- save_path:
- variables:
List[str]
name of variables (or variable, list can have one element) to put in the ouput dataset
- variables:
- data_types:
List[str]
xarray datatypes corresponding the the variable names, for correct saving of the dataset
- data_types:
- time_dimension:
bool
default = True
boolean to indicate if the dataframe has a time dimension
- time_dimension:
Returns
None
This dataset can then be used for the model calculation.