RQCP: Reads Quality Control Pipeline
Description
RQCP check NGS (illumina) reads quality and clean it if needed, as you set, using:
- Cutadapts to trim NGS sequencing adapters
- Sickle-trim to trim reads on base-calling quality score
- Fastq-join to join mates reads (forward R1 and Reverse R2) when it's possible
- FastQC to check global quality
- FastqScreen to check putative contamination(s)
- MultiQC to generate HTML reports
Badges
Visuals
Good idea to include screenshots or GIFs (see ttygif or Asciinema)
Installation
Conda (prior!)
Download and install Conda: Latest Miniconda Installer
- Donwload conda installer (i.e. for Miniconda3 with Python 3.9 on Linux-64-bit):
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh
- Install conda using installer bash script: Follow the prompts on the installer screens
bash Miniconda3-latest-Linux-x86_64.sh
- Remove conda installer:
rm Miniconda3-latest-Linux-x86_64.sh
- Restart shell, close and reopen new terminal window
Snakemake (prior!)
Install Snakemake using Conda package management system
Follow the prompts on the installer screens
conda install -c bioconda -c conda-forge snakemake
RQCP
Download or clone the Reads Quality Control Pipeline project
Download
- Download source code archive (zip, tar.gz, tar.bz2, tar): RQCP on GitLab
- Extract and remove the the archive (i.e. tar.gz):
tar -xzvf path/to/archive/Reads_Quality_Control_Pipeline.tar.gz
rm path/to/archive/Reads_Quality_Control_Pipeline.tar.gz
Clone
Clone with SSH when you want to authenticate only one time
Authenticate with GitLab by following the instructions in the SSH documentation
git clone git@gitlab.com:ird_transvihmi/Reads_Quality_Control_Pipeline.git path/to/workdir/
cd path/to/workdir/Reads_Quality_Control_Pipeline/
Clone with HTTPS when you want to authenticate each time you perform an operation between your computer and GitLab
git clone https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline.git path/to/workdir/
cd path/to/workdir/Reads_Quality_Control_Pipeline/
Difference between download and clone
To create a copy of a remote repository’s files on your computer, you can either download or clone the repository
If you download it, you cannot sync the repository with the remote repository on GitLab
Cloning a repository is the same as downloading, except it preserves the Git connection with the remote repository
You can then modify the files locally and upload the changes to the remote repository on GitLab
Usage
- Copy your paired-end reads fastq.gz files into: ./resources/reads/ directory
- Edit config.yaml file on ./config/ directory, as you want, if needed
- Edit fastq-screen.conf file on ./config/ directory, as you want, if needed
- Be sure your bash script is executable, if not, in a Terminal:
cd Reads_Quality_Control_Pipeline/
sudo chmod +x RQCP.sh
- Run RQCP.sh bash script by double-clicking on it
Configuration
Resources
Edit to match your hardware configuration
Environments
Edit if you change some environments (i.e.new version) in ./workflow/envs/tools-version.yaml files
Datasets
Edit to choose datasets you want an quality control with FastQC et Fastq-Screen
Cutadapt
- length: Discard reads shorter than length, after trim (default config: '75')
- kit: Sequence of an adapter ligated to the 3' end of the first read (default config: truseq / nextera / small)
Sickle-trim
- command: Pipeline wait for paired-end reads (default config: 'pe') see: rule sickletrim on ./workflow/rules/reads_quality_control_pipeline.smk snake file
- encoding: If your data are from recent Illumina run, let 'sanger' (default config: 'sanger')
- quality: Q-phred score limit (default config: '30')
- length: Read length limit, after trim (default config: '75')
Fastq-Join
- percent: Percent maximum difference (default config: 5)
- overlap: Minimum overlap (default config: 25)
Fastq-Screen
- config: Path to the fastq-screen configuration file (default config: ./config/fastq-screen.conf)
- subset: Don't use the whole sequence file, but create a temporary dataset of this specified number of read (default config: '10000', set '0' for all dataset)
- aligner: Specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa' (default config: 'bwa')
fastq-screen.conf
- path: Set this value to tell the program where to find your chosen aligner (default :/usr/local/<tool>
- bismark: Same for bismark (for bisulfite sequencing only)
- threads: Set this value to the number of cores you want for mapping reads (default: 1, but overwrited by Snakemake and config.yaml file)
- databases: This section enables you to configure multiple genomes databases (aligner index files) to search against in your screen
databases
For each genome you need to provide a database name (which can't contain spaces) and the location of the aligner index files
The path to the index files should include the basename of the index, (e.g: ./resources/databases//Human/Homo_sapiens_h38)
Thus, the index files (Homo_sapiens_h38.bt2, Homo_sapiens_h38.2.bt2, etc.) are found in a folder named 'Homo_sapiens_h38'
For example, the Bowtie, Bowtie2 and BWA indices of a given genome reside in the same folder
A single path may be provided to all the of indices
The index used will be the one compatible with the chosen aligner (as specified using the --aligner option)
The entries shown in ./config/fastq-screen.conf are only suggested examples,
- You can add as many database sections as required
- You can comment out or remove as many of the existing entries as desired
It's suggested including genomes and sequences that:
- may be sources of contamination either because they where run on your sequencer previously
- may have contaminated your sample during the library preparation step
For IRD_U233_TransVIHMI, cretaed this indexes:
- Human: main sources of lab. contaminations (exepted if Boston Dynamics Atlas robot did the job) ¡not included!
- Mouse: main model in biology experimentation, very frequent in NGS facility core ¡not included!
- Arabidopsis: frequent plant model in NGS facility core associated with plants researches (IRD, CIRAD, INRAE, ...) ¡not included!
- Ecoli: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples
- PhiX: usefull control in Illumina sequencing run technology
- Adapters: use for libraries generation
- Vector: use in general molecular biology
- Gorilla: species studied in TransVIHMI ¡not included!
- Chimpanzee: species studied in TransVIHMI ¡not included!
- Bat: species studied in TransVIHMI ¡not included!
- HIV: species studied in TransVIHMI
- Ebola: species studied in TransVIHMI
- SARS-CoV-2: species studied in TransVIHMI
Not included indexes:
Indexes for large genomes can be heavy (~ 3Gb) and git limit each project to 10Gb. Download all this databases can be also to long
Commonly share on git code, but large resources (data input, databases, references, ...) can always be download somewhere
Theses databases where generated and available at lab. You can free ask for a share, by USB supports or FileSender to add it to your analyses
You can ask also for new databases for references not presented here and for which you want check presence / absence on your data
Support
- RTFM! (Read The Fabulous Manual! ^^.)
- Read de awsome wiki ;)
- Create a new issue: Issues > New issue > Describe your issue
- Send an email to nicolas.fernandez@ird.fr
- Call me to
+33.(0)4.67.41.55.xx
(No don't please O_o!)
Roadmap
Add new features
Contributing
Open to contributions :)
Testing code, finding issues, asking for update, proposing new features ...
Use Git tools to share!
Authors and acknowledgment
- Nicolas Fernandez (Maintener)
- Christelle Butel for testing all versions of this script
License
Project status
I'm out of time for this project, development has slowed down, close to stopped completely
You can be volunteer to step in as a maintainer ;)
Or choose to fork this project allowing this project to keep going !
For information :
- Guests are not active contributors in private projects, they can only see, and leave comments and issues.
- Reporters are read-only contributors, they can't write to the repository, but can on issues.
-
Developers are direct contributors, they have access to everything to go from idea to production,
unless something has been explicitly restricted. -
Maintainers are super-developers, they are able to push to master, deploy to production.
This role is often held by maintainers and engineering managers. - Owners are essentially group-admins, they can give access to groups and have destructive capabilities.