Skip to content
Snippets Groups Projects
user avatar
Nicolas FERNANDEZ NUÑEZ authored
1c3c8db8
History

RQCP: Reads Quality Control Pipeline

Description

RQCP check NGS (illumina) reads quality and clean it if needed, as you set, using:

  • Cutadapts to trim NGS sequencing adapters
  • Sickle-trim to trim reads on base-calling quality score
  • Fastq-join to join mates reads (forward R1 and Reverse R2) when it's possible
  • FastQC to check global quality
  • FastqScreen to check putative contamination(s)
  • MultiQC to generate HTML reports

Badges

Maintener MacOS Issues closed Issues opened Maintened Wiki Open Source GNU AGPL v3 Bash Python Snakemake Conda

Visuals

Good idea to include screenshots or GIFs (see ttygif or Asciinema)

Installation

Conda (prior!)

Download and install Conda: Latest Miniconda Installer

  1. Donwload conda installer (i.e. for Miniconda3 with Python 3.9 on Linux-64-bit):
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh
  1. Install conda using installer bash script: Follow the prompts on the installer screens
bash Miniconda3-latest-Linux-x86_64.sh
  1. Remove conda installer:
rm Miniconda3-latest-Linux-x86_64.sh
  1. Restart shell, close and reopen new terminal window

Snakemake (prior!)

Install Snakemake using Conda package management system
Follow the prompts on the installer screens

conda install -c bioconda -c conda-forge snakemake

RQCP

Download or clone the Reads Quality Control Pipeline project

Download

  1. Download source code archive (zip, tar.gz, tar.bz2, tar): RQCP on GitLab

Image of download button

  1. Extract and remove the the archive (i.e. tar.gz):
tar -xzvf path/to/archive/Reads_Quality_Control_Pipeline.tar.gz
rm path/to/archive/Reads_Quality_Control_Pipeline.tar.gz 

Clone

Clone with SSH when you want to authenticate only one time
Authenticate with GitLab by following the instructions in the SSH documentation

git clone git@gitlab.com:ird_transvihmi/Reads_Quality_Control_Pipeline.git path/to/workdir/
cd path/to/workdir/Reads_Quality_Control_Pipeline/

Clone with HTTPS when you want to authenticate each time you perform an operation between your computer and GitLab

git clone https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline.git path/to/workdir/
cd path/to/workdir/Reads_Quality_Control_Pipeline/ 

Difference between download and clone

To create a copy of a remote repository’s files on your computer, you can either download or clone the repository
If you download it, you cannot sync the repository with the remote repository on GitLab
Cloning a repository is the same as downloading, except it preserves the Git connection with the remote repository
You can then modify the files locally and upload the changes to the remote repository on GitLab

Usage

  • Copy your paired-end reads fastq.gz files into: ./resources/reads/ directory
  • Edit config.yaml file on ./config/ directory, as you want, if needed
  • Edit fastq-screen.conf file on ./config/ directory, as you want, if needed
  • Be sure your bash script is executable, if not, in a Terminal:
cd Reads_Quality_Control_Pipeline/
sudo chmod +x RQCP.sh
  • Run RQCP.sh bash script by double-clicking on it

Configuration

Resources

Edit to match your hardware configuration

Environments

Edit if you change some environments (i.e.new version) in ./workflow/envs/tools-version.yaml files

Datasets

Edit to choose datasets you want an quality control with FastQC et Fastq-Screen

Cutadapt

  • length: Discard reads shorter than length, after trim (default config: '75')
  • kit: Sequence of an adapter ligated to the 3' end of the first read (default config: truseq / nextera / small)

Sickle-trim

  • command: Pipeline wait for paired-end reads (default config: 'pe') see: rule sickletrim on ./workflow/rules/reads_quality_control_pipeline.smk snake file
  • encoding: If your data are from recent Illumina run, let 'sanger' (default config: 'sanger')
  • quality: Q-phred score limit (default config: '30')
  • length: Read length limit, after trim (default config: '75')
Fastq-Join
  • percent: Percent maximum difference (default config: 5)
  • overlap: Minimum overlap (default config: 25)

Fastq-Screen

  • config: Path to the fastq-screen configuration file (default config: ./config/fastq-screen.conf)
  • subset: Don't use the whole sequence file, but create a temporary dataset of this specified number of read (default config: '10000', set '0' for all dataset)
  • aligner: Specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa' (default config: 'bwa')
fastq-screen.conf
  • path: Set this value to tell the program where to find your chosen aligner (default :/usr/local/<tool>
  • bismark: Same for bismark (for bisulfite sequencing only)
  • threads: Set this value to the number of cores you want for mapping reads (default: 1, but overwrited by Snakemake and config.yaml file)
  • databases: This section enables you to configure multiple genomes databases (aligner index files) to search against in your screen
databases

For each genome you need to provide a database name (which can't contain spaces) and the location of the aligner index files

The path to the index files should include the basename of the index, (e.g: ./resources/databases//Human/Homo_sapiens_h38)
Thus, the index files (Homo_sapiens_h38.bt2, Homo_sapiens_h38.2.bt2, etc.) are found in a folder named 'Homo_sapiens_h38'
For example, the Bowtie, Bowtie2 and BWA indices of a given genome reside in the same folder
A single path may be provided to all the of indices

The index used will be the one compatible with the chosen aligner (as specified using the --aligner option)

The entries shown in ./config/fastq-screen.conf are only suggested examples,

  • You can add as many database sections as required
  • You can comment out or remove as many of the existing entries as desired

It's suggested including genomes and sequences that:

  • may be sources of contamination either because they where run on your sequencer previously
  • may have contaminated your sample during the library preparation step

For IRD_U233_TransVIHMI, cretaed this indexes:

  • Human: main sources of lab. contaminations (exepted if Boston Dynamics Atlas robot did the job) ¡not included!
  • Mouse: main model in biology experimentation, very frequent in NGS facility core ¡not included!
  • Arabidopsis: frequent plant model in NGS facility core associated with plants researches (IRD, CIRAD, INRAE, ...) ¡not included!
  • Ecoli: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples
  • PhiX: usefull control in Illumina sequencing run technology
  • Adapters: use for libraries generation
  • Vector: use in general molecular biology
  • Gorilla: species studied in TransVIHMI ¡not included!
  • Chimpanzee: species studied in TransVIHMI ¡not included!
  • Bat: species studied in TransVIHMI ¡not included!
  • HIV: species studied in TransVIHMI
  • Ebola: species studied in TransVIHMI
  • SARS-CoV-2: species studied in TransVIHMI

Not included indexes:
Indexes for large genomes can be heavy (~ 3Gb) and git limit each project to 10Gb. Download all this databases can be also to long
Commonly share on git code, but large resources (data input, databases, references, ...) can always be download somewhere
Theses databases where generated and available at lab. You can free ask for a share, by USB supports or FileSender to add it to your analyses
You can ask also for new databases for references not presented here and for which you want check presence / absence on your data

Support

  1. RTFM! (Read The Fabulous Manual! ^^.)
  2. Read de awsome wiki ;)
  3. Create a new issue: Issues > New issue > Describe your issue
  4. Send an email to nicolas.fernandez@ird.fr
  5. Call me to +33.(0)4.67.41.55.xx (No don't please O_o!)

Roadmap

Add new features

Contributing

Open to contributions :)
Testing code, finding issues, asking for update, proposing new features ...
Use Git tools to share!

Authors and acknowledgment

  • Nicolas Fernandez (Maintener)
  • Christelle Butel for testing all versions of this script

License

GPLv3

Project status

I'm out of time for this project, development has slowed down, close to stopped completely
You can be volunteer to step in as a maintainer ;)
Or choose to fork this project allowing this project to keep going !

For information :

  • Guests are not active contributors in private projects, they can only see, and leave comments and issues.
  • Reporters are read-only contributors, they can't write to the repository, but can on issues.
  • Developers are direct contributors, they have access to everything to go from idea to production,
    unless something has been explicitly restricted.
  • Maintainers are super-developers, they are able to push to master, deploy to production.
    This role is often held by maintainers and engineering managers.
  • Owners are essentially group-admins, they can give access to groups and have destructive capabilities.