Skip to content
Snippets Groups Projects
Commit 289f33c5 authored by Nicolas FERNANDEZ NUÑEZ's avatar Nicolas FERNANDEZ NUÑEZ
Browse files

'Initial' commit for these copy, after reached the free storage limit of 10 GB...

'Initial' commit for these copy, after reached the free storage limit of 10 GB on main first one... :s
parents
No related branches found
No related tags found
No related merge requests found
*~
*#
.DS_Store
#.snakemake
*.temp
*.fastq
*.fastq.gz
*.done
*.pdf
*.txt
/results
\ No newline at end of file
# RQCP: Reads Quality Control Pipeline #
## Description ##
RQCP check NGS (illumina) reads quality and clean it if needed, as you set, using:
- Cutadapts to trim NGS sequencing adapters
- Sickle-trim to trim reads on base-calling quality score
- Fastq-join to join mates reads (forward R1 and Reverse R2) when it's possible
- FastQC to check global quality
- FastqScreen to check putative contamination(s)
- MultiQC to generate HTML reports
## Badges ##
![Maintener](<https://badgen.net/badge/Maintener/Nicolas Fernandez/blue?scale=0.9>)
![MacOS](<https://badgen.net/badge/icon/Hight Sierra (10.13),Catalina (10.15),Big Sure (11)/cyan?icon=apple&label&list=|&scale=0.9>)
![Issues closed](<https://badgen.net/badge/Issues closed/0/green?scale=0.9>)
![Issues opened](<https://badgen.net/badge/Issues opened/0/yellow?scale=0.9>)
![Maintened](<https://badgen.net/badge/Maintened/No/red?scale=0.9>)
![Wiki](<https://badgen.net/badge/icon/Wiki/pink?icon=wiki&label&scale=0.9>)
![Open Source](<https://badgen.net/badge/icon/Open Source/purple?icon=https://upload.wikimedia.org/wikipedia/commons/4/44/Corazón.svg&label&scale=0.9>)
![GNU AGPL v3](<https://badgen.net/badge/Licence/GNU AGPL v3/grey?scale=0.9>)
![Bash](<https://badgen.net/badge/icon/Bash 3.2.57/black?icon=terminal&label&scale=0.9>)
![Python](<https://badgen.net/badge/icon/Python 3.8.7/black?icon=https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg&label&scale=0.9>)
![Snakemake](<https://badgen.net/badge/icon/Snakemake 5.11.2/black?icon=https://upload.wikimedia.org/wikipedia/commons/d/d3/Python_icon_%28black_and_white%29.svg&label&scale=0.9>)
![Conda](<https://badgen.net/badge/icon/Conda 4.10.3/black?icon=codacy&label&scale=0.9>)
## Visuals ##
_Good idea to include screenshots or GIFs (see ttygif or Asciinema)_
## Installation ##
### Command-Line ###
Clone with SSH when you want to authenticate only one time
`git clone git@gitlab.com:ird_transvihmi/Reads_Quality_Control_Pipeline.git`
Authenticate with GitLab by following the instructions in the [SSH documentation] (https://docs.gitlab.com/ee/ssh/index.html)
Clone with HTTPS when you want to authenticate each time you perform an operation between your computer and GitLab
`git clone https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline.git`
### Web-Interface ###
Download reads_quality_control project directory
![Image of download option](./Download_screen.png)
### Difference between download and clone ###
To create a copy of a remote repository’s files on your computer, you can either download or clone the repository
If you download it, you cannot sync the repository with the remote repository on GitLab
Cloning a repository is the same as downloading, except it preserves the Git connection with the remote repository
You can then modify the files locally and upload the changes to the remote repository on GitLab
## Usage ##
Make sure your bash script is executable
_Edit config.yaml file on config directory (option)_
_Edit fastq-screen config file on config directory (option)_
Run RQC.sh bash script by double-clicking on it
### manual ###
_to do quick !_
## Support ##
1. RTFM! (Read The Fabulous Manual! ^^.)
2. Read de awsome wiki ;)
3. Create a new issue: Issues > New issue > Describe your issue
4. Send an email to [nicolas.fernandez@ird.fr](url)
5. Call me to `+33.(0)4.67.41.55.°°` ; No, don't please ! O_o
## Roadmap ##
Add new features
## Contributing ##
Open to contributions :)
Testing code, finding issues, asking for update, proposing new features ...
Use Git tools to share !
## Authors and acknowledgment ##
- Nicolas Fernandez (Maintener)
- Christelle Butel for testing all versions of this script
## License ##
GPLv3 : https://www.gnu.org/licenses/gpl-3.0.html
## Project status ##
I'm out of time for this project, development has slowed down, close to stopped completely
You can be volunteer to step in as a maintainer ;)
Or choose to fork this project allowing this project to keep going !
For information :
- **Guests** are _not active contributors_ in private projects, they can only see, and leave comments and issues.
- **Reporters** are _read-only contributors_, they can't write to the repository, but can on issues.
- **Developers** are _direct contributors_, they have access to everything to go from idea to production, unless something has been explicitly restricted.
- **Maintainers** are _super-developers_, they are able to push to master, deploy to production. This role is often held by maintainers and engineering managers.
- **Owners** are essentially _group-admins_, they can give access to groups and have destructive capabilities.
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
RQCP.sh 0 → 100755
#!/bin/bash
##### About #####
echo ""
echo "##### ABOUT #####"
echo "-----------------"
echo "Name: Reads Quality Control Pipeline"
echo "Author: Nicolas Fernandez"
echo "Affiliation: IRD_U233_TransVIHMI"
echo "Aim: Reads Quality Control and Cleaning"
echo "Date: 2021.04.30"
echo "Run: snakemake --use-conda -s readsquality_control_pipeline.smk --cores"
echo "Latest modification: 2021.10.05"
echo "Todo: na"
echo "________________________________________________________________________"
###### Hardware check #####
echo ""
echo "##### HARDWARE #####"
echo "--------------------"
physicalcpu=$(sysctl -n hw.physicalcpu) # Get physical cpu
echo "Physical CPU: ${physicalcpu}" # Print physical cpu
logicalcpu=$(sysctl -n hw.logicalcpu) # Get logical cpu
echo "Logical CPU: ${logicalcpu}" # Print logical cpu
hwmemsize=$(sysctl -n hw.memsize) # Get memory size
ramsize=$(expr $hwmemsize / $((1024**3))) # 1024**3 = GB
echo "System Memory: ${ramsize} GB" # Print RAM size
echo "________________________________________________________________________"
###### Working directory
echo ""
echo "##### WORKING DIRECTORY #####"
echo "-----------------------------"
WORKDIR=${0%/*}
echo "CWD: ${WORKDIR}"
echo "________________________________________________________________________"
###### Rename samples, comment first line if you want do keep barcode-ID in name
rename -v 's/_S\d+_/_/' ${WORKDIR}/resources/reads/*.fastq.gz 2> /dev/null # Remove barcode-ID like {_S001_}
rename -v 's/_L\d+_/_/' ${WORKDIR}/resources/reads/*.fastq.gz 2> /dev/null # Remove line-ID ID like {_L001_}
rename -v 's/_\d+.fastq.gz/.fastq.gz/' ${WORKDIR}/resources/reads/*.fastq.gz 2> /dev/null # Remove end-name ID like {_001}.fastq.gz
###### Create links for rax samples
mkdir ${WORKDIR}/results/ ${WORKDIR}/results/reads/ ${WORKDIR}/results/reads/raw/ 2> /dev/null
for FASTQ in ${WORKDIR}/resources/reads/*.fastq.gz ; do ln -s $FASTQ ${WORKDIR}/results/reads/raw/$(basename $FASTQ) 2> /dev/null ; done # Create Symbol links
##### Create dir for QC
mkdir ${WORKDIR}/results/reads/cutadapt/ ${WORKDIR}/results/reads/sickle-trim/ ${WORKDIR}/results/reads/fastq-join/ 2> /dev/null
###### Call snakemake pipeline
snakemake --directory ${WORKDIR} --use-conda --keep-going --rerun-incomplete -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --cores --unlock # unlock first, if previous error
snakemake --directory ${WORKDIR} --use-conda --keep-going --rerun-incomplete -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --cores --dryrun # then dry run, check error like missing file
snakemake --directory ${WORKDIR} --use-conda --keep-going --rerun-incomplete -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --cores # at last, real run
###### Create usefull graphs and summary
mkdir ${WORKDIR}/results/Graphics/ 2> /dev/null
snakemake --directory ${WORKDIR} --use-conda -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --dag | dot -Tpdf > ${WORKDIR}/results/Graphics/DAGraph.pdf
snakemake --directory ${WORKDIR} --use-conda -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --rulegraph | dot -Tpdf > ${WORKDIR}/results/Graphics/RuleGraph.pdf
snakemake --directory ${WORKDIR} --use-conda -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --filegraph | dot -Tpdf > ${WORKDIR}/results/Graphics/FileGraph.pdf
snakemake --directory ${WORKDIR} --use-conda -s ${WORKDIR}/workflow/rules/reads_quality_control_pipeline.smk --summary > ${WORKDIR}/results/Graphics/Summary.txt
---
###############################################################################
# Author: Nicolas Fernandez
# Affiliation: IRD_TransVIHMI
# Aim: Config file for the Reads Quality Control Pipeline (RQCP)
# Date: 2021.04.30
# Use: Edit or (de)comment values you want modify
# Latest modification: 2021.10.04
# Todo: ...
###############################################################################
## CLUSTER -------------------------------------------------------------------------------------------
resources:
cpus: 12 # cpus
mem_mb: 64000 # mem in Mb
mem_gb: 64 # mem in Gb
time: 1140 # time in Min
## ENVIRONNEMENTS ------------------------------------------------------------------------------------
conda:
cutadapt: '../envs/cutadapt-3.4.yaml'
sickle-trim: '../envs/sickle-trim-1.33.yaml'
fastq-join: '../envs/fastq-join-1.3.1.yaml'
fastqc: '../envs/fastqc-0.11.9.yaml'
fastq-screen: '../envs/fastq-screen-0.14.0.yaml'
multiqc: '../envs/multiqc-1.10.1.yaml'
## DATASETS ------------------------------------------------------------------------------------------
datasets:
quality-directory:
- 'raw'
- 'cutadapt'
- 'sickle-trim'
#- 'sickle-single'
- 'fastq-join'
#- 'fastq-join-unique'
quality-tool:
- 'fastqc'
- 'fastq-screen'
## CUTADAPT ------------------------------------------------------------------------------------------
cutadapt:
length: '75' # Discard reads shorter than length, after trim
kits: # Sequence of an adapter ligated to the 3' end of the first read
truseq: 'AGATCGGAAGAGC' # Illumina TruSeq / ScriptSeq based kits libraries
nextera: 'CTGTCTCTTATACACATC' # Illumina Nextera / TruSight based kits libraries
small: 'TGGAATTCTCGGGTGCCAAGG' # Illumina Small based kits libraries
#TruSeqR1: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA' # TruSeq-LT and TruSeq-HT based kits R1
#TruSeqR2: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT' # TruSeq-LT and TruSeq-HT based kits R2
#ScriptSeqR1: 'AGATCGGAAGAGCACACGTCTGAAC' # ScriptSeq and TruSeq DNA Methylation based kits R1
#ScriptSeqR2: 'AGATCGGAAGAGCGTCGTGTAGGGA' # ScriptSeq and TruSeq DNA Methylation based kits R2
#TruSeqRibo: 'AGATCGGAAGAGCACACGTCT' # TruSeq Ribo Profile based kits
## SICKLE-TRIM ---------------------------------------------------------------------------------------
sickle-trim:
command: # choose one
#- 'se' # if single-end sequence
- 'pe' # if paired-end sequence
encoding: # choose one
- 'sanger' # if sanger (CASAVA >= 1.8)
#- 'illumina' # if illumina (CASAVA 1.3 to 1.7)
#- 'solexa' # if solexa (CASAVA < 1.3)
quality: '30' # phred score limit
length: '75' # read length limit, after trim
## FASTQ-JOIN ----------------------------------------------------------------------------------------
fastq-join:
percent: '10' # N-percent maximum difference (default: 8)
overlap: '10' # N-minimum overlap (default: 6)
## FASTQSCREEN ---------------------------------------------------------------------------------------
fastq-screen:
config: 'config/fastq-screen.conf' # path to the fastq-screen
subset: '1000' #
aligner:
- 'bwa' #
#- 'bowtie' #
#- 'bowtie2' #
## ---------------------------------------------------------------------------------------------------
# This is Nicolas Fernandez (IRD) version of configuration file for FastQ-Screen
############################
## Bowtie, Bowtie 2 or BWA #
############################
## If the Bowtie, Bowtie 2 or BWA binary is not in your PATH, you can set this value to tell the program where to find your chosen aligner.
## Uncomment the relevant line below and set the appropriate location.
## Please note, this path should INCLUDE the executable filename.
#BOWTIE /usr/local/bowtie
#BOWTIE2 /usr/local/bowtie2-2.3.4.1/bowtie2
BWA .snakemake/conda/db59fc2b/bin/bwa
############################################
## Bismark (for bisulfite sequencing only) #
############################################
## If the Bismark binary is not in your PATH then you can set this value to tell the program where to find it.
## Uncomment the line below and set the appropriate location.
## Please note, this path should INCLUDE the executable filename.
#BISMARK /usr/local/bin/bismark/bismark
############
## Threads #
############
## Genome aligners can be made to run across multiple CPU cores to speed up searches.
## Set this value to the number of cores you want for mapping reads.
THREADS 1
##############
## DATABASES #
##############
## This section enables you to configure multiple genomes databases (aligner index files) to search against in your screen.
## For each genome you need to provide a database name (which can't contain spaces) and the location of the aligner index files.
##
## The path to the index files SHOULD INCLUDE THE BASENAME of the index, e.g: /opt/Genomes/Human_Bowtie/GRCh37/Homo_sapiens.GRCh37
## Thus, the index files (Homo_sapiens.GRCh37.1.bt2, Homo_sapiens.GRCh37.2.bt2, etc.) are found in a folder named 'GRCh37'.
##
## If, for example, the Bowtie, Bowtie2 and BWA indices of a given genome reside in the SAME FOLDER, a SINLGE path may be provided to ALL the of indices.
## The index used will be the one compatible with the chosen aligner (as specified using the --aligner flag).
##
## The entries shown below are only suggested examples, you can add as many DATABASE sections as required, and you can comment out or remove as many of the existing entries as desired.
## We suggest including genomes and sequences that may be sources of contamination either because they where run on your sequencer previously, or may have contaminated your sample during the library preparation step.
##
## Human h38
#DATABASE Human resources/databases/bwa/Human/Homo_sapiens_h38
##
## Mouse m39
#DATABASE Mouse resources/databases/bwa/Mouse/Mus_musculus_m39
##
## Gorilla g4
#DATABASE Gorilla resources/databases/bwa/Gorilla/Gorilla_gorilla_g4
##
## Chimpanzee
#DATABASE Chimpanzee resources/databases/bwa/Chimpanzee/Pan_troglodytes_t3
##
## Bat 10
#DATABASE Bat resources/databases/bwa/Bat/Pteropus_vampyrus_v1
##
## Ecoli - sequence available from EMBL accession U00096.2
DATABASE Ecoli resources/databases/bwa/Ecoli/Echerichia_coli_U00096
##
## HIV - HXB2
DATABASE HIV resources/databases/bwa/HIV/HXB2
##
## Ebola - sequence available from NCBI accession NC_002549
DATABASE Ebola resources/databases/bwa/Ebola/ZEBOV
##
## SARS-CoV-2 - sequence from Whuhan available from NCBI accession NC_045512.2
DATABASE SARS-CoV-2 resources/databases/bwa/SARS-CoV-2/Wuhan-Hu-1
##
## PhiX - sequence available from Refseq accession NC_001422.1
DATABASE PhiX resources/databases/bwa/Phix/PhiX
##
## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc
DATABASE Adapters resources/databases/bwa/Adapters/Adapters
##
## Vector - Sequence taken from the UniVec database: http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html, BUT, without phi-X174
DATABASE Vectors resources/databases/bwa/Vectors/UniVec_wo_phi-X174
##
## Arabidopsis thaliana - sequence from NCBI: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.fna.gz
DATABASE Arabidopsis resources/databases/bwa/Arabidopsis/Arabidopsis_thaliana_t10
visuals/download_button.png

10.8 KiB

name: cutadapt-3.4
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- ca-certificates=2021.4.13=hecd8cb5_1
- certifi=2020.12.5=py39h6e9494a_1
- cutadapt=3.4=py39ha492530_1
- dnaio=0.5.1=py39ha492530_0
- isa-l=2.30.0=h0d85af4_4
- libcxx=11.1.0=habf9029_0
- libffi=3.3=h046ec9c_2
- ncurses=6.2=h2e338ed_4
- openssl=1.1.1k=h0d85af4_0
- pigz=2.6=h5dbffcc_0
- pip=21.1=pyhd8ed1ab_0
- python=3.9.4=h88f2d9e_0
- python-isal=0.10.0=py39h89e85a6_0
- python_abi=3.9=1_cp39
- readline=8.1=h05e3726_0
- setuptools=52.0.0=py39hecd8cb5_0
- sqlite=3.35.5=h44b9ce1_0
- tk=8.6.10=h0419947_1
- tzdata=2021a=he74cb21_0
- wheel=0.36.2=pyhd3deb0d_0
- xopen=1.1.0=py39h6e9494a_2
- xz=5.2.5=haf1e3a3_1
- zlib=1.2.11=h7795811_1010
prefix: /Users/2021nf001/miniconda3/envs/cutadapt-3.4
name: fastq-join-1.3.1
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- fastq-join=1.3.1=hb280591_4
- libcxx=11.1.0=habf9029_0
prefix: /Users/2021nf001/miniconda3/envs/fastq-join-1.3.1
name: fastq-screen
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- bowtie=1.0.0=1
- bowtie2=2.4.2=py39h7f6fa2c_2
- bwa=0.7.17=h188c3c3_8
- ca-certificates=2021.4.13=hecd8cb5_1
- certifi=2020.12.5=py39h6e9494a_1
- expat=2.3.0=he49afe7_0
- fastq-screen=0.14.0=pl5262hdfd78af_1
- fontconfig=2.13.1=h10f422b_1005
- freetype=2.10.4=h4cff582_1
- giflib=5.2.1=hbcb3906_2
- icu=67.1=hb1e8313_0
- jpeg=9d=hbcb3906_0
- libcxx=11.1.0=habf9029_0
- libffi=3.3=h046ec9c_2
- libgd=2.2.5=h190906e_1008
- libiconv=1.16=haf1e3a3_0
- libpng=1.6.37=h7cec526_2
- libtiff=4.2.0=h7c11950_1
- libwebp=1.2.0=h1648767_0
- libwebp-base=1.2.0=h0d85af4_2
- libxml2=2.9.10=h2c6e4a5_2
- lz4-c=1.9.3=h046ec9c_0
- ncurses=6.2=h2e338ed_4
- openssl=1.1.1k=h0d85af4_0
- perl=5.26.2=hbcb3906_1008
- perl-gd=2.71=pl526h5c9b4e4_0
- perl-gdgraph=1.54=pl526_0
- perl-gdtextutil=0.86=pl526h1de35cc_5
- pip=21.1=pyhd8ed1ab_0
- python=3.9.4=h88f2d9e_0
- python_abi=3.9=1_cp39
- readline=8.1=h05e3726_0
- setuptools=52.0.0=py39hecd8cb5_0
- sqlite=3.35.5=h44b9ce1_0
- tbb=2020.3=h879752b_0
- tk=8.6.10=h0419947_1
- tzdata=2021a=he74cb21_0
- wheel=0.36.2=pyhd3deb0d_0
- xz=5.2.5=haf1e3a3_1
- zlib=1.2.11=h7795811_1010
- zstd=1.4.9=h582d3a0_0
prefix: /Users/2021nf001/miniconda3/envs/fastq-screen
name: fastqc-0.11.9
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- fastqc=0.11.9=hdfd78af_1
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- fontconfig=2.13.1=h10f422b_1005
- freetype=2.10.4=h4cff582_1
- icu=68.1=h74dc148_0
- libcxx=11.1.0=habf9029_0
- libiconv=1.16=haf1e3a3_0
- libpng=1.6.37=h7cec526_2
- libxml2=2.9.10=h93ec3fd_4
- openjdk=11.0.9.1=hcf210ce_1
- perl=5.32.0=hbcb3906_0
- xz=5.2.5=haf1e3a3_1
- zlib=1.2.11=h7795811_1010
prefix: /Users/2021nf001/miniconda3/envs/fastqc
name: multiqc-1.10.1
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- brotlipy=0.7.0=py39hcbf5805_1001
- ca-certificates=2021.4.13=hecd8cb5_1
- certifi=2020.12.5=py39h6e9494a_1
- cffi=1.14.5=py39h319c39b_0
- chardet=4.0.0=py39h6e9494a_1
- click=7.1.2=pyh9f0ad1d_0
- colorama=0.4.4=pyh9f0ad1d_0
- coloredlogs=15.0=py39h6e9494a_0
- colormath=3.0.0=py_2
- commonmark=0.9.1=py_0
- cryptography=3.4.7=py39ha2c9959_0
- cycler=0.10.0=py_2
- decorator=5.0.7=pyhd8ed1ab_0
- freetype=2.10.4=h4cff582_1
- future=0.18.2=py39h6e9494a_3
- humanfriendly=9.1=py39h6e9494a_0
- idna=2.10=pyh9f0ad1d_0
- importlib-metadata=4.0.1=py39h6e9494a_0
- jinja2=2.11.3=pyh44b312d_0
- jpeg=9d=hbcb3906_0
- kiwisolver=1.3.1=py39hedf5dff_1
- lcms2=2.12=h577c468_0
- libblas=3.9.0=8_openblas
- libcblas=3.9.0=8_openblas
- libcxx=11.1.0=habf9029_0
- libffi=3.3=h046ec9c_2
- libgfortran=5.0.0=9_3_0_h6c81a4c_22
- libgfortran5=9.3.0=h6c81a4c_22
- liblapack=3.9.0=8_openblas
- libopenblas=0.3.12=openmp_h54245bb_1
- libpng=1.6.37=h7cec526_2
- libtiff=4.2.0=h7c11950_1
- libwebp-base=1.2.0=h0d85af4_2
- llvm-openmp=11.1.0=hda6cdc1_1
- lz4-c=1.9.3=h046ec9c_0
- lzstring=1.0.4=py_1001
- markdown=3.3.4=pyhd8ed1ab_0
- markupsafe=1.1.1=py39hcbf5805_3
- matplotlib-base=3.4.1=py39hb07454d_0
- multiqc=1.10.1=py_0
- ncurses=6.2=h2e338ed_4
- networkx=2.5=py_0
- numpy=1.20.2=py39h7eed0ac_0
- olefile=0.46=pyh9f0ad1d_1
- openssl=1.1.1k=h0d85af4_0
- pillow=8.2.0=py39h5270095_0
- pip=21.1=pyhd8ed1ab_0
- pycparser=2.20=pyh9f0ad1d_2
- pygments=2.8.1=pyhd8ed1ab_0
- pyopenssl=20.0.1=pyhd8ed1ab_0
- pyparsing=2.4.7=pyh9f0ad1d_0
- pysocks=1.7.1=py39h6e9494a_3
- python=3.9.4=h88f2d9e_0
- python-dateutil=2.8.1=py_0
- python_abi=3.9=1_cp39
- pyyaml=5.4.1=py39hcbf5805_0
- readline=8.1=h05e3726_0
- requests=2.25.1=pyhd3deb0d_0
- rich=10.1.0=py39h6e9494a_0
- setuptools=52.0.0=py39hecd8cb5_0
- simplejson=3.17.2=py39hcbf5805_2
- six=1.15.0=pyh9f0ad1d_0
- spectra=0.0.11=py_1
- sqlite=3.35.5=h44b9ce1_0
- tk=8.6.10=h0419947_1
- tornado=6.1=py39hcbf5805_1
- typing_extensions=3.7.4.3=py_0
- tzdata=2021a=he74cb21_0
- urllib3=1.26.4=pyhd8ed1ab_0
- wheel=0.36.2=pyhd3deb0d_0
- xz=5.2.5=haf1e3a3_1
- yaml=0.2.5=haf1e3a3_0
- zipp=3.4.1=pyhd8ed1ab_0
- zlib=1.2.11=h7795811_1010
- zstd=1.4.9=h582d3a0_0
prefix: /Users/2021nf001/miniconda3/envs/multiqc
name: sickle-trim-1.33
channels:
- bioconda
- conda-forge
- r
- anaconda
- defaults
dependencies:
- sickle-trim=1.33=h188c3c3_6
- zlib=1.2.11=h7795811_1010
prefix: /Users/2021nf001/miniconda3/envs/sickle-1.33
###############################################################################
# Name: Reads Quality Control Pipeline
# Author: Nicolas Fernandez
# Affiliation: IRD_U233_TransVIHMI
# Aim: Reads Quality Control
# Date: 2021.04.30
# Run: snakemake --use-conda -s reads_quality_control_pipeline.smk --cores
# Latest modification: 2021.10.04
# Todo: ND
###############################################################################
# CONFIGURATION #
configfile: "config/config.yaml"
# FUNCTIONS #
# WILDCARDS #
SAMPLE, = glob_wildcards("results/reads/raw/{sample}_R1.fastq.gz")
QCDIR = config["datasets"]["quality-directory"]
# ENVIRONMENT #
CUTADAPT = config["conda"]["cutadapt"] # Cutadapt
SICKLETRIM = config["conda"]["sickle-trim"] # Sickle-trim
FASTQJOIN = config["conda"]["fastq-join"] # Fastq-join
FASTQC = config["conda"]["fastqc"] # FastQC
FASTQSCREEN = config["conda"]["fastq-screen"] # FastQ Screen
MULTIQC = config["conda"]["multiqc"] # MultiQC
# RESOURCES #
CPUS = config["resources"]["cpus"] # Resources thread
MEM_GB = config["resources"]["mem_gb"] # Resources mem in Gb
# PARAMETERS #
LENGTHC = config["cutadapt"]["length"] # Cutadapt --minimum-length
TRUSEQ = config["cutadapt"]["kits"]["truseq"] # Cutadapt --adapter Illumina TruSeq
NEXTERA = config["cutadapt"]["kits"]["nextera"] # Cutadapt --adapter Illumina Nextera
SMALL = config["cutadapt"]["kits"]["small"] # Cutadapt --adapter Illumina Small
COMMAND = config["sickle-trim"]["command"] # Sickle-trim command
ENCODING = config["sickle-trim"]["encoding"] # Sickle-trim --qual-type
QUALITY = config["sickle-trim"]["quality"] # Sickle-trim --qual-threshold
LENGTHS = config["sickle-trim"]["length"] # Sickle-trim --length-treshold
PERCENT = config["fastq-join"]["percent"] # Fastq-join -p (percent maximum difference)
OVERLAP = config ["fastq-join"]["overlap"] # Fastq-join -m (minimum overlap)
CONFIG = config["fastq-screen"]["config"] # Fastq-screen --conf
ALIGNER = config["fastq-screen"]["aligner"] # Fastq-screen --aligner
SUBSET = config["fastq-screen"]["subset"] # Fastq-screen --subset
###############################################################################
rule all:
input:
joined = expand("results/reads/fastq-join/{sample}_cutadapt_sickle-trim_fastq-join_Mate.fastq.gz",
sample = SAMPLE),
multiqc = expand("results/quality/{qcdir}/multiqc/",
qcdir = QCDIR)
###############################################################################
rule multiqc:
# Aim: aggregates bioinformatics analyses results into a single report
# Use: multiqc [OPTIONS] [--output dir] <analysis directory>
message:
"MultiQC aggregate quality results from {wildcards.qcdir} reads"
conda:
MULTIQC
log:
"results/reports/multiqc/{qcdir}.log"
input:
fastqc = "results/quality/{qcdir}/fastqc/",
fastqscreen = "results/quality/{qcdir}/fastqscreen/"
output:
multiqc = directory("results/quality/{qcdir}/multiqc/")
log:
"results/reports/multiqc/{qcdir}.log"
shell:
"multiqc " # Multiqc, searches in given directories for analysis & compiles a HTML report
"--quiet " # -q: Only show log warning
"--outdir {output.multiqc} " # -o: Create report in the specified output directory
"{input.fastqc} " # Input FastQC files
"{input.fastqscreen} " # Input Fastq-Screen files
#"--no-ansi " # Disable coloured log
"&> {log}" # Add redirection for log
###############################################################################
rule fastqscreen:
# Aim: screen if the composition of the library matches with what you expect
# Use fastq_screen [OPTIONS] <fastq.1>... <fastq.n>
message:
"Fastq-Screen on samples {wildcards.qcdir} reads"
conda:
FASTQSCREEN
resources:
cpus = CPUS
params:
config = CONFIG,
aligner = ALIGNER,
subset = SUBSET
input:
fastq = "results/reads/{qcdir}/"
output:
fastqscreen = directory("results/quality/{qcdir}/fastqscreen/")
log:
"results/reports/fastq-screen/{qcdir}.log"
shell:
"fastq_screen " # FastqScreen, what did you expect ?
"--quiet " # -q: Only show log warning
"--threads {resources.cpus} " # -t: Specifies across how many threads bowtie will be allowed to run
"--conf {params.config} " # path to configuration file
"--aligner {params.aligner} " # -a: choose aligner 'bowtie', 'bowtie2', 'bwa'
"--subset {params.subset} " # Don't use the whole sequence file, but create a subset of specified size
"--outdir {output.fastqscreen} " # Output directory
"{input.fastq}/*.fastq.gz " # Input file.fastq
"&> {log}" # Add redirection for log
###############################################################################
rule fastqc:
# Aim: reads sequence files and produces a quality control report
# Use: fastqc [OPTIONS] [--output dir] <fastq.1> ... <fastq.n>
message:
"FastQC on samples {wildcards.qcdir} reads"
conda:
FASTQC
resources:
cpus = CPUS
input:
fastq = "results/reads/{qcdir}/"
output:
fastqc = directory("results/quality/{qcdir}/fastqc/")
log:
"results/reports/fastqc/{qcdir}.log"
shell:
"fastqc " # FastQC, a high throughput sequence QC analysis tool
"--quiet " # -q: Supress all progress messages on stdout and only report errors
"--threads {resources.cpus} " # -t: Specifies files number which can be processed simultaneously
"--outdir {output.fastqc} " # -o: Create all output files in the specified output directory
"{input.fastq}/*.fastq.gz " # Input file.fastq
"&> {log}" # Add redirection for log
###############################################################################
rule fastqjoin:
# Aim: joins two paired-end reads on the overlapping ends
# Use: fastq-join [OPTIONS] <read1.fastq> <read2.fastq> [mate.fastq] -o <read.fastq> -o <read.fastq> -o <read.fastq>
priority: 1
message:
"Fastq-join assemble R1 and R2 from {wildcards.sample} reads"
conda:
FASTQJOIN
params:
percent = PERCENT,
overlap = OVERLAP
input:
forward = "results/reads/sickle-trim/{sample}_cutadapt_sickle-trim_R1.fastq.gz",
reverse = "results/reads/sickle-trim/{sample}_cutadapt_sickle-trim_R2.fastq.gz"
output:
forward= "results/reads/fastq-join-unique/{sample}_cutadapt_sickle-trim_fastq-join_Unique_R1.fastq.gz",
reverse = "results/reads/fastq-join-unique/{sample}_cutadapt_sickle-trim_fastq-join_Unique_R2.fastq.gz",
joined = "results/reads/fastq-join/{sample}_cutadapt_sickle-trim_fastq-join_Mate.fastq.gz"
log:
"results/reports/fastq-join/{sample}.log"
shell:
"fastq-join " # Fastq-Join, joins two paired-end reads on the overlapping ends
"{input.forward} " # Input forward
"{input.reverse} " # Input reverse
"-p {params.percent} " # Percent maximum difference (default: 8)
"-m {params.overlap} " # Minimum overlap (default: 6)
"-o {output.forward} " # for unique forward files
"-o {output.reverse} " # for unique reverse files
"-o {output.joined} " # for join files
#"-r {log} " # Verbose stitch length report
"&> {log}" # Add redirection for log
###############################################################################
rule sickletrim:
# Aim: windowed adaptive trimming tool for FASTQ files using quality
# Use: sickle <command> [OPTIONS]
message:
"Sickle-trim remove low quality sequences from {wildcards.sample} reads"
conda:
SICKLETRIM
params:
command = COMMAND,
encoding = ENCODING,
quality = QUALITY,
length = LENGTHS
input:
forward = "results/reads/cutadapt/{sample}_cutadapt_R1.fastq.gz",
reverse = "results/reads/cutadapt/{sample}_cutadapt_R2.fastq.gz"
output:
forward = "results/reads/sickle-trim/{sample}_cutadapt_sickle-trim_R1.fastq.gz",
reverse = "results/reads/sickle-trim/{sample}_cutadapt_sickle-trim_R2.fastq.gz",
single = "results/reads/sickle-single/{sample}_cutadapt_sickle-trim_Single.fastq.gz"
log:
"results/reports/sickle-trim/{sample}.log"
shell:
"sickle " # Sickle, a windowed adaptive trimming tool for FASTQ files using quality
"{params.command} " # Paired-end or single-end sequence trimming
"-t {params.encoding} " # --qual-type: Type of quality values, solexa ; illumina ; sanger ; CASAVA, < 1.3 ; 1.3 to 1.7 ; >= 1.8
"-q {params.quality} " # --qual-threshold: Threshold for trimming based on average quality in a window (default: 20)
"-l {params.length} " # --length-treshold: Threshold to keep a read based on length after trimming (default: 20)
"-f {input.forward} " # --pe-file1: Input paired-end forward fastq file
"-r {input.reverse} " # --pe-file2: Input paired-end reverse fastq file
"-g " # --gzip-output: Output gzipped files
"-o {output.forward} " # --output-pe1: Output trimmed forward fastq file
"-p {output.reverse} " # --output-pe2: Output trimmed reverse fastq file (must use -s option)
"-s {output.single} " # --output-single: Output trimmed singles fastq file
"&> {log}" # Add redirection for log
###############################################################################
rule cutadapt:
# Aim: finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads
# Use: cutadapt -a ADAPTER [OPTIONS] [-o output.forward] [-p output.reverse] <input.forward> <input.reverse>
# Rmq: multiple adapter sequences can be given using further -a options, but only the best-matching adapter will be removed
message:
"Cutadapt remove adapters from {wildcards.sample} reads"
conda:
CUTADAPT
resources:
cpus = CPUS
params:
length = LENGTHC,
truseq = TRUSEQ,
nextera = NEXTERA,
small = SMALL
input:
forward = "results/reads/raw/{sample}_R1.fastq.gz",
reverse = "results/reads/raw/{sample}_R2.fastq.gz"
output:
forward = "results/reads/cutadapt/{sample}_cutadapt_R1.fastq.gz",
reverse = "results/reads/cutadapt/{sample}_cutadapt_R2.fastq.gz"
log:
"results/reports/cutadapt/{sample}.log"
shell:
"cutadapt " # Cutadapt, finds and removes unwanted sequence from your HT-seq reads
"--cores {resources.cpus} " # -j: Number of CPU cores to use. Use 0 to auto-detect (default: 1)
"--trim-n " # --trim-n: Trim N's on ends of reads
"--minimum-length {params.length} " # -m: Discard reads shorter than length
"--adapter {params.truseq} " # -a: Sequence of an adapter ligated to the 3' end of the first read
"-A {params.truseq} " # -A: 3' adapter to be removed from second read in a pair
"--adapter {params.nextera} " # -a: Sequence of an adapter ligated to the 3' end of the first read
"-A {params.nextera} " # -A: 3' adapter to be removed from second read in a pair
"--adapter {params.small} " # -a: Sequence of an adapter ligated to the 3' end of the first read
"-A {params.small} " # -A: 3' adapter to be removed from second read in a pair
"--output {output.forward} " # -o: Write trimmed reads to FILE
"--paired-output {output.reverse} " # -p: Write second read in a pair to FILE
"{input.forward} " # Input forward reads R1.fastq
"{input.reverse} " # Input reverse reads R2.fastq
"&> {log} " # Add redirection for log
###############################################################################
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment