From c77418c3dcca6039d436b8a6ce5d93d126419987 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nicolas=20FERNANDEZ=20NU=C3=91EZ?= <fernandez.nunez.nicolas@gmail.com> Date: Thu, 27 Jan 2022 11:24:21 +0100 Subject: [PATCH] Working on README.md and bwa ref --- .gitignore | 2 +- README.md | 88 +++++++++++++++++++++++----------------- RQC.sh | 14 +++---- config/config.yaml | 7 ++-- config/fastq-screen.conf | 19 ++++++++- 5 files changed, 79 insertions(+), 51 deletions(-) diff --git a/.gitignore b/.gitignore index af9a39f..d6a461b 100644 --- a/.gitignore +++ b/.gitignore @@ -11,4 +11,4 @@ *.txt /results* -#/resources/databases/bwa + diff --git a/README.md b/README.md index 85a33e4..8088858 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,7 @@ mamba install -c conda-forge -c bioconda snakemake=6.12.1 --yes ### RQC ### -**Download** _OR_ clone the **RQC pipeline** project +**Download** _OR_ **Clone** the **RQC pipeline** project #### Difference between **Download** and **Clone** #### To create a copy of a remote repository’s files on your computer, you can either download or clone the repository @@ -75,6 +75,7 @@ wget https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline/-/archive/ tar -xzvf Reads_Quality_Control_Pipeline-main.tar.gz rm -f Reads_Quality_Control_Pipeline-main.tar.gz mv Reads_Quality_Control_Pipeline-main/ ~/Desktop/RQC_Pipeline/ +cd ~/Desktop/RQC_Pipeline/ ``` #### Clone #### @@ -83,14 +84,20 @@ mv Reads_Quality_Control_Pipeline-main/ ~/Desktop/RQC_Pipeline/ Authenticate with GitLab by following the instruction in the [2FA documentation](https://docs.gitlab.com/ee/user/profile/account/two_factor_authentication.html) ```shell git clone https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline.git +``` +```shell mv Reads_Quality_Control_Pipeline/ ~/Desktop/RQC_Pipeline/ +cd ~/Desktop/RQC_Pipeline/ ``` - Clone with **SSH** (_when you want to authenticate only one time_) Authenticate with GitLab by following the instructions in the [SSH documentation](https://docs.gitlab.com/ee/ssh/index.html) ```shell git clone git@gitlab.com:ird_transvihmi/Reads_Quality_Control_Pipeline.git +``` +```shell mv Reads_Quality_Control_Pipeline/ ~/Desktop/RQC_Pipeline/ +cd ~/Desktop/RQC_Pipeline/ ``` @@ -110,32 +117,36 @@ sudo chmod +x path/to/Reads_Quality_Control_Pipeline/RQCP.sh Yours results are available in results directory as follow: -... TODO ... +1. fastq-screen: Your search libraries might contain the genomes of all of the organisms you work on, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments. +2. fastqc: modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. +3. multiqc: compiled HTML report. +4. reports: log from tools +5. summary: Snakemake rules graph and files summary. + + +### Configuration ### -### Configuration ### +#### _./config/config.yaml_ #### -#### Resources #### +##### Resources ##### -Edit to match your hardware configuration +Edit to match your hardware configuration +_(given when you run RQC.sh)_ -#### Environments #### +##### Environments ##### Edit if you change some environments (i.e.new version) in ./workflow/envs/tools-version.yaml files +_(you should not change this)_ -#### Fastq-Screen ##### +##### Fastq-Screen ###### - **config**: Path to the fastq-screen configuration file (default config: ./config/fastq-screen.conf) - **subset**: Don't use the whole sequence file, but create a temporary dataset of this specified number of read (default config: '10000', set '0' for all dataset) - **aligner**: Specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa' (default config: 'bwa') -##### fastq-screen.conf ##### +#### _./config/fastq-screen.conf_ #### -- **path**: Set this value to tell the program where to find your chosen aligner (default :/usr/local/\<tool\> -- **bismark**: Same for bismark (for bisulfite sequencing only) -- **threads**: Set this value to the number of cores you want for mapping reads (default: 1, but overwrited by Snakemake and config.yaml file) -- **databases**: This section enables you to configure multiple genomes databases (aligner index files) to search against in your screen - -##### databases ##### +###### databases ###### For each genome you need to provide a database name (which **can't** contain spaces) and the location of the aligner index files @@ -154,37 +165,39 @@ It's suggested including genomes and sequences that: - may be sources of contamination either because they where run on your sequencer previously - may have contaminated your sample during the library preparation step -For IRD_U233_TransVIHMI, cretaed this indexes: - -- **Human**: main sources of lab. contaminations _(exepted if Boston Dynamics Atlas robot did the job)_ **¡not included!** -- **Mouse**: main model in biology experimentation, very frequent in NGS facility core **¡not included!** -- **Arabidopsis**: frequent plant model in NGS facility core associated with plants researches (IRD, CIRAD, INRAE, ...) **¡not included!** -- **Ecoli**: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples -- **PhiX**: usefull control in Illumina sequencing run technology -- **Adapters**: use for libraries generation -- **Vector**: use in general molecular biology -- **Gorilla**: species studied in TransVIHMI **¡not included!** -- **Chimpanzee**: species studied in TransVIHMI **¡not included!** -- **Bat**: species studied in TransVIHMI **¡not included!** -- **HIV**: species studied in TransVIHMI -- **Ebola**: species studied in TransVIHMI -- **SARS-CoV-2**: species studied in TransVIHMI - -**Not included indexes:** -Indexes for large genomes can be heavy (~ 3Gb) and git limit each project to 10Gb. Download all this databases can be also to long. -Commonly it's share on git only code, but not larger resources _(data input, databases, references, ...). -This data can always be download somewhere (online servers). -Databases for below genomes where generated and available at IRD_U233_TransVIHMI lab. -You can freely ask for sharing (with USB supports or FileSender) to add it to your analyses. -You can also ask for new databases, for genomes references not yet included, to check putative presence / absence on your dataset. +For IRD_U233_TransVIHMI, we can provid: + +- Human: main sources of laboratory contaminations +- Mouse: main model in biology experimentation, very frequent in NGS facility core +- Arabidopsis: frequent plant model in NGS facility core associated with plants researches _(IRD, CIRAD, INRAE, ...)_ +- Ecoli: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples +- PhiX: usefull control in Illumina sequencing run technology +- Adapters: use for libraries generation +- Vector: use in general molecular biology +- Gorilla: species studied in TransVIHMI +- Chimpanzee: species studied in TransVIHMI +- Bat: species studied in TransVIHMI +- HIV: species studied in TransVIHMI +- Ebola: species studied in TransVIHMI +- SARS-CoV-2: species studied in TransVIHMI +- Coronavirus: species studied in Trans + +Indexes for larger genomes can be heavy (~ 3Gb) and gitlab limit each project to 10Gb. +Download all this databases can be also very long. So we commonly share on gitlab code but resources. +This data can be download separatly, from dedicated servers. +Or you can freely ask for share (with physical support or FileSender), to add it to your analyses. +You can also ask for new indexes, for your favorite genomes not yet included. + ## Support ## + 1. RTFM! (Read The Fabulous Manual! ^^.) 2. Read de awsome wiki ;) 3. Create a new issue: Issues > New issue > Describe your issue 4. Send an email to [nicolas.fernandez@ird.fr](url) 5. Call me to `+33.(0)4.67.41.55.xx` (No don't please _O\_o_!) + ## Roadmap ## Add a wiki ! @@ -211,6 +224,7 @@ Use Git tools to share! ## Project status ## + This project is regularly update and actively maintened However, you can be volunteer to step in as a maintainer diff --git a/RQC.sh b/RQC.sh index cee215a..89dd8d6 100755 --- a/RQC.sh +++ b/RQC.sh @@ -4,13 +4,13 @@ echo "" echo "##### ABOUT #####" echo "-----------------" -echo "Name: Reads Quality Control pipeline" +echo "Name: RQC.sh" echo "Author: Nicolas Fernandez" echo "Affiliation: IRD_U233_TransVIHMI" echo "Aim: Bash script for RQC pipeline" echo "Date: 2021.04.30" echo "Run: snakemake --snakemake rqc.smk --cores --use-conda" -echo "Latest modification: 2022.01.25" +echo "Latest modification: 2022.01.27" echo "Todo: done" echo "________________________________________________________________________" @@ -56,12 +56,12 @@ echo "##### RENAME FASTQ FILES #####" echo "------------------------------" # With rename command from macOSX -rename --verbose 's/_S\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove barcode-ID like {_S001_} +#rename --verbose 's/_S\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove barcode-ID like {_S001_} rename --verbose 's/_L\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove line-ID ID like {_L001_} rename --verbose 's/_001.fastq.gz/.fastq.gz/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove end-name ID like {_001}.fastq.gz # With rename command as part of the util-linux package -rename --verbose _S\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove barcode-ID like {_S001_} +#rename --verbose _S\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove barcode-ID like {_S001_} rename --verbose _L\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove line-ID ID like {_L001_} rename --verbose _001.fastq.gz .fastq.gz ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove end-name ID like {_001}.fastq.gz @@ -163,7 +163,7 @@ echo "" echo "##### SNAKEMAKE PIPELINE GRAPHS ######" echo "--------------------------------------" -mkdir ${workdir}/results/Summary/ 2> /dev/null +mkdir ${workdir}/results/summary/ 2> /dev/null graphList="dag rulegraph filegraph" extentionList="pdf png" @@ -175,14 +175,14 @@ for graph in ${graphList} ; do --snakefile ${workdir}/workflow/rules/rqc.smk \ --${graph} | \ dot -T${extention} > \ - ${workdir}/results/Summary/${graph}.${extention} ; + ${workdir}/results/summary/${graph}.${extention} ; done ; done snakemake \ --directory ${workdir} \ --snakefile ${workdir}/workflow/rules/rqc.smk \ - --summary > ${workdir}/results/Summary/files_summary.txt + --summary > ${workdir}/results/summary/files_summary.txt echo "________________________________________________________________________" diff --git a/config/config.yaml b/config/config.yaml index 7b7ec15..f4eff3d 100644 --- a/config/config.yaml +++ b/config/config.yaml @@ -5,15 +5,14 @@ # Aim: Config file for RQC pipeline # Date: 2021.04.30 # Use: Edit or (de)comment settings -# Latest modification: 2022.01.25 +# Latest modification: 2022.01.27 # Todo: done - ############################################################################### ## RESOURCES ----------------------------------------------------------------------------------------- resources: - cpus: 12 # cpus - mem_gb: 16 # mem in Gb + cpus: 4 # cpus + mem_gb: 4 # mem in Gb ## ENVIRONNEMENTS ------------------------------------------------------------------------------------ conda: diff --git a/config/fastq-screen.conf b/config/fastq-screen.conf index 5c4d281..22a8f72 100644 --- a/config/fastq-screen.conf +++ b/config/fastq-screen.conf @@ -5,7 +5,7 @@ # Aim: Config file for Fastq-Screen, with bwa, for RQC pipeline # Date: 2021.04.30 # Use: Edit, (de)comment settingss you want modify -# Latest modification: 2021.01.26 +# Latest modification: 2021.01.27 # Todo: na ############################################################################### @@ -18,6 +18,7 @@ DATABASE Adapters resources/indexes/bwa/Adapters DATABASE Vectors resources/indexes/bwa/UniVec_wo_phi-X174 ## LAB ORGANISMS ------------------------------------------------------------------------------------- + # Smallest ----- # HIV - HXB2 DATABASE HIV resources/indexes/bwa/HIV_HXB2 @@ -26,6 +27,20 @@ DATABASE Ebola resources/indexes/bwa/Ebola_ZEBOV # SARS-CoV-2 - sequence from Whuhan available from NCBI (accession NC_045512.2) DATABASE SARS-CoV-2 resources/indexes/bwa/SARS-CoV-2_Wuhan-WIV04_2019 +# Coronas ----- +# aCoV_DuvinaCoV - sequence 229E-CoV_NC_002645.1 +#DATABASE aCoV_DuvinaCoV resources/indexes/bwa/aCoV_Duvinacov +# bCoV_EmbeCoV - sequence HKU1_HM034837.1 +#DATABASE bCoV_EmbeCoV resources/indexes/bwa/bCoV_Embecov +# bCoV_HibeCoV - sequence bat_NC_025217.1 +#DATABASE bCoV_HibeCoV resources/indexes/bwa/bCoV_Hibecov +# bCoV_MerbeCoV - sequence bat_MF593268.1 +#DATABASE bCoV_MerbCoV resources/indexes/bwa/bCoV_Merbecov +# bCoV_NobeCoV - sequence bat_NC_048212.1 +#DATABASE bCoV_NobeCoV resources/indexes/bwa/bCoV_Nobecov +# bCoV_SarbeCoV - sequence RaTG13_MN996532.2 +#DATABASE bCoV_SarbeCoV resources/indexes/bwa/bCoV_Sarbecov + # Larger ----- # Gorilla g4 #DATABASE Gorilla resources/indexes/bwa/Gorilla_gorilla_g4 @@ -42,4 +57,4 @@ DATABASE SARS-CoV-2 resources/indexes/bwa/SARS-CoV-2_Wuhan-WIV04_2019 # Arabidopsis t10 - sequence from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.fna.gz) #DATABASE Arabidopsis resources/indexes/bwa/Arabidopsis_thaliana_t10 # Ecoli - sequence available from EMBL (accession U00096.2) -DATABASE Ecoli resources/indexes/bwa/Echerichia_coli_U00096 +DATABASE Ecoli resources/indexes/bwa/Echerichia_coli -- GitLab