From c77418c3dcca6039d436b8a6ce5d93d126419987 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Nicolas=20FERNANDEZ=20NU=C3=91EZ?=
 <fernandez.nunez.nicolas@gmail.com>
Date: Thu, 27 Jan 2022 11:24:21 +0100
Subject: [PATCH] Working on README.md and bwa ref

---
 .gitignore               |  2 +-
 README.md                | 88 +++++++++++++++++++++++-----------------
 RQC.sh                   | 14 +++----
 config/config.yaml       |  7 ++--
 config/fastq-screen.conf | 19 ++++++++-
 5 files changed, 79 insertions(+), 51 deletions(-)

diff --git a/.gitignore b/.gitignore
index af9a39f..d6a461b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,4 +11,4 @@
 *.txt
 /results*
 
-#/resources/databases/bwa
+
diff --git a/README.md b/README.md
index 85a33e4..8088858 100644
--- a/README.md
+++ b/README.md
@@ -57,7 +57,7 @@ mamba install -c conda-forge -c bioconda snakemake=6.12.1 --yes
 
 ### RQC ###
 
-**Download** _OR_ clone the **RQC pipeline** project  
+**Download** _OR_ **Clone** the **RQC pipeline** project  
 
 #### Difference between **Download** and **Clone** ####
 To create a copy of a remote repository’s files on your computer, you can either download or clone the repository  
@@ -75,6 +75,7 @@ wget https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline/-/archive/
 tar -xzvf Reads_Quality_Control_Pipeline-main.tar.gz
 rm -f Reads_Quality_Control_Pipeline-main.tar.gz 
 mv Reads_Quality_Control_Pipeline-main/ ~/Desktop/RQC_Pipeline/
+cd ~/Desktop/RQC_Pipeline/
 ```
 
 #### Clone ####
@@ -83,14 +84,20 @@ mv Reads_Quality_Control_Pipeline-main/ ~/Desktop/RQC_Pipeline/
 Authenticate with GitLab by following the instruction in the [2FA documentation](https://docs.gitlab.com/ee/user/profile/account/two_factor_authentication.html)  
 ```shell
 git clone https://gitlab.com/ird_transvihmi/Reads_Quality_Control_Pipeline.git
+```
+```shell
 mv Reads_Quality_Control_Pipeline/ ~/Desktop/RQC_Pipeline/
+cd ~/Desktop/RQC_Pipeline/
 ```
 
 - Clone with **SSH** (_when you want to authenticate only one time_)  
 Authenticate with GitLab by following the instructions in the [SSH documentation](https://docs.gitlab.com/ee/ssh/index.html)  
 ```shell
 git clone git@gitlab.com:ird_transvihmi/Reads_Quality_Control_Pipeline.git
+```
+```shell
 mv Reads_Quality_Control_Pipeline/ ~/Desktop/RQC_Pipeline/
+cd ~/Desktop/RQC_Pipeline/
 ```
 
 
@@ -110,32 +117,36 @@ sudo chmod +x path/to/Reads_Quality_Control_Pipeline/RQCP.sh
 
 Yours results are available in results directory as follow:
 
-... TODO ...
+1. fastq-screen: Your search libraries might contain the genomes of all of the organisms you work on, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments.
+2. fastqc: modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
+3. multiqc: compiled HTML report.
+4. reports: log from tools
+5. summary: Snakemake rules graph and files summary.
+
+
+### Configuration ###
 
-###  Configuration ###
+#### _./config/config.yaml_ ####
 
-#### Resources ####
+##### Resources #####
 
-Edit to match your hardware configuration  
+Edit to match your hardware configuration
+_(given when you run RQC.sh)_
 
-#### Environments ####
+##### Environments #####
 
 Edit if you change some environments (i.e.new version) in ./workflow/envs/tools-version.yaml files
+_(you should not change this)_
 
-#### Fastq-Screen #####
+##### Fastq-Screen ######
 
 - **config**: Path to the fastq-screen configuration file (default config: ./config/fastq-screen.conf)
 - **subset**: Don't use the whole sequence file, but create a temporary dataset of this specified number of read (default config: '10000', set '0' for all dataset)
 - **aligner**: Specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa' (default config: 'bwa')
 
-##### fastq-screen.conf #####
+#### _./config/fastq-screen.conf_ ####
 
-- **path**: Set this value to tell the program where to find your chosen aligner (default :/usr/local/\<tool\>
-- **bismark**: Same for bismark (for bisulfite sequencing only)
-- **threads**: Set this value to the number of cores you want for mapping reads (default: 1, but overwrited by Snakemake and config.yaml file)
-- **databases**: This section enables you to configure multiple genomes databases (aligner index files) to search against in your screen
-
-##### databases #####
+###### databases ######
 
 For each genome you need to provide a database name (which **can't** contain spaces) and the location of the aligner index files  
 
@@ -154,37 +165,39 @@ It's suggested including genomes and sequences that:
 - may be sources of contamination either because they where run on your sequencer previously
 - may have contaminated your sample during the library preparation step
 
-For IRD_U233_TransVIHMI, cretaed this indexes:
-
-- **Human**: main sources of lab. contaminations _(exepted if Boston Dynamics Atlas robot did the job)_ **¡not included!**
-- **Mouse**: main model in biology experimentation, very frequent in NGS facility core **¡not included!**
-- **Arabidopsis**: frequent plant model in NGS facility core associated with plants researches (IRD, CIRAD, INRAE, ...) **¡not included!**
-- **Ecoli**: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples
-- **PhiX**: usefull control in Illumina sequencing run technology
-- **Adapters**: use for libraries generation
-- **Vector**: use in general molecular biology
-- **Gorilla**: species studied in TransVIHMI **¡not included!**
-- **Chimpanzee**: species studied in TransVIHMI **¡not included!**
-- **Bat**: species studied in TransVIHMI **¡not included!**
-- **HIV**: species studied in TransVIHMI
-- **Ebola**: species studied in TransVIHMI
-- **SARS-CoV-2**: species studied in TransVIHMI
-
-**Not included indexes:**  
-Indexes for large genomes can be heavy (~ 3Gb) and git limit each project to 10Gb. Download all this databases can be also to long.  
-Commonly it's share on git only code, but not larger resources _(data input, databases, references, ...).  
-This data can always be download somewhere (online servers).   
-Databases for below genomes where generated and available at IRD_U233_TransVIHMI lab.  
-You can freely ask for sharing (with USB supports or FileSender) to add it to your analyses.  
-You can also ask for new databases, for genomes references not yet included, to check putative presence / absence on your dataset.  
+For IRD_U233_TransVIHMI, we can provid:
+
+- Human: main sources of laboratory contaminations
+- Mouse: main model in biology experimentation, very frequent in NGS facility core
+- Arabidopsis: frequent plant model in NGS facility core associated with plants researches _(IRD, CIRAD, INRAE, ...)_
+- Ecoli: frequent bacteria model, also an indicator of human contaminations, also in feces and stool samples
+- PhiX: usefull control in Illumina sequencing run technology
+- Adapters: use for libraries generation
+- Vector: use in general molecular biology
+- Gorilla: species studied in TransVIHMI
+- Chimpanzee: species studied in TransVIHMI
+- Bat: species studied in TransVIHMI
+- HIV: species studied in TransVIHMI
+- Ebola: species studied in TransVIHMI
+- SARS-CoV-2: species studied in TransVIHMI
+- Coronavirus: species studied in Trans
+
+Indexes for larger genomes can be heavy (~ 3Gb) and gitlab limit each project to 10Gb.  
+Download all this databases can be also very long. So we commonly share on gitlab code but resources.  
+This data can be download separatly, from dedicated servers.  
+Or you can freely ask for share (with physical support or FileSender), to add it to your analyses.  
+You can also ask for new indexes, for your favorite genomes not yet included.  
+
 
 ## Support ##
+
 1. RTFM! (Read The Fabulous Manual! ^^.)
 2. Read de awsome wiki ;)
 3. Create a new issue: Issues > New issue > Describe your issue
 4. Send an email to [nicolas.fernandez@ird.fr](url)
 5. Call me to `+33.(0)4.67.41.55.xx` (No don't please _O\_o_!)
 
+
 ## Roadmap ##
 
 Add a wiki !  
@@ -211,6 +224,7 @@ Use Git tools to share!
 
 
 ## Project status ##
+
 This project is regularly update and actively maintened  
 However, you can be volunteer to step in as a maintainer  
 
diff --git a/RQC.sh b/RQC.sh
index cee215a..89dd8d6 100755
--- a/RQC.sh
+++ b/RQC.sh
@@ -4,13 +4,13 @@
 echo ""
 echo "##### ABOUT #####"
 echo "-----------------"
-echo "Name: Reads Quality Control pipeline"
+echo "Name: RQC.sh"
 echo "Author: Nicolas Fernandez"
 echo "Affiliation: IRD_U233_TransVIHMI"
 echo "Aim: Bash script for RQC pipeline"
 echo "Date: 2021.04.30"
 echo "Run: snakemake --snakemake rqc.smk --cores --use-conda"
-echo "Latest modification: 2022.01.25"
+echo "Latest modification: 2022.01.27"
 echo "Todo: done"
 echo "________________________________________________________________________"
 
@@ -56,12 +56,12 @@ echo "##### RENAME FASTQ FILES #####"
 echo "------------------------------"
 
 # With rename command from macOSX
-rename --verbose 's/_S\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove barcode-ID like {_S001_}
+#rename --verbose 's/_S\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove barcode-ID like {_S001_}
 rename --verbose 's/_L\d+_/_/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove line-ID ID like {_L001_}
 rename --verbose 's/_001.fastq.gz/.fastq.gz/' ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove end-name ID like {_001}.fastq.gz
 
 # With rename command as part of the util-linux package
-rename --verbose _S\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove barcode-ID like {_S001_}
+#rename --verbose _S\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove barcode-ID like {_S001_}
 rename --verbose _L\d+_ _ ${workdir}/resources/reads/*.fastq.gz 2> /dev/null                # Remove line-ID ID like {_L001_}
 rename --verbose _001.fastq.gz .fastq.gz ${workdir}/resources/reads/*.fastq.gz 2> /dev/null # Remove end-name ID like {_001}.fastq.gz
 
@@ -163,7 +163,7 @@ echo ""
 echo "##### SNAKEMAKE PIPELINE GRAPHS ######"
 echo "--------------------------------------"
 
-mkdir ${workdir}/results/Summary/ 2> /dev/null
+mkdir ${workdir}/results/summary/ 2> /dev/null
 
 graphList="dag rulegraph filegraph"
 extentionList="pdf png"
@@ -175,14 +175,14 @@ for graph in ${graphList} ; do
             --snakefile ${workdir}/workflow/rules/rqc.smk \
             --${graph} | \
             dot -T${extention} > \
-                ${workdir}/results/Summary/${graph}.${extention} ;
+                ${workdir}/results/summary/${graph}.${extention} ;
     done ;
 done
 
 snakemake \
     --directory ${workdir} \
     --snakefile ${workdir}/workflow/rules/rqc.smk \
-    --summary > ${workdir}/results/Summary/files_summary.txt
+    --summary > ${workdir}/results/summary/files_summary.txt
 
 echo "________________________________________________________________________"
 
diff --git a/config/config.yaml b/config/config.yaml
index 7b7ec15..f4eff3d 100644
--- a/config/config.yaml
+++ b/config/config.yaml
@@ -5,15 +5,14 @@
 # Aim: Config file for RQC pipeline
 # Date: 2021.04.30
 # Use: Edit or (de)comment settings
-# Latest modification: 2022.01.25
+# Latest modification: 2022.01.27
 # Todo: done
-
 ###############################################################################
 
 ## RESOURCES -----------------------------------------------------------------------------------------
 resources:
-  cpus: 12  # cpus
-  mem_gb: 16 # mem in Gb
+  cpus: 4   # cpus
+  mem_gb: 4 # mem in Gb
 
 ## ENVIRONNEMENTS ------------------------------------------------------------------------------------
 conda:
diff --git a/config/fastq-screen.conf b/config/fastq-screen.conf
index 5c4d281..22a8f72 100644
--- a/config/fastq-screen.conf
+++ b/config/fastq-screen.conf
@@ -5,7 +5,7 @@
 # Aim: Config file for Fastq-Screen, with bwa, for RQC pipeline
 # Date: 2021.04.30
 # Use: Edit, (de)comment settingss you want modify
-# Latest modification: 2021.01.26
+# Latest modification: 2021.01.27
 # Todo: na
 ###############################################################################                                                                                                                                  
 
@@ -18,6 +18,7 @@ DATABASE	Adapters        resources/indexes/bwa/Adapters
 DATABASE	Vectors		 resources/indexes/bwa/UniVec_wo_phi-X174
 
 ## LAB ORGANISMS -------------------------------------------------------------------------------------
+
 # Smallest -----
 # HIV - HXB2
 DATABASE	HIV	 resources/indexes/bwa/HIV_HXB2
@@ -26,6 +27,20 @@ DATABASE	Ebola	 resources/indexes/bwa/Ebola_ZEBOV
 # SARS-CoV-2 - sequence from Whuhan available from NCBI (accession NC_045512.2)
 DATABASE	SARS-CoV-2	 resources/indexes/bwa/SARS-CoV-2_Wuhan-WIV04_2019
 
+# Coronas ----- 
+# aCoV_DuvinaCoV - sequence 229E-CoV_NC_002645.1
+#DATABASE	aCoV_DuvinaCoV	resources/indexes/bwa/aCoV_Duvinacov
+# bCoV_EmbeCoV - sequence HKU1_HM034837.1
+#DATABASE	bCoV_EmbeCoV	resources/indexes/bwa/bCoV_Embecov
+# bCoV_HibeCoV - sequence bat_NC_025217.1
+#DATABASE	bCoV_HibeCoV	resources/indexes/bwa/bCoV_Hibecov
+# bCoV_MerbeCoV - sequence bat_MF593268.1
+#DATABASE	bCoV_MerbCoV	resources/indexes/bwa/bCoV_Merbecov
+# bCoV_NobeCoV - sequence bat_NC_048212.1
+#DATABASE	bCoV_NobeCoV	resources/indexes/bwa/bCoV_Nobecov
+# bCoV_SarbeCoV - sequence RaTG13_MN996532.2
+#DATABASE	bCoV_SarbeCoV	resources/indexes/bwa/bCoV_Sarbecov
+
 # Larger -----
 # Gorilla g4	
 #DATABASE	Gorilla	 resources/indexes/bwa/Gorilla_gorilla_g4
@@ -42,4 +57,4 @@ DATABASE	SARS-CoV-2	 resources/indexes/bwa/SARS-CoV-2_Wuhan-WIV04_2019
 # Arabidopsis t10 - sequence from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.fna.gz)
 #DATABASE	Arabidopsis	 resources/indexes/bwa/Arabidopsis_thaliana_t10
 # Ecoli - sequence available from EMBL (accession U00096.2)
-DATABASE	Ecoli	 resources/indexes/bwa/Echerichia_coli_U00096
+DATABASE	Ecoli	 resources/indexes/bwa/Echerichia_coli
-- 
GitLab