diff --git a/README.rst b/README.rst index 79cac06fe444f045183fe78bb17ff8adfd6a638a..ba73a993abe116d15a32e437eb00baa941980c4c 100644 --- a/README.rst +++ b/README.rst @@ -1,72 +1,81 @@ -.. contents:: Table of Contents - :depth: 2 +# About this Workflows -About this DNAnnot -==================== +This pipeline is an automatic structural annotation workflows written in snakemake. The annotation is based on two tools + which uses RNA-Seq and/or protein homology information for predict coding sequence. One of this tools is **[BRAKER](https://github.com/Gaius-Augustus/BRAKER)** which use **[GeneMark-EX](http://exon.gatech.edu/GeneMark/)** and **[AUGUSTUS](http://augustus.gobics.de/)**. And the other tool is **[AUGUSTUS](http://augustus.gobics.de/)** alone for improve annotation of small coding sequences with few or no intron. Before the annotation, the repeat element of genome are masked for avoid annotation probleme. In addition this workflows can perform a illumina assembly with ABySS using différent value of kmere. + -This pipeline is an automatic structural annotation workflows written in snakemake. The annotation is based on two tools which uses RNA-Seq and/or protein homology information for predict coding sequence. One of this tools is BRAKER which use GeneMark-EX and AUGUSTUS. And the other tool is AUGUSTUS alone for improve annotation of small coding sequences with few or no intron. Before the annotation, the repeat element of genome are masked for avoid annotation probleme. In addition this workflows can perform a illumina assembly with ABySS using différent value of kmere. - -Installation -============ +#Installation For install the annotation Workflows, please use this command : -git clone https://github.com/FlorianCHA/git@github.com:FlorianCHA/DNAnnot.git +``` +git clone https://github.com/FlorianCHA/AssemblyAndAnnotation_pipeline.git +``` -This workflows use many tools for assembly, mapping, annotation and quality control. For installation of softwre two option are available. You can install all tools mannually or you can use the singularity launcher without install any tools needed in this workflows. -Manually install +This workflows use many tools for assembly, mapping, annotation and quality control. For installation of softwre two + option are + available. You can install all tools mannually or you can use the singularity launcher without install any tools + needed in + this workflows. + +## Manually install If you want download all software, please complete the software part of config.yaml file. -Mandatory installation - - Hisat2 - samtools - BRAKER - BUSCO - R & Rmarckdown package - Python >=3.7 - Snakemake >= 5.2 - -Optional installation -_____________________ - - RepeatMasker if you want mask the repeat element of your genomes - ABySS if you want assembled you illumina fastq - -Sigularity containers -_____________________ - -All containers for the workflows are available here. If you use the 'Launcher_singularty.sh', the workflows download all singularity containers needed. You only need to download the genemark-ES licence here -Defining Workflows -Prepare config file - -Run Workflow -============ - -To run the workflows you have to provide the data path for all input file. Please complete the config.yaml file for launch the workflow. - -Assembly option -_______________ +### Mandatory installation + + * **[Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml#obtaining-hisat2)** + * **[samtools](https://github.com/samtools/samtools)** + * **[BRAKER](https://github.com/Gaius-Augustus/BRAKER)** + * **[BUSCO](https://gitlab.com/ezlab/busco/-/tree/master)** + * **[R](https://cran.r-project.org/bin/linux/ubuntu/README.html>) & [Rmarckdown package](https://rmarkdown.rstudio.com/lesson-1.html>)** + * **[Python >=3.7](https://www.python.org/downloads/)** + * **[Snakemake >= 5.2](https://snakemake.readthedocs.io/en/stable/)** + +### Optional installation + + * [RepeatMasker](http://www.repeatmasker.org/RMDownload.html) if you want mask the repeat element of your genomes + * [ABySS](https://bioinformaticshome.com/tools/wga/descriptions/ABySS.html#Download_and_documentation) if you want assembled you illumina fastq + +## Sigularity containers + +All containers for the workflows are available **[here](https://singularity-hub.org/collections/4091)**. If you use the + 'Launcher_singularty.sh', the workflows download all singularity containers needed. You only need to download the genemark-ES licence **[here](http://exon.gatech.edu/GeneMark/license_download.cgi)** + +# Defining Workflows + +## Prepare config file + +To run the workflows you have to provide the data path for all input file. Please complete the config.yaml file for + launch the workflow. + +### 1. Providing data + +#### Assembly option + +``` # If you want assembly with ABySS you illumina data please complete this part else, pass this part (keep every path empty '') FASTQ: '/path/to/fastq/directory/' SUFFIX_FASTQ_R1 : '_R1.fastq.gz' SUFFIX_FASTQ_R2 : '_R1.fastq.gz' +``` - FASTQ : Path of you directory which contain all your fastq file to assemble, if you let empty the path the workflown don't assembled and use fasta file (give in the FASTA option) for the annotation step. - SUFFIX_FASTQ_R1 : Etension of your R1 fastq files contains in FASTQ directory (for exemple : '_R1.fastq.gz' ) - SUFFIX_FASTQ_R2 : Etension of your R2 fastq files contains in FASTQ directory (for exemple : '_R2.fastq.gz' ) - -Repeat element masking option -_____________________________ +* **FASTQ** : Path of you directory which contain all your fastq file to assemble, if you let empty the path the + workflown don't assembled and use fasta file (give in the **FASTA** option) for the annotation step. +* **SUFFIX_FASTQ_R1** : Etension of your R1 fastq files contains in FASTQ directory (for exemple : '_R1.fastq.gz' ) +* **SUFFIX_FASTQ_R2** : Etension of your R2 fastq files contains in FASTQ directory (for exemple : '_R2.fastq.gz' ) + +#### Repeat element masking option +``` ET_DB: '/path/to/repeat_element_db.fasta' +``` +* **ET_DB** : Path of the repeat element data base for repeatMasker, if you let empty the path, the workflow don't + mask the repeat element of the genome + +#### Annotation option - ET_DB : Path of the repeat element data base for repeatMasker, if you let empty the path, the workflow don't mask the repeat element of the genome - -Annotation option -_________________ - +``` FASTA:'/path/to/fasta/directory/' SUFFIX_FASTA : '.fasta' RNAseq_DIR : 'path/to/RNA_seqfastq/directory/' @@ -74,13 +83,23 @@ _________________ ID_SPECIES: 'arabidopsis' PROTEIN_REF: '/path/to/protein_ref.fasta' GM_KEY : '/path/to/gm_key_64' +``` +* **FASTA** : Path of you directory which contain all your fasta file to annotate. If the **FASTQ** option is empty + please give a correct path else you can let empty this option. +* **RNAseq_DIR** : Path of the directory which contain all RNAseq data, if you kepts this path empty this pipeline + run only augustus +* **SUFFIX_RNAseq** : Etension of your fastq files contains in FASTQ directory (for exemple : '.fastq.gz','fq.gz +','fq' , etc. ) +* **ID_SPECIES** : ID of species for augustus trainings, please refers to augustus main page for this option +* **PROTEIN_REF** : Path of the protein fasta file, if you don't have this file you can kept empty this option ('') +* **GM_KEY** : Path of the licence for Genemarks-ES (please clik **[here](http://exon.gatech.edu/GeneMark/license_download.cgi)** for download the licence). +* **OUTPUT** : Output directory for all results of this pipeline + + +### 2. Parameters for some specific tools - FASTA : Path of you directory which contain all your fasta file to annotate. If the FASTQ option is empty please give a correct path else you can let empty this option. - RNAseq_DIR : Path of the directory which contain all RNAseq data, if you kepts this path empty this pipeline run only augustus - SUFFIX_RNAseq : Etension of your fastq files contains in FASTQ directory (for exemple : '.fastq.gz','fq.gz ','fq' , etc. ) - ID_SPECIES : ID of species for augustus trainings, please refers to augustus main page for this option - PROTEIN_REF : Path of the protein fasta file, if you don't have this file you can kept empty this option ('') - GM_KEY : Path of the licence for Genemarks-ES (please clik here for download the licence). - OUTPUT : Output directory for all results of this pipeline +## Launching workflow +### 1. On a HPC clusters +### 2. On a single machine