Skip to content
Snippets Groups Projects
Commit f443a214 authored by florianCHA's avatar florianCHA
Browse files

Merge branch 'master' of github.com:FlorianCHA/DNAnnot

parents 7ca56b37 c181c3f6
Branches master
No related tags found
No related merge requests found
README.md 0 → 100644
# About this Workflows
This pipeline is an automatic structural annotation workflows written in snakemake. The annotation is based on two tools
which uses RNA-Seq and/or protein homology information for predict coding sequence. One of this tools is **[BRAKER](https://github.com/Gaius-Augustus/BRAKER)** which use **[GeneMark-EX](http://exon.gatech.edu/GeneMark/)** and **[AUGUSTUS](http://augustus.gobics.de/)**. And the other tool is **[AUGUSTUS](http://augustus.gobics.de/)** alone for improve annotation of small coding sequences with few or no intron. Before the annotation, the repeat element of genome are masked for avoid annotation probleme. In addition this workflows can perform a illumina assembly with ABySS using différent value of kmere.
#Installation
For install the annotation Workflows, please use this command :
```
git clone https://github.com/FlorianCHA/AssemblyAndAnnotation_pipeline.git
```
This workflows use many tools for assembly, mapping, annotation and quality control. For installation of softwre two
option are
available. You can install all tools mannually or you can use the singularity launcher without install any tools
needed in
this workflows.
## Manually install
If you want download all software, please complete the software part of config.yaml file.
### Mandatory installation
* **[Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml#obtaining-hisat2)**
* **[samtools](https://github.com/samtools/samtools)**
* **[BRAKER](https://github.com/Gaius-Augustus/BRAKER)**
* **[BUSCO](https://gitlab.com/ezlab/busco/-/tree/master)**
* **[R](https://cran.r-project.org/bin/linux/ubuntu/README.html>) & [Rmarckdown package](https://rmarkdown.rstudio.com/lesson-1.html>)**
* **[Python >=3.7](https://www.python.org/downloads/)**
* **[Snakemake >= 5.2](https://snakemake.readthedocs.io/en/stable/)**
### Optional installation
* [RepeatMasker](http://www.repeatmasker.org/RMDownload.html) if you want mask the repeat element of your genomes
* [ABySS](https://bioinformaticshome.com/tools/wga/descriptions/ABySS.html#Download_and_documentation) if you want assembled you illumina fastq
## Sigularity containers
All containers for the workflows are available **[here](https://singularity-hub.org/collections/4091)**. If you use the
'Launcher_singularty.sh', the workflows download all singularity containers needed. You only need to download the genemark-ES licence **[here](http://exon.gatech.edu/GeneMark/license_download.cgi)**
# Defining Workflows
## Prepare config file
To run the workflows you have to provide the data path for all input file. Please complete the config.yaml file for
launch the workflow.
### 1. Providing data
#### Assembly option
```
# If you want assembly with ABySS you illumina data please complete this part else, pass this part (keep every path empty '')
FASTQ: '/path/to/fastq/directory/'
SUFFIX_FASTQ_R1 : '_R1.fastq.gz'
SUFFIX_FASTQ_R2 : '_R1.fastq.gz'
```
* **FASTQ** : Path of you directory which contain all your fastq file to assemble, if you let empty the path the
workflown don't assembled and use fasta file (give in the **FASTA** option) for the annotation step.
* **SUFFIX_FASTQ_R1** : Etension of your R1 fastq files contains in FASTQ directory (for exemple : '_R1.fastq.gz' )
* **SUFFIX_FASTQ_R2** : Etension of your R2 fastq files contains in FASTQ directory (for exemple : '_R2.fastq.gz' )
#### Repeat element masking option
```
ET_DB: '/path/to/repeat_element_db.fasta'
```
* **ET_DB** : Path of the repeat element data base for repeatMasker, if you let empty the path, the workflow don't
mask the repeat element of the genome
#### Annotation option
```
FASTA:'/path/to/fasta/directory/'
SUFFIX_FASTA : '.fasta'
RNAseq_DIR : 'path/to/RNA_seqfastq/directory/'
SUFFIX_RNAseq : '.fastq.gz'
ID_SPECIES: 'arabidopsis'
PROTEIN_REF: '/path/to/protein_ref.fasta'
GM_KEY : '/path/to/gm_key_64'
```
* **FASTA** : Path of you directory which contain all your fasta file to annotate. If the **FASTQ** option is empty
please give a correct path else you can let empty this option.
* **RNAseq_DIR** : Path of the directory which contain all RNAseq data, if you kepts this path empty this pipeline
run only augustus
* **SUFFIX_RNAseq** : Etension of your fastq files contains in FASTQ directory (for exemple : '.fastq.gz','fq.gz
','fq' , etc. )
* **ID_SPECIES** : ID of species for augustus trainings, please refers to augustus main page for this option
* **PROTEIN_REF** : Path of the protein fasta file, if you don't have this file you can kept empty this option ('')
* **GM_KEY** : Path of the licence for Genemarks-ES (please clik **[here](http://exon.gatech.edu/GeneMark/license_download.cgi)** for download the licence).
* **OUTPUT** : Output directory for all results of this pipeline
### 2. Parameters for some specific tools
## Launching workflow
### 1. On a HPC clusters
### 2. On a single machine
.. contents:: Table of Contents
:depth: 2
About this DNAnnot
====================
This pipeline is an automatic structural annotation workflows written in snakemake. The annotation is based on two tools which uses RNA-Seq and/or protein homology information for predict coding sequence. One of this tools is BRAKER which use GeneMark-EX and AUGUSTUS. And the other tool is AUGUSTUS alone for improve annotation of small coding sequences with few or no intron. Before the annotation, the repeat element of genome are masked for avoid annotation probleme. In addition this workflows can perform a illumina assembly with ABySS using différent value of kmere.
Installation
============
For install the annotation Workflows, please use this command :
git clone https://github.com/FlorianCHA/git@github.com:FlorianCHA/DNAnnot.git
This workflows use many tools for assembly, mapping, annotation and quality control. For installation of softwre two option are available. You can install all tools mannually or you can use the singularity launcher without install any tools needed in this workflows.
Manually install
If you want download all software, please complete the software part of config.yaml file.
Mandatory installation
Hisat2
samtools
BRAKER
BUSCO
R & Rmarckdown package
Python >=3.7
Snakemake >= 5.2
Optional installation
_____________________
RepeatMasker if you want mask the repeat element of your genomes
ABySS if you want assembled you illumina fastq
Sigularity containers
_____________________
All containers for the workflows are available here. If you use the 'Launcher_singularty.sh', the workflows download all singularity containers needed. You only need to download the genemark-ES licence here
Defining Workflows
Prepare config file
Run Workflow
============
To run the workflows you have to provide the data path for all input file. Please complete the config.yaml file for launch the workflow.
Assembly option
_______________
# If you want assembly with ABySS you illumina data please complete this part else, pass this part (keep every path empty '')
FASTQ: '/path/to/fastq/directory/'
SUFFIX_FASTQ_R1 : '_R1.fastq.gz'
SUFFIX_FASTQ_R2 : '_R1.fastq.gz'
FASTQ : Path of you directory which contain all your fastq file to assemble, if you let empty the path the workflown don't assembled and use fasta file (give in the FASTA option) for the annotation step.
SUFFIX_FASTQ_R1 : Etension of your R1 fastq files contains in FASTQ directory (for exemple : '_R1.fastq.gz' )
SUFFIX_FASTQ_R2 : Etension of your R2 fastq files contains in FASTQ directory (for exemple : '_R2.fastq.gz' )
Repeat element masking option
_____________________________
ET_DB: '/path/to/repeat_element_db.fasta'
ET_DB : Path of the repeat element data base for repeatMasker, if you let empty the path, the workflow don't mask the repeat element of the genome
Annotation option
_________________
FASTA:'/path/to/fasta/directory/'
SUFFIX_FASTA : '.fasta'
RNAseq_DIR : 'path/to/RNA_seqfastq/directory/'
SUFFIX_RNAseq : '.fastq.gz'
ID_SPECIES: 'arabidopsis'
PROTEIN_REF: '/path/to/protein_ref.fasta'
GM_KEY : '/path/to/gm_key_64'
FASTA : Path of you directory which contain all your fasta file to annotate. If the FASTQ option is empty please give a correct path else you can let empty this option.
RNAseq_DIR : Path of the directory which contain all RNAseq data, if you kepts this path empty this pipeline run only augustus
SUFFIX_RNAseq : Etension of your fastq files contains in FASTQ directory (for exemple : '.fastq.gz','fq.gz ','fq' , etc. )
ID_SPECIES : ID of species for augustus trainings, please refers to augustus main page for this option
PROTEIN_REF : Path of the protein fasta file, if you don't have this file you can kept empty this option ('')
GM_KEY : Path of the licence for Genemarks-ES (please clik here for download the licence).
OUTPUT : Output directory for all results of this pipeline
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment