GeVarLi: GEnome assembly, VARiant calling and LIneage assignation

~ ABOUT ~
GeVarLi
GeVarLi is a FAIR, open-source, scalable, modulable and traceable snakemake pipeline, used for SARS-CoV-2 (and others viruses) genome assembly and variants monitoring, using Illumina Inc. short reads COVIDSeq™ libraries sequencing.
GeVarLi was initialy developed for AFROSCREEN project.
Genomic sequencing, a public health tool
The establishment of a surveillance and sequencing network is an essential public health tool for detecting and containing pathogens with epidemic potential. Genomic sequencing makes it possible to identify pathogens, monitor the emergence and impact of variants, and adapt public health policies accordingly.
The Covid-19 epidemic has highlighted the disparities that remain between continents in terms of surveillance and sequencing systems. At the end of October 2021, of the 4,600,000 sequences shared on the public and free GISAID tool worldwide, only 49,000 came from the African continent, i.e. less than 1% of the cases of Covid-19 diagnosed on this continent.
Features
- Reads quality control
- Fastq-Screen (contamination check)
- FastQC (quality metrics)
- MultiQC (html reports)
- Reads cleaning
- Cutadapt (adapters trimming & amplicon primers hard-clipping)
- Sickle-trim (quality trimming)
- Reads mapping
- (bam files)
- (bed files)
- Visualization (IGV)
- Variants calling and filtering (vcf files)
- Genome coverage (statistics reports)
- Consensus sequences (fasta file)
- Genomes classification
- Nextclade (consensus quality and lineages reports)
- Pangolin (lineages reports)
Version
V.2023.06
Rulegraph



~ SUPPORT ~
- Read The Fabulous Manual!
- Read de Awsome Wiki!
- Create a new issue: Issues > New issue > Describe your issue
- Send an email to nicolas.fernandez@ird.fr
~ CITATION ~
If you use this pipeline, please cite this GeVarLi, GitLab IRDForge repository and authors:
GitLab IRDForge repository: https://forge.ird.fr/transvihmi/nfernandez/GeVarLi
GeVarLi, a FAIR, open-source, scalable, modulable and traceable snakemake pipeline, for reference-based Genome assembly and Variants calling and Lineage assignment, from SARS-CoV-2 to others (re)emergent viruses, Illumina short reads sequencing.
Nicolas FERNANDEZ NUÑEZ (1)
(1) UMI 233 - Recherches Translationnelles sur le VIH et les Maladies Infectieuses endémiques et émergentes (TransVIHMI), University of Montpellier (UM), French Institute of Health and Medical Research (INSERM), French National Research Institute for Sustainable Development (IRD)
~ AUTHORS & ACKNOWLEDGMENTS ~
- Nicolas Fernandez - IRD (Developer and Maintener)
- Christelle Butel - IRD (Reporter)
- Eddy Kinganda-Lusamaki - INRB (Source)
- DALL•E mini - OpenAI Git (Repo. avatar)
~ LICENSE ~
Licencied under GPLv3
Intellectual property belongs to IRD and authors.
~ ROADMAP ~
- Publish GeVarLi paper !
- Add GisAid submision files generation
- Add MultiQC config template
~ PROJECT STATUS ~
This project is regularly update and actively maintened
However, you can be volunteer to step in as developer or maintainer
~ CONTRIBUTING ~
Open to contributions!
- Asking for update
- Proposing new feature
- Reporting issue
- Fixing issue
- Sharing code
- Citing tool
~ INSTALLATIONS ~
Conda (dependencies)
GeVarLi use the usefull Conda environment manager
So, if and only if, it's required (Conda not already installed), please, first install Conda!
Download and install your OS adapted version of Latest Miniconda Installer
e.g. for MacOSX-64-bit systems:
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ~/Miniconda3-latest-MacOSX-x86_64.sh && \
bash ~/Miniconda3-latest-MacOSX-x86_64.sh -b -p ~/miniconda3/ && \
rm -f ~/Miniconda3-latest-MacOSX-x86_64.sh && \
~/miniconda3/condabin/conda update conda --yes && \
~/miniconda3/condabin/conda init && \
exit
e.g. for Linux-64-bit systems:
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o ~/Miniconda3-latest-Linux-x86_64.sh && \
bash ~/Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3/ && \
rm -f ~/Miniconda3-latest-Linux-x86_64.sh && \
~/miniconda3/condabin/conda update conda --yes && \
~/miniconda3/condabin/conda init && \
exit
Update Conda:
conda update -n base -c defaults conda
GeVarLi
Clone to your home/ GeVarLi GitLab IRDForge repository (ID: 399) (128Mo): (128 Mo required)
git clone --depth 1 https://forge.ird.fr/transvihmi/nfernandez/GeVarLi.git ~/GeVarLi/
Update GeVarLi:
cd ~/GeVarLi/ && git reset --hard HEAD && git pull --verbose
Otherwise, you candownload GeVarLi (no update through "git pull"): (75 Mo required)
curl https://forge.ird.fr/transvihmi/nfernandez/GeVarLi/-/archive/main/GeVarLi-main.tar.gz -o ~/GeVarLi-main.tar.gz && \
tar -xzvf ~/GeVarLi-main.tar.gz && \
mv ~/GeVarLi-main/ ~/GeVarLi/ && \
rm -f ~/GeVarLi-main.tar.gz
~ USAGE ~
-
Copy your paired-end reads files, in .fastq.gz format, into: ./resources/reads/ directory Without reads, SARS-CoV-2 from ./resources/test_data/ directory will be used
-
Execute Start_GeVarLi.sh bash script to run GeVarLi pipeline (according to your choice):
- or with a Double-click on it (if you make .sh files executable files with Terminal.app)
- or with a Right-click > Open with > Terminal.app
- or with CLI from a terminal:
bash Start_GeVarLi.sh
- Yours analyzes will start, with default configuration settings
Option-1: Edit config.yaml file in ./configuration/ directory
Option-2: Edit fastq-screen.conf file in ./configuration/ directory
First run will auto-created (only once): - Workflow-Base conda environment (with: Snakemake, Mamba, Yq, Rename and GraphViz) - GeVarLi all tTools conda environments (tools used by GeVarLi rules) - Indexes for BWA and BOWTIE2 aligners (for each fasta genomes in resources/ directory)
This may take some time, depending on your internet connection and your computer
~ RESULTS ~
Yours results are available in ./results/ directory, as follow:
Some [temp] tagged files are removed by default, to save disk usage
🧩 GeVarLi/
├── 📂 archives/
│ └── 📦 Results_{YYYY-MM-DD_HHhMM}_{REFERENCE}_{ALIGNER}_{MINCOV}_{SAMPLES}_archive.tar.gz
└── 📂 results/
├── 🧬 All_{REFERENCE}_consensus_sequences.fasta
├── 📊 All_{REFERENCE}_genome_coverages.tsv
├── 📊 All_{REFERENCE}_nextclade_lineages.tsv
├── 📊 All_{REFERENCE}_pangolin_lineages.tsv
├── 🌐 All_readsQC_reports.html
├── 📂 00_Quality_Control/
│ ├── 📂 fastq-screen/
│ │ ├── 🌐 {SAMPLE}_R{1/2}_screen.html
│ │ ├── 📈 {SAMPLE}_R{1/2}_screen.png
│ │ └── 📄 {SAMPLE}_R{1/2}_screen.txt
│ ├── 📂 fastqc/
│ │ ├── 🌐 {SAMPLE}_R{1/2}_fastqc.html
│ │ └── 📦 {SAMPLE}_R{1/2}_fastqc.zip
│ └── 📂 multiqc/
│ ├── 🌐 multiqc_report.html
│ └──📂 multiqc_data/
│ ├── 📝 multiqc.log
│ ├── 📄 multiqc_citations.txt
│ ├── 🌀 multiqc_data.json
│ ├── 📄 multiqc_fastq_screen.txt
│ ├── 📄 multiqc_fastqc.txt
│ ├── 📄 multiqc_general_stats.txt
| └── 📄 multiqc_sources.txt
├── 📂 01_Trimmidapt
│ ├── 📂 cutadapt/
│ │ └── 📦 {SAMPLE}_cutadapt-removed_R{1/2}.fastq.gz # [temp]
│ └── 📂 sickle/
│ ├── 📦 {SAMPLE}_sickle-trimmed_R{1/2}.fastq.gz # [temp]
│ └── 📦 {SAMPLE}_sickle-trimmed_SE.fastq.gz # [temp]
├── 📂 02_Mapping/
│ ├── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_mark-dup.bam
│ ├── 🗂️ {SAMPLE}_{REFERENCE}_{ALIGNER}_mark-dup.bam.bai
│ ├── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_mark-dup.primerclipped.bam
│ ├── 🗂️ {SAMPLE}_{REFERENCE}_{ALIGNER}_mark-dup.primerclipped.bam.bai
│ ├── 🧭 {SAMPLE}_{ALIGNER}-mapped.sam # [temp]
│ ├── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_sorted-by-names.bam # [temp]
│ ├── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_fixed-mate.bam # [temp]
│ └── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_sorted.bam # [temp]
├── 📂 03_Coverage/
│ ├── 📊 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_coverage-stats.tsv
│ ├── 🛏️ {SAMPLE}_{REFERENCE}_{ALIGNER}_genome-cov.bed # [temp]
│ ├── 🛏️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_min-cov-filt.bed # [temp]
│ └── 🛏️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_low-cov-mask.bed # [temp]
├── 📂 04_Variants/
│ ├── 🧬 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_masked-ref.fasta
│ ├── 🗂️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_masked-ref.fasta.fai
│ ├── 🧭 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_indel-qual.bam
│ ├── 🗂️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_indel-qual.bai
│ ├── 🧮️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_variant-call.vcf
│ ├── 🧮️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_variant-filt.vcf
│ ├── 📦 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_variant-filt.vcf.bgz # [temp]
│ └── 🗂️ {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_variant-filt.vcf.bgz.tbi # [temp]
├── 📂 05_Consensus/
│ └── 🧬 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_consensus.fasta
├── 📂 06_Lineages/
│ ├── 📊 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_nextclade-report.tsv
│ ├── 📊 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_pangolin-report.csv
│ └── 📂 {SAMPLE}_{REFERENCE}_{ALIGNER}_{MINCOV}_nextclade-all/
│ ├── 🧬 nextclade.aligned.fasta
│ ├── 📊 nextclade.csv
│ ├── 📊 nextclade.errors.csv
│ ├── 📊 nextclade.insertions.csv
│ ├── 🌀 nextclade.json
│ ├── 🌀 nextclade.ndjson
│ ├── 🌀 nextclade.auspice.json
│ └── 🧬 nextclade_{GENE}.translation.fasta
└── 📂 10_Reports/
├── ⚙️ config.log
├── 📝 settings.log
├── 🍜 gevarli-base_v.{VERSION}.yaml
├── 🍜 gevarli-tools_v.{VERSION}.yaml
├── 📂 files-summaries
│ └── 📄 {PIPELINE}_files-summary.txt
├── 📂 graphs/
│ ├── 📈 {PIPELINE}_dag.{PNG/PDF}
│ ├── 📈 {PIPELINE}_filegraph.{PNG/PDF}
│ └── 📈 {PIPELINE}_rulegraph.{PNG/PDF}
└── 📂 tools-log/
├── 📂 awk/
├── 📂 bcftools/
├── 📂 bedtools/
├── 📂 bgzip/
├── 📂 bowtie2/
├── 📂 bwa/
├── 📂 cutadapt/
├── 📂 lofreq/
├── 📂 nextclade/
├── 📂 pangolin/
├── 📂 samtools/
├── 📂 sed/
├── 📂 sickle-trim/
├── 📂 tabix/
├── 📝 fastq-screen.log
├── 📝 fastqc.log
└── 📝 multiqc.log

Files Glossary
- BAM: Binary Alignment Map, compressed binary representation of the SAM files.
- BAI: BAM Indexes.
- FASTA: Fast-All, text-based format for representing either nucleotide sequences or amino acid (protein) sequences.
- FASTQ: FASTA with Quality, text-based format storing both a biological sequence and its corresponding quality scores.
- FAI: FASTA Indexes.
- SAM: Sequence Alignment Map, text-based format consists of a header and an alignment section.
- YAML: Commonly used for configuration filesand in applications where data is being stored or transmitted.
- GZ: format used for file compression and decompression, normally used to compress just single files.
- TAR: Tarball, format collecting many files into one archive file`, extract with ```tar -xzvf archive.tar.gz````.
~ CONFIGURATION ~
You can edit default settings in config.yaml file into ./config/ directory:
Resources
Edit to match your hardware configuration
-
cpus: for tools that can (i.e. bwa), could be use at most n cpus to run in parallel (default config: '8')
Note: snakemake (with default Start bash script) will always use all cpus to parallelize jobs - ram: for tools that can (i.e. samtools), limit memory usage to max n Gb (default config: '16' Gb)
- tmpdir: for tools that can (i.e. pangolin), specify where you want the temp stuff (default config: '$TMPDIR')
Consensus
- mincov: minimum coverage for masking to low covered regions in final consensus sequence (default: '30')
- minaf: minimum allele frequency allowed for variant calling step (default: '0.2')
- reference: reference sequence fasta file format name used for mapping (default: 'SARS-CoV-2_Wuhan_MN-908947-3')
- iupac: allow output variants in the form of IUPAC ambiguity codes (default: deactivate -> '' )
Aligner
- aligner: Map your reads using either bwa or bowtie2
Fastq-Screen
- config: path to the fastq-screen configuration file (default: 'configuration/fastq-screen/' [*] )
- subset: do not use the whole sequence file, but create a temporary dataset of this specified number of read (default: '1000')
[*] configuration/fastq-screen/{aligner}.conf
- DATABASE: (de)comment (#) or add your own 'DATABASE' to configure multiple genomes screaning
Cutadapt
- length: discard reads shorter than length, after trimming (default: '50')
- kits: sequence of an adapter ligated to the 3' end of the first read _(default: 'truseq', 'nextera' and 'small' Illumina kits)
Sickle-trim
- quality: Q-phred score limit (default: '30')
- length: read length limit, after trimming (default: '50')
- command: Pipeline wait for paired-end reads (default and should be: 'pe')
- encoding: If your data are from recent Illumina run, let 'sanger' (default and should be: 'sanger')
BWA
- path: path to BWA indexes (default: 'resources/indexes/bwa/')
- algorithm: algorithm for constructing BWA index (default: deactivate -> '')
Bowtie2
- sensitivity: preset for bowtie2 sensitivity (default config: '--sensitive')
- path: path to Bowtie2 indexes (default: 'resources/indexes/bowtie2/')
- algorithm: algorithm for constructing Bowtie2 index (default: deactivate -> '' )
Nextclade
- path: path to nextclade datasets
- dataset: Nextclade dataset (not used, set by Start_GeVarLi.sh depending your reference genome)
GisAid (soon)
- username:
- threshold:
- name:
- country:
- identifier:
- year:
Environments
- frontend: conda frontend (default: 'mamba')
- osx/linux: conda environments paths/names for osx and linux OS (default: workflow/envs/{tools}_v.{version}.yaml) Note: edit only if you want to change some environments (e.g. test a new version)
Operating System
- osx: Operating System (default: 'osx', but will set by Start_GeVarLi.sh) Note: Only 'osx' or 'linux' supported
GeVarLi map
🧩 GeVarLi/
├── 🖥️️ Start_GeVarLi.sh
├── 📚 README.md
├── 🪪 LICENSE
├── 🛑 .gitignore
├── 📂 .git/
├── 📂 .snakemake/
├── 📂 configuration/
│ ├── ⚙️ config.yaml
│ ├── ⚙️ fastq-screen.conf
│ └── ⚙️ multiqc.yaml
├── 📂 resources/
│ ├── 📂 genomes/
│ │ ├── 🧬 SARS-CoV-2_Wuhan_MN-908947-3.fasta
│ │ ├── 🧬 Monkeypox-virus_Zaire_AF-380138-1.fasta
│ │ ├── 🧬 Monkeypox-virus_UK_MT-903345-1.fasta
│ │ ├── 🧬 Swinepox-virus_India_MW-036632-1.fasta
│ │ ├── 🧬 Ebola-virus_Zaire_AF-272001-1.fasta
│ │ ├── 🧬 Ebola-virus_Sudan_MH-121162-1.fasta
│ │ ├── 🧬 Nipah-virus_Malaysia_AJ-564622-1.fasta
│ │ ├── 🧬 HIV-1_HXB2_K-03455-1.fasta
│ │ ├── 🧬 (your_favorite_genome_reference}.fasta
│ │ ├── 🧬 Echerichia-coli_CP-060121-1.fasta
│ │ ├── 🧬 Kanamycin-Resistance-Gene.fasta
│ │ ├── 🧬 NGS-adapters.fasta
│ │ ├── 🧬 Phi-X174_Coliphage_NC-001422-1.fasta
│ │ ├── 🧬 UniVec_wo_phiX-kanamycin-NGSseq.fasta
│ │ └── 🧬 {your_favorite_control_reference}.fasta
│ ├── 📂 indexes/
│ │ ├── 📂 bwa/
│ │ │ ├── 🗂️ {GENOME}.amb
│ │ │ ├── 🗂️ {GENOME}.ann
│ │ │ ├── 🗂️ {GENOME}.bwt
│ │ │ ├── 🗂️ {GENOME}.pac
│ │ │ └── 🗂️ {GENOME}.sa
│ │ └── 📂 bowtie2/
│ │ ├── 🗂️ {GENOME}.1.bt2
│ │ ├── 🗂️ {GENOME}.2.bt2
│ │ ├── 🗂️ {GENOME}.3.bt2
│ │ ├── 🗂️ {GENOME}.4.bt2
│ │ ├── 🗂️ {GENOME}.rev.1.bt2
│ │ └── 🗂️ {GENOME}.rev.2.bt2
│ ├── 📂 nextclade/
│ │ ├── 📂 sars-cov-2/
│ │ │ ├── 🌍 genemap.gff
│ │ │ ├── 🧪 primers.csv
│ │ │ ├── ✅ qc.json
│ │ │ ├── 🦠 reference.fasta
│ │ │ ├── 🧬 sequences.fasta
│ │ │ ├── 🏷️ tag.json
│ │ │ └── 🌳 tree.json
│ │ ├── 📂 MPXV/
│ │ │ ├── 🌍 genemap.gff
│ │ │ ├── 🧪 primers.csv
│ │ │ ├── ✅ qc.json
│ │ │ ├── 🦠 reference.fasta
│ │ │ ├── 🧬 sequences.fasta
│ │ │ ├── 🏷️ tag.json
│ │ │ └── 🌳 tree.json
│ │ ├── 📂 hMPWV/
│ │ │ ├── 🌍 genemap.gff
│ │ │ ├── 🧪 primers.csv
│ │ │ ├── ✅ qc.json
│ │ │ ├── 🦠 reference.fasta
│ │ │ ├── 🧬 sequences.fasta
│ │ │ ├── 🏷️ tag.json
│ │ │ └── 🌳 tree.json
│ │ └── 📂 hMPXV_B1/
│ │ ├── 🌍 genemap.gff
│ │ ├── 🧪 primers.csv
│ │ ├── ✅ qc.json
│ │ ├── 🦠 reference.fasta
│ │ ├── 🧬 sequences.fasta
│ │ ├── 🏷️ tag.json
│ │ └── 🌳 tree.json
│ ├── 📂 reads/
│ │ ├── 🛡️ .gitkeep
│ │ ├── 📦 {SAMPLE}_R1.fastq.gz
│ │ └── 📦 {SAMPLE}_R2.fastq.gz
│ ├── 📂 test_data/
│ │ ├── 🛡️ .gitkeep
│ │ ├── 📦 SARS-CoV-2_Omicron-BA.1.1_Covid-Seq-Lib-on-MiSeq_250000-reads_R1.fastq.gz
│ │ └── 📦 SARS-CoV-2_Omicron-BA.1.1_Covid-Seq-Lib-on-MiSeq_250000-reads_R2.fastq.gz
│ └── 📂 visuals/
│ ├── 📈 gevarli_filegraph.png
│ ├── 📈 gevarli_rulegraph.png
│ ├── 📈 indexing_genomes_rulegraph.png
│ └── 📈 quality_control_rulegraph.png
└── 📂 workflow/
├── 📂 environments/
│ ├── 📂 linux/
│ ├── 🍜 bamclipper_v.1.0.0.yaml
│ │ ├── 🍜 bcftools_v.1.17.yaml
│ │ ├── 🍜 bedtools_v.2.31.0.yaml
│ │ ├── 🍜 bowtie2_v.2.5.1.yaml
│ │ ├── 🍜 bwa_v.0.7.17.yaml
│ │ ├── 🍜 cutadapt_v.4.4.yaml
│ │ ├── 🍜 fastq-screen_v.0.15.3.yaml
│ │ ├── 🍜 fastqc_v.0.12.1.yaml
│ │ ├── 🍜 gawk_v.5.1.0.yaml
│ │ ├── 🍜 lofreq_v.2.1.5.yaml
│ │ ├── 🍜 multiqc_v.1.14.yaml
│ │ ├── 🍜 nextclade_v.2.14.0.yaml
│ │ ├── 🍜 pangolin_v.4.3.yaml
│ │ ├── 🍜 samtools_v.1.17.yaml
│ │ ├── 🍜 sickle-trim_v.1.33.yaml
│ │ └── 🍜 workflow-base_v.2023.06.yaml
│ └── 📂 osx/
│ ├── 🍜 bamclipper_v.1.0.0.yaml
│ ├── 🍜 bcftools_v.1.17.yaml
│ ├── 🍜 bedtools_v.2.31.0.yaml
│ ├── 🍜 bowtie2_v.2.5.1.yaml
│ ├── 🍜 bwa_v.0.7.17.yaml
│ ├── 🍜 cutadapt_v.4.4.yaml
│ ├── 🍜 fastq-screen_v.0.15.3.yaml
│ ├── 🍜 fastqc_v.0.12.1.yaml
│ ├── 🍜 gawk_v.5.1.0.yaml
│ ├── 🍜 lofreq_v.2.1.5.yaml
│ ├── 🍜 multiqc_v.1.14.yaml
│ ├── 🍜 nextclade_v.2.14.0.yaml
│ ├── 🍜 pangolin_v.4.3.yaml
│ ├── 🍜 samtools_v.1.17.yaml
│ ├── 🍜 sickle-trim_v.1.33.yaml
│ └── 🍜 workflow-base_v.2023.06.yaml
└── 📂 snakefiles/
├── 📜 gevarli.smk
├── 📜 indexing_genomes.smk
└── 📜 quality_control.smk
~ REFERENCES ~
Sustainable data analysis with Snakemake
Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B. Hall, Christopher H. Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O. Twardziok, Alexander Kanitz, Andreas Wilm, Manuel Holtgrewe, Sven Rahmann, Sven Nahnsen, Johannes Köster
F1000Research (2021)
DOI: https://doi.org/10.12688/f1000research.29032.2
Publication: https://f1000research.com/articles/10-33/v1
Source code: https://github.com/snakemake/snakemake
Documentation: https://snakemake.readthedocs.io/en/stable/index.html
Anaconda Software Distribution
Team
Computer software (2016)
DOI:
Publication: https://www.anaconda.com
Source code: https://github.com/snakemake/snakemake (conda)
Documentation: https://snakemake.readthedocs.io/en/stable/index.html (conda)
Source code: https://github.com/mamba-org/mamba (mamba)
Documentation: https://mamba.readthedocs.io/en/latest/index.html (mamba)
HAVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences
Phuoc Thien Truong Nguyen, Ilya Plyusnin, Tarja Sironen, Olli Vapalahti, Ravi Kant & Teemu Smura
BMC Bioinformatics volume 22, Article number: 373 (2021)
DOI: https://doi.org/10.1186/s12859-021-04294-2
Publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04294-2#Bib1
Source code: https://bitbucket.org/auto_cov_pipeline/havoc
Documentation: https://www2.helsinki.fi/en/projects/havoc
Nextclade: clade assignment, mutation calling and quality control for viral genomes
Ivan Aksamentov, Cornelius Roemer, Emma B. Hodcroft and Richard A. Neher
The Journal of Open Source Software
DOI: https://doi.org/10.21105/joss.03773
Publication: [https://joss.theoj.org/papers/10.21105/joss.03773)(https://joss.theoj.org/papers/10.21105/joss.03773)
Source code: https://github.com/nextstrain/nextclade
Documentation: https://clades.nextstrain.org
Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool
Áine O’Toole, Emily Scher, Anthony Underwood, Ben Jackson, Verity Hill, John T McCrone, Rachel Colquhoun, Chris Ruis, Khalil Abu-Dahab, Ben Taylor, Corin Yeats, Louis du Plessis, Daniel Maloney, Nathan Medd, Stephen W Attwood, David M Aanensen, Edward C Holmes, Oliver G Pybus and Andrew Rambaut
Virus Evolution, Volume 7, Issue 2 (2021)
DOI: https://doi.org/10.1093/ve/veab064
Publication: https://academic.oup.com/ve/article/7/2/veab064/6315289
Source code: https://github.com/cov-lineages/pangolin (pangolin)
Source code: https://github.com/cov-lineages/scorpio (scorpio)
Documentation: https://cov-lineages.org/index.html
Tabix: fast retrieval of sequence features from generic TAB-delimited files
Heng Li
Bioinformatics, Volume 27, Issue 5 (2011)
DOI: https://doi.org/10.1093/bioinformatics/btq671
Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176/
Source code: https://github.com/samtools/samtools
Documentation: http://samtools.sourceforge.net
LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets
Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo, Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd and Niranjan Nagarajan
Nucleic Acids Research, Volume 40, Issue 22 (2012)
DOI: https://doi.org/10.1093/nar/gks918
Publication: https://pubmed.ncbi.nlm.nih.gov/23066108/
Source code: https://gitlab.com/treangenlab/lofreq (v2 used)
Source code: https://github.com/andreas-wilm/lofreq3 (see also v3 in Nim)
Documentation: https://csb5.github.io/lofreq
The AWK Programming Language
Al Aho, Brian Kernighan and Peter Weinberger
Addison-Wesley (1988)
ISBN: https://www.biblio.com/9780201079814
Publication:
Source code: https://github.com/onetrueawk/awk
Documentation: https://www.gnu.org/software/gawk/manual/gawk.html
BEDTools: a flexible suite of utilities for comparing genomic features
Aaron R. Quinlan and Ira M. Hall
Bioinformatics, Volume 26, Issue 6 (2010)
DOI: https://doi.org/10.1093/bioinformatics/btq033
Publication: https://academic.oup.com/bioinformatics/article/26/6/841/244688
Source code: https://github.com/arq5x/bedtools2
Documentation: https://bedtools.readthedocs.io/en/latest/
ARTIC Network
Authors
Journal (year)
DOI:
Publication:
Source code: https://github.com/artic-network/primer-schemes
Documentation:
Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies and Heng Li
GigaScience, Volume 10, Issue 2 (2021)
DOI: https://doi.org/10.1093/gigascience/giab008
Publication: https://academic.oup.com/gigascience/article/10/2/giab008/6137722
Source code: https://github.com/samtools/samtools
Documentation: http://samtools.sourceforge.net
Fast and accurate short read alignment with Burrows-Wheeler Transform
Heng Li and Richard Durbin
Bioinformatics, Volume 25, Aricle 1754-60 (2009)
DOI: https://doi.org/10.1093/bioinformatics/btp324
Publication: https://pubmed.ncbi.nlm.nih.gov/19451168@
Source code: https://github.com/lh3/bwa
Documentation: http://bio-bwa.sourceforge.net
Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files
Joshi NA and Fass JN
_(2011)
DOI: https://doi.org/
Publication:
Source code: https://github.com/najoshi/sickle
Documentation:
Cutadapt Removes Adapter Sequences From High-Throughput Sequencing Reads
Marcel Martin
_EMBnet Journal, Volume 17, Article 1 (2011)
DOI: https://doi.org/10.14806/ej.17.1.200
Publication: http://journal.embnet.org/index.php/embnetjournal/article/view/200
Source code: https://github.com/marcelm/cutadapt
Documentation: https://cutadapt.readthedocs.io/en/stable/
MultiQC: summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics, Volume 32, Issue 19 (2016)
DOI: https://doi.org/10.1093/bioinformatics/btw354
Publication: https://academic.oup.com/bioinformatics/article/32/19/3047/2196507
Source code: https://github.com/ewels/MultiQC
Documentation: https://multiqc.info
FastQ Screen: A tool for multi-genome mapping and quality control
Wingett SW and Andrews S
F1000Research (2018)
DOI: https://doi.org/10.12688/f1000research.15931.2
Publication: https://f1000research.com/articles/7-1338/v2
Source code: https://github.com/StevenWingett/FastQ-Screen
Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen
FastQC: A quality control tool for high throughput sequence data
Simon Andrews
Online (2010)
DOI: https://doi.org/
Publication:
Source code: https://github.com/s-andrews/FastQC
Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc
Seqtk: A fast and lightweight tool for processing sequences in the FASTA or FASTQ format
Heng Li
Online (2014)
DOI: https://doi.org/
Publication:
Source code: https://github.com/lh3/seqtk
Documentation: https://bioweb.pasteur.fr/packages/pack@seqtk@1.3
###############################################################################