frangiPANe_stats.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0516f8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ref_png = '/scratch/tranchant/rice-output/04-stats/04-plots/00_ref.png'\n",
    "ref_csv = '/scratch/tranchant/rice-output/04-stats/04-summary/00_ref.txt'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "\n",
    "\n",
    "# <span style=\"color: #3987C4;\">I - Workflow configuration <a class=\"anchor\" id=\"workflow\"></a></span>\n",
    "\n",
    "### <span style=\"color: #919395\"> _Parameters_  <a class=\"anchor\" id=\"configinput\"></a></span>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(out_dir, ref_file, vec_file, group_file, fastq_dir) #,cpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color: #919395\">_Preparing Genome Reference for next analysis_\n",
    "\n",
    "#### __Genome indexation__ and __Genome dashboard__\n",
    "\n",
    "This step is done with `bwa index` if index are absent. Indexation is required before performing reads mapping against genome reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#from pathlib import Path\n",
    "import sys\n",
    "\n",
    "sys.path.append(\"/home/christine/Documents/Dev/frangiPANe_snake/workflow\")\n",
    "from scripts import generate_stats as gs\n",
    "gs.dashboard_genome2(\"400\",png,csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color: #919395\">_Analyzing Group File_</span> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'read_group_file' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-e7c314ec7bb8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Reading group file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mid_dict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdf_group\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_group_file\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgroup_file\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlogger\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;31m# Group file dashboard\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mdashboard_group\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_group\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mNameError\u001b[0m: name 'read_group_file' is not defined"
     ]
    }
   ],
   "source": [
    "# Reading group file\n",
    "id_dict, df_group = read_group_file(group_file.value,logger)\n",
    "\n",
    "# Group file dashboard\n",
    "dashboard_group(df_group)\n",
    "bgc('LightBlue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "\n",
    "\n",
    "# <span style=\"color: #3987C4;\">II - frangiPANe Workflow <a class=\"anchor\" id=\"workflow\"></a></span>\n",
    "\n",
    "### <span style=\"color: #919395\"> _1 - Stats about raw data (fastq files)_\n",
    "\n",
    "#### __Generating fastq statistics with `fastq_stats`__\n",
    "    \n",
    "After this stat analysis, several files have been created and saved into 00_fastq_stats directory :\n",
    "* one file (fastq-stat) by fastq file\n",
    "* one file with all stats : all_fastq-stats.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Raw data dashboard\n",
    "dashboard_fastq(fastqstat_csv,total_genome_size,df_group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### <span style=\"color: #919395\">_2 - Mapping the individuals reads against the reference genome_  <a class=\"anchor\" id=\"mapping\"></a></span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### __Generating mapping stats <a class=\"anchor\" id=\"mappingstat\">__\n",
    "    \n",
    "Statistics are generated by `samtools flagstat` and they are saved into the directory _01_mapping-against_reference_ and the subdirectory _stat_\n",
    "\n",
    "* One \"flagtstat file\" is generated for each bam file (http://www.htslib.org/doc/samtools-flagstat.html).\n",
    "\n",
    "* _all_flagstat.csv_ file compiling all the stats\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Dashboard\n",
    "dashboard_flagstat(stat_file,df_group)\n",
    "\n",
    "bgc('LightBlue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "\n",
    "\n",
    "### <span style=\"color: #919395\">3 - Assembly of the individuals' reads that do not map (properly) on the reference genome <a class=\"anchor\" id=\"assembly\"></a></span>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "dashboard_ab(stat_len,stats_N,stats_L,output_assembly_testplots)\n",
    "\n",
    "bgc('LightBlue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### __Assembly step 2 : assembly with the final k value__\n",
    "\n",
    "### Running ABySS for each individual\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dashboard_assembly(stat_file,df_group)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "    \n",
    "### <span style=\"color: #919395\"> 4 - Removing contamination<a class=\"anchor\" id=\"contamination\"></a></span>\n",
    "\n",
    "#### __VecScreen__\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dashboard_ass(final_stat_file,df_group)\n",
    "\n",
    "bgc('LightBlue')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "    \n",
    "### <span style=\"color: #919395\"> 5 - Reducing Sequence Redundancy<a class=\"anchor\" id=\"redundancy\"></a></span>\n",
    "\n",
    "frangiPANe uses CD-HIT to cluster sequences and to reduce sequence redundancy (inter and intra-species).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Dashboard\n",
    "dashboard_cdhit(df_cdhit)\n",
    "\n",
    "bgc('LightBlue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<img src=\"Images/up-arrow.png\" alt=\"Top\" width=2% align=\"right\">](#home \"Go back to the top\")\n",
    "    \n",
    "### <span style=\"color: #919395\"> 6 - Anchoring Clusters on Reference Genome<a class=\"anchor\" id=\"anchoring\"></a></span>\n",
    "\n",
    "#### __Generating panreference__\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dashboard_flagstat(stat2_file,df_group)\n",
    "\n",
    "\n",
    "bgc('LightBlue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### __Panreference dashboard__\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dashboard_anchoring(cdhit_fasta,panref_keep_file,panref_bed_file, output_dir, anc_stat_dict)\n",
    "\n",
    "bgc('LightBlue')"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}