Integrating plot ref

2ac24a64 · christine.tranchant_ird.fr · e96b93da · 2ac24a64
Commit 2ac24a64 authored 1 year ago by christine.tranchant_ird.fr
--- a/frangiPANe/report/frangiPANe_stats.ipynb
+++ b/frangiPANe/report/frangiPANe_stats.ipynb
@@ -3,15 +3,12 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "6c56e93f",
+   "id": "f0516f8e",
   "metadata": {},
   "outputs": [],
   "source": [
-    "out_dir = '/scratch/tranchant/rice-output'\n",
-    "fastq_dir = '/scratch/tranchant/data_test/fastq'\n",
-    "group_file = '/scratch/tranchant/data_test/rice_group.txt'\n",
-    "ref_file = '/scratch/tranchant/data_test/ref.fasta'\n",
-    "vec_file = '/scratch/tranchant/data_test/bank/UniVec_Core'\n"
+    "ref_png = '/scratch/tranchant/rice-output/04-stats/04-plots/00_ref.png'\n",
+    "ref_csv = '/scratch/tranchant/rice-output/04-stats/04-summary/00_ref.txt'\n"
   ]
  },
  {
@@ -34,7 +31,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "print(project_name, out_dir, ref_file, vec_file, group_file, fastq_dir) #,cpus)"
+    "print(out_dir, ref_file, vec_file, group_file, fastq_dir) #,cpus)"
   ]
  },
  {

-%% Cell type:code id:6c56e93f tags:
+%% Cell type:code id:f0516f8e tags:

 ``` python
-out_dir = '/scratch/tranchant/rice-output'
-fastq_dir = '/scratch/tranchant/data_test/fastq'
-group_file = '/scratch/tranchant/data_test/rice_group.txt'
-ref_file = '/scratch/tranchant/data_test/ref.fasta'
-vec_file = '/scratch/tranchant/data_test/bank/UniVec_Core'
+ref_png = '/scratch/tranchant/rice-output/04-stats/04-plots/00_ref.png'
+ref_csv = '/scratch/tranchant/rice-output/04-stats/04-summary/00_ref.txt'
 ```

 %% Cell type:markdown id: tags:

 ***

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")


 # <span style="color: #3987C4;">I - Workflow configuration <a class="anchor" id="workflow"></a></span>

 ### <span style="color: #919395"> _Parameters_  <a class="anchor" id="configinput"></a></span>

 %% Cell type:code id: tags:

 ``` python
-print(project_name, out_dir, ref_file, vec_file, group_file, fastq_dir) #,cpus)
+print(out_dir, ref_file, vec_file, group_file, fastq_dir) #,cpus)
 ```

 %% Cell type:markdown id: tags:

 ### <span style="color: #919395">_Preparing Genome Reference for next analysis_

 #### __Genome indexation__ and __Genome dashboard__

 This step is done with `bwa index` if index are absent. Indexation is required before performing reads mapping against genome reference.

 %% Cell type:code id: tags:

 ``` python
 #from pathlib import Path
 import sys

 sys.path.append("/home/christine/Documents/Dev/frangiPANe_snake/workflow")
 from scripts import generate_stats as gs
 gs.dashboard_genome2("400",png,csv)
 ```

 %% Cell type:markdown id: tags:

 ### <span style="color: #919395">_Analyzing Group File_</span>

 %% Cell type:code id: tags:

 ``` python
 # Reading group file
 id_dict, df_group = read_group_file(group_file.value,logger)

 # Group file dashboard
 dashboard_group(df_group)
 bgc('LightBlue')
 ```

 %% Output

    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-1-e7c314ec7bb8> in <module>
          1 # Reading group file
    ----> 2 id_dict, df_group = read_group_file(group_file.value,logger)
          3
          4 # Group file dashboard
          5 dashboard_group(df_group)
    NameError: name 'read_group_file' is not defined

 %% Cell type:markdown id: tags:

 ***

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")


 # <span style="color: #3987C4;">II - frangiPANe Workflow <a class="anchor" id="workflow"></a></span>

 ### <span style="color: #919395"> _1 - Stats about raw data (fastq files)_

 #### __Generating fastq statistics with `fastq_stats`__

 After this stat analysis, several files have been created and saved into 00_fastq_stats directory :
 * one file (fastq-stat) by fastq file
 * one file with all stats : all_fastq-stats.csv

 %% Cell type:code id: tags:

 ``` python
 #Raw data dashboard
 dashboard_fastq(fastqstat_csv,total_genome_size,df_group)
 ```

 %% Cell type:markdown id: tags:

 ### <span style="color: #919395">_2 - Mapping the individuals reads against the reference genome_  <a class="anchor" id="mapping"></a></span>

 %% Cell type:markdown id: tags:

 #### __Generating mapping stats <a class="anchor" id="mappingstat">__

 Statistics are generated by `samtools flagstat` and they are saved into the directory _01_mapping-against_reference_ and the subdirectory _stat_

 * One "flagtstat file" is generated for each bam file (http://www.htslib.org/doc/samtools-flagstat.html).

 * _all_flagstat.csv_ file compiling all the stats

 %% Cell type:code id: tags:

 ``` python
 ### Dashboard
 dashboard_flagstat(stat_file,df_group)

 bgc('LightBlue')
 ```

 %% Cell type:markdown id: tags:

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")


 ### <span style="color: #919395">3 - Assembly of the individuals' reads that do not map (properly) on the reference genome <a class="anchor" id="assembly"></a></span>

 %% Cell type:code id: tags:

 ``` python
 dashboard_ab(stat_len,stats_N,stats_L,output_assembly_testplots)

 bgc('LightBlue')
 ```

 %% Cell type:markdown id: tags:

 #### __Assembly step 2 : assembly with the final k value__

 ### Running ABySS for each individual

 %% Cell type:code id: tags:

 ``` python
 dashboard_assembly(stat_file,df_group)
 ```

 %% Cell type:markdown id: tags:

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")

 ### <span style="color: #919395"> 4 - Removing contamination<a class="anchor" id="contamination"></a></span>

 #### __VecScreen__

 %% Cell type:code id: tags:

 ``` python
 dashboard_ass(final_stat_file,df_group)

 bgc('LightBlue')
 ```

 %% Cell type:markdown id: tags:

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")

 ### <span style="color: #919395"> 5 - Reducing Sequence Redundancy<a class="anchor" id="redundancy"></a></span>

 frangiPANe uses CD-HIT to cluster sequences and to reduce sequence redundancy (inter and intra-species).

 %% Cell type:code id: tags:

 ``` python
 #Dashboard
 dashboard_cdhit(df_cdhit)

 bgc('LightBlue')
 ```

 %% Cell type:markdown id: tags:

 [<img src="Images/up-arrow.png" alt="Top" width=2% align="right">](#home "Go back to the top")

 ### <span style="color: #919395"> 6 - Anchoring Clusters on Reference Genome<a class="anchor" id="anchoring"></a></span>

 #### __Generating panreference__

 %% Cell type:code id: tags:

 ``` python
 dashboard_flagstat(stat2_file,df_group)


 bgc('LightBlue')
 ```

 %% Cell type:markdown id: tags:

 #### __Panreference dashboard__

 %% Cell type:code id: tags:

 ``` python
 dashboard_anchoring(cdhit_fasta,panref_keep_file,panref_bed_file, output_dir, anc_stat_dict)

 bgc('LightBlue')
 ```