polish awk part

b8ebf795 · jacques.dainat_ird.fr · ffb19f96 · b8ebf795 · ffb19f96 · b8ebf795
Commit b8ebf795 authored 3 weeks ago by jacques.dainat_ird.fr
--- a/docs/pages/bash_manip/bash_manip-3-grep.md
+++ b/docs/pages/bash_manip/bash_manip-3-grep.md
@@ -9,7 +9,7 @@

 ## Concept

-`Grep` stands for "global regular expression print". It searches through the contents of files for lines that match a specified **pattern**.  
+`Grep` stands for "global regular expression print". It searches through the contents of files (or streams) for lines that match a specified **pattern**.  
 The basic syntax is:  
 ```bash
 grep [options] pattern [file...]

--- a/docs/pages/bash_manip/bash_manip-4-awk.md
+++ b/docs/pages/bash_manip/bash_manip-4-awk.md
-# AWK
-
-## setup
-
-??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
-    {%
-    include-markdown "pages/bash_manip/bash_manip-0-setup.md"
-    %}
-
-
-## Filtering a file (awk)
-
-
-
-## Replacing patterns (sed)
-
-
-
-
-
--- a/docs/pages/bash_manip/bash_manip-4-regex.md
+++ b/docs/pages/bash_manip/bash_manip-4-regex.md
 # Regular Expression

-Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within strings.  
+Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within **strings**.  
 It is a powerful tools for text processing and can be used in various command-line utilities like `grep`, `sed`, and `awk` to search, match, and manipulate text.

 ## Regular Expression Summary
@@ -40,7 +40,50 @@ It is possible to use POSIX character classes:
 !!! Warning
    Do not confound with **Globbing** (Pathname expansion) used to match filename!
    `?`  Any single character  
-    `*`  Zero or more characters
-    `[]` Specify a range. Any character of the range or none of them using `!` inside the bracket.
-    `{term1,term2}`  Specify a list of terms separated by commas and each term must be a name or a wildcard.
-    `{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers).
\ No newline at end of file
+    `*`  Zero or more characters  
+    `[]` Specify a range. Any character of the range or none of them using `!` inside the bracket.  
+    `{term1,term2}`  Specify a list of terms separated by commas and each term must be a name or a wildcard.  
+    `{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers).  
+
+## Example
+
+```
+??????@ start @??????
+I love having cake on Sundays.
+Macarons are great, but Mille-feuille is on another level!
+What are you up to next Sunday?
+Feel free to reach out by email at me@example.com.
+Otherwise give me a call at 123-456-789.
+Cheers!
+??????@ end @??????
+```
+
+!!! question "Find lines with a question"
+
+Right it is line ending with `?`, but how to avoid the first and last lane? 
+
+??? example "Click to show the solution"  
+
+    ```bash
+    grep -E "[A-Za-z ]+\?$" text.txt
+    grep -E "[[:alnum:] ]+\?$" text.txt
+    grep -E "[^?]+\?$" text.tx
+    ```
+
+!!! question "Find lines with email address"
+
+??? example "Click to show the solution"  
+
+    ```bash
+    grep -E "[a-zA-Z0-9]+@[a-zA-Z0-9]+\.com" text.txt
+    grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" text.txt # more generalized
+    ```
+
+!!! question "Find the phone number ensuring the format XXX-XXX-XXX"
+
+??? example "Click to show the solution"  
+
+    ```bash
+    grep -E '[0-9]{3}-[0-9]{3}-[0-9]{3}' text.txt
+    grep -E '\d{3}-\d{3}-\d{3}' text.txt
+    ```
--- a/docs/pages/bash_manip/bash_manip-5-grep2.md
+++ b/docs/pages/bash_manip/bash_manip-5-grep2.md
@@ -10,74 +10,47 @@

 ## Searching patterns (grep)

-!!! question "Select all line related of the year 2001 in `nat2021.csv` file"
+Back to our data file `nat2021.csv` containing first names given to children born in France since 1900.  
+Let's play with some RegEx...

-??? example "Click to show the solution"  
-    ```bash
-    grep ";2021;" nat2021.csv
-    ```
-
-!!! question "How many names have been provided in 2021?"
+!!! question "How to define the data structure using regex that match all lines excepted header (e.g.`1;PRENOMS;1904;1430)`?"

 ??? example "Click to show the solution"  
    ```bash
-    grep ";2021;" nat2021.csv | wc -l
-    # result: 13501
+    [1|2];[A-Za-z]+;[0-9]{4};[0-9]+
+    [1|2];[A-Za-z\-]+;[0-9]{4};[0-9]+ # in case we want to handle the compassed first name (-)
+    [1|2];[A-Za-z\-_]+;[0-9]{4};[0-9]+ # in case we want to take care of _PRENOMS_RARES too (- and _)
    ```

-!!! question "Is there more diversity in male or female names in 2021"?
+!!! question "What names have been provided more than 10 000 times in 1980?"

 ??? example "Click to show the solution"  
    ```bash
-    # female
-    grep ";2021;" nat2021.csv | grep "^2" | wc -l
-    # result: 7112
-    # male
-    grep ";2021;" nat2021.csv | grep "^1" | wc -l
-    # result: 6389
+    grep -E '[1|2];[A-Za-z]+;1980;[0-9]{5,}' nat2021.csv # add | wc -l to count 
    ```
-
-!!! question "How many person are called PARIS in 2021"?
+!!! question "What names have been provided more than 10 000 time in 1980?"

 ??? example "Click to show the solution"  
    ```bash
-    # female
-    grep "PARIS;2021;" nat2021.csv
-    # result 16 (5 male and 11 female)
+    grep -E '[1|2];[A-Za-z]+;1980;[2-9]{1}[0-9]{4,}' nat2021.csv 
    ```

-The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)) are set as `_PRENOMS_RARES`.
-
-!!! question "Could you find all rare name ? Do you see any pattern?"
+!!! question "List all names provided more than 20 000 times/year over all the years? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count the number of lines."

 ??? example "Click to show the solution"  
    ```bash
-    grep ";_PRENOMS_RARES;" nat2021.csv
+    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv 
    ```
-    People tends to provide more and more rare names.
-
-
-!!! question "What year was the most prolific fot the name ZINEDINE?"

-??? example "Click to show the solution"  
+??? example "Click to show the solution without redundancy"  
    ```bash
-    # command
-    grep ";ZINEDINE;" nat2021.csv | sort -n -t ';' -k4
-    # result: 1998
+    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u 
    ```

-
-You can redirect a result and store it in a file thanks to the `>` redirection:  
-`command > filename`
-
-!!! question "Select all the names from 2005 in a dedicated file?"
-
-??? example "Click to show the solution"  
+??? example "Click to show the solution without redundancy"  
    ```bash
-    # command
-    grep ";2005;" nat2021.csv 
+    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
+    # Result = 21 
    ```


-
-
--- a/docs/pages/bash_manip/bash_manip-5-sed.md
+++ b/docs/pages/bash_manip/bash_manip-5-sed.md
-# Extracting from files
-
-To do this exercice you will need to download French First name data from "Institut national de la statistique
-et des études économiques"
-
-```bash
-wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
-unzip nat2021_csv.zip 
-```
-
-You should now have a file called `nat2021.csv` in your working directory.
-
-
-## Filtering a file (awk)
-
-
-
-## Replacing patterns (sed)
-
-
-
-
-
--- a/docs/pages/bash_manip/bash_manip-6-awk.md
+++ b/docs/pages/bash_manip/bash_manip-6-awk.md
+# AWK
+
+## setup
+
+??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
+    {%
+    include-markdown "pages/bash_manip/bash_manip-0-setup.md"
+    %}
+
+
+## Concept
+
+`AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations.
+
+
+```bash
+awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file
+```
+
+**BEGIN Block**  
+This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers)
+
+**Pattern-Action Pair**  
+This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed.
+
+**END Block**  
+This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results.
+
+## Variables in awk
+
+| Variable | Description |
+|----------|----------|
+$0 | The entire current record (line).
+$1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field.
+NF | The number of fields in the current record (i.e., the number of columns in a line).
+NR | The number of records (lines) processed so far.
+FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code.
+OFS | The output field separator used when printing fields. Default is a single space.
+ORS | The output record separator used when printing records. Default is a newline.
+RS | The record separator, which determines how awk separates input records. Default is a newline.
+FNR | The record number in the current input file (resets for each new file).
+ARGV | An array containing the command-line arguments passed to awk.
+
+## Programmation in awk
+
+awk is a full-fledged programming language that supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview:
+
+### if/else statement
+
+```bash
+awk '{ if (condition) { action1 } else { action2 } }' file.txt
+```
+
+**Comparison Operators**
+
+You can compare numbers or strings using these operators:
+
+| Operator | Description |
+|----------|----------|
+== | Equal to
+!= | Not equal to
+< | Less than
+<= | Less than or equal to
+> | Greater than
+>= | Greater than or equal to
+
+**Logical Operators**
+
+You can combine conditions using logical operators:
+
+| Operator | Description
+|----------|----------|
+&& | AND (Both conditions must be true)
+`
+! | NOT (Negates the condition)
+
+**Pattern Matching with Regular Expressions**
+
+You can use regular expressions with the ~ (matches) or !~ (does not match) operators.
+
+| Operator | Description
+|----------|----------|
+~ | Matches a regex pattern
+!~ | Does NOT match a regex pattern
+
+### loop
+
+You can also use for loops and while loops in awk.
+
+```bash
+# FOR loop
+awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt
+# WHILE loop
+awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt
+```
+
+This loops prints each field ($i) on every line. NF is the number of fields in the current record.
+
+### Using Bash Variables in awk
+
+You can pass variables from Bash into an awk program using the -v option.
+
+```bash
+my_var="PIERRE"
+awk -v var="$my_var" '$2 == var { print $0 }' file.txt
+```
+
+## Excercice
+
+!!! question "Print first lines using awk and head"
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{print $0}' nat2021.csv | head
+    ```
+
+!!! question "Print first lines using awk and head but skipping the first line"
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head
+    ```
+
+!!! question "Print second column of the file? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count number of lines."
+
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{print $2}' nat2021.csv
+    ```
+
+??? example "Click to show the solution without redundancy"  
+    ```bash
+    awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u 
+    ```
+
+??? example "Click to show the solution without redundancy and with count"  
+    ```bash
+    awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
+    # 36172 - this is the diversity of names in our file
+    ```
+
+!!! question "List the names containing PIERRE."
+
+??? example "Click to show the solution"  
+    ```bash
+    # 3 solutions
+    awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u
+    awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u
+    awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u
+    ```
+
+!!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)."
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv
+    ```
+
+!!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019."
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv
+    ```
+
+!!! question "List all the names and count how many times they have been given in total over the year"
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv
+    ```
+
+!!! question "Can you sort the previous result by number of times each name has been given?"
+
+??? example "Click to show the solution"  
+    ```bash
+    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2
+    ```
--- a/docs/pages/bash_manip/bash_manip-7-sed.md
+++ b/docs/pages/bash_manip/bash_manip-7-sed.md
+# Extracting from files
+
+
+```bash
+# or curl -O instead of wget
+wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
+gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
+mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff
+```
+
+You should now have a file called `yeast.gff` in your working directory.
+
+```
+##gff-version 3
+###
+I	sgd	gene	335	649	.	+	.	ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd
+I	sgd	mRNA	335	649	.	+	.	ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA
+I	sgd	exon	335	649	.	+	.	Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1
+I	sgd	CDS	335	649	.	+	0	ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W
+###
+```
+
+The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format.
+Every line starting with `#` is a comment. 
+Each line is a feature and contains 9 fields (tabulation separated).
+
+
+## Concept
+
+`sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams.
+
+**Syntax**
+
+```bash
+sed [Option(s)] 'Command(s)' [File(s)]
+```
+??? Note "Available Options"
+    | Option | Description |
+    |----------|----------|
+    | -n | Suppress automatic printing of pattern space
+    | -e | Add the script to the commands to be executed
+    | -f | Add the script file to the commands to be executed
+    | -i | Edit files in place (makes backup if extension supplied)
+    | -r | Use extended regular expressions in the script
+    | -s | Treat files as separate rather than as a single continuous long stream
+
+At our level the options most useful would be `-n` and `-i`
+
+## Line selection 
+
+**Syntax**
+
+```bash
+sed -n 'line p' file
+```
+
+| Command | Description |
+|----------|----------|
+| `sed -n '8p' file` | Print line 8 |
+| `sed -n '8p; 16p' file` | Print lines 8 and 16 |
+| `sed -n '8,16 p' file` | Print lines from 8 to 16 |
+| `sed '8,$ p' file` | Print lines from line 8 to the end of the file |
+| `sed -n '1~8 p' file` | Print from line 1, every 8 lines |
+
+
+## Line deletion 
+
+**Syntax**
+
+```bash
+sed 'line d' file
+```
+
+| Command | Description |
+|----------|----------|
+| `sed '8d' file` | Delete line 8 |
+| `sed '8d; 16d' file` | Delete lines 8 and 16 |
+| `sed '8,16 d' file` | Delete lines from 8 to 16 |
+| `sed '8,$ d' file` | Delete lines from line 8 to the end of the file |
+| `sed '1~8d' file` | Delete from line 1, every 8 lines |
+
+## Use of Regular Expression 
+
+**Syntax**
+
+```bash
+sed 'RegEx' file
+```
+
+| Command | Description |
+|----------|----------|
+| `sed '/^#/d' file` | Delete lines starting by # |
+| `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation |
+
+## Subsitution
+
+**Syntax**
+
+```bash
+sed 's/pattern/replacement/' file
+```
+
+| Command | Description |
+|----------|----------|
+| `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement |
+| `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement |
+| `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement |
+| `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case |
+| `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case |
+
+## Extract value
+
+It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file.
+
+**Syntax**
+
+```bash
+sed 's/.*Name=\([^;]*\);.*/\1/p' file
+```
+
+* `-n` Suppresses default output (only prints matches).
+* `s/.../.../p` Substitutes text and prints only the matched part.
+* `.*Name=` Matches everything before Name=.
+* `\([^;]*\)` Captures everything after Name= until the first ;.
+* `.*` Matches everything after ; (but doesn’t capture it).
+* `\1` Outputs only the captured group (the Name value).
+* `/p` Prints the result.
\ No newline at end of file
--- a/docs/pages/cheat_sheet/bash/bash.md
+++ b/docs/pages/cheat_sheet/bash/bash.md
@@ -6,8 +6,3 @@
 </br>
 # level 3 - Programming
 <iframe id="iframepdf" src="../Bash_cheat_sheet_level3.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> 
-
-# Interesting ressources
-
-* [Software carpentry](https://swcarpentry.github.io/shell-novice/index.html)
-* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html)
\ No newline at end of file
--- a/docs/pages/cheat_sheet/interesting_ressources.md
+++ b/docs/pages/cheat_sheet/interesting_ressources.md
-List of interesting ressources:
+# Interesting ressources

 If you are interested in learning more, here are some
 reading tips for you:

-* The Unix Shell (datacarpentry): https://swcarpentry.github.io/shell-novice/01-intro.html
-* Linux For Dummies (Southgreen): https://southgreenplatform.github.io/trainings/linux/
-* Intro to Linux (NBIS): https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html 
-* Introducing the Unix Command Line: https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/
-* What is a Terminal?: https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface.
\ No newline at end of file
+* [The Unix Shell (datacarpentry)](https://swcarpentry.github.io/shell-novice/01-intro.html)
+* [Linux For Dummies (Southgreen)](https://southgreenplatform.github.io/trainings/linux/)
+* [Intro to Linux (NBIS)](https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html)
+* [Introducing the Unix Command Line](https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/)
+* [What is a Terminal?](https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface)
+* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html)
\ No newline at end of file
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -110,8 +110,8 @@ nav:
        - Grep (part1): pages/bash_manip/bash_manip-3-grep.md
        - Regular expressions: pages/bash_manip/bash_manip-4-regex.md
        - Grep (part2): pages/bash_manip/bash_manip-5-grep2.md
-        - Awk: pages/bash_manip/bash_manip-4-awk.md
-        - Sed: pages/bash_manip/bash_manip-5-sed.md
+        - Awk: pages/bash_manip/bash_manip-6-awk.md
+        - Sed: pages/bash_manip/bash_manip-7-sed.md
    - Bash scripting:
        - Course overview: pages/bash_script/bash_script-0-overview.md
        - Introduction: pages/bash_script/bash_script-1-intro.md