From b8ebf7953d74ea82c1aef1e98432ea8eb00e0ec1 Mon Sep 17 00:00:00 2001 From: Jacques Dainat <jacques.dainat@ird.fr> Date: Wed, 5 Mar 2025 17:12:53 +0100 Subject: [PATCH] polish awk part --- docs/pages/bash_manip/bash_manip-3-grep.md | 2 +- docs/pages/bash_manip/bash_manip-4-awk.md | 20 -- docs/pages/bash_manip/bash_manip-4-regex.md | 53 +++++- docs/pages/bash_manip/bash_manip-5-grep2.md | 61 ++---- docs/pages/bash_manip/bash_manip-5-sed.md | 23 --- docs/pages/bash_manip/bash_manip-6-awk.md | 179 ++++++++++++++++++ docs/pages/bash_manip/bash_manip-7-sed.md | 127 +++++++++++++ docs/pages/cheat_sheet/bash/bash.md | 5 - .../cheat_sheet/interesting_ressources.md | 13 +- mkdocs.yml | 4 +- 10 files changed, 381 insertions(+), 106 deletions(-) delete mode 100644 docs/pages/bash_manip/bash_manip-4-awk.md delete mode 100644 docs/pages/bash_manip/bash_manip-5-sed.md create mode 100644 docs/pages/bash_manip/bash_manip-6-awk.md create mode 100644 docs/pages/bash_manip/bash_manip-7-sed.md diff --git a/docs/pages/bash_manip/bash_manip-3-grep.md b/docs/pages/bash_manip/bash_manip-3-grep.md index ecfebc4..0b5ce27 100644 --- a/docs/pages/bash_manip/bash_manip-3-grep.md +++ b/docs/pages/bash_manip/bash_manip-3-grep.md @@ -9,7 +9,7 @@ ## Concept -`Grep` stands for "global regular expression print". It searches through the contents of files for lines that match a specified **pattern**. +`Grep` stands for "global regular expression print". It searches through the contents of files (or streams) for lines that match a specified **pattern**. The basic syntax is: ```bash grep [options] pattern [file...] diff --git a/docs/pages/bash_manip/bash_manip-4-awk.md b/docs/pages/bash_manip/bash_manip-4-awk.md deleted file mode 100644 index bf121f8..0000000 --- a/docs/pages/bash_manip/bash_manip-4-awk.md +++ /dev/null @@ -1,20 +0,0 @@ -# AWK - -## setup - -??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!" - {% - include-markdown "pages/bash_manip/bash_manip-0-setup.md" - %} - - -## Filtering a file (awk) - - - -## Replacing patterns (sed) - - - - - diff --git a/docs/pages/bash_manip/bash_manip-4-regex.md b/docs/pages/bash_manip/bash_manip-4-regex.md index d846cca..bfb7a6d 100644 --- a/docs/pages/bash_manip/bash_manip-4-regex.md +++ b/docs/pages/bash_manip/bash_manip-4-regex.md @@ -1,6 +1,6 @@ # Regular Expression -Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within strings. +Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within **strings**. It is a powerful tools for text processing and can be used in various command-line utilities like `grep`, `sed`, and `awk` to search, match, and manipulate text. ## Regular Expression Summary @@ -40,7 +40,50 @@ It is possible to use POSIX character classes: !!! Warning Do not confound with **Globbing** (Pathname expansion) used to match filename! `?` Any single character - `*` Zero or more characters - `[]` Specify a range. Any character of the range or none of them using `!` inside the bracket. - `{term1,term2}` Specify a list of terms separated by commas and each term must be a name or a wildcard. - `{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers). \ No newline at end of file + `*` Zero or more characters + `[]` Specify a range. Any character of the range or none of them using `!` inside the bracket. + `{term1,term2}` Specify a list of terms separated by commas and each term must be a name or a wildcard. + `{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers). + +## Example + +``` +??????@ start @?????? +I love having cake on Sundays. +Macarons are great, but Mille-feuille is on another level! +What are you up to next Sunday? +Feel free to reach out by email at me@example.com. +Otherwise give me a call at 123-456-789. +Cheers! +??????@ end @?????? +``` + +!!! question "Find lines with a question" + +Right it is line ending with `?`, but how to avoid the first and last lane? + +??? example "Click to show the solution" + + ```bash + grep -E "[A-Za-z ]+\?$" text.txt + grep -E "[[:alnum:] ]+\?$" text.txt + grep -E "[^?]+\?$" text.tx + ``` + +!!! question "Find lines with email address" + +??? example "Click to show the solution" + + ```bash + grep -E "[a-zA-Z0-9]+@[a-zA-Z0-9]+\.com" text.txt + grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" text.txt # more generalized + ``` + +!!! question "Find the phone number ensuring the format XXX-XXX-XXX" + +??? example "Click to show the solution" + + ```bash + grep -E '[0-9]{3}-[0-9]{3}-[0-9]{3}' text.txt + grep -E '\d{3}-\d{3}-\d{3}' text.txt + ``` diff --git a/docs/pages/bash_manip/bash_manip-5-grep2.md b/docs/pages/bash_manip/bash_manip-5-grep2.md index 81f11cf..a43adc8 100644 --- a/docs/pages/bash_manip/bash_manip-5-grep2.md +++ b/docs/pages/bash_manip/bash_manip-5-grep2.md @@ -10,74 +10,47 @@ ## Searching patterns (grep) -!!! question "Select all line related of the year 2001 in `nat2021.csv` file" +Back to our data file `nat2021.csv` containing first names given to children born in France since 1900. +Let's play with some RegEx... -??? example "Click to show the solution" - ```bash - grep ";2021;" nat2021.csv - ``` - -!!! question "How many names have been provided in 2021?" +!!! question "How to define the data structure using regex that match all lines excepted header (e.g.`1;PRENOMS;1904;1430)`?" ??? example "Click to show the solution" ```bash - grep ";2021;" nat2021.csv | wc -l - # result: 13501 + [1|2];[A-Za-z]+;[0-9]{4};[0-9]+ + [1|2];[A-Za-z\-]+;[0-9]{4};[0-9]+ # in case we want to handle the compassed first name (-) + [1|2];[A-Za-z\-_]+;[0-9]{4};[0-9]+ # in case we want to take care of _PRENOMS_RARES too (- and _) ``` -!!! question "Is there more diversity in male or female names in 2021"? +!!! question "What names have been provided more than 10 000 times in 1980?" ??? example "Click to show the solution" ```bash - # female - grep ";2021;" nat2021.csv | grep "^2" | wc -l - # result: 7112 - # male - grep ";2021;" nat2021.csv | grep "^1" | wc -l - # result: 6389 + grep -E '[1|2];[A-Za-z]+;1980;[0-9]{5,}' nat2021.csv # add | wc -l to count ``` - -!!! question "How many person are called PARIS in 2021"? +!!! question "What names have been provided more than 10 000 time in 1980?" ??? example "Click to show the solution" ```bash - # female - grep "PARIS;2021;" nat2021.csv - # result 16 (5 male and 11 female) + grep -E '[1|2];[A-Za-z]+;1980;[2-9]{1}[0-9]{4,}' nat2021.csv ``` -The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)) are set as `_PRENOMS_RARES`. - -!!! question "Could you find all rare name ? Do you see any pattern?" +!!! question "List all names provided more than 20 000 times/year over all the years? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count the number of lines." ??? example "Click to show the solution" ```bash - grep ";_PRENOMS_RARES;" nat2021.csv + grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv ``` - People tends to provide more and more rare names. - - -!!! question "What year was the most prolific fot the name ZINEDINE?" -??? example "Click to show the solution" +??? example "Click to show the solution without redundancy" ```bash - # command - grep ";ZINEDINE;" nat2021.csv | sort -n -t ';' -k4 - # result: 1998 + grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u ``` - -You can redirect a result and store it in a file thanks to the `>` redirection: -`command > filename` - -!!! question "Select all the names from 2005 in a dedicated file?" - -??? example "Click to show the solution" +??? example "Click to show the solution without redundancy" ```bash - # command - grep ";2005;" nat2021.csv + grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l + # Result = 21 ``` - - diff --git a/docs/pages/bash_manip/bash_manip-5-sed.md b/docs/pages/bash_manip/bash_manip-5-sed.md deleted file mode 100644 index 4498598..0000000 --- a/docs/pages/bash_manip/bash_manip-5-sed.md +++ /dev/null @@ -1,23 +0,0 @@ -# Extracting from files - -To do this exercice you will need to download French First name data from "Institut national de la statistique -et des études économiques" - -```bash -wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip -unzip nat2021_csv.zip -``` - -You should now have a file called `nat2021.csv` in your working directory. - - -## Filtering a file (awk) - - - -## Replacing patterns (sed) - - - - - diff --git a/docs/pages/bash_manip/bash_manip-6-awk.md b/docs/pages/bash_manip/bash_manip-6-awk.md new file mode 100644 index 0000000..796f480 --- /dev/null +++ b/docs/pages/bash_manip/bash_manip-6-awk.md @@ -0,0 +1,179 @@ +# AWK + +## setup + +??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!" + {% + include-markdown "pages/bash_manip/bash_manip-0-setup.md" + %} + + +## Concept + +`AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations. + + +```bash +awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file +``` + +**BEGIN Block** +This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers) + +**Pattern-Action Pair** +This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed. + +**END Block** +This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results. + +## Variables in awk + +| Variable | Description | +|----------|----------| +$0 | The entire current record (line). +$1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field. +NF | The number of fields in the current record (i.e., the number of columns in a line). +NR | The number of records (lines) processed so far. +FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code. +OFS | The output field separator used when printing fields. Default is a single space. +ORS | The output record separator used when printing records. Default is a newline. +RS | The record separator, which determines how awk separates input records. Default is a newline. +FNR | The record number in the current input file (resets for each new file). +ARGV | An array containing the command-line arguments passed to awk. + +## Programmation in awk + +awk is a full-fledged programming language that supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview: + +### if/else statement + +```bash +awk '{ if (condition) { action1 } else { action2 } }' file.txt +``` + +**Comparison Operators** + +You can compare numbers or strings using these operators: + +| Operator | Description | +|----------|----------| +== | Equal to +!= | Not equal to +< | Less than +<= | Less than or equal to +> | Greater than +>= | Greater than or equal to + +**Logical Operators** + +You can combine conditions using logical operators: + +| Operator | Description +|----------|----------| +&& | AND (Both conditions must be true) +` +! | NOT (Negates the condition) + +**Pattern Matching with Regular Expressions** + +You can use regular expressions with the ~ (matches) or !~ (does not match) operators. + +| Operator | Description +|----------|----------| +~ | Matches a regex pattern +!~ | Does NOT match a regex pattern + +### loop + +You can also use for loops and while loops in awk. + +```bash +# FOR loop +awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt +# WHILE loop +awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt +``` + +This loops prints each field ($i) on every line. NF is the number of fields in the current record. + +### Using Bash Variables in awk + +You can pass variables from Bash into an awk program using the -v option. + +```bash +my_var="PIERRE" +awk -v var="$my_var" '$2 == var { print $0 }' file.txt +``` + +## Excercice + +!!! question "Print first lines using awk and head" + +??? example "Click to show the solution" + ```bash + awk -F ';' '{print $0}' nat2021.csv | head + ``` + +!!! question "Print first lines using awk and head but skipping the first line" + +??? example "Click to show the solution" + ```bash + awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head + ``` + +!!! question "Print second column of the file? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count number of lines." + + +??? example "Click to show the solution" + ```bash + awk -F ';' '{print $2}' nat2021.csv + ``` + +??? example "Click to show the solution without redundancy" + ```bash + awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u + ``` + +??? example "Click to show the solution without redundancy and with count" + ```bash + awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l + # 36172 - this is the diversity of names in our file + ``` + +!!! question "List the names containing PIERRE." + +??? example "Click to show the solution" + ```bash + # 3 solutions + awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u + awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u + awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u + ``` + +!!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)." + +??? example "Click to show the solution" + ```bash + awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv + ``` + +!!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019." + +??? example "Click to show the solution" + ```bash + awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv + ``` + +!!! question "List all the names and count how many times they have been given in total over the year" + +??? example "Click to show the solution" + ```bash + awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv + ``` + +!!! question "Can you sort the previous result by number of times each name has been given?" + +??? example "Click to show the solution" + ```bash + awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2 + ``` diff --git a/docs/pages/bash_manip/bash_manip-7-sed.md b/docs/pages/bash_manip/bash_manip-7-sed.md new file mode 100644 index 0000000..301a84f --- /dev/null +++ b/docs/pages/bash_manip/bash_manip-7-sed.md @@ -0,0 +1,127 @@ +# Extracting from files + + +```bash +# or curl -O instead of wget +wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz +gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz +mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff +``` + +You should now have a file called `yeast.gff` in your working directory. + +``` +##gff-version 3 +### +I sgd gene 335 649 . + . ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd +I sgd mRNA 335 649 . + . ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA +I sgd exon 335 649 . + . Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1 +I sgd CDS 335 649 . + 0 ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W +### +``` + +The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format. +Every line starting with `#` is a comment. +Each line is a feature and contains 9 fields (tabulation separated). + + +## Concept + +`sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams. + +**Syntax** + +```bash +sed [Option(s)] 'Command(s)' [File(s)] +``` +??? Note "Available Options" + | Option | Description | + |----------|----------| + | -n | Suppress automatic printing of pattern space + | -e | Add the script to the commands to be executed + | -f | Add the script file to the commands to be executed + | -i | Edit files in place (makes backup if extension supplied) + | -r | Use extended regular expressions in the script + | -s | Treat files as separate rather than as a single continuous long stream + +At our level the options most useful would be `-n` and `-i` + +## Line selection + +**Syntax** + +```bash +sed -n 'line p' file +``` + +| Command | Description | +|----------|----------| +| `sed -n '8p' file` | Print line 8 | +| `sed -n '8p; 16p' file` | Print lines 8 and 16 | +| `sed -n '8,16 p' file` | Print lines from 8 to 16 | +| `sed '8,$ p' file` | Print lines from line 8 to the end of the file | +| `sed -n '1~8 p' file` | Print from line 1, every 8 lines | + + +## Line deletion + +**Syntax** + +```bash +sed 'line d' file +``` + +| Command | Description | +|----------|----------| +| `sed '8d' file` | Delete line 8 | +| `sed '8d; 16d' file` | Delete lines 8 and 16 | +| `sed '8,16 d' file` | Delete lines from 8 to 16 | +| `sed '8,$ d' file` | Delete lines from line 8 to the end of the file | +| `sed '1~8d' file` | Delete from line 1, every 8 lines | + +## Use of Regular Expression + +**Syntax** + +```bash +sed 'RegEx' file +``` + +| Command | Description | +|----------|----------| +| `sed '/^#/d' file` | Delete lines starting by # | +| `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation | + +## Subsitution + +**Syntax** + +```bash +sed 's/pattern/replacement/' file +``` + +| Command | Description | +|----------|----------| +| `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement | +| `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement | +| `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement | +| `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case | +| `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case | + +## Extract value + +It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file. + +**Syntax** + +```bash +sed 's/.*Name=\([^;]*\);.*/\1/p' file +``` + +* `-n` Suppresses default output (only prints matches). +* `s/.../.../p` Substitutes text and prints only the matched part. +* `.*Name=` Matches everything before Name=. +* `\([^;]*\)` Captures everything after Name= until the first ;. +* `.*` Matches everything after ; (but doesn’t capture it). +* `\1` Outputs only the captured group (the Name value). +* `/p` Prints the result. \ No newline at end of file diff --git a/docs/pages/cheat_sheet/bash/bash.md b/docs/pages/cheat_sheet/bash/bash.md index cd96e05..322e75e 100644 --- a/docs/pages/cheat_sheet/bash/bash.md +++ b/docs/pages/cheat_sheet/bash/bash.md @@ -6,8 +6,3 @@ </br> # level 3 - Programming <iframe id="iframepdf" src="../Bash_cheat_sheet_level3.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> - -# Interesting ressources - -* [Software carpentry](https://swcarpentry.github.io/shell-novice/index.html) -* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html) \ No newline at end of file diff --git a/docs/pages/cheat_sheet/interesting_ressources.md b/docs/pages/cheat_sheet/interesting_ressources.md index 3164292..a1fba38 100644 --- a/docs/pages/cheat_sheet/interesting_ressources.md +++ b/docs/pages/cheat_sheet/interesting_ressources.md @@ -1,10 +1,11 @@ -List of interesting ressources: +# Interesting ressources If you are interested in learning more, here are some reading tips for you: -* The Unix Shell (datacarpentry): https://swcarpentry.github.io/shell-novice/01-intro.html -* Linux For Dummies (Southgreen): https://southgreenplatform.github.io/trainings/linux/ -* Intro to Linux (NBIS): https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html -* Introducing the Unix Command Line: https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/ -* What is a Terminal?: https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface. \ No newline at end of file +* [The Unix Shell (datacarpentry)](https://swcarpentry.github.io/shell-novice/01-intro.html) +* [Linux For Dummies (Southgreen)](https://southgreenplatform.github.io/trainings/linux/) +* [Intro to Linux (NBIS)](https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html) +* [Introducing the Unix Command Line](https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/) +* [What is a Terminal?](https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface) +* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html) \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a989ce0..9177095 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,8 +110,8 @@ nav: - Grep (part1): pages/bash_manip/bash_manip-3-grep.md - Regular expressions: pages/bash_manip/bash_manip-4-regex.md - Grep (part2): pages/bash_manip/bash_manip-5-grep2.md - - Awk: pages/bash_manip/bash_manip-4-awk.md - - Sed: pages/bash_manip/bash_manip-5-sed.md + - Awk: pages/bash_manip/bash_manip-6-awk.md + - Sed: pages/bash_manip/bash_manip-7-sed.md - Bash scripting: - Course overview: pages/bash_script/bash_script-0-overview.md - Introduction: pages/bash_script/bash_script-1-intro.md -- GitLab