Skip to content
Snippets Groups Projects
Commit b8ebf795 authored by jacques.dainat_ird.fr's avatar jacques.dainat_ird.fr
Browse files

polish awk part

parent ffb19f96
No related branches found
No related tags found
No related merge requests found
Pipeline #84419 passed
......@@ -9,7 +9,7 @@
## Concept
`Grep` stands for "global regular expression print". It searches through the contents of files for lines that match a specified **pattern**.
`Grep` stands for "global regular expression print". It searches through the contents of files (or streams) for lines that match a specified **pattern**.
The basic syntax is:
```bash
grep [options] pattern [file...]
......
# AWK
## setup
??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Filtering a file (awk)
## Replacing patterns (sed)
# Regular Expression
Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within strings.
Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within **strings**.
It is a powerful tools for text processing and can be used in various command-line utilities like `grep`, `sed`, and `awk` to search, match, and manipulate text.
## Regular Expression Summary
......@@ -40,7 +40,50 @@ It is possible to use POSIX character classes:
!!! Warning
Do not confound with **Globbing** (Pathname expansion) used to match filename!
`?` Any single character
`*` Zero or more characters
`[]` Specify a range. Any character of the range or none of them using `!` inside the bracket.
`{term1,term2}` Specify a list of terms separated by commas and each term must be a name or a wildcard.
`{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers).
\ No newline at end of file
`*` Zero or more characters
`[]` Specify a range. Any character of the range or none of them using `!` inside the bracket.
`{term1,term2}` Specify a list of terms separated by commas and each term must be a name or a wildcard.
`{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers).
## Example
```
??????@ start @??????
I love having cake on Sundays.
Macarons are great, but Mille-feuille is on another level!
What are you up to next Sunday?
Feel free to reach out by email at me@example.com.
Otherwise give me a call at 123-456-789.
Cheers!
??????@ end @??????
```
!!! question "Find lines with a question"
Right it is line ending with `?`, but how to avoid the first and last lane?
??? example "Click to show the solution"
```bash
grep -E "[A-Za-z ]+\?$" text.txt
grep -E "[[:alnum:] ]+\?$" text.txt
grep -E "[^?]+\?$" text.tx
```
!!! question "Find lines with email address"
??? example "Click to show the solution"
```bash
grep -E "[a-zA-Z0-9]+@[a-zA-Z0-9]+\.com" text.txt
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" text.txt # more generalized
```
!!! question "Find the phone number ensuring the format XXX-XXX-XXX"
??? example "Click to show the solution"
```bash
grep -E '[0-9]{3}-[0-9]{3}-[0-9]{3}' text.txt
grep -E '\d{3}-\d{3}-\d{3}' text.txt
```
......@@ -10,74 +10,47 @@
## Searching patterns (grep)
!!! question "Select all line related of the year 2001 in `nat2021.csv` file"
Back to our data file `nat2021.csv` containing first names given to children born in France since 1900.
Let's play with some RegEx...
??? example "Click to show the solution"
```bash
grep ";2021;" nat2021.csv
```
!!! question "How many names have been provided in 2021?"
!!! question "How to define the data structure using regex that match all lines excepted header (e.g.`1;PRENOMS;1904;1430)`?"
??? example "Click to show the solution"
```bash
grep ";2021;" nat2021.csv | wc -l
# result: 13501
[1|2];[A-Za-z]+;[0-9]{4};[0-9]+
[1|2];[A-Za-z\-]+;[0-9]{4};[0-9]+ # in case we want to handle the compassed first name (-)
[1|2];[A-Za-z\-_]+;[0-9]{4};[0-9]+ # in case we want to take care of _PRENOMS_RARES too (- and _)
```
!!! question "Is there more diversity in male or female names in 2021"?
!!! question "What names have been provided more than 10 000 times in 1980?"
??? example "Click to show the solution"
```bash
# female
grep ";2021;" nat2021.csv | grep "^2" | wc -l
# result: 7112
# male
grep ";2021;" nat2021.csv | grep "^1" | wc -l
# result: 6389
grep -E '[1|2];[A-Za-z]+;1980;[0-9]{5,}' nat2021.csv # add | wc -l to count
```
!!! question "How many person are called PARIS in 2021"?
!!! question "What names have been provided more than 10 000 time in 1980?"
??? example "Click to show the solution"
```bash
# female
grep "PARIS;2021;" nat2021.csv
# result 16 (5 male and 11 female)
grep -E '[1|2];[A-Za-z]+;1980;[2-9]{1}[0-9]{4,}' nat2021.csv
```
The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)) are set as `_PRENOMS_RARES`.
!!! question "Could you find all rare name ? Do you see any pattern?"
!!! question "List all names provided more than 20 000 times/year over all the years? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count the number of lines."
??? example "Click to show the solution"
```bash
grep ";_PRENOMS_RARES;" nat2021.csv
grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv
```
People tends to provide more and more rare names.
!!! question "What year was the most prolific fot the name ZINEDINE?"
??? example "Click to show the solution"
??? example "Click to show the solution without redundancy"
```bash
# command
grep ";ZINEDINE;" nat2021.csv | sort -n -t ';' -k4
# result: 1998
grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u
```
You can redirect a result and store it in a file thanks to the `>` redirection:
`command > filename`
!!! question "Select all the names from 2005 in a dedicated file?"
??? example "Click to show the solution"
??? example "Click to show the solution without redundancy"
```bash
# command
grep ";2005;" nat2021.csv
grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
# Result = 21
```
# Extracting from files
To do this exercice you will need to download French First name data from "Institut national de la statistique
et des études économiques"
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
```
You should now have a file called `nat2021.csv` in your working directory.
## Filtering a file (awk)
## Replacing patterns (sed)
# AWK
## setup
??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Concept
`AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations.
```bash
awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file
```
**BEGIN Block**
This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers)
**Pattern-Action Pair**
This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed.
**END Block**
This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results.
## Variables in awk
| Variable | Description |
|----------|----------|
$0 | The entire current record (line).
$1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field.
NF | The number of fields in the current record (i.e., the number of columns in a line).
NR | The number of records (lines) processed so far.
FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code.
OFS | The output field separator used when printing fields. Default is a single space.
ORS | The output record separator used when printing records. Default is a newline.
RS | The record separator, which determines how awk separates input records. Default is a newline.
FNR | The record number in the current input file (resets for each new file).
ARGV | An array containing the command-line arguments passed to awk.
## Programmation in awk
awk is a full-fledged programming language that supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview:
### if/else statement
```bash
awk '{ if (condition) { action1 } else { action2 } }' file.txt
```
**Comparison Operators**
You can compare numbers or strings using these operators:
| Operator | Description |
|----------|----------|
== | Equal to
!= | Not equal to
< | Less than
<= | Less than or equal to
> | Greater than
>= | Greater than or equal to
**Logical Operators**
You can combine conditions using logical operators:
| Operator | Description
|----------|----------|
&& | AND (Both conditions must be true)
`
! | NOT (Negates the condition)
**Pattern Matching with Regular Expressions**
You can use regular expressions with the ~ (matches) or !~ (does not match) operators.
| Operator | Description
|----------|----------|
~ | Matches a regex pattern
!~ | Does NOT match a regex pattern
### loop
You can also use for loops and while loops in awk.
```bash
# FOR loop
awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt
# WHILE loop
awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt
```
This loops prints each field ($i) on every line. NF is the number of fields in the current record.
### Using Bash Variables in awk
You can pass variables from Bash into an awk program using the -v option.
```bash
my_var="PIERRE"
awk -v var="$my_var" '$2 == var { print $0 }' file.txt
```
## Excercice
!!! question "Print first lines using awk and head"
??? example "Click to show the solution"
```bash
awk -F ';' '{print $0}' nat2021.csv | head
```
!!! question "Print first lines using awk and head but skipping the first line"
??? example "Click to show the solution"
```bash
awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head
```
!!! question "Print second column of the file? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count number of lines."
??? example "Click to show the solution"
```bash
awk -F ';' '{print $2}' nat2021.csv
```
??? example "Click to show the solution without redundancy"
```bash
awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u
```
??? example "Click to show the solution without redundancy and with count"
```bash
awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
# 36172 - this is the diversity of names in our file
```
!!! question "List the names containing PIERRE."
??? example "Click to show the solution"
```bash
# 3 solutions
awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u
awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u
awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u
```
!!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)."
??? example "Click to show the solution"
```bash
awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv
```
!!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019."
??? example "Click to show the solution"
```bash
awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv
```
!!! question "List all the names and count how many times they have been given in total over the year"
??? example "Click to show the solution"
```bash
awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv
```
!!! question "Can you sort the previous result by number of times each name has been given?"
??? example "Click to show the solution"
```bash
awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2
```
# Extracting from files
```bash
# or curl -O instead of wget
wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff
```
You should now have a file called `yeast.gff` in your working directory.
```
##gff-version 3
###
I sgd gene 335 649 . + . ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd
I sgd mRNA 335 649 . + . ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA
I sgd exon 335 649 . + . Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1
I sgd CDS 335 649 . + 0 ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W
###
```
The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format.
Every line starting with `#` is a comment.
Each line is a feature and contains 9 fields (tabulation separated).
## Concept
`sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams.
**Syntax**
```bash
sed [Option(s)] 'Command(s)' [File(s)]
```
??? Note "Available Options"
| Option | Description |
|----------|----------|
| -n | Suppress automatic printing of pattern space
| -e | Add the script to the commands to be executed
| -f | Add the script file to the commands to be executed
| -i | Edit files in place (makes backup if extension supplied)
| -r | Use extended regular expressions in the script
| -s | Treat files as separate rather than as a single continuous long stream
At our level the options most useful would be `-n` and `-i`
## Line selection
**Syntax**
```bash
sed -n 'line p' file
```
| Command | Description |
|----------|----------|
| `sed -n '8p' file` | Print line 8 |
| `sed -n '8p; 16p' file` | Print lines 8 and 16 |
| `sed -n '8,16 p' file` | Print lines from 8 to 16 |
| `sed '8,$ p' file` | Print lines from line 8 to the end of the file |
| `sed -n '1~8 p' file` | Print from line 1, every 8 lines |
## Line deletion
**Syntax**
```bash
sed 'line d' file
```
| Command | Description |
|----------|----------|
| `sed '8d' file` | Delete line 8 |
| `sed '8d; 16d' file` | Delete lines 8 and 16 |
| `sed '8,16 d' file` | Delete lines from 8 to 16 |
| `sed '8,$ d' file` | Delete lines from line 8 to the end of the file |
| `sed '1~8d' file` | Delete from line 1, every 8 lines |
## Use of Regular Expression
**Syntax**
```bash
sed 'RegEx' file
```
| Command | Description |
|----------|----------|
| `sed '/^#/d' file` | Delete lines starting by # |
| `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation |
## Subsitution
**Syntax**
```bash
sed 's/pattern/replacement/' file
```
| Command | Description |
|----------|----------|
| `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement |
| `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement |
| `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement |
| `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case |
| `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case |
## Extract value
It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file.
**Syntax**
```bash
sed 's/.*Name=\([^;]*\);.*/\1/p' file
```
* `-n` Suppresses default output (only prints matches).
* `s/.../.../p` Substitutes text and prints only the matched part.
* `.*Name=` Matches everything before Name=.
* `\([^;]*\)` Captures everything after Name= until the first ;.
* `.*` Matches everything after ; (but doesn’t capture it).
* `\1` Outputs only the captured group (the Name value).
* `/p` Prints the result.
\ No newline at end of file
......@@ -6,8 +6,3 @@
</br>
# level 3 - Programming
<iframe id="iframepdf" src="../Bash_cheat_sheet_level3.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
# Interesting ressources
* [Software carpentry](https://swcarpentry.github.io/shell-novice/index.html)
* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html)
\ No newline at end of file
List of interesting ressources:
# Interesting ressources
If you are interested in learning more, here are some
reading tips for you:
* The Unix Shell (datacarpentry): https://swcarpentry.github.io/shell-novice/01-intro.html
* Linux For Dummies (Southgreen): https://southgreenplatform.github.io/trainings/linux/
* Intro to Linux (NBIS): https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html
* Introducing the Unix Command Line: https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/
* What is a Terminal?: https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface.
\ No newline at end of file
* [The Unix Shell (datacarpentry)](https://swcarpentry.github.io/shell-novice/01-intro.html)
* [Linux For Dummies (Southgreen)](https://southgreenplatform.github.io/trainings/linux/)
* [Intro to Linux (NBIS)](https://nbisweden.github.io/workshop-ngsintro/2311/home_schedule.html)
* [Introducing the Unix Command Line](https://iopn.library.illinois.edu/pressbooks/demystifyingtechnology/back-matter/introducing-the-unix-command-line/)
* [What is a Terminal?](https://itconnect.uw.edu/tools-services-support/teaching-learning/workshops/online-tutorials/what-is-a-terminal/#:~:text=Terminals%2C%20also%20known%20as%20command,of%20a%20graphical%20user%20interface)
* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html)
\ No newline at end of file
......@@ -110,8 +110,8 @@ nav:
- Grep (part1): pages/bash_manip/bash_manip-3-grep.md
- Regular expressions: pages/bash_manip/bash_manip-4-regex.md
- Grep (part2): pages/bash_manip/bash_manip-5-grep2.md
- Awk: pages/bash_manip/bash_manip-4-awk.md
- Sed: pages/bash_manip/bash_manip-5-sed.md
- Awk: pages/bash_manip/bash_manip-6-awk.md
- Sed: pages/bash_manip/bash_manip-7-sed.md
- Bash scripting:
- Course overview: pages/bash_script/bash_script-0-overview.md
- Introduction: pages/bash_script/bash_script-1-intro.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment