Skip to content
Snippets Groups Projects
bash_manip-6-awk.md 5.83 KiB
Newer Older
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
# AWK

## setup

??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
    {%
    include-markdown "pages/bash_manip/bash_manip-0-setup.md"
    %}


## Concept

`AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations.


```bash
awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file
```

**BEGIN Block**  
This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers)

**Pattern-Action Pair**  
This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed.

**END Block**  
This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results.

## Variables in awk

| Variable | Description |
|----------|----------|
$0 | The entire current record (line).
$1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field.
NF | The number of fields in the current record (i.e., the number of columns in a line).
NR | The number of records (lines) processed so far.
FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code.
OFS | The output field separator used when printing fields. Default is a single space.
ORS | The output record separator used when printing records. Default is a newline.
RS | The record separator, which determines how awk separates input records. Default is a newline.
FNR | The record number in the current input file (resets for each new file).
ARGV | An array containing the command-line arguments passed to awk.

## Programmation in awk

awk is a full-fledged programming language that supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview:

### if/else statement

```bash
awk '{ if (condition) { action1 } else { action2 } }' file.txt
```

**Comparison Operators**

You can compare numbers or strings using these operators:

| Operator | Description |
|----------|----------|
== | Equal to
!= | Not equal to
< | Less than
<= | Less than or equal to
> | Greater than
>= | Greater than or equal to

**Logical Operators**

You can combine conditions using logical operators:

| Operator | Description
|----------|----------|
&& | AND (Both conditions must be true)
`
! | NOT (Negates the condition)

**Pattern Matching with Regular Expressions**

You can use regular expressions with the ~ (matches) or !~ (does not match) operators.

| Operator | Description
|----------|----------|
~ | Matches a regex pattern
!~ | Does NOT match a regex pattern

### loop

You can also use for loops and while loops in awk.

```bash
# FOR loop
awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt
# WHILE loop
awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt
```

This loops prints each field ($i) on every line. NF is the number of fields in the current record.

### Using Bash Variables in awk

You can pass variables from Bash into an awk program using the -v option.

```bash
my_var="PIERRE"
awk -v var="$my_var" '$2 == var { print $0 }' file.txt
```

## Excercice

!!! question "Print first lines using awk and head"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{print $0}' nat2021.csv | head
    ```

!!! question "Print first lines using awk and head but skipping the first line"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head
    ```

!!! question "Print second column of the file? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count number of lines."


??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv
    ```

??? example "Click to show the solution without redundancy"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u 
    ```

??? example "Click to show the solution without redundancy and with count"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
    # 36172 - this is the diversity of names in our file
    ```

!!! question "List the names containing PIERRE."

??? example "Click to show the solution"  
    ```bash
    # 3 solutions
    awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u
    awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u
    awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u
    ```

!!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)."

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv
    ```

!!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019."

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv
    ```

!!! question "List all the names and count how many times they have been given in total over the year"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv
    ```

!!! question "Can you sort the previous result by number of times each name has been given?"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2
    ```