Skip to content
Snippets Groups Projects
bash_manip-6-awk.md 8.33 KiB
Newer Older
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
# AWK

## setup

??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
    {%
    include-markdown "pages/bash_manip/bash_manip-0-setup.md"
    %}


## Concept

`AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations.

```bash
awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file
```

**BEGIN Block**  
This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers)

**Pattern-Action Pair**  
This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed.

**END Block**  
This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results.

## Variables in awk

| Variable | Description |
|----------|----------|
$0 | The entire current record (line).
$1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field.
NF | The number of fields in the current record (i.e., the number of columns in a line).
NR | The number of records (lines) processed so far.
FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code.
OFS | The output field separator used when printing fields. Default is a single space.
ORS | The output record separator used when printing records. Default is a newline.
RS | The record separator, which determines how awk separates input records. Default is a newline.
FNR | The record number in the current input file (resets for each new file).
ARGV | An array containing the command-line arguments passed to awk.

## Built-in Functions

Awk provides built-in functions for string manipulation and numeric operations, making it a powerful tool for text processing and calculations.

### Numeric Manipulation

awk includes several numeric functions for performing mathematical operations.

| Function | Description
|----------|----------|
sin(x) | Returns the sine of x (x in radians)
cos(x) | Returns the cosine of x (x in radians)
atan2(y, x) | Returns the arctangent of y/x
log(x) | Returns the natural logarithm of x
exp(x) | Returns the exponential of x
sqrt(x) | Returns the square root of x
int(x) | Returns the integer part of x
rand() | Returns a random number between 0 and 1
srand([x]) | Sets the seed for rand() and returns the previous seed

### String Manipulation

Awk provides several string functions to manipulate text. 

| Function | Description
|----------|----------|
length([string]) | Returns the length of the string (or the length of $0 if no string is given)
substr(string, start, [length]) | Returns the substring of string starting at start position with optional length
index(string, search) | Returns the position of search in string, or 0 if not found
match(string, regex) | Returns the position of the match of regex in string, or 0 if no match
split(string, array, [separator]) | Splits string into array elements using separator (default is FS)
tolower(string) | Returns a copy of string with all characters converted to lowercase
toupper(string) | Returns a copy of string with all characters converted to uppercase
sprintf(format, expressions) | Returns a formatted string using format and expressions

jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
## Programmation in awk

awk is a full-fledged programming language that use two type of data structure (variable and array) and supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview:

### Data Structure
**variable**

```bash
var = string # Assigns a value to a variable
var = $1+$2 # calculation
```

**array**

AWK arrays use keys instead of just numerical indices (like dictionaries in Python). They are dynamic i.e. no need to declare the size; you can add elements anytime.

```bash
# fill an array
array["fruit1"] = apple;
array["fruit2"] = banana;
array["fruit3"] = cherry;

# delete an entry
delete array[fruit2]

# print results
for(key in array){         # Traversing through key array here.
     print key,array[key]  # Printing index and value of current item
}

```

It is possible to split a String into an Array:

```bash
awk 'BEGIN {
    str = "apple,banana,cherry";
    split(str, myArray, ",");
    print myArray[1];  # Output: apple
}'
```

### If/else Statement
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed

```bash
awk '{ if (condition) { action1 } else { action2 } }' file.txt
```

**Comparison Operators**

You can compare numbers or strings using these operators:

| Operator | Description |
|----------|----------|
== | Equal to
!= | Not equal to
< | Less than
<= | Less than or equal to
> | Greater than
>= | Greater than or equal to

**Logical Operators**

You can combine conditions using logical operators:

| Operator | Description
|----------|----------|
&& | AND (Both conditions must be true)
`
! | NOT (Negates the condition)

**Pattern Matching with Regular Expressions**

You can use the following approach:  

```bash
awk '/pattern/ { by line action(s) }' file
```

Or you can use regular expressions with the ~ (matches) or !~ (does not match) operators.
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed

| Operator | Description
|----------|----------|
~ | Matches a regex pattern
!~ | Does NOT match a regex pattern

```bash
awk '{ if ($2 ~ /pattern/) }' file
```

jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
### loop

You can also use for loops and while loops in awk.

```bash
# FOR loop
awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt
# WHILE loop
awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt
```

This loops prints each field ($i) on every line. NF is the number of fields in the current record.

### Using Bash Variables in awk

You can pass variables from Bash into an awk program using the -v option.

```bash
my_var="PIERRE"
awk -v var="$my_var" '$2 == var { print $0 }' file.txt
```

## Excercice

!!! question "Print first lines using awk and head"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{print $0}' nat2021.csv | head
    ```

!!! question "Print first lines using awk and head but skipping the first line"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head
    ```

!!! question "Print second column of the file? In a second time try to remove redundancy (using `sort` and `uniq`). In a third time count number of lines."
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed


??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv
    ```

??? example "Click to show the solution without redundancy"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv | sort -u 
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    ```

??? example "Click to show the solution without redundancy and with count"  
    ```bash
    awk -F ';' '{print $2}' nat2021.csv | sort -u | wc -l
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    # 36172 - this is the diversity of names in our file
    ```

!!! question "List the names containing PIERRE."

??? example "Click to show the solution"  
    ```bash
    # 3 solutions
    awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u
    awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u
    awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u
    ```

!!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)."

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv
    ```

!!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019."

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv
    ```

!!! question "List all the names and count how many times they have been given in total over the year"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv
    ```

!!! question "Can you sort the previous result by number of times each name has been given?"

??? example "Click to show the solution"  
    ```bash
    awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2
    ```