# AWK ## setup ??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!" {% include-markdown "pages/bash_manip/bash_manip-0-setup.md" %} ## Concept `AWK` is a powerful programming language designed for text processing and typically used as a data extraction and reporting tool. Named after its creators (Aho, Weinberger, and Kernighan), `AWK` is particularly well-suited for processing **columnar data** and performing complex text manipulations. ```bash awk 'BEGIN { Initial action(s) } /pattern/ { by line action(s) } END { final action(s) }' file ``` **BEGIN Block** This block runs once, before any lines are processed. It’s often used for initialization (e.g., setting field separators, printing headers) **Pattern-Action Pair** This is the core of awk. The pattern is matched against each line of the input, and when it matches, the action is executed. **END Block** This block runs once, after all lines have been processed. It’s useful for final actions like printing summary results. ## Variables in awk | Variable | Description | |----------|----------| $0 | The entire current record (line). $1, $2, …, $NF | Represents the fields in the current record. $1 is the first field, $2 is the second, etc. $NF is the last field. NF | The number of fields in the current record (i.e., the number of columns in a line). NR | The number of records (lines) processed so far. FS | The field separator, which determines how awk splits each line into fields. Default is whitespace. You can change it using -F (`-F 'separator'`) or inside awk code. OFS | The output field separator used when printing fields. Default is a single space. ORS | The output record separator used when printing records. Default is a newline. RS | The record separator, which determines how awk separates input records. Default is a newline. FNR | The record number in the current input file (resets for each new file). ARGV | An array containing the command-line arguments passed to awk. ## Built-in Functions Awk provides built-in functions for string manipulation and numeric operations, making it a powerful tool for text processing and calculations. ### Numeric Manipulation awk includes several numeric functions for performing mathematical operations. | Function | Description |----------|----------| sin(x) | Returns the sine of x (x in radians) cos(x) | Returns the cosine of x (x in radians) atan2(y, x) | Returns the arctangent of y/x log(x) | Returns the natural logarithm of x exp(x) | Returns the exponential of x sqrt(x) | Returns the square root of x int(x) | Returns the integer part of x rand() | Returns a random number between 0 and 1 srand([x]) | Sets the seed for rand() and returns the previous seed ### String Manipulation Awk provides several string functions to manipulate text. | Function | Description |----------|----------| length([string]) | Returns the length of the string (or the length of $0 if no string is given) substr(string, start, [length]) | Returns the substring of string starting at start position with optional length index(string, search) | Returns the position of search in string, or 0 if not found match(string, regex) | Returns the position of the match of regex in string, or 0 if no match split(string, array, [separator]) | Splits string into array elements using separator (default is FS) tolower(string) | Returns a copy of string with all characters converted to lowercase toupper(string) | Returns a copy of string with all characters converted to uppercase sprintf(format, expressions) | Returns a formatted string using format and expressions ## Programmation in awk awk is a full-fledged programming language that use two type of data structure (variable and array) and supports control structures such as if-else and loops, making it powerful for text processing. Here’s a brief overview: ### Data Structure **variable** ```bash var = string # Assigns a value to a variable var = $1+$2 # calculation ``` **array** AWK arrays use keys instead of just numerical indices (like dictionaries in Python). They are dynamic i.e. no need to declare the size; you can add elements anytime. ```bash # fill an array array["fruit1"] = apple; array["fruit2"] = banana; array["fruit3"] = cherry; # delete an entry delete array[fruit2] # print results for(key in array){ # Traversing through key array here. print key,array[key] # Printing index and value of current item } ``` It is possible to split a String into an Array: ```bash awk 'BEGIN { str = "apple,banana,cherry"; split(str, myArray, ","); print myArray[1]; # Output: apple }' ``` ### If/else Statement ```bash awk '{ if (condition) { action1 } else { action2 } }' file.txt ``` **Comparison Operators** You can compare numbers or strings using these operators: | Operator | Description | |----------|----------| == | Equal to != | Not equal to < | Less than <= | Less than or equal to > | Greater than >= | Greater than or equal to **Logical Operators** You can combine conditions using logical operators: | Operator | Description |----------|----------| && | AND (Both conditions must be true) ` ! | NOT (Negates the condition) **Pattern Matching with Regular Expressions** You can use the following approach: ```bash awk '/pattern/ { by line action(s) }' file ``` Or you can use regular expressions with the ~ (matches) or !~ (does not match) operators. | Operator | Description |----------|----------| ~ | Matches a regex pattern !~ | Does NOT match a regex pattern ```bash awk '{ if ($2 ~ /pattern/) }' file ``` ### loop You can also use for loops and while loops in awk. ```bash # FOR loop awk '{ for (i = 1; i <= NF; i++) { print $i } }' file.txt # WHILE loop awk '{ i=1; while (i <= NF) { print $i; i++ } }' file.txt ``` This loops prints each field ($i) on every line. NF is the number of fields in the current record. ### Using Bash Variables in awk You can pass variables from Bash into an awk program using the -v option. ```bash my_var="PIERRE" awk -v var="$my_var" '$2 == var { print $0 }' file.txt ``` ## Excercice !!! question "Print first lines using awk and head" ??? example "Click to show the solution" ```bash awk -F ';' '{print $0}' nat2021.csv | head ``` !!! question "Print first lines using awk and head but skipping the first line" ??? example "Click to show the solution" ```bash awk -F ';' 'NR > 1 {print $0}' nat2021.csv | head ``` !!! question "Print second column of the file? In a second time try to remove redundancy (using `sort` and `uniq`). In a third time count number of lines." ??? example "Click to show the solution" ```bash awk -F ';' '{print $2}' nat2021.csv ``` ??? example "Click to show the solution without redundancy" ```bash awk -F ';' '{print $2}' nat2021.csv | sort -u ``` ??? example "Click to show the solution without redundancy and with count" ```bash awk -F ';' '{print $2}' nat2021.csv | sort -u | wc -l # 36172 - this is the diversity of names in our file ``` !!! question "List the names containing PIERRE." ??? example "Click to show the solution" ```bash # 3 solutions awk -F ';' '/PIERRE/ {print $2}' nat2021.csv | sort -u awk -F ';' '{if($2 ~ /PIERRE/ ) {print $2} }' nat2021.csv | sort -u awk -F ';' '{print $2}' nat2021.csv | grep PIERRE | sort -u ``` !!! question "How many times the name PIERRE has been given every year after 2010 (Print full line)." ??? example "Click to show the solution" ```bash awk -F ';' '{if($2 == "PIERRE" && ($3 > 2010 )) {print $0} }' nat2021.csv ``` !!! question "Print lines containing only PIERRE as name between 1920-1929 and between 2010-2019." ??? example "Click to show the solution" ```bash awk -F ';' '{if($2 == "PIERRE" && ($3 ~ /192[0-9]/ || $3 ~ /201[0-9]/ )) {print $0} }' nat2021.csv ``` !!! question "List all the names and count how many times they have been given in total over the year" ??? example "Click to show the solution" ```bash awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv ``` !!! question "Can you sort the previous result by number of times each name has been given?" ??? example "Click to show the solution" ```bash awk -F ';' '{names[$2]+=$4}END{for (name in names) print name, names[name] }' nat2021.csv | sort -n -k 2 ```