Skip to content
Snippets Groups Projects
bash_manip-5-grep2.md 2.03 KiB
Newer Older
# GREP (part2)

## setup

??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
    {%
    include-markdown "pages/bash_manip/bash_manip-0-setup.md"
    %}


## Searching patterns (grep)

In order to use regular expression in grep you should use the `-E` option:  

```bash
grep -E pattern file
```

jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
Back to our data file `nat2021.csv` containing first names given to children born in France since 1900.  
Let's play with some RegEx...
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
!!! question "How to define the data structure using regex that match all lines excepted header (e.g.`1;PRENOMS;1904;1430)`?"

??? example "Click to show the solution"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    [1|2];[A-Za-z]+;[0-9]{4};[0-9]+
    [1|2];[A-Za-z\-]+;[0-9]{4};[0-9]+ # in case we want to handle the compassed first name (-)
    [1|2];[A-Za-z\-_]+;[0-9]{4};[0-9]+ # in case we want to take care of _PRENOMS_RARES too (- and _)
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
!!! question "What names have been provided more than 10 000 times in 1980?"

??? example "Click to show the solution"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    grep -E '[1|2];[A-Za-z]+;1980;[0-9]{5,}' nat2021.csv # add | wc -l to count 
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
!!! question "What names have been provided more than 10 000 time in 1980?"

??? example "Click to show the solution"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    grep -E '[1|2];[A-Za-z]+;1980;[2-9]{1}[0-9]{4,}' nat2021.csv 
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
!!! question "List all names provided more than 20 000 times/year over all the years? In a second time try to remove redundancy (using `cut`, `sort` and `uniq`). In a third time count the number of lines."

??? example "Click to show the solution"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv 
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
??? example "Click to show the solution without redundancy"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u 
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
??? example "Click to show the solution without redundancy"  
    ```bash
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
    grep -E '[1|2];[A-Za-z]+;.*;[2-9]{1}[0-9]{4,}' nat2021.csv | cut -d ';' -f 2 | sort -u | wc -l
    # Result = 21