Skip to content
Snippets Groups Projects
Commit ffb19f96 authored by jacques.dainat_ird.fr's avatar jacques.dainat_ird.fr
Browse files

advance for course2

parent bcd1acb6
No related branches found
No related tags found
No related merge requests found
Pipeline #84275 passed
To do this exercice you will use data of first names given to children born in France since 1900 downloaded from "Institut national de la statistique et des études économiques" (see [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628) for details).
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
```
You should now have a file called `nat2021.csv` in your working directory.
The data contained in this file have this shape:
```
sexe;preusuel;annais;nombre
2;SANDRINE;1973;17605
1;JEAN;1960;17607
1;_PRENOMS_RARES;1904;1430
```
The first line is the header where `preusuel` means `prenom usuel` and `annais` means `année naissance`.
The subsequent lines are the data.
1 in column sex means male and 2 means female.
`_PRENOMS_RARES` are rare first names. They are classified as rare following criteria described [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628#documentation).
\ No newline at end of file
# Extracting from files
# Basic commands
To do this exercice you will use data of first names given to children born in France since 1900 downloaded from "Institut national de la statistique et des études économiques" (see [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628) for details).
## setup
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
```
You should now have a file called `nat2021.csv` in your working directory.
The data contained in this file have this shape:
```
sexe;preusuel;annais;nombre
2;SANDRINE;1973;17605
1;JEAN;1960;17607
1;_PRENOMS_RARES;1904;1430
```
The first line is the header where `preusuel` means `prenom usuel` and `annais` means `année naissance`.
The subsequent lines are the data.
`_PRENOMS_RARES` are rare first names. They are classified as rare following criteria described [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628#documentation).
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Displaying sample (head, tail)
......@@ -139,12 +124,12 @@ The `uniq` command can be used to remove the redundancy. But result need to be s
You can redirect a result and store it in a file thanks to the `>` redirection:
`command > filename`
!!! question "Save all the names from 2005 in a dedicated file?"
!!! question "Choose a command used before and save the result in a dedicated file?"
??? example "Click to show the solution"
```bash
# command
grep ";2025;" nat2021.csv > names2005.txt
sort -n -t ';' -k4 nat2021.csv | tail -n 100 | cut -d";" -f 2 | sort | uniq > names2005.txt
```
## Final question
......
# Extracting from files
# Grep
To do this exercice you will need to download French First name data from "Institut national de la statistique
et des études économiques"
## setup
??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Concept
`Grep` stands for "global regular expression print". It searches through the contents of files for lines that match a specified **pattern**.
The basic syntax is:
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
grep [options] pattern [file...]
```
You should now have a file called `nat2021.csv` in your working directory.
It has a lot of options but the most common ones are:
| Command | Explanation |
|---------|-------------|
| -i | Ignore case distinctions in patterns and data. |
| -v | Invert the match, showing lines that do not match the pattern. |
| -n | Prefix each line of output with the line number. |
| -c | Print only a count of matching lines per file. |
| -o | Print each match on a new line. |
## Excercice
!!! question "How many lines contain the number `2` in `nat2021.csv` file?"
## Searching patterns (grep)
??? example "Click to show the solution"
```bash
grep 2 nat2021.csv
# 553461
```
!!! question "How many occurence of number `2` exists in `nat2021.csv` file?"
??? example "Click to show the solution"
```bash
grep -o 2 nat2021.csv | wc -l
# -o makes grep print each match on a new line.
# wc -l counts the number of lines, which equals the total occurrences
# 871258
```
!!! question "Select all line related of the year 2001 in `nat2021.csv` file"
Pay attention that value 2021 may occur in 2 different columns: `annais` (column3) and `nombre` (column4)
??? example "Click to show the solution"
```bash
grep ";2021;" nat2021.csv
```
!!! question "How many names have been provided in 2021?"
!!! question "How many diffent names have been provided in 2021 (_PRENOMS_RARES count for 1)?"
??? example "Click to show the solution"
```bash
......@@ -29,19 +63,19 @@ You should now have a file called `nat2021.csv` in your working directory.
# result: 13501
```
!!! question "Is there more diversity in male or female names in 2021"?
!!! question "Is there more diversity in male or female names in 2021?"
??? example "Click to show the solution"
```bash
# female
grep ";2021;" nat2021.csv | grep "^2" | wc -l
# female - field one contains male female information (-f 1) then count female (grep -c 2)
grep ";2021;" nat2021.csv | cut -d ';' -f 1 | grep -c 2
# result: 7112
# male
grep ";2021;" nat2021.csv | grep "^1" | wc -l
# male - field one contains male female information (-f 1) then count male (grep -c 1)
grep ";2021;" nat2021.csv | cut -d ';' -f 1 | grep -c 1
# result: 6389
```
!!! question "How many person are called PARIS in 2021"?
!!! question "How many person are called PARIS in 2021?"
??? example "Click to show the solution"
```bash
......@@ -52,7 +86,7 @@ You should now have a file called `nat2021.csv` in your working directory.
The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)) are set as `_PRENOMS_RARES`.
!!! question "Could you find all rare name ? Do you see any pattern?"
!!! question "Could you find the number of rare name per year ? Do you see any pattern?"
??? example "Click to show the solution"
```bash
......@@ -61,7 +95,7 @@ The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques
People tends to provide more and more rare names.
!!! question "What year was the most prolific fot the name ZINEDINE?"
!!! question "What year was the most prolific year for the name ZINEDINE?"
??? example "Click to show the solution"
```bash
......@@ -71,20 +105,3 @@ The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques
```
## Redirecting an output (>)
You can redirect a result and store it in a file thanks to the `>` redirection:
`command > filename`
!!! question "Save all the names from 2005 in a dedicated file?"
??? example "Click to show the solution"
```bash
# command
grep ";2025;" nat2021.csv > names2005.txt
```
# Extracting from files
To do this exercice you will need to download French First name data from "Institut national de la statistique
et des études économiques"
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
```
You should now have a file called `nat2021.csv` in your working directory.
## Filtering a file (awk)
## Replacing patterns (sed)
# AWK
## setup
??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Filtering a file (awk)
## Replacing patterns (sed)
# Regular Expression
Regular expressions (regex) are sequences of characters that define a search pattern. They are used for pattern matching within strings.
It is a powerful tools for text processing and can be used in various command-line utilities like `grep`, `sed`, and `awk` to search, match, and manipulate text.
## Regular Expression Summary
| Symbol | Description | Example | Matches |
|--------|-------------|---------|---------|
| `.` | Any single character except newline | `a.b` | `aab`, `acb`, `a1b` |
| `^` | Start of a line | `^abc` | `abc` at the start of a line |
| `$` | End of a line | `abc$` | `abc` at the end of a line |
| `*` | Zero or more of the preceding element | `ab*c` | `ac`, `abc`, `abbc` |
| `+` | One or more of the preceding element | `ab+c` | `abc`, `abbc` |
| `?` | Zero or one of the preceding element | `ab?c` | `ac`, `abc` |
| `{n}` | Exactly n of the preceding element | `a{3}` | `aaa` |
| `{n,}` | n or more of the preceding element | `a{2,}` | `aa`, `aaa`, `aaaa` |
| `{n,m}`| Between n and m of the preceding element | `a{2,3}` | `aa`, `aaa` |
| `[]` | Any one of the characters within the brackets | `[abc]` | `a`, `b`, `c` |
| `[^]` | Any one character not within the brackets | `[^abc]` | Any character except `a`, `b`, `c` |
| `|` | Alternation (OR) | `a|b` | `a`, `b` |
| `()` | Grouping | `(abc)` | `abc` |
| `\d` | Any digit (0-9) | `\d` | `0`, `1`, `2`, ..., `9` |
| `\D` | Any non-digit | `\D` | Any character except `0-9` |
| `\w` | Any word character (alphanumeric + underscore) | `\w` | `a`, `b`, `1`, `_` |
| `\W` | Any non-word character | `\W` | Any character except `a-z`, `A-Z`, `0-9`, `_` |
| `\s` | Any whitespace character | `\s` | Space, tab, newline |
| `\S` | Any non-whitespace character | `\S` | Any character except space, tab, newline |
It is possible to use POSIX character classes:
| Symbol | Description |
|--------|-------------|
| [:alnum:] | equivqlent to A-Za-z0-9 |
| [:alpha:] | equivalent to A-Za-z |
| [:blank:] | equivalent to space or tab |
| [:digit:] | equivalent to 0-9 |
!!! Warning
Do not confound with **Globbing** (Pathname expansion) used to match filename!
`?` Any single character
`*` Zero or more characters
`[]` Specify a range. Any character of the range or none of them using `!` inside the bracket.
`{term1,term2}` Specify a list of terms separated by commas and each term must be a name or a wildcard.
`{term1..term2}` Called brace expansion, this syntax expands all the terms between term1 and term2 (Letters or Integers).
\ No newline at end of file
# GREP (part2)
## setup
??? quote "First click and follow the instructions below only if you start the course at this stage! Otherwise skip this step!"
{%
include-markdown "pages/bash_manip/bash_manip-0-setup.md"
%}
## Searching patterns (grep)
!!! question "Select all line related of the year 2001 in `nat2021.csv` file"
??? example "Click to show the solution"
```bash
grep ";2021;" nat2021.csv
```
!!! question "How many names have been provided in 2021?"
??? example "Click to show the solution"
```bash
grep ";2021;" nat2021.csv | wc -l
# result: 13501
```
!!! question "Is there more diversity in male or female names in 2021"?
??? example "Click to show the solution"
```bash
# female
grep ";2021;" nat2021.csv | grep "^2" | wc -l
# result: 7112
# male
grep ";2021;" nat2021.csv | grep "^1" | wc -l
# result: 6389
```
!!! question "How many person are called PARIS in 2021"?
??? example "Click to show the solution"
```bash
# female
grep "PARIS;2021;" nat2021.csv
# result 16 (5 male and 11 female)
```
The rare name ([see here for documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)) are set as `_PRENOMS_RARES`.
!!! question "Could you find all rare name ? Do you see any pattern?"
??? example "Click to show the solution"
```bash
grep ";_PRENOMS_RARES;" nat2021.csv
```
People tends to provide more and more rare names.
!!! question "What year was the most prolific fot the name ZINEDINE?"
??? example "Click to show the solution"
```bash
# command
grep ";ZINEDINE;" nat2021.csv | sort -n -t ';' -k4
# result: 1998
```
You can redirect a result and store it in a file thanks to the `>` redirection:
`command > filename`
!!! question "Select all the names from 2005 in a dedicated file?"
??? example "Click to show the solution"
```bash
# command
grep ";2005;" nat2021.csv
```
......@@ -5,4 +5,9 @@
<iframe id="iframepdf" src="../Bash_cheat_sheet_level2.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
</br>
# level 3 - Programming
<iframe id="iframepdf" src="../Bash_cheat_sheet_level3.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
\ No newline at end of file
<iframe id="iframepdf" src="../Bash_cheat_sheet_level3.pdf" frameborder="0" width="640" height="480" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
# Interesting ressources
* [Software carpentry](https://swcarpentry.github.io/shell-novice/index.html)
* [gentoo linux](https://devmanual.gentoo.org/tools-reference/bash/index.html)
\ No newline at end of file
......@@ -107,8 +107,9 @@ nav:
- Course overview: pages/bash_manip/bash_manip-0-overview.md
- Introduction: pages/bash_manip/bash_manip-1-introduction.md
- Basic commands: pages/bash_manip/bash_manip-2-basics.md
- RegEx: pages/bash_manip/bash_manip-3-grep.md
- Grep: pages/bash_manip/bash_manip-3-grep.md
- Grep (part1): pages/bash_manip/bash_manip-3-grep.md
- Regular expressions: pages/bash_manip/bash_manip-4-regex.md
- Grep (part2): pages/bash_manip/bash_manip-5-grep2.md
- Awk: pages/bash_manip/bash_manip-4-awk.md
- Sed: pages/bash_manip/bash_manip-5-sed.md
- Bash scripting:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment