Newer
Older
# Extracting from files
To do this exercice you will use data of first names given to children born in France since 1900 downloaded from "Institut national de la statistique et des études économiques" (see [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628) for details).
```bash
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
```
You should now have a file called `nat2021.csv` in your working directory.
The data contained in this file have this shape:
```
sexe;preusuel;annais;nombre
2;SANDRINE;1973;17605
1;JEAN;1960;17607
1;_PRENOMS_RARES;1904;1430
```
The first line is the header where `preusuel` means `prenom usuel` and `annais` means `année naissance`.
The subsequent lines are the data.
`_PRENOMS_RARES` are rare first names. They are classified as rare following criteria described [here](https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628#documentation).
## Displaying sample (head, tail)
When you have a huge dataset, it can be interesting to only display the beginning or the end of the file, to have an idea of how the file is structured.
Using commands `head` and `tail` allows to do this tasks.
!!! question "Display the first 20 lines of `nat2021.csv` file"
??? example "Click to show the solution"
```bash
head -n 20 nat2021.csv
```
!!! question "Display the last 10 lines of `nat2021.csv` file"
??? example "Click to show the solution"
```bash
tail -n 10 nat2021.csv
```
## Counting words/lines (wc)
!!! question "Count the number of characters of `nat2021.csv` file"
??? example "Click to show the solution"
```bash
wc -c nat2021.csv
```
!!! question "Count the number of word in `nat2021.csv` file"
??? example "Click to show the solution"
```bash
!!! question "Count the number of line of `nat2021.csv` file"
??? example "Click to show the solution"
```bash
!!! question "Could you explain the similarity of the result between word and line count?"
## Sorting a tabular file (sort)
It is possible to sort a file or tabulated output using the `sort` command:
```bash
sort nat2021.csv | head
```
Sort is particularly useful when you use some key options:
* `-n` to sort numerically
* `-t` to specify a separator (the default separator is a space or a tab)
* `-k` to specify on which column you want to sort the lines (use together with `-t`)
Try the numerical sort.
```bash
sort -n nat2021.csv | head
```
!!! question "Do you observe any difference?"
!!! question "What name has been the most provided in a single year among the records? What year was that?"
??? example "Click to show the solution"
```bash
# command
sort -n -t ';' -k4 nat2021.csv
!!! question "Can refine the previous command to provide the top 100 of names per year the most provided?"
??? example "Click to show the solution"
```bash
# command
```
## Extracting columns (cut)
The `cut` command allows to cut a line at a specific character and extract a selected field: `cut -d";" -f 2`
* `-d` specify the separator
* `-f` specify the field to extract
!!! question "How to extract only the name of the top 100 names/year the most provided"
??? example "Click to show the solution"
```bash
# command
sort -n -t ';' -k4 nat2021.csv | tail -n 100 | cut -d";" -f 2
```
The `uniq` command can be used to remove the redundancy. But result need to be sorted to make it work properly/
!!! question "Could you now find a way to filter the redundancy"
??? example "Click to show the solution"
```bash
# command
sort -n -t ';' -k4 nat2021.csv | tail -n 100 | cut -d";" -f 2 | sort | uniq
```
!!! warning
You should realise that `uniq` needs sorted data to work appropriately.
## Redirecting an output (>)
You can redirect a result and store it in a file thanks to the `>` redirection:
`command > filename`
!!! question "Save all the names from 2005 in a dedicated file?"
??? example "Click to show the solution"
```bash
# command
grep ";2025;" nat2021.csv > names2005.txt
```
!!! question "How many time the name JEAN has been provided in total?"
??? example "Click to show the solution"
It start to be too complicated for the command you have seen so far, you need to use a command specific to column data `awk`