-
ndomassi.tando_ird.fr authoredndomassi.tando_ird.fr authored
Manipulating files
To do this exercice you will need to download Frech First name data from "Institut national de la statistique et des études économiques"
wget https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip
unzip nat2021_csv.zip
You should now have a file called nat2021.csv
in your working directory.
Displaying sample (head, tail)
When you have a huge dataset, it can be interesting to only display the beginning or the end of the file, to have an idea of how the file is structured.
Using commands head
and tail
allows to do this tasks.
!!! question "Display the first 20 lines of nat2021.csv
file"
??? example "Click to show the solution"
bash head -n 20 nat2021.csv
!!! question "Display the last 10 lines of nat2021.csv
file"
??? example "Click to show the solution"
bash tail -n 10 nat2021.csv
Counting words/lines (wc)
!!! question "Count the number of characters of nat2021.csv
file"
??? example "Click to show the solution"
bash wc -c nat2021.csv
!!! question "Count the number of line of nat2021.csv
file"
??? example "Click to show the solution"
bash wc -l nat2021.csv
Searching patterns (grep)
!!! question "Select all line related of the year 2001 in nat2021.csv
file"
??? example "Click to show the solution"
bash grep ";2021;" nat2021.csv
!!! question "How many names have been given in 2021?"
??? example "Click to show the solution"
bash grep ";2021;" nat2021.csv | wc -l grep -c ";2021;" nat2021.csv # result: 13501
Combining commands (| && ;)
!!! question "Is there more diversity in male or female names in 2021?"
??? example "Click to show the solution"
bash # female grep ";2021;" nat2021.csv | grep "^2" | wc -l # result: 7112 # male grep ";2021;" nat2021.csv | grep "^1" | wc -l # result: 6389
!!! question "How many person are called PARIS in 2021?"
??? example "Click to show the solution"
bash # female grep "PARIS;2021;" nat2021.csv # result 16 (5 male and 11 female)
The rare name (see here for documentation) are set as _PRENOMS_RARES
.
!!! question "Could you find all rare name ? Do you see any pattern?"
??? example "Click to show the solution"
bash grep ";_PRENOMS_RARES;" nat2021.csv
People tends to provide more and more rare names.
!!! question "Now, display the lines _PRENOMS_RARES
in 2021 follow with the command echo "this is the number of rare firstnames in 2021 for boys and girls"
"
??? example "Click to show the solution"
bash grep ";_PRENOMS_RARES;2021;" nat2021.csv && echo "this is the number of rare firstnames in 2021 for boys and girls "
Sorting a tabular file (sort)
It is possible to sort a file or tabulated output using the sort
command:
sort nat2021.csv | head
Sort is particularly useful when you use some key options:
-
-n
to sort numerically -
-t
to specify a separator (the default separator is a space or a tab) -
-k
to specify on which column you want to sort the lines (use together with-t
)
!!! question "Look at the first lines after sorting numerically. Do you observe any difference?"
??? example "Click to show the solution"
```bash
sort nat2021.csv | head
sort -n nat2021.csv | head
```
!!! question "Which name was the most popular among all records? Which year?"
??? example "Click to show the solution"
bash # command sort -n -t ';' -k4 nat2021.csv | tail # result: JEAN in 1946
!!! question "By combining commands, try to see which year was the most prolific fot the name ZINEDINE?"
??? example "Click to show the solution"
bash # command grep ";ZINEDINE;" nat2021.csv | sort -n -t ';' -k4 # result: 1998
Extracting columns (cut)
The cut
command allows to cut a line at a specific character and extract a selected field: cut -d";" -f 2
-
-d
specify the separator -
-f
specify the field to extract
!!! question "Combining with sort
command, how can you extract the name of the top 50 most popular names and the corresponding year"
??? example "Click to show the solution"
bash # command sort -n -t ';' -k4 nat2021.csv | tail -n 50 | cut -d";" -f 2,3
The uniq
command can be used to remove the redundancy. But result need to be sorted to make it work properly/
!!! question "Using uniq
, could you now find a way to filter the redundancy by keeping only the names"
??? example "Click to show the solution"
bash # command sort -n -t ';' -k4 nat2021.csv | tail -n 50 | cut -d";" -f 2 | sort | uniq
Redirecting an output (>)
You can redirect a result and store it in a file thanks to the >
redirection:
command > filename
!!! question "Save all the names from 2005 in a file called selected_names.txt
?"
??? example "Click to show the solution"
bash grep ";2005;" nat2021.csv > selected_names.txt
!!! question "Add all names from 2006 and 2007 into this file. Check the number of lines after each action"
??? example "Click to show the solution"
bash wc -l selected_names.txt grep ";2006;" nat2021.csv >> selected_names.txt wc -l selected_names.txt grep ";2007;" nat2021.csv >> selected_names.txt wc -l selected_names.txt
Replacing patterns (sed)
!!! question "Using the sed command, convert CSV file into a tabular file in a new file called nat2021.tsv
"
??? example "Click to show the solution"
bash sed "s/;/\t/g" nat2021.csv > nat2021.tsv
!!! question "In the tabular file, replace the sexe information 1/2 by F/M"
??? example "Click to show the solution"
bash sed -i "s/^1/M/g" nat2021.tsv sed -i "s/^2/F/g" nat2021.tsv
Filtering a file (awk)
!!! question "Using awk
on tabular file, display the most popular names (given more than 10000 times) between 1980 and 1990"
??? example "Click to show the solution"
bash awk {'if ($3 >= 1980 && $3 <= 1990 && $4 > 10000)print $1'} nat2021.tsv