Skip to content
Snippets Groups Projects
bash_manip-7-sed.md 4.25 KiB
Newer Older
jacques.dainat_ird.fr's avatar
jacques.dainat_ird.fr committed
# Extracting from files


```bash
# or curl -O instead of wget
wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff
```

You should now have a file called `yeast.gff` in your working directory.

```
##gff-version 3
###
I	sgd	gene	335	649	.	+	.	ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd
I	sgd	mRNA	335	649	.	+	.	ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA
I	sgd	exon	335	649	.	+	.	Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1
I	sgd	CDS	335	649	.	+	0	ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W
###
```

The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format.
Every line starting with `#` is a comment. 
Each line is a feature and contains 9 fields (tabulation separated).


## Concept

`sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams.

**Syntax**

```bash
sed [Option(s)] 'Command(s)' [File(s)]
```
??? Note "Available Options"
    | Option | Description |
    |----------|----------|
    | -n | Suppress automatic printing of pattern space
    | -e | Add the script to the commands to be executed
    | -f | Add the script file to the commands to be executed
    | -i | Edit files in place (makes backup if extension supplied)
    | -r | Use extended regular expressions in the script
    | -s | Treat files as separate rather than as a single continuous long stream

At our level the options most useful would be `-n` and `-i`

## Line selection 

**Syntax**

```bash
sed -n 'line p' file
```

| Command | Description |
|----------|----------|
| `sed -n '8p' file` | Print line 8 |
| `sed -n '8p; 16p' file` | Print lines 8 and 16 |
| `sed -n '8,16 p' file` | Print lines from 8 to 16 |
| `sed '8,$ p' file` | Print lines from line 8 to the end of the file |
| `sed -n '1~8 p' file` | Print from line 1, every 8 lines |


## Line deletion 

**Syntax**

```bash
sed 'line d' file
```

| Command | Description |
|----------|----------|
| `sed '8d' file` | Delete line 8 |
| `sed '8d; 16d' file` | Delete lines 8 and 16 |
| `sed '8,16 d' file` | Delete lines from 8 to 16 |
| `sed '8,$ d' file` | Delete lines from line 8 to the end of the file |
| `sed '1~8d' file` | Delete from line 1, every 8 lines |

## Use of Regular Expression 

**Syntax**

```bash
sed 'RegEx' file
```

| Command | Description |
|----------|----------|
| `sed '/^#/d' file` | Delete lines starting by # |
| `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation |

## Subsitution

**Syntax**

```bash
sed 's/pattern/replacement/' file
```

| Command | Description |
|----------|----------|
| `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement |
| `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement |
| `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement |
| `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case |
| `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case |

## Extract value

It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file.

**Syntax**

```bash
sed 's/.*Name=\([^;]*\);.*/\1/p' file
```

* `-n` Suppresses default output (only prints matches).
* `s/.../.../p` Substitutes text and prints only the matched part.
* `.*Name=` Matches everything before Name=.
* `\([^;]*\)` Captures everything after Name= until the first ;.
* `.*` Matches everything after ; (but doesn’t capture it).
* `\1` Outputs only the captured group (the Name value).
* `/p` Prints the result.