Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Extracting from files
```bash
# or curl -O instead of wget
wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz
mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff
```
You should now have a file called `yeast.gff` in your working directory.
```
##gff-version 3
###
I sgd gene 335 649 . + . ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd
I sgd mRNA 335 649 . + . ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA
I sgd exon 335 649 . + . Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1
I sgd CDS 335 649 . + 0 ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W
###
```
The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format.
Every line starting with `#` is a comment.
Each line is a feature and contains 9 fields (tabulation separated).
## Concept
`sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams.
**Syntax**
```bash
sed [Option(s)] 'Command(s)' [File(s)]
```
??? Note "Available Options"
| Option | Description |
|----------|----------|
| -n | Suppress automatic printing of pattern space
| -e | Add the script to the commands to be executed
| -f | Add the script file to the commands to be executed
| -i | Edit files in place (makes backup if extension supplied)
| -r | Use extended regular expressions in the script
| -s | Treat files as separate rather than as a single continuous long stream
At our level the options most useful would be `-n` and `-i`
## Line selection
**Syntax**
```bash
sed -n 'line p' file
```
| Command | Description |
|----------|----------|
| `sed -n '8p' file` | Print line 8 |
| `sed -n '8p; 16p' file` | Print lines 8 and 16 |
| `sed -n '8,16 p' file` | Print lines from 8 to 16 |
| `sed '8,$ p' file` | Print lines from line 8 to the end of the file |
| `sed -n '1~8 p' file` | Print from line 1, every 8 lines |
## Line deletion
**Syntax**
```bash
sed 'line d' file
```
| Command | Description |
|----------|----------|
| `sed '8d' file` | Delete line 8 |
| `sed '8d; 16d' file` | Delete lines 8 and 16 |
| `sed '8,16 d' file` | Delete lines from 8 to 16 |
| `sed '8,$ d' file` | Delete lines from line 8 to the end of the file |
| `sed '1~8d' file` | Delete from line 1, every 8 lines |
## Use of Regular Expression
**Syntax**
```bash
sed 'RegEx' file
```
| Command | Description |
|----------|----------|
| `sed '/^#/d' file` | Delete lines starting by # |
| `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation |
## Subsitution
**Syntax**
```bash
sed 's/pattern/replacement/' file
```
| Command | Description |
|----------|----------|
| `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement |
| `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement |
| `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement |
| `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case |
| `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case |
## Extract value
It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file.
**Syntax**
```bash
sed 's/.*Name=\([^;]*\);.*/\1/p' file
```
* `-n` Suppresses default output (only prints matches).
* `s/.../.../p` Substitutes text and prints only the matched part.
* `.*Name=` Matches everything before Name=.
* `\([^;]*\)` Captures everything after Name= until the first ;.
* `.*` Matches everything after ; (but doesn’t capture it).
* `\1` Outputs only the captured group (the Name value).
* `/p` Prints the result.