# Extracting from files ```bash # or curl -O instead of wget wget https://ftp.ensembl.org/pub/release-113/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz gunzip Saccharomyces_cerevisiae.R64-1-1.113.gff3.gz mv Saccharomyces_cerevisiae.R64-1-1.113.gff3 yeast.gff ``` You should now have a file called `yeast.gff` in your working directory. ``` ##gff-version 3 ### I sgd gene 335 649 . + . ID=gene:YAL069W;biotype=protein_coding;description=Dubious open reading frame%3B unlikely to encode a functional protein%2C based on available experimental and comparative sequence data [Source:SGD%3BAcc:S000002143];gene_id=YAL069W;logic_name=sgd I sgd mRNA 335 649 . + . ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=YAL069W_mRNA I sgd exon 335 649 . + . Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1 I sgd CDS 335 649 . + 0 ID=CDS:YAL069W;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W ### ``` The GFF/GTF format describe genomics features, such as genes, exons, CDS in a standardized format. Every line starting with `#` is a comment. Each line is a feature and contains 9 fields (tabulation separated). ## Concept `sed` (short for Stream Editor) is a powerful command-line tool used for text manipulation, allowing you to **search**, **replace**, **delete**, and **modify** text within files or streams. **Syntax** ```bash sed [Option(s)] 'Command(s)' [File(s)] ``` ??? Note "Available Options" | Option | Description | |----------|----------| | -n | Suppress automatic printing of pattern space | -e | Add the script to the commands to be executed | -f | Add the script file to the commands to be executed | -i | Edit files in place (makes backup if extension supplied) | -r | Use extended regular expressions in the script | -s | Treat files as separate rather than as a single continuous long stream At our level the options most useful would be `-n` and `-i` ## Line selection **Syntax** ```bash sed -n 'line p' file ``` | Command | Description | |----------|----------| | `sed -n '8p' file` | Print line 8 | | `sed -n '8p; 16p' file` | Print lines 8 and 16 | | `sed -n '8,16 p' file` | Print lines from 8 to 16 | | `sed '8,$ p' file` | Print lines from line 8 to the end of the file | | `sed -n '1~8 p' file` | Print from line 1, every 8 lines | ## Line deletion **Syntax** ```bash sed 'line d' file ``` | Command | Description | |----------|----------| | `sed '8d' file` | Delete line 8 | | `sed '8d; 16d' file` | Delete lines 8 and 16 | | `sed '8,16 d' file` | Delete lines from 8 to 16 | | `sed '8,$ d' file` | Delete lines from line 8 to the end of the file | | `sed '1~8d' file` | Delete from line 1, every 8 lines | ## Use of Regular Expression **Syntax** ```bash sed 'RegEx' file ``` | Command | Description | |----------|----------| | `sed '/^#/d' file` | Delete lines starting by # | | `sed -n '/\tmRNA\t/p' file` | Print lines matchin mRNA surrounded by tabulation | ## Subsitution **Syntax** ```bash sed 's/pattern/replacement/' file ``` | Command | Description | |----------|----------| | `s/pattern/replacement/` | Substitute the first occurrence of pattern with replacement | | `s/pattern/replacement/2` | Substitute the second occurrence of pattern with replacement | | `s/pattern/replacement/g` | Substitute all occurrences of pattern with replacement | | `s/pattern/replacement/i` | Substitute the first occurrence of pattern with replacement, ignoring case | | `s/pattern/replacement/gi` | Substitute all occurrences of pattern with replacement, ignoring case | ## Extract value It is possible to extract part of a line. Let's take the example of the extraction of a value from an attribute (`tag=value`) with tag `Name` of the 9th column of a GFF/GTF file. **Syntax** ```bash sed 's/.*Name=\([^;]*\);.*/\1/p' file ``` * `-n` Suppresses default output (only prints matches). * `s/.../.../p` Substitutes text and prints only the matched part. * `.*Name=` Matches everything before Name=. * `\([^;]*\)` Captures everything after Name= until the first ;. * `.*` Matches everything after ; (but doesn’t capture it). * `\1` Outputs only the captured group (the Name value). * `/p` Prints the result.