Skip to content

TP 3.2: Data mining from files

Prerequisites

TP 3.1 must be done beforehand.

You must be in directory ~/save/tp_unix.

For this practice, we will use blastn. It can be run with following command:

module load bioinfo/NCBI_Blast+/2.10.0+
blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna -outfmt 7

Use "blast" to produce tabulated files

Question

Run the blastn command and redirect the results in a file named ab005233.blast.

Solution
module load bioinfo/NCBI_Blast+/2.10.0+
blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna -outfmt 7 > ab005233.blast
cat ab005233.blast

Sort result file

Question

Sort the file ab005233.blast according to the % identity by reverse order. In order to find the right column, please display the begining of the file.

Think about removing the 5th first columns with command tail.

Solution
tail -n +6 ab005233.blast | sort -k 3 -r -n

Display some columns

Question

By using the same blast file, display only the subject names.

Solution
head ab005233.blast
tail -n +6 ab005233.blast | cut -f 2

Concatenate data files

Question

Go inside the directory ~/save/tp_unix/data, concaténate the fasta files matching ab005*.fasta in a new file called mes_sequences.fasta

Count the number of séquence in the new file.

Solution
cd ~/save/tp_unix/data
cat ab005*.fasta > mes_sequences.fasta
grep -c ">" mes_sequences.fasta

Question

Add to file mes_sequences.fasta the sequence from ab017070.fasta

Solution
cat ab017070.fasta >> mes_sequences.fasta

Display page by page

Question

Display the file mes_sequences.fasta page per page.

Search for the string AB017070 in order to check that the séquence is correctly added. Use the / in the pager in order to start a search.

Solution

First, run a pager:

less mes_sequences.fasta

Then use the / to start the search inside the document.

When finished, use Q to quit

Count the number of sequences

Question

Count the number of sequences by using grep command.

Solution
grep -c ">" mes_sequences.fasta

Compare two files

Question

Compare by using diff command the file ab106670.fasta with the file /save/user/formation/tp_unix/ab106670_bis.fasta

Solution
diff ab106670.fasta /save/user/formation/tp_unix/ab106670_bis.fasta

Search in a directory

Question

Search in fasta files the sequences that contain the pattern ttatatatc

Solution
grep "ttatatatc" *.fasta