TP 3.2: Data mining from files
Prerequisites¶
TP 3.1 must be done beforehand.
You must be in directory ~/save/tp_unix
.
For this practice, we will use blastn
. It can be run with following command:
module load bioinfo/NCBI_Blast+/2.10.0+
blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna -outfmt 7
Use "blast" to produce tabulated files¶
Question
Run the blastn
command and redirect the results in a file named ab005233.blast
.
Solution
module load bioinfo/NCBI_Blast+/2.10.0+
blastn -query data/ab005233.fasta -db ensembl_arabidopsis_thaliana_cdna -outfmt 7 > ab005233.blast
cat ab005233.blast
Sort result file¶
Question
Sort the file ab005233.blast
according to the % identity
by reverse order. In order to find the right column, please display the begining of the file.
Think about removing the 5th first columns with command tail
.
Solution
tail -n +6 ab005233.blast | sort -k 3 -r -n
Display some columns¶
Question
By using the same blast file, display only the subject
names.
Solution
head ab005233.blast
tail -n +6 ab005233.blast | cut -f 2
Concatenate data files¶
Question
Go inside the directory ~/save/tp_unix/data
, concaténate the fasta files matching ab005*.fasta
in a new file called mes_sequences.fasta
Count the number of séquence in the new file.
Solution
cd ~/save/tp_unix/data
cat ab005*.fasta > mes_sequences.fasta
grep -c ">" mes_sequences.fasta
Question
Add to file mes_sequences.fasta
the sequence from ab017070.fasta
Solution
cat ab017070.fasta >> mes_sequences.fasta
Display page by page¶
Question
Display the file mes_sequences.fasta
page per page.
Search for the string AB017070
in order to check that the séquence is correctly added. Use the /
in the pager in order to start a search.
Solution
First, run a pager:
less mes_sequences.fasta
Then use the / to start the search inside the document.
When finished, use Q to quit
Count the number of sequences¶
Question
Count the number of sequences by using grep
command.
Solution
grep -c ">" mes_sequences.fasta
Compare two files¶
Question
Compare by using diff
command the file ab106670.fasta
with the file /save/user/formation/tp_unix/ab106670_bis.fasta
Solution
diff ab106670.fasta /save/user/formation/tp_unix/ab106670_bis.fasta
Search in a directory¶
Question
Search in fasta files the sequences that contain the pattern ttatatatc
Solution
grep "ttatatatc" *.fasta