TP 1.1: Prepare data

Goal: Refresh your mind about linux commands.

Prepare¶

Connect to cluster¶

Start your machine and open a terminal (please use mobaXterm for window). You can now try to access the genotoul server by using ssh.

ssh -X <username>@genobioinfo.toulouse.inrae.fr

Don't forget to replace <username> with your own username.

Create project¶

Question

In the work directory, create new a directory named cluster and go inside it.

Solution

cd work/
mkdir cluster
cd cluster/

Get data¶

Question

Download the transcript file from https://web-genobioinfo.toulouse.inrae.fr/~formation/cluster/data/contigs.fasta.gz

Solution

wget http://web-genobioinfo.toulouse.inrae.fr/~formation/cluster/data/contigs.fasta.gz

Uncompress files¶

Question

Un-compress the file.

Solution

gunzip contigs.fasta.gz

Note

Manipulating files (compress, zip, ...) can use a lot of resources, it's necessary to perform it on a cluster node when possible. We will learn how to connect to a node in next practices

Look at data¶

Question

Display the ten first lines of contigs.fasta file, then the twenty first lines.

Which is the format file ?

Which is the kind of data ?

Solution

The commands:

The ten first lines:
```
head contigs.fasta
```
The twenty first lines
```
head -n 20 contigs.fasta
```
The file format:
```
file contigs.fasta # (1)!
```
1. Will return contigs.fasta: ASCII text

The file contigs.fasta is a fasta file. It is a text file that contains some blocks of data. Each block begins with a > followed by a description of the data (all in a single line). The lines immediately following the description line are the sequence data. It could be nucleic or proteic.

Here contigs.fasta is a nucleic file.