Skip to content

TP 2: Multithreading and ecological behavior

Objective: Speed up a job by using many cpu on one node. Create efficent jobs in a context of digital sobriety/ecological practices.

Going faster

Multithreading

We use the work we did in TP 1 as a script basis:

Question

Create a script file named blastx.sh with following content:

blastx.sh
1
2
3
4
5
6
7
#!/bin/sh

INPUT="contigs.fasta"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blastx_dr -evalue 10e-10

Edit blastx.sh in order to run blastx with 8 cpus on the same node.

Check the execution in detail while running.

Solution

The script file:

blastx.sh
1
2
3
4
5
6
7
8
9
#!/bin/sh
#SBATCH --cpus-per-task=8 (1)

INPUT="contigs.fasta"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
blastx -db ensembl_danio_rerio_pep -query $INPUT -out $INPUT.blastx_dr \
  -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK # (2)!

  1. It will define the number of cpu reserved by slurm

  2. Two things

    • The \ at the end of line 8 allows to split a command on many lines for readability
    • $SLURM_CPUS_PER_TASK is the value defined by --cpus-per-task (line 2).

The script blastx.sh is submitted as a job with the following command:

sbatch blastx.sh

An inline version, without relying on a script file is also possible by using --wrap option:

sbatch  --cpus-per-task 8 -J blastx_dr \
--wrap="blastx -num_threads 8 -db ensembl_danio_rerio_pep -query contigs.fasta -evalue 10e-10 -out contigs.blastx_dr2"

Running script can be checked with one of the following command:

squeue
squeue -u "$(whoami)" # (1)!

  1. $() is called a subshell. It means 'run the command and get back the result'. Here whoami returns the username.

How much faster?

Question

When the job is ended, take a look at the ressources used. How much time and memory were consumed?

Here is an extract of the seff command for the blastx job on 1 cpu. What is the speedup provided by the blastx job on 8 cpus ? Compare the memory consumption.

Job ID: ...
Cluster: genobioinfo
User/Group: ...
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:05:57
CPU Efficiency: 95.20% of 00:06:15 core-walltime
Job Wall-clock time: 00:06:15
Memory Utilized: 18.14 MB
Memory Efficiency: 0.89% of 2.00 GB

Tip

It is a good practice to check the resources a job has consumed

Solution

8x cpus doesn't means 8x faster (~3.6x for this example). For blast, 4 cpu is a good tradoff (~2.7x for this example).

Digital sobriety

Genotoul-bioinfo provides some ressources about digital sobriety applied to bioinformatics.

Alternative tools

Question

Some alternatives can go faster than blast on proteins. Create a script diamondx.sh where blastx is replaced with diamond.

When the job has ended, look at the ressources used. What could you conclude regarding time and memory?

Solution

The script:

diamondx.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/sh
#SBATCH --cpus-per-task=8

INPUT="contigs.fasta"

module purge
module load bioinfo/DIAMOND/2.1.4
diamond blastx --db /bank/diamonddb/ensembl_danio_rerio_pep.dmnd \
  --query $INPUT --out $INPUT.diamondx_dr --outfmt 0 \
  --evalue 10e-10 --threads $SLURM_CPUS_PER_TASK

Run with:

sbatch diamondx.sh

The speedup provided (x100 to x1000 faster) makes diamond a good tool when targeting digital sobriety.

Tunning the slurm parameters

Question

Reduce the amount of memory used by diamond job. What appends if you reduce the amount too much?

Tip

Setting resources correctly (#cpus, memory, max time) ensures a job don't waste ressources. The side effect is that your jobs may aslo start sooner. However it requires knowledge to set them before hand.

We provide a page to help you with some tools. In addition, ask community to help you to choose the right tools and set efficient parameters.

Solution

Breakpoint is under 170 MB

sbatch --mem=150M diamondx.sh

or edit the script diamondx.sh to keep track of parameters

diamondx.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/sh
#SBATCH --cpus-per-task=8
#SBATCH --mem=150M

INPUT="contigs.fasta"

module purge
module load bioinfo/DIAMOND/2.1.4
diamond blastx --db /bank/diamonddb/ensembl_danio_rerio_pep.dmnd \
  --query $INPUT --out $INPUT.diamondx_dr --outfmt 0 \
  --evalue 10e-10 --threads $SLURM_CPUS_PER_TASK

Even with x10 memory consumption, diamond still a good tool for digital sobriety. Some diamond options allow to reduce memory consumption by trading off speed.