TP 3: Job array

Objective: Speed up a job by splitting it into independant jobs runnning on many nodes.

Prepare data¶

Split data¶

Question

Split the fasta file in 10 fasta files into a directory called contigs_split.

Tip

The fastasplit program from exonerate can be used for this purpose. Here an extract from fastasplit help.

module load bioinfo/Exonerate/2.2.0
fastasplit <path> <dirpath> \
    -f --fasta [mandatory]  <*** not set ***> \
    -o --output [mandatory]  <*** not set ***> \
    -c --chunk [2]

Solution

mkdir contigs_split
module load bioinfo/Exonerate/2.2.0
fastasplit -f contigs.fasta -c 10 -o contigs_split

Check the number of files¶

Question

Check the number of files obtained in previous step

Solution

ls contigs_split/* | wc

Check the content of files¶

Question

Check if the number of sequences in contigs.fasta file is the same than in the sum of all sequences in splitted files.

Solution

grep ">" contigs_split/* | wc -l
grep -c ">" contigs.fasta

The usual way¶

The structure¶

We slightly modify the script blastx.sh from TP 2 into the script blastx_pe.sh:

blastx_pe.sh
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="contigs_split/contigs.fasta_chunk_0000000"
OUTPUT="blastx_split/contigs.fasta_chunk_0000000.blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
  -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK # (1)!

For the sake of debuging, we will echo the command line until we are satisfied about the result. Then we will remove it in order to run the real command.

Compute output name from input¶

Question

Run the following command in your interactive session:

INPUT="contigs_split/contigs.fasta_chunk_0000000"
echo "blastx_split/$(basename "$INPUT").blast" # (1)!

Some explainations:
- $() is called a subshell. It means run the command and get back the result.
- basename is a command that extracts the filename (here contigs.fasta_chunk_0000000) from a path (here contigs_split/contigs.fasta_chunk_0000000). It can also remove extension part from filename if needed.

From what you observe, edit the script blastx_pe.sh in order to compute the OUTPUT variable (line 5) from the INPUT variable (line 4).

Solution

blastx_pe.sh
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="contigs_split/contigs.fasta_chunk_0000000"
OUTPUT="blastx_split/$(basename "$INPUT").blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
-evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK

Don't forget to create blastx_split directory before running the script

Get one distinct input per task (`SLURM_ARRAY_TASK_ID`)¶

Question

Run in your interactive session the following commands:

SLURM_ARRAY_TASK_ID=1
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")" # (1)!
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"

NR means 'Number of Rows' in awk.

SLURM_ARRAY_TASK_ID=2
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"

SLURM_ARRAY_TASK_ID=5
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"

Based on what you observe from outputs, adapt the blastx_pe.sh script to apply blastx to each sequence in input?

Solution

blastx_pe.sh
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
OUTPUT="blastx_split/$(basename "$INPUT").blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
  -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK

Tip

Did you know that we offer a sed & awk training ?

Dry run¶

Question

Run the script blastx_pe.sh as an array of jobs. Check the resulting log files.

Solution

First, get the number of files (there are 10 files):

ls contigs_split/*.fasta_chunk* | wc -l

Then, run the array of jobs on all files:

sbatch --array 1-10 blastx_pe.sh

Run the job for real¶

Question

Remove the echo before the blastx command (line 9) and run again the script blastx_pe.sh as an array of jobs.

When running, check the jobs status.

Solution

First, get the number of files (there are 10 files):

ls contigs_split/*.fasta_chunk* | wc -l

Then, run the array of jobs on all files:

sbatch --array 1-10 blastx_pe.sh

Check the running jobs:

sq_long -u "$(whoami)"

Run the job again¶

Question

Run again the array of job on the 4 first splited files while limiting the job to 2 simultaneous running tasks?

Check the running jobs

Solution

sbatch --array 1-4%2 blastx_pe.sh

Merge results¶

Question

Concatenate all blast results obtained from the job array into one file.

Solution

cat blastx_split/*.blast > result.blast

The Genotoul-bioinfo `sarray` wrapper¶

We provide a wrapper called sarray that helps you to run some job arrays.

By giving it a file containing one job per line, it will run them as a job array. A job a list of commands.

We create a script named generate_blastx_array_cmds.sh that will generate such a file.

generate_blastx_array_cmds.sh
#!/bin/sh

NB_CPUS=2

for INPUT in contigs_split/*.fasta_chunk*; do
    OUTPUT="blastx_split/$(basename "$INPUT").blast"
    echo "module purge \
       && module load bioinfo/NCBI_Blast+/2.10.0+ \
       && blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
          -evalue 10e-10 -num_threads $NB_CPUS"
done

Then, we run in a interactive session the script generate_blastx_array_cmds.sh in order to generate the blastx_array.cmds:

bash generate_blastx_array_cmds.sh > blastx_array.cmds

Finally we run the array on jobs in blastx_array.cmds with sarray command, with maximum 4 tasks in parallel.

sarray -J blastx --cpus-per-task 2 --%=4 blastx_array.cmds

where options are same options as slurm with exception of --%:

-J is the job name
--cpus-per-task the number of cpu reserved by each task
--% the maximum number of task running in parallel

TP 3: Job array

Prepare data¶

Split data¶

Check the number of files¶

Check the content of files¶

The usual way¶

The structure¶

Compute output name from input¶

Get one distinct input per task (SLURM_ARRAY_TASK_ID)¶

Dry run¶

Run the job for real¶

Run the job again¶

Merge results¶

The Genotoul-bioinfo sarray wrapper¶

Get one distinct input per task (`SLURM_ARRAY_TASK_ID`)¶

The Genotoul-bioinfo `sarray` wrapper¶