Skip to content

TP 3: Job array

Objective: Speed up a job by splitting it into independant jobs runnning on many nodes.

Prepare data

Split data

Question

Split the fasta file in 10 fasta files into a directory called contigs_split.

Tip

The fastasplit program from exonerate can be used for this purpose. Here an extract from fastasplit help.

module load bioinfo/Exonerate/2.2.0
fastasplit <path> <dirpath> \
    -f --fasta [mandatory]  <*** not set ***> \
    -o --output [mandatory]  <*** not set ***> \
    -c --chunk [2]

Solution
mkdir contigs_split
module load bioinfo/Exonerate/2.2.0
fastasplit -f contigs.fasta -c 10 -o contigs_split

Check the number of files

Question

Check the number of files obtained in previous step

Solution
ls contigs_split/* | wc

Check the content of files

Question

Check if the number of sequences in contigs.fasta file is the same than in the sum of all sequences in splitted files.

Solution
grep ">" contigs_split/* | wc -l
grep -c ">" contigs.fasta

The usual way

The structure

We slightly modify the script blastx.sh from TP 2 into the script blastx_pe.sh:

blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="contigs_split/contigs.fasta_chunk_0000000"
OUTPUT="blastx_split/contigs.fasta_chunk_0000000.blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
  -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK # (1)!
  1. For the sake of debuging, we will echo the command line until we are satisfied about the result. Then we will remove it in order to run the real command.

Compute output name from input

Question

Run the following command in your interactive session:

INPUT="contigs_split/contigs.fasta_chunk_0000000"
echo "blastx_split/$(basename "$INPUT").blast" # (1)!

  1. Some explainations:
    • $() is called a subshell. It means run the command and get back the result.
    • basename is a command that extracts the filename (here contigs.fasta_chunk_0000000) from a path (here contigs_split/contigs.fasta_chunk_0000000). It can also remove extension part from filename if needed.

From what you observe, edit the script blastx_pe.sh in order to compute the OUTPUT variable (line 5) from the INPUT variable (line 4).

Solution
blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="contigs_split/contigs.fasta_chunk_0000000"
OUTPUT="blastx_split/$(basename "$INPUT").blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
-evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK

Don't forget to create blastx_split directory before running the script

Get one distinct input per task (SLURM_ARRAY_TASK_ID)

Question

Run in your interactive session the following commands:

SLURM_ARRAY_TASK_ID=1
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")" # (1)!
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"
  1. NR means 'Number of Rows' in awk.

SLURM_ARRAY_TASK_ID=2
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"
SLURM_ARRAY_TASK_ID=5
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"

Based on what you observe from outputs, adapt the blastx_pe.sh script to apply blastx to each sequence in input?

Solution
blastx_pe.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/sh
#SBATCH --cpus-per-task 1

INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
OUTPUT="blastx_split/$(basename "$INPUT").blast"

module purge
module load bioinfo/NCBI_Blast+/2.10.0+
echo blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
  -evalue 10e-10 -num_threads $SLURM_CPUS_PER_TASK

Tip

Did you know that we offer a sed & awk training ?

Dry run

Question

Run the script blastx_pe.sh as an array of jobs. Check the resulting log files.

Solution
  • First, get the number of files (there are 10 files):
ls contigs_split/*.fasta_chunk* | wc -l
  • Then, run the array of jobs on all files:
sbatch --array 1-10 blastx_pe.sh

Run the job for real

Question

Remove the echo before the blastx command (line 9) and run again the script blastx_pe.sh as an array of jobs.

When running, check the jobs status.

Solution
  • First, get the number of files (there are 10 files):
ls contigs_split/*.fasta_chunk* | wc -l
  • Then, run the array of jobs on all files:
sbatch --array 1-10 blastx_pe.sh
  • Check the running jobs:
sq_long -u "$(whoami)"

Run the job again

Question

Run again the array of job on the 4 first splited files while limiting the job to 2 simultaneous running tasks?

Check the running jobs

Solution
sbatch --array 1-4%2 blastx_pe.sh

Merge results

Question

Concatenate all blast results obtained from the job array into one file.

Solution
cat blastx_split/*.blast > result.blast

The Genotoul-bioinfo sarray wrapper

We provide a wrapper called sarray that helps you to run some job arrays.

By giving it a file containing one job per line, it will run them as a job array. A job a list of commands.

We create a script named generate_blastx_array_cmds.sh that will generate such a file.

generate_blastx_array_cmds.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/sh

NB_CPUS=2

for INPUT in contigs_split/*.fasta_chunk*; do
    OUTPUT="blastx_split/$(basename "$INPUT").blast"
    echo "module purge \
       && module load bioinfo/NCBI_Blast+/2.10.0+ \
       && blastx -db ensembl_danio_rerio_pep -query $INPUT -out $OUTPUT \
          -evalue 10e-10 -num_threads $NB_CPUS"
done

Then, we run in a interactive session the script generate_blastx_array_cmds.sh in order to generate the blastx_array.cmds:

bash generate_blastx_array_cmds.sh > blastx_array.cmds

Finally we run the array on jobs in blastx_array.cmds with sarray command, with maximum 4 tasks in parallel.

sarray -J blastx --cpus-per-task 2 --%=4 blastx_array.cmds

where options are same options as slurm with exception of --%:

  • -J is the job name
  • --cpus-per-task the number of cpu reserved by each task
  • --% the maximum number of task running in parallel