TP 3: Job array
Objective: Speed up a job by splitting it into independant jobs runnning on many nodes.
Prepare data¶
Split data¶
Question
Split the fasta file in 10 fasta files into a directory called contigs_split
.
Tip
The fastasplit
program from exonerate
can be used for this purpose. Here an extract from fastasplit
help.
module load bioinfo/Exonerate/2.2.0
fastasplit <path> <dirpath> \
-f --fasta [mandatory] <*** not set ***> \
-o --output [mandatory] <*** not set ***> \
-c --chunk [2]
Solution
mkdir contigs_split
module load bioinfo/Exonerate/2.2.0
fastasplit -f contigs.fasta -c 10 -o contigs_split
Check the number of files¶
Question
Check the number of files obtained in previous step
Solution
ls contigs_split/* | wc
Check the content of files¶
Question
Check if the number of sequences in contigs.fasta
file is the same than in the sum of all sequences in splitted files.
Solution
grep ">" contigs_split/* | wc -l
grep -c ">" contigs.fasta
The usual way¶
The structure¶
We slightly modify the script blastx.sh
from TP 2 into the script blastx_pe.sh
:
blastx_pe.sh | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
- For the sake of debuging, we will
echo
the command line until we are satisfied about the result. Then we will remove it in order to run the real command.
Compute output name from input¶
Question
Run the following command in your interactive session:
INPUT="contigs_split/contigs.fasta_chunk_0000000"
echo "blastx_split/$(basename "$INPUT").blast" # (1)!
- Some explainations:
$()
is called a subshell. It means run the command and get back the result.basename
is a command that extracts the filename (herecontigs.fasta_chunk_0000000
) from a path (herecontigs_split/contigs.fasta_chunk_0000000
). It can also remove extension part from filename if needed.
From what you observe, edit the script blastx_pe.sh
in order to compute the OUTPUT
variable (line 5) from the INPUT
variable (line 4).
Solution
blastx_pe.sh | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Don't forget to create blastx_split
directory before running the script
Get one distinct input per task (SLURM_ARRAY_TASK_ID
)¶
Question
Run in your interactive session the following commands:
SLURM_ARRAY_TASK_ID=1
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")" # (1)!
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"
NR
means 'Number of Rows' inawk
.
SLURM_ARRAY_TASK_ID=2
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"
SLURM_ARRAY_TASK_ID=5
INPUT="$(ls contigs_split/*.fasta_chunk* | awk "NR==$SLURM_ARRAY_TASK_ID")"
echo "I am task $SLURM_ARRAY_TASK_ID using $INPUT"
Based on what you observe from outputs, adapt the blastx_pe.sh
script to apply blastx
to each sequence in input?
Solution
blastx_pe.sh | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Tip
Did you know that we offer a sed & awk training ?
Dry run¶
Question
Run the script blastx_pe.sh
as an array of jobs. Check the resulting log files.
Solution
- First, get the number of files (there are 10 files):
ls contigs_split/*.fasta_chunk* | wc -l
- Then, run the array of jobs on all files:
sbatch --array 1-10 blastx_pe.sh
Run the job for real¶
Question
Remove the echo
before the blastx
command (line 9) and run again the script blastx_pe.sh
as an array of jobs.
When running, check the jobs status.
Solution
- First, get the number of files (there are 10 files):
ls contigs_split/*.fasta_chunk* | wc -l
- Then, run the array of jobs on all files:
sbatch --array 1-10 blastx_pe.sh
- Check the running jobs:
sq_long -u "$(whoami)"
Run the job again¶
Question
Run again the array of job on the 4 first splited files while limiting the job to 2 simultaneous running tasks?
Check the running jobs
Solution
sbatch --array 1-4%2 blastx_pe.sh
Merge results¶
Question
Concatenate all blast results obtained from the job array into one file.
Solution
cat blastx_split/*.blast > result.blast
The Genotoul-bioinfo sarray
wrapper¶
We provide a wrapper called sarray
that helps you to run some job arrays.
By giving it a file containing one job per line, it will run them as a job array. A job a list of commands.
We create a script named generate_blastx_array_cmds.sh
that will generate such a file.
generate_blastx_array_cmds.sh | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Then, we run in a interactive session the script generate_blastx_array_cmds.sh
in order to generate the blastx_array.cmds
:
bash generate_blastx_array_cmds.sh > blastx_array.cmds
Finally we run the array on jobs in blastx_array.cmds
with sarray
command, with maximum 4 tasks in parallel.
sarray -J blastx --cpus-per-task 2 --%=4 blastx_array.cmds
where options are same options as slurm with exception of --%
:
-J
is the job name--cpus-per-task
the number of cpu reserved by each task--%
the maximum number of task running in parallel