Nextflow
Nextflow is a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. Nextflow is a workflow manager. The community develop and maintain workflows for several kind of high throughput data into nf-core repository (https://github.com/nf-core)
The rnaseq workflow is available to all genotoul cluster users. “The workflow processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.”
An advance course of how to use nextflow on genotoul is available at https://genotoul-bioinfo.pages.mia.inra.fr/use-nextflow-nfcore-course.
Here we are going to launch the RNAseq workfow on example data and explore results.
Set up the genotoul environement
On genologin execute once the following script which creates two directories : ~/work/.nextflow
and ~/work/.singularity
and creates symbolic link in home directory.
sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh
Explore nextflow
Load module
bioinfo/nfcore-Nextflow-v21.10.6
. This module load nextflow, nf-core and singularity.Execute nextflow without option
nextflow
Get help on nextflow: execute run command nextflow with option -help
nextflow run -help
Get the list of nf-core workflow available localy with :
nf-core list
Get help on workflow rnaseq: execute run command nextflow with workflow
nf-core/rnaseq
and option--help
nextflow run nf-core/rnaseq --help (add option -r if necessary)
Create a samplesheet with following lines
samples.csv
:sample,fastq_1,fastq_2,strandedness MT,MT_rep1_1_Ch6.fastq.gz,MT_rep1_2_Ch6.fastq.gz,unstranded WT,WT_rep1_1_Ch6.fastq.gz,WT_rep1_2_Ch6.fastq.gz,unstranded
If your samples are called *R1.fastq.gz
and *R2.fastq.gz
, you can use following loop to generate content of samplesheet:
for i in *R1.fastq.gz; do ech=${i%*R1.fastq.gz} ; r2=${i/R1/R2}; echo "$ech,$i,$r2,reverse" ; done > samples.csv
- Run the pipeline with following data and parameters:
--fasta star-index/ITAG2.3_genomic_Ch6.fasta \ --gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \ --input samples.csv \ --outdir results \ --aligner star_rsem \ --skip_bbsplit \ --skip_trimming \ --skip_markduplicate \ -profile genotoul \ -r 3.6
Get summary of execution report:
For all exection in current directory :
nextflow log
For last exection in current directory :
nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,exit,status
Get commands of all processes
nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,status,script
Results
The pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
Other nextflow hidden files, eg. history of pipeline runs and old logs.
- Copy directory
results/MultiQC
andresults/pipeline_info
into~/public_html
- explore the html files of multiQC at genoweb.toulouse.inra.fr/~USERNAME
- explore the pipeline_info/execution_report*.html, which process is the longer?
- in commande line explore the
results
directory, find the bam files, the quantification files and the new transcripts files.
Relaunch an aborted workflow
Warning : if you relaunch the nextflow command line in the same directory, the entire workflow will be relaunch if you don’t set option -resume.
nextflow run nf-core/rnaseq -resume\ -profile genotoul \ --fasta star-index/ITAG2.3_genomic_Ch6.fasta \ --gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \ --input samples.csv \ --outdir results \ --aligner star_rsem \ --skip_bbsplit \ --skip_trimming \ --skip_markduplicate \ -profile genotoul \ -r 3.6
Use a genome allready indexed
List all files available in
results/genome
Here is the format of a (configuration for a genome)[https://nf-co.re/docs/usage/reference_genomes]
params { genomes { 'YOUR-ID' { fasta = '/path/to/data/genome.fa' } 'OTHER-GENOME' { // [..] } } // Optional - default genome. Ignored if --genome 'OTHER-GENOME' specified on command line genome = 'YOUR-ID' }
Here is an exemple of a full genome configuration :
genomes {
GRCh37 {
fasta = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa'
bwa = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa'
bowtie2 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/'
star = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/'
bismark = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/'
gtf = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf'
bed12 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed'
readme = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt'
mito_name = 'MT'
macs_gsize = '2.7e9'
blacklist = '/home/cnoirot/.nextflow/assets/nf-core/rnaseq/assets/blacklists/GRCh37-blacklist.bed'
}
}
- Add the appropriate configuration for tomato genome in current directory in a file called
nextflow.config
params {
genomes {
'Itag' {
fasta = './data/ITAG2.3_genomic_Ch6.fasta'
fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
... TODO: ADD expected files ... LOOK at files in `results/genome`
}
}
... NOTE Here you can parametrize pipeline.
}
NB: in version 3.6, when using --aligner star_rsem, both the STAR and RSEM indices should be present in the path specified by --rsem_index (see issue #568)
%accordion%Solution
params {
genomes {
'Itag' {
fasta = './star-index/ITAG2.3_genomic_Ch6.fasta'
fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
gtf = './star-index/ITAG_pre2.3_gene_models_Ch6.gtf'
bed12 = './results/genome/ITAG_pre2.3_gene_models_Ch6.bed'
}
}
rsem = './results/genome/index/rsem/'
aligner = 'star_rsem'
skip_bbsplit = true
skip_trimming = true
skip_markduplicate = true
}
%/accordion%
Launch
nextflow config nf-core/rnaseq -profile genotoul
Does the configuration of Itag genome is loaded ?Relaunch the pipeline rnaseq by using option
--genome Itag --igenomes_base ~/work/tp_rnaseq/results/genome
instead of--fasta
and--gtf
Does the index are used ?
Kill the process once you see it to avoid to waste CPU and carbon !
%accordion%Solution
nextflow run nf-core/rnaseq -profile genotoul \
--genome Itag --igenomes_base ~/work/tp_rnaseq/results/genome \
-r 3.6 --input samples_sheet.csv --outdir ResultTestConfig
%/accordion%