Nextflow

Nextflow is a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. Nextflow is a workflow manager. The community develop and maintain workflows for several kind of high throughput data into nf-core repository (https://github.com/nf-core)

The rnaseq workflow is available to all genotoul cluster users. “The workflow processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.”

An advance course of how to use nextflow on genotoul is available at https://genotoul-bioinfo.pages.mia.inra.fr/use-nextflow-nfcore-course.

Slides

Here we are going to launch the RNAseq workfow on example data and explore results.

Set up the genotoul environement

On genobioinfo execute once the following script which creates two directories : ~/work/.nextflow and ~/work/.singularity and creates symbolic link in home directory.

sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh

Explore nextflow

Load module bioinfo/nfcore-Nextflow-v21.10.6. This module load nextflow, nf-core and singularity.
Execute nextflow without option
```
nextflow
```
Get help on nextflow: execute run command nextflow with option -help
```
nextflow run -help
```
Get the list of nf-core workflow available localy with :
```
nf-core list
```
Get help on workflow rnaseq: execute run command nextflow with workflow nf-core/rnaseq and option --help
```
nextflow run nf-core/rnaseq --help (add option -r if necessary)
```

Create a samplesheet with following lines samples.csv:

sample,fastq_1,fastq_2,strandedness
MT,MT_rep1_1_Ch6.fastq.gz,MT_rep1_2_Ch6.fastq.gz,unstranded
WT,WT_rep1_1_Ch6.fastq.gz,WT_rep1_2_Ch6.fastq.gz,unstranded

If your samples are called *R1.fastq.gz and *R2.fastq.gz, you can use following loop to generate content of samplesheet:

for i in *R1.fastq.gz; do ech=${i%*R1.fastq.gz} ; r2=${i/R1/R2}; echo "$ech,$i,$r2,reverse" ; done > samples.csv

Run the pipeline with following data and parameters:

  --fasta star-index/ITAG2.3_genomic_Ch6.fasta \
  --gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \
  --input samples.csv \
  --outdir results \
  --aligner star_rsem \
  --skip_bbsplit \
  --skip_trimming \
  --skip_markduplicate \
  -profile genotoul \
  -r 3.6

Get summary of execution report:

For all exection in current directory :

nextflow log

For last exection in current directory :

nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,exit,status

Get commands of all processes

nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,status,script

Results

The pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow

Other nextflow hidden files, eg. history of pipeline runs and old logs.

Copy directory results/MultiQC and results/pipeline_info into ~/public_html
explore the html files of multiQC at web-genobioinfo.toulouse.inrae.fr/~USERNAME
explore the pipeline_info/execution_report*.html, which process is the longer?
in commande line explore the results directory, find the bam files, the quantification files and the new transcripts files.

Relaunch an aborted workflow

Warning : if you relaunch the nextflow command line in the same directory, the entire workflow will be relaunch if you don’t set option -resume.
nextflow run nf-core/rnaseq -resume\
-profile genotoul \
--fasta star-index/ITAG2.3_genomic_Ch6.fasta \
--gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \
--input samples.csv \
--outdir results \
--aligner star_rsem \
--skip_bbsplit \
--skip_trimming \
--skip_markduplicate \
-profile genotoul \
-r 3.6

Use a genome already indexed

List all files available in results/genome

Here is the format of a (configuration for a genome)[https://nf-co.re/docs/usage/reference_genomes]

params {
genomes {
  'YOUR-ID' {
    fasta  = '/path/to/data/genome.fa'
  }
  'OTHER-GENOME' {
    // [..]
  }
}
// Optional - default genome. Ignored if --genome 'OTHER-GENOME' specified on command line
genome = 'YOUR-ID'
}

Here is an exemple of a full genome configuration :

genomes {
      GRCh37 {
         fasta = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa'
         bwa = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa'
         bowtie2 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/'
         star = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/'
         bismark = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/'
         gtf = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf'
         bed12 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed'
         readme = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt'
         mito_name = 'MT'
         macs_gsize = '2.7e9'
         blacklist = '/home/cnoirot/.nextflow/assets/nf-core/rnaseq/assets/blacklists/GRCh37-blacklist.bed'
      }
 }

Add the appropriate configuration for tomato genome in current directory in a file called nextflow.config

params {
 genomes {
    'Itag' {
       fasta = './data/ITAG2.3_genomic_Ch6.fasta'
       fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
       ... TODO: ADD expected files ... LOOK at files in `results/genome`
    }
 }
 ... NOTE Here you can parametrize pipeline.
}

NB: in version 3.6, when using --aligner star_rsem, both the STAR and RSEM indices should be present in the path specified by --rsem_index (see issue #568)

%accordion%Solution

   params {
     genomes {
     'Itag' {
        fasta = './star-index/ITAG2.3_genomic_Ch6.fasta'
        fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
        gtf = './star-index/ITAG_pre2.3_gene_models_Ch6.gtf'
        bed12 = './results/genome/ITAG_pre2.3_gene_models_Ch6.bed'
        }
      }
      rsem = './results/genome/index/rsem/'
      aligner = 'star_rsem'
      skip_bbsplit = true
      skip_trimming = true
      skip_markduplicate = true
   }

%/accordion%

Launch nextflow config nf-core/rnaseq -profile genotoul Does the configuration of Itag genome is loaded ?
Relaunch the pipeline rnaseq by using option --genome Itag --igenomes_base ~/work/tp_rnaseq/results/genome instead of --fasta and --gtf

Does the index are used ?

Kill the process once you see it to avoid to waste CPU and carbon !