Nextflow


Nextflow is a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner. Nextflow is a workflow manager. The community develop and maintain workflows for several kind of high throughput data into nf-core repository (https://github.com/nf-core)

The rnaseq workflow is available to all genotoul cluster users. “The workflow processes raw data from FastQ inputs (FastQC, Trim Galore!), aligns the reads (STAR or HiSAT2), generates gene counts (featureCounts, StringTie) and performs extensive quality-control on the results (RSeQC, dupRadar, Preseq, edgeR, MultiQC). See the output documentation for more details of the results.”

An advance course of how to use nextflow on genotoul is available at https://genotoul-bioinfo.pages.mia.inra.fr/use-nextflow-nfcore-course.

不支持嵌入的PDF对象: Slides

Here we are going to launch the RNAseq workfow on example data and explore results.

Set up the genotoul environement

On genobioinfo execute once the following script which creates two directories : ~/work/.nextflow and ~/work/.singularity and creates symbolic link in home directory.

sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh

Explore nextflow

  • Load module bioinfo/nfcore-Nextflow-v21.10.6. This module load nextflow, nf-core and singularity.

  • Execute nextflow without option

    nextflow
    
  • Get help on nextflow: execute run command nextflow with option -help

    nextflow run -help
    
  • Get the list of nf-core workflow available localy with :

    nf-core list
    
  • Get help on workflow rnaseq: execute run command nextflow with workflow nf-core/rnaseq and option --help

    nextflow run nf-core/rnaseq --help (add option -r if necessary)
    
  • Create a samplesheet with following lines samples.csv:

    sample,fastq_1,fastq_2,strandedness
    MT,MT_rep1_1_Ch6.fastq.gz,MT_rep1_2_Ch6.fastq.gz,unstranded
    WT,WT_rep1_1_Ch6.fastq.gz,WT_rep1_2_Ch6.fastq.gz,unstranded
    

If your samples are called *R1.fastq.gz and *R2.fastq.gz, you can use following loop to generate content of samplesheet:

for i in *R1.fastq.gz; do ech=${i%*R1.fastq.gz} ; r2=${i/R1/R2}; echo "$ech,$i,$r2,reverse" ; done > samples.csv
  • Run the pipeline with following data and parameters:
      --fasta star-index/ITAG2.3_genomic_Ch6.fasta \
      --gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \
      --input samples.csv \
      --outdir results \
      --aligner star_rsem \
      --skip_bbsplit \
      --skip_trimming \
      --skip_markduplicate \
      -profile genotoul \
      -r 3.6
    

Get summary of execution report:

For all exection in current directory :

nextflow log

For last exection in current directory :

nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,exit,status

Get commands of all processes

nextflow log $(nextflow log | cut -f 3 | tail -n 1) -f hash,name,status,script

Results

The pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow

Other nextflow hidden files, eg. history of pipeline runs and old logs.

  • Copy directory results/MultiQC and results/pipeline_info into ~/public_html
  • explore the html files of multiQC at web-genobioinfo.toulouse.inrae.fr/~USERNAME
  • explore the pipeline_info/execution_report*.html, which process is the longer?
  • in commande line explore the results directory, find the bam files, the quantification files and the new transcripts files.

Relaunch an aborted workflow

Warning : if you relaunch the nextflow command line in the same directory, the entire workflow will be relaunch if you don’t set option -resume.

nextflow run nf-core/rnaseq -resume\
-profile genotoul \
--fasta star-index/ITAG2.3_genomic_Ch6.fasta \
--gtf star-index/ITAG_pre2.3_gene_models_Ch6.gtf \
--input samples.csv \
--outdir results \
--aligner star_rsem \
--skip_bbsplit \
--skip_trimming \
--skip_markduplicate \
-profile genotoul \
-r 3.6

Use a genome already indexed

  • List all files available in results/genome

  • Here is the format of a (configuration for a genome)[https://nf-co.re/docs/usage/reference_genomes]

    params {
    genomes {
      'YOUR-ID' {
        fasta  = '/path/to/data/genome.fa'
      }
      'OTHER-GENOME' {
        // [..]
      }
    }
    // Optional - default genome. Ignored if --genome 'OTHER-GENOME' specified on command line
    genome = 'YOUR-ID'
    }
    
  • Here is an exemple of a full genome configuration :

genomes {
      GRCh37 {
         fasta = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa'
         bwa = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa'
         bowtie2 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/'
         star = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/'
         bismark = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/'
         gtf = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf'
         bed12 = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed'
         readme = 's3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt'
         mito_name = 'MT'
         macs_gsize = '2.7e9'
         blacklist = '/home/cnoirot/.nextflow/assets/nf-core/rnaseq/assets/blacklists/GRCh37-blacklist.bed'
      }
 }
  • Add the appropriate configuration for tomato genome in current directory in a file called nextflow.config
params {
 genomes {
    'Itag' {
       fasta = './data/ITAG2.3_genomic_Ch6.fasta'
       fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
       ... TODO: ADD expected files ... LOOK at files in `results/genome`
    }
 }
 ... NOTE Here you can parametrize pipeline.
}

NB: in version 3.6, when using --aligner star_rsem, both the STAR and RSEM indices should be present in the path specified by --rsem_index (see issue #568)

%accordion%Solution

   params {
     genomes {
     'Itag' {
        fasta = './star-index/ITAG2.3_genomic_Ch6.fasta'
        fasta_index = './results/genome/ITAG2.3_genomic_Ch6.fasta.fai'
        gtf = './star-index/ITAG_pre2.3_gene_models_Ch6.gtf'
        bed12 = './results/genome/ITAG_pre2.3_gene_models_Ch6.bed'
        }
      }
      rsem = './results/genome/index/rsem/'
      aligner = 'star_rsem'
      skip_bbsplit = true
      skip_trimming = true
      skip_markduplicate = true
   }

%/accordion%

  • Launch nextflow config nf-core/rnaseq -profile genotoul Does the configuration of Itag genome is loaded ?

  • Relaunch the pipeline rnaseq by using option --genome Itag --igenomes_base ~/work/tp_rnaseq/results/genome instead of --fasta and --gtf

Does the index are used ?

Kill the process once you see it to avoid to waste CPU and carbon !

%accordion%Solution

        nextflow run nf-core/rnaseq -profile genotoul \
        --genome Itag --igenomes_base ~/work/tp_rnaseq/results/genome \
        -r 3.6 --input samples_sheet.csv --outdir ResultTestConfig

%/accordion%

results matching ""

    No results matching ""