This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to Bioinformatics workflows with Nextflow and nf-core: Groovy syntax

Key Points

Getting Started with Nextflow
  • A workflow is a sequence of tasks that process a set of data.

  • A workflow management system (WfMS) is a computational platform that provides an infrastructure for the set-up, execution and monitoring of workflows.

  • Nextflow is a workflow management system that comprises both a runtime environment and a domain specific language (DSL).

  • Nextflow scripts comprise of channels for controlling inputs and outputs, and processes for defining workflow tasks.

  • You run a Nextflow script using the nextflow run command.

Nextflow scripting
  • Nextflow is a Domain Specific Language (DSL) implemented on top of the Groovy programming language.

  • To define a variable, assign a value to it e.g., a = 1 .

  • Comments use the same syntax as in the C-family programming languages: // or multiline /* */.

  • Multiple values can be stored in lists [value1, value2, value3, …] or maps [chromosome: 1, start :1].

  • Lists are indexed and sliced with square brackets (e.g., list[0] and list[2..9])

  • String interpolation (variable interpolation, variable substitution, or variable expansion) is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values.

  • A closure is an expression (block of code) encased in {} e.g. { it * it }.

Workflow parameterization
  • Pipeline parameters are specified by prepending the prefix params to a variable name, separated by dot character.

  • To specify a pipeline parameter on the command line for a Nextflow run use --variable_name syntax.

  • You can add parameters to a JSON or YAML formatted file and pass them to the script using option -params-file.

Channels
  • Channels must be used to import data into Nextflow.

  • Nextflow has two different kinds of channels: queue channels and value channels.

  • Data in value channels can be used multiple times in workflow.

  • Data in queue channels are consumed when they are used by a process or an operator.

  • Channel factory methods, such as Channel.of, are used to create channels.

  • Channel factory methods have optional parameters e.g., checkIfExists, that can be used to alter the creation and behaviour of a channel.

Processes
  • A Nextflow process is an independent step in a workflow

  • Processes contain up to five definition blocks including: directives, inputs, outputs, when clause and finally a script block.

  • The script block contains the commands you would like to run.

  • A process should have a script but the other four blocks are optional

  • Inputs are defined in the input block with a type qualifier and a name.

Processes Part 2
  • Outputs to a process are defined using the output blocks.

  • You can group input and output data from a process using the tuple qualifier.

  • The execution of a process can be controlled using the when declaration and conditional statements.

  • Files produced within a process and defined as output can be saved to a directory using the publishDir directive.

Workflow
  • A Nextflow workflow is defined by invoking processes inside the workflow scope.

  • A process is invoked like a function inside the workflow scope passing any required input parameters as arguments. e.g. FASTQC(reads_ch).

  • Process outputs can be accessed using the out attribute for the respective process object or assigning the output to a Nextflow variable. - Multiple outputs from a single process can be accessed using the list syntax [] and it’s index or by referencing the a named process output .

Operators
  • Nextflow operators are methods that allow you to modify, set or view channels.

  • Operators can be separated in to several groups; filtering , transforming , splitting , combining , forking and Maths operators

  • To use an operator use the dot notation after the Channel object e.g. my_ch.view().

  • You can parse text items emitted by a channel, that are formatted using the CSV format, using the splitCsv operator.

Nextflow configuration
  • Nextflow configuration can be managed using a Nextflow configuration file.

  • Nextflow configuration files are plain text files containing a set of properties.

  • You can define process specific settings, such as cpus and memory, within the process scope.

  • You can assign different resources to different processes using the process selectors withName or withLabel.

  • You can define a profile for different configurations using the profiles scope. These profiles can be selected when launching a pipeline execution by using the -profile command-line option

  • Nextflow configuration settings are evaluated in the order they are read-in.

  • Workflow configuration settings can be inspected using nextflow config <script> [options].

Simple RNA-Seq pipeline
  • Nextflow can combined tasks (processes) and manage data flows using channels into a single pipeline/workflow.

  • A Workflow can be parameterise using params . These value of the parameters can be captured in a log file using log.info

  • Nextflow can handle a workflow’s software requirements using several technologies including the conda package and enviroment manager.

  • Workflow steps are connected via their inputs and outputs using Channels.

  • Intermediate pipeline results can be transformed using Channel operators such as combine.

  • Nextflow can execute an action when the pipeline completes the execution using the workflow.onComplete event handler to print a confirmation message.

  • Nextflow is able to produce multiple reports and charts providing several runtime metrics and execution information using the command line options -with-report, -with-trace, -with-timeline and produce a graph using -with-dag.

Modules
  • A module file is a Nextflow script containing one or more process definitions that can be imported from another Nextflow script.

  • To import a module into a workflow use the include keyword.

  • A module script can define one or more parameters using the same syntax of a Nextflow workflow script.

  • The module inherits the parameters define before the include statement, therefore any further parameter set later is ignored.

Sub-workflows
  • Nextflow allows for definition of reusable sub-workflow libraries.

  • Sub-workflow allows the definition of workflow processes that can be included from any other script and invoked as a custom function within the new workflow scope. This enables reuse of workflow components

  • The entry option of the nextflow run command specifies the workflow name to be executed

Reporting
  • Nextflow can produce a custom execution report with run information using the log command.

  • You can generate a report using the -t option specifying a template file.

Workflow caching and checkpointing
  • Nextflow automatically keeps track of all the processes executed in your pipeline via checkpointing.

  • Nextflow caches intermediate data in task directories within the work directory.

  • Nextflow caching and checkpointing allows re-entrancy into a workflow after a pipeline error or using new data, skipping steps that have been successfully executed. - Re-entrancy is enabled using the -resume option.

Deploying nf-core pipelines
  • nf-core is a community-led project to develop a set of best-practice pipelines built using the Nextflow workflow management system.

  • The nf-core tool (nf-core) is a suite of helper tools that aims to help people run and develop nf-core pipelines.

  • nf-core pipelines can be found using nf-core list, or by checking the nf-core website.

  • nf-core launch nf-core/<pipeline> can be used to write a parameter file for an nf-core pipeline. This can be supplied to the pipeline using the -params-file option.

  • An nf-core workflow is run using nextflow run nf-core/<pipeline> syntax.

  • nf-core pipelines can be reconfigured by using custom config files and/or adding command line parameters.

Nextflow coding practices
  • Nextflow is not sensitive to whitespace. Use it to layout code for readability.

  • Use comments and whitespace to group chunks of code to describe big picture functionality.

  • Report tool versions in the scripts.

  • Name channel outputs using the emit: keyword.

  • Avoid params.parameter in a process. Pass all parameters using input channels.

  • Input files should be passed using input channels.

  • Group short running commands into a larger process.

  • Include a test profile which runs the workflow on a small test data set.

  • Write your processes to reuse existing containers/software bundles.

  • Use compressed files and temporary disk space when possible.

  • Use consistent naming conventions.

Groovy syntax

Here are some fundamental concepts of the Groovy language.

Groovy is very syntax-rich and supports many more operations. A full description of Groovy semantics can be found in the Groovy Documentation.

Glossary

Term Description
Dataflow programming dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operation
Domain Specific Language A domain-specific language (DSL) is a computer language specialized to a particular application domain. Nextflow is a language for computational workflows
Tuple An ordered, immutable list of elements. A tuple can be seen as an ordered collection of objects of different types. These objects do not necessarily relate to each other in any way, but collectively they will have some meaning.

External references

Manuals

Papers