Multi-Container Nextflow Tutorial¶

Introduction¶

In this notebook, we will set up and explain each file in a Nextflow workflow which uses Python scripts. Our workflow consists of two Python scripts that are each executed in a different Conda environment.

The workflow consists of several components: the main Nextflow script (main.nf), two Python scripts (script1.py and script2.py), two Conda environment files (env1.yaml and env2.yaml), and a Nextflow configuration file (nextflow.config).

We will explain each file in the following sections. We will also show how to create the files and directories necessary for this workflow using the %%writefile magic command.

Directory Structure Setup¶

Before we start creating our files, we need to ensure that the appropriate directories are in place. Our workflow will use a specific directory structure:

  • Python scripts will be located in the scripts directory.
  • Conda environment files will be located in the envs directory.

We can create these directories using a single line of code:

In [1]:
!mkdir -p scripts envs

This command uses the mkdir (make directory) command from Unix. The ! at the beginning of the command is a special Jupyter feature that allows us to run shell commands directly from the notebook.

Now that we have our directories set up, we can start creating our Python scripts and Conda environment files:

Creating the Conda Environment Files¶

Next, we will create a YAML file to define our first Conda environment. This environment will be used when running script1.py in our Nextflow workflow.

In [2]:
%%writefile envs/env1.yaml

name: env1
channels:
  - defaults
dependencies:
  - python
  - numpy
Overwriting envs/env1.yaml

The second Conda environment will be used when running script2.py in our Nextflow workflow.

In [3]:
%%writefile envs/env2.yaml

name: env2
channels:
  - defaults
dependencies:
  - python
  - pandas
Overwriting envs/env2.yaml

Each of the scripts will only run in their respective environments because the first script requires the Python module numpy while the second requires the Python module pandas and each environment only installs the modules required for their specific tasks.

Writing the Python Scripts¶

In script1.py, we import the numpy library, create a simple numpy array, multiply each element in the array by 2, and then write the result to a file named results_script1.txt.

In [4]:
%%writefile scripts/script1.py

import numpy as np

# Create an array
array = np.array([1, 2, 3, 4, 5])

# Perform a simple operation
result = array * 2

# Output the result
with open('results_script1.txt', 'w') as f:
    f.write(f'Result: {result}')
Overwriting scripts/script1.py

In script2.py, we import the pandas library, create a pandas DataFrame, calculate basic statistical details like percentile, mean, and standard deviation using the describe method, and then write the result to a file named results_script2.txt.

In [5]:
%%writefile scripts/script2.py

import pandas as pd

# Create a DataFrame
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32]
}
df = pd.DataFrame(data)

# Perform a simple operation
result = df.describe()

# Output the result
with open('results_script2.txt', 'w') as f:
    f.write(f'Result:\n{result}')
Overwriting scripts/script2.py

The Nextflow Configuration File¶

The Nextflow configuration provides specifications for how to run the main Nextflow file:

  • process: This keyword is used to define process-specific configurations.

  • withName: SCRIPT1 and withName: SCRIPT2: These specify configurations for each of our processes, which we've named SCRIPT1 and SCRIPT2. withName is used to apply configurations to specific processes by name.

  • conda = "${baseDir}/envs/env1.yaml" and conda = "${baseDir}/envs/env2.yaml": These lines are telling Nextflow to use the specified Conda environment when running the process. ${baseDir} is a variable in Nextflow that represents the base directory of the project.

In [6]:
%%writefile nextflow.config

process {
    withName: SCRIPT1 {
        conda = "${baseDir}/envs/env1.yaml"
    }
    withName: SCRIPT2 {
        conda = "${baseDir}/envs/env2.yaml"
    }
}
Overwriting nextflow.config

The Main Nextflow Script¶

The last component of the workflow is the main Nextflow script with the following parts:

  • process SCRIPT1 and process SCRIPT2: These lines define the two processes that our workflow will execute. Each process represents a computational task in our workflow. In this case, our tasks are running the two Python scripts.

  • publishDir "${baseDir}/results", mode: 'copy': This directive tells Nextflow to copy the outputs of the process to the results directory located in the base directory of the project.

  • output: path("results_script1.txt"), emit: result and output: path("results_script2.txt"), emit: result: These lines specify the output files of the processes. The emit keyword allows us to assign a name (result) to the output for further use in the workflow.

  • script: """ python ${baseDir}/scripts/script1.py """ and script: """ python ${baseDir}/scripts/script2.py """: These are the actual scripts that will be run for each process. They are simple shell commands to run the Python scripts. ${baseDir} is a variable in Nextflow that represents the base directory of the project.

  • workflow { SCRIPT1(); SCRIPT2() }: This is the main workflow block that specifies the order in which to run the processes. The processes SCRIPT1 and SCRIPT2 are run in the order they are written.

In [7]:
%%writefile main.nf

process SCRIPT1 {
    publishDir "${baseDir}/results", mode: 'copy'
    
    output:
        path("results_script1.txt"), emit: result

    script:
    """
    python3 ${baseDir}/scripts/script1.py
    """
}

process SCRIPT2 {
    publishDir "${baseDir}/results", mode: 'copy'
    
    output:
        path("results_script2.txt"), emit: result

    script:
    """
    python3 ${baseDir}/scripts/script2.py
    """
}

workflow {
    SCRIPT1()
    SCRIPT2()
}
Overwriting main.nf

Running the Nextflow Workflow¶

In [ ]:
!nextflow run main.nf
N E X T F L O W  ~  version 22.10.6
Launching `main.nf` [reverent_golick] DSL2 - revision: 7e1971871e

When this command is run, Nextflow will execute our workflow. It will create and activate the Conda environments specified in env1.yaml and env2.yaml, run script1.py in env1 and script2.py in env2, and then copy the outputs to the results directory.