In this notebook, we will set up and explain each file in a Nextflow workflow which uses Python scripts. Our workflow consists of two Python scripts that are each executed in a different Conda environment.
The workflow consists of several components: the main Nextflow script (main.nf
), two Python scripts (script1.py
and script2.py
), two Conda environment files (env1.yaml
and env2.yaml
), and a Nextflow configuration file (nextflow.config
).
We will explain each file in the following sections. We will also show how to create the files and directories necessary for this workflow using the %%writefile
magic command.
Before we start creating our files, we need to ensure that the appropriate directories are in place. Our workflow will use a specific directory structure:
scripts
directory.envs
directory.We can create these directories using a single line of code:
!mkdir -p scripts envs
This command uses the mkdir (make directory) command from Unix. The !
at the beginning of the command is a special Jupyter feature that allows us to run shell commands directly from the notebook.
Now that we have our directories set up, we can start creating our Python scripts and Conda environment files:
Next, we will create a YAML file to define our first Conda environment. This environment will be used when running script1.py
in our Nextflow workflow.
%%writefile envs/env1.yaml
name: env1
channels:
- defaults
dependencies:
- python
- numpy
Overwriting envs/env1.yaml
The second Conda environment will be used when running script2.py
in our Nextflow workflow.
%%writefile envs/env2.yaml
name: env2
channels:
- defaults
dependencies:
- python
- pandas
Overwriting envs/env2.yaml
Each of the scripts will only run in their respective environments because the first script requires the Python module numpy while the second requires the Python module pandas and each environment only installs the modules required for their specific tasks.
In script1.py
, we import the numpy library, create a simple numpy array, multiply each element in the array by 2, and then write the result to a file named results_script1.txt
.
%%writefile scripts/script1.py
import numpy as np
# Create an array
array = np.array([1, 2, 3, 4, 5])
# Perform a simple operation
result = array * 2
# Output the result
with open('results_script1.txt', 'w') as f:
f.write(f'Result: {result}')
Overwriting scripts/script1.py
In script2.py
, we import the pandas library, create a pandas DataFrame, calculate basic statistical details like percentile, mean, and standard deviation using the describe method, and then write the result to a file named results_script2.txt
.
%%writefile scripts/script2.py
import pandas as pd
# Create a DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32]
}
df = pd.DataFrame(data)
# Perform a simple operation
result = df.describe()
# Output the result
with open('results_script2.txt', 'w') as f:
f.write(f'Result:\n{result}')
Overwriting scripts/script2.py
The Nextflow configuration provides specifications for how to run the main Nextflow file:
process
: This keyword is used to define process-specific configurations.
withName: SCRIPT1
and withName: SCRIPT2
: These specify configurations for each of our processes, which we've named SCRIPT1
and SCRIPT2
. withName
is used to apply configurations to specific processes by name.
conda = "${baseDir}/envs/env1.yaml"
and conda = "${baseDir}/envs/env2.yaml"
: These lines are telling Nextflow to use the specified Conda environment when running the process. ${baseDir}
is a variable in Nextflow that represents the base directory of the project.
%%writefile nextflow.config
process {
withName: SCRIPT1 {
conda = "${baseDir}/envs/env1.yaml"
}
withName: SCRIPT2 {
conda = "${baseDir}/envs/env2.yaml"
}
}
Overwriting nextflow.config
The last component of the workflow is the main Nextflow script with the following parts:
process SCRIPT1
and process SCRIPT2
: These lines define the two processes that our workflow will execute. Each process represents a computational task in our workflow. In this case, our tasks are running the two Python scripts.
publishDir "${baseDir}/results", mode: 'copy'
: This directive tells Nextflow to copy the outputs of the process to the results
directory located in the base directory of the project.
output: path("results_script1.txt"), emit: result
and output: path("results_script2.txt"), emit: result
: These lines specify the output files of the processes. The emit
keyword allows us to assign a name (result
) to the output for further use in the workflow.
script: """ python ${baseDir}/scripts/script1.py """
and script: """ python ${baseDir}/scripts/script2.py """
: These are the actual scripts that will be run for each process. They are simple shell commands to run the Python scripts. ${baseDir}
is a variable in Nextflow that represents the base directory of the project.
workflow { SCRIPT1(); SCRIPT2() }
: This is the main workflow block that specifies the order in which to run the processes. The processes SCRIPT1
and SCRIPT2
are run in the order they are written.
%%writefile main.nf
process SCRIPT1 {
publishDir "${baseDir}/results", mode: 'copy'
output:
path("results_script1.txt"), emit: result
script:
"""
python3 ${baseDir}/scripts/script1.py
"""
}
process SCRIPT2 {
publishDir "${baseDir}/results", mode: 'copy'
output:
path("results_script2.txt"), emit: result
script:
"""
python3 ${baseDir}/scripts/script2.py
"""
}
workflow {
SCRIPT1()
SCRIPT2()
}
Overwriting main.nf
!nextflow run main.nf
N E X T F L O W ~ version 22.10.6 Launching `main.nf` [reverent_golick] DSL2 - revision: 7e1971871e
When this command is run, Nextflow will execute our workflow. It will create and activate the Conda environments specified in env1.yaml
and env2.yaml
, run script1.py
in env1
and script2.py
in env2
, and then copy the outputs to the results
directory.