Very nice user documentation
Computational workflows, which describe complex, multi-step procedures for automated execution, are essential for ensuring reproducibility, scalability, and efficiency in scientific research. The FAIRagro Scientific Workflow Infrastructure (SciWIn) supports scientists to create, execute, share, and publish these workflows, fostering collaboration and transparency.
Why you need s4n
?
SciWIn-Client (s4n
) is designed to pick up scientists right at the in silico workbench, where iterative and highly interactive processes such as data extraction, cleaning, visualization, exploration, analysis and transformation are carried out. It is a command-line tool designed to easily create, record, annotate and execute computational workflows. What Git does for versioning, s4n does for provenance management. From simple one-step calculations to complex multi-branch pipelines, s4n records the chain of provenance for data and code artifacts. These records can be re-executed, also on remote computers. The individual artifacts and computational steps form a graph which can be annotated with semantic metadata. s4n also supports this annotation. s4n can package the resulting workflow as Workflow RO-Crate and publish it through WorkflowHub.
What is CWL?
CWL is an acronym for Common Workflow Language.
Available commands
SciWIn client provides commands for project initialization (s4n init
), working with CWL CommandLineTools (s4n tool
) and CWL Workflows (s4n workflow
), metadata annotation (s4n annotate
), the execution of CWL (s4n execute
) and synchronization with a remote sever (s4n sync
).
Usage
Client tool for Scientific Workflow Infrastructure (SciWIn)
Usage: s4n <COMMAND>
Commands:
init Initializes project folder structure and repository
tool Provides commands to create and work with CWL CommandLineTools
workflow Provides commands to create and work with CWL Workflows
annotate
execute Execution of CWL Files locally or on remote servers [aliases: ex]
sync
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
Example Usage
This example is a sample use case for building a small project with s4n
. It features the creation of two commandline scripts and the combination of those into a workflow as well as the execution of this workflow using the internal CWL runner.
Add s4n
to your PATH
environment variable if not done already.
To verify the successful addition to the PATH
variable the following command can be used.
To initialize a new project use the s4n init
command. A project folder can be specifies using the -p
argument.
The command will initialize a git repository in this folder if there is none already. Furthermore a workflows
folder will be created.
s4n init -p test_project
# 📂 s4n project initialisation sucessfully:
# test_project (Base)
# ├── workflows
For this example some data needs to be created. To download the data a new folder data
needs to be created. The raw data files can be downloaded using e.g. wget
.
wget https://raw.githubusercontent.com/fairagro/m4.4_sciwin_client/refs/heads/main/tests/test_data/hello_world/data/population.csv
wget https://raw.githubusercontent.com/fairagro/m4.4_sciwin_client/refs/heads/main/tests/test_data/hello_world/data/speakers_revised.csv
The keep the demo project organized, the workflows
folder will also be used to house the scripts used in this demo. The following bash script needs to be created as workflows/calculation/calculation.py
import argparse
import csv
def calculate_total_population(population_file):
total_population = 0
with open(population_file, 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
try:
total_population += int(row[1])
except ValueError:
print(f"Error: Invalid population value in {row[0]}")
return None
return total_population
def calculate_speaker_percentages(speakers_file, total_population):
print("Language,Speakers,Percentage")
with open(speakers_file, 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
try:
language = row[0]
speakers = int(row[1])
percentage = (speakers / total_population) * 100
print(f"{language},{speakers},{percentage:.2f}%")
except ValueError:
print(f"Error: Invalid speakers value in {row[0]}")
def main():
parser = argparse.ArgumentParser(description='Calculate population-based percentages.')
parser.add_argument('--population', required=True, help='CSV file containing population data')
parser.add_argument('--speakers', required=True, help='CSV file containing speakers data')
args = parser.parse_args()
try:
total_population = calculate_total_population(args.population)
percentages = calculate_speaker_percentages(args.speakers, total_population)
print(percentages)
except FileNotFoundError as e:
print(f"Error: File not found: {e.filename}")
return
if __name__ == "__main__":
main()
To run the tool creation command the changes need to be committed beforehand. The shell script usually would be called with the command python workflows/calculation/calculation.py --speakers data/speakers_revised.csv --population data/population.csv \> results.csv
. To create a CommandLineTool this only needs to be prefixed with s4n tool create
or s4n run
. However the >
operator needs to be escaped using a backslash.
s4n tool create python workflows/calculation/calculation.py --speakers data/speakers_revised.csv --population data/population.csv \> results.csv
# 📂 The current working directory is /home/ubuntu/test_project
# ⏳ Executing Command: `python workflows/calculation/calculation.py --speakers data/speakers_revised.csv --population data/population.csv`
# 📜 Found changes:
# - results.csv
#
# 📄 Created CWL file workflows/calculation/calculation.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: workflows/calculation/calculation.py
entry:
$include: calculation.py
inputs:
- id: speakers
type: File
default:
class: File
location: '../../data/speakers_revised.csv'
inputBinding:
prefix: '--speakers'
- id: population
type: File
default:
class: File
location: '../../data/population.csv'
inputBinding:
prefix: '--population'
outputs:
- id: results
type: File
outputBinding:
glob: results.csv
stdout: results.csv
baseCommand:
- python
- workflows/calculation/calculation.py
The tool create
command created a description of the script which can be used to build workflows. In this example a second script will be used to have a linear two-step-workflow at the end (matplotlib
needs to be installed beforehand!)
import argparse
import csv
import matplotlib.pyplot as plt
def generate_bar_plot(results_file):
languages = []
percentages = []
with open(results_file, 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
language = row[0]
percentage = float(row[2].replace('%', ''))
languages.append(language)
percentages.append(percentage)
plt.bar(languages, percentages)
plt.xlabel('Language')
plt.ylabel('Percentage of Total Population')
plt.title('Language Speakers as Percentage of Total Population')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("figure.png")
def main():
parser = argparse.ArgumentParser(description='Generate a bar plot from results.csv.')
parser.add_argument('--data', required=True, help='CSV file containing the results data for bar plot')
args = parser.parse_args()
try:
generate_bar_plot(args.data)
except FileNotFoundError as e:
print(f"Error: File not found: {e.filename}")
return
if __name__ == "__main__":
main()
To create the tool for this script the following command can be used to create the CWL definition. SciWIn client automatically determines that figure.png
shall be listed as an output for this tool.
s4n tool create python workflows/plot/plot.py --data results.csv
# 📂 The current working directory is /home/ubuntu/test_project
# ⏳ Executing Command: `python workflows/plot/plot.py --data results.csv`
# 📜 Found changes:
# - figure.png
#
# 📄 Created CWL file workflows/plot/plot.cwl
The freshly created plot tool should look like this:
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: workflows/plot/plot.py
entry:
$include: plot.py
inputs:
- id: data
type: File
default:
class: File
location: '../../results.csv'
inputBinding:
prefix: '--data'
outputs:
- id: figure
type: File
outputBinding:
glob: figure.png
baseCommand:
- python
- workflows/plot/plot.py
To check that all has been created correctly the list
command can be used to visualize tools and their in- and outputs.
s4n tool list -a
# 📂 Scanning for tools in: /home/ubuntu/test_project
# +-------------+----------------------------------------------+---------------------+
# | Tool | Inputs | Outputs |
# +-------------+----------------------------------------------+---------------------+
# | plot | plot/data | plot/figure |
# +-------------+----------------------------------------------+---------------------+
# | calculation | calculation/speakers, calculation/population | calculation/results |
# +-------------+----------------------------------------------+---------------------+
To execute those tools and to really benefit from using CWL a workflow can be created connecting the tools. To create a blank workflow file the
Connections between tools and in- and outputs can be created by the connect
command. There are the two arguments --from
and --to
which creates a line from one node to another. Furthermore a workflow needs inputs and outputs to process its steps. The slot names can be copied from the aforementioned s4n tool list
command, a connection to in. and outputs needs to be prefixed with @
and will result in creation of a new in- oder output slot.
s4n workflow connect main --from @inputs/speakers --to calculation/speakers
# ➕ Added step calculation to workflow
# ➕ Added or updated connection from inputs.speakers to calculation/speakers in workflow
# ✔️ Updated Workflow workflows/main/main.cwl!
A report of the current status of the workflow can be received by using the status
command. Doing this now will result in the following table. As it states there currently is no output and one input (speakers
). The only step is the calculation step, which has the input connected to its speakers
slot. The population
slot would use its default value and nothing is done with the steps output results
.
s4n workflow status main
# Status report for Workflow workflows/main/main.cwl
# +--------------------------------+------------------+---------------+
# | Tool | Inputs | Outputs |
# +================================+==================+===============+
# | <Workflow> | ✅ speakers | |
# +--------------------------------+------------------+---------------+
# | Steps: | | |
# +--------------------------------+------------------+---------------+
# | ../calculation/calculation.cwl | ✅ speakers | ❌ results |
# | | 🔘 population | |
# +--------------------------------+------------------+---------------+
# ✅ : connected - 🔘 : tool default - ❌ : no connection
To connect another input to the population
slot the above command can be reused with the neccessary adjustments. The stdout will be one line shorter now, as the step will not be created a second time.
s4n workflow connect main --from @inputs/population --to calculation/population
# ➕ Added or updated connection from inputs.population to calculation/population in workflow
# ✔️ Updated Workflow workflows/main/main.cwl!
As done before the result of the calculation step shall be used in the plot script. Therefore a connection between both steps is needed. The s4n tool ls -a
command can be used to to copy and paste the slot names for this connection.
s4n workflow connect main --from calculation/results --to plot/data
# 🔗 Found step calculation in workflow. Not changing that!
# ➕ Added step plot to workflow
# ✔️ Updated Workflow workflows/main/main.cwl!
To complete the workflow a connection to an output is needed. Otherwise no file will be copied back after running the workflow.
s4n workflow connect main --from plot/figure --to @outputs/image
# ➕ Added or updated connection from plot/figure to outputs.image in workflow!
# ✔️ Updated Workflow workflows/main/main.cwl!
Running the status command again will show that everything is fine now!
s4n workflow status main
# Status report for Workflow workflows/main/main.cwl
# +--------------------------------+------------------+---------------+
# | Tool | Inputs | Outputs |
# +================================+==================+===============+
# | <Workflow> | ✅ speakers | ✅ image |
# | | ✅ population | |
# +--------------------------------+------------------+---------------+
# | Steps: | | |
# +--------------------------------+------------------+---------------+
# | ../calculation/calculation.cwl | ✅ speakers | ✅ results |
# | | ✅ population | |
# +--------------------------------+------------------+---------------+
# | ../plot/plot.cwl | ✅ data | ✅ figure |
# +--------------------------------+------------------+---------------+
# ✅ : connected - 🔘 : tool default - ❌ : no connection
The finished CWL workflow file looks like this:
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: Workflow
inputs:
- id: speakers
type: File
- id: population
type: File
outputs:
- id: image
type: File
outputSource: plot/figure
steps:
- id: calculation
in:
population: population
speakers: speakers
run: '../calculation/calculation.cwl'
out:
- results
- id: plot
in:
data: calculation/results
run: '../plot/plot.cwl'
out:
- figure
The workflow now can be executed using a standard CWL runner like cwltool
or using SciWIn-client's internal runner. SciWIn's internal runner is still in a testing phase so it will not be able to support all features cwltool
does, but is does support everything the client can generate. CWL can either be executed by using commandline arguments or by using a input file in YAML format. For this demo the following input file (inputs.yml
) is used:
population:
class: File
location: data/population.csv
speakers:
class: File
location: data/speakers_revised.csv
Before running the workflow the output files of the script execution should be deleted to verify the correct execution. The workflow can be executed locally by using the following command which concludes this example.
s4n execute local workflows/main/main.cwl inputs.yml
# 💻 Executing "workflows/main/main.cwl" using SciWIn's custom runner. Use `--runner cwltool` to use reference runner (if installed).
# ⚠️ The internal runner currently is for testing purposes only and does not support containerization, yet!
# 🚲 Executing CommandLineTool "workflows/main/../calculation/calculation.cwl" ...
# 📁 Created staging directory: "/tmp/.tmpo1VAdn"
# ⏳ Executing Command: `python workflows/calculation/calculation.py --speakers /tmp/.tmpo1VAdn/data/speakers_revised.csv --population /tmp/.tmpo1VAdn/data/population.csv`
# 📜 Wrote output file: "/tmp/.tmp7ol6bV/results.csv"
# ✔️ CommandLineTool "workflows/main/../calculation/calculation.cwl" executed successfully in 196ms!
# 🚲 Executing CommandLineTool "workflows/main/../plot/plot.cwl" ...
# 📁 Created staging directory: "/tmp/.tmphnWjUa"
# ⏳ Executing Command: `python workflows/plot/plot.py --data /tmp/.tmphnWjUa/results.csv`
# 📜 Wrote output file: "/tmp/.tmp7ol6bV/figure.png"
# ✔️ CommandLineTool "workflows/main/../plot/plot.cwl" executed successfully in 1s!
# {
# "image": {
# "location": "file:///home/ubuntu/test_project/figure.png",
# "basename": "figure.png",
# "class": "File",
# "checksum": "sha1$65a86b4fa5d42ee81ecda344fc1030c61ad6cb06",
# "size": 40074,
# "path": "/home/ubuntu/test_project/figure.png"
# }
# }
# ✔️ Workflow "workflows/main/main.cwl" executed successfully in 1s!