Skip to content

Tool creation

CWL command line tools can be created easily using s4n. The simplest approach is to just add the s4n tool create or s4n run as prefix to the command.

A command line tool consists of a baseCommand which is usually any kind of executable which accepts inputs and writes outputs. In- and Outputs can be of multiple kinds of value type but Files are generally the most used kind. The baseCommand can be a single term like echo or a list like [python, script.py]. All of this is handle by SciWIn Client.

!!! note The following examples assume that they are being executed in a git repository with clean status. If there is no repository yet, use s4n init to create an environment.

A common example is to wrap the echo command for its simplicity. To create the tool echo "Hello World" is prefixed with s4n tool create.

Command
s4n tool create echo "Hello World"
echo.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
inputs:
- id: hello_world
type: string
default: Hello World
inputBinding:
position: 0
outputs: []
baseCommand: echo

The baseCommand was correctly determined as echo and an input slot was created named with the value of the input as slug hello_world. This could be renamed by editing the file, but for now we leave it as is. Currrently the tool is not producing any outputs. Assuming we want to create a file using the echo command which is a common use case a redirection > operator can be used. To not redirect the s4n output it needs to be shielded by a backslash \>. Assuming the following yaml file needs to be created

message: "Hello World"

The usual command would be echo 'message: "Hello World"' > hello.yaml. To create the command line tool the command will be

Command
s4n tool create --name echo2 echo 'message: "Hello World"' \> hello.yaml
echo2.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
inputs:
- id: message_hello_world
type: string
default:
message: Hello World
inputBinding:
position: 0
outputs:
- id: hello
type: File
outputBinding:
glob: hello.yaml
stdout: hello.yaml
baseCommand: echo

The stdout part will tell the tool to redirect output to the file called hello.yaml. Noticed the --name option? This is used to specify the file name of the to be created tool.

A common usecase is to wrap a script in an interpreted language like python or R. Wrapping a python script follows the same principles like shown in the previous example where the echo command was wrapped.

Command
s4n tool create --name echo_python python echo.py --message "SciWIn rocks!" --output-file out.txt
echo.py
import argparse;
parser = argparse.ArgumentParser(description='Echo your input')
parser.add_argument('--message', help='Message to echo', required=True)
parser.add_argument('--output-file', help='File to save the message', required=True)
args = parser.parse_args()
with open(args.output_file, 'w') as f:
f.write(args.message)
print(args.message)
echo_python.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: echo.py
entry:
$include: '../../echo.py'
- class: InlineJavascriptRequirement
inputs:
- id: message
type: string
default: SciWIn rocks!
inputBinding:
prefix: '--message'
- id: outputfile
type: string
default: out.txt
inputBinding:
prefix: '--output-file'
outputs:
- id: out
type: File
outputBinding:
glob: $(inputs.outputfile)
baseCommand:
- python
- echo.py

As shown in echo_python.cwl the outputBinding for the output file is set to $(inputs.outputfile) and therefore automatically gets the name given in the input. s4n also automatically detects the usage of python and adds the used script as an InitialWorkDirRequirement which makes the script available for the execution engine.

Sometimes it is neccessary to run a highly complicated script on a remote environment because it would take to long on a simple machine. But how to get the CWL file than? In the example python file the script will sleep for 1 minute and then writes a file. One could use the s4n tool create command as shown above and just wait 60 seconds. But what if the calculation takes a week? This is possible for example in quantum chemical calculations like DFT. There is the --no-run flag which tells s4n to not run the script. However this will not create an output and therefore can not detect any output files.

Command
s4n tool create --no-run python sleep.py
sleep.py
from time import sleep
sleep(60)
with open('sleep.txt', 'w') as f:
f.write('I slept for 60 seconds')
sleep.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: sleep.py
entry:
$include: '../../sleep.py'
inputs: []
outputs: []
baseCommand:
- python
- sleep.py

For this cases there is the possibility to specify outputs via the commandline using the -o or --outputs argument which tells the parser to add a output slot.

Command
s4n tool create --name sleep2 --no-run -o sleep.txt python sleep.py
sleep2.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: sleep.py
entry:
$include: '../../sleep.py'
inputs: []
outputs:
- id: sleep
type: File
outputBinding:
glob: sleep.txt
baseCommand:
- python
- sleep.py

This CWL file can then be executed remotely by using any runner e.g. cwltool and will write the sleep.txt file after 60 seconds.

Like shown in the above example there is also the possibility to specify inputs explictly. This is needed e.g. if the scripts loads a hardcoded file like in the following example.

Command
s4n tool create -i file.txt -o out.txt python load.py
load.py
with open('file.txt', 'r') as file:
data = file.read()
with open('out.txt', 'w') as out:
out.write(data)
load.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: file.txt
entry:
$include: '../../file.txt'
- entryname: load.py
entry:
$include: '../../load.py'
inputs: []
outputs:
- id: out
type: File
outputBinding:
glob: out.txt
baseCommand:
- python
- load.py

Using the pipe operator | is a common usecase when using the commandline. Let’s assume the first 5 lines of a file are needed e.g cat speakers.csv | head -n 5 > speakers_5.csv

Command
s4n tool create cat speakers.csv \| head -n 5 \> speakers_5.csv
cat.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: ShellCommandRequirement
inputs:
- id: speakers_csv
type: File
default:
class: File
location: '../../speakers.csv'
inputBinding:
position: 0
outputs:
- id: speakers_5
type: File
outputBinding:
glob: speakers_5.csv
baseCommand: cat
arguments:
- position: 1
valueFrom: '|'
shellQuote: false
- position: 1
valueFrom: head
- position: 2
valueFrom: '-n'
- position: 3
valueFrom: '5'
- position: 4
valueFrom: '>'
- position: 5
valueFrom: speakers_5.csv

For full reproducibility it is recommended to use containers e.g. docker as requirement inside of the CWL files. Adding an existing container image is quite easy. The s4n tool create command needs to be called using -c or --container-image argument. For testing a python script using pandas is used together with the pandas/pandas container.

Command
s4n tool create -c pandas/pandas:pip-all python calculation.py --population population.csv --speakers speakers_revised.csv
calculation.py
import pandas as pd
import argparse
parser = argparse.ArgumentParser(prog="python calculation.py", description="Calculates the percentage of speakers for each language")
parser.add_argument("-p", "--population", required=True, help="Path to the population.csv File")
parser.add_argument("-s", "--speakers", required=True, help="Path to the speakers.csv File")
args = parser.parse_args()
df = pd.read_csv(args.population)
sum = df["population"].sum()
print(f"Total population: {sum}")
df = pd.read_csv(args.speakers)
df["percentage"] = df["speakers"] / sum * 100
df.to_csv("results.csv")
print(df.head(10))
calculation.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: calculation.py
entry:
$include: '../../calculation.py'
- class: DockerRequirement
dockerPull: pandas/pandas:pip-all
inputs:
- id: population
type: File
default:
class: File
location: '../../population.csv'
inputBinding:
prefix: '--population'
- id: speakers
type: File
default:
class: File
location: '../../speakers_revised.csv'
inputBinding:
prefix: '--speakers'
outputs:
- id: results
type: File
outputBinding:
glob: results.csv
baseCommand:
- python
- calculation.py

When the tool is executed by a runner supporting containerization e.g. cwltool it is using the pandas/pandas:pip-all container to run the script in a reproducible environment.

Using a complex research environments a custom container is may needed. The same example from above will be executed in a container built from a Dockerfile. This can be achieved by using the -c argument with a path to a Dockerfile. A tag can be specified by using -t.

Command
s4n tool create -c Dockerfile -t my-docker python calculation.py --population population.csv --speakers speakers_revised.csv
calculation.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
requirements:
- class: InitialWorkDirRequirement
listing:
- entryname: calculation.py
entry:
$include: '../../calculation.py'
- class: DockerRequirement
dockerFile:
$include: '../../Dockerfile'
dockerImageId: my-docker
inputs:
- id: population
type: File
default:
class: File
location: '../../population.csv'
inputBinding:
prefix: '--population'
- id: speakers
type: File
default:
class: File
location: '../../speakers_revised.csv'
inputBinding:
prefix: '--speakers'
outputs:
- id: results
type: File
outputBinding:
glob: results.csv
baseCommand:
- python
- calculation.py
Dockerfile
FROM python
RUN pip install pandas

A runner will check whether a container with the specified image is existent and build it using the Dockerfile otherwise.