NAME
toil - Toil DocumentationVivian, J., Rao, A. A., Nothaft, F. A.,
Ketchum, C., Armstrong, J., Novak, A., … Paten, B. (2017). Toil enables
reproducible, open source, big biomedical data analyses. Nature Biotechnology,
35(4), 314–316. http://doi.org/10.1038/nbt.3772
QUICKSTART EXAMPLES
Running a basic workflow
A Toil workflow can be run with just two steps:- 1.
- Copy and paste the following code block into a new file called helloWorld.py:
from toil.common import Toil from toil.job import Job def helloWorld(message, memory="1G", cores=1, disk="1G"): return f"Hello, world!, here's a message: {message}" if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.clean = "always" with Toil(options) as toil: output = toil.start(Job.wrapFn(helloWorld, "You did it!")) print(output)
- 2.
- Specify the name of the job store and run the workflow:
python3 helloWorld.py file:my-job-store
Running a basic CWL workflow
The Common Workflow Language (CWL) is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Running CWL workflows using Toil is easy.- 1.
- Copy and paste the following code block into example.cwl:
cwlVersion: v1.0 class: CommandLineTool baseCommand: echo stdout: output.txt inputs: message: type: string inputBinding: position: 1 outputs: output: type: stdout
message: Hello world!
- 2.
- To run the workflow simply enter
$ toil-cwl-runner example.cwl example-job.yaml
$ cat output.txt Hello world!
Running a basic WDL workflow
The Workflow Description Language (WDL) is another emerging language for writing workflows that are portable across multiple workflow engines and platforms. Running WDL workflows using Toil is still in alpha, and currently experimental. Toil currently supports basic workflow syntax (see WDL in Toil for more details and examples). Here we go over running a basic WDL helloworld workflow.- 1.
- Copy and paste the following code block into wdl-helloworld.wdl:
workflow write_simple_file { call write_file } task write_file { String message command { echo ${message} > wdl-helloworld-output.txt } output { File test = "wdl-helloworld-output.txt" } } and this code into ``wdl-helloworld.json``:: { "write_simple_file.write_file.message": "Hello world!" }
- 2.
- To run the workflow simply enter
$ toil-wdl-runner wdl-helloworld.wdl wdl-helloworld.json
$ cat wdl-helloworld-output.txt Hello world!
A (more) real-world example
For a more detailed example and explanation, we've developed a sample pipeline that merge-sorts a temporary file. This is not supposed to be an efficient sorting program, rather a more fully worked example of what Toil is capable of.Running the example
- 1.
- Download the example code
- 2.
- Run it with the default settings:
$ python3 sort.py file:jobStore
Delete fileToSort.txt before moving on
to #3. This example introduces options that specify dimensions for
fileToSort.txt, if it does not already exist. If it exists, this
workflow will use the existing file and the results will be the same as
#2.
- 3.
- Run with custom options:
$ python3 sort.py file:jobStore \ --numLines=5000 \ --lineLength=10 \ --overwriteOutput=True \ --workDir=/tmp/
Describing the source code
To understand the details of what's going on inside. Let's start with the main() function. It looks like a lot of code, but don't worry---we'll break it down piece by piece.def main(options=None): if not options: # deal with command line arguments parser = ArgumentParser() Job.Runner.addToilOptions(parser) parser.add_argument('--numLines', default=defaultLines, help='Number of lines in file to sort.', type=int) parser.add_argument('--lineLength', default=defaultLineLen, help='Length of lines in file to sort.', type=int) parser.add_argument("--fileToSort", help="The file you wish to sort") parser.add_argument("--outputFile", help="Where the sorted output will go") parser.add_argument("--overwriteOutput", help="Write over the output file if it already exists.", default=True) parser.add_argument("--N", dest="N", help="The threshold below which a serial sort function is used to sort file. " "All lines must of length less than or equal to N or program will fail", default=10000) parser.add_argument('--downCheckpoints', action='store_true', help='If this option is set, the workflow will make checkpoints on its way through' 'the recursive "down" part of the sort') parser.add_argument("--sortMemory", dest="sortMemory", help="Memory for jobs that sort chunks of the file.", default=None) parser.add_argument("--mergeMemory", dest="mergeMemory", help="Memory for jobs that collate results.", default=None) options = parser.parse_args() if not hasattr(options, "sortMemory") or not options.sortMemory: options.sortMemory = sortMemory if not hasattr(options, "mergeMemory") or not options.mergeMemory: options.mergeMemory = sortMemory # do some input verification sortedFileName = options.outputFile or "sortedFile.txt" if not options.overwriteOutput and os.path.exists(sortedFileName): print(f'Output file {sortedFileName} already exists. ' f'Delete it to run the sort example again or use --overwriteOutput=True') exit() fileName = options.fileToSort if options.fileToSort is None: # make the file ourselves fileName = 'fileToSort.txt' if os.path.exists(fileName): print(f'Sorting existing file: {fileName}') else: print(f'No sort file specified. Generating one automatically called: {fileName}.') makeFileToSort(fileName=fileName, lines=options.numLines, lineLen=options.lineLength) else: if not os.path.exists(options.fileToSort): raise RuntimeError("File to sort does not exist: %s" % options.fileToSort) if int(options.N) <= 0: raise RuntimeError("Invalid value of N: %s" % options.N) # Now we are ready to run with Toil(options) as workflow: sortedFileURL = 'file://' + os.path.abspath(sortedFileName) if not workflow.options.restart: sortFileURL = 'file://' + os.path.abspath(fileName) sortFileID = workflow.importFile(sortFileURL) sortedFileID = workflow.start(Job.wrapJobFn(setup, sortFileID, int(options.N), options.downCheckpoints, options=options, memory=sortMemory)) else: sortedFileID = workflow.restart() workflow.exportFile(sortedFileID, sortedFileURL)
def setup(job, inputFile, N, downCheckpoints, options): """ Sets up the sort. Returns the FileID of the sorted file """ RealtimeLogger.info("Starting the merge sort") return job.addChildJobFn(down, inputFile, N, 'root', downCheckpoints, options = options, preemptible=True, memory=sortMemory).rv()
def down(job, inputFileStoreID, N, path, downCheckpoints, options, memory=sortMemory): """ Input is a file, a subdivision size N, and a path in the hierarchy of jobs. If the range is larger than a threshold N the range is divided recursively and a follow on job is then created which merges back the results else the file is sorted and placed in the output. """ RealtimeLogger.info("Down job starting: %s" % path) # Read the file inputFile = job.fileStore.readGlobalFile(inputFileStoreID, cache=False) length = os.path.getsize(inputFile) if length > N: # We will subdivide the file RealtimeLogger.critical("Splitting file: %s of size: %s" % (inputFileStoreID, length)) # Split the file into two copies midPoint = getMidPoint(inputFile, 0, length) t1 = job.fileStore.getLocalTempFile() with open(t1, 'w') as fH: fH.write(copySubRangeOfFile(inputFile, 0, midPoint+1)) t2 = job.fileStore.getLocalTempFile() with open(t2, 'w') as fH: fH.write(copySubRangeOfFile(inputFile, midPoint+1, length)) # Call down recursively. By giving the rv() of the two jobs as inputs to the follow-on job, up, # we communicate the dependency without hindering concurrency. result = job.addFollowOnJobFn(up, job.addChildJobFn(down, job.fileStore.writeGlobalFile(t1), N, path + '/0', downCheckpoints, checkpoint=downCheckpoints, options=options, preemptible=True, memory=options.sortMemory).rv(), job.addChildJobFn(down, job.fileStore.writeGlobalFile(t2), N, path + '/1', downCheckpoints, checkpoint=downCheckpoints, options=options, preemptible=True, memory=options.mergeMemory).rv(), path + '/up', preemptible=True, options=options, memory=options.sortMemory).rv() else: # We can sort this bit of the file RealtimeLogger.critical("Sorting file: %s of size: %s" % (inputFileStoreID, length)) # Sort the copy and write back to the fileStore shutil.copyfile(inputFile, inputFile + '.sort') sort(inputFile + '.sort') result = job.fileStore.writeGlobalFile(inputFile + '.sort') RealtimeLogger.info("Down job finished: %s" % path) return result
def up(job, inputFileID1, inputFileID2, path, options, memory=sortMemory): """ Merges the two files and places them in the output. """ RealtimeLogger.info("Up job starting: %s" % path) with job.fileStore.writeGlobalFileStream() as (fileHandle, outputFileStoreID): fileHandle = codecs.getwriter('utf-8')(fileHandle) with job.fileStore.readGlobalFileStream(inputFileID1) as inputFileHandle1: inputFileHandle1 = codecs.getreader('utf-8')(inputFileHandle1) with job.fileStore.readGlobalFileStream(inputFileID2) as inputFileHandle2: inputFileHandle2 = codecs.getreader('utf-8')(inputFileHandle2) RealtimeLogger.info("Merging %s and %s to %s" % (inputFileID1, inputFileID2, outputFileStoreID)) merge(inputFileHandle1, inputFileHandle2, fileHandle) # Cleanup up the input files - these deletes will occur after the completion is successful. job.fileStore.deleteGlobalFile(inputFileID1) job.fileStore.deleteGlobalFile(inputFileID2) RealtimeLogger.info("Up job finished: %s" % path) return outputFileStoreID
if __name__ == '__main__' main()
Logging
By default, Toil logs a lot of information related to the current environment in addition to messages from the batch system and jobs. This can be configured with the --logLevel flag. For example, to only log CRITICAL level messages to the screen:$ python3 sort.py file:jobStore \ --logLevel=critical \ --overwriteOutput=True
Error Handling and Resuming Pipelines
With Toil, you can recover gracefully from a bug in your pipeline without losing any progress from successfully completed jobs. To demonstrate this, let's add a bug to our example code to see how Toil handles a failure and how we can resume a pipeline after that happens. Add a bad assertion at line 52 of the example (the first line of down()):def down(job, inputFileStoreID, N, downCheckpoints, memory=sortMemory): ... assert 1 == 2, "Test error!"
$ python3 sort.py file:jobStore ... ---TOIL WORKER OUTPUT LOG--- ... m/j/jobonrSMP Traceback (most recent call last): m/j/jobonrSMP File "toil/src/toil/worker.py", line 340, in main m/j/jobonrSMP job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1270, in _runner m/j/jobonrSMP returnValues = self._run(jobGraph, fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1217, in _run m/j/jobonrSMP return self.run(fileStore) m/j/jobonrSMP File "toil/src/toil/job.py", line 1383, in run m/j/jobonrSMP rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) m/j/jobonrSMP File "toil/example.py", line 30, in down m/j/jobonrSMP assert 1 == 2, "Test error!" m/j/jobonrSMP AssertionError: Test error!
$ python3 sort.py file:jobStore \ --restart \ --overwriteOutput=True
$ python3 sort.py file:jobStore \ --retryCount 2 \ --restart \ --overwriteOutput=True
$ python3 sort.py file:jobStore \ --restart \ --overwriteOutput=True
Collecting Statistics
Please see the Stats Command section for more on gathering runtime and resource info on jobs.Launching a Toil Workflow in AWS
After having installed the aws extra for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run the basic helloWorld.py script (Running a basic workflow) on a VM in AWS just by modifying the run command.- 1.
- Launch a cluster in AWS using the Launch-Cluster Command command:
$ toil launch-cluster <cluster-name> \ --keyPairName <AWS-key-pair-name> \ --leaderNodeType t2.medium \ --zone us-west-2a
- 2.
- Copy helloWorld.py to the /tmp directory on the leader node using the Rsync-Cluster Command command:
$ toil rsync-cluster --zone us-west-2a <cluster-name> helloWorld.py :/tmp
- 3.
- Login to the cluster leader node using the Ssh-Cluster Command command:
$ toil ssh-cluster --zone us-west-2a <cluster-name>
- 4.
- Run the Toil script in the cluster:
$ python3 /tmp/helloWorld.py aws:us-west-2:my-S3-bucket
- 5.
- Exit from the SSH connection.
$ exit
- 6.
- Use the Destroy-Cluster Command command to destroy the cluster:
$ toil destroy-cluster --zone us-west-2a <cluster-name>
Running a CWL Workflow on AWS
After having installed the aws and cwl extras for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run a CWL workflow with Toil on AWS.- 1.
- First launch a node in AWS using the Launch-Cluster Command command:
$ toil launch-cluster <cluster-name> \ --keyPairName <AWS-key-pair-name> \ --leaderNodeType t2.medium \ --zone us-west-2a
- 2.
- Copy example.cwl and example-job.yaml from the CWL example to the node using the Rsync-Cluster Command command:
toil rsync-cluster --zone us-west-2a <cluster-name> example.cwl :/tmp toil rsync-cluster --zone us-west-2a <cluster-name> example-job.yaml :/tmp
- 3.
- SSH into the cluster's leader node using the Ssh-Cluster Command utility:
$ toil ssh-cluster --zone us-west-2a <cluster-name>
- 4.
- Once on the leader node, it's a good idea to update and install the following:
sudo apt-get update sudo apt-get -y upgrade sudo apt-get -y dist-upgrade sudo apt-get -y install git sudo pip install mesos.cli
- 5.
- Now create a new virtualenv with the --system-site-packages option and activate:
virtualenv --system-site-packages venv source venv/bin/activate
- 6.
- Now run the CWL workflow:
(venv) $ toil-cwl-runner \ --provisioner aws \ --jobStore aws:us-west-2a:any-name \ /tmp/example.cwl /tmp/example-job.yaml
When running a CWL workflow on AWS, input
files can be provided either on the local file system or in S3 buckets using
s3:// URI references. Final output files will be copied to the local
file system of the leader node.
- 7.
- Finally, log out of the leader node and from your local computer, destroy the cluster:
$ toil destroy-cluster --zone us-west-2a <cluster-name>
Running a Workflow with Autoscaling - Cactus
Cactus is a reference-free, whole-genome multiple alignment program that can be run on any of the cloud platforms Toil supports.
Cloud Independence:
This example provides a "cloud agnostic" view of running Cactus with
Toil. Most options will not change between cloud providers. However, each
provisioner has unique inputs for --leaderNodeType, --nodeType
and --zone. We recommend the following:
When executing toil launch-cluster with gce specified for
--provisioner, the option --boto must be specified and given a
path to your .boto file. See Running in Google Compute Engine (GCE) for
more information about the --boto option.
Option | Used in | AWS | |
--leaderNodeType | launch-cluster | t2.medium | n1-standard-1 |
--zone | launch-cluster | us-west-2a | us-west1-a |
--zone | cactus | us-west-2 | |
--nodeType | cactus | c3.4xlarge | n1-standard-8 |
- 1.
- Download pestis.tar.gz
- 2.
- Launch a leader node using the Launch-Cluster Command command:
(venv) $ toil launch-cluster <cluster-name> \ --provisioner <aws, gce> \ --keyPairName <key-pair-name> \ --leaderNodeType <type> \ --zone <zone>
A Helpful Tip
When using AWS, setting the environment variable eliminates having to specify
the --zone option for each command. This will be supported for GCE in
the future.
$ export TOIL_AWS_ZONE=us-west-2c
- 3.
- Create appropriate directory for uploading files:
$ toil ssh-cluster --provisioner <aws, gce> <cluster-name> $ mkdir /root/cact_ex $ exit
- 4.
- Copy the required files, i.e., seqFile.txt (a text file containing the locations of the input sequences as well as their phylogenetic tree, see here), organisms' genome sequence files in FASTA format, and configuration files (e.g. blockTrim1.xml, if desired), up to the leader node:
$ toil rsync-cluster --provisioner <aws, gce> <cluster-name> pestis-short-aws-seqFile.txt :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000169655.1_ASM16965v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000006645.1_ASM664v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000182485.1_ASM18248v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000013805.1_ASM1380v1_genomic.fna :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> setup_leaderNode.sh :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim1.xml :/root/cact_ex $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim3.xml :/root/cact_ex
- 5.
- Log in to the leader node:
$ toil ssh-cluster --provisioner <aws, gce> <cluster-name>
- 6.
- Set up the environment of the leader node to run Cactus:
$ bash /root/cact_ex/setup_leaderNode.sh $ source cact_venv/bin/activate (cact_venv) $ cd cactus (cact_venv) $ pip install --upgrade .
- 7.
- Run Cactus as an autoscaling workflow:
(cact_venv) $ TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.14.0 cactus \ --provisioner <aws, gce> \ --nodeType <type> \ --maxNodes 2 \ --minNodes 0 \ --retry 10 \ --batchSystem mesos \ --logDebug \ --logFile /logFile_pestis3 \ --configFile \ /root/cact_ex/blockTrim3.xml <aws, google>:<zone>:cactus-pestis \ /root/cact_ex/pestis-short-aws-seqFile.txt \ /root/cact_ex/pestis_output3.hal
Pieces of the Puzzle:
TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:3.14.0 --- specifies the
version of Toil being used, 3.14.0; if the latest one is desired, please
eliminate.
--nodeType --- determines the instance type used for worker nodes. The
instance type specified here must be on the same cloud provider as the one
specified with --leaderNodeType
--maxNodes 2 --- creates up to two instances of the type specified with
--nodeType and launches Mesos worker containers inside them.
--logDebug --- equivalent to --logLevel DEBUG.
--logFile /logFile_pestis3 --- writes logs in a file named
logFile_pestis3 under / folder.
--configFile --- this is not required depending on whether a specific
configuration file is intended to run the alignment.
<aws, google>:<zone>:cactus-pestis --- creates a bucket,
named cactus-pestis, with the specified cloud provider to store
intermediate job files and metadata. NOTE: If you want to use a
GCE-based jobstore, specify google here, not gce.
The result file, named pestis_output3.hal, is stored under
/root/cact_ex folder of the leader node.
Use cactus --help to see all the Cactus and Toil flags available.
- 8.
- Log out of the leader node:
(cact_venv) $ exit
- 9.
- Download the resulted output to local machine:
(venv) $ toil rsync-cluster \ --provisioner <aws, gce> <cluster-name> \ :/root/cact_ex/pestis_output3.hal \ <path-of-folder-on-local-machine>
- 10.
- Destroy the cluster:
(venv) $ toil destroy-cluster --provisioner <aws, gce> <cluster-name>
INTRODUCTION
Toil runs in various environments, including locally and in the cloud (Amazon Web Services and Google Compute Engine). Toil also supports two DSLs: CWL and (Amazon Web Services and Google Compute Engine). Toil also supports two DSLs: CWL and WDL (experimental).- •
- Job Store API: A filepath or url that can host and centralize all files for a workflow (e.g. a local folder, or an AWS s3 bucket url).
- •
- Batch System API: Specifies either a local single-machine or a currently supported HPC environment (lsf, parasol, mesos, slurm, torque, htcondor, kubernetes, or grid_engine). Mesos is a special case, and is launched for cloud environments.
- •
- Provisioner: For running in the cloud only. This specifies which cloud provider provides instances to do the "work" of your workflow.
Job Store
The job store is a storage abstraction which contains all of the information used in a Toil run. This centralizes all of the files used by jobs in the workflow and also the details of the progress of the run. If a workflow crashes or fails, the job store contains all of the information necessary to resume with minimal repetition of work.File Job Store
The file job store is for use locally, and keeps the workflow information in a directory on the machine where the workflow is launched. This is the simplest and most convenient job store for testing or for small runs.Cloud Job Stores
Toil currently supports the following cloud storage systems as job stores:- •
- AWS Job Store: An AWS S3 bucket formatted as "aws:<zone>:<bucketname>" where only numbers, letters, and dashes are allowed in the bucket name. Example: aws:us-west-2:my-aws-jobstore-name.
- •
- Google Job Store: A Google Cloud Storage bucket formatted as "gce:<zone>:<bucketname>" where only numbers, letters, and dashes are allowed in the bucket name. Example: gce:us-west2-a:my-google-jobstore-name.
Batch System
A Toil batch system is either a local single-machine (one computer) or a currently supported HPC cluster of computers (lsf, parasol, mesos, slurm, torque, htcondor, or grid_engine). Mesos is a special case, and is launched for cloud environments. These environments manage individual worker nodes under a leader node to process the work required in a workflow. The leader and its workers all coordinate their tasks and files through a centralized job store location.Provisioner
The Toil provisioner provides a tool set for running a Toil workflow on a particular cloud platform.COMMANDLINE OPTIONS
A quick way to see all of Toil's commandline options is by executing the following on a toil script:$ toil example.py --help
The Job Store
Running toil scripts requires a filepath or url to a centralizing location for all of the files of the workflow. This is Toil's one required positional argument: the job store. To use the quickstart example, if you're on a node that has a large /scratch volume, you can specify that the jobstore be created there by executing: python3 HelloWorld.py /scratch/my-job-store, or more explicitly, python3 HelloWorld.py file:/scratch/my-job-store.Local: file:job-store-name
AWS: aws:region-here:job-store-name
Google: google:projectID-here:job-store-name
Commandline Options
Core Toil Options Options to specify the location of the Toil workflow and turn on stats collation about the performance of jobs.- --workDir WORKDIR
- Absolute path to directory where temporary files generated during the Toil run should be placed. Standard output and error from batch system jobs (unless --noStdOutErr) will be placed in this directory. A cache directory may be placed in this directory. Temp files and folders will be placed in a toil-<workflowID> within workDir. The workflowID is generated by Toil and will be reported in the workflow logs. Default is determined by the variables (TMPDIR, TEMP, TMP) via mkdtemp. This directory needs to exist on all machines running jobs; if capturing standard output and error from batch system jobs is desired, it will generally need to be on a shared file system. When sharing a cache between containers on a host, this directory must be shared between the containers.
- --coordinationDir COORDINATION_DIR
- Absolute path to directory where Toil will keep state and lock files. When sharing a cache between containers on a host, this directory must be shared between the containers.
- --noStdOutErr
- Do not capture standard output and error from batch system jobs.
- --stats
- Records statistics about the toil workflow to be used by 'toil stats'.
- --clean=STATE
- Determines the deletion of the jobStore upon completion of the program. Choices: 'always', 'onError','never', or 'onSuccess'. The -\-stats option requires information from the jobStore upon completion so the jobStore will never be deleted with that flag. If you wish to be able to restart the run, choose 'never' or 'onSuccess'. Default is 'never' if stats is enabled, and 'onSuccess' otherwise
- --cleanWorkDir STATE
- Determines deletion of temporary worker directory upon completion of a job. Choices: 'always', 'onError', 'never', or 'onSuccess'. Default = always. WARNING: This option should be changed for debugging only. Running a full pipeline with this option could fill your disk with intermediate data.
- --clusterStats FILEPATH
- If enabled, writes out JSON resource usage statistics to a file. The default location for this file is the current working directory, but an absolute path can also be passed to specify where this file should be written. This option only applies when using scalable batch systems.
- --restart
- If -\-restart is specified then will attempt to restart existing workflow at the location pointed to by the -\-jobStore option. Will raise an exception if the workflow does not exist.
- --logOff
- Only CRITICAL log levels are shown. Equivalent to --logLevel=OFF or --logLevel=CRITICAL.
- --logCritical
- Only CRITICAL log levels are shown. Equivalent to --logLevel=OFF or --logLevel=CRITICAL.
- --logError
- Only ERROR, and CRITICAL log levels are shown. Equivalent to --logLevel=ERROR.
- --logWarning
- Only WARN, ERROR, and CRITICAL log levels are shown. Equivalent to --logLevel=WARNING.
- --logInfo
- All log statements are shown, except DEBUG. Equivalent to --logLevel=INFO.
- --logDebug
- All log statements are shown. Equivalent to --logLevel=DEBUG.
- --logLevel=LOGLEVEL
- May be set to: OFF (or CRITICAL), ERROR, WARN (or WARNING), INFO, or DEBUG.
- --logFile FILEPATH
- Specifies a file path to write the logging output to.
- --rotatingLogging
- Turn on rotating logging, which prevents log files from getting too big (set using --maxLogFileSize BYTESIZE).
- --maxLogFileSize BYTESIZE
- The maximum size of a job log file to keep (in bytes), log files larger than this will be truncated to the last X bytes. Setting this option to zero will prevent any truncation. Setting this option to a negative value will truncate from the beginning. Default=62.5KiB Sets the maximum log file size in bytes ( --rotatingLogging must be active).
- --log-dir DIRPATH
- For CWL and local file system only. Log stdout and stderr (if tool requests stdout/stderr) to the DIRPATH.
- --batchSystem BATCHSYSTEM
- The type of batch system to run the job(s) with, currently can be one of aws_batch, parasol, single_machine, grid_engine, lsf, mesos, slurm, tes, torque, htcondor, kubernetes. (default: single_machine)
- --disableAutoDeployment
- Should auto-deployment of the user script be deactivated? If True, the user script/package should be present at the same location on all workers. Default = False.
- --maxLocalJobs MAXLOCALJOBS
- For batch systems that support a local queue for housekeeping jobs (Mesos, GridEngine, htcondor, lsf, slurm, torque). Specifies the maximum number of these housekeeping jobs to run on the local system. The default (equal to the number of cores) is a maximum of concurrent local housekeeping jobs.
- --manualMemArgs
- Do not add the default arguments: 'hv=MEMORY' & 'h_vmem=MEMORY' to the qsub call, and instead rely on TOIL_GRIDGENGINE_ARGS to supply alternative arguments. Requires that TOIL_GRIDGENGINE_ARGS be set.
- --runCwlInternalJobsOnWorkers
- Whether to run CWL internal jobs (e.g. CWLScatter) on the worker nodes instead of the primary node. If false (default), then all such jobs are run on the primary node. Setting this to true can speed up the pipeline for very large workflows with many sub-workflows and/or scatters, provided that the worker pool is large enough.
- --coalesceStatusCalls
- Coalese status calls to prevent the batch system from being overloaded. Currently only supported for LSF.
- --statePollingWait STATEPOLLINGWAIT
- Time, in seconds, to wait before doing a scheduler query for job state. Return cached results if within the waiting period. Only works for grid engine batch systems such as gridengine, htcondor, torque, slurm, and lsf.
- --parasolCommand PARASOLCOMMAND
- The name or path of the parasol program. Will be looked up on PATH unless it starts with a slash. (default: parasol)
- --parasolMaxBatches PARASOLMAXBATCHES
- Maximum number of job batches the Parasol batch is allowed to create. One batch is created for jobs with a unique set of resource requirements. (default: 1000)
- --mesosEndpoint MESOSENDPOINT
- The host and port of the Mesos server separated by a colon. (default: <leader IP>:5050)
- --kubernetesHostPath KUBERNETES_HOST_PATH
- Path on Kubernetes hosts to use as shared inter-pod temp directory.
- --kubernetesOwner KUBERNETES_OWNER
- Username to mark Kubernetes jobs with.
- --kubernetesServiceAccount KUBERNETES_SERVICE_ACCOUNT
- Service account to run jobs as.
- --kubernetesPodTimeout KUBERNETES_POD_TIMEOUT
- Seconds to wait for a scheduled Kubernetes pod to start running. (default: 120s)
- --tesEndpoint TES_ENDPOINT
- The http(s) URL of the TES server. (default: http://<leader IP>:8000)
- --tesUser TES_USER
- User name to use for basic authentication to TES server.
- --tesPassword TES_PASSWORD
- Password to use for basic authentication to TES server.
- --tesBearerToken TES_BEARER_TOKEN
- Bearer token to use for authentication to TES server.
- --awsBatchRegion AWS_BATCH_REGION
- The AWS region containing the AWS Batch queue to submit to.
- --awsBatchQueue AWS_BATCH_QUEUE
- The name or ARN of the AWS Batch queue to submit to.
- --awsBatchJobRoleArn AWS_BATCH_JOB_ROLE_ARN
- The ARN of an IAM role to run AWS Batch jobs as, so they can e.g. access a job store. Must be assumable by ecs-tasks.amazonaws.com
- --scale SCALE
- A scaling factor to change the value of all submitted tasks' submitted cores. Used in single_machine batch system. Useful for running workflows on smaller machines than they were designed for, by setting a value less than 1. (default: 1)
- --linkImports
- When using a filesystem based job store, CWL input files are by default symlinked in. Specifying this option instead copies the files into the job store, which may protect them from being modified externally. When not specified and as long as caching is enabled, Toil will protect the file automatically by changing the permissions to read-only.
- --moveExports
- When using a filesystem based job store, output files are by default moved to the output directory, and a symlink to the moved exported file is created at the initial location. Specifying this option instead copies the files into the output directory. Applies to filesystem-based job stores only.
- --disableCaching
- Disables caching in the file store. This flag must be set to use a batch system that does not support cleanup, such as Parasol.
- --caching BOOL
- Set caching options. This must be set to "false" to use a batch system that does not support cleanup, such as Parasol. Set to "true" if caching is desired.
- --provisioner CLOUDPROVIDER
- The provisioner for cluster auto-scaling. This is the main Toil -\-provisioner option, and defaults to None for running on single_machine and non-auto-scaling batch systems. The currently supported choices are 'aws' or 'gce'.
- --nodeTypes NODETYPES
- Specifies a list of comma-separated node types, each of which is composed of slash-separated instance types, and an optional spot bid set off by a colon, making the node type preemptible. Instance types may appear in multiple node types, and the same node type may appear as both preemptible and non-preemptible.
- Valid argument specifying two node types:
- c5.4xlarge/c5a.4xlarge:0.42,t2.large
- Node types:
- c5.4xlarge/c5a.4xlarge:0.42 and t2.large
- Instance types:
- c5.4xlarge, c5a.4xlarge, and t2.large
- Semantics:
- Bid $0.42/hour for either c5.4xlarge or c5a.4xlarge instances, treated interchangeably, while they are available at that price, and buy t2.large instances at full price
- --minNodes MINNODES
- Minimum number of nodes of each type in the cluster, if using auto-scaling. This should be provided as a comma-separated list of the same length as the list of node types. default=0
- --maxNodes MAXNODES
- Maximum number of nodes of each type in the cluster, if using autoscaling, provided as a comma-separated list. The first value is used as a default if the list length is less than the number of nodeTypes. default=10
- --targetTime TARGETTIME
- Sets how rapidly you aim to complete jobs in seconds. Shorter times mean more aggressive parallelization. The autoscaler attempts to scale up/down so that it expects all queued jobs will complete within targetTime seconds. (Default: 1800)
- --betaInertia BETAINERTIA
- A smoothing parameter to prevent unnecessary oscillations in the number of provisioned nodes. This controls an exponentially weighted moving average of the estimated number of nodes. A value of 0.0 disables any smoothing, and a value of 0.9 will smooth so much that few changes will ever be made. Must be between 0.0 and 0.9. (Default: 0.1)
- --scaleInterval SCALEINTERVAL
- The interval (seconds) between assessing if the scale of the cluster needs to change. (Default: 60)
- --preemptibleCompensation PREEMPTIBLECOMPENSATION
- The preference of the autoscaler to replace preemptible nodes with non-preemptible nodes, when preemptible nodes cannot be started for some reason. Defaults to 0.0. This value must be between 0.0 and 1.0, inclusive. A value of 0.0 disables such compensation, a value of 0.5 compensates two missing preemptible nodes with a non-preemptible one. A value of 1.0 replaces every missing pre-emptable node with a non-preemptible one.
- --nodeStorage NODESTORAGE
- Specify the size of the root volume of worker nodes when they are launched in gigabytes. You may want to set this if your jobs require a lot of disk space. The default value is 50.
- --nodeStorageOverrides NODESTORAGEOVERRIDES
- Comma-separated list of nodeType:nodeStorage that are used to override the default value from -\-nodeStorage for the specified nodeType(s). This is useful for heterogeneous jobs where some tasks require much more disk than others.
- --metrics
- Enable the prometheus/grafana dashboard for monitoring CPU/RAM usage, queue size, and issued jobs.
- --assumeZeroOverhead
- Ignore scheduler and OS overhead and assume jobs can use every last byte of memory and disk on a node when autoscaling.
- --maxServiceJobs MAXSERVICEJOBS
- The maximum number of service jobs that can be run concurrently, excluding service jobs running on preemptible nodes. default=9223372036854775807
- --maxPreemptibleServiceJobs MAXPREEMPTIBLESERVICEJOBS
- The maximum number of service jobs that can run concurrently on preemptible nodes. default=9223372036854775807
- --deadlockWait DEADLOCKWAIT
- Time, in seconds, to tolerate the workflow running only the same service jobs, with no jobs to use them, before declaring the workflow to be deadlocked and stopping. default=60
- --deadlockCheckInterval DEADLOCKCHECKINTERVAL
- Time, in seconds, to wait between checks to see if the workflow is stuck running only service jobs, with no jobs to use them. Should be shorter than -\-deadlockWait. May need to be increased if the batch system cannot enumerate running jobs quickly enough, or if polling for running jobs is placing an unacceptable load on a shared cluster. default=30
- --defaultMemory INT
- The default amount of memory to request for a job. Only applicable to jobs that do not specify an explicit value for this requirement. Standard suffixes like K, Ki, M, Mi, G or Gi are supported. Default is 2.0G
- --defaultCores FLOAT
- The default number of CPU cores to dedicate a job. Only applicable to jobs that do not specify an explicit value for this requirement. Fractions of a core (for example 0.1) are supported on some batch systems, namely Mesos and singleMachine. Default is 1.0
- --defaultDisk INT
- The default amount of disk space to dedicate a job. Only applicable to jobs that do not specify an explicit value for this requirement. Standard suffixes like K, Ki, M, Mi, G or Gi are supported. Default is 2.0G
- --defaultAccelerators ACCELERATOR
- The default amount of accelerators to request for a job. Only applicable to jobs that do not specify an explicit value for this requirement. Each accelerator specification can have a type (gpu [default], nvidia, amd, cuda, rocm, opencl, or a specific model like nvidia-tesla-k80), and a count [default: 1]. If both a type and a count are used, they must be separated by a colon. If multiple types of accelerators are used, the specifications are separated by commas. Default is [].
- --defaultPreemptible BOOL
- Make all jobs able to run on preemptible (spot) nodes by default.
- --maxCores INT
- The maximum number of CPU cores to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
- --maxMemory INT
- The maximum amount of memory to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
- --maxDisk INT
- The maximum amount of disk space to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
- --retryCount RETRYCOUNT
- Number of times to retry a failing job before giving up and labeling job failed. default=1
- --enableUnlimitedPreemptibleRetries
- If set, preemptible failures (or any failure due to an instance getting unexpectedly terminated) will not count towards job failures and -\-retryCount.
- --doubleMem
- If set, batch jobs which die due to reaching memory limit on batch schedulers will have their memory doubled and they will be retried. The remaining retry count will be reduced by 1. Currently only supported by LSF. default=False.
- --maxJobDuration MAXJOBDURATION
- Maximum runtime of a job (in seconds) before we kill it (this is a lower bound, and the actual time before killing the job may be longer).
- --rescueJobsFrequency RESCUEJOBSFREQUENCY
- Period of time to wait (in seconds) between checking for missing/overlong jobs, that is jobs which get lost by the batch system. Expert parameter.
- --maxLogFileSize MAXLOGFILESIZE
- The maximum size of a job log file to keep (in bytes), log files larger than this will be truncated to the last X bytes. Setting this option to zero will prevent any truncation. Setting this option to a negative value will truncate from the beginning. Default=62.5 K
- --writeLogs FILEPATH
- Write worker logs received by the leader into their own files at the specified path. Any non-empty standard output and error from failed batch system jobs will also be written into files at this path. The current working directory will be used if a path is not specified explicitly. Note: By default only the logs of failed jobs are returned to leader. Set log level to 'debug' or enable -\-writeLogsFromAllJobs to get logs back from successful jobs, and adjust -\-maxLogFileSize to control the truncation limit for worker logs.
- --writeLogsGzip FILEPATH
- Identical to -\-writeLogs except the logs files are gzipped on the leader.
- --writeMessages FILEPATH
- File to send messages from the leader's message bus to.
- --realTimeLogging
- Enable real-time logging from workers to leader.
- --disableChaining
- Disables chaining of jobs (chaining uses one job's resource allocation for its successor job if possible).
- --disableJobStoreChecksumVerification
- Disables checksum verification for files transferred to/from the job store. Checksum verification is a safety check to ensure the data is not corrupted during transfer. Currently only supported for non-streaming AWS files
- --sseKey SSEKEY
- Path to file containing 32 character key to be used for server-side encryption on awsJobStore or googleJobStore. SSE will not be used if this flag is not passed.
- --setEnv NAME, -e NAME
- NAME=VALUE or NAME, -e NAME=VALUE or NAME are also valid. Set an environment variable early on in the worker. If VALUE is omitted, it will be looked up in the current environment. Independently of this option, the worker will try to emulate the leader's environment before running a job, except for some variables known to vary across systems. Using this option, a variable can be injected into the worker process itself before it is started.
- --servicePollingInterval SERVICEPOLLINGINTERVAL
- Interval of time service jobs wait between polling for the existence of the keep-alive flag (default=60)
- --forceDockerAppliance
- Disables sanity checking the existence of the docker image specified by TOIL_APPLIANCE_SELF, which Toil uses to provision mesos for autoscaling.
- --statusWait INT
- Seconds to wait between reports of running jobs. (default=3600)
- --disableProgress
- Disables the progress bar shown when standard error is a terminal.
- --debugWorker
- Experimental no forking mode for local debugging. Specifically, workers are not forked and stderr/stdout are not redirected to the log. (default=False)
- --disableWorkerOutputCapture
- Let worker output go to worker's standard out/error instead of per-job logs.
- --badWorker BADWORKER
- For testing purposes randomly kill -\-badWorker proportion of jobs using SIGKILL. (Default: 0.0)
- --badWorkerFailInterval BADWORKERFAILINTERVAL
- When killing the job pick uniformly within the interval from 0.0 to -\-badWorkerFailInterval seconds after the worker starts. (Default: 0.01)
- --kill_polling_interval KILL_POLLING_INTERVAL
- Interval of time (in seconds) the leader waits between polling for the kill flag inside the job store set by the "toil kill" command. (default=5)
Restart Option
In the event of failure, Toil can resume the pipeline by adding the argument --restart and rerunning the python script. Toil pipelines (but not CWL pipelines) can even be edited and resumed which is useful for development or troubleshooting.Running Workflows with Services
Toil supports jobs, or clusters of jobs, that run as services to other accessor jobs. Example services include server databases or Apache Spark Clusters. As service jobs exist to provide services to accessor jobs their runtime is dependent on the concurrent running of their accessor jobs. The dependencies between services and their accessor jobs can create potential deadlock scenarios, where the running of the workflow hangs because only service jobs are being run and their accessor jobs can not be scheduled because of too limited resources to run both simultaneously. To cope with this situation Toil attempts to schedule services and accessors intelligently, however to avoid a deadlock with workflows running service jobs it is advisable to use the following parameters:- •
- --maxServiceJobs: The maximum number of service jobs that can be run concurrently, excluding service jobs running on preemptible nodes.
- •
- --maxPreemptibleServiceJobs: The maximum number of service jobs that can run concurrently on preemptible nodes.
Setting Options directly with the Toil Script
It's good to remember that commandline options can be overridden in the Toil script itself. For example, toil.job.Job.Runner.getDefaultOptions() can be used to run toil with all default options, and in this example, it will override commandline args to run the default options and always run with the "./toilWorkflow" directory specified as the jobstore:options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object with Toil(options) as toil: toil.start(Job()) # Run the script
options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object options.logLevel = "DEBUG" # Set the log level to the debug level. options.clean = "ALWAYS" # Always delete the jobStore after a run with Toil(options) as toil: toil.start(Job()) # Run the script
parser = Job.Runner.getDefaultArgumentParser() # Get the parser options = parser.parse_args() # Parse user args to create the options object with Toil(options) as toil: toil.start(Job()) # Run the script
parser = Job.Runner.getDefaultArgumentParser() # Get the parser options = parser.parse_args() # Parse user args to create the options object options.logLevel = "DEBUG" # Set the log level to the debug level. options.clean = "ALWAYS" # Always delete the jobStore after a run with Toil(options) as toil: toil.start(Job()) # Run the script
TOIL DEBUGGING
Toil has a number of tools to assist in debugging. Here we provide help in working through potential problems that a user might encounter in attempting to run a workflow.Introspecting the Jobstore
Note: Currently these features are only implemented for use locally (single machine) with the fileJobStore.$ toil debug-file file:path-to-jobstore-directory \ --listFilesInJobStore
$ toil debug-file file:path-to-jobstore \ --fetch overview.txt *.bam *.fastq \ --localFilePath=/home/user/localpath
Stats and Status
See Stats Command for more about gathering statistics about job success, runtime, and resource usage from workflows.Using a Python debugger
If you execute a workflow using the --debugWorker flag, Toil will not fork in order to run jobs, which means you can either use pdb, or an IDE that supports debugging Python as you would normally. Note that the --debugWorker flag will only work with the singleMachine batch system (the default), and not any of the custom job schedulers.RUNNING IN THE CLOUD
Toil supports Amazon Web Services (AWS) and Google Compute Engine (GCE) in the cloud and has autoscaling capabilities that can adapt to the size of your workflow, whether your workflow requires 10 instances or 20,000.Managing a Cluster of Virtual Machines (Provisioning)
Toil can launch and manage a cluster of virtual machines to run using the provisioner to run a workflow distributed over several nodes. The provisioner also has the ability to automatically scale up or down the size of the cluster to handle dynamic changes in computational demand (autoscaling). Currently we have working provisioners with AWS and GCE (Azure support has been deprecated).Storage (Toil jobStore)
Toil can make use of cloud storage such as AWS or Google buckets to take care of storage needs.- •
- AWS Job Store
- •
- Google Job Store
CLOUD PLATFORMS
Running on Kubernetes
Kubernetes is a very popular container orchestration tool that has become a de facto cross-cloud-provider API for accessing cloud resources. Major cloud providers like Amazon, Microsoft, Kubernetes owner Google, and DigitalOcean have invested heavily in making Kubernetes work well on their platforms, by writing their own deployment documentation and developing provider-managed Kubernetes-based products. Using minikube, Kubernetes can even be run on a single machine.Preparing your Kubernetes environment
- 1.
- Get a Kubernetes cluster To run Toil workflows on Kubernetes, you need to have a Kubernetes cluster set up. This will not be covered here, but there are many options available, and which one you choose will depend on which cloud ecosystem if any you use already, and on pricing. If you are just following along with the documentation, use minikube on your local machine. Note that currently the only way to run a Toil workflow on Kubernetes is to use the AWS Job Store, so your Kubernetes workflow will currently have to store its data in Amazon's cloud regardless of where you run it. This can result in significant egress charges from Amazon if you run it outside of Amazon. Kubernetes Cluster Providers:
- •
- Your own institution
- •
- Amazon EKS
- •
- Microsoft Azure AKS
- •
- Google GKE
- •
- DigitalOcean Kubernetes
- •
- minikube
- 2.
- Get a Kubernetes context on your local machine There are two main ways to run Toil workflows on Kubernetes. You can either run the Toil leader on a machine outside the cluster, with jobs submitted to and run on the cluster, or you can submit the Toil leader itself as a job and have it run inside the cluster. Either way, you will need to configure your own machine to be able to submit jobs to the Kubernetes cluster. Generally, this involves creating and populating a file named .kube/config in your user's home directory, and specifying the cluster to connect to, the certificate and token information needed for mutual authentication, and the Kubernetes namespace within which to work. However, Kubernetes configuration can also be picked up from other files in the .kube directory, environment variables, and the enclosing host when running inside a Kubernetes-managed container. You will have to do different things here depending on where you got your Kubernetes cluster:
- •
- Configuring for Amazon EKS
- •
- Configuring for Microsoft Azure AKS
- •
- Configuring for Google GKE
- •
- Configuring for DigitalOcean Kubernetes Clusters
- •
- Configuring for minikube
- 3.
- If running the Toil leader in the cluster, get a service account If you are going to run your workflow's leader within the Kubernetes cluster (see Option 1: Running the Leader Inside Kubernetes), you will need a service account in your chosen Kubernetes namespace. Most namespaces should have a service account named default which should work fine. If your cluster requires you to use a different service account, you will need to obtain its name and use it when launching the Kubernetes job containing the Toil leader.
- 4.
- Set up appropriate permissions Your local Kubernetes context and/or the service account you are using to run the leader in the cluster will need to have certain permissions in order to run the workflow. Toil needs to be able to interact with jobs and pods in the cluster, and to retrieve pod logs. You as a user may need permission to set up an AWS credentials secret, if one is not already available. Additionally, it is very useful for you as a user to have permission to interact with nodes, and to shell into pods. The appropriate permissions may already be available to you and your service account by default, especially in managed or ease-of-use-optimized setups such as EKS or minikube. However, if the appropriate permissions are not already available, you or your cluster administrator will have to grant them manually. The following Role (toil-user) and ClusterRole (node-reader), to be applied with kubectl apply -f filename.yaml, should grant sufficient permissions to run Toil workflows when bound to your account and the service account used by Toil workflows. Be sure to replace YOUR_NAMESPACE_HERE with the namespace you are running your workflows in
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: YOUR_NAMESPACE_HERE name: toil-user rules: - apiGroups: ["*"] resources: ["*"] verbs: ["explain", "get", "watch", "list", "describe", "logs", "attach", "exec", "port-forward", "proxy", "cp", "auth"] - apiGroups: ["batch"] resources: ["*"] verbs: ["get", "watch", "list", "create", "run", "set", "delete"] - apiGroups: [""] resources: ["secrets", "pods", "pods/attach", "podtemplates", "configmaps", "events", "services"] verbs: ["patch", "get", "update", "watch", "list", "create", "run", "set", "delete", "exec"] - apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/exec"] verbs: ["create"]
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: node-reader rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list", "describe"] - apiGroups: [""] resources: ["namespaces"] verbs: ["get", "list", "describe"] - apiGroups: ["metrics.k8s.io"] resources: ["*"] verbs: ["*"]
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: toil-developer-member namespace: toil subjects: - kind: User name: YOUR_KUBERNETES_USERNAME_HERE apiGroup: rbac.authorization.k8s.io - kind: ServiceAccount name: YOUR_SERVICE_ACCOUNT_NAME_HERE namespace: YOUR_NAMESPACE_HERE roleRef: kind: Role name: toil-user apiGroup: rbac.authorization.k8s.io
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: read-nodes subjects: - kind: User name: YOUR_KUBERNETES_USERNAME_HERE apiGroup: rbac.authorization.k8s.io - kind: ServiceAccount name: YOUR_SERVICE_ACCOUNT_NAME_HERE namespace: YOUR_NAMESPACE_HERE roleRef: kind: ClusterRole name: node-reader apiGroup: rbac.authorization.k8s.io
AWS Job Store for Kubernetes
Currently, the only job store, which is what Toil uses to exchange data between jobs, that works with jobs running on Kubernetes is the AWS Job Store. This requires that the Toil leader and Kubernetes jobs be able to connect to and use Amazon S3 and Amazon SimpleDB. It also requires that you have an Amazon Web Services account.- 1.
- Get access to AWS S3 and SimpleDB In your AWS account, you need to create an AWS access key. First go to the IAM dashboard; for "us-west1", the link would be:
https://console.aws.amazon.com/iam/home?region=us-west-1#/home
- 1.
- On the IAM Dashboard page, choose your account name in the navigation bar, and then choose My Security Credentials.
- 2.
- Expand the Access keys (access key ID and secret access key) section.
- 3.
- Choose Create New Access Key. Then choose Download Key File to save the access key ID and secret access key to a file on your computer. After you close the dialog box, you can't retrieve this secret access key again.
- 2.
- Configure AWS access from the local machine This only really needs to happen if you run the leader on the local machine. But we need the files in place to fill in the secret in the next step. Run:
$ aws configure
[default] aws_access_key_id = BLAH aws_secret_access_key = blahblahblah
- 3.
- Create a Kubernetes secret to give jobs access to AWS
Go into the directory where the
credentials file is:
Then, create a Kubernetes secret that contains it. We'll call it
aws-credentials:
$ cd ~/.aws
$ kubectl create secret generic aws-credentials --from-file credentials
Configuring Toil for your Kubernetes environment
To configure your workflow to run on Kubernetes, you will have to configure several environment variables, in addition to passing the --batchSystem kubernetes option. Doing the research to figure out what values to give these variables may require talking to your cluster provider.- 1.
- TOIL_AWS_SECRET_NAME is the most important, and must be set to the secret that contains your AWS credentials file, if your cluster nodes don't otherwise have access to S3 and SimpleDB (such as through IAM roles). This is required for the AWS job store to work, which is currently the only job store that can be used on Kubernetes. In this example we are using aws-credentials.
- 2.
- TOIL_KUBERNETES_HOST_PATH can be set to allow Toil jobs on the same physical host to share a cache. It should be set to a path on the host where the shared cache should be stored. It will be mounted as /var/lib/toil, or at TOIL_WORKDIR if specified, inside the container. This path must already exist on the host, and must have as much free space as your Kubernetes node offers to jobs. In this example, we are using /data/scratch. To actually make use of caching, make sure not to use --disableCaching.
- 3.
- TOIL_KUBERNETES_OWNER should be set to the username of the user running the Toil workflow. The jobs that Toil creates will include this username, so they can be more easily recognized, and cleaned up by the user if anything happens to the Toil leader. In this example we are using demo-user.
Running workflows
To run the workflow, you will need to run the Toil leader process somewhere. It can either be run inside Kubernetes as a Kubernetes job, or outside Kubernetes as a normal command.Option 1: Running the Leader Inside Kubernetes
Once you have determined a set of environment variable values for your workflow run, write a YAML file that defines a Kubernetes job to run your workflow with that configuration. Some configuration items (such as your username, and the name of your AWS credentials secret) need to be written into the YAML so that they can be used from the leader as well.apiVersion: batch/v1 kind: Job metadata: # It is good practice to include your username in your job name. # Also specify it in TOIL_KUBERNETES_OWNER name: demo-user-toil-test # Do not try and rerun the leader job if it fails spec: backoffLimit: 0 template: spec: # Do not restart the pod when the job fails, but keep it around so the # log can be retrieved restartPolicy: Never volumes: - name: aws-credentials-vol secret: # Make sure the AWS credentials are available as a volume. # This should match TOIL_AWS_SECRET_NAME secretName: aws-credentials # You may need to replace this with a different service account name as # appropriate for your cluster. serviceAccountName: default containers: - name: main image: quay.io/ucsc_cgl/toil:5.5.0 env: # Specify your username for inclusion in job names - name: TOIL_KUBERNETES_OWNER value: demo-user # Specify where to find the AWS credentials to access the job store with - name: TOIL_AWS_SECRET_NAME value: aws-credentials # Specify where per-host caches should be stored, on the Kubernetes hosts. # Needs to be set for Toil's caching to be efficient. - name: TOIL_KUBERNETES_HOST_PATH value: /data/scratch volumeMounts: # Mount the AWS credentials volume - mountPath: /root/.aws name: aws-credentials-vol resources: # Make sure to set these resource limits to values large enough # to accommodate the work your workflow does in the leader # process, but small enough to fit on your cluster. # # Since no request values are specified, the limits are also used # for the requests. limits: cpu: 2 memory: "4Gi" ephemeral-storage: "10Gi" command: - /bin/bash - -c - | # This Bash script will set up Toil and the workflow to run, and run them. set -e # We make sure to create a work directory; Toil can't hot-deploy a # script from the root of the filesystem, which is where we start. mkdir /tmp/work cd /tmp/work # We make a virtual environment to allow workflow dependencies to be # hot-deployed. # # We don't really make use of it in this example, but for workflows # that depend on PyPI packages we will need this. # # We use --system-site-packages so that the Toil installed in the # appliance image is still available. virtualenv --python python3 --system-site-packages venv . venv/bin/activate # Now we install the workflow. Here we're using a demo workflow # script from Toil itself. wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py # Now we run the workflow. We make sure to use the Kubernetes batch # system and an AWS job store, and we set some generally useful # logging options. We also make sure to enable caching. python3 tutorial_helloworld.py \ aws:us-west-2:demouser-toil-test-jobstore \ --batchSystem kubernetes \ --realTimeLogging \ --logInfo
$ kubectl apply -f leader.yaml
Monitoring and Debugging Kubernetes Jobs and Pods
The following techniques are most useful for looking at the pod which holds the Toil leader, but they can also be applied to individual Toil jobs on Kubernetes, even when the leader is outside the cluster.$ kubectl get pods | grep demo-user-toil-test demo-user-toil-test-g5496 1/1 Running 0 2m
$ kubectl get pods | grep demo-user
$ kubectl logs demo-user-toil-test-g5496
$ kubectl logs -f demo-user-toil-test-g5496
$ kubectl describe pod demo-user-toil-test-g5496
Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 13s (x79 over 100m) default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient ephemeral-storage, 4 Insufficient memory.
$ kubectl exec -ti demo-user-toil-test-g5496 /bin/bash
When Things Go Wrong
The Toil Kubernetes batch system includes cleanup code to terminate worker jobs when the leader shuts down. However, if the leader pod is removed by Kubernetes, is forcibly killed or otherwise suffers a sudden existence failure, it can go away while its worker jobs live on. It is not recommended to restart a workflow in this state, as jobs from the previous invocation will remain running and will be trying to modify the job store concurrently with jobs from the new invocation.$ kubectl get jobs | grep demo-user | cut -f1 -d' ' | xargs -n10 kubectl delete job
Option 2: Running the Leader Outside Kubernetes
If you don't want to run your Toil leader inside Kubernetes, you can run it locally instead. This can be useful when developing a workflow; files can be hot-deployed from your local machine directly to Kubernetes. However, your local machine will have to have (ideally role-assumption- and MFA-free) access to AWS, and access to Kubernetes. Real time logging will not work unless your local machine is able to listen for incoming UDP packets on arbitrary ports on the address it uses to contact the IPv4 Internet; Toil does no NAT traversal or detection.$ export TOIL_KUBERNETES_OWNER=demo-user # This defaults to your local username if not set $ export TOIL_AWS_SECRET_NAME=aws-credentials $ export TOIL_KUBERNETES_HOST_PATH=/data/scratch $ virtualenv --python python3 --system-site-packages venv $ . venv/bin/activate $ wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py $ python3 tutorial_helloworld.py \ aws:us-west-2:demouser-toil-test-jobstore \ --batchSystem kubernetes \ --realTimeLogging \ --logInfo
Running CWL Workflows
Running CWL workflows on Kubernetes can be challenging, because executing CWL can require toil-cwl-runner to orchestrate containers of its own, within a Kubernetes job running in the Toil appliance container.$ export TOIL_KUBERNETES_OWNER=demo-user # This defaults to your local username if not set $ export TOIL_AWS_SECRET_NAME=aws-credentials $ export TOIL_KUBERNETES_HOST_PATH=/data/scratch $ virtualenv --python python3 --system-site-packages venv $ . venv/bin/activate $ pip install toil[kubernetes,cwl]==5.8.0 $ toil-cwl-runner \ --jobStore aws:us-west-2:demouser-toil-test-jobstore \ --batchSystem kubernetes \ --realTimeLogging \ --logInfo \ --disableCaching \ path/to/cwl/workflow \ path/to/cwl/input/object
AppArmor and Singularity
Kubernetes clusters based on Ubuntu hosts often will have AppArmor enabled on the host. AppArmor is a capability-based security enhancement system that integrates with the Linux kernel to enforce lists of things which programs may or may not do, called profiles. For example, an AppArmor profile could be applied to a web server process to stop it from using the mount() system call to manipulate the filesystem, because it has no business doing that under normal circumstances but might attempt to do it if compromised by hackers.Running in AWS
Toil jobs can be run on a variety of cloud platforms. Of these, Amazon Web Services (AWS) is currently the best-supported solution. Toil provides the Cluster Utilities to conveniently create AWS clusters, connect to the leader of the cluster, and then launch a workflow. The leader handles distributing the jobs over the worker nodes and autoscaling to optimize costs.Preparing your AWS environment
To use Amazon Web Services (AWS) to run Toil or to just use S3 to host the files during the computation of a workflow, first set up and configure an account with AWS:- 1.
- If necessary, create and activate an AWS account
- 2.
- Next, generate a key pair for AWS with the command (do NOT generate your key pair with the Amazon browser):
$ ssh-keygen -t rsa
- 3.
- This should prompt you to save your key. Please save it in
~/.ssh/id_rsa
- 4.
- Now move this to where your OS can see it as an authorized key:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- 5.
- Next, you'll need to add your key to the ssh-agent:
$ eval `ssh-agent -s` $ ssh-add
- 6.
- You'll also need to chmod your private key (good practice but also enforced by AWS):
$ chmod 400 id_rsa
- 7.
- Now you'll need to add the key to AWS via the browser. For example, on us-west1, this address would accessible at:
https://us-west-1.console.aws.amazon.com/ec2/v2/home?region=us-west-1#KeyPairs:sort=keyName
- 8.
- Now click on the "Import Key Pair" button to add your key:
Adding an Amazon Key
Pair.UNINDENT
- 9.
- Next, you need to create an AWS access key. First go to the IAM dashboard, again; for "us-west1", the example link would be here:
https://console.aws.amazon.com/iam/home?region=us-west-1#/home
- 10.
- The directions (transcribed from: https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html ) are now:
- 1.
- On the IAM Dashboard page, choose your account name in the navigation bar, and then choose My Security Credentials.
- 2.
- Expand the Access keys (access key ID and secret access key) section.
- 3.
- Choose Create New Access Key. Then choose Download Key File to save the access key ID and secret access key to a file on your computer. After you close the dialog box, you can't retrieve this secret access key again.
- 11.
- Now you should have a newly generated "AWS Access Key ID" and "AWS Secret Access Key". We can now install the AWS CLI and make sure that it has the proper credentials:
$ pip install awscli --upgrade --user
- 12.
- Now configure your AWS credentials with:
$ aws configure
- 13.
- Add your "AWS Access Key ID" and "AWS Secret Access Key" from earlier and your region and output format:
" AWS Access Key ID [****************Q65Q]: " " AWS Secret Access Key [****************G0ys]: " " Default region name [us-west-1]: " " Default output format [json]: "
- 14.
- If not done already, install toil (example uses version 5.3.0, but we recommend the latest release):
$ virtualenv venv $ source venv/bin/activate $ pip install toil[all]==5.3.0
- 15.
- Now that toil is installed and you are running a virtualenv, an example of launching a toil leader node would be the following (again, note that we set TOIL_APPLIANCE_SELF to toil version 5.3.0 in this example, but please set the version to the installed version that you are using if you're using a different version):
$ TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:5.3.0 \ toil launch-cluster clustername \ --leaderNodeType t2.medium \ --zone us-west-1a \ --keyPairName id_rsa
TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:latest
--- This is optional. It specifies a mesos docker image that we maintain with
the latest version of toil installed on it. If you want to use a different
version of toil, please specify the image tag you need from
https://quay.io/repository/ucsc_cgl/toil?tag=latest&tab=tags.
toil launch-cluster --- Base command in toil to launch a cluster.
clustername --- Just choose a name for your cluster.
--leaderNodeType t2.medium --- Specify the leader node type. Make a
t2.medium (2CPU; 4Gb RAM; $0.0464/Hour). List of available AWS instances:
https://aws.amazon.com/ec2/pricing/on-demand/
--zone us-west-1a --- Specify the AWS zone you want to launch the
instance in. Must have the same prefix as the zone in your awscli credentials
(which, in the example of this tutorial is: "us-west-1").
--keyPairName id_rsa --- The name of your key pair, which should be
"id_rsa" if you've followed this tutorial.
You can set the TOIL_AWS_TAGS
environment variable to a JSON object to specify arbitrary tags for AWS
resources. For example, if you export
TOIL_AWS_TAGS='{"project-name": "variant-calling"}' in
your shell before using Toil, AWS resources created by Toil will be tagged
with a project-name tag with the value variant-calling.
AWS Job Store
Using the AWS job store is straightforward after you've finished Preparing your AWS environment; all you need to do is specify the prefix for the job store name.$ python3 sort.py aws:us-west-2:my-aws-sort-jobstore
Toil Provisioner
The Toil provisioner is included in Toil alongside the [aws] extra and allows us to spin up a cluster.- 1.
- Make sure you have Toil installed with the AWS extras. For detailed instructions see Installing Toil with Extra Features.
- 2.
- You will need an AWS account and you will need to save your AWS credentials on your local machine. For help setting up an AWS account see here. For setting up your AWS credentials follow instructions here.
- Choosing Toil Appliance Image
- When using the Toil provisioner, the appliance image will be automatically chosen based on the pip-installed version of Toil on your system. That choice can be overridden by setting the environment variables TOIL_DOCKER_REGISTRY and TOIL_DOCKER_NAME or TOIL_APPLIANCE_SELF. See Environment Variables for more information on these variables. If you are developing with autoscaling and want to test and build your own appliance have a look at Developing with Docker.
Details about Launching a Cluster in AWS
Using the provisioner to launch a Toil leader instance is simple using the launch-cluster command. For example, to launch a cluster named "my-cluster" with a t2.medium leader in the us-west-2a zone, run(venv) $ toil launch-cluster my-cluster \ --leaderNodeType t2.medium \ --zone us-west-2a \ --keyPairName <your-AWS-key-pair-name>
(venv) $ toil launch-cluster --help
Static Provisioning
Toil can be used to manage a cluster in the cloud by using the Cluster Utilities. The cluster utilities also make it easy to run a toil workflow directly on this cluster. We call this static provisioning because the size of the cluster does not change. This is in contrast with Running a Workflow with Autoscaling.(venv) $ toil launch-cluster my-cluster \ --leaderNodeType t2.small -z us-west-2a \ --keyPairName your-AWS-key-pair-name \ --nodeTypes m3.large,t2.micro -w 1,4
Uploading Workflows
Now that our cluster is launched, we use the Rsync-Cluster Command utility to copy the workflow to the leader. For a simple workflow in a single file this might look like(venv) $ toil rsync-cluster -z us-west-2a my-cluster toil-workflow.py :/
If your toil workflow has dependencies have a
look at the Auto-Deployment section for a detailed explanation on how
to include them.
Running a Workflow with Autoscaling
Autoscaling is a feature of running Toil in a cloud whereby additional cloud instances are launched to run the workflow. Autoscaling leverages Mesos containers to provide an execution environment for these workflows.Make sure you've done the AWS setup in
Preparing your AWS environment.
- 1.
- Download sort.py
- 2.
- Launch the leader node in AWS using the Launch-Cluster Command command:
(venv) $ toil launch-cluster <cluster-name> \ --keyPairName <AWS-key-pair-name> \ --leaderNodeType t2.medium \ --zone us-west-2a
- 3.
- Copy the sort.py script up to the leader node:
(venv) $ toil rsync-cluster -z us-west-2a <cluster-name> sort.py :/root
- 4.
- Login to the leader node:
(venv) $ toil ssh-cluster -z us-west-2a <cluster-name>
- 5.
- Run the script as an autoscaling workflow:
$ python3 /root/sort.py aws:us-west-2:<my-jobstore-name> \ --provisioner aws \ --nodeTypes c3.large \ --maxNodes 2 \ --batchSystem mesos
In this example, the autoscaling Toil code
creates up to two instances of type c3.large and launches Mesos slave
containers inside them. The containers are then available to run jobs defined
by the sort.py script. Toil also creates a bucket in S3 called
aws:us-west-2:autoscaling-sort-jobstore to store intermediate job
results. The Toil autoscaler can also provision multiple different node types,
which is useful for workflows that have jobs with varying resource
requirements. For example, one could execute the script with --nodeTypes
c3.large,r3.xlarge --maxNodes 5,1, which would allow the provisioner to
create up to five c3.large nodes and one r3.xlarge node for memory-intensive
jobs. In this situation, the autoscaler would avoid creating the more
expensive r3.xlarge node until needed, running most jobs on the c3.large
nodes.
- 1.
- View the generated file to sort:
$ head fileToSort.txt
- 2.
- View the sorted file:
$ head sortedFile.txt
$ python3 my-toil-script.py --help
Some important caveats about starting a toil
run through an ssh session are explained in the Ssh-Cluster Command
section.
Preemptibility
Toil can run on a heterogeneous cluster of both preemptible and non-preemptible nodes. Being a preemptible node simply means that the node may be shut down at any time, while jobs are running. These jobs can then be restarted later somewhere else.$ python /root/sort.py aws:us-west-2:<my-jobstore-name> \ --provisioner aws \ --nodeTypes c3.4xlarge:2.00 \ --maxNodes 2 \ --batchSystem mesos \ --defaultPreemptible
- Specify Preemptibility Carefully
- Ensure that your choices for --nodeTypes and --maxNodes <> make sense for your workflow and won't cause it to hang. You should make sure the provisioner is able to create nodes large enough to run the largest job in the workflow, and that non-preemptible node types are allowed if there are non-preemptible jobs in the workflow.
Using MinIO and S3-Compatible object stores
Toil can be configured to access files stored in an S3-compatible object store such as MinIO. The following environment variables can be used to configure the S3 connection used:- •
- TOIL_S3_HOST: the IP address or hostname to use for connecting to S3
- •
- TOIL_S3_PORT: the port number to use for connecting to S3, if needed
- •
- TOIL_S3_USE_SSL: enable or disable the usage of SSL for connecting to S3 ( True by default)
TOIL_S3_HOST=127.0.0.1 TOIL_S3_PORT=9010 TOIL_S3_USE_SSL=False
Dashboard
Toil provides a dashboard for viewing the RAM and CPU usage of each node, the number of issued jobs of each type, the number of failed jobs, and the size of the jobs queue. To launch this dashboard for a toil workflow, include the --metrics flag in the toil script command. The dashboard can then be viewed in your browser at localhost:3000 while connected to the leader node through toil ssh-cluster:(venv) $ toil ssh-cluster -z us-west-2a --grafana_port 8000 <cluster-name>
Running in Google Compute Engine (GCE)
Toil supports a provisioner with Google, and a Google Job Store. To get started, follow instructions for Preparing your Google environment.Preparing your Google environment
Toil supports using the Google Cloud Platform. Setting this up is easy!- 1.
- Make sure that the google extra (Installing Toil with Extra Features) is installed
- 2.
- Follow Google's Instructions to download credentials and set the GOOGLE_APPLICATION_CREDENTIALS environment variable
- 3.
- Create a new ssh key with the proper format. To create a new ssh key run the command
$ ssh-keygen -t rsa -f ~/.ssh/id_rsa -C [USERNAME]
This command could overwrite an old ssh key
you may be using. If you have an existing ssh key you would like to use, it
will need to be called id_rsa and it needs to have no password set.
$ chmod 400 ~/.ssh/id_rsa ~/.ssh/id_rsa.pub
- 4.
- Add your newly formatted public key to Google. To do this, log into your Google Cloud account and go to metadata section under the Compute tab. [image] Near the top of the screen click on 'SSH Keys', then edit, add item, and paste the key. Then save: [image]
Google Job Store
To use the Google Job Store you will need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable by following Google's instructions.$ python3 sort.py google:my-project-id:my-google-sort-jobstore
Running a Workflow with Autoscaling
WARNING:Google Autoscaling is in beta!
- 1.
- Download sort.py
- 2.
- Launch the leader node in GCE using the Launch-Cluster Command command:
(venv) $ toil launch-cluster <CLUSTER-NAME> \ --provisioner gce \ --leaderNodeType n1-standard-1 \ --keyPairName <SSH-KEYNAME> \ --zone us-west1-a
- 3.
- Upload the sort example and ssh into the leader:
(venv) $ toil rsync-cluster --provisioner gce <CLUSTER-NAME> sort.py :/root (venv) $ toil ssh-cluster --provisioner gce <CLUSTER-NAME>
- 4.
- Run the workflow:
$ python3 /root/sort.py google:<PROJECT-ID>:<JOBSTORE-NAME> \ --provisioner gce \ --batchSystem mesos \ --nodeTypes n1-standard-2 \ --maxNodes 2
- 5.
- Clean up:
$ exit # this exits the ssh from the leader node (venv) $ toil destroy-cluster --provisioner gce <CLUSTER-NAME>
Cluster Utilities
There are several utilities used for starting and managing a Toil cluster using the AWS provisioner. They are installed via the [aws] or [google] extra. For installation details see Toil Provisioner. The cluster utilities are used for Running in AWS and are comprised of toil launch-cluster, toil rsync-cluster, toil ssh-cluster, and toil destroy-cluster entry points.
status --- Reports runtime and resource
usage for all jobs in a specified jobstore (workflow must have originally been
run using the -\-stats option).
stats --- Inspects a job store to see which jobs have failed, run
successfully, etc.
destroy-cluster --- For autoscaling. Terminates the specified cluster and
associated resources.
launch-cluster --- For autoscaling. This is used to launch a toil leader
instance with the specified provisioner.
rsync-cluster --- For autoscaling. Used to transfer files to a cluster
launched with toil launch-cluster.
ssh-cluster --- SSHs into the toil appliance container running on the
leader of the cluster.
clean --- Delete the job store used by a previous Toil workflow
invocation.
kill --- Kills any running jobs in a rogue toil.
toil launch-cluster --help
By default, all of the cluster utilities
expect to be running on AWS. To run with Google you will need to specify the
--provisioner gce option for each utility.
Boto must be configured with AWS
credentials before using cluster utilities.
Running in Google Compute Engine (GCE) contains instructions for
Stats Command
To use the stats command, a workflow must first be run using the --stats option. Using this command makes certain that toil does not delete the job store, no matter what other options are specified (i.e. normally the option --clean=always would delete the job, but --stats will override this).python3 discoverfiles.py file:my-jobstore --stats
import os import subprocess from toil.common import Toil from toil.job import Job class discoverFiles(Job): """Views files at a specified path using ls.""" def __init__(self, path, *args, **kwargs): self.path = path super().__init__(*args, **kwargs) def run(self, fileStore): if os.path.exists(self.path): subprocess.check_call(["ls", self.path]) def main(): options = Job.Runner.getDefaultArgumentParser().parse_args() options.clean = "always" job1 = discoverFiles(path="/sys/", displayName='sysFiles') job2 = discoverFiles(path=os.path.expanduser("~"), displayName='userFiles') job3 = discoverFiles(path="/tmp/") job1.addChild(job2) job2.addChild(job3) with Toil(options) as toil: if not toil.options.restart: toil.start(job1) else: toil.restart() if __name__ == '__main__': main()
toil stats file:my-jobstore
Batch System: singleMachine Default Cores: 1 Default Memory: 2097152K Max Cores: 9.22337e+18 Total Clock: 0.56 Total Runtime: 1.01 Worker Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1 | 0.14 0.14 0.14 0.14 0.14 | 0.13 0.13 0.13 0.13 0.13 | 0.01 0.01 0.01 0.01 0.01 | 76K 76K 76K 76K 76K Job Worker Jobs | min med ave max | 3 3 3 3 Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 3 | 0.01 0.06 0.05 0.07 0.14 | 0.00 0.06 0.04 0.07 0.12 | 0.00 0.01 0.00 0.01 0.01 | 76K 76K 76K 76K 229K sysFiles Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1 | 0.01 0.01 0.01 0.01 0.01 | 0.00 0.00 0.00 0.00 0.00 | 0.01 0.01 0.01 0.01 0.01 | 76K 76K 76K 76K 76K userFiles Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1 | 0.06 0.06 0.06 0.06 0.06 | 0.06 0.06 0.06 0.06 0.06 | 0.01 0.01 0.01 0.01 0.01 | 76K 76K 76K 76K 76K discoverFiles Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1 | 0.07 0.07 0.07 0.07 0.07 | 0.07 0.07 0.07 0.07 0.07 | 0.00 0.00 0.00 0.00 0.00 | 76K 76K 76K 76K 76K
toil clean file:my-jobstore
Status Command
Continuing the example from the stats section above, if we ran our workflow with the commandpython3 discoverfiles.py file:my-jobstore --stats
toil status file:my-jobstore
2018-01-11 19:31:29,739 - toil.lib.bioio - INFO - Root logger is at level 'INFO', 'toil' logger at level 'INFO'. 2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Parsed arguments 2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Checking if we have files for Toil The root job of the job store is absent, the workflow completed successfully.
There are x unfinished jobs, y
parent jobs with children, z jobs with services, a services, and
b totally failed jobs currently in c.
Clean Command
If a Toil pipeline didn't finish successfully, or was run using --clean=always or --stats, the job store will exist until it is deleted. toil clean <jobStore> ensures that all artifacts associated with a job store are removed. This is particularly useful for deleting AWS job stores, which reserves an SDB domain as well as an S3 bucket.Launch-Cluster Command
Running toil launch-cluster starts up a leader for a cluster. Workers can be added to the initial cluster by specifying the -w option. An example would be$ toil launch-cluster my-cluster \ --leaderNodeType t2.small -z us-west-2a \ --keyPairName your-AWS-key-pair-name \ --nodeTypes m3.large,t2.micro -w 1,4
$ toil launch-cluster --help
- --help
- -h also accepted. Displays this help menu.
- --tempDirRoot TEMPDIRROOT
- Path to the temporary directory where all temp files are created, by default uses the current working directory as the base.
- --version
- Display version.
- --provisioner CLOUDPROVIDER
- -p CLOUDPROVIDER also accepted. The provisioner for cluster auto-scaling. Both AWS and GCE are currently supported.
- --zone ZONE
- -z ZONE also accepted. The availability zone of the leader. This parameter can also be set via the TOIL_AWS_ZONE or TOIL_GCE_ZONE environment variables, or by the ec2_region_name parameter in your .boto file if using AWS, or derived from the instance metadata if using this utility on an existing EC2 instance.
- --leaderNodeType LEADERNODETYPE
- Non-preemptable node type to use for the cluster leader.
- --keyPairName KEYPAIRNAME
- The name of the AWS or ssh key pair to include on the instance.
- --owner OWNER
- The owner tag for all instances. If not given, the value in TOIL_OWNER_TAG will be used, or else the value of --keyPairName.
- --boto BOTOPATH
- The path to the boto credentials directory. This is transferred to all nodes in order to access the AWS jobStore from non-AWS instances.
- --tag KEYVALUE
- KEYVALUE is specified as KEY=VALUE. -t KEY=VALUE also accepted. Tags are added to the AWS cluster for this node and all of its children. Tags are of the form: -t key1=value1 --tag key2=value2. Multiple tags are allowed and each tag needs its own flag. By default the cluster is tagged with: { "Name": clusterName, "Owner": IAM username }.
- --vpcSubnet VPCSUBNET
- VPC subnet ID to launch cluster leader in. Uses default subnet if not specified. This subnet needs to have auto assign IPs turned on.
- --nodeTypes NODETYPES
- Comma-separated list of node types to create while launching the leader. The syntax for each node type depends on the provisioner used. For the AWS provisioner this is the name of an EC2 instance type followed by a colon and the price in dollars to bid for a spot instance, for example 'c3.8xlarge:0.42'. Must also provide the --workers argument to specify how many workers of each node type to create.
- --workers WORKERS
- -w WORKERS also accepted. Comma-separated list of the number of workers of each node type to launch alongside the leader when the cluster is created. This can be useful if running toil without auto-scaling but with need of more hardware support.
- --leaderStorage LEADERSTORAGE
- Specify the size (in gigabytes) of the root volume for the leader instance. This is an EBS volume.
- --nodeStorage NODESTORAGE
- Specify the size (in gigabytes) of the root volume for any worker instances created when using the -w flag. This is an EBS volume.
- --nodeStorageOverrides NODESTORAGEOVERRIDES
- Comma-separated list of nodeType:nodeStorage that are used to override the default value from --nodeStorage for the specified nodeType(s). This is useful for heterogeneous jobs where some tasks require much more disk than others.
- --logOff
- Same as -\-logCritical.
- --logCritical
- Turn on logging at level CRITICAL and above. (default is INFO)
- --logError
- Turn on logging at level ERROR and above. (default is INFO)
- --logWarning
- Turn on logging at level WARNING and above. (default is INFO)
- --logInfo
- Turn on logging at level INFO and above. (default is INFO)
- --logDebug
- Turn on logging at level DEBUG and above. (default is INFO)
- --logLevel LOGLEVEL
- Log at given level (may be either OFF (or CRITICAL), ERROR, WARN (or WARNING), INFO or DEBUG). (default is INFO)
- --logFile LOGFILE
- File to log in.
- --rotatingLogging
- Turn on rotating logging, which prevents log files getting too big.
Ssh-Cluster Command
Toil provides the ability to ssh into the leader of the cluster. This can be done as follows:$ toil ssh-cluster CLUSTER-NAME-HERE
$ script $ screen
$ toil ssh-cluster CLUSTER-NAME-HERE remoteCommand
Rsync-Cluster Command
The most frequent use case for the rsync-cluster utility is deploying your Toil script to the Toil leader. Note that the syntax is the same as traditional rsync with the exception of the hostname before the colon. This is not needed in toil rsync-cluster since the hostname is automatically determined by Toil.$ toil rsync-cluster CLUSTER-NAME-HERE \ ~/localFile :/remoteDestination
Destroy-Cluster Command
The destroy-cluster command is the advised way to get rid of any Toil cluster launched using the Launch-Cluster Command command. It ensures that all attached nodes, volumes, security groups, etc. are deleted. If a node or cluster is shut down using Amazon's online portal residual resources may still be in use in the background. To delete a cluster run$ toil destroy-cluster CLUSTER-NAME-HERE
Kill Command
To kill all currently running jobs for a given jobstore, use the commandtoil kill file:my-jobstore
HPC ENVIRONMENTS
Toil is a flexible framework that can be leveraged in a variety of environments, including high-performance computing (HPC) environments. Toil provides support for a number of batch systems, including Grid Engine, Slurm, Torque and LSF, which are popular schedulers used in these environments. Toil also supports HTCondor, which is a popular scheduler for high-throughput computing (HTC). To use one of these batch systems specify the "-\-batchSystem" argument to the toil script.Standard Output/Error from Batch System Jobs
Standard output and error from batch system jobs (except for the Parasol and Mesos batch systems) are redirected to files in the toil-<workflowID> directory created within the temporary directory specified by the --workDir option; see Commandline Options. Each file is named as follows: toil_job_<Toil job ID>_batch_<name of batch system>_<job ID from batch system>_<file description>.log, where <file description> is std_output for standard output, and std_error for standard error. HTCondor will also write job event log files with <file description> = job_events.CWL IN TOIL
The Common Workflow Language (CWL) is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Toil has full support for the CWL v1.0, v1.1, and v1.2 standards.Running CWL Locally
The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil.$ toil-cwl-runner example.cwl example-job.yml
Note for macOS + Docker + Toil
When invoking CWL documents that make use of Docker containers if you see errors that look likedocker: Error response from daemon: Mounts denied: The paths /var/...tmp are not shared from OS X and are not known to Docker.
export TMPDIR=/tmp/docker_tmp
Detailed Usage Instructions
Help information can be found by using this toil command:$ toil-cwl-runner -h
$ toil-cwl-runner \ --singularity \ --jobStore my_jobStore \ --batchSystem lsf \ --workDir `pwd` \ --outdir `pwd` \ --logFile cwltoil.log \ --writeLogs `pwd` \ --logLevel DEBUG \ --retryCount 2 \ --maxLogFileSize 20000000000 \ --stats \ standard_bam_processing.cwl \ inputs.yaml
Running CWL in the Cloud
To run in cloud and HPC configurations, you may need to provide additional command line parameters to select and configure the batch system to use.Running CWL within Toil Scripts
A CWL workflow can be run indirectly in a native Toil script. However, this is not the standard way to run CWL workflows with Toil and doing so comes at the cost of job efficiency. For some use cases, such as running one process on multiple files, it may be useful. For example, if you want to run a CWL workflow with 3 YML files specifying different samples inputs, it could look something like:import os import subprocess import tempfile from toil.common import Toil from toil.job import Job def initialize_jobs(job): job.fileStore.logToMaster('initialize_jobs') def runQC(job, cwl_file, cwl_filename, yml_file, yml_filename, outputs_dir, output_num): job.fileStore.logToMaster("runQC") tempDir = job.fileStore.getLocalTempDir() cwl = job.fileStore.readGlobalFile(cwl_file, userPath=os.path.join(tempDir, cwl_filename)) yml = job.fileStore.readGlobalFile(yml_file, userPath=os.path.join(tempDir, yml_filename)) subprocess.check_call(["toil-cwl-runner", cwl, yml]) output_filename = "output.txt" output_file = job.fileStore.writeGlobalFile(output_filename) job.fileStore.readGlobalFile(output_file, userPath=os.path.join(outputs_dir, "sample_" + output_num + "_" + output_filename)) return output_file if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_cwlexample") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: # specify the folder where the cwl and yml files live inputs_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "cwlExampleFiles") # specify where you wish the outputs to be written outputs_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "cwlExampleFiles") job0 = Job.wrapJobFn(initialize_jobs) cwl_filename = "hello.cwl" cwl_file = toil.importFile("file://" + os.path.abspath(os.path.join(inputs_dir, cwl_filename))) # add list of yml config inputs here or import and construct from file yml_files = ["hello1.yml", "hello2.yml", "hello3.yml"] i = 0 for yml in yml_files: i = i + 1 yml_file = toil.importFile("file://" + os.path.abspath(os.path.join(inputs_dir, yml))) yml_filename = yml job = Job.wrapJobFn(runQC, cwl_file, cwl_filename, yml_file, yml_filename, outputs_dir, output_num=str(i)) job0.addChild(job) toil.start(job0)
Running CWL workflows with InplaceUpdateRequirement
Some CWL workflows use the InplaceUpdateRequirement feature, which requires that operations on files have visible side effects that Toil's file store cannot support. If you need to run a workflow like this, you can make sure that all of your worker nodes have a shared filesystem, and use the --bypass-file-store option to toil-cwl-runner. This will make it leave all CWL intermediate files on disk and share them between jobs using file paths, instead of storing them in the file store and downloading them when jobs need them.Toil & CWL Tips
See logs for just one job by using the full log filecat cwltoil.log | grep jobVM1fIs
pcregrep -M "\[job .*\.cwl.*$\n(.* .*$\n)*" cwltoil.log # ^allows for multiline matching
find . | grep -P '^./out_tmpdir.*_MD\.bam$'
cat log/cwltoil.log | grep -oP "\[job .*.cwl\]" | sort | uniq
cat log/cwltoil.log | grep -i "issued job"
$ toil status /home/johnsoni/TEST_RUNS_3/TEST_run/tmp/jobstore-09ae0acc-c800-11e8-9d09-70106fb1697e <hostname> 2018-10-04 15:01:44,184 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'. <hostname> 2018-10-04 15:01:44,185 MainThread INFO toil.utils.toilStatus: Parsed arguments <hostname> 2018-10-04 15:01:47,081 MainThread INFO toil.utils.toilStatus: Traversing the job graph gathering jobs. This may take a couple of minutes. Of the 286 jobs considered, there are 179 jobs with children, 107 jobs ready to run, 0 zombie jobs, 0 jobs with services, 0 services, and 0 jobs with log files currently in file:/home/user/jobstore-09ae0acc-c800-11e8-9d09-70106fb1697e.
$ toil stats /path/to/jobstore
<hostname> 2018-10-15 12:06:19,003 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'. <hostname> 2018-10-15 12:06:19,004 MainThread INFO toil.utils.toilStats: Parsed arguments <hostname> 2018-10-15 12:06:19,004 MainThread INFO toil.utils.toilStats: Checking if we have files for toil <hostname> 2018-10-15 12:06:19,004 MainThread INFO toil.utils.toilStats: Checked arguments Batch System: lsf Default Cores: 1 Default Memory: 10485760K Max Cores: 9.22337e+18 Total Clock: 106608.01 Total Runtime: 86634.11 Worker Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1659 | 0.00 0.80 264.87 12595.59 439424.40 | 0.00 0.46 449.05 42240.74 744968.80 | -35336.69 0.16 -184.17 4230.65 -305544.39 | 48K 223K 1020K 40235K 1692300K Job Worker Jobs | min med ave max | 1077 1077 1077 1077 Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 1077 | 0.04 1.18 407.06 12593.43 438404.73 | 0.01 0.28 691.17 42240.35 744394.14 | -35336.83 0.27 -284.11 4230.49 -305989.41 | 135K 268K 1633K 40235K 1759734K ResolveIndirect Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 205 | 0.04 0.07 0.16 2.29 31.95 | 0.01 0.02 0.02 0.14 3.60 | 0.02 0.05 0.14 2.28 28.35 | 190K 266K 256K 314K 52487K CWLGather Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 40 | 0.05 0.17 0.29 1.90 11.62 | 0.01 0.02 0.02 0.05 0.80 | 0.03 0.14 0.27 1.88 10.82 | 188K 265K 250K 316K 10039K CWLWorkflow Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 205 | 0.09 0.40 0.98 13.70 200.82 | 0.04 0.15 0.16 1.08 31.78 | 0.04 0.26 0.82 12.62 169.04 | 190K 270K 257K 316K 52826K file:///home/johnsoni/pipeline_0.0.39/ACCESS-Pipeline/cwl_tools/expression_tools/group_waltz_files.cwl Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 99 | 0.29 0.49 0.59 2.50 58.11 | 0.14 0.26 0.29 1.04 28.95 | 0.14 0.22 0.29 1.48 29.16 | 135K 135K 135K 136K 13459K file:///home/johnsoni/pipeline_0.0.39/ACCESS-Pipeline/cwl_tools/expression_tools/make_sample_output_dirs.cwl Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 11 | 0.34 0.52 0.74 2.63 8.18 | 0.20 0.30 0.41 1.17 4.54 | 0.14 0.20 0.33 1.45 3.65 | 136K 136K 136K 136K 1496K file:///home/johnsoni/pipeline_0.0.39/ACCESS-Pipeline/cwl_tools/expression_tools/consolidate_files.cwl Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 8 | 0.31 0.59 0.71 1.80 5.69 | 0.18 0.35 0.37 0.63 2.94 | 0.13 0.27 0.34 1.17 2.75 | 136K 136K 136K 136K 1091K file:///home/johnsoni/pipeline_0.0.39/ACCESS-Pipeline/cwl_tools/bwa-mem/bwa-mem.cwl Count | Time* | Clock | Wait | Memory n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total 22 | 895.76 3098.13 3587.34 12593.43 78921.51 | 2127.02 7910.31 8123.06 16959.13 178707.34 | -11049.84 -3827.96 -4535.72 19.49 -99785.83 | 5659K 5950K 5854K 6128K 128807K
file:<path to cwl tool>.cwl_<job ID>.log file:---home-johnsoni-pipeline_1.1.14-ACCESS--Pipeline-cwl_tools-marianas-ProcessLoopUMIFastq.cwl_I-O-jobfGsQQw000.log
WDL IN TOIL
Support is still in the alpha phase and should be able to handle basic wdl files. See the specification below for more details.How to Run a WDL file in Toil
Recommended best practice when running wdl files is to first use the Broad's wdltool for syntax validation and generating the needed json input file. Full documentation can be found on the repository, and a precompiled jar binary can be downloaded here: wdltool (this requires java7).ENCODE Example from ENCODE-DCC
To follow this example, you will need docker installed. The original workflow can be found here: https://github.com/ENCODE-DCC/pipeline-container{ "encode_mapping_workflow.fastqs": "Array[File]", "encode_mapping_workflow.trimming_parameter": "String", "encode_mapping_workflow.reference": "File" }
{ "encode_mapping_workflow.fastqs": ["/path/to/unzipped/ENCODE_data/ENCFF000VOL_chr21.fq.gz"], "encode_mapping_workflow.trimming_parameter": "native", "encode_mapping_workflow.reference": "/path/to/unzipped/ENCODE_data/reference/GRCh38_chr21_bwa.tar.gz" }
GATK Examples from the Broad
Simple examples of WDL can be found on the Broad's website as tutorials: https://software.broadinstitute.org/wdl/documentation/topic?name=wdl-tutorials.- •
- Absolute filepath inputs are recommended for local testing.
toilwdl.py Options
'-o' or '-\-outdir': Specifies the output folder, and defaults to the current working directory if not specified by the user.Running WDL within Toil Scripts
NOTE:A cromwell.jar file is needed in order to run
a WDL workflow.
import os import subprocess import tempfile from toil.common import Toil from toil.job import Job def initialize_jobs(job): job.fileStore.logToMaster("initialize_jobs") def runQC(job, wdl_file, wdl_filename, json_file, json_filename, outputs_dir, jar_loc,output_num): job.fileStore.logToMaster("runQC") tempDir = job.fileStore.getLocalTempDir() wdl = job.fileStore.readGlobalFile(wdl_file, userPath=os.path.join(tempDir, wdl_filename)) json = job.fileStore.readGlobalFile(json_file, userPath=os.path.join(tempDir, json_filename)) subprocess.check_call(["java","-jar",jar_loc,"run",wdl,"--inputs",json]) output_filename = "output.txt" output_file = job.fileStore.writeGlobalFile(outputs_dir + output_filename) job.fileStore.readGlobalFile(output_file, userPath=os.path.join(outputs_dir, "sample_" + output_num + "_" + output_filename)) return output_file if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_wdlexample") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: # specify the folder where the wdl and json files live inputs_dir = "wdlExampleFiles/" # specify where you wish the outputs to be written outputs_dir = "wdlExampleFiles/" # specify the location of your cromwell jar jar_loc = os.path.abspath("wdlExampleFiles/cromwell-35.jar") job0 = Job.wrapJobFn(initialize_jobs) wdl_filename = "hello.wdl" wdl_file = toil.importFile("file://" + os.path.abspath(os.path.join(inputs_dir, wdl_filename))) # add list of yml config inputs here or import and construct from file json_files = ["hello1.json", "hello2.json", "hello3.json"] i = 0 for json in json_files: i = i + 1 json_file = toil.importFile("file://" + os.path.join(inputs_dir, json)) json_filename = json job = Job.wrapJobFn(runQC, wdl_file, wdl_filename, json_file, json_filename, outputs_dir, jar_loc, output_num=str(i)) job0.addChild(job) toil.start(job0)
WDL Specifications
WDL language specifications can be found here: https://github.com/broadinstitute/wdl/blob/develop/SPEC.md- CURRENTLY IMPLEMENTED:
- •
- Scatter
- •
- Many Built-In Functions
- •
- Docker Calls
- •
- Handles Priority, and Output File Wrangling
- •
- Currently Handles Primitives and Arrays
- TO BE IMPLEMENTED:
- •
- Integrate Cloud Autoscaling Capacity More Robustly
- •
- WDL Files That "Import" Other WDL Files (Including URI Handling for ' http://' and 'https://')
WORKFLOW EXECUTION SERVICE (WES)
The GA4GH Workflow Execution Service (WES) is a standardized API for submitting and monitoring workflows. Toil has experimental support for setting up a WES server and executing CWL, WDL, and Toil workflows using the WES API. More information about the WES API specification can be found here.Preparing your WES environment
The WES server requires Celery to distribute and execute workflows. To set up Celery:- 1.
- Start RabbitMQ, which is the broker between the WES server and Celery workers:
docker run -d --name wes-rabbitmq -p 5672:5672 rabbitmq:3.9.5
- 2.
- Start Celery workers:
celery -A toil.server.celery_app worker --loglevel=INFO
Starting a WES server
To start a WES server on the default port 8080, run the Toil command:$ toil server
http://localhost:8080/ga4gh/wes/v1
$ toil server --port 3000
$ toil server --help
- --debug
- Enable debug mode.
- --bypass_celery
- Skip sending workflows to Celery and just run them under the server. For testing.
- --host HOST
- The host interface that the Toil server binds on. (default: "127.0.0.1").
- --port PORT
- The port that the Toil server listens on. (default: 8080).
- --swagger_ui
- If True, the swagger UI will be enabled and hosted on the {api_base_path}/ui endpoint. (default: False)
- --cors
- Enable Cross Origin Resource Sharing (CORS). This should only be turned on if the server is intended to be used by a website or domain. (default: False).
- --cors_origins CORS_ORIGIN
- Ignored if -//-cors is False. This sets the allowed origins for CORS. For details about CORS and its security risks, see the GA4GH docs on CORS. (default: "*").
- --workers WORKERS, -w WORKERS
- Ignored if -\-debug is True. The number of worker processes launched by the WSGI server. (default: 2).
- --work_dir WORK_DIR
- The directory where workflows should be stored. This directory should be empty or only contain previous workflows. (default: './workflows').
- --state_store STATE_STORE
- The local path or S3 URL where workflow state metadata should be stored. (default: in -\-work_dir)
- --opt OPT, -o OPT
- Specify the default parameters to be sent to the workflow engine for each run. Options taking arguments must use = syntax. Accepts multiple values. Example: -\-opt=-\-logLevel=CRITICAL -\-opt=-\-workDir=/tmp.
- --dest_bucket_base DEST_BUCKET_BASE
- Direct CWL workflows to save output files to dynamically generated unique paths under the given URL. Supports AWS S3.
- --wes_dialect DIALECT
- Restrict WES responses to a dialect compatible with clients that do not fully implement the WES standard. (default: 'standard')
Running the Server with docker-compose
Instead of manually setting up the server components ( toil server, RabbitMQ, and Celery), you can use the following docker-compose.yml file to orchestrate and link them together.# docker-compose.yml version: "3.8" services: rabbitmq: image: rabbitmq:3.9.5 hostname: rabbitmq celery: image: ${TOIL_APPLIANCE_SELF} volumes: - /var/run/docker.sock:/var/run/docker.sock - /var/lib/docker:/var/lib/docker - /var/lib/toil:/var/lib/toil - /var/lib/cwl:/var/lib/cwl - /tmp/toil-workflows:/tmp/toil-workflows command: celery --broker=amqp://guest:guest@rabbitmq:5672// -A toil.server.celery_app worker --loglevel=INFO depends_on: - rabbitmq wes-server: image: ${TOIL_APPLIANCE_SELF} volumes: - /tmp/toil-workflows:/tmp/toil-workflows environment: - TOIL_WES_BROKER_URL=amqp://guest:guest@rabbitmq:5672// command: toil server --host 0.0.0.0 --port 8000 --work_dir /tmp/toil-workflows expose: - 8000 labels: - "traefik.enable=true" - "traefik.http.routers.wes.rule=Host(`localhost`)" - "traefik.http.routers.wes.entrypoints=web" - "traefik.http.routers.wes.middlewares=auth" - "traefik.http.middlewares.auth.basicauth.users=test:$$2y$$12$$ci.4U63YX83CwkyUrjqxAucnmi2xXOIlEF6T/KdP9824f1Rf1iyNG" - "traefik.http.routers.wespublic.rule=Host(`localhost`) && Path(`/ga4gh/wes/v1/service-info`)" depends_on: - rabbitmq - celery traefik: image: traefik:v2.2 command: - "--providers.docker" - "--providers.docker.exposedbydefault=false" - "--entrypoints.web.address=:8080" ports: - "8080:8080" volumes: - /var/run/docker.sock:/var/run/docker.sock
docker-compose is not installed on the
Toil appliance by default. See the following section to set up the WES server
on a Toil cluster.
Running on a Toil cluster
To run the server on a Toil leader instance on EC2:- 1.
- Launch a Toil cluster with the toil launch-cluster command with the AWS provisioner
- 2.
- SSH into your cluster with the --sshOption=-L8080:localhost:8080 option to forward port 8080
- 3.
- Install Docker Compose by running the following commands from the Docker docs:
curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose # check installation docker-compose --version
- 4.
- Copy the docker-compose.yml file from (Running the Server with docker-compose) to an empty directory, and modify the configuration as needed.
- 5.
- Now, run docker-compose up -d to start the WES server in detach mode on the Toil appliance.
- 6.
- To stop the server, run docker-compose down.
WES API Endpoints
As defined by the GA4GH WES API specification, the following endpoints with base path ga4gh/wes/v1/ are supported by Toil:GET /service-info | Get information about the Workflow Execution Service. |
GET /runs | List the workflow runs. |
POST /runs | Run a workflow. This endpoint creates a new workflow run and returns a run_id to monitor its progress. |
GET /runs/{run_id} | Get detailed info about a workflow run. |
POST /runs/{run_id}/cancel | Cancel a running workflow. |
GET /runs/{run_id}/status | Get the status (overall state) of a workflow run. |
Submitting a Workflow
Now that the WES API is up and running, we can submit and monitor workflows remotely using the WES API endpoints. A workflow can be submitted for execution using the POST /runs endpoint.# example.cwl cwlVersion: v1.0 class: CommandLineTool baseCommand: echo stdout: output.txt inputs: message: type: string inputBinding: position: 1 outputs: output: type: stdout
$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs' \ --user test:test \ --form 'workflow_url="example.cwl"' \ --form 'workflow_type="cwl"' \ --form 'workflow_type_version="v1.0"' \ --form 'workflow_params="{\"message\": \"Hello world!\"}"' \ --form 'workflow_attachment=@"./toil_test_files/example.cwl"' { "run_id": "4deb8beb24894e9eb7c74b0f010305d1" }
workflow_url | The URL of the workflow to run. This can refer to a file from workflow_attachment. |
workflow_type | The type of workflow language. Toil currently supports one of the following: "CWL", "WDL", or "py". To run a Toil native python script, set this to "py". |
workflow_type_version | The version of the workflow language. Supported versions can be found by accessing the GET /service-info endpoint of your WES server. |
workflow_params | A JSON object that specifies the inputs of the workflow. |
workflow_attachment | A list of files associated with the workflow run. |
workflow_engine_parameters | A JSON key-value map of workflow engine parameters to send to the runner. Example: {"--logLevel": "INFO", "--workDir": "/tmp/"} |
tags | A JSON key-value map of metadata associated with the workflow. |
Upload multiple files
Looking at the body of the request of the previous example, note that the workflow_url is a relative URL that refers to the example.cwl file uploaded from the local path ./toil_test_files/example.cwl.$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs' \ --user test:test \ --form 'workflow_url="example.cwl"' \ --form 'workflow_type="cwl"' \ --form 'workflow_type_version="v1.0"' \ --form 'workflow_params="{\"message\": \"Hello world!\"}"' \ --form 'workflow_attachment=@"./toil_test_files/example.cwl"' \ --form 'workflow_attachment=@"./toil_test_files/2.fasta";filename=inputs/test.fasta' \ --form 'workflow_attachment=@"./toil_test_files/2.fastq";filename=inputs/test.fastq'
execution/ ├── example.cwl ├── inputs │ ├── test.fasta | └── test.fastq └── wes_inputs.json
Specify Toil options
To pass Toil-specific parameters to the workflow, you can include the workflow_engine_parameters parameter along with your request.{"--logLevel": "INFO", "--workDir": "/tmp/"}
Monitoring a Workflow
With the run_id returned when submitting the workflow, we can check the status or get the full logs of the workflow run.Checking the state
The GET /runs/{run_id}/status endpoint can be used to get a simple result with the overall state of your run:$ curl --user test:test http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1/status { "run_id": "4deb8beb24894e9eb7c74b0f010305d1", "state": "RUNNING" }
Getting the full logs
To get the detailed information about a workflow run, use the GET /runs/{run_id} endpoint:$ curl --user test:test http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1 { "run_id": "4deb8beb24894e9eb7c74b0f010305d1", "request": { "workflow_attachment": [ "example.cwl" ], "workflow_url": "example.cwl", "workflow_type": "cwl", "workflow_type_version": "v1.0", "workflow_params": { "message": "Hello world!" } }, "state": "RUNNING", "run_log": { "cmd": [ "toil-cwl-runner --outdir=/home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/outputs --jobStore=file:/home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/toil_job_store /home/toil/workflows/4deb8beb24894e9eb7c74b0f010305d1/execution/example.cwl /home/workflows/4deb8beb24894e9eb7c74b0f010305d1/execution/wes_inputs.json" ], "start_time": "2021-08-30T17:35:50Z", "end_time": null, "stdout": null, "stderr": null, "exit_code": null }, "task_logs": [], "outputs": {} }
Canceling a run
To cancel a workflow run, use the POST /runs/{run_id}/cancel endpoint:$ curl --location --request POST 'http://localhost:8080/ga4gh/wes/v1/runs/4deb8beb24894e9eb7c74b0f010305d1/cancel' \ --user test:test { "run_id": "4deb8beb24894e9eb7c74b0f010305d1" }
DEVELOPING A WORKFLOW
This tutorial walks through the features of Toil necessary for developing a workflow using the Toil Python API."script" and "workflow"
will be used interchangeably
Scripting Quick Start
To begin, consider this short toil script which illustrates defining a workflow:import os import tempfile from toil.common import Toil from toil.job import Job def helloWorld(message, memory="2G", cores=2, disk="3G"): return f"Hello, world!, here's a message: {message}" if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_quickstart") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "OFF" options.clean = "always" hello_job = Job.wrapFn(helloWorld, "Woot") with Toil(options) as toil: print(toil.start(hello_job)) # prints "Hello, world!, ..."
Job Basics
The atomic unit of work in a Toil workflow is a Job. User scripts inherit from this base class to define units of work. For example, here is a more long-winded class-based version of the job in the quick start example:from toil.job import Job class HelloWorld(Job): def __init__(self, message): Job.__init__(self, memory="2G", cores=2, disk="3G") self.message = message def run(self, fileStore): return f"Hello, world! Here's a message: {self.message}"
... def run(self, fileStore): self.log(f"Hello, world! Here's a message: {self.message}")
Invoking a Workflow
We can add to the previous example to turn it into a complete workflow by adding the necessary function calls to create an instance of HelloWorld and to run this as a workflow containing a single job. For example:import os import tempfile from toil.common import Toil from toil.job import Job class HelloWorld(Job): def __init__(self, message): Job.__init__(self, memory="2G", cores=2, disk="3G") self.message = message def run(self, fileStore): return f"Hello, world!, here's a message: {self.message}" if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_invokeworkflow") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "OFF" options.clean = "always" hello_job = HelloWorld("Woot") with Toil(options) as toil: print(toil.start(hello_job))
Do not include a . in the name of your
python script (besides .py at the end). This is to allow toil to import
the types and functions defined in your file while starting a new
process.
import os import tempfile from toil.common import Toil from toil.job import Job class HelloWorld(Job): def __init__(self, message): Job.__init__(self, memory="2G", cores=2, disk="3G") self.message = message def run(self, fileStore): return f"Hello, world!, I have a message: {self.message}" if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_invokeworkflow2") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: if not toil.options.restart: job = HelloWorld("Woot!") output = toil.start(job) else: output = toil.restart() print(output)
Specifying Commandline Arguments
To allow command line control of the options we can use the toil.job.Job.Runner.getDefaultArgumentParser() method to create a argparse.ArgumentParser object which can be used to parse command line options for a Toil script. For example:from toil.common import Toil from toil.job import Job class HelloWorld(Job): def __init__(self, message): Job.__init__(self, memory="2G", cores=2, disk="3G") self.message = message def run(self, fileStore): return "Hello, world!, here's a message: %s" % self.message if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.logLevel = "OFF" options.clean = "always" hello_job = HelloWorld("Woot") with Toil(options) as toil: print(toil.start(hello_job))
Resuming a Workflow
In the event that a workflow fails, either because of programmatic error within the jobs being run, or because of node failure, the workflow can be resumed. Workflows can only not be reliably resumed if the job-store itself becomes corrupt.Functions and Job Functions
Defining jobs by creating class definitions generally involves the boilerplate of creating a constructor. To avoid this the classes toil.job.FunctionWrappingJob and toil.job.JobFunctionWrappingTarget allow functions to be directly converted to jobs. For example, the quick start example (repeated here):import os import tempfile from toil.common import Toil from toil.job import Job def helloWorld(message, memory="2G", cores=2, disk="3G"): return f"Hello, world!, here's a message: {message}" if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_quickstart") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "OFF" options.clean = "always" hello_job = Job.wrapFn(helloWorld, "Woot") with Toil(options) as toil: print(toil.start(hello_job)) # prints "Hello, world!, ..."
Job.wrapFn(helloWorld, "Woot")
import os import tempfile from toil.common import Toil from toil.job import Job def helloWorld(job, message): job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_jobfunctions") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" hello_job = Job.wrapJobFn(helloWorld, "Woot!") with Toil(options) as toil: toil.start(hello_job)
hello_job = Job.wrapJobFn(helloWorld, "Woot")
Workflows with Multiple Jobs
A parent job can have child jobs and follow-on jobs. These relationships are specified by methods of the job class, e.g. toil.job.Job.addChild() and toil.job.Job.addFollowOn().from toil.common import Toil from toil.job import Job def helloWorld(job, message, memory="2G", cores=2, disk="3G"): job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.logLevel = "INFO" options.clean = "always" j1 = Job.wrapJobFn(helloWorld, "first") j2 = Job.wrapJobFn(helloWorld, "second or third") j3 = Job.wrapJobFn(helloWorld, "second or third") j4 = Job.wrapJobFn(helloWorld, "last") j1.addChild(j2) j1.addChild(j3) j1.addFollowOn(j4) with Toil(options) as toil: toil.start(j1)
from toil.common import Toil from toil.job import Job def helloWorld(job, message, memory="2G", cores=2, disk="3G"): job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.logLevel = "INFO" options.clean = "always" j1 = Job.wrapJobFn(helloWorld, "first") j2 = j1.addChildJobFn(helloWorld, "second or third") j3 = j1.addChildJobFn(helloWorld, "second or third") j4 = j1.addFollowOnJobFn(helloWorld, "last") with Toil(options) as toil: toil.start(j1)
from toil.common import Toil from toil.job import Job def helloWorld(job, message, memory="2G", cores=2, disk="3G"): job.log(f"Hello world, I have a message: {message}") if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.logLevel = "INFO" options.clean = "always" j1 = Job.wrapJobFn(helloWorld, "first") j2 = j1.addChildJobFn(helloWorld, "second or third") j3 = j1.addChildJobFn(helloWorld, "second or third") j4 = j2.addChildJobFn(helloWorld, "last") j3.addChild(j4) with Toil(options) as toil: toil.start(j1)
Dynamic Job Creation
The previous examples show a workflow being defined outside of a job. However, Toil also allows jobs to be created dynamically within jobs. For example:import os import tempfile from toil.common import Toil from toil.job import Job def binaryStringFn(job, depth, message=""): if depth > 0: job.addChildJobFn(binaryStringFn, depth-1, message + "0") job.addChildJobFn(binaryStringFn, depth-1, message + "1") else: job.log(f"Binary string: {message}") if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_dynamic") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: toil.start(Job.wrapJobFn(binaryStringFn, depth=5))
Promises
The previous example of dynamic job creation shows variables from a parent job being passed to a child job. Such forward variable passing is naturally specified by recursive invocation of successor jobs within parent jobs. This can also be achieved statically by passing around references to the return variables of jobs. In Toil this is achieved with promises, as illustrated in the following example:import os import tempfile from toil.common import Toil from toil.job import Job def fn(job, i): job.log("i is: %s" % i, level=100) return i + 1 if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_promises") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" j1 = Job.wrapJobFn(fn, 1) j2 = j1.addChildJobFn(fn, j1.rv()) j3 = j1.addFollowOnJobFn(fn, j2.rv()) with Toil(options) as toil: toil.start(j1)
j2 = j1.addChildFn(fn, j1.rv())
def parent(job): indexable = Job.wrapJobFn(fn) job.addChild(indexable) job.addFollowOnFn(raiseWrap, indexable.rv(2)) def raiseWrap(arg): raise RuntimeError(arg) # raises "2" def fn(job): return (0, 1, 2, 3)
import os import tempfile from toil.common import Toil from toil.job import Job def binaryStrings(job, depth, message=""): if depth > 0: s = [job.addChildJobFn(binaryStrings, depth - 1, message + "0").rv(), job.addChildJobFn(binaryStrings, depth - 1, message + "1").rv()] return job.addFollowOnFn(merge, s).rv() return [message] def merge(strings): return strings[0] + strings[1] if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_promises2") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.loglevel = "OFF" options.clean = "always" with Toil(options) as toil: print(toil.start(Job.wrapJobFn(binaryStrings, depth=5)))
Promised Requirements
Promised requirements are a special case of Promises that allow a job's return value to be used as another job's resource requirements.import os import tempfile from toil.common import Toil from toil.job import Job, PromisedRequirement def parentJob(job): downloadJob = Job.wrapJobFn(stageFn, "file://" + os.path.realpath(__file__), cores=0.1, memory='32M', disk='1M') job.addChild(downloadJob) analysis = Job.wrapJobFn(analysisJob, fileStoreID=downloadJob.rv(0), disk=PromisedRequirement(downloadJob.rv(1))) job.addFollowOn(analysis) def stageFn(job, url, cores=1): importedFile = job.fileStore.import_file(url) return importedFile, importedFile.size def analysisJob(job, fileStoreID, cores=2): # now do some analysis on the file pass if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_requirements") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: toil.start(Job.wrapJobFn(parentJob))
def parentJob(job): aggregator = [] for fileNum in range(0, 10): downloadJob = Job.wrapJobFn(stageFn, "file://" + os.path.realpath(__file__), cores=0.1, memory='32M', disk='1M') job.addChild(downloadJob) aggregator.append(downloadJob) analysis = Job.wrapJobFn(analysisJob, fileStoreID=downloadJob.rv(0), disk=PromisedRequirement(lambda xs: sum(xs), [j.rv(1) for j in aggregator])) job.addFollowOn(analysis)
- Limitations
- Just like regular promises, the return value must be determined prior to scheduling any job that depends on the return value. In our example above, notice how the dependent jobs were follow ons to the parent while promising jobs are children of the parent. This ordering ensures that all promises are properly fulfilled.
FileID
The toil.fileStore.FileID class is a small wrapper around Python's builtin string class. It is used to represent a file's ID in the file store, and has a size attribute that is the file's size in bytes. This object is returned by importFile and writeGlobalFile.Managing files within a workflow
It is frequently the case that a workflow will want to create files, both persistent and temporary, during its run. The toil.fileStores.abstractFileStore.AbstractFileStore class is used by jobs to manage these files in a manner that guarantees cleanup and resumption on failure.import os import tempfile from toil.common import Toil from toil.job import Job class LocalFileStoreJob(Job): def run(self, fileStore): # self.tempDir will always contain the name of a directory within the allocated disk space reserved for the job scratchDir = self.tempDir # Similarly create a temporary file. scratchFile = fileStore.getLocalTempFile() if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_managing") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" # Create an instance of FooJob which will have at least 2 gigabytes of storage space. j = LocalFileStoreJob(disk="2G") # Run the workflow with Toil(options) as toil: toil.start(j)
def localFileStoreJobFn(job): scratchDir = job.tempDir scratchFile = job.fileStore.getLocalTempFile()
import os import tempfile from toil.common import Toil from toil.job import Job def globalFileStoreJobFn(job): job.log("The following example exercises all the methods provided " "by the toil.fileStores.abstractFileStore.AbstractFileStore class") # Create a local temporary file. scratchFile = job.fileStore.getLocalTempFile() # Write something in the scratch file. with open(scratchFile, 'w') as fH: fH.write("What a tangled web we weave") # Write a copy of the file into the file-store; fileID is the key that can be used to retrieve the file. # This write is asynchronous by default fileID = job.fileStore.writeGlobalFile(scratchFile) # Write another file using a stream; fileID2 is the # key for this second file. with job.fileStore.writeGlobalFileStream(cleanup=True) as (fH, fileID2): fH.write(b"Out brief candle") # Now read the first file; scratchFile2 is a local copy of the file that is read-only by default. scratchFile2 = job.fileStore.readGlobalFile(fileID) # Read the second file to a desired location: scratchFile3. scratchFile3 = os.path.join(job.tempDir, "foo.txt") job.fileStore.readGlobalFile(fileID2, userPath=scratchFile3) # Read the second file again using a stream. with job.fileStore.readGlobalFileStream(fileID2) as fH: print(fH.read()) # This prints "Out brief candle" # Delete the first file from the global file-store. job.fileStore.deleteGlobalFile(fileID) # It is unnecessary to delete the file keyed by fileID2 because we used the cleanup flag, # which removes the file after this job and all its successors have run (if the file still exists) if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_managing2") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: toil.start(Job.wrapJobFn(globalFileStoreJobFn))
Staging of Files into the Job Store
External files can be imported into or exported out of the job store prior to running a workflow when the toil.common.Toil context manager is used on the leader. The context manager provides methods toil.common.Toil.importFile(), and toil.common.Toil.exportFile() for this purpose. The destination and source locations of such files are described with URLs passed to the two methods. Local files can be imported and exported as relative paths, and should be relative to the directory where the toil workflow is initially run from.import os import tempfile from toil.common import Toil from toil.job import Job class HelloWorld(Job): def __init__(self, id): Job.__init__(self, memory="2G", cores=2, disk="3G") self.inputFileID = id def run(self, fileStore): with fileStore.readGlobalFileStream(self.inputFileID, encoding='utf-8') as fi: with fileStore.writeGlobalFileStream(encoding='utf-8') as (fo, outputFileID): fo.write(fi.read() + 'World!') return outputFileID if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_staging") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: if not toil.options.restart: ioFileDirectory = os.path.join(os.path.dirname(os.path.abspath(__file__)), "stagingExampleFiles") inputFileID = toil.importFile("file://" + os.path.abspath(os.path.join(ioFileDirectory, "in.txt"))) outputFileID = toil.start(HelloWorld(inputFileID)) else: outputFileID = toil.restart() toil.exportFile(outputFileID, "file://" + os.path.abspath(os.path.join(ioFileDirectory, "out.txt")))
Using Docker Containers in Toil
Docker containers are commonly used with Toil. The combination of Toil and Docker allows for pipelines to be fully portable between any platform that has both Toil and Docker installed. Docker eliminates the need for the user to do any other tool installation or environment setup.dockerCall(job=job, tool='quay.io/ucsc_cgl/bwa', workDir=job.tempDir, parameters=['index', '/data/reference.fa'])
import os import tempfile from toil.common import Toil from toil.job import Job from toil.lib.docker import apiDockerCall align = Job.wrapJobFn(apiDockerCall, image='ubuntu', working_dir=os.getcwd(), parameters=['ls', '-lha']) if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_docker") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: toil.start(align)
entrypoint=["/bin/bash","-c"]
Services
It is sometimes desirable to run services, such as a database or server, concurrently with a workflow. The toil.job.Job.Service class provides a simple mechanism for spawning such a service within a Toil workflow, allowing precise specification of the start and end time of the service, and providing start and end methods to use for initialization and cleanup. The following simple, conceptual example illustrates how services work:import os import tempfile from toil.common import Toil from toil.job import Job class DemoService(Job.Service): def start(self, fileStore): # Start up a database/service here # Return a value that enables another process to connect to the database return "loginCredentials" def check(self): # A function that if it returns False causes the service to quit # If it raises an exception the service is killed and an error is reported return True def stop(self, fileStore): # Cleanup the database here pass j = Job() s = DemoService() loginCredentialsPromise = j.addService(s) def dbFn(loginCredentials): # Use the login credentials returned from the service's start method to connect to the service pass j.addChildFn(dbFn, loginCredentialsPromise) if __name__ == "__main__": jobstore: str = tempfile.mkdtemp("tutorial_services") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: toil.start(j)
Checkpoints
Services complicate resuming a workflow after failure, because they can create complex dependencies between jobs. For example, consider a service that provides a database that multiple jobs update. If the database service fails and loses state, it is not clear that just restarting the service will allow the workflow to be resumed, because jobs that created that state may have already finished. To get around this problem Toil supports checkpoint jobs, specified as the boolean keyword argument checkpoint to a job or wrapped function, e.g.:j = Job(checkpoint=True)
Encapsulation
Let A be a root job potentially with children and follow-ons. Without an encapsulated job the simplest way to specify a job B which runs after A and all its successors is to create a parent of A, call it Ap, and then make B a follow-on of Ap. e.g.:import os import tempfile from toil.common import Toil from toil.job import Job if __name__ == "__main__": # A is a job with children and follow-ons, for example: A = Job() A.addChild(Job()) A.addFollowOn(Job()) # B is a job which needs to run after A and its successors B = Job() # The way to do this without encapsulation is to make a parent of A, Ap, and make B a follow-on of Ap. Ap = Job() Ap.addChild(A) Ap.addFollowOn(B) jobstore: str = tempfile.mkdtemp("tutorial_encapsulations") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: print(toil.start(Ap))
import os import tempfile from toil.common import Toil from toil.job import Job if __name__ == "__main__": # A A = Job() A.addChild(Job()) A.addFollowOn(Job()) # Encapsulate A A = A.encapsulate() # B is a job which needs to run after A and its successors B = Job() # With encapsulation A and its successor subgraph appear to be a single job, hence: A.addChild(B) jobstore: str = tempfile.mkdtemp("tutorial_encapsulations2") os.rmdir(jobstore) options = Job.Runner.getDefaultOptions(jobstore) options.logLevel = "INFO" options.clean = "always" with Toil(options) as toil: print(toil.start(A))
Depending on Toil
If you are packing your workflow(s) as a pip-installable distribution on PyPI, you might be tempted to declare Toil as a dependency in your setup.py, via the install_requires keyword argument to setup(). Unfortunately, this does not work, for two reasons: For one, Toil uses Setuptools' extra mechanism to manage its own optional dependencies. If you explicitly declared a dependency on Toil, you would have to hard-code a particular combination of extras (or no extras at all), robbing the user of the choice what Toil extras to install. Secondly, and more importantly, declaring a dependency on Toil would only lead to Toil being installed on the leader node of a cluster, but not the worker nodes. Auto-deployment does not work here because Toil cannot auto-deploy itself, the classic "Which came first, chicken or egg?" problem.Best Practices for Dockerizing Toil Workflows
Computational Genomics Lab's Dockstore based production system provides workflow authors a way to run Dockerized versions of their pipeline in an automated, scalable fashion. To be compatible with this system of a workflow should meet the following requirements. In addition to the Docker container, a common workflow language descriptor file is needed. For inputs:- •
- Only command line arguments should be used for configuring the workflow. If the workflow relies on a configuration file, like Toil-RNAseq or ProTECT, a wrapper script inside the Docker container can be used to parse the CLI and generate the necessary configuration file.
- •
- All inputs to the pipeline should be explicitly enumerated rather than implicit. For example, don't rely on one FASTQ read's path to discover the location of its pair. This is necessary since all inputs are mapped to their own isolated directories when the Docker is called via Dockstore.
- •
- All inputs must be documented in the CWL descriptor file. Examples of this file can be seen in both Toil-RNAseq and ProTECT.
- •
- All outputs should be written to a local path rather than S3.
- •
- Take care to package outputs in a local and user-friendly way. For example, don't tar up all output if there are specific files that will care to see individually.
- •
- All output file names should be deterministic and predictable. For example, don't prepend the name of an output file with PASS/FAIL depending on the outcome of the pipeline.
- •
- All outputs must be documented in the CWL descriptor file. Examples of this file can be seen in both Toil-RNAseq and ProTECT.
TOIL CLASS API
The Toil class configures and starts a Toil run.- class toil.common.Toil(options: Namespace)
- A context manager that represents a Toil workflow. Specifically the batch system, job store, and its configuration.
- __init__(options: Namespace) -> None
- Initialize a Toil object from the given options. Note that this is very light-weight and that the bulk of the work is done when the context is entered.
- Parameters
- options -- command line options specified by the user
- start(rootJob: Job) -> Any
- Invoke a Toil workflow with the given job as the root for an initial run. This method must be called in the body of a with Toil(...) as toil: statement. This method should not be called more than once for a workflow that has not finished.
- Parameters
- rootJob -- The root job of the workflow
- Returns
- The root job's return value
- restart() -> Any
- Restarts a workflow that has been interrupted.
- Returns
- The root job's return value
- classmethod getJobStore(locator: str) -> AbstractJobStore
- Create an instance of the concrete job store implementation that matches the given locator.
- Parameters
- locator (str) -- The location of the job store to be represent by the instance
- Returns
- an instance of a concrete subclass of AbstractJobStore
- static createBatchSystem(config: Config) -> AbstractBatchSystem
- Create an instance of the batch system specified in the given config.
- Parameters
- config -- the current configuration
- Returns
- an instance of a concrete subclass of AbstractBatchSystem
- import_file(src_uri: str, shared_file_name: str, symlink: bool = False) -> None
- import_file(src_uri: str, shared_file_name: None = None, symlink: bool = False) -> FileID
- Import the file at the given URL into the job store. See toil.jobStores.abstractJobStore.AbstractJobStore.importFile() for a full description
- export_file(file_id: FileID, dst_uri: str) -> None
- Export file to destination pointed at by the destination URL. See toil.jobStores.abstractJobStore.AbstractJobStore.exportFile() for a full description
- static normalize_uri(uri: str, check_existence: bool = False) -> str
- Given a URI, if it has no scheme, prepend "file:".
- Parameters
- check_existence -- If set, raise an error if a URI points to a local file that does not exist.
- static getToilWorkDir(configWorkDir: Optional[str] = None) -> str
- Return a path to a writable directory under which per-workflow directories exist. This directory is always required to exist on a machine, even if the Toil worker has not run yet. If your workers and leader have different temp directories, you may need to set TOIL_WORKDIR.
- Parameters
- configWorkDir -- Value passed to the program using the --workDir flag
- Returns
- Path to the Toil work directory, constant across all machines
- classmethod get_toil_coordination_dir(config_work_dir: Optional[str], config_coordination_dir: Optional[str]) -> str
- Return a path to a writable directory, which will be in memory if convenient. Ought to be used for file locking and coordination.
- Parameters
- •
- config_work_dir -- Value passed to the program using the --workDir flag
- •
- config_coordination_dir -- Value passed to the program using the --coordinationDir flag
- Returns
- Path to the Toil coordination directory. Ought to be on a POSIX filesystem that allows directories containing open files to be deleted.
- classmethod getLocalWorkflowDir(workflowID: str, configWorkDir: Optional[str] = None) -> str
- Return the directory where worker directories and the cache will be located for this workflow on this machine.
- Parameters
- configWorkDir -- Value passed to the program using the --workDir flag
- Returns
- Path to the local workflow directory on this machine
- classmethod get_local_workflow_coordination_dir(workflow_id: str, config_work_dir: Optional[str], config_coordination_dir: Optional[ str]) -> str
- Return the directory where coordination files should be located for this workflow on this machine. These include internal Toil databases and lock files for the machine. If an in-memory filesystem is available, it is used. Otherwise, the local workflow directory, which may be on a shared network filesystem, is used.
- Parameters
- •
- workflow_id -- Unique ID of the current workflow.
- •
- config_work_dir -- Value used for the work directory in the current Toil Config.
- •
- config_coordination_dir -- Value used for the coordination directory in the current Toil Config.
- Returns
- Path to the local workflow coordination directory on this machine.
JOB STORE API
The job store interface is an abstraction layer that that hides the specific details of file storage, for example standard file systems, S3, etc. The AbstractJobStore API is implemented to support a give file store, e.g. S3. Implement this API to support a new file store.- class toil.jobStores.abstractJobStore.AbstractJobStore(locator: str)
- Represents the physical storage for the jobs and files in a Toil workflow. JobStores are responsible for storing toil.job.JobDescription (which relate jobs to each other) and files. Actual toil.job.Job objects are stored in files, referenced by JobDescriptions. All the non-file CRUD methods the JobStore provides deal in JobDescriptions and not full, executable Jobs. To actually get ahold of a toil.job.Job, use toil.job.Job.loadJob() with a JobStore and the relevant JobDescription.
- __init__(locator: str) -> None
- Create an instance of the job store. The instance will not be fully functional until either initialize() or resume() is invoked. Note that the destroy() method may be invoked on the object with or without prior invocation of either of these two methods. Takes and stores the locator string for the job store, which will be accessible via self.locator.
- initialize(config: Config) -> None
- Initialize this job store. Create the physical storage for this job store, allocate a workflow ID and persist the given Toil configuration to the store.
- Parameters
- config -- the Toil configuration to initialize this job store with. The given configuration will be updated with the newly allocated workflow ID.
- Raises
- JobStoreExistsException -- if the physical storage for this job store already exists
- write_config() -> None
- Persists the value of the AbstractJobStore.config attribute to the job store, so that it can be retrieved later by other instances of this class.
- resume() -> None
- Connect this instance to the physical storage it represents and load the Toil configuration into the AbstractJobStore.config attribute.
- Raises
- NoSuchJobStoreException -- if the physical storage for this job store doesn't exist
- property config: Config
- Return the Toil configuration associated with this job store.
- property locator: str
- Get the locator that defines the job store, which can be used to connect to it.
- setRootJob(rootJobStoreID: FileID) -> None
- Set the root job of the workflow backed by this job store.
- set_root_job(job_id: FileID) -> None
- Set the root job of the workflow backed by this job store.
- Parameters
- job_id -- The ID of the job to set as root
- load_root_job() -> JobDescription
- Loads the JobDescription for the root job in the current job store.
- Raises
- toil.job.JobException -- If no root job is set or if the root job doesn't exist in this job store
- Returns
- The root job.
- create_root_job(job_description: JobDescription) -> JobDescription
- Create the given JobDescription and set it as the root job in this job store.
- Parameters
- job_description -- JobDescription to save and make the root job.
- get_root_job_return_value() -> Any
- Parse the return value from the root job. Raises an exception if the root job hasn't fulfilled its promise yet.
- import_file(src_uri: str, shared_file_name: str, hardlink: bool = False, symlink: bool = False) -> None
- import_file(src_uri: str, shared_file_name: None = None, hardlink: bool = False, symlink: bool = False) -> FileID
- Imports the file at the given URL into job store. The ID of the newly imported file is returned. If the name of a shared file name is provided, the file will be imported as such and None is returned. If an executable file on the local filesystem is uploaded, its executability will be preserved when it is downloaded. Currently supported schemes are:
- •
- 's3' for objects in Amazon S3
- e.g. s3://bucket/key
- •
- 'file' for local files
- e.g. file:///local/file/path
- •
- 'http'
- e.g. http://someurl.com/path
- •
- 'gs'
- e.g. gs://bucket/file
- Parameters
- •
- src_uri (str) -- URL that points to a file or object in the storage mechanism of a supported URL scheme e.g. a blob in an AWS s3 bucket.
- •
- shared_file_name (str) -- Optional name to assign to the imported file within the job store
- Returns
- The jobStoreFileID of the imported file or None if shared_file_name was given
- Return type
- toil.fileStores.FileID or None
- export_file(file_id: FileID, dst_uri: str) -> None
- Exports file to destination pointed at by the destination URL. The exported file will be executable if and only if it was originally uploaded from an executable file on the local filesystem. Refer to AbstractJobStore.import_file() documentation for currently supported URL schemes. Note that the helper method _exportFile is used to read from the source and write to destination. To implement any optimizations that circumvent this, the _exportFile method should be overridden by subclasses of AbstractJobStore.
- Parameters
- •
- file_id (str) -- The id of the file in the job store that should be exported.
- •
- dst_uri (str) -- URL that points to a file or object in the storage mechanism of a supported URL scheme e.g. a blob in an AWS s3 bucket.
- classmethod list_url(src_uri: str) -> List[ str]
- List the directory at the given URL. Returned path components can be joined with '/' onto the passed URL to form new URLs. Those that end in '/' correspond to directories. The provided URL may or may not end with '/'. Currently supported schemes are:
- •
- 's3' for objects in Amazon S3
- e.g. s3://bucket/prefix/
- •
- 'file' for local files
- e.g. file:///local/dir/path/
- Parameters
- src_uri (str) -- URL that points to a directory or prefix in the storage mechanism of a supported URL scheme e.g. a prefix in an AWS s3 bucket.
- Returns
- A list of URL components in the given directory, already URL-encoded.
- classmethod get_is_directory(src_uri: str) -> bool
- Return True if the thing at the given URL is a directory, and False if it is a file. The URL may or may not end in '/'.
- classmethod read_from_url(src_uri: str, writable: IO[bytes]) -> Tuple[int, bool]
- Read the given URL and write its content into the given writable stream.
- Returns
- The size of the file in bytes and whether the executable permission bit is set
- Return type
- Tuple[int, bool]
- abstract classmethod get_size(src_uri: ParseResult) -> None
- Get the size in bytes of the file at the given URL, or None if it cannot be obtained.
- Parameters
- src_uri -- URL that points to a file or object in the storage mechanism of a supported URL scheme e.g. a blob in an AWS s3 bucket.
- abstract destroy() -> None
- The inverse of initialize(), this method deletes the physical storage represented by this instance. While not being atomic, this method is at least idempotent, as a means to counteract potential issues with eventual consistency exhibited by the underlying storage mechanisms. This means that if the method fails (raises an exception), it may (and should be) invoked again. If the underlying storage mechanism is eventually consistent, even a successful invocation is not an ironclad guarantee that the physical storage vanished completely and immediately. A successful invocation only guarantees that the deletion will eventually happen. It is therefore recommended to not immediately reuse the same job store location for a new Toil workflow.
- get_env() -> Dict[str, str]
- Returns a dictionary of environment variables that this job store requires to be set in order to function properly on a worker.
- Return type
- dict[str,str]
- clean(jobCache: Optional[Dict[Union[ str, TemporaryID], JobDescription]] = None) -> JobDescription
- Function to cleanup the state of a job store after a restart. Fixes jobs that might have been partially updated. Resets the try counts and removes jobs that are not successors of the current root job.
- Parameters
- jobCache -- if a value it must be a dict from job ID keys to JobDescription object values. Jobs will be loaded from the cache (which can be downloaded from the job store in a batch) instead of piecemeal when recursed into.
- abstract assign_job_id(job_description: JobDescription) -> None
- Get a new jobStoreID to be used by the described job, and assigns it to the JobDescription. Files associated with the assigned ID will be accepted even if the JobDescription has never been created or updated.
- Parameters
- job_description (toil.job.JobDescription) -- The JobDescription to give an ID to
- batch() -> Iterator[None]
- If supported by the batch system, calls to create() with this context manager active will be performed in a batch after the context manager is released.
- abstract create_job(job_description: JobDescription) -> JobDescription
- Writes the given JobDescription to the job store. The job must have an ID assigned already. Must call jobDescription.pre_update_hook()
- Returns
- The JobDescription passed.
- Return type
- toil.job.JobDescription
- abstract job_exists(job_id: str) -> bool
- Indicates whether a description of the job with the specified jobStoreID exists in the job store
- Return type
- bool
- abstract get_public_url(file_name: str) -> str
- Returns a publicly accessible URL to the given file in the job store. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
- Parameters
- file_name (str) -- the jobStoreFileID of the file to generate a URL for
- Raises
- NoSuchFileException -- if the specified file does not exist in this job store
- Return type
- str
- abstract get_shared_public_url(shared_file_name: str) -> str
- Differs from getPublicUrl() in that this method is for generating URLs for shared files written by writeSharedFileStream(). Returns a publicly accessible URL to the given file in the job store. The returned URL starts with 'http:', 'https:' or 'file:'. The returned URL may expire as early as 1h after its been returned. Throw an exception if the file does not exist.
- Parameters
- shared_file_name (str) -- The name of the shared file to generate a publically accessible url for.
- Raises
- NoSuchFileException -- raised if the specified file does not exist in the store
- Return type
- str
- abstract load_job(job_id: str) -> JobDescription
- Loads the description of the job referenced by the given ID, assigns it the job store's config, and returns it. May declare the job to have failed (see toil.job.JobDescription.setupJobAfterFailure()) if there is evidence of a failed update attempt.
- Parameters
- job_id -- the ID of the job to load
- Raises
- NoSuchJobException -- if there is no job with the given ID
- abstract update_job(job_description: JobDescription) -> None
- Persists changes to the state of the given JobDescription in this store atomically. Must call jobDescription.pre_update_hook()
- Parameters
- job (toil.job.JobDescription) -- the job to write to this job store
- abstract delete_job(job_id: str) -> None
- Removes the JobDescription from the store atomically. You may not then subsequently call load(), write(), update(), etc. with the same jobStoreID or any JobDescription bearing it. This operation is idempotent, i.e. deleting a job twice or deleting a non-existent job will succeed silently.
- Parameters
- job_id (str) -- the ID of the job to delete from this job store
- jobs() -> Iterator[JobDescription]
- Best effort attempt to return iterator on JobDescriptions for all jobs in the store. The iterator may not return all jobs and may also contain orphaned jobs that have already finished successfully and should not be rerun. To guarantee you get any and all jobs that can be run instead construct a more expensive ToilState object
- Returns
- Returns iterator on jobs in the store. The iterator may or may not contain all jobs and may contain invalid jobs
- Return type
- Iterator[toil.job.jobDescription]
- abstract write_file(local_path: str, job_id: Optional[ str] = None, cleanup: bool = False) -> str
- Takes a file (as a path) and places it in this job store. Returns an ID that can be used to retrieve the file at a later time. The file is written in a atomic manner. It will not appear in the jobStore until the write has successfully completed.
- Parameters
- •
- local_path (str) -- the path to the local file that will be uploaded to the job store. The last path component (basename of the file) will remain associated with the file in the file store, if supported, so that the file can be searched for by name or name glob.
- •
- job_id (str) -- the id of a job, or None. If specified, the may be associated with that job in a job-store-specific way. This may influence the returned ID.
- •
- cleanup (bool) -- Whether to attempt to delete the file when the job whose jobStoreID was given as jobStoreID is deleted with jobStore.delete(job). If jobStoreID was not given, does nothing.
- Raises
- •
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- •
- NoSuchJobException -- if the job specified via jobStoreID does not exist
- Returns
- an ID referencing the newly created file and can be used to read the file in the future.
- Return type
- str
- abstract write_file_stream(job_id: Optional[str] = None, cleanup: bool = False, basename: Optional[str] = None, encoding: Optional[str] = None, errors: Optional[str] = None) -> Iterator[Tuple[ IO[bytes], str]]
- Similar to writeFile, but returns a context manager yielding a tuple of 1) a file handle which can be written to and 2) the ID of the resulting file in the job store. The yielded file handle does not need to and should not be closed explicitly. The file is written in a atomic manner. It will not appear in the jobStore until the write has successfully completed.
- Parameters
- •
- job_id (str) -- the id of a job, or None. If specified, the may be associated with that job in a job-store-specific way. This may influence the returned ID.
- •
- cleanup (bool) -- Whether to attempt to delete the file when the job whose jobStoreID was given as jobStoreID is deleted with jobStore.delete(job). If jobStoreID was not given, does nothing.
- •
- basename (str) -- If supported by the implementation, use the given file basename so that when searching the job store with a query matching that basename, the file will be detected.
- •
- encoding (str) -- the name of the encoding used to encode the file. Encodings are the same as for encode(). Defaults to None which represents binary mode.
- •
- errors (str) -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Raises
- •
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- •
- NoSuchJobException -- if the job specified via jobStoreID does not exist
- Returns
- a context manager yielding a file handle which can be written to and an ID that references the newly created file and can be used to read the file in the future.
- Return type
- Iterator[Tuple[IO[bytes], str]]
- abstract get_empty_file_store_id(job_id: Optional[ str] = None, cleanup: bool = False, basename: Optional[str] = None) -> str
- Creates an empty file in the job store and returns its ID. Call to fileExists(getEmptyFileStoreID(jobStoreID)) will return True.
- Parameters
- •
- job_id (str) -- the id of a job, or None. If specified, the may be associated with that job in a job-store-specific way. This may influence the returned ID.
- •
- cleanup (bool) -- Whether to attempt to delete the file when the job whose jobStoreID was given as jobStoreID is deleted with jobStore.delete(job). If jobStoreID was not given, does nothing.
- •
- basename (str) -- If supported by the implementation, use the given file basename so that when searching the job store with a query matching that basename, the file will be detected.
- Returns
- a jobStoreFileID that references the newly created file and can be used to reference the file in the future.
- Return type
- str
- abstract read_file(file_id: str, local_path: str, symlink: bool = False) -> None
- Copies or hard links the file referenced by jobStoreFileID to the given local file path. The version will be consistent with the last copy of the file written/updated. If the file in the job store is later modified via updateFile or updateFileStream, it is implementation-defined whether those writes will be visible at localFilePath. The file is copied in an atomic manner. It will not appear in the local file system until the copy has completed. The file at the given local path may not be modified after this method returns! Note! Implementations of readFile need to respect/provide the executable attribute on FileIDs.
- Parameters
- •
- file_id (str) -- ID of the file to be copied
- •
- local_path (str) -- the local path indicating where to place the contents of the given file in the job store
- •
- symlink (bool) -- whether the reader can tolerate a symlink. If set to true, the job store may create a symlink instead of a full copy of the file or a hard link.
- abstract read_file_stream(file_id: Union[FileID, str], encoding: Literal[None] = None, errors: Optional[str] = None) -> ContextManager[ IO[bytes]]
- abstract read_file_stream(file_id: Union[FileID, str], encoding: str, errors: Optional[ str] = None) -> ContextManager[IO[str]]
- Similar to readFile, but returns a context manager yielding a file handle which can be read from. The yielded file handle does not need to and should not be closed explicitly.
- Parameters
- •
- file_id (str) -- ID of the file to get a readable file handle for
- •
- encoding (str) -- the name of the encoding used to decode the file. Encodings are the same as for decode(). Defaults to None which represents binary mode.
- •
- errors (str) -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Returns
- a context manager yielding a file handle which can be read from
- Return type
- Iterator[Union[IO[bytes], IO[str]]]
- abstract delete_file(file_id: str) -> None
- Deletes the file with the given ID from this job store. This operation is idempotent, i.e. deleting a file twice or deleting a non-existent file will succeed silently.
- Parameters
- file_id (str) -- ID of the file to delete
- fileExists(jobStoreFileID: str) -> bool
- Determine whether a file exists in this job store.
- abstract file_exists(file_id: str) -> bool
- Determine whether a file exists in this job store.
- Parameters
- file_id -- an ID referencing the file to be checked
- getFileSize(jobStoreFileID: str) -> int
- Get the size of the given file in bytes.
- abstract get_file_size(file_id: str) -> int
- Get the size of the given file in bytes, or 0 if it does not exist when queried. Note that job stores which encrypt files might return overestimates of file sizes, since the encrypted file may have been padded to the nearest block, augmented with an initialization vector, etc.
- Parameters
- file_id (str) -- an ID referencing the file to be checked
- Return type
- int
- updateFile(jobStoreFileID: str, localFilePath: str) -> None
- Replaces the existing version of a file in the job store.
- abstract update_file(file_id: str, local_path: str) -> None
- Replaces the existing version of a file in the job store. Throws an exception if the file does not exist.
- Parameters
- •
- file_id -- the ID of the file in the job store to be updated
- •
- local_path -- the local path to a file that will overwrite the current version in the job store
- Raises
- •
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- •
- NoSuchFileException -- if the specified file does not exist
- abstract update_file_stream(file_id: str, encoding: Optional[str] = None, errors: Optional[str] = None) -> Iterator[IO[Any]]
- Replaces the existing version of a file in the job store. Similar to writeFile, but returns a context manager yielding a file handle which can be written to. The yielded file handle does not need to and should not be closed explicitly.
- Parameters
- •
- file_id (str) -- the ID of the file in the job store to be updated
- •
- encoding (str) -- the name of the encoding used to encode the file. Encodings are the same as for encode(). Defaults to None which represents binary mode.
- •
- errors (str) -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Raises
- •
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- •
- NoSuchFileException -- if the specified file does not exist
- abstract write_shared_file_stream(shared_file_name: str, encrypted: Optional[bool] = None, encoding: Optional[ str] = None, errors: Optional[str] = None) -> Iterator[IO[bytes]]
- Returns a context manager yielding a writable file handle to the global file referenced by the given name. File will be created in an atomic manner.
- Parameters
- •
- shared_file_name (str) -- A file name matching AbstractJobStore.fileNameRegex, unique within this job store
- •
- encrypted (bool) -- True if the file must be encrypted, None if it may be encrypted or False if it must be stored in the clear.
- •
- encoding (str) -- the name of the encoding used to encode the file. Encodings are the same as for encode(). Defaults to None which represents binary mode.
- •
- errors (str) -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Raises
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- Returns
- a context manager yielding a writable file handle
- Return type
- Iterator[IO[bytes]]
- abstract read_shared_file_stream(shared_file_name: str, encoding: Optional[str] = None, errors: Optional[ str] = None) -> Iterator[IO[bytes]]
- Returns a context manager yielding a readable file handle to the global file referenced by the given name.
- Parameters
- •
- shared_file_name (str) -- A file name matching AbstractJobStore.fileNameRegex, unique within this job store
- •
- encoding (str) -- the name of the encoding used to decode the file. Encodings are the same as for decode(). Defaults to None which represents binary mode.
- •
- errors (str) -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Returns
- a context manager yielding a readable file handle
- Return type
- Iterator[IO[bytes]]
- abstract write_logs(msg: str) -> None
- Stores a message as a log in the jobstore.
- Parameters
- msg (str) -- the string to be written
- Raises
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- abstract read_logs(callback: Callable[[...], Any], read_all: bool = False) -> int
- Reads logs accumulated by the write_logs() method. For each log this method calls the given callback function with the message as an argument (rather than returning logs directly, this method must be supplied with a callback which will process log messages). Only unread logs will be read unless the read_all parameter is set.
- Parameters
- •
- callback (Callable) -- a function to be applied to each of the stats file handles found
- •
- read_all (bool) -- a boolean indicating whether to read the already processed stats files in addition to the unread stats files
- Raises
- ConcurrentFileModificationException -- if the file was modified concurrently during an invocation of this method
- Returns
- the number of stats files processed
- Return type
- int
- write_leader_pid() -> None
- Write the pid of this process to a file in the job store. Overwriting the current contents of pid.log is a feature, not a bug of this method. Other methods will rely on always having the most current pid available. So far there is no reason to store any old pids.
- read_leader_pid() -> int
- Read the pid of the leader process to a file in the job store.
- Raises
- NoSuchFileException -- If the PID file doesn't exist.
- write_leader_node_id() -> None
- Write the leader node id to the job store. This should only be called by the leader.
- read_leader_node_id() -> str
- Read the leader node id stored in the job store.
- Raises
- NoSuchFileException -- If the node ID file doesn't exist.
- write_kill_flag(kill: bool = False) -> None
- Write a file inside the job store that serves as a kill flag. The initialized file contains the characters "NO". This should only be changed when the user runs the "toil kill" command. Changing this file to a "YES" triggers a kill of the leader process. The workers are expected to be cleaned up by the leader.
- read_kill_flag() -> bool
- Read the kill flag from the job store, and return True if the leader has been killed. False otherwise.
- default_caching() -> bool
- Jobstore's preference as to whether it likes caching or doesn't care about it. Some jobstores benefit from caching, however on some local configurations it can be flaky. see https://github.com/DataBiosphere/toil/issues/4218
TOIL JOB API
Functions to wrap jobs and return values (promises).FunctionWrappingJob
The subclass of Job for wrapping user functions.- class toil.job.FunctionWrappingJob(userFunction, *args, **kwargs)
- Job used to wrap a function. In its run method the wrapped function is called.
- __init__(userFunction, *args, **kwargs)
- Parameters
- userFunction (callable) -- The function to wrap. It will be called with *args and **kwargs as arguments.
- run(fileStore)
- Override this function to perform work and dynamically create successor jobs.
- Parameters
- fileStore -- Used to create local and globally sharable temporary files and to send log messages to the leader process.
- Returns
- The return value of the function can be passed to other jobs by means of toil.job.Job.rv().
JobFunctionWrappingJob
The subclass of FunctionWrappingJob for wrapping user job functions.- class toil.job.JobFunctionWrappingJob(userFunction, *args, **kwargs)
- A job function is a function whose first argument is a Job instance that is the wrapping job for the function. This can be used to add successor jobs for the function and perform all the functions the Job class provides. To enable the job function to get access to the toil.fileStores.abstractFileStore.AbstractFileStore instance (see toil.job.Job.run()), it is made a variable of the wrapping job called fileStore. To specify a job's resource requirements the following default keyword arguments can be specified:
- •
- memory
- •
- disk
- •
- cores
- •
- accelerators
- •
- preemptible
Job.wrapJobFn(myJob, memory='100k', disk='1M', cores=0.1)
- run(fileStore)
- Override this function to perform work and dynamically create successor jobs.
- Parameters
- fileStore -- Used to create local and globally sharable temporary files and to send log messages to the leader process.
- Returns
- The return value of the function can be passed to other jobs by means of toil.job.Job.rv().
EncapsulatedJob
The subclass of Job for encapsulating a job, allowing a subgraph of jobs to be treated as a single job.- class toil.job.EncapsulatedJob(job, unitName=None)
- A convenience Job class used to make a job subgraph appear to be a single job. Let A be the root job of a job subgraph and B be another job we'd like to run after A and all its successors have completed, for this use encapsulate:
# Job A and subgraph, Job B A, B = A(), B() Aprime = A.encapsulate() Aprime.addChild(B) # B will run after A and all its successors have completed, A and its subgraph of # successors in effect appear to be just one job.
- __init__(job, unitName=None)
- Parameters
- •
- job (toil.job.Job) -- the job to encapsulate.
- •
- unitName (str) -- human-readable name to identify this job instance.
- addChild(childJob)
- Add a childJob to be run as child of this job. Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
- Returns
- childJob: for call chaining
- addService(service, parentService=None)
- Add a service. The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run. Services allow things like databases and servers to be started and accessed by jobs in a workflow.
- Raises
- toil.job.JobException -- If service has already been made the child of a job or another service.
- Parameters
- •
- service -- Service to add.
- •
- parentService -- Service that will be started before 'service' is started. Allows trees of services to be established. parentService must be a service of this job.
- Returns
- a promise that will be replaced with the return value from toil.job.Job.Service.start() of service in any successor of the job.
- addFollowOn(followOnJob)
- Add a follow-on job. Follow-on jobs will be run after the child jobs and their successors have been run.
- Returns
- followOnJob for call chaining
- rv(*path) -> Promise
- Create a promise (toil.job.Promise). The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
- Parameters
- path ((Any)) -- Optional path for selecting a component of the promised return value. If absent or empty, the entire return value will be used. Otherwise, the first element of the path is used to select an individual item of the return value. For that to work, the return value must be a list, dictionary or of any other type implementing the __getitem__() magic method. If the selected item is yet another composite value, the second element of the path can be used to select an item from it, and so on. For example, if the return value is [6,{'a':42}], .rv(0) would select 6 , rv(1) would select {'a':3} while rv(1,'a') would select 3. To select a slice from a return value that is slicable, e.g. tuple or list, the path element should be a slice object. For example, assuming that the return value is [6, 7, 8, 9] then .rv(slice(1, 3)) would select [7, 8]. Note that slicing really only makes sense at the end of path.
- Returns
- A promise representing the return value of this jobs toil.job.Job.run() method.
- Return type
- toil.job.Promise
- prepareForPromiseRegistration(jobStore)
- Set up to allow this job's promises to register themselves. Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized. The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
Promise
The class used to reference return values of jobs/services not yet run/started.- class toil.job.Promise(*args)
- References a return value from a method as a promise before the method itself is run. References a return value from a toil.job.Job.run() or toil.job.Job.Service.start() method as a promise before the method itself is run. Let T be a job. Instances of Promise (termed a promise) are returned by T.rv(), which is used to reference the return value of T's run function. When the promise is passed to the constructor (or as an argument to a wrapped function) of a different, successor job the promise will be replaced by the actual referenced return value. This mechanism allows a return values from one job's run method to be input argument to job before the former job's run function has been executed.
- filesToDelete = {}
- A set of IDs of files containing promised values when we know we won't need them anymore
- __init__(job: Job, path: Any)
- Initialize this promise.
- Parameters
- •
- job (Job) -- the job whose return value this promise references
- •
- path -- see Job.rv()
- class toil.job.PromisedRequirement(valueOrCallable, *args)
- Class for dynamically allocating job function resource requirements. (involving toil.job.Promise instances.) Use when resource requirements depend on the return value of a parent function. PromisedRequirements can be modified by passing a function that takes the Promise as input. For example, let f, g, and h be functions. Then a Toil workflow can be defined as follows:: A = Job.wrapFn(f) B = A.addChildFn(g, cores=PromisedRequirement(A.rv()) C = B.addChildFn(h, cores=PromisedRequirement(lambda x: 2*x, B.rv()))
- __init__(valueOrCallable, *args)
- Initialize this Promised Requirement.
- Parameters
- •
- valueOrCallable -- A single Promise instance or a function that takes args as input parameters.
- •
- args (int or .Promise) -- variable length argument list
- getValue()
- Return PromisedRequirement value.
- static convertPromises(kwargs: Dict[str, Any]) -> bool
- Return True if reserved resource keyword is a Promise or PromisedRequirement instance. Converts Promise instance to PromisedRequirement.
- Parameters
- kwargs -- function keyword arguments
JOB METHODS API
Jobs are the units of work in Toil which are composed into workflows.- class toil.job.Job(memory: Optional[Union[str, int]] = None, cores: Optional[ Union[str, int, float]] = None, disk: Optional[Union[str, int]] = None, accelerators: Optional[Union[str, int, Mapping[ str, Any], AcceleratorRequirement, Sequence[ Union[str, int, Mapping[str, Any], AcceleratorRequirement]]]] = None, preemptible: Optional[Union[str, int, bool]] = None, preemptable: Optional[Union[str, int, bool]] = None, unitName: Optional[str] = '', checkpoint: Optional[bool] = False, displayName: Optional[ str] = '', descriptionClass: Optional[str] = None)
- Class represents a unit of work in toil.
- __init__(memory: Optional[Union[str, int]] = None, cores: Optional[ Union[str, int, float]] = None, disk: Optional[Union[str, int]] = None, accelerators: Optional[Union[str, int, Mapping[ str, Any], AcceleratorRequirement, Sequence[ Union[str, int, Mapping[str, Any], AcceleratorRequirement]]]] = None, preemptible: Optional[Union[str, int, bool]] = None, preemptable: Optional[Union[str, int, bool]] = None, unitName: Optional[str] = '', checkpoint: Optional[bool] = False, displayName: Optional[ str] = '', descriptionClass: Optional[str] = None) -> None
- Job initializer. This method must be called by any overriding constructor.
- Parameters
- •
- memory (int or string convertible by toil.lib.conversions.human2bytes to an int) -- the maximum number of bytes of memory the job will require to run.
- •
- cores (float, int, or string convertible by toil.lib.conversions.human2bytes to an int) -- the number of CPU cores required.
- •
- disk (int or string convertible by toil.lib.conversions.human2bytes to an int) -- the amount of local disk space required by the job, expressed in bytes.
- •
- accelerators (int, string, dict, or list of those. Strings and dicts must be parseable by AcceleratorRequirement.parse.) -- the computational accelerators required by the job. If a string, can be a string of a number, or a string specifying a model, brand, or API (with optional colon-delimited count).
- •
- preemptible (bool, int in {0, 1}, or string in {'false', 'true'} in any case) -- if the job can be run on a preemptible node.
- •
- preemptable -- legacy preemptible parameter, for backwards compatibility with workflows not using the preemptible keyword
- •
- unitName (str) -- Human-readable name for this instance of the job.
- •
- checkpoint (bool) -- if any of this job's successor jobs completely fails, exhausting all their retries, remove any successor jobs and rerun this job to restart the subtree. Job must be a leaf vertex in the job graph when initially defined, see toil.job.Job.checkNewCheckpointsAreCutVertices().
- •
- displayName (str) -- Human-readable job type display name.
- •
- descriptionClass (class) -- Override for the JobDescription class used to describe the job.
- property jobStoreID
- Get the ID of this Job.
- Return type
- str|toil.job.TemporaryID
- property description
- Expose the JobDescription that describes this job.
- Return type
- toil.job.JobDescription
- property disk: int
- The maximum number of bytes of disk the job will require to run.
- Return type
- int
- property memory
- The maximum number of bytes of memory the job will require to run.
- Return type
- int
- property cores
The number of CPU cores required.
- Return type
- int|float
- property accelerators
Any accelerators, such as GPUs, that are
needed.
- Return type
- list
- property preemptible
- Whether the job can be run on a preemptible node.
- Return type
- bool
- property checkpoint
- Determine if the job is a checkpoint job or not.
- Return type
- bool
- assignConfig(config: Config)
- Assign the given config object. It will be used by various actions implemented inside the Job class.
- Parameters
- config -- Config object to query
- run(fileStore: AbstractFileStore) -> Any
- Override this function to perform work and dynamically create successor jobs.
- Parameters
- fileStore -- Used to create local and globally sharable temporary files and to send log messages to the leader process.
- Returns
- The return value of the function can be passed to other jobs by means of toil.job.Job.rv().
- addChild(childJob: Job) -> Job
- Add a childJob to be run as child of this job. Child jobs will be run directly after this job's toil.job.Job.run() method has completed.
- Returns
- childJob: for call chaining
- hasChild(childJob: Job) -> bool
- Check if childJob is already a child of this job.
- Returns
- True if childJob is a child of the job, else False.
- addFollowOn(followOnJob: Job) -> Job
- Add a follow-on job. Follow-on jobs will be run after the child jobs and their successors have been run.
- Returns
- followOnJob for call chaining
- hasPredecessor(job: Job) -> bool
- Check if a given job is already a predecessor of this job.
- hasFollowOn(followOnJob: Job) -> bool
- Check if given job is already a follow-on of this job.
- Returns
- True if the followOnJob is a follow-on of this job, else False.
- addService(service: Service, parentService: Optional[ Service] = None) -> Promise
- Add a service. The toil.job.Job.Service.start() method of the service will be called after the run method has completed but before any successors are run. The service's toil.job.Job.Service.stop() method will be called once the successors of the job have been run. Services allow things like databases and servers to be started and accessed by jobs in a workflow.
- Raises
- toil.job.JobException -- If service has already been made the child of a job or another service.
- Parameters
- •
- service -- Service to add.
- •
- parentService -- Service that will be started before 'service' is started. Allows trees of services to be established. parentService must be a service of this job.
- Returns
- a promise that will be replaced with the return value from toil.job.Job.Service.start() of service in any successor of the job.
- hasService(service: Service) -> bool
- Return True if the given Service is a service of this job, and False otherwise.
- addChildFn(fn: Callable, *args, **kwargs) -> FunctionWrappingJob
- Add a function as a child job.
- Parameters
- fn -- Function to be run as a child job with *args and **kwargs as arguments to this function. See toil.job.FunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new child job that wraps fn.
- addFollowOnFn(fn: Callable, *args, **kwargs) -> FunctionWrappingJob
- Add a function as a follow-on job.
- Parameters
- fn -- Function to be run as a follow-on job with *args and **kwargs as arguments to this function. See toil.job.FunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new follow-on job that wraps fn.
- addChildJobFn(fn: Callable, *args, **kwargs) -> FunctionWrappingJob
- Add a job function as a child job. See toil.job.JobFunctionWrappingJob for a definition of a job function.
- Parameters
- fn -- Job function to be run as a child job with *args and **kwargs as arguments to this function. See toil.job.JobFunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new child job that wraps fn.
- addFollowOnJobFn(fn: Callable, *args, **kwargs) -> FunctionWrappingJob
- Add a follow-on job function. See toil.job.JobFunctionWrappingJob for a definition of a job function.
- Parameters
- fn -- Job function to be run as a follow-on job with *args and **kwargs as arguments to this function. See toil.job.JobFunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new follow-on job that wraps fn.
- property tempDir: str
- Shortcut to calling job.fileStore.getLocalTempDir(). Temp dir is created on first call and will be returned for first and future calls :return: Path to tempDir. See job.fileStore.getLocalTempDir
- log(text: str, level=20) -> None
- Convenience wrapper for fileStore.logToMaster().
- static wrapFn(fn, *args, **kwargs)
- Makes a Job out of a function. Convenience function for constructor of toil.job.FunctionWrappingJob.
- Parameters
- fn -- Function to be run with *args and **kwargs as arguments. See toil.job.JobFunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new function that wraps fn.
- Return type
- toil.job.FunctionWrappingJob
- static wrapJobFn(fn, *args, **kwargs)
- Makes a Job out of a job function. Convenience function for constructor of toil.job.JobFunctionWrappingJob.
- Parameters
- fn -- Job function to be run with *args and **kwargs as arguments. See toil.job.JobFunctionWrappingJob for reserved keyword arguments used to specify resource requirements.
- Returns
- The new job function that wraps fn.
- Return type
- toil.job.JobFunctionWrappingJob
- encapsulate(name=None)
- Encapsulates the job, see toil.job.EncapsulatedJob. Convenience function for constructor of toil.job.EncapsulatedJob.
- Parameters
- name (str) -- Human-readable name for the encapsulated job.
- Returns
- an encapsulated version of this job.
- Return type
- toil.job.EncapsulatedJob
- rv(*path) -> Any
- Create a promise (toil.job.Promise). The "promise" representing a return value of the job's run method, or, in case of a function-wrapping job, the wrapped function's return value.
- Parameters
- path ((Any)) -- Optional path for selecting a component of the promised return value. If absent or empty, the entire return value will be used. Otherwise, the first element of the path is used to select an individual item of the return value. For that to work, the return value must be a list, dictionary or of any other type implementing the __getitem__() magic method. If the selected item is yet another composite value, the second element of the path can be used to select an item from it, and so on. For example, if the return value is [6,{'a':42}], .rv(0) would select 6 , rv(1) would select {'a':3} while rv(1,'a') would select 3. To select a slice from a return value that is slicable, e.g. tuple or list, the path element should be a slice object. For example, assuming that the return value is [6, 7, 8, 9] then .rv(slice(1, 3)) would select [7, 8]. Note that slicing really only makes sense at the end of path.
- Returns
- A promise representing the return value of this jobs toil.job.Job.run() method.
- Return type
- toil.job.Promise
- prepareForPromiseRegistration(jobStore: AbstractJobStore) -> None
- Set up to allow this job's promises to register themselves. Prepare this job (the promisor) so that its promises can register themselves with it, when the jobs they are promised to (promisees) are serialized. The promissee holds the reference to the promise (usually as part of the job arguments) and when it is being pickled, so will the promises it refers to. Pickling a promise triggers it to be registered with the promissor.
- checkJobGraphForDeadlocks()
- Ensures that a graph of Jobs (that hasn't yet been saved to the JobStore) doesn't contain any pathological relationships between jobs that would result in deadlocks if we tried to run the jobs. See toil.job.Job.checkJobGraphConnected(), toil.job.Job.checkJobGraphAcyclic() and toil.job.Job.checkNewCheckpointsAreLeafVertices() for more info.
- Raises
- toil.job.JobGraphDeadlockException -- if the job graph is cyclic, contains multiple roots or contains checkpoint jobs that are not leaf vertices when defined (see toil.job.Job.checkNewCheckpointsAreLeaves()).
- getRootJobs() -> Set[Job]
- Returns the set of root job objects that contain this job. A root job is a job with no predecessors (i.e. which are not children, follow-ons, or services). Only deals with jobs created here, rather than loaded from the job store.
- checkJobGraphConnected()
- Raises
- toil.job.JobGraphDeadlockException -- if toil.job.Job.getRootJobs() does not contain exactly one root job.
- checkJobGraphAcylic()
- Raises
- toil.job.JobGraphDeadlockException -- if the connected component of jobs containing this job contains any cycles of child/followOn dependencies in the augmented job graph (see below). Such cycles are not allowed in valid job graphs.
- checkNewCheckpointsAreLeafVertices()
- A checkpoint job is a job that is restarted if either it fails, or if any of its successors completely fails, exhausting their retries. A job is a leaf it is has no successors. A checkpoint job must be a leaf when initially added to the job graph. When its run method is invoked it can then create direct successors. This restriction is made to simplify implementation. Only works on connected components of jobs not yet added to the JobStore.
- Raises
- toil.job.JobGraphDeadlockException -- if there exists a job being added to the graph for which checkpoint=True and which is not a leaf.
- defer(function, *args, **kwargs)
- Register a deferred function, i.e. a callable that will be invoked after the current attempt at running this job concludes. A job attempt is said to conclude when the job function (or the toil.job.Job.run() method for class-based jobs) returns, raises an exception or after the process running it terminates abnormally. A deferred function will be called on the node that attempted to run the job, even if a subsequent attempt is made on another node. A deferred function should be idempotent because it may be called multiple times on the same node or even in the same process. More than one deferred function may be registered per job attempt by calling this method repeatedly with different arguments. If the same function is registered twice with the same or different arguments, it will be called twice per job attempt. Examples for deferred functions are ones that handle cleanup of resources external to Toil, like Docker containers, files outside the work directory, etc.
- Parameters
- •
- function (callable) -- The function to be called after this job concludes.
- •
- args (list) -- The arguments to the function
- •
- kwargs (dict) -- The keyword arguments to the function
- getTopologicalOrderingOfJobs()
- Returns
- a list of jobs such that for all pairs of indices i, j for which i < j, the job at index i can be run before the job at index j.
- Return type
- list[Job]
- saveBody(jobStore)
- Save the execution data for just this job to the JobStore, and fill in the JobDescription with the information needed to retrieve it. The Job's JobDescription must have already had a real jobStoreID assigned to it. Does not save the JobDescription.
- Parameters
- jobStore (toil.jobStores.abstractJobStore.AbstractJobStore) -- The job store to save the job body into.
- saveAsRootJob(jobStore: AbstractJobStore) -> JobDescription
- Save this job to the given jobStore as the root job of the workflow.
- Returns
- the JobDescription describing this job.
- classmethod loadJob(jobStore: AbstractJobStore, jobDescription: JobDescription) -> Job
- Retrieves a toil.job.Job instance from a JobStore
- Parameters
- •
- jobStore -- The job store.
- •
- jobDescription -- the JobDescription of the job to retrieve.
- Returns
- The job referenced by the JobDescription.
JobDescription
The class used to store all the information that the Toil Leader ever needs to know about a Job.- class toil.job.JobDescription(requirements: Mapping[ str, Union[int, str, bool]], jobName: str, unitName: str = '', displayName: str = '', command: Optional[str] = None)
- Stores all the information that the Toil Leader ever needs to know about a Job. (requirements information, dependency information, commands to issue, etc.) Can be obtained from an actual (i.e. executable) Job object, and can be used to obtain the Job object from the JobStore. Never contains other Jobs or JobDescriptions: all reference is by ID. Subclassed into variants for checkpoint jobs and service jobs that have their specific parameters.
- __init__(requirements: Mapping[str, Union[ int, str, bool]], jobName: str, unitName: str = '', displayName: str = '', command: Optional[ str] = None) -> None
- Create a new JobDescription.
- Parameters
- •
- requirements -- Dict from string to number, string, or bool describing the resource requirements of the job. 'cores', 'memory', 'disk', and 'preemptible' fields, if set, are parsed and broken out into properties. If unset, the relevant property will be unspecified, and will be pulled from the assigned Config object if queried (see toil.job.Requirer.assignConfig()).
- •
- jobName -- Name of the kind of job this is. May be used in job store IDs and logging. Also used to let the cluster scaler learn a model for how long the job will take. Ought to be the job class's name if no real user-defined name is available.
- •
- unitName -- Name of this instance of this kind of job. May appear with jobName in logging.
- •
- displayName -- A human-readable name to identify this particular job instance. Ought to be the job class's name if no real user-defined name is available.
- serviceHostIDsInBatches() -> Iterator[List[ str]]
- Find all batches of service host job IDs that can be started at the same time. (in the order they need to start in)
- successorsAndServiceHosts() -> Iterator[str]
- Get an iterator over all child, follow-on, and service job IDs.
- allSuccessors()
- Get an iterator over all child and follow-on job IDs.
- property services
- Get a collection of the IDs of service host jobs for this job, in arbitrary order. Will be empty if the job has no unfinished services.
- nextSuccessors() -> List[str]
- Return the collection of job IDs for the successors of this job that are ready to run. If those jobs have multiple predecessor relationships, they may still be blocked on other jobs. Returns None when at the final phase (all successors done), and an empty collection if there are more phases but they can't be entered yet (e.g. because we are waiting for the job itself to run).
- property stack: Tuple[Tuple[str, ...], ...]
- Get IDs of successors that need to run still. Batches of successors are in reverse order of the order they need to run in. Some successors in each batch may have already been finished. Batches may be empty. Exists so that code that used the old stack list immutably can work still. New development should use nextSuccessors(), and all mutations should use filterSuccessors() (which automatically removes completed phases).
- Returns
- Batches of successors that still need to run, in reverse order. An empty batch may exist under a non-empty batch, or at the top when the job itself is not done.
- Return type
- tuple(tuple(str))
- filterSuccessors(predicate: Callable[[str], bool]) -> None
- Keep only successor jobs for which the given predicate function approves. The predicate function is called with the job's ID. Treats all other successors as complete and forgets them.
- filterServiceHosts(predicate: Callable[[str], bool]) -> None
- Keep only services for which the given predicate approves. The predicate function is called with the service host job's ID. Treats all other services as complete and forgets them.
- clear_nonexistent_dependents(job_store: AbstractJobStore) -> None
- Remove all references to child, follow-on, and associated service jobs that do not exist (i.e. have been completed and removed) in the given job store.
- clear_dependents() -> None
- Remove all references to child, follow-on, and associated service jobs.
- is_subtree_done() -> bool
- Return True if the job appears to be done, and all related child, follow-on, and service jobs appear to be finished and removed.
- replace(other: JobDescription) -> None
- Take on the ID of another JobDescription, retaining our own state and type. When updated in the JobStore, we will save over the other JobDescription. Useful for chaining jobs: the chained-to job can replace the parent job. Merges cleanup state from the job being replaced into this one.
- Parameters
- other -- Job description to replace.
- addChild(childID: str) -> None
- Make the job with the given ID a child of the described job.
- addFollowOn(followOnID: str) -> None
- Make the job with the given ID a follow-on of the described job.
- addServiceHostJob(serviceID, parentServiceID=None)
- Make the ServiceHostJob with the given ID a service of the described job. If a parent ServiceHostJob ID is given, that parent service will be started first, and must have already been added.
- hasChild(childID: str) -> bool
- Return True if the job with the given ID is a child of the described job.
- hasFollowOn(followOnID: str) -> bool
- Test if the job with the given ID is a follow-on of the described job.
- hasServiceHostJob(serviceID) -> bool
- Test if the ServiceHostJob is a service of the described job.
- renameReferences(renames: Dict[TemporaryID, str]) -> None
- Apply the given dict of ID renames to all references to jobs. Does not modify our own ID or those of finished predecessors. IDs not present in the renames dict are left as-is.
- Parameters
- renames -- Rename operations to apply.
- addPredecessor() -> None
- Notify the JobDescription that a predecessor has been added to its Job.
- onRegistration(jobStore: AbstractJobStore) -> None
- Called by the Job saving logic when this JobDescription meets the JobStore and has its ID assigned. Overridden to perform setup work (like hooking up flag files for service jobs) that requires the JobStore.
- Parameters
- jobStore -- The job store we are being placed into
- setupJobAfterFailure(exit_status: Optional[int] = None, exit_reason: Optional[BatchJobExitReason] = None)
- Reduce the remainingTryCount if greater than zero and set the memory to be at least as big as the default memory (in case of exhaustion of memory, which is common). Requires a configuration to have been assigned (see toil.job.Requirer.assignConfig()).
- Parameters
- •
- exit_status -- The exit code from the job.
- •
- exit_reason -- The reason the job stopped, if available from the batch system.
- getLogFileHandle(jobStore)
- Returns a context manager that yields a file handle to the log file. Assumes logJobStoreFileID is set.
- property remainingTryCount
- The try count set on the JobDescription, or the default based on the retry count from the config if none is set.
- clearRemainingTryCount() -> bool
- Clear remainingTryCount and set it back to its default value.
- Returns
- True if a modification to the JobDescription was made, and False otherwise.
- pre_update_hook() -> None
- Called by the job store before pickling and saving a created or updated version of a job.
- get_job_kind() -> str
- Returns an identifier of the job for use with the message bus. Either the unit name, job name, or display name, which identifies the kind of job it is to toil. Otherwise returns Unknown Job in case no identifier is available
JOB.RUNNER API
The Runner contains the methods needed to configure and start a Toil run.- class Job.Runner
- Used to setup and run Toil workflow.
- static getDefaultArgumentParser() -> ArgumentParser
- Get argument parser with added toil workflow options.
- Returns
- The argument parser used by a toil workflow with added Toil options.
- Return type
- argparse.ArgumentParser
- static getDefaultOptions(jobStore: str) -> Namespace
- Get default options for a toil workflow.
- Parameters
- jobStore (string) -- A string describing the jobStore for the workflow.
- Returns
- The options used by a toil workflow.
- Return type
- argparse.ArgumentParser values object
- static addToilOptions(parser)
- Adds the default toil options to an optparse or argparse parser object.
- Parameters
- parser (optparse.OptionParser or argparse.ArgumentParser) -- Options object to add toil options to.
- static startToil(job, options)
- Run the toil workflow using the given options. Deprecated by toil.common.Toil.start. (see Job.Runner.getDefaultOptions and Job.Runner.addToilOptions) starting with this job. :param toil.job.Job job: root job of the workflow :raises: toil.leader.FailedJobsException if at the end of function their remain failed jobs. :return: The return value of the root job's run function. :rtype: Any
JOB.FILESTORE API
The AbstractFileStore is an abstraction of a Toil run's shared storage.- class toil.fileStores.abstractFileStore.AbstractFileStore(jobStore: AbstractJobStore, jobDesc: JobDescription, file_store_dir: str, waitForPreviousCommit: Callable[[], Any])
- Interface used to allow user code run by Toil to read and write files. Also provides the interface to other Toil facilities used by user code, including:
- •
- normal (non-real-time) logging
- •
- finding the correct temporary directory for scratch work
- •
- importing and exporting files into and out of the workflow
- __init__(jobStore: AbstractJobStore, jobDesc: JobDescription, file_store_dir: str, waitForPreviousCommit: Callable[[], Any]) -> None
- Create a new file store object.
- Parameters
- •
- jobStore -- the job store in use for the current Toil run.
- •
- jobDesc -- the JobDescription object for the currently running job.
- •
- file_store_dir -- the per-worker local temporary directory where the file store should store local files. Per-job directories will be created under here by the file store.
- •
- waitForPreviousCommit -- the waitForCommit method of the previous job's file store, when jobs are running in sequence on the same worker. Used to prevent this file store's startCommit and the previous job's startCommit methods from running at the same time and racing. If they did race, it might be possible for the later job to be fully marked as completed in the job store before the eralier job was.
- static createFileStore(jobStore: AbstractJobStore, jobDesc: JobDescription, file_store_dir: str, waitForPreviousCommit: Callable[[], Any], caching: Optional[bool]) -> Union[NonCachingFileStore, CachingFileStore]
- Create a concreate FileStore.
- static shutdownFileStore(workflowID: str, config_work_dir: Optional[str], config_coordination_dir: Optional[ str]) -> None
- Carry out any necessary filestore-specific cleanup. This is a destructive operation and it is important to ensure that there are no other running processes on the system that are modifying or using the file store for this workflow. This is the intended to be the last call to the file store in a Toil run, called by the batch system cleanup function upon batch system shutdown.
- Parameters
- •
- workflowID -- The workflow ID for this invocation of the workflow
- •
- config_work_dir -- The path to the work directory in the Toil Config.
- •
- config_coordination_dir -- The path to the coordination directory in the Toil Config.
- open(job: Job) -> Generator[None, None, None]
- Create the context manager around tasks prior and after a job has been run. File operations are only permitted inside the context manager. Implementations must only yield from within with super().open(job):.
- Parameters
- job -- The job instance of the toil job to run.
- getLocalTempDir() -> str
- Get a new local temporary directory in which to write files. The directory will only persist for the duration of the job.
- Returns
- The absolute path to a new local temporary directory. This directory will exist for the duration of the job only, and is guaranteed to be deleted once the job terminates, removing all files it contains recursively.
- getLocalTempFile(suffix: Optional[str] = None, prefix: Optional[str] = None) -> str
- Get a new local temporary file that will persist for the duration of the job.
- Parameters
- •
- suffix -- If not None, the file name will end with this string. Otherwise, default value ".tmp" will be used
- •
- prefix -- If not None, the file name will start with this string. Otherwise, default value "tmp" will be used
- Returns
- The absolute path to a local temporary file. This file will exist for the duration of the job only, and is guaranteed to be deleted once the job terminates.
- getLocalTempFileName(suffix: Optional[str] = None, prefix: Optional[str] = None) -> str
- Get a valid name for a new local file. Don't actually create a file at the path.
- Parameters
- •
- suffix -- If not None, the file name will end with this string. Otherwise, default value ".tmp" will be used
- •
- prefix -- If not None, the file name will start with this string. Otherwise, default value "tmp" will be used
- Returns
- Path to valid file
- abstract writeGlobalFile(localFileName: str, cleanup: bool = False) -> FileID
- Upload a file (as a path) to the job store. If the file is in a FileStore-managed temporary directory (i.e. from toil.fileStores.abstractFileStore.AbstractFileStore.getLocalTempDir()), it will become a local copy of the file, eligible for deletion by toil.fileStores.abstractFileStore.AbstractFileStore.deleteLocalFile(). If an executable file on the local filesystem is uploaded, its executability will be preserved when it is downloaded again.
- Parameters
- •
- localFileName -- The path to the local file to upload. The last path component (basename of the file) will remain associated with the file in the file store, if supported by the backing JobStore, so that the file can be searched for by name or name glob.
- •
- cleanup -- if True then the copy of the global file will be deleted once the job and all its successors have completed running. If not the global file must be deleted manually.
- Returns
- an ID that can be used to retrieve the file.
- writeGlobalFileStream(cleanup: bool = False, basename: Optional[str] = None, encoding: Optional[str] = None, errors: Optional[str] = None) -> Iterator[Tuple[WriteWatchingStream, FileID]]
- Similar to writeGlobalFile, but allows the writing of a stream to the job store. The yielded file handle does not need to and should not be closed explicitly.
- Parameters
- •
- encoding -- The name of the encoding used to decode the file. Encodings are the same as for decode(). Defaults to None which represents binary mode.
- •
- errors -- Specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- •
- cleanup -- is as in toil.fileStores.abstractFileStore.AbstractFileStore.writeGlobalFile().
- •
- basename -- If supported by the backing JobStore, use the given file basename so that when searching the job store with a query matching that basename, the file will be detected.
- Returns
- A context manager yielding a tuple of 1) a file handle which can be written to and 2) the toil.fileStores.FileID of the resulting file in the job store.
- logAccess(fileStoreID: Union[FileID, str], destination: Optional[str] = None) -> None
- Record that the given file was read by the job. (to be announced if the job fails) If destination is not None, it gives the path that the file was downloaded to. Otherwise, assumes that the file was streamed. Must be called by readGlobalFile() and readGlobalFileStream() implementations.
- abstract readGlobalFile(fileStoreID: str, userPath: Optional[str] = None, cache: bool = True, mutable: bool = False, symlink: bool = False) -> str
- Make the file associated with fileStoreID available locally. If mutable is True, then a copy of the file will be created locally so that the original is not modified and does not change the file for other jobs. If mutable is False, then a link can be created to the file, saving disk resources. The file that is downloaded will be executable if and only if it was originally uploaded from an executable file on the local filesystem. If a user path is specified, it is used as the destination. If a user path isn't specified, the file is stored in the local temp directory with an encoded name. The destination file must not be deleted by the user; it can only be deleted through deleteLocalFile. Implementations must call logAccess() to report the download.
- Parameters
- •
- fileStoreID -- job store id for the file
- •
- userPath -- a path to the name of file to which the global file will be copied or hard-linked (see below).
- •
- cache -- Described in toil.fileStores.CachingFileStore.readGlobalFile()
- •
- mutable -- Described in toil.fileStores.CachingFileStore.readGlobalFile()
- Returns
- An absolute path to a local, temporary copy of the file keyed by fileStoreID.
- abstract readGlobalFileStream(fileStoreID: str, encoding: Optional[str] = None, errors: Optional[str] = None) -> ContextManager[Union[ IO[bytes], IO[str]]]
- Read a stream from the job store; similar to readGlobalFile. The yielded file handle does not need to and should not be closed explicitly.
- Parameters
- •
- encoding -- the name of the encoding used to decode the file. Encodings are the same as for decode(). Defaults to None which represents binary mode.
- •
- errors -- an optional string that specifies how encoding errors are to be handled. Errors are the same as for open(). Defaults to 'strict' when an encoding is specified.
- Returns
- a context manager yielding a file handle which can be read from.
- getGlobalFileSize(fileStoreID: Union[FileID, str]) -> int
- Get the size of the file pointed to by the given ID, in bytes. If a FileID or something else with a non-None 'size' field, gets that. Otherwise, asks the job store to poll the file's size. Note that the job store may overestimate the file's size, for example if it is encrypted and had to be augmented with an IV or other encryption framing.
- Parameters
- fileStoreID -- File ID for the file
- Returns
- File's size in bytes, as stored in the job store
- abstract deleteLocalFile(fileStoreID: Union[FileID, str]) -> None
- Delete local copies of files associated with the provided job store ID. Raises an OSError with an errno of errno.ENOENT if no such local copies exist. Thus, cannot be called multiple times in succession. The files deleted are all those previously read from this file ID via readGlobalFile by the current job into the job's file-store-provided temp directory, plus the file that was written to create the given file ID, if it was written by the current job from the job's file-store-provided temp directory.
- Parameters
- fileStoreID -- File Store ID of the file to be deleted.
- abstract deleteGlobalFile(fileStoreID: Union[FileID, str]) -> None
- Delete local files and then permanently deletes them from the job store. To ensure that the job can be restarted if necessary, the delete will not happen until after the job's run method has completed.
- Parameters
- fileStoreID -- the File Store ID of the file to be deleted.
- logToMaster(text: str, level: int = 20) -> None
- Send a logging message to the leader. The message will also be logged by the worker at the same level.
- Parameters
- •
- text -- The string to log.
- •
- level -- The logging level.
- abstract startCommit(jobState: bool = False) -> None
- Update the status of the job on the disk. May start an asynchronous process. Call waitForCommit() to wait on that process.
- Parameters
- jobState -- If True, commit the state of the FileStore's job, and file deletes. Otherwise, commit only file creates/updates.
- abstract waitForCommit() -> bool
- Blocks while startCommit is running. This function is called by this job's successor to ensure that it does not begin modifying the job store until after this job has finished doing so. Might be called when startCommit is never called on a particular instance, in which case it does not block.
- Returns
- Always returns True
- abstract classmethod shutdown(shutdown_info: Any) -> None
- Shutdown the filestore on this node. This is intended to be called on batch system shutdown.
- Parameters
- shutdown_info -- The implementation-specific shutdown information, for shutting down the file store and removing all its state and all job local temp directories from the node.
- class toil.fileStores.FileID(fileStoreID: str, *args: Any)
- A small wrapper around Python's builtin string class. It is used to represent a file's ID in the file store, and has a size attribute that is the file's size in bytes. This object is returned by importFile and writeGlobalFile. Calls into the file store can use bare strings; size will be queried from the job store if unavailable in the ID.
- __init__(fileStoreID: str, size: int, executable: bool = False) -> None
- pack() -> str
- Pack the FileID into a string so it can be passed through external code.
- classmethod unpack(packedFileStoreID: str) -> FileID
- Unpack the result of pack() into a FileID object.
BATCH SYSTEM API
The batch system interface is used by Toil to abstract over different ways of running batches of jobs, for example Slurm, GridEngine, Mesos, Parasol and a single node. The toil.batchSystems.abstractBatchSystem.AbstractBatchSystem API is implemented to run jobs using a given job management system, e.g. Mesos.Batch System Enivronmental Variables
Environmental variables allow passing of scheduler specific parameters.export TOIL_SLURM_ARGS="-t 1:00:00 -q fatq" export TOIL_SLURM_PE='multicore'
export TOIL_TORQUE_ARGS="-q fatq" export TOIL_TORQUE_REQS="walltime=1:00:00"
export TOIL_GRIDENGINE_PE='smp' export TOIL_GRIDENGINE_ARGS='-q batch.q'
export TOIL_HTCONDOR_PARAMS='requirements = TARGET.has_sse4_2 == true; accounting_group = test'
Batch System API
- class toil.batchSystems.abstractBatchSystem.AbstractBatchSystem
- An abstract base class to represent the interface the batch system must provide to Toil.
- abstract classmethod supportsAutoDeployment() -> bool
- Whether this batch system supports auto-deployment of the user script itself. If it does, the setUserScript() can be invoked to set the resource object representing the user script. Note to implementors: If your implementation returns True here, it should also override
- abstract classmethod supportsWorkerCleanup() -> bool
- Indicates whether this batch system invokes BatchSystemSupport.workerCleanup() after the last job for a particular workflow invocation finishes. Note that the term worker refers to an entire node, not just a worker process. A worker process may run more than one job sequentially, and more than one concurrent worker process may exist on a worker node, for the same workflow. The batch system is said to shut down after the last worker process terminates.
- setUserScript(userScript: Resource) -> None
- Set the user script for this workflow. This method must be called before the first job is issued to this batch system, and only if supportsAutoDeployment() returns True, otherwise it will raise an exception.
- Parameters
- userScript -- the resource object representing the user script or module and the modules it depends on.
- set_message_bus(message_bus: MessageBus) -> None
- Give the batch system an opportunity to connect directly to the message bus, so that it can send informational messages about the jobs it is running to other Toil components.
- abstract issueBatchJob(jobDesc: JobDescription, job_environment: Optional[Dict[str, str]] = None) -> int
- Issues a job with the specified command to the batch system and returns a unique jobID.
- Parameters
- •
- jobDesc -- a toil.job.JobDescription
- •
- job_environment -- a collection of job-specific environment variables to be set on the worker.
- Returns
- a unique jobID that can be used to reference the newly issued job
- abstract killBatchJobs(jobIDs: List[int]) -> None
- Kills the given job IDs. After returning, the killed jobs will not appear in the results of getRunningBatchJobIDs. The killed job will not be returned from getUpdatedBatchJob.
- Parameters
- jobIDs -- list of IDs of jobs to kill
- abstract getIssuedBatchJobIDs() -> List[int]
- Gets all currently issued jobs
- Returns
- A list of jobs (as jobIDs) currently issued (may be running, or may be waiting to be run). Despite the result being a list, the ordering should not be depended upon.
- abstract getRunningBatchJobIDs() -> Dict[int, float]
- Gets a map of jobs as jobIDs that are currently running (not just waiting) and how long they have been running, in seconds.
- Returns
- dictionary with currently running jobID keys and how many seconds they have been running as the value
- abstract getUpdatedBatchJob(maxWait: int) -> Optional[UpdatedBatchJobInfo]
- Returns information about job that has updated its status (i.e. ceased running, either successfully or with an error). Each such job will be returned exactly once. Does not return info for jobs killed by killBatchJobs, although they may cause None to be returned earlier than maxWait.
- Parameters
- maxWait -- the number of seconds to block, waiting for a result
- Returns
- If a result is available, returns UpdatedBatchJobInfo. Otherwise it returns None. wallTime is the number of seconds (a strictly positive float) in wall-clock time the job ran for, or None if this batch system does not support tracking wall time.
- getSchedulingStatusMessage() -> Optional[str]
- Get a log message fragment for the user about anything that might be going wrong in the batch system, if available. If no useful message is available, return None. This can be used to report what resource is the limiting factor when scheduling jobs, for example. If the leader thinks the workflow is stuck, the message can be displayed to the user to help them diagnose why it might be stuck.
- Returns
- User-directed message about scheduling state.
- abstract shutdown() -> None
- Called at the completion of a toil invocation. Should cleanly terminate all worker threads.
- setEnv(name: str, value: Optional[str] = None) -> None
- Set an environment variable for the worker process before it is launched. The worker process will typically inherit the environment of the machine it is running on but this method makes it possible to override specific variables in that inherited environment before the worker is launched. Note that this mechanism is different to the one used by the worker internally to set up the environment of a job. A call to this method affects all jobs issued after this method returns. Note to implementors: This means that you would typically need to copy the variables before enqueuing a job. If no value is provided it will be looked up from the current environment.
- classmethod add_options(parser: Union[ArgumentParser, _ArgumentGroup]) -> None
- If this batch system provides any command line options, add them to the given parser.
- classmethod setOptions(setOption: OptionSetter) -> None
- Process command line or configuration options relevant to this batch system.
- Parameters
- setOption -- A function with signature setOption(option_name, parsing_function=None, check_function=None, default=None, env=None) returning nothing, used to update run configuration as a side effect.
- getWorkerContexts() -> List[ContextManager[ Any]]
- Get a list of picklable context manager objects to wrap worker work in, in order. Can be used to ask the Toil worker to do things in-process (such as configuring environment variables, hot-deploying user scripts, or cleaning up a node) that would otherwise require a wrapping "executor" process.
JOB.SERVICE API
The Service class allows databases and servers to be spawned within a Toil workflow.- class Job.Service(memory=None, cores=None, disk=None, accelerators=None, preemptible=None, unitName=None)
- Abstract class used to define the interface to a service. Should be subclassed by the user to define services. Is not executed as a job; runs within a ServiceHostJob.
- __init__(memory=None, cores=None, disk=None, accelerators=None, preemptible=None, unitName=None)
- Memory, core and disk requirements are specified identically to as in toil.job.Job.__init__().
- abstract start(job)
- Start the service.
- Parameters
- job (toil.job.Job) -- The underlying host job that the service is being run in. Can be used to register deferred functions, or to access the fileStore for creating temporary files.
- Returns
- An object describing how to access the service. The object must be pickleable and will be used by jobs to access the service (see toil.job.Job.addService()).
- abstract stop(job)
- Stops the service. Function can block until complete.
- Parameters
- job (toil.job.Job) -- The underlying host job that the service is being run in. Can be used to register deferred functions, or to access the fileStore for creating temporary files.
- check()
- Checks the service is still running.
- Raises
- exceptions.RuntimeError -- If the service failed, this will cause the service job to be labeled failed.
- Returns
- True if the service is still running, else False. If False then the service job will be terminated, and considered a success. Important point: if the service job exits due to a failure, it should raise a RuntimeError, not return False!
EXCEPTIONS API
Toil specific exceptions.- exception toil.job.JobException(message: str)
- General job exception.
- __init__(message: str) -> None
- exception toil.job.JobGraphDeadlockException(string)
- An exception raised in the event that a workflow contains an unresolvable dependency, such as a cycle. See toil.job.Job.checkJobGraphForDeadlocks().
- __init__(string)
- exception toil.jobStores.abstractJobStore.ConcurrentFileModificationException(jobStoreFileID: FileID)
- Indicates that the file was attempted to be modified by multiple processes at once.
- __init__(jobStoreFileID: FileID)
- Parameters
- jobStoreFileID -- the ID of the file that was modified by multiple workers or processes concurrently
- exception toil.jobStores.abstractJobStore.JobStoreExistsException(locator: str)
- Indicates that the specified job store already exists.
- __init__(locator: str)
- Parameters
- locator (str) -- The location of the job store
- exception toil.jobStores.abstractJobStore.NoSuchFileException(jobStoreFileID: FileID, customName: Optional[str] = None, *extra: Any)
- Indicates that the specified file does not exist.
- __init__(jobStoreFileID: FileID, customName: Optional[ str] = None, *extra: Any)
- Parameters
- •
- jobStoreFileID -- the ID of the file that was mistakenly assumed to exist
- •
- customName -- optionally, an alternate name for the nonexistent file
- •
- extra (list) -- optional extra information to add to the error message
- exception toil.jobStores.abstractJobStore.NoSuchJobException(jobStoreID: FileID)
- Indicates that the specified job does not exist.
- __init__(jobStoreID: FileID)
- Parameters
- jobStoreID (str) -- the jobStoreID that was mistakenly assumed to exist
- exception toil.jobStores.abstractJobStore.NoSuchJobStoreException(locator: str)
- Indicates that the specified job store does not exist.
- __init__(locator: str)
- Parameters
- locator (str) -- The location of the job store
RUNNING TESTS
Test make targets, invoked as $ make <target>, subject to which environment variables are set (see Running Integration Tests).TARGET | DESCRIPTION |
test | Invokes all tests. |
integration_test | Invokes only the integration tests. |
test_offline | Skips building the Docker appliance and only invokes tests that have no docker dependencies. |
integration_test_local | Makes integration tests easier to debug locally by running the integration tests serially and doesn't redirect output. This makes it appears on the terminal as expected. |
$ make test
$ export TOIL_TEST_QUICK=True; make test
$ make test tests=src/toil/test/sort/sortTest.py::SortTest::testSort
$ make test tests="-m 'not aws' src"
$ make test tests="-m 'not aws and not parasol' src"
Running Tests with pytest
Often it is simpler to use pytest directly, instead of calling the make wrapper. This usually works as expected, but some tests need some manual preparation. To run a specific test with pytest, use the following:python3 -m pytest src/toil/test/sort/sortTest.py::SortTest::testSort
Running Integration Tests
These tests are generally only run using in our CI workflow due to their resource requirements and cost. However, they can be made available for local testing:- •
- Running tests that make use of Docker (e.g. autoscaling tests and Docker tests) require an appliance image to be hosted. First, make sure you have gone through the set up found in Using Docker with Quay. Then to build and host the appliance image run the make target push_docker.
$ make push_docker
- •
- Running integration tests require activation via an environment variable as well as exporting information relevant to the desired tests. Enable the integration tests:
$ export TOIL_TEST_INTEGRATIVE=True
- •
- Finally, set the environment variables for keyname and desired zone:
$ export TOIL_X_KEYNAME=[Your Keyname] $ export TOIL_X_ZONE=[Desired Zone]
- •
- See the above sections for guidance on running tests.
Test Environment Variables
TOIL_TEST_TEMP | An absolute path to a directory where Toil tests will write their temporary files. Defaults to the system's standard temporary directory. |
TOIL_TEST_INTEGRATIVE | If True, this allows the integration tests to run. Only valid when running the tests from the source directory via make test or make test_parallel. |
TOIL_AWS_KEYNAME | An AWS keyname (see Preparing your AWS environment), which is required to run the AWS tests. |
TOIL_GOOGLE_PROJECTID | A Google Cloud account projectID (see Running in Google Compute Engine (GCE)), which is required to to run the Google Cloud tests. |
TOIL_TEST_QUICK | If True, long running tests are skipped. |
- Partial install and failing tests
- Some tests may fail with an ImportError if the required extras are not installed. Install Toil with all of the extras do prevent such errors.
Using Docker with Quay
Docker is needed for some of the tests. Follow the appropriate installation instructions for your system on their website to get started.$ make test Please set TOIL_DOCKER_REGISTRY, e.g. to quay.io/USER.
$ TOIL_DOCKER_REGISTRY=quay.io/USER make test
$ echo 'export TOIL_DOCKER_REGISTRY=quay.io/USER' >> $HOME/.bashrc
Running Mesos Tests
If you're running Toil's Mesos tests, be sure to create the virtualenv with --system-site-packages to include the Mesos Python bindings. Verify this by activating the virtualenv and running pip list | grep mesos. On macOS, this may come up empty. To fix it, run the following:for i in /usr/local/lib/python2.7/site-packages/*mesos*; do ln -snf $i venv/lib/python2.7/site-packages/; done
DEVELOPING WITH DOCKER
To develop on features reliant on the Toil Appliance (the docker image toil uses for AWS autoscaling), you should consider setting up a personal registry on Quay or Docker Hub. Because the Toil Appliance images are tagged with the Git commit they are based on and because only commits on our master branch trigger an appliance build on Quay, as soon as a developer makes a commit or dirties the working copy they will no longer be able to rely on Toil to automatically detect the proper Toil Appliance image. Instead, developers wishing to test any appliance changes in autoscaling should build and push their own appliance image to a personal Docker registry. This is described in the next section.Making Your Own Toil Docker Image
Note! Toil checks if the docker image specified by TOIL_APPLIANCE_SELF exists prior to launching by using the docker v2 schema. This should be valid for any major docker repository, but there is an option to override this if desired using the option: -\-forceDockerAppliance.- 1.
- Make some changes to the provisioner of your local version of Toil
- 2.
- Go to the location where you installed the Toil source code and run
$ make docker
- 3.
- If it's not already you will need Docker installed and need to log into Quay. Also you will want to make sure that your Quay account is public.
- 4.
- Set the environment variable TOIL_DOCKER_REGISTRY to your Quay account. If you find yourself doing this often you may want to add
export TOIL_DOCKER_REGISTRY=quay.io/<MY_QUAY_USERNAME>
- 5.
- Now you can run
$ make push_docker
- 6.
- Finally you will need to tell Toil from where to pull the Appliance image you've created (it uses the Toil release you have installed by default). To do this set the environment variable TOIL_APPLIANCE_SELF to the url of your image. For more info see Environment Variables.
- 7.
- Now you can launch your cluster! For more information see Running a Workflow with Autoscaling.
Running a Cluster Locally
The Toil Appliance container can also be useful as a test environment since it can simulate a Toil cluster locally. An important caveat for this is autoscaling, since autoscaling will only work on an EC2 instance and cannot (at this time) be run on a local machine.docker run \ --entrypoint=mesos-master \ --net=host \ -d \ --name=leader \ --volume=/home/jobStoreParentDir:/jobStoreParentDir \ quay.io/ucsc_cgl/toil:3.6.0 \ --registry=in_memory \ --ip=127.0.0.1 \ --port=5050 \ --allocation_interval=500ms
docker run \ --entrypoint=mesos-slave \ --net=host \ -d \ --name=worker \ --volume=/home/jobStoreParentDir:/jobStoreParentDir \ quay.io/ucsc_cgl/toil:3.6.0 \ --work_dir=/var/lib/mesos \ --master=127.0.0.1:5050 \ --ip=127.0.0.1 \ —-attributes=preemptable:False \ --resources=cpus:2
docker exec -it leader bash
- Docker-in-Docker issues
- If you want to run Docker inside this Docker cluster (Dockerized tools, perhaps), you should also mount in the Docker socket via -v /var/run/docker.sock:/var/run/docker.sock. This will give the Docker client inside the Toil Appliance access to the Docker engine on the host. Client/engine version mismatches have been known to cause issues, so we recommend using Docker version 1.12.3 on the host to be compatible with the Docker client installed in the Appliance. Finally, be careful where you write files inside the Toil Appliance - 'child' Docker containers launched in the Appliance will actually be siblings to the Appliance since the Docker engine is located on the host. This means that the 'child' container can only mount in files from the Appliance if the files are located in a directory that was originally mounted into the Appliance from the host - that way the files are accessible to the sibling container. Note: if Docker can't find the file/directory on the host it will silently fail and mount in an empty directory.
MAINTAINER'S GUIDELINES
In general, as developers and maintainers of the code, we adhere to the following guidelines:- •
- We strive to never break the build on master. All development should be done on branches, in either the main Toil repository or in developers' forks.
- •
- Pull requests should be used for any and all changes (except truly trivial ones).
- •
- Pull requests should be in response to issues. If you find yourself making a pull request without an issue, you should create the issue first.
Naming Conventions
- •
- Commit messages should be great. Most importantly, they must:
- •
- Have a short subject line. If in need of more space, drop down two lines and write a body to explain what is changing and why it has to change.
- •
- Write the subject line as a command: Destroy all humans, not All humans destroyed.
- •
- Reference the issue being fixed in a Github-parseable format, such as (resolves #1234) at the end of the subject line, or This will fix #1234. somewhere in the body. If no single commit on its own fixes the issue, the cross-reference must appear in the pull request title or body instead.
- •
- Branches in the main Toil repository must start with issues/, followed by the issue number (or numbers, separated by a dash), followed by a short, lowercase, hyphenated description of the change. (There can be many open pull requests with their associated branches at any given point in time and this convention ensures that we can easily identify branches.) Say there is an issue numbered #123 titled Foo does not work. The branch name would be issues/123-fix-foo and the title of the commit would be Fix foo in case of bar (resolves #123).
Pull Requests
- •
- All pull requests must be reviewed by a person other than the request's author. Review the PR by following the Reviewing Pull Requests checklist.
- •
- Modified pull requests must be re-reviewed before merging. Note that Github does not enforce this!
- •
- Merge pull requests by following the Merging Pull Requests checklist.
- •
- When merging a pull request, make sure to update the Draft Changelog on the Github wiki, which we will use to produce the changelog for the next release. The PR template tells you to do this, so don't forget. New entries should go at the bottom.
- •
- Pull requests will not be merged unless CI tests pass. Gitlab tests are only run on code in the main Toil repository on some branch, so it is the responsibility of the approving reviewer to make sure that pull requests from outside repositories are copied to branches in the main repository. This can be accomplished with (from a Toil clone):
./contrib/admin/test-pr theirusername their-branch issues/123-fix-description-here
- •
- Prefer using "Squash and marge" when merging pull requests to master especially when the PR contains a "single unit" of work (i.e. if one were to rewrite the PR from scratch with all the fixes included, they would have one commit for the entire PR). This makes the commit history on master more readable and easier to debug in case of a breakage. When squashing a PR from multiple authors, please add Co-authored-by to give credit to all contributing authors. See Issue #2816 for more details.
Publishing a Release
These are the steps to take to publish a Toil release:- •
- Determine the release version X.Y.Z. This should follow semantic versioning; if user-workflow-breaking changes are made, X should be incremented, and Y and Z should be zero. If non-breaking changes are made but new functionality is added, X should remain the same as the last release, Y should be incremented, and Z should be zero. If only patches are released, X and Y should be the same as the last release and Z should be incremented.
- •
- If it does not exist already, create a release branch in the Toil repo named X.Y.x, where x is a literal lower-case "x". For patch releases, find the existing branch and make sure it is up to date with the patch commits that are to be released. They may be cherry-picked over from master.
- •
- On the release branch, edit version_template.py in the root of the repository. Find the line that looks like this (slightly different for patch releases):
baseVersion = 'X.Y.0a1'
baseVersion = 'X.Y.Z'
- •
- Tag the current state of the release branch as releases/X.Y.Z.
- •
- Make the Github release here, referencing that tag. For a non-patch release, fill in the description with the changelog from the wiki page, which you should clear. For a patch release, just describe the patch.
- •
- For a non-patch release, set up the main branch so that development builds will declare themselves to be alpha versions of what the next release will probably be. Edit version_template.py in the root of the repository on the main branch to set baseVersion like this:
baseVersion = 'X.Y+1.0a1'
Using Git Hooks
In the contrib/hooks directory, there are two scripts, mypy-after-commit.py and mypy-before-push.py, that can be set up as Git hooks to make sure you don't accidentally push commits that would immediately fail type-checking. These are supposed to eliminate the need to run make mypy constantly. You can install them into your Git working copy like thisln -rs ./contrib/hooks/mypy-after-commit.py .git/hooks/post-commit ln -rs ./contrib/hooks/mypy-before-push.py .git/hooks/pre-push
Adding Retries to a Function
See toil.lib.retry .from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError]) def update_my_wallpaper(): return get('https://www.deviantart.com/')
from requests import get from requests.exceptions import HTTPError @retry(errors=[HTTPError, ValueError]) def update_my_wallpaper(): return get('https://www.deviantart.com/')
from requests import get from requests.exceptions import HTTPError @retry(errors=[ ErrorCondition( error=HTTPError, error_codes=[500, 502, 503, 504] )]) def update_my_wallpaper(): return requests.get('https://www.deviantart.com/')
from requests import get from requests.exceptions import HTTPError @retry(errors=[ ErrorCondition( error=HTTPError, error_message_must_include="NotFound" )]) def update_my_wallpaper(): return requests.get('https://www.deviantart.com/')
from requests import get from requests.exceptions import HTTPError @retry(errors=[ HTTPError, ErrorCondition( error=HTTPError, error_message_must_include="NotFound", retry_on_this_condition=False )]) def update_my_wallpaper(): return requests.get('https://www.deviantart.com/')
import boto3 from botocore.exceptions import ClientError @retry(errors=[ ErrorCondition( error=ClientError, boto_error_codes=["BucketNotFound"] )]) def boto_bucket(bucket_name): boto_session = boto3.session.Session() s3_resource = boto_session.resource('s3') return s3_resource.Bucket(bucket_name)
- 1.
- Retrying on a normal error, like a KeyError.
- 2.
- Retrying on HTTP error codes (use ErrorCondition).
- 3.
- Retrying on boto's specific status errors, like "BucketNotFound" (use ErrorCondition).
- 4.
- Retrying when an error message contains a certain phrase (use ErrorCondition).
- 5.
- Explicitly NOT retrying on a condition (use ErrorCondition).
PULL REQUEST CHECKLISTS
This document contains checklists for dealing with PRs. More general PR information is available at Pull Requests.Reviewing Pull Requests
This checklist is to be kept in sync with the checklist in the pull request template.- •
- Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- •
-
If it is coming from an external repo, make sure to pull it in for CI with:
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
- •
-
If there is no associated issue, create one.
- •
- Read through the code changes. Make sure that it doesn't have:
- •
-
Addition of trailing whitespace.
- •
-
New variable or member names in camelCase that want to be in snake_case.
- •
-
New functions without type hints.
- •
-
New functions or classes without informative docstrings.
- •
-
Changes to semantics not reflected in the relevant docstrings.
- •
-
New or changed command line options for Toil workflows that are not reflected in docs/running/cliOptions.rst
- •
-
New features without tests.
- •
-
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
- •
-
Finish the review with an overall description of your opinion.
Merging Pull Requests
This checklist is to be kept in sync with the checklist in the pull request template.- •
-
Make sure the PR passes tests.
- •
-
Make sure the PR has been reviewed since its last modification. If not, review it.
- •
- Merge with the Github "Squash and merge" feature.
- •
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
- •
-
Copy its recommended changelog entry to the Draft Changelog.
- •
-
Append the issue number in parentheses to the changelog entry.
TOIL ARCHITECTURE
The following diagram layouts out the software architecture of Toil.[image: Toil's architecture is composed of the
leader, the job store, the worker processes, the batch system, the node
provisioner, and the stats and logging monitor.] [image] Figure 1: The basic
components of Toil's architecture..UNINDENT
- These components are described below:
- •
- the leader:
- The leader is responsible for deciding which jobs should be run. To do this it traverses the job graph. Currently this is a single threaded process, but we make aggressive steps to prevent it becoming a bottleneck (see Read-only Leader described below).
- •
- the job-store:
- Handles all files shared between the components. Files in the job-store are the means by which the state of the workflow is maintained. Each job is backed by a file in the job store, and atomic updates to this state are used to ensure the workflow can always be resumed upon failure. The job-store can also store all user files, allowing them to be shared between jobs. The job-store is defined by the AbstractJobStore class. Multiple implementations of this class allow Toil to support different back-end file stores, e.g.: S3, network file systems, Google file store, etc.
- •
- workers:
- The workers are temporary processes responsible for running jobs, one at a time per worker. Each worker process is invoked with a job argument that it is responsible for running. The worker monitors this job and reports back success or failure to the leader by editing the job's state in the file-store. If the job defines successor jobs the worker may choose to immediately run them (see Job Chaining below).
- •
- the batch-system:
- Responsible for scheduling the jobs given to it by the leader, creating a worker command for each job. The batch-system is defined by the AbstractBatchSystem class. Toil uses multiple existing batch systems to schedule jobs, including Apache Mesos, GridEngine and a multi-process single node implementation that allows workflows to be run without any of these frameworks. Toil can therefore fairly easily be made to run a workflow using an existing cluster.
- •
- the node provisioner:
- Creates worker nodes in which the batch system schedules workers. It is defined by the AbstractProvisioner class.
- •
- the statistics and logging monitor:
- Monitors logging and statistics produced by the workers and reports them. Uses the job-store to gather this information.
Jobs and JobDescriptions
As noted in Job Basics, a job is the atomic unit of work in a Toil workflow. User scripts inherit from the Job class to define units of work. These jobs are pickled and stored in the job-store by the leader, and are retrieved and un-pickled by the worker when they are scheduled to run.jobName | Name of the kind of job this is. This may be used in job store IDs and logging. Also used to let the cluster scaler learn a model for how long the job will take. Defaults to the job class's name if no real user-defined name is available. For a FunctionWrappingJob, the jobName is replaced by the wrapped function's name. For a CWL workflow, the jobName is the class name of the internal job that is running the CWL workflow, such as "CWLJob". |
unitName | Name of this instance of this kind of job. If set by the user, it will appear with the jobName in logging. For a CWL workflow, the unitName is set to a descriptive name that includes the CWL file name and the ID in the file if set. |
displayName | A human-readable name to identify this particular job instance. Used as an identifier of the job class in the stats report. Defaults to the job class's name if no real user-defined name is available. For a CWL workflow, the displayName is the absolute workflow URI. |
Optimizations
Toil implements lots of optimizations designed for scalability. Here we detail some of the key optimizations.Read-only leader
The leader process is currently implemented as a single thread. Most of the leader's tasks revolve around processing the state of jobs, each stored as a file within the job-store. To minimise the load on this thread, each worker does as much work as possible to manage the state of the job it is running. As a result, with a couple of minor exceptions, the leader process never needs to write or update the state of a job within the job-store. For example, when a job is complete and has no further successors the responsible worker deletes the job from the job-store, marking it complete. The leader then only has to check for the existence of the file when it receives a signal from the batch-system to know that the job is complete. This off-loading of state management is orthogonal to future parallelization of the leader.Job chaining
The scheduling of successor jobs is partially managed by the worker, reducing the number of individual jobs the leader needs to process. Currently this is very simple: if the there is a single next successor job to run and its resources fit within the resources of the current job and closely match the resources of the current job then the job is run immediately on the worker without returning to the leader. Further extensions of this strategy are possible, but for many workflows which define a series of serial successors (e.g. map sequencing reads, post-process mapped reads, etc.) this pattern is very effective at reducing leader workload.Preemptable node support
Critical to running at large-scale is dealing with intermittent node failures. Toil is therefore designed to always be resumable providing the job-store does not become corrupt. This robustness allows Toil to run on preemptible nodes, which are only available when others are not willing to pay more to use them. Designing workflows that divide into many short individual jobs that can use preemptable nodes allows for workflows to be efficiently scheduled and executed.Caching
Running bioinformatic pipelines often require the passing of large datasets between jobs. Toil caches the results from jobs such that child jobs running on the same node can directly use the same file objects, thereby eliminating the need for an intermediary transfer to the job store. Caching also reduces the burden on the local disks, because multiple jobs can share a single file. The resulting drop in I/O allows pipelines to run faster, and, by the sharing of files, allows users to run more jobs in parallel by reducing overall disk requirements.[image: Graph outlining the efficiency gain
from caching.] [image] Figure 2: Efficiency gain from caching. The lower half
of each plot describes the disk used by the pipeline recorded every 10 minutes
over the duration of the pipeline, and the upper half shows the corresponding
stage of the pipeline that is being processed. Since jobs requesting the same
file shared the same inode, the effective load on the disk is considerably
lower than in the uncached case where every job downloads a personal copy of
every file it needs. We see that in all cases, the uncached run uses almost
300-400GB more that the cached run in the resource heavy mutation calling
step. We also see a benefit in terms of wall time for each stage since we
eliminate the time taken for file transfers..UNINDENT
Toil support for Common Workflow Language
The CWL document and input document are loaded using the 'cwltool.load_tool' module. This performs normalization and URI expansion (for example, relative file references are turned into absolute file URIs), validates the document against the CWL schema, initializes Python objects corresponding to major document elements (command line tools, workflows, workflow steps), and performs static type checking that sources and sinks have compatible types.MINIMUM AWS IAM PERMISSIONS
Toil requires at least the following permissions in an IAM role to operate on a cluster. These are added by default when launching a cluster. However, ensure that they are present if creating a custom IAM role when launching a cluster with the --awsEc2ProfileArn parameter.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:*", "s3:*", "sdb:*", "iam:PassRole" ], "Resource": "*" } ] }
AUTO-DEPLOYMENT
If you want to run your workflow in a distributed environment, on multiple worker machines, either in the cloud or on a bare-metal cluster, your script needs to be made available to those other machines. If your script imports other modules, those modules also need to be made available on the workers. Toil can automatically do that for you, with a little help on your part. We call this feature auto-deployment of a workflow.$ virtualenv --system-site-packages venv $ . venv/bin/activate
$ tree . ├── util │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py └── workflow ├── __init__.py └── main.py 3 directories, 5 files $ pip install matplotlib $ cp -R workflow util venv/lib/python2.7/site-packages
$ tree . ├── util │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py ├── workflow │ ├── __init__.py │ └── main.py └── setup.py 3 directories, 6 files $ pip install .
$ pip install my-project
$ python3 main.py --batchSystem=mesos …
If workflow's external dependencies contain
native code (i.e. are not pure Python) then they must be manually installed on
each worker.
Neither python3 setup.py develop nor
pip install -e . can be used in this process as, instead of copying the
source files, they create .egg-link files that Toil can't auto-deploy.
Similarly, python3 setup.py install doesn't work either as it installs
the project as a Python .egg which is also not currently supported by
Toil (though it could be in the future).
Also note that using the --single-version-externally-managed flag with
setup.py will prevent the installation of your package as an
.egg. It will also disable the automatic installation of your project's
dependencies.
Auto Deployment with Sibling Modules
This scenario applies if the user script imports modules that are its siblings:$ cd my_project $ ls userScript.py utilities.py $ ./userScript.py --batchSystem=mesos …
Auto-Deploying a Package Hierarchy
Recall that in Python, a package is a directory containing one or more .py files—one of which must be called __init__.py—and optionally other packages. For more involved workflows that contain a significant amount of code, this is the recommended way of organizing the source code. Because we use a package hierarchy, we can't really refer to the user script as such, we call it the user module instead. It is merely one of the modules in the package hierarchy. We need to inform Toil that we want to use a package hierarchy by invoking Python's -m option. That enables Toil to identify the entire set of modules belonging to the workflow and copy all of them to each worker. Note that while using the -m option is optional in the scenarios above, it is mandatory in this one.$ cd my_project $ tree . ├── utils │ ├── __init__.py │ └── sort │ ├── __init__.py │ └── quick.py └── workflow ├── __init__.py └── main.py 3 directories, 5 files $ python3 -m workflow.main --batchSystem=mesos …
$ cd my_project $ export PYTHONPATH="$PWD" $ cd /some/other/dir $ python3 -m workflow.main --batchSystem=mesos …
Relying on Shared Filesystems
Bare-metal clusters typically mount a shared file system like NFS on each node. If every node has that file system mounted at the same path, you can place your project on that shared filesystem and run your user script from there. Additionally, you can clone the Toil source tree into a directory on that shared file system and you won't even need to install Toil on every worker. Be sure to add both your project directory and the Toil clone to PYTHONPATH. Toil replicates PYTHONPATH from the leader to every worker.- Using a shared filesystem
- Toil currently only supports a tempdir set to a local, non-shared directory.
Toil Appliance
The term Toil Appliance refers to the Mesos Docker image that Toil uses to simulate the machines in the virtual mesos cluster. It's easily deployed, only needs Docker, and allows for workflows to be run in single-machine mode and for clusters of VMs to be provisioned. To specify a different image, see the Toil Environment Variables section. For more information on the Toil Appliance, see the Running in AWS section.ENVIRONMENT VARIABLES
There are several environment variables that affect the way Toil runs.TOIL_CHECK_ENV | A flag that determines whether Toil will try to refer back to a Python virtual environment in which it is installed when composing commands that may be run on other hosts. If set to True, if Toil is installed in the current virtual environment, it will use absolute paths to its own executables (and the virtual environment must thus be available on at the same path on all nodes). Otherwise, Toil internal commands such as _toil_worker will be resolved according to the PATH on the node where they are executed. This setting can be useful in a shared HPC environment, where users may have their own Toil installations in virtual environments. |
TOIL_WORKDIR | An absolute path to a directory where Toil will write its temporary files. This directory must exist on each worker node and may be set to a different value on each worker. The --workDir command line option overrides this. When using the Toil docker container, such as on Kubernetes, this defaults to /var/lib/toil. When using Toil autoscaling with Mesos, this is somewhere inside the Mesos sandbox. In all other cases, the system's standard temporary directory is used. |
TOIL_WORKDIR_OVERRIDE | An absolute path to a directory where Toil will write its temporary files. This overrides TOIL_WORKDIR and the --workDir command line option. |
TOIL_COORDINATION_DIR | An absolute path to a directory where Toil will write its lock files. This directory must exist on each worker node and may be set to a different value on each worker. The --coordinationDir command line option overrides this. |
TOIL_COORDINATION_DIR_OVERRIDE | An absolute path to a directory where Toil will write its lock files. This overrides TOIL_COORDINATION_DIR and the --coordinationDir command line option. |
TOIL_KUBERNETES_HOST_PATH | A path on Kubernetes hosts that will be mounted as the Toil work directory in the workers, to allow for shared caching. Will be created if it doesn't already exist. |
TOIL_KUBERNETES_OWNER | A name prefix for easy identification of Kubernetes jobs. If not set, Toil will use the current user name. |
TOIL_KUBERNETES_SERVICE_ACCOUNT | A service account name to apply when creating Kubernetes pods. |
TOIL_KUBERNETES_POD_TIMEOUT | Seconds to wait for a scheduled Kubernetes pod to start running. |
KUBE_WATCH_ENABLED | A boolean variable that allows for users to utilize kubernetes watch stream feature instead of polling for running jobs. Default value is set to False. |
TOIL_TES_ENDPOINT | URL to the TES server to run against when using the tes batch system. |
TOIL_TES_USER | Username to use with HTTP Basic Authentication to log into the TES server. |
TOIL_TES_PASSWORD | Password to use with HTTP Basic Authentication to log into the TES server. |
TOIL_TES_BEARER_TOKEN | Token to use to authenticate to the TES server. |
TOIL_APPLIANCE_SELF | The fully qualified reference for the Toil Appliance you wish to use, in the form REPO/IMAGE:TAG. quay.io/ucsc_cgl/toil:3.6.0 and cket/toil:3.5.0 are both examples of valid options. Note that since Docker defaults to Dockerhub repos, only quay.io repos need to specify their registry. |
TOIL_DOCKER_REGISTRY | The URL of the registry of the Toil Appliance image you wish to use. Docker will use Dockerhub by default, but the quay.io registry is also very popular and easily specifiable by setting this option to quay.io. |
TOIL_DOCKER_NAME | The name of the Toil Appliance image you wish to use. Generally this is simply toil but this option is provided to override this, since the image can be built with arbitrary names. |
TOIL_AWS_SECRET_NAME | For the Kubernetes batch system, the name of a Kubernetes secret which contains a credentials file granting access to AWS resources. Will be mounted as ~/.aws inside Kubernetes-managed Toil containers. Enables the AWSJobStore to be used with the Kubernetes batch system, if the credentials allow access to S3 and SimpleDB. |
TOIL_AWS_ZONE | Zone to use when using AWS. Also determines region. Overrides TOIL_AWS_REGION. |
TOIL_AWS_REGION | Region to use when using AWS. |
TOIL_AWS_AMI | ID of the AMI to use in node provisioning. If in doubt, don't set this variable. |
TOIL_AWS_NODE_DEBUG | Determines whether to preserve nodes that have failed health checks. If set to True, nodes that fail EC2 health checks won't immediately be terminated so they can be examined and the cause of failure determined. If any EC2 nodes are left behind in this manner, the security group will also be left behind by necessity as it cannot be deleted until all associated nodes have been terminated. |
TOIL_AWS_BATCH_QUEUE | Name or ARN of an AWS Batch Queue to use with the AWS Batch batch system. |
TOIL_AWS_BATCH_JOB_ROLE_ARN | ARN of an IAM role to run AWS Batch jobs as with the AWS Batch batch system. If the jobs are not run with an IAM role or on machines that have access to S3 and SimpleDB, the AWS job store will not be usable. |
TOIL_GOOGLE_PROJECTID | The Google project ID to use when generating Google job store names for tests or CWL workflows. |
TOIL_SLURM_ARGS | Arguments for sbatch for the slurm batch system. Do not pass CPU or memory specifications here. Instead, define resource requirements for the job. There is no default value for this variable. If neither --export nor --export-file is in the argument list, --export=ALL will be provided. |
TOIL_SLURM_PE | Name of the slurm partition to use for parallel jobs. There is no default value for this variable. |
TOIL_GRIDENGINE_ARGS | Arguments for qsub for the gridengine batch system. Do not pass CPU or memory specifications here. Instead, define resource requirements for the job. There is no default value for this variable. |
TOIL_GRIDENGINE_PE | Parallel environment arguments for qsub and for the gridengine batch system. There is no default value for this variable. |
TOIL_TORQUE_ARGS | Arguments for qsub for the Torque batch system. Do not pass CPU or memory specifications here. Instead, define extra parameters for the job such as queue. Example: -q medium Use TOIL_TORQUE_REQS to pass extra values for the -l resource requirements parameter. There is no default value for this variable. |
TOIL_TORQUE_REQS | Arguments for the resource requirements for Torque batch system. Do not pass CPU or memory specifications here. Instead, define extra resource requirements as a string that goes after the -l argument to qsub. Example: walltime=2:00:00,file=50gb There is no default value for this variable. |
TOIL_LSF_ARGS | Additional arguments for the LSF's bsub command. Instead, define extra parameters for the job such as queue. Example: -q medium. There is no default value for this variable. |
TOIL_HTCONDOR_PARAMS | Additional parameters to include in the HTCondor submit file passed to condor_submit. Do not pass CPU or memory specifications here. Instead define extra parameters which may be required by HTCondor. This variable is parsed as a semicolon-separated string of parameter = value pairs. Example: requirements = TARGET.has_sse4_2 == true; accounting_group = test. There is no default value for this variable. |
TOIL_CUSTOM_DOCKER_INIT_COMMAND | Any custom bash command to run in the Toil docker container prior to running the Toil services. Can be used for any custom initialization in the worker and/or primary nodes such as private docker docker authentication. Example for AWS ECR: pip install awscli && eval $(aws ecr get-login --no-include-email --region us-east-1). |
TOIL_CUSTOM_INIT_COMMAND | Any custom bash command to run prior to starting the Toil appliance. Can be used for any custom initialization in the worker and/or primary nodes such as private docker authentication for the Toil appliance itself (i.e. from TOIL_APPLIANCE_SELF). |
TOIL_S3_HOST | the IP address or hostname to use for connecting to S3. Example: TOIL_S3_HOST=127.0.0.1 |
TOIL_S3_PORT | a port number to use for connecting to S3. Example: TOIL_S3_PORT=9001 |
TOIL_S3_USE_SSL | enable or disable the usage of SSL for connecting to S3 (True by default). Example: TOIL_S3_USE_SSL=False |
TOIL_WES_BROKER_URL | An optional broker URL to use to communicate between the WES server and Celery task queue. If unset, amqp://guest:guest@localhost:5672// is used. |
TOIL_WES_JOB_STORE_TYPE | Type of job store to use by default for workflows run via the WES server. Can be file, aws, or google. |
TOIL_OWNER_TAG | This will tag cloud resources with a tag reading: "Owner: $TOIL_OWNER_TAG". This is used internally at UCSC to stop a bot we have that terminates untagged resources. |
TOIL_AWS_PROFILE | The name of an AWS profile to run TOIL with. |
TOIL_AWS_TAGS | This will tag cloud resources with any arbitrary tags given in a JSON format. These are overwritten in favor of CLI options when using launch cluster. For information on valid AWS tags, see AWS Tags. |
SINGULARITY_DOCKER_HUB_MIRROR | An http or https URL for the Singularity wrapper in the Toil Docker container to use as a mirror for Docker Hub. |
OMP_NUM_THREADS | The number of cores set for OpenMP applications in the workers. If not set, Toil will use the number of job threads. |
GUNICORN_CMD_ARGS | Specify additional Gunicorn configurations for the Toil WES server. See Gunicorn settings. |
- •
- Index
- •
- Search Page
AUTHOR
UCSC Computational Genomics LabCOPYRIGHT
2023 – 2023 UCSC Computational Genomics LabNovember 20, 2023 | 5.9.0 |