Dashboard > GridWizard User Documentation > Table of Contents > Using GridWizard
Added by Neil Jones, last edited by Neil Jones on Apr 19, 2007  (view change)
Labels: 
(None)


Note: do not forget to set GWIZ_HOME. This needs to be done both on the machine from which you send jobs, as well as on every resource that runs jobs. (But it's not necessary on data resources.) This is best to do in your ~/.bashrc file. The command should look like

export GWIZ_HOME=/home/me/gwiz

where me is replaced with the appropriate string.

Note: You must install Gwiz (but not configure it!) on each of the compute resources you intend to use. If you skip this step, jobs will succesfully schedule on remote resources, but won't run properly.

Introduction

This document describes how to use GridWizard. In order to do this, we will need to rely on a number of examples that demonstrate how to get work done; to that end, we will create a fictitious "grid" environment. This grid will consist of a cluster or maybe a couple of clusters (this doesn't matter so much), as well as an external file server on which you store your data. In our experience, most clusters don't really have very much storage space, so it's best to keep your data separate from the cluster itself, so GWiz was built with that design principle. A feature that is planned for the next release candidate will do two things: first, allow you to use data on your local submit machine (eg, a laptop), rather than a publicly available file server; and second, allow you to iterate over data stored on the cluster without going through the overhead of a staging file transfer.

Suppose your data is stored on host.com, which is a server that a bunch of people in your lap use to store data. The kind system administrator there has created a directory for you, called /opt/mri, since you'll mainly be analyzing MRI images. Within /opt/mri-data, you have created the following hierarchy:

/opt/mri
      `- subjects/
      |     s1.img
      |     s1.hdr
      |     s2.img
      |     s2.hdr
      |     s3.img
      |     s3.hdr
      `- atlas/
      |     a.img
      |     a.hdr
      `- misc/
      |     foo
      |     bar
      |     `- testdir/
      |          baz    
      `- results/

We will cover several ways of moving various chunks of the host.com data hierarchy over to a remote cluster, as well as ways of running computation on collections of files. For the purpose of this document, let's assume there's some hypothetical program, prog which takes different types of inputs:

% prog --help
Usage
  prog [options]
where options is one of:
  -t #  -- an integer specifying a threshold
  -d #  -- an integer specifying a distance
  -i <file> -- a filename for input
  -o <file> -- a filename for output
Prog can be called with any, all, or none of its options.  It does nothing.

Syntax guide

The goal of the default configuration in GWiz is to give you as much flexibility in specifying a command as possible, so that you can launch many types of analyses directly from the command line. The result is that the syntax can get a little messy. In fact, the "syntax" can change entirely, depending on the default Task Generator that's listed in the config.xml file. The remainder of this page describes the syntax for TemplateTaskGenerator, which allows for more flexibility than most other task generators, but might not be what you need to make headway on your own problem. In that case, write your own, or use the BagOfTasksGenerator, which basically just takes in a literal list of commands and grinds away at them.

GWiz is (our attempt at) a modular framework for running jobs. By itself, this is not terribly useful without an application that enables you to use that framework. Thus, Gwiz comes with one application, gwiz-run, with which you can access the framework from a command line. This particular application is a static scheduler: it figures out all the work that needs to be done, starts it running, and then exits. Many other possible applications are possible: a GUI that monitors and presents job state in a meaningful way; a daemon that polls a database; a command line daemon that comes with its own CLI. None of these are impossible, they just haven't yet been written. If a particular application sounds like it might be extraordinarily useful, please enter it as a "desirable feature" in JIRA, and vote on it.

So, within the confines of this static scheduling application, the basic idea is that you specify the command you want to run (say, tar) and you preface it with gwiz-run. However, as you might imagine, it gets somewhat more complicated.

Specifying an executable

The first string after gwiz-run is the executable that you will run on the remote cluster. Because multiple clusters may have different paths, what seems to work the best is to modify your PATH variable on each cluster so that the name of the program is sufficient to invoke it. For example, add /opt/BIRN/lddmm/1.0.1 to your path if you plan to use lddmm at BIRN, and then invoke it as gwiz-run lddmm <args>. Some applications are not statically compiled, and in that case you will need to also alter your LD_LIBRARY_PATH. This can be done from within the cluster.xml file (see the configuration documentation) or from within your .profile (or its equivalent) on each cluster.

Specifying arguments

Anything that comes after the executable is considered an argument (unless it is surrounded by [...], see below). Arguments may be defined in terms of other arguments, they may be composed of lists, or take on special values that are only known at runtime (eg, a random number, or a GUID, or the current date).

Lists

Many applications require you to set a parameter or two, perhaps a threshold or distance metric. In order to iterate over a number of different values, you can specify a list as start:end[:step]. For example, suppose your executable is the prog example from above. Executing bin/gwiz-run prog -t 1:4 will create 4 tasks:

prog -t 1
prog -t 2
prog -t 3
prog -t 4

These tasks will be scheduled and executed in parallel as your cluster allows. Similarly, bin/gwiz-run prog -t 1:4 -d 0:12:4 will generate

prog -t 1 -d 0
prog -t 2 -d 0
prog -t 3 -d 0
prog -t 4 -d 0
prog -t 1 -d 4
prog -t 2 -d 4
...
prog -t 4 -d 12

These tasks will be scheduled and executed in parallel as your cluster allows.

By using a file URI glob, you can also iterate over collections of files. The command
bin/gwiz-run prog -i sftp://host.com/opt/mri/subjects/*.img will result in

prog -i sftp://host.com/opt/mri/subjects/s1.img
prog -i sftp://host.com/opt/mri/subjects/s2.img
prog -i sftp://host.com/opt/mri/subjects/s3.img

If you are experienced with the split-file method of storing images (Analyze or FITS formats, for example), you will realize that this is problematic, as the .hdr portion of the file is just as important as the .img portion. This is covered below under "Phantom transfers". For programs (like prog) that have absolutely no idea what to do with a URI like sftp://host.com/opt/mri/subjects/s1.img, use a file transfer designator (in, out, etc) to tell GWiz that you would like the file to be moved around automatically. This practice is described below.

Arguments built from other arguments

Programs often generate output files, and when running a large number of program invocations in parallel, you frequently want to distinguish the output files based on the value of other parameters. Continuing the example above, you may want to run multiple thresholds (-t) over the same input, but call the output something unique based on the threshold. In this case, bin/gwiz-run prog -t 1:4 -i input -o output.$(-t) would generate

prog -t 1 -i input -o output.1
prog -t 2 -i input -o output.2
prog -t 3 -i input -o output.3
prog -t 4 -i input -o output.4

In this case, the file names are considered to be local; the next section describes file handling in more detail.

The text between $(...) determines the variable that should be substituted. These variables are created based on a fairly standard analysis of the command line you feed in: anything that looks like an option and isn't a number can be referred to by its name or by its position; anything that isn't obviously an option is added as a positional argument. To refer to the third command-line argument, use $(#3). To refer to the value of the -t parameter, use $(-t). One can use a mixture of these methods.

There are some special variables you can refer to: $(iter) returns the current value of a counter; if you have an invocation of gwiz-run that results in 450 tasks to schedule, the value of $(iter) will be one of 1, 2, 3, ... 450, in order. $(rand) is a random number chosen at the start of gwiz-run; $(uuid) is a globally-unique identifier chosen at the start of gwiz-run.

Further, you can use certain functions to affect simple changes to values. Functions have an awkward syntax: $(function:arg1,arg2,...). For example, the base function returns the base filename of a parameter. Thus,

gwiz-run prog -i /some/directory/file.txt -o $(base:-i)

will yield

prog -i /some/directory/file.txt -o file.txt

Similarly, the dir function will yield the directory name of its argument. The function s does a regular expression substitution, so

gwiz-run prog -i /some/directory/file.txt -o $(s:-i,".txt",".dat")

returns

prog -i /some/directory/file.txt -o /some/directory/file.dat

Additional functions can be added fairly easily (within the GridWizard source code) as the need arises, so if a function doesn't exist that you would like, please suggest it in the JIRA bug tracker and vote for it.

File transfers

When scheduling many program executions on multiple clusters, one of the most annoying tasks is to handle data transfers. Accordingly, GWiz tries to handle this for you as much as it can, but doing so makes the syntax a little bit worse.

For any entity on the command line to be recognized as a file, it needs to be a properly-formatted URI (uniform resource identifier). These will generally look like protocol://host/path; the protocols currently supported are sftp and http, though more can be added as the need arises. For the purposes of specifying staging instructions, the protocol specifier is prepended with a direction, and a local path can be appended to the end, in the following way: direction:protocol://host/path>local. For example, in:sftp://host.com/opt/mri/subjects/s1.img>images/subject.img will stage in a file to the computation host; the file will be the contents of host.com:/opt/mri/subjects/s1.img and will be located at ./images/subject.img, relative to the execution root on the compute host.

In most cases, you will not need to specify a local path for file staging. It turns out that sftp://foo.bar/path/to/file is a perfectly valid path in unix, so files will by default come and go directly to the path described by their URL. This makes it somewhat easier for you to debug complicated program executions because you can see what got transferred, from where, and whether or not the correct file was moved in its entirety (by using a md5sum). With local file names, you'll always have this lingering doubt that maybe you specified the local name a little strangely or that GWiz maybe screwed up.

in

The in direction instructs GWiz to stage a file from a data host to a compute host when the job runs. Since many protocols (sftp especially) require authentication, authentication and GWiz configuration information needs to be moved over to the compute host to be used at a later point in time. Unfortunately, this opens an enormous security hole, so the authentication information is encrypted with a temporary key. These are the EncryptedAuthInfoArtifact and EncryptedConfigArtifact generators listed in the main config file. If you will be transferring data in and out using GWiz as the staging system, make sure your job managers have those artifact generators enabled, and also make sure that the inputFiles property for the Condor job manager (if you're crazy enough to use Condor) is properly set:

<cluster name="birn">
   ...
   <job-manager type="condor">
       <artifact-generator type="condor submit script" submittable="true">
          <properties>
             <name>inputFiles</name>
             <value>auth.xml.enc,config.xml.enc</value>
          </properties>
       </artifact-generator>
       <artifact-generator type="encrypted auth info">
          ....
       </artifact-generator> 
       ....
  </job-manager>
</cluster>

SGE and other PBS systems don't need the inputFiles property set.

out

The out designator informs GWiz that it needs to wait for the computation to finish, and then transfer the named file from the compute host to the named data host. The same configuration directives as above apply.

Note that you don't really need to do anything to make sure that directories exist prior to the file getting created. GWiz is not shy about creating directories for you.

inout

This was probably a bad idea on my part, but some programs are crazy and modify their input files. In these cases, the inout parameter allows a file to be staged in before the computation, then staged out after.

Phantom transfers

All of the above methods for listing command line parameters generated some sort of trace on the command line. That is bin/gwiz-run prog -t 1:4 -i in:sftp://host.com/opt/mri/atlas/a.img converted itself into

prog -t 1 -i sftp://host.com/opt/mri/atlas/a.img
prog -t 2 -i sftp://host.com/opt/mri/atlas/a.img
prog -t 3 -i sftp://host.com/opt/mri/atlas/a.img
prog -t 4 -i sftp://host.com/opt/mri/atlas/a.img

(with the internal plumbing of GWiz handling the movement of a.img from host.com to the file ./sftp:/host.com/opt/mri/atlas/a.img relative to where the program prog is being started from.)

In some cases, programs expect pairs (or sets) of files to be present. For example, the LDDMM application operates on images that are composed of data (foo.img) and headers (foo.hdr). In the case of LDDMM, you specify the .img portion but the .hdr part of the file is assumed to exist in the same directory. Here, you need to specify a staging directive to GWiz but you don't want it to alter the command line. In this case, you use the [...] specifiers. For example

bin/gwiz-run prog -i in:sftp://host.com/opt/mri/subjects/s1.img 
  [in:sftp://host.com/opt/mri/subjects/s1.hdr]

would transfer s1.img, s1.hdr (both into the relative directory ./sftp:/host.com/data) and run prog on it.

Files can be combined with the tricks from above, so

bin/gwiz-run prog -t 1:4 
  -i in:sftp://host.com/opt/mri/subjects/*.img 
  -o out:$(dir:-i)/$(-t)/$(base:-i).out 
[in:$(s:-i,".img",".hdr")]

would expand to

prog -t 1 -i ./sftp:/host.com/opt/mri/subjects/s1.img -o ./sftp:/host.com/opt/mri/subjects/1/s1.out 
prog -t 1 -i ./sftp:/host.com/opt/mri/subjects/s2.img -o ./sftp:/host.com/opt/mri/subjects/1/s2.out 
prog -t 1 -i ./sftp:/host.com/opt/mri/subjects/s3.img -o ./sftp:/host.com/opt/mri/subjects/1/s3.out 
prog -t 2 -i ./sftp:/host.com/opt/mri/subjects/s1.img -o ./sftp:/host.com/opt/mri/subjects/1/s1.out 
prog -t 2 -i ./sftp:/host.com/opt/mri/subjects/s2.img -o ./sftp:/host.com/opt/mri/subjects/1/s2.out 
prog -t 2 -i ./sftp:/host.com/opt/mri/subjects/s3.img -o ./sftp:/host.com/opt/mri/subjects/1/s3.out 
...

The s?.img and s?.hdr files will automatically be transferred in before the computation, and the ?/s?.out files will be transferred back to the data host once the computation is complete.

outbase

Many image processing programs run on a single input and generate a host of output files in some output directory. The outbase designator informs GWiz that all the files that match the given URI need to be transferred from the compute host back to the data host. For example [outbase:sftp://host.com/foo/bar/baz] will transfer any file appearing in ./sftp://host.com/foo/bar that matches baz*.

The outbase designator can only be used as a phantom. The outbase designator cannot currently be used to transfer files recursively, though we are working on this problem.

inbase

Files can be staged in bulk from the remote data host by using the inbase specifier. Like the outbase, this can only be done within a phantom transfer, but unlik outbase, it can be used to transfer files recursively by including a double-: [inbase:sftp://host.com/opt/mri/misc/*].

Example commands

In the following examples, I will use the sftp protocol to transfer files, and the hypothetical host host.com as the source and sink for data. Computations will get run on whatever cluster(s) you have defined in your cluster.xml configuration file.

Computing the md5sum of a large number of files

bin/gwiz-run \
   md5sum \
     in:sftp://host.com/opt/mri/subjects/*.img \
     > out:$(#1).md5

Results: host.com will have a blah.img.md5 file in /opt/mri/subjects for each image blah. This is a bit different than usual, and is a good example of why GWiz seems so screwed up at first. Let's assume that you have three images in that directory: s1.img, s2.img, and s3.img.

If you typed md5sum /opt/mri/subjects/*.img from a command prompt, what will actually happen is that the shell will expand this out to

md5sum /opt/mri/subjects/s1.img /opt/mri-images/subjects/s2.img /opt/mri-images/subjects/s3.img

This means that the md5sum program gets executed one time on all three images, with the result being that the command is not itself parallelized. Instead, GWiz takes this glob and turns it into

md5sum sftp://host.com/opt/mri/subjects/s1.img
md5sum sftp://host.com/opt/mri/subjects/s2.img
md5sum sftp://host.com/opt/mri/subjects/s3.img

Each of these commands is then handled as an independent task and is scheduled according to whatever policy you chose in your configuration file.

Tar'ing some input files

bin/gwiz-run \
   tar cvfz \
     out:sftp://host.com/opt/mri/misc/file.tgz>file.tgz \
     in:sftp://host.com/opt/mri/misc/foo>testdir/foo \
     in:sftp://host.com/opt/mri/misc/foo>testdir/bar

Result shows up in sftp://host.com/opt/mri/misc/file.tgz. Untarring this will result in ./testdir/foo and ./testdir/bar.

Why all the ">"s? When files get staged in, by default they get a local name of the url you entered (it's a valid filename!), so it will be in the sftp: directory of the GWIZ_WORK_DIR. By using the > you define the local path. Note that this feature is necessary for tar for two reasons: first, tar will try to interpret the sftp: itself, so you can't use the default local name. Second, you probably want reasonable names within your tar archive, so you want the layout of the files in the archive to not necessarily mirror the layout of the files on the remote data host.

Tarring all files in a directory

bin/gwiz-run \
   tar cvfz \
     out:sftp://host.com/opt/mri/misc/file.tgz>file.tgz \
     testdir
     [inbase:sftp://host.com/opt/mri/misc/**>testdir/]

Running nonrigid registration

/bin/gwiz-run nreg \
  /home/me/atlas/icbm.hdr \
  in:sftp://host.com/opt/mri/subjects/*.hdr \
  -dofout out:sftp://host.com/opt/mri/nonrigid/$(base:#2).dofout \
  -ds 20 \
  -parameter /home/me/atlas/parameterNr \
[in:$(s:#2,".hdr",".img")]

This example iterates over all the files in a directory on host.com. This is the first example that uses command line substitutions for parameters. It is also the first example that uses so-called phantom transfers. The result of this command will be the nreg command, which takes as input a pre-staged atlas file (atlas/icbm.hdr) and performs the nonrigid registration against each of the images on host.com. Prior to the calculation being performed, a file is staged in (without modifying the command line) that has the .img extension instead of the .hdr extension; if the input file specified as the second command line argument is opt/mri-images/subjects/s1.hdr, then the file opt/mri-images/subjects/s1.img is also transferred. When working with MRI or FITS images where headers and data are stored separately, this feature is quite useful.

The portions of the command line that look like $(...) are variable substitutions. Each argument on the command line has a name, as described above. Further, one can call certain functions on templated arguments---the functions supported so far are described above.

Note that the entire ITK nonrigid registration toolkit relies on a separate library, fltk to be compiled and loadable. For this reason, you will need to modify your LD_LIBRARY_PATH prior to running this example.

Running LDDMM in parallel

This may require you to set an X509_USER_PROXY which requires you to log in to MyProxy first. More later.

This needs to be updated to include the GridWizard GUI. Also, changes to gwiz-run, along with gwiz-kill, gwiz-state, gwiz-notify.

Posted by Neil Jones at Jul 03, 2007 08:37 | Permalink
Site running on a free Atlassian Confluence Open Source Project License granted to The GridWizard Project. Evaluate Confluence today.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.3.3 Build:#645 Feb 13, 2007) - Bug/feature request - Contact Administrators