SciNet is being decommissioned. It is no longer possible to submit compute jobs, and disk access will be available only until May 9th, 2018. To run MICe tools on Compute Canada systems, see the Pydpiper on Graham page.
Most of the information here not specific to Pydpiper is condensed from the detailed SciNet wiki.
- Apply for a Compute Canada account.
- Apply for a SciNet account. (You need to be logged in using your Compute Canada account to see this page.) This may take several days to be approved.
SSH into SciNet (you may wish to generate SSH keys and run ssh-add to streamline the login process; see these instructions):
(The -X flag is for X windows forwarding, and may be omitted.)
Don't set paths or load modules in your
~/.bashrc. It's sourced by the scripts submitted to the remote machines, so doing so will eventually cause your pipelines to fail in mysterious ways.
Running a job
Once logged in, SSH from your current login node to one of eight devel nodes:
This is effectively
ssh gpc0N, where N is from 1 to 8, depending on node availability.
Modules are a bit finicky. If you get weird errors here, type `module purge` and try again. Also try `module help` for some other commands.
In addition to loading prerequisite modules and putting Pydpiper scripts on your path, this will also set
$PYDPIPER_CONFIG_FILE to point to a file specifying some SciNet-specific options. For instance:
You can use command-line flags to override these defaults.
Run your Pydpiper command as usual. This must be done in /scratch, since /home is not writeable from the cluster. Initial models are kept in:
You don't need to specify any queue options yourself (so you do not have to specify: --proc, --mem, --ppn, --queue-name, --queue-type, etc.; see above regarding configuration). However, you should specify
--time=HH:MM:SS, giving the total (sequential) time you expect your job to take, as well as
--num-executors=N, where N is the number of brains in your study divided by 4 (for 56-micron data) or by 2 (for 40-micron data). The explanation for this is as follows: we launch one executor to manage each 8-CPU compute node, so you might expect one executor for every 8 brains, but memory constraints force use to register fewer brains per node. The current situation is a bit of a hack, and in the future we might be able to guess an appropriate number of nodes to request automatically.
The information below about time settings is somewhat dated and turned out not to work well for very large pipelines, revealing limitations of SciNet's suitability for use with our pipelines.
Note that the way Pydpiper executes pipelines on SciNet is by running batches of 1 server and a bunch of executors (you will specify the number of executors for your command using --num-executors=N). Because it is easier to get onto the SciNet compute machines when you request from a smaller time slot, it's better not to request for 48 hours for each job you have. For an MBM.py pipeline for instance, the time requested should be a bit longer than the longest job in the pipeline (last nonlinear stage). For a 56 micron mouse brain pipeline that stage for mincANTS takes about 6-7 hours, for 40 micron data this stage can take up to 9-10 hours. Currently the configuration file for pydpiper requests for 16 hours for each server/executor. This is a bit longer than necessary, but is far better than asking for 48 hours.
Some --time examples
Example 1: you have 20 56-micron mouse brains. The final mincANTS stages take about 3G of memory, so 4 processes can run on a compute node. The current config file runs the executors for 16 hours, and the final nonlinear stage takes about 8 hours, which should leave enough time to finish the rest of the pipeline. So we can submit 5 executors with the 16 hours:
Size: 56 micron
What will be submitted:
job1: server/executor (16 hours)
job2, job3, job4, job5: executors tied to job1 (16 hours each, hopefully starting at approximately the same time)
Example 2: you have 20 40-micron mouse brains. The final mincANTS stages take about 7G of memory, so 2 processes can run on a compute node (we need 10 executors in this case). The current config file runs the executors for 16 hours, and the final nonlinear stage takes about 10 hours, which means we probably need 2 rounds of executors to finish all the other stages as well. Here is what we'll do:
Size: 40 micron
What will be submitted:
job1: server/executor (16 hours)
job2, job3, ..., job10: executors tied to job1 (16 hours each)
job11: server/executor (16 hours)
job13, job14, ..., job20: executors tied to job11 (16 hours each)
Make sure to transfer your results to another filesystem - /scratch is not backed up and unused files there are deleted after 3 months, though you'll get an email before this happens.
For transferring small volumes of data (less than 10G), such as your input
.mnc files, you can use
rsync as usual:
For larger volumes, SciNet requests that you transfer via the datamover nodes. However, the two datamover nodes are behind a firewall and inaccessible from the outside, so you must log into one of them from elsewhere on SciNet, and won't be able to transfer data from behind the Sick Kids firewall unless we expose a node to the outside. Instead, transfer data via the login nodes using the ability of the rsync command to resume interrupted transfers in the event that your transfer process runs afoul of the login nodes' 5-minute CPU time limit (follow the instructions linked above to generate and use SSH keys to avoid being repeatedly prompted for your password):
On SciNet, you have access to the directories
$HOME, $SCRATCH, and $PROJECT (which you may need to create via
. The compute nodes have read-only access to $HOME and $PROJECT but can only write to $SCRATCH, so you should run pipelines in this directory. Here's an example of transferring some data there:
When transferring results back to MICe, you can use
rsync flags to avoid transferring unnecessary files:
You can check your disk quotas via the command
diskUsage (available after
module load extras or in
/scinet/gpc/bin6/). Note that we also have access to a large volume of tape storage on SciNet.
Also see https://support.scinet.utoronto.ca/wiki/index.php/Data_Management#Data_Transfer.
Monitoring your jobs
First and foremost, you can run the check_pipeline_status.py command when a server is actually running at SciNet, similar to what you'd do at MICe (first load the modules as discussed in running a job):
If your pipeline is not running however (e.g. jobs are idle/blocked in the queue), you can find out whether there are stages that have failed, by running the following command:
A rough estimate of how far along your pipeline is when it's not currently running can be gotten by comparing the total number of stages in your pipeline and the number of finished stages:
You can use the following commands to look at your jobs in the queue:
showstart command you can get an estimate of when your job might start running:
The output of this might be as follows:
To get detailed information on a job:
You can cancel jobs with
qdel as usual, though SciNet recommends using
The March 2015 tech talk slides have many details on scheduling/allocation, monitoring jobs, disk quota, and various usage summaries, while the slightly older job monitoring tech talk has further details on monitoring running jobs, finding what nodes your jobs are on (also see the executor log files), SSHing into the compute nodes, finding the stderr and stdout of the jobs (although we now redirect stdout into a file), &c.