Compute Canada's Graham cluster is a new HPC system at the University of Waterloo. While the Niagara system is a more direct replacement of SciNet, we're moving to Graham instead as it has better support for our workloads. In general, the HPF grid is slightly more convenient for typical uses, but Graham has some advantages (in particular, it's easier to get collaborators a Graham account).
Unlike SciNet, you don't need to apply for a separate account after getting your Compute Canada account.
where user is your Compute Canada username and your password is your Compute Canada web password.
There's also no separation of login vs. dev/submit nodes, so you're ready to use the
Graham runs the Slurm scheduler (and
sbatch, ..., suite) just as at MICe; see, e.g., here. We have
qbatch installed as a convenient interface.
Its admins request you do not submit large numbers of separate jobs in a short time but either use an array job or wait 1s between submissions. Similarly, they request you don't spam 'squeue' via a script.
Add the following to your Graham
and then source this file or log in again.
If your PI has a Compute Canada RAC allocation, you'll want to use it to run your jobs to take advantage of the increased priority:
However, it seems that Jason's existing compute allocation (iww-954-ad-004) is still attached to the SciNet GPC and hasn't been transferred to Graham:
From https://ccdb.computecanada.ca/ (My Account -> Account Details) one sees:
|iww-954-aa||def-jlerch||Default Resource Allocation Project||2 allocations|
|iww-954-ad||rrg-jlerch-ac||Using Medical Imaging to Understand the Relationship Between Genetics, Development and Disease||5 allocations, RAC 2015|
The idea here is that jobs are by default submitted under the allocation of group def-jlerch. To submit with our non-default allocation, we'd want to pass --account=rrg-jlerch-ac to squeue or equivalently set the above *_ACCOUNT environment variables (much more convenient since Pydpiper is unaware of any Slurm-related details), but this currently gives an error ("Please use one of the following accounts: RAS default accounts: def-jlerch"), which agrees with the fact that the CCDB portal shows iww-954-ad-004 is assigned to the GPC.
Running a Screen session
SSH sessions from MICe to Graham seem to be randomly dropped. For this and other reliability reasons, you may with to run a more permanent interactive process - i.e., a terminal multiplexer, likely either screen or tmux - on a Graham login node. (See the Pydpiper on HPF page for more details.) If a subsequent SSH connection to Graham is routed to a different node than the one running your multiplexer, simply connect from the former to the latter (e.g., '[gra-login-1 ~]$ ssh gra-login-3').
(Note that due to the non-interactive mode of running Pydpiper on Graham, this won't cause your pipeline to stop.)
Unlike HPF, the login nodes are not intended for interactive work such as running CPU-intensive statistical models; instead, use
salloc to get an interactive session (for up to 24:00:00) as per the Graham wiki. I recommend doing this from inside a screen/tmux session (see above) as at present there seems to be an issue with ssh connections being dropped.
(Currently missing: RMINC, visual tools.)
Of course, you may load a specific version of this module, e.g., to fix software versions for a specific analysis.
Broadly similar to SciNet, except that there's less need to wrap rsync commands in a loop since there's no hard CPU time limit on the login nodes.
Your main disk spaces are $HOME, $SCRATCH, and $PROJECT. $HOME is small and should be used mostly for text files, etc. $SCRATCH should be used for running pipelines. It is not backed up and inactive files are deleted regularly (you will receive email prior to any deletion). $PROJECT can be used for storing completed pipelines and analyses.
Writing to $PROJECT
Note that while $SCRATCH has a large per-user quota, $PROJECT allocation is per-group. As such, you should always write files with group given by your allocation (e.g.,
def-jlerch) or you'll fill up your tiny (2MB!) personal $PROJECT quota. Thus, either
chgrp files before moving them to $PROJECT or, if creating files, run
beforehand automatically create them with appropriate group ownership. (Note: don't add this to your ~/.bashrc - it breaks it somehow.)
Since long-running jobs such as Pydpiper servers are discouraged on the login nodes, we'll submit the Pydpiper server itself to a compute node. The salloc command seems to have a 24hr time limit, so we'll submit it as a non-interactive job, e.g., using qbatch (the maximum walltime allowed is 672hr, which should be more than enough for most pipelines; try to choose a more reasonable limit):
The server is submitted as a job to the queue; once launched, it will submit additional executors itself as needed.
Monitoring your pipeline
As usual information is written to the pipeline.log file. The server's standard output which is normally visible in your terminal is redirected to logs/slurm-STDOUT-<jobid>.out. If you like, you can redirect it to a different location via submitting with echo 'MBM.py ... > some-file.txt' | qbatch ... ; note that the output will be buffered by the OS so you won't always see things immediately as they are written out.
TODO are the default stdout logs visible in real time; can you make Pyro (TCP) connections to the compute nodes via check_pipeline_status.py?
Acknowledging Graham/Compute Canada
See here; please edit if you find a paper for Graham (as there is for SciNet).