Skip to end of metadata
Go to start of metadata

Compute Canada's Graham cluster is a new HPC system at the University of Waterloo.  While the Niagara system is a more direct replacement of SciNet, we're moving to Graham instead as it has better support for our workloads.  In general, the HPF grid is slightly more convenient for typical uses, but Graham has some advantages (in particular, it's easier to get collaborators a Graham account).

For general references, see the Compute Canada wiki here and here.

Compute Canada account creation

If you don't have one already, apply for a Compute Canada account here.  Most people will need an identifier corresponding to their PI; contact Ben or Chris if you're associated with Jason.

Login

Unlike SciNet, you don't need to apply for a separate shell account after getting your Compute Canada account.

where user is your Compute Canada username and your password is your Compute Canada web password.

There's also no separation of login vs. dev/submit nodes, so you're ready to use the

Initial setup

Add the following to your Graham ~/.bashrc:

and then source this file or log in again.

Scheduler

Graham runs the Slurm scheduler (and squeuesbatch, ..., suite) just as at MICe; see, e.g., here.  We have qbatch installed as a convenient interface.

Its admins request you do not submit large numbers of separate jobs in a short time but either use an array job or wait 1s between submissions.  Similarly, they request you don't spam 'squeue' via a script.

Special allocations

If your PI has a Compute Canada RAC allocation above the default, you'll need to tell the scheduler which allocation to use to run your jobs or you'll get an error from sbatch ("You are associated with multiple _cpu allocations...").  One way to do so is via environment variables:

You can also pass --account=... to squeue - although the above way is more generally useful since, e.g., Pydpiper is unaware of any Slurm-related details - and there might be a way to set a per-user default through the CCDB web portal.

Looking up allocation information

From https://ccdb.computecanada.ca/ (My Account -> Account Details) one sees:

RAPIGroup NameTitleAllocations
iww-954-aadef-jlerchDefault Resource Allocation Project2 allocations
............
iww-954-adrrg-jlerch-acUsing Medical Imaging to Understand the Relationship Between Genetics, Development and Disease5 allocations, RAC 2015

The idea here is that jobs are by default submitted under the allocation of group def-jlerch.

Running a Screen session

SSH sessions from MICe to Graham seem to be randomly dropped.  For this and other reliability reasons, you may with to run a more permanent interactive process - i.e., a terminal multiplexer, likely either screen or tmux - on a Graham login node.  (See the Pydpiper on HPF page for more details.)  If a subsequent SSH connection to Graham is routed to a different node than the one running your multiplexer, simply connect from the former to the latter (e.g., '[gra-login-1 ~]$ ssh gra-login-3').

(Note that due to the non-interactive mode of running Pydpiper on Graham, this won't cause your pipeline to stop.)

Interactive use

Unlike HPF, the login nodes are not intended for interactive work such as running CPU-intensive statistical models; instead, use salloc to get an interactive session (for up to 24:00:00) as per the Graham wiki.  I recommend doing this from inside a screen/tmux session (see above) as at present there seems to be an issue with ssh connections being dropped.

Software

(Currently missing: RMINC, visual tools.)

Of course, you may load a specific version of this module, e.g., to fix software versions for a specific analysis.

Data transfer

Broadly similar to SciNet, except that there's no need to wrap rsync commands in a loop since there's no hard CPU time limit on the login nodes, though again you may benefit from running screen as per above.

Disk space

Your main disk spaces are $HOME, $SCRATCH, and $PROJECT.  $HOME is small and should be used mostly for text files, etc. $SCRATCH should be used for running pipelines.  It is not backed up and inactive files are deleted regularly (you will receive email prior to any deletion).  $PROJECT can be used for storing completed pipelines and analyses.

Writing to $PROJECT

Note that while $SCRATCH has a large per-user quota, $PROJECT allocation is per-group.  As such, you should always write files with group given by your allocation (e.g., def-jlerch) or you'll fill up your tiny (2MB!) personal $PROJECT quota.  Thus, either chgrp files before moving them to $PROJECT or, if creating files, run

beforehand automatically create them with appropriate group ownership.  (Note: don't add this to your ~/.bashrc - it breaks it somehow.)

Running Pydpiper

Since long-running jobs such as Pydpiper servers are discouraged on the login nodes, we'll submit the Pydpiper server itself to a compute node.  The salloc command seems to have a 24hr time limit, so we'll submit it as a non-interactive job, e.g., using qbatch (the maximum walltime allowed is 672hr, which should be more than enough for most pipelines; try to choose a more reasonable limit):

The server is submitted as a job to the queue; once launched, it will submit additional executors itself as needed.

Monitoring your pipeline

As usual information is written to the pipeline.log file.  The server's standard output which is normally visible in your terminal is redirected to logs/slurm-STDOUT-<jobid>.out.  If you like, you can redirect it to a different location via submitting with echo 'MBM.py ... > some-file.txt' | qbatch ... ; note that the output will be buffered by the OS so you won't always see things immediately as they are written out.

TODO are the default stdout logs visible in real time; can you make Pyro (TCP) connections to the compute nodes via check_pipeline_status.py?

Acknowledging Graham/Compute Canada

See here; please edit if you find a paper for Graham (as there is for SciNet).

  • No labels