This is mostly condensed from HPF documentation; see https://wiki.mouseimaging.ca/download/attachments/7897100/HPF%20User%20Documentation.pptx?api=v2, https://ccm.sickkids.ca/?page_id=61, and https://hpc.ccm.sickkids.ca.
- Transfer data onto /hpf/largeprojects/MICe.
- Start a new
qloginsession with sufficient walltime and memory to run Pydpiper.
- Load modules.
- Run your Pydpiper job as you would at MICe.
Read on for details.
A directory for our projects on the HPF file system is mounted at MICe as /hpf/largeprojects/MICe, so it's trivial to move data back and forth - simply place it in this directory, and it'll be visible from both systems. At the moment, MICe users have write permissions on this directory, but you should create a single directory with your username. (NOTE: if you don't want the contents of this directory to be world-visible at SickKids, run
chmod o-rx on it from either MICe or SickKids.)
The HPF also has data nodes, but we don't really need these since data transfer is trivial and the qlogin nodes now have network access for installing software.
Troubleshooting a pipeline
See here for some detailed information on how to troubleshoot a pipeline: Troubleshooting pydpiper pipelines
Example calls for MBM.py
Disk Clean up
When you're finished with a registration, have the data analysed and are ready to archive the pipeline, you can remove more than just the tmp directories from the pipeline. See these notes on how to perform a thorough disk clean up after a MBM run
ssh into a login node:
You may omit the "
username@" if your local user is the same (which will usually be the case here);
-A enables ssh agent forwarding. No projects (
/hpf/projects, etc.) or tools (
/hpf/tools) directories are mounted, so you can't do anything useful from here.
qlogin (interactive/submit) nodes
Next, from a login node:
From the qlogin nodes, you may
qsub jobs (or run PydPiper) as usual.
The -l flag specifies resource requests for your qlogin session. If not specified, the defaults are 2 hours and 2G of RAM (for larger pipelines, consider requesting 8 or even 16G). You may have up to 6 interactive sessions of 16GB RAM or less, for a maximum of 5 days, as well as a single interactive session of up to 48 GB of RAM for a maximum of 24 hours. (There's apparently also a
ppn option, but I don't know how to use it.) We might consider asking for even longer time limits.
Surviving network interruptions with GNU Screen
Using a terminal multiplexer such as Screen or Tmux is useful to avoid losing your HPF session/pipeline due to a network interruption or local machine issues – particularly useful if you're working from a laptop.
For instance, you could achieve this by starting a
screen session on the login nodes - but apparently not the qlogin nodes - as follows:
[bdarwin@hpclogin3 ~]$ screen
[bdarwin@hpclogin3 ~]$ qlogin ...
[bdarwin@qlogin1 ~]$ MBM.py ...
# now press Control-"a" (to send a command to screen instead of the shell) followed by Control-"d" (for detach)
[bdarwin@hpf23 ~]$ ^D
bdarwin@vinnie ~$ # at this point we have no ssh connection to HPF at all
# now a screen session is running independently on hpclogin3 ... to reattach:
[bdarwin@hpclogin3 ~]$ screen -r # reattach
[bdarwin@qlogin1 ~]$ MBM.py ........................... # <-- lots of dots emitted by Pydpiper
# if you're satisfied with your pipeline's progress, you can detach again ...
HPF has several different login machines, currently
hpclogin4. When you log back into HPF you might end up on one of the other login nodes. You will find out by running the following command:
Now what do you do? The answer is that you can simply ssh into the login node you want:
Next up, multiple screens! After you have detached from a screen you do not have to reattach to it necessarily. You can also simply run screen again, and start a second one:
TODO: move this section to a separate page and link to it from the HPF/SciNet/Graham wikipages ?
How long will your qlogin session still run for?
This section is out of date following the Centos 6 → Centos 7 upgrade on March 1, 2021. You can use `/opt/qlogin_torque/bin/qstat` in a similar way as below but it doesn't seem to show the wall time elapsed, making this endeavour somewhat futile.
In the previous command you might have specified mem=8G,walltime=72:00:00. A day or two later you can find out how much time is left in that session using:
modulecmd(1) to set paths and other environment variables for our software:
(In the future, we intend to make manual loading of the prerequisite modules unnecessary.)
The Pydpiper module also sets some system-specific settings via a config file at
$PYDPIPER_CONFIG_FILE which you can
Make sure you have enough time remaining in your qlogin session!
/hpf/largeprojects/MICe/tools/protocols for relevant files.
You generally don't need to specify
--mem, --proc, --time, --queue-type, &c. since this is done in the configuration file.
As an example, a simple MBM pipeline should look something like this, if you choose not to run MAGeT:
You should see something like the following:
 indicates that Pydpiper has submitted an "array job" - i.e., an array of nearly identical jobs. You can get detailed information about the individual elements of the array using
qstat -t 3725292 (note the brackets). Note: this is standard at MICe as well, but I don't know where to put general Pydpiper information yet ...)
- You may occasionally get sporadic errors about executors dying, shortly after they start. This happens when a node's ramdisk (
/dev/shm/) is full, causing the executor to crash. For now,
--max-failed-executorshas been set to some large number in the config file.
- If executors are dying systematically or a specific stage is repeatedly failing, this probably indicates stages are being cancelled by the scheduler for using too much virtual memory ('vmem') due to its very conservative accounting scheme. Try a 'grep vmem *-e.log*' if you suspect this. As a temporary workaround I've increased the default job memory in the configuration file, but this means executors may be slower to start running.
- The server itself doesn't use much memory for all but very large pipelines, but the scheduler sees double that amount at the moment when jobs are being submitted. This could be fixed if it's problematic.
- If the scheduler is not accepting jobs, you'll see some further errors when trying to submit. Currently Pydpiper will just repeatedly try to submit, but this isn't really tested yet. If Pydpiper gets into a state where it believes some executors are still queued but `qstat -u ...` shows otherwise and restarting your pipeline is time-consuming, you may submit the appropriate number of executors manually via `pipeline_executor.py --uri-file=/full/path/to/build-model-jan27_uri --num-executors=n`.
- Pydpiper >= 2.0.8 by default resamples your data to consensus space and requires a common space template. This is good practice as it allows for future meta-analysis. With that said, this feature may not be appropriate for many studies (for example studies with neonatal brains). Add the "--no-common-space-registration" flag to skip this registration step.