Your pipeline is seemingly stuck - not processing anymore

The executors that run on the compute nodes of the cluster use the /dev/shm space to initialize communication with the server. /dev/shm is writeable memory; you can use it as if it is a normal hard drive, but it works a lot faster as it is actually memory. The way HPF deals with job submissions on their cluster means that /dev/shm is not cleaned/wiped after a job finished. Something that does happen on SciNet for instance, but on SciNet you always get exclusive access to an entire compute node so it's a lot easier there to deal with. If other users on HPF have filled up the /dev/shm space on a compute node and your job (an executor) wants to start, it will crash right way, as it needs a little bit of space there. This will lead the pipeline/server to think that your executor is still in the queue, as it has never been in communication with the server yet. So, if you're in the following situation:

  • check_pipeline_status.py [path to uri file] indicates no running jobs, but does indicate that executors are waiting in the queue AND
  • qstat -u [your username] does not show any of your jobs in the queue

You are in trouble. To know for sure that an executor has failed/died this way run the following:

grep "No space left on device" logs/*

If this returns something along these lines:

logs/pipeline-executor-2017-06-05-at-19-52-36-14765941[2]-e.log-2:OSError: [Errno 28] No space left on device

You can examine the content of that file:

cat logs/pipeline-executor-2017-06-05-at-19-52-36-14765941[2]-e.log-2
 
Filesystem     1K-blocks    Used Available Use% Mounted on
none             1321924 1321924         0 100% /dev/shm
Traceback (most recent call last):
  File "/hpf/largeprojects/MICe/tools/pydpiper/2.0.7/bin/pipeline_executor.py", line 4, in <module>
    __import__('pkg_resources').run_script('pydpiper==2.0.7', 'pipeline_executor.py')
  File "/hpf/largeprojects/MICe/tools/python/3.5.2/lib/python3.5/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/hpf/largeprojects/MICe/tools/python/3.5.2/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/hpf/largeprojects/MICe/tools/pydpiper/2.0.7/lib/python/pydpiper-2.0.7-py3.5.egg/EGG-INFO/scripts/pipeline_executor.py", line 661, in <module>
    main()
  File "/hpf/largeprojects/MICe/tools/pydpiper/2.0.7/lib/python/pydpiper-2.0.7-py3.5.egg/EGG-INFO/scripts/pipeline_executor.py", line 647, in main
    local_launch(options)
  File "/hpf/largeprojects/MICe/tools/pydpiper/2.0.7/lib/python/pydpiper-2.0.7-py3.5.egg/EGG-INFO/scripts/pipeline_executor.py", line 636, in local_launch
    pe = pipelineExecutor(options=options, uri_file=options.urifile)  #, pipeline_name=options.pipeline_name)
  File "/hpf/largeprojects/MICe/tools/pydpiper/2.0.7/lib/python/pydpiper-2.0.7-py3.5.egg/EGG-INFO/scripts/pipeline_executor.py", line 238, in __init__
    self.lock = Lock()
  File "/hpf/largeprojects/MICe/tools/python/3.5.2/lib/python3.5/multiprocessing/context.py", line 66, in Lock
    return Lock(ctx=self.get_context())
  File "/hpf/largeprojects/MICe/tools/python/3.5.2/lib/python3.5/multiprocessing/synchronize.py", line 163, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/hpf/largeprojects/MICe/tools/python/3.5.2/lib/python3.5/multiprocessing/synchronize.py", line 60, in __init__
    unlink_now)
OSError: [Errno 28] No space left on device

Not the following lines that indicate the exact error:

none             1321924 1321924         0 100% /dev/shm       <------- the writable memory space is completely filled up
 
and
 
OSError: [Errno 28] No space left on device                    <------- indicates that that's the reason for the crash of this executor

Resolution

Simply press Ctrl+c to stop your pipeline. Then rerun your original command, and the pipeline will continue where it left off. Also, it might be worth sending an email to the Research Help Desk (if this happens on HPF) to let them know that they have 1 or more compute nodes where /dev/shm is completely filled up.

error: [Errno 107] Transport endpoint is not connected

N.B.: this serious issue is believed fixed in Pydpiper 2.0 and higher.

Especially when we run pipelines on HPF, we run into network related issues. The above error signals that the established communication between the server and executors has dropped. When this message occurs, the server stopped working. Press Ctrl+C to fully stop/kill your pipeline, and rerun your command.

Here's a longer example of the type of error you might see:

Process Process-1:
Traceback (most recent call last):
  File "/hpf/tools/centos6/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hpf/tools/centos6/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hpf/largeprojects/MICe/tools/pydpiper/1.18.1/lib/python/Pyro4-4.47-py2.7.egg/Pyro4/core.py", line 992, in requestLoop
    self.transportServer.loop(loopCondition=condition)
  File "/hpf/largeprojects/MICe/tools/pydpiper/1.18.1/lib/python/Pyro4-4.47-py2.7.egg/Pyro4/socketserver/multiplexserver.py", line 180, in loop
    server.events(fileobjs)
  File "/hpf/largeprojects/MICe/tools/pydpiper/1.18.1/lib/python/Pyro4-4.47-py2.7.egg/Pyro4/socketserver/multiplexserver.py", line 71, in events
    active = self.handleRequest(s)
  File "/hpf/largeprojects/MICe/tools/pydpiper/1.18.1/lib/python/Pyro4-4.47-py2.7.egg/Pyro4/socketserver/multiplexserver.py", line 151, in handleRequest
    log.debug("disconnected %s", conn.sock.getpeername())
  File "/hpf/tools/centos6/python/2.7.9/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 107] Transport endpoint is not connected

 

Jobs require more memory than available

Issues of the type:

 

Shutting down due to jobs which require more memory ([...]G) than available anywhere.

 

Here's an example error message from a pydpiper pipeline:

(eeyore)(eeyore)lcahill@chuck:/hpf/largeprojects/MICe/lcahill/Hypoxia_Autism/P23$ MBM.py \
--pipeline-name=06sep16_HA_23days \
--init-model=/hpf/largeprojects/MICe/tools/initial-models/Pydpiper-40-micron-basket-dec-2014/basket_mouse_brain_40micron.mnc \
--registration-method=mincANTS \
--num-executors 34 \
/hpf/largeprojects/MICe/lcahill/Hypoxia_Autism/P23/img*.mnc 

Total number of stages in the pipeline:  4979
Number of stages already processed:      580

Shutting down due to jobs which require more memory (22.07G) than available anywhere. Please use the --mem argument to increase the amount of memory that executors can request.

First an explanation of what's happening here. The pydpiper code tries to estimate for each job how much memory is required to run that particular job. Executors are submitting based on the memory that is required for the jobs that can be run at any time. One of the command line flags to any pydpiper program (–mem) determines how much memory an executor can request at most. This was used originally for pipelines at SciNet, as each compute node there only had 14G available, and we wanted to make sure we wouldn't exceed that. This parameter could be set to the maximum amount of memory available on the system you are working on. For instance, at MICe, farm40 has 192G of memory, so that is the very upper limit, but normally you won't need that much memory. When you load a quarantine, there is a file that determines some defaults. You can find it like so:

echo $PYDPIPER_CONFIG_FILE 

At MICe you would currently (September 2016) find:

echo $PYDPIPER_CONFIG_FILE 
/axiom2/projects/software/arch/pydpiper/master-v1.18.1/python/pydpiper-1.18.1-py2.7.egg/config/MICe.cfg

# and the content is:
cat /axiom2/projects/software/arch/pydpiper/master-v1.18.1/python/pydpiper-1.18.1-py2.7.egg/config/MICe.cfg 
[MICe]
queue-type sge
queue-name all.q
proc 1
mem  14
pe   smp
time-to-accept-jobs 180

Which means that the maximum amount of memory an executor can request is 14G (mem  14). If you get the error message displayed above, you can change the --mem variable to ensure that your pipelines continues running (setting the max amount of memory for an executor to 24G, more than the 22.07G it requires):

(eeyore)(eeyore)lcahill@chuck:/hpf/largeprojects/MICe/lcahill/Hypoxia_Autism/P23$ MBM.py \
--pipeline-name=06sep16_HA_23days \
--init-model=/hpf/largeprojects/MICe/tools/initial-models/Pydpiper-40-micron-basket-dec-2014/basket_mouse_brain_40micron.mnc \
--registration-method=mincANTS \
--num-executors 34 \
--mem=24 \
/hpf/largeprojects/MICe/lcahill/Hypoxia_Autism/P23/img*.mnc 

Jobs killed due to exceeding memory - the summary

If jobs are being killed because they are using more memory than requested, information about that will end up in the error log file of the executor that was trying to run that job. For all pydpiper pipelines run on HPF, the following command will tell you if that's the case:

# here the "logs" directory is the one located in the directory you are running your pipeline from:

grep "job killed: vmem" logs/*-e.log*

If there is some output, executors are being killed. For instance:

logs/pipeline-executor-2016-08-25-at-08-14-13-6914496[9]-e.log-9:=>> PBS: job killed: vmem 5475532800 exceeded limit 5368709120

Means that the executor requested roughly 5.37G of memory, and used (at least) 5.47G. The problem here is that the code thinks that 5.37G should be enough memory to run this job, but in reality that's incorrect. One way to fix this problem is to indicate in your pipeline that all jobs need at least 8G to run. That will override the 5.37G that the pipeline wants to request. You can do that using the following flag and argument:

--default-job-mem=8

Jobs killed due to exceeding memory - an example

Sometimes your pipeline fails and you get messages indicating that too many executors have died. The message you'll get will be similar to this one:

Error: 35 executors have died. This is more than the number (20) allowed by --max-failed-executors.  No executors remain; exiting...

The first thing to do is to figure out which stage(s) failed. You can do this by querying the pipeline.log file (in case you don't see any messages in your shell pointing to log files from the stages that are failing):

grep "Lost Stage " pipeline.log

Here's an example output:

[2016-08-25 09:27:41.214,pydpiper.pipeline,WARNING] Lost Stage 9703: mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/resampled/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12.mnc,/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/second_level-nlin-1.mnc,1,3]' -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_fwhm0.056_dxyz.mnc,/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/second_level-nlin-1/tmp/second_level-nlin-1_fwhm0.056_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level-nlin-1_nlin_0.xfm -x /hpf/largeprojects/MICe/vousdend/CREB_RAM/init_model/p65_MEMRI_bruker_mouse_brain_mask.mnc: 

So in this case a mincANTS stage is failing. We will need to find the log file that belongs to this stage and see if there is any information in there about what's going wrong. The output file from the stage can be used to find the log directory:

# output file:
/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level-nlin-1_nlin_0.xfm

# which means that the tmp directory for this file is:
/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/

# and the log directory is:
/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/log

Generally when a pipeline is stuck like this, the file in the log directory that is written to last is the log file you're interested in. You can find is using the following ls command which orders the files based on creation time with the last created file at the bottom of the list:

ls -ltr /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/log

which in this case provided (abbreviated):

...
-rw-r--r-- 1 vousdend mice 18105 Aug 24 18:50 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-avg-lsq12.log
-rw-r--r-- 1 vousdend mice  1095 Aug 24 18:52 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12.log
-rw-r--r-- 1 vousdend mice  1017 Aug 24 18:52 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_fwhm0.056_blur.log
-rw-r--r-- 1 vousdend mice  9687 Aug 24 19:15 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level_lsq12-pairs_nlin_0.log
-rw-r--r-- 1 vousdend mice  1344 Aug 24 19:33 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level_lsq12-pairs_nlin_0-resampled.log
-rw-r--r-- 1 vousdend mice 44106 Aug 25 12:51 G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level-nlin-1_nlin_0.log

So now we know which log file to inspect:

nano [or any other text editor, or simply cat] /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/log/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level-nlin-1_nlin_0.log

Here's part of what is in that text file:

Stage 9703 running on r3a-8 (PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025) at 2016-08-24 21:05:34.996272:
mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_sec...
 Run Reg
 values 2
  Fixed image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processe...
  Moving image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/se...
Metric 0:  Not a Point-set
  Fixed image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processe...
  Moving image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/se...
  similarity metric weight: 1
  Radius: [3, 3, 3]
  Radius: [3, 3, 3]
...
...
E-Slope 0.000119865 iteration 54 energy 0 : -403743 energy 1 : -435389
 E-Slope 0.000117451 iteration 55 energy 0 : -403797 energy 1 : -435374
 E-Slope 0.0001169 iteration 56 energy 0 : -403851 energy 1 : -435354
 E-Slope 0.000115879 iteration 57 energy 0 : -403900 energy 1 : -435339
 E-Slope 0.000112928 iteration 58 energy 0 : -403943 energy 1 : -435319
 E-Slope 0.00010065 iteration 59 energy 0 : -403983 energy 1 : -435302
 E-Slope 8.91015e-05 iteration 60 energy 0 : -404017 energy 1 : -435281
 Converged due to oscillation in optimization  metric 0 bad 1   metric 1 bad 18
 this->m_ScaleFactor 1 nlev 4 curl 3
 Its at this level 20
Stage 9703 running on r2a-5 (PYRO:obj_00aa017897e2421bb1312bb0b9b626ae@10.10.2.9:38196) at 2016-08-25 08:16:03.778378:
mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_s...
 Run Reg
 values 2
  Fixed image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed...
  Moving image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/sec...
Metric 0:  Not a Point-set
  Fixed image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed...
  Moving image file: /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/sec...
  similarity metric weight: 1
  Radius: [3, 3, 3]
  Radius: [3, 3, 3]

The mincANTS command had the following levels of iterations: 100x100x100x20. This means that 20 iterations should be performed at the true file resolution. We see in the log file that right after it started at this level (the line: Its at this level 20), the program must have stopped running, because the next line shows a new entry in the log file indicating that the stage started (i.e. restarted in a second attempt):

 Its at this level 20
Stage 9703 running on r2a-5 (PYRO:obj_00aa017897e2421bb1312bb0b9b626ae@10.10.2.9:38196) at 2016-08-25 08:16:03.778378:


The first line indicates that mincANTS was starting the final round of iterations (of the 100x100x100x20):
 Its at this level 20

The line after that indicates that the stage started. In this case this means restarted, because the previous attempt failed:
Stage 9703 running on r2a-5 (PYRO:obj_00aa017897e2421bb1312bb0b9b626ae@10.10.2.9:38196) at 2016-08-25 08:16:03.778378:

So why did the first attempt fail? We can see what happened to that particular executor. It's address/identifier can be see in the first line:

Stage 9703 running on r3a-8 (PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025) at 2016-08-24 21:05:34.996272:

So we look the log files of this executor in the overall log directory (the top directory where the main pipeline directories are:

grep 2765f22ef51546f6bb0b664aa0f09ec logs/*

This resulted in:

logs/pipeline-executor-2016-08-24-at-21-03-51-6914219[25].log:[2016-08-24 21:05:34.916,__main__,INFO] Client URI is PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025
logs/pipeline-executor-2016-08-24-at-21-03-51-6914219[25].log:[2016-08-24 21:05:34.944,__main__,INFO] Running stage 9703 (on PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025)

so we know which log files belong to this executor:

Full log file of the executor:
logs/pipeline-executor-2016-08-24-at-21-03-51-6914219[25].log

Output log from queuing system:
logs/pipeline-executor-2016-08-24-at-21-03-51-6914219[25]-o.log-25

Error log from queuing system:
logs/pipeline-executor-2016-08-24-at-21-03-51-6914219[25]-e.log-25

The actual log file does not tell us much:

cat logs/pipeline-executor-2016-08-24-at-21-03-51-6914219\[25\].log 


[2016-08-24 21:05:34.357,Pyro4,INFO] Pyro log configured using built-in defaults, level=INFO
[2016-08-24 21:05:34.902,Pyro4.multiplexserver,INFO] starting multiplexed socketserver
[2016-08-24 21:05:34.916,__main__,INFO] Connected to PYRO:obj_01fd8e1499814146a005c9d0ae4c694d@10.10.1.6:33824
[2016-08-24 21:05:34.916,__main__,INFO] Client URI is PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025
[2016-08-24 21:05:34.916,__main__,INFO] Connected to the server at: 2016-08-24 21:05:34.916599
[2016-08-24 21:05:34.939,Pyro4.core,INFO] daemon 10.10.3.15:56025 entering requestloop
[2016-08-24 21:05:34.944,__main__,INFO] Running stage 9703 (on PYRO:obj_2765f22ef51546f6bb0b664aa0f09ecf@10.10.3.15:56025)
[2016-08-24 21:05:34.952,__main__,INFO] mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/resampled/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12.mnc,/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/second_level-nlin-1.mnc,1,3]' -m 'CC[/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_fwhm0.056_dxyz.mnc,/hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_nlin/second_level-nlin-1/tmp/second_level-nlin-1_fwhm0.056_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /hpf/largeprojects/MICe/vousdend/CREB_RAM/registration/19aug16/19aug16_secondlevel/second_level_processed/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3/tmp/G7_pre_8jul16.1_1.may2016_dist_corr-nlin-3-resampled-lsq12_to_second_level-nlin-1_nlin_0.xfm -x /hpf/largeprojects/MICe/vousdend/CREB_RAM/init_model/p65_MEMRI_bruker_mouse_brain_mask.mnc

However, the error log file reveals the issue!

cat logs/pipeline-executor-2016-08-24-at-21-03-51-6914219\[25\]-e.log-25 


Filesystem           1K-blocks      Used Available Use% Mounted on
none                   1321928    189084   1132844  15% /dev/shm
=>> PBS: job killed: vmem 5531455488 exceeded limit 5368709120

Means that the executor requested roughly 5.37G of memory, and used (at least) 5.53G. The problem here is that the code thinks that 5.37G should be enough memory to run this job, but in reality that's incorrect. One way to fix this problem is to indicate in your pipeline that all jobs need at least 8G to run. That will override the 5.37G that the pipeline wants to request. You can do that using the following flag and argument:

--default-job-mem=8
  • No labels