locks
HPC and Data for Lattice QCD
locks
Fixing problem caused by stale APErun locks
In particular situations it might happen that lock files created bythe APErun tools are not removed. In the the user file CAOS.log it
will typically create messages like:
[APErun]: Sleeping while starter-lock is active... sommer 49257
and finally:
[APErun]: ERROR: Waiting for too long for starter lock. Please contact the administrators
What kind of locks are there and when should they exist?
To know when a lock can be removed by hand it is important to understand forwhat purpose locks are created:
1) apemaster?:/apeshare/locks/starter.lock
This lock is created on the corresponding apemasters by the scripts caos-startd and caos-stopd while starting or stopping the CAOS daemons. It is actually a directory which contains a file "info" with the username and job-ID for the job that created the lock.
This lock should only exist while the scripts caos-startd or caos-stopd are executed.
This lock file is created by caos-startd on the unit-PC where the CAOS root daemons are started. It indicates that caos-startd prepared the execution of a job which needs a running rootd. This lock will be removed by the script caos-stopd which is executed after job completion. These locks will automatically be removed during a reboot of the PC.
This lock should only exist if the corresponding unit is used by any job.
Similar to "aperun.lock". This lock is created by APErun while it is requesting locks on the unit PCs described below.
This lock should only exist during start-up and finish of APErun.
For each board which is needed for job execution this lock file is created by caos-startd on the PC of the corresponding unit(s). It indicates that the job needs a running unit daemon. This lock(s) will be removed by the script caos-stopd which is executed after job completion. This lock(s) will automatically be removed during a reboot of the PC.
This lock should only exist if the corresponding board is used by any job.
This lock(s) is created by APErun that is going to spawn a job on the corresponding board. It is removed by APErun when the job is completed. Possible stale locks are removed caos-stopd. This lock(s) will automatically be removed during a reboot of the PC.
This lock should only exist while APErun is running a job on the corresponding board.
How to handle stale locks
If there is reason to assume a lock to be stale proceed in the following way:1) Possibly disable affected parallel environments.
2) Identify both the job which created the lock and the jobs which are
affected by the lock.
4) Try to identify the reason for a lock to become stale.
5) Remove all stale locks.
6) Check whether any queues have been put into error mode using "qstat -f".
Clean errors flags using "qmod -c <queue>".
8) Inform users whose jobs are in state 'Rq' since the current version of the
batch queuing system might give them a very low priority and advice them to resubmit their jobs.