HPC
HPC

| HPC and Data for Lattice QCD

locks

HPC and Data for Lattice QCD

locks

Fixing problem caused by stale APErun locks

In particular situations it might happen that lock files created by
the APErun tools are not removed. In the the user file CAOS.log it
will typically create messages like:
[APErun]: Sleeping while starter-lock is active... sommer 49257
and finally:
[APErun]: ERROR: Waiting for too long for starter lock. Please contact the administrators

What kind of locks are there and when should they exist?

To know when a lock can be removed by hand it is important to understand for
what purpose locks are created:
1) apemaster?:/apeshare/locks/starter.lock
	This lock is created on the corresponding apemasters by the scripts
	caos-startd and caos-stopd while starting or stopping the CAOS daemons.
	It is actually a directory which contains a file "info" with the
	username and job-ID for the job that created the lock.
	This lock should only exist while the scripts caos-startd or caos-stopd
	are executed.
2) unit?0:/var/run/aperun/rootd.lock.u[0-3]
	This lock file is created by caos-startd on the unit-PC where the CAOS
	root daemons are started. It indicates that caos-startd prepared the
	execution of a job which needs a running rootd. This lock will be
	removed by the script caos-stopd which is executed after job completion.
	These locks will automatically be removed during a reboot of the PC.

	This lock should only exist if the corresponding unit is used by any
	job.
3) apemaster?:/apeshare/locks/aperun.lock
	Similar to "aperun.lock". This lock is created by APErun while it is
	requesting locks on the unit PCs described below.
	This lock should only exist during start-up and finish of APErun.
4) unit??:/var/run/aperun/unitd.lock.b[0-3]
	For each board which is needed for job execution this lock file is
	created by caos-startd on the PC of the corresponding unit(s). It
	indicates that the job needs a running unit daemon. This lock(s) will
	be removed by the script caos-stopd which is executed after job
	completion. This lock(s) will automatically be removed during a reboot
	of the PC.
	This lock should only exist if the corresponding board is used by any
	job.
5) unit??:/var/run/aperun/run.lock.b[0-3]
	This lock(s) is created by APErun that is going to spawn a job on the
	corresponding board. It is removed by APErun when the job is completed.
	Possible stale locks are removed caos-stopd. This lock(s) will
	automatically be removed during a reboot of the PC.
	This lock should only exist while APErun is running a job on the
	corresponding board.

How to handle stale locks

If there is reason to assume a lock to be stale proceed in the following way:
1) Possibly disable affected parallel environments.
2) Identify both the job which created the lock and the jobs which are
   affected by the lock.
3) Verify that the lock is indeed not needed anymore.
4) Try to identify the reason for a lock to become stale.
5) Remove all stale locks.
6) Check whether any queues have been put into error mode using "qstat -f".
   Clean errors flags using "qmod -c <queue>".
7) Re-enable affected parallel environments.
8) Inform users whose jobs are in state 'Rq' since the current version of the
   batch queuing system might give them a very low priority and advice them to
   resubmit their jobs.