Batch system
HPC and Data for Lattice QCD
Batch system
Batch system administration
In the following, some useful commands/procedures are listed which help to maintain the QPACE batch system. The latter consists of a combination of Maui, Torque, Parastation and custom wrapper scripts and configuration programs.
It is assumed you are root on the servers or nodes, otherwise some of the described functionality may not be accessible.
Startup
Starting with Ramdisk v07.rc09, the Parastation and Torque daemons are automatically started on the nodes. In case you need to start or stop one of these daemons, please use the following commands on the nodes:
service pbs_mom start
service pbs_mom stop
service parastation start
service parastation stop
If you need to do this on a range of nodes, e.g. on one or more backplanes, it is advised you use qpsh
(described elsewhere).
See what is running
Tourque jobs
Use qstat
. Useful options: -f
, -a
.
Torque nodes
To see which nodes are available to Torque (i.e. on which nodes pbs_mom
is up and running), use pbsnodes
:
pbsnodes -l
pbsnodes -al
pbsnodes [-l] :prop
pbsnodes
without parameters lists all nodes known to Torque (same as pbsnodes -a
). pbsnodes -l
lists only those nodes which are in DOWN, OFFLINE or UNKNOWN state. pbsnodes -la
does the same, although in the long format (listing all known attributes).
It is possible to specify properties to limit the range of nodes shown. For QPACE, these properties are identical to the configured partitions. For example, to see which of the nodes of partition 1x1x4_25-00
are available to Torque, use the following command:
pbsnodes [-l] :1x1x4_25_00
Maui jobs
While Torque manages the nodes/queues and actually runs jobs, Maui does the scheduling of the jobs. As such, it also has a view on which jobs/nodes are running. Maui comes with a set of tools which exposes this view to the admin/user, and does this in an arguably nicer way than Torque's tools.
To see which jobs are running, use the following commands:
showq
showq -r
showq -i
showq -b
showq -u USER
diagnose -j
If called without arguments, showq
displays all currently queued jobs. It is possible to limit the output to running, idle or blocked jobs, as well as limiting output to a certain user. diagnose -j
also shows which jobs are queued, but presents information different to that provided by showq
.
If you need detailed information about a job (e.g. which nodes the job is running on), the command of choice is
checkjob JOBID
Maui nodes
To get a quick overview about all nodes' state, use
diagnose -n
To get detailed information about a specific node, use
checknode NODEID
Parastation nodes
To see which nodes are available to Parastation, use the tool psiadmin
. For conveniency, there is a tool called pslist
on the nodes. Call without parameters for further information.
Torque configuration
Most of Torque's configuration is done using the command qmgr
. Listed below are some configuration parameters and useful subcommands:
QUEUE
list queue batch
set queue batch max_running = 200
properties:
max_running
resources_default.walltime
SERVER
set server mail_domain = qlogin.qpace.uni-wuppertal.de
properties:
mail_domain