HPC
HPC

| HPC and Data for Lattice QCD

Batch system

HPC and Data for Lattice QCD

Batch system

Batch system administration

In the following, some useful commands/procedures are listed which help to maintain the QPACE batch system. The latter consists of a combination of Maui, Torque, Parastation and custom wrapper scripts and configuration programs.

It is assumed you are root on the servers or nodes, otherwise some of the described functionality may not be accessible.

Startup

Starting with Ramdisk v07.rc09, the Parastation and Torque daemons are automatically started on the nodes. In case you need to start or stop one of these daemons, please use the following commands on the nodes:

service pbs_mom start
service pbs_mom stop
service parastation start
service parastation stop

If you need to do this on a range of nodes, e.g. on one or more backplanes, it is advised you use qpsh (described elsewhere).

See what is running

Tourque jobs

Use qstat. Useful options: -f, -a.

Torque nodes

To see which nodes are available to Torque (i.e. on which nodes pbs_mom is up and running), use pbsnodes:

pbsnodes -l
pbsnodes -al
pbsnodes [-l] :prop

pbsnodes without parameters lists all nodes known to Torque (same as pbsnodes -a). pbsnodes -l lists only those nodes which are in DOWN, OFFLINE or UNKNOWN state. pbsnodes -la does the same, although in the long format (listing all known attributes).

It is possible to specify properties to limit the range of nodes shown. For QPACE, these properties are identical to the configured partitions. For example, to see which of the nodes of partition 1x1x4_25-00 are available to Torque, use the following command:

pbsnodes [-l] :1x1x4_25_00

Maui jobs

While Torque manages the nodes/queues and actually runs jobs, Maui does the scheduling of the jobs. As such, it also has a view on which jobs/nodes are running. Maui comes with a set of tools which exposes this view to the admin/user, and does this in an arguably nicer way than Torque's tools.

To see which jobs are running, use the following commands:

showq
showq -r
showq -i
showq -b
showq -u USER
diagnose -j

If called without arguments, showq displays all currently queued jobs. It is possible to limit the output to running, idle or blocked jobs, as well as limiting output to a certain user. diagnose -j also shows which jobs are queued, but presents information different to that provided by showq.

If you need detailed information about a job (e.g. which nodes the job is running on), the command of choice is

checkjob JOBID

Maui nodes

To get a quick overview about all nodes' state, use

diagnose -n

To get detailed information about a specific node, use

checknode NODEID

Parastation nodes

To see which nodes are available to Parastation, use the tool psiadmin. For conveniency, there is a tool called pslist on the nodes. Call without parameters for further information.

Torque configuration

Most of Torque's configuration is done using the command qmgr. Listed below are some configuration parameters and useful subcommands:

QUEUE

list queue batch
set queue batch max_running = 200

properties:
  max_running
  resources_default.walltime

SERVER

set server mail_domain = qlogin.qpace.uni-wuppertal.de

properties:
  mail_domain