HPC
HPC

| HPC and Data for Lattice QCD

Batch

Batch Queuing System

HPC and Data for Lattice QCD

Batch Queuing System

QPACE currently uses the Torque resource manager and the MAUI scheduler for its queuing system.

Submit jobs

Use the tool qpace-sub to submit your jobs. The tool can take two parameters:

--topo=BLUExGREENxRED
This will submit a job that can run on any partition that has the given topology.
--part=PNAME
This will schedule the job for the partition with name PNAME only.

Examples

The following example is a minimal script to run a job:

#!/bin/bash
#
#  defines the name of the job:
#PBS -N my_job_name
#
#  defines the name of the queue
#  that the job will be submitted to:
#PBS -q queue_name

qpace-exec my-tool

IMPORTANT: The command qpace-exec must be called only once within a job script. After job completion TNW links will be put offline and only be re-enabled during the prologue of the next job.

The name of the job, my_job_name will appear if one lists the currently queued and running jobs using qstat. The binary to be run is my-tool, and it has to be started via qpace-exec. Additionally, a queue must be specified (see below). In this example, the job will be submitted to the queue queue_name.

Let's say that the name of the job script is my-script.bash. Then you can run the job on a 4-by-4-by-1 partition in the following way:

qpace-sub --topo=4x4x1 -q pro my-script.bash

qpace-sub will search the list of available partitions and will schedule the job to run on any partition that has the required topology (and is available via the queue queue_name).

To run the job on a dedicated, fixed partition, use for example:

qpace-sub --part=4x4x1_05_00 -q pro my-script.bash

This will run the job on the partition named 4x4x1_05_00 which is by convention a 4-by-4-by-1 partition starting at nodecard 00 in backplane 05. Obviously, the job will (and can not) be started if the partition is not available via queue_name.

IMPORTANT: Note that you must submit the job from a subdirectory under /work/ in order to run qpace jobs. Otherwise, your job's output to stderr/stdout will be lost.

A more elaborate job script could look like this:

#!/bin/bash
#
#PBS -N my_job_name
#PBS -j eo
#PBS -v MYVAR=myvalue
#PBS -l walltime=48:00:00
#PBS -q dev
#
qpace-exec my-tool

Every line in the submit script that starts with #PBS is evaluated as a command line option to the tool qsub, which belongs to the TORQUE resource manager.
Please do not use qsub directly, always use qpace-sub.
Basically, you can use any command line option that qsub supports, but only few of them are useful on qpace. In the above example, -j eo means that standard output and standard error will be merged in a single file. -v MYVAR=myvalue means that the environment variable MYVAR will be exported to the job and its value will be myvalue. -l walltime=48:00:00 requests a Walltime of 48 hours. -q dev selects the development queue (see below).

Job control

You can control your jobs after submission. In order to see which jobs are scheduled, just use

qstat

or, for a more verbose output,

qstat -f

In case you changed your mind, and you want to delete an already scheduled job, you can use qdel:

qdel job_name

Queues

The QPACE Torque resource manager is configured for two types of jobs: Production and development. In order to run a job, you have to specify the type of the job by selecting the appropiate queue: "pro" for production runs and "dev" for development runs. This can be done in two ways:

  1. Add the following line to your job script:
    #PBS -q dev
    
  2. Add "-q dev" to the qpace-sub cmdline. For examples, see above.

Please note the following:

  • Jobs will not be scheduled if no queue is selected
  • Development jobs are limited to one BP.
  • Development jobs can only be run at the Wuppertal site.
  • In order to be allowed to start production runs, additional measures have to be taken. Please contact one of the administrators.
  • Development runs are limited to a runtime (Walltime) of 1 week, the default setting is 24 hours. If you need to have development jobs longer than that, you have to specify the desired Walltime as shown above (provided it is less than one week).

Email updates

It is possible to send mails from jupace.fz-juelich.de and qlogin.qpace-uni-wuppertal.de via sendmail. This is especially useful for status updates of batch jobs.

To receive Torque status emails, add the following line to your job script:

#PBS -m abe

-m enables sending mails, whereas the letters a, b, and e specify on which occasions mails will be sent: b when a job begins, e after a job ends, and a when a job aborted.

Mail will be sent to USER@qloginJ.qpace (or USER@qloginW.qpace). To actually receive mail from outside QPACE, use either of the following methods (assuming your mail address is user@domain.com):

  1. Add the following line to your job script:
    #PBS -M user@domain.com
    
  2. Generate a file named .forward in your $HOME directory, containing your mail address.

Supported topologies

The qpace queuing system can support the following topologies on any installation:

  • 1x1x1
  • 1x1x2
  • 1x1x4
  • 1x1x8
  • 1x2x4
  • 1x2x8
  • 1x4x4
  • 1x4x8

Installations with a one-rack configuration (e.g. Wuppertal) may also support these topologies:

  • 2x4x4
  • 2x4x8
  • 2x8x4
  • 2x8x8
  • 2x12x4
  • 2x12x8
  • 2x16x4
  • 2x16x8

Installations with a two-rack configuration (Wuppertal and currently also Jülich) can also support these topologies:

  • 4x4x4
  • 4x4x8
  • 4x8x4
  • 4x8x8
  • 4x12x4
  • 4x12x8
  • 4x16x4
  • 4x16x8

And finally installations with a four-rack configuration (currently none, but maybe Jülich in the future) may support these topologies:

  • 8x4x4
  • 8x4x8
  • 8x8x4
  • 8x8x8
  • 8x12x4
  • 8x12x8
  • 8x16x4
  • 8x16x8