Batch
Batch Queuing System
HPC and Data for Lattice QCD
Batch Queuing System
QPACE currently uses the Torque resource manager and the MAUI scheduler for its queuing system.
Submit jobs
Use the tool qpace-sub
to submit your jobs.
The tool can take two parameters:
--topo=
BLUEx
GREENx
RED- This will submit a job that can run on any partition that has the given topology.
--part=
PNAME- This will schedule the job for the partition with name PNAME only.
Examples
The following example is a minimal script to run a job:
#!/bin/bash
#
# defines the name of the job:
#PBS -N my_job_name
#
# defines the name of the queue
# that the job will be submitted to:
#PBS -q queue_name
qpace-exec my-tool
IMPORTANT: The command qpace-exec
must be
called only once within a job script.
After job completion TNW links will be put offline and only be re-enabled during the prologue
of the next job.
The name of the job, my_job_name will appear if one
lists the currently queued and running jobs using
qstat
.
The binary to be run is my-tool, and it has to be
started via qpace-exec. Additionally, a queue must be specified (see
below). In this example, the job will be submitted to the queue queue_name.
Let's say that the name of the job script is my-script.bash. Then you can run the job on a 4-by-4-by-1 partition in the following way:
qpace-sub --topo=4x4x1 -q pro my-script.bash
qpace-sub will search the list of available partitions and will schedule the job to run on any partition that has the required topology (and is available via the queue queue_name).
To run the job on a dedicated, fixed partition, use for example:
qpace-sub --part=4x4x1_05_00 -q pro my-script.bash
This will run the job on the partition named 4x4x1_05_00 which is by convention a 4-by-4-by-1 partition starting at nodecard 00 in backplane 05. Obviously, the job will (and can not) be started if the partition is not available via queue_name.
IMPORTANT: Note that you must submit the job from a subdirectory under /work/ in order to run qpace jobs. Otherwise, your job's output to stderr/stdout will be lost.
A more elaborate job script could look like this:
#!/bin/bash
#
#PBS -N my_job_name
#PBS -j eo
#PBS -v MYVAR=myvalue
#PBS -l walltime=48:00:00
#PBS -q dev
#
qpace-exec my-tool
Every line in the submit script that starts with #PBS
is evaluated as a command line option to the tool qsub, which
belongs to the TORQUE resource manager.
Please do not use qsub directly, always use qpace-sub.
Basically, you can use any command line option that qsub supports,
but only few of them are useful on qpace.
In the above example, -j eo
means that standard output and
standard error will be merged in a single file.
-v MYVAR=myvalue
means that the environment variable
MYVAR will be exported to the job and its value will be myvalue
.
-l walltime=48:00:00
requests a Walltime of 48 hours.
-q dev
selects the development queue (see below).
Job control
You can control your jobs after submission. In order to see which jobs are scheduled, just use
qstat
or, for a more verbose output,
qstat -f
In case you changed your mind, and you want to delete
an already scheduled job, you can use qdel
:
qdel job_name
Queues
The QPACE Torque resource manager is configured for two types of jobs: Production and development. In order to run a job, you have to specify the type of the job by selecting the appropiate queue: "pro" for production runs and "dev" for development runs. This can be done in two ways:
- Add the following line to your job script:
#PBS -q dev
- Add "
-q dev
" to the qpace-sub cmdline. For examples, see above.
Please note the following:
- Jobs will not be scheduled if no queue is selected
- Development jobs are limited to one BP.
- Development jobs can only be run at the Wuppertal site.
- In order to be allowed to start production runs, additional measures have to be taken. Please contact one of the administrators.
- Development runs are limited to a runtime (Walltime) of 1 week, the default setting is 24 hours. If you need to have development jobs longer than that, you have to specify the desired Walltime as shown above (provided it is less than one week).
Email updates
It is possible to send mails from jupace.fz-juelich.de and qlogin.qpace-uni-wuppertal.de via sendmail. This is especially useful for status updates of batch jobs.
To receive Torque status emails, add the following line to your job script:
#PBS -m abe
-m
enables sending mails, whereas the letters a
,
b
, and e
specify on which occasions mails will be
sent: b
when a job begins, e
after a job
ends, and a
when a job aborted.
Mail will be sent to USER@qloginJ.qpace (or USER@qloginW.qpace). To actually receive mail from outside QPACE, use either of the following methods (assuming your mail address is user@domain.com):
- Add the following line to your job script:
#PBS -M user@domain.com
- Generate a file named
.forward
in your $HOME directory, containing your mail address.
Supported topologies
The qpace queuing system can support the following topologies on any installation:
- 1x1x1
- 1x1x2
- 1x1x4
- 1x1x8
- 1x2x4
- 1x2x8
- 1x4x4
- 1x4x8
Installations with a one-rack configuration (e.g. Wuppertal) may also support these topologies:
- 2x4x4
- 2x4x8
- 2x8x4
- 2x8x8
- 2x12x4
- 2x12x8
- 2x16x4
- 2x16x8
Installations with a two-rack configuration (Wuppertal and currently also Jülich) can also support these topologies:
- 4x4x4
- 4x4x8
- 4x8x4
- 4x8x8
- 4x12x4
- 4x12x8
- 4x16x4
- 4x16x8
And finally installations with a four-rack configuration (currently none, but maybe Jülich in the future) may support these topologies:
- 8x4x4
- 8x4x8
- 8x8x4
- 8x8x8
- 8x12x4
- 8x12x8
- 8x16x4
- 8x16x8