HPC
HPC

| HPC and Data for Lattice QCD

Batch

HPC and Data for Lattice QCD

Batch

Queue related information

Jobs on the APE systems in Zeuthen are submitted using the SGE queueing system. The environment settings are provided by sourcing
  /afs/ifh.de/group/ape/nroot/nlogin.csh
or
  /afs/ifh.de/group/ape/nroot/nlogin.sh

Scheduler

The scheduler is responsible for deciding which job to run next. It first selects a project based on the recent usage (which is accumulated and decayed with a halftime of currently 600 hours). It then selects a job within that project taking into account the jobshare (see Priorities below). In the case of equal jobshare the oldest job is selected.

Common commands

Check queues

    qs              # default job display

Check accounting

    qshowacct -a
Shows the accounting information for all projects.

Submit a job

    qsub -pe <PEname> <PEsize> -P <Project> <script>
In order to keep accounting for crate (or rack), unit and board jobs separately <Project> is only valid for crate/rack jobs and has to be changed to
    <Project>_u for unit jobs
    <Project>_b for board jobs
In order to keep accounting for APEmille and apeNEXT seperated, prefix all apeNEXT jobs with an 'n'.
One board queue of apemille is configured with a time limit of 0.5 hours on Monday-Friday from 7 am until 9 pm -- else there will be a time limit of 2 hours. Users who want to submit board jobs running for >0.5 hours should specify the maximum run time in their script by using the qsub option -l h_rt=h:mm:ss, e.g. qsub -l h_rt=1:15:00 for board jobs running for up to 1.25 hours.
Notes
  • Specification of a valid project name by option "-P" or by a corresponding entry in a file $HOME/.grd_request is mandatory!
  • Job dependencies can be specified by adding
       -hold_jid <jobid>
    where <jobid> is the id of a previous job on whose completion the current one depends on.

Delete a job

    qdel <JOB-ID>     (may take a while)
See manpages for qsub, qdel, qstat.

Change job

    qalter <options> <job_id>
This allows for instance, to change the requested queue of a previously submitted job (e.g.: qalter -pe u00 4 <jobid> )

Current configuration

For SGE, APE jobs run in so-called Parallel Environments (PEs) that manage one or more lowlevel batch queues. PEs are specified at job submit time (just like a queue specification), but include the "size" (measured in in board-queues) :
  qsub -pe c1 16 <jobscript>
  qsub -pe u* 4 <jobscript>
The 2nd example uses wildcards to request any PE whose name starts with 'u'. All PEs take care of starting the necessary daemons by themselves, also for non-root users. apeNEXT parallel environments have to be prefixed with 'n'.

PE Nomenclature

The geometry that is served by a PE is easily recognized by it's name: Examples:
   u80       APEmille unit 0 in crate 8
   c3        Crate 3
   b830      Board 0 in APEunit 3 of crate 8
   nc1       Crate 1 of apeNEXT
   nb833     apeNEXT development board

Embedded flags

Options for qsub can be specified by embedded flags beginning with '#$' in the script (see below)
    #$ -q <queue_name>             ... specifies queue
    #$ -pe <pe_name> <n>           ... specifies PE and slots
    #$ -cwd                        ... assume current directory as default
    #$ -j y                        ... join stderr/stdout into one file
    #$ -o <name_of_stdout_file>    ... (output is appended!)
    #$ -e <name_of_stderr_file>    ...
    #$ -V                          ... pass the current environment setting to
                                       the job
    #$ -N <job_name>               ... sets name of job as it will
                                       appear in the queue

None of these options is required in the file, since all of them can also be specified on the command line. See the manpage of qsub. NOTE that the executing shell is specified by a line such as :
#!/bin/sh
as the 1st line of the script and NOT by using the '-S' option. SGE allows both variants (exclusively) but we decided for the first one.
Priorities
To change the priority of a job you have to change its share relative to other jobs. The job share is set using the option
  -js <n>
Higher values of n mean higher priority. Only integers greater or equal to zero are allowed.
Sample script for APEmille
#!/usr/bin/zsh
#$ -V
#$ -cwd
#$ -pe c* 16
#$ -P qcdsf
#$ -N myjob

source source /nroot/nlogin.sh

APErun -- -o yz myprogram.jex || exit 1
echo -n "done"
Sample script for apeNEXT crate
#!/bin/tcsh
#
# request apeNEXT parallel environment for a crate
#$ -pe nc* 16
#
# usual project, but with 'n' in front for apeNEXT
#$ -P nalpha
#
# Execute job from current working directory
#$ -cwd
#
# send mail on beginning, end and abort
#$ -m abe

# create local directory (if it doesn't exist) and change working directory
test -d /data/$USER || mkdir /data/$USER/
cd /data/$USER/

# source environment for apeNEXT
source /nroot/nlogin.csh

nrun testJob.mem || exit 1

# at the end of the run produced data should be copied either to afs or directly to tape
Notes
  • The apeNEXT testboard (PE nb833) has a hard time limit of 20 minutes during daytime, while during the night you can run for 3 hours. If you know that your job is longer than 20 minutes, make sure you explicitly request the desired time in your job script
    #$ -l h_rt=3:00:00
    
    Jobs running longer than the allowed time will be automatically killed without a warning.
  • The computer serving the apeNEXT testboard (PE nb833) does not have a local /data partition, therefore make sure you use a different directory that is available on that computer (e.g. afs scratch space)
  • The geometry does not have to be specified, it is determined automatically from the PE.
    Exception: The apeNEXT testboard (PE nb833) is not fully equipped with nodes, therefore you have to specify explicitly the geometry via
      -slice 2266 2377
    or, when running with nose (instead of dnose)
      -slice 266 377
    (The difference between nose and dnose is that nose accesses the hardware directly, while dnose uses some intermediate hosts to accomplish that. Since all hosts serving crates are not directly connected to the hardware, nose is only usable on the testboard)
  • All APErun options after the "--" are passed to the APE operating system (see corresponding documentation HOWTO-caos). The most relevant options are:
    -o Z        ... open communication along Z (default is closed)
    -o Y        ... open communication along Y (default is closed)
    -o YZ       ... open communication along Y and Z
    -j 0x88     ... mask irrelevant DenOut exceptions of Jn
    -f 0x3202   ... use aggressive memory refresh rate
    
  • Specification of the machine configuration "-C ..." should be omitted for the APErun call, since the parallel environment "knows" about it's geometry.
  • You don't need to specifiy the '-H' option in order to perform a hard reset of the machine - this option is added automatically by APErun. On the other hand, to *avoid* a hard reset, add the -noreset option to APErun, e.g. APErun -noreset -- myprog.jex
  • Do 'APErun -h' to see all options. Note that the '-debug' is valid only if krun is used.
  • Using a special version of CAOS : Use the
     -v CAOS_VERSION=projects
    option to specify a specific variant of caos and it's daemons. (For debugging only!)
  • Note how the shell is specified for that script. Don't rely on the -S option.