HPC
HPC

| HPC and Data for Lattice QCD

multi-unit

HPC and Data for Lattice QCD

multi-unit

Howto run jobs on multiple units ...

.. or more general: how to run jobs with the distributed krun/caos. There are several collections of scripts to do the job. The 'New" ones are preferred.

New scripts:

It boils down the following:
  -  log into 'apemaster' 
  -  start the unit daemons with approproate options  	--> StartUnitds
  -  run the .jex file          			--> APErun
  -  kill the unit daemons				--> KillUnitds
The scripts are located in /zroot/multi_unit/, so make sure it's in your PATH (should be there by default) Actually there's only one script, but with links to it under different names. It accepts a number of command line options to specify debug level, geometry etc. Use the '-h' option to get a list.

Sample session:

    StartUnitds -crate 1 -logdir `pwd` 
    APErun -crate 1 -- -H big-write-xtc.jex
    KillUnitds -crate 1
These commands use the file /zroot/HostConfig to determine the hostnames of the unit pcs. Other available commands : RestartUnitds, Unlock Locking: Both APErun and StartUnitds write lock-files to /apeshare/locks/ in order to prevent someone else from doing the same. At the end of the run and by running KillUnitds these locks are removed. After a crash this is usually not done. Use the Unlock command in such cases. caos: To use CAOS instead of KRUN use the -caos option to StartUnitds and APErun. kome: is listed as a command line option but is not supported yet. mico is the default. IF YOU'RE NOT ROOT :
   The scripts work only if you're allowed to access the unit pcs ==>  Have a
   superuser add you to the 'sudoers' file (/etc/sudoers on apemaster)

Old scripts: (Ignore please !)

  -  log into 'apemaster' as root
  -  cd /zroot/multi-unit

  -  make sure the units are empty, e.g. run 'lsmod' on the units and
     check if the 'use count' of plddrv is 0.
  -  adjust 'env.z' to contain a current OSUNITLIST and HOSTLIST
  -  adjust 'unitd' so it sets OSROOTMASKAU correctly
     See examples there.
     This is setting the unit mask register of the root board, so it's
     used by the unit pc that's controlling the root board.
  -  In your run*.cmd script, make sure that the setting of OSMASTERUID
     reflects the number of the unit that controls the Root Board.
  -  Start the unit daemons on all required apepcs : 
       ./startallunitd.cmd [version]
     [version] is optional and indicates a program 
     version in this directory. As of this writing there
     are 'try54' (from Rome, not working ? ) and 'noe1'
     (compiled here, works occasionally ;-)
     default is 'noe1' (see env.z)
     Also, by default, this will start the CORBA (mico) version of the
     unit daemons.
  -  run the program : 
       ./runonecrate.cmd jexfile [version] [options] 
     or 
       ./runoneunit.cmd  jexfile [version] [options]

     This uses a script that sets a number of OS* variables.
     Check and adjust if necessary. For a different geometry make
     sure to also set the correct '-o' option to 'krun'.
  -  kill or restart all unit daemons (suggested after every run)
        ./killallunitd.cmd [version]
     or
        ./restartallunitd.cmd [version]
  -  Make sure the units are empty if you don't need them anymore.

  -  Make sure to specify the correct version for the kill-, start- restart-
     and run scripts.
The logfiles of the unit daemons are written to their local /data/multi_unit_logs/ disk, which is visible to apemaster under /nfs/apepc*/data/multi_unit_logs/ . For speed it is desirable to write them to a private disk of the apepcs. This is determined by setting LOGS in 'unitd'.