single-error
HPC and Data for Lattice QCD
single-error
# ----------------------------------------------------------
# Setting single error counters to (maxcount -1) for testing
# ----------------------------------------------------------
Tz:
Jn:
actions: (unit)
files:
# ----------------------------------------------------------
# SETTING and READING SERR-COUNTERS WITH CAOS
# ----------------------------------------------------------
Take following CODINE-script as example.
#-----------------------------------------------------------
#!/usr/bin/zsh
#$ -q unit0_q
#$ -V -cwd
#$ -S /usr/bin/zsh
#$ -j y
#$ -N ubik-0
UNIT=00
APECAOS=APErun
PRG=u1-v3-unit-xtc-psk.jex
echo -n ".start : "
date
echo
# Step 1 - just initialize machine
# Step 2 - set single error counters to (max-1)
# Step 3 - run your application *WITHOUT* hard reset
echo
# Step 4 - read single error counters (serc) after application has finished
# serc's are read board-wise, so you have to cycle through all the
# boards involved (here it is a unit)
#
# script "serr-grep" expects "-coords ${CID}${UID}${BID}"
# in this example, the status is collected in a file "Serr.${JOB_ID}"
#
touch Serr.${JOB_ID}
for BOARD in 0 1 2 3
done
echo -n ".end : "
date
#-----------------------------------------------------------
to do:
# Setting single error counters to (maxcount -1) for testing
# ----------------------------------------------------------
Tz:
single error counters: ---------------------- 0x02 [0x0eeeeeee] value set to (maxcount -1)
call krun with option:
-c,--set-tz-serc=VAL sets Tarzan single error counter before running the program; VAL is a dec/hex(0xX) 32 bits value
--set-tz-serc=0x0eeeeeee
single error counters: ---------------------- 0x01 [0xeeeeeeee] value set to (maxcount -1) 0x02 [0xfefefefe] value set to (maxcount -1) 0x03 [0x......fe] value set to (maxcount -1)
actions: (unit)
(1) prior to execution run script (on the unit)
/zroot/tools/set-serr-unit
(2) start krun WITHOUT hard reset (-H)
files:
/zroot/tools/jsercnt.tac
# SETTING and READING SERR-COUNTERS WITH CAOS
# ----------------------------------------------------------
Take following CODINE-script as example.
#-----------------------------------------------------------
#!/usr/bin/zsh
#$ -q unit0_q
#$ -V -cwd
#$ -S /usr/bin/zsh
#$ -j y
#$ -N ubik-0
UNIT=00
APECAOS=APErun
PRG=u1-v3-unit-xtc-psk.jex
echo -n ".start : "
date
echo
# Step 1 - just initialize machine
$APECAOS -caos -unit $UNIT -- /zroot/tools/jinit0.jex
$APECAOS -caos -unit $UNIT -- -p /zroot/tools/caos-serr.tac
$APECAOS -caos -unit $UNIT -- -o z -j 0x88 $PRG >! out.u0.cmp.${JOB_ID}
# Step 4 - read single error counters (serc) after application has finished
# serc's are read board-wise, so you have to cycle through all the
# boards involved (here it is a unit)
#
# script "serr-grep" expects "-coords ${CID}${UID}${BID}"
# in this example, the status is collected in a file "Serr.${JOB_ID}"
#
touch Serr.${JOB_ID}
for BOARD in 0 1 2 3
do /zroot/tools/serr-grep -coords ${UNIT}${BOARD} >> Serr.${JOB_ID}
echo -n ".end : "
date
#-----------------------------------------------------------
to do:
(1) implement /zroot/tools/serr-grep into .epilog-files of CODINE (2) create global "single error count database"