HPC
HPC

| HPC and Data for Lattice QCD

HW Counters

HPC and Data for Lattice QCD

HW Counters

apeNEXT runtime Counter HOWTO, October 2005 (author: Nils Christian)

The apeNEXT system provides some counters with which can help optimizing your code. This HOWTO quickly gives an overview on this counters. All counters are in units of clock ticks.

clk
number of ticks, during which the processor was not in I2C mode (i.e. the processor was running the program, not waiting for any I/O to finish etc; but the processor could have been waiting for something else (see below)).
run
ticks during which the processor was neither in I2C mode nor was stalling.
mbusy, mwait
These two counters give the ticks during which the processor was waiting for the memory. You may be able to decrease mwait by using prefetching instead of explicit local memory access (this may however lead to an increased qwait).
qwait
holds the number of ticks during which the processor was waiting for the queue. All network traffic, and by default also all memory access (if not explicitly marked as local) are handled via the queue. It is likely that this value increases when waiting for data becoming available via the network. A first optimization here would be to use prefetching. Also for local memory access prefetching can be helpful.
nbusy
Clockticks during which the processor was waiting to be able to send something via the network.

When looking at these counters for judging the performance of your program, you should always either reset the counters at the program start or look at the difference between timers taken at beginning and end, since the program startup can change theses counters massively. Also make sure that you don't use I/O functions in the measured period, since these also make heavy use of memory and network.

For accessing these counters in C, please have a look at the nlibc documentation.

For using the counters in TAO, here is a small code snippet:

/include <confregs>
/NUM=100
begin main
  clockcnt startcnt, stopcnt
  complex z1[NUM], z2[NUM]
  physreg complex z3
  localint ttclk, ttmbusy, ttmwait, ttnbusy, ttqwait, ttrun
  physreg real ir

  !! initialize array
  ir=0.
  do i=0, NUM
    z1[i]=(1.+ir, 1.+1./(1.+ir))
    z2[i]=(2.+ir, 2.+1./(1.+ir))
    ir=ir+1.
  enddo

  get clockcnt in startcnt

  z3=(0.,0.)
  begin cache
    do i=0, NUM
      z3=z3 + z1[i]~*z2[i];
    enddo
  end cache

  get clockcnt in stopcnt

  ttclk=clk(stopcnt)-clk(startcnt)
  ttmbusy=mbusy(stopcnt)-mbusy(startcnt)
  ttmwait=mwait(stopcnt)-mwait(startcnt)
  ttnbusy=nbusy(stopcnt)-nbusy(startcnt)
  ttqwait=qwait(stopcnt)-qwait(startcnt)
  ttrun=run(stopcnt)-run(startcnt)
  write " clk ",ttclk,"  run ", ttrun,"  mbusy ", ttmbusy,"  mwait ", ...
        ttmwait,"  nbusy ", ttnbusy, "  qwait ", ttqwait, "\n"
end main