HW Counters
HPC and Data for Lattice QCD
HW Counters
apeNEXT runtime Counter HOWTO, October 2005 (author: Nils Christian)
The apeNEXT system provides some counters with which can help optimizing your code. This HOWTO quickly gives an overview on this counters. All counters are in units of clock ticks.
- clk
- number of ticks, during which the processor was not in I2C mode (i.e. the processor was running the program, not waiting for any I/O to finish etc; but the processor could have been waiting for something else (see below)).
- run
- ticks during which the processor was neither in I2C mode nor was stalling.
- mbusy, mwait
- These two counters give the ticks during which the processor was waiting for the memory. You may be able to decrease mwait by using prefetching instead of explicit local memory access (this may however lead to an increased qwait).
- qwait
- holds the number of ticks during which the processor was waiting for the queue. All network traffic, and by default also all memory access (if not explicitly marked as local) are handled via the queue. It is likely that this value increases when waiting for data becoming available via the network. A first optimization here would be to use prefetching. Also for local memory access prefetching can be helpful.
- nbusy
- Clockticks during which the processor was waiting to be able to send something via the network.
When looking at these counters for judging the performance of your program, you should always either reset the counters at the program start or look at the difference between timers taken at beginning and end, since the program startup can change theses counters massively. Also make sure that you don't use I/O functions in the measured period, since these also make heavy use of memory and network.
For accessing these counters in C, please have a look at the nlibc documentation.
For using the counters in TAO, here is a small code snippet:
/include <confregs>/NUM=100 begin main clockcnt startcnt, stopcnt complex z1[NUM], z2[NUM] physreg complex z3 localint ttclk, ttmbusy, ttmwait, ttnbusy, ttqwait, ttrun physreg real ir !! initialize array ir=0. do i=0, NUM z1[i]=(1.+ir, 1.+1./(1.+ir)) z2[i]=(2.+ir, 2.+1./(1.+ir)) ir=ir+1. enddo get clockcnt in startcnt z3=(0.,0.) begin cache do i=0, NUM z3=z3 + z1[i]~*z2[i]; enddo end cache get clockcnt in stopcnt ttclk=clk(stopcnt)-clk(startcnt) ttmbusy=mbusy(stopcnt)-mbusy(startcnt) ttmwait=mwait(stopcnt)-mwait(startcnt) ttnbusy=nbusy(stopcnt)-nbusy(startcnt) ttqwait=qwait(stopcnt)-qwait(startcnt) ttrun=run(stopcnt)-run(startcnt) write " clk ",ttclk," run ", ttrun," mbusy ", ttmbusy," mwait ", ... ttmwait," nbusy ", ttnbusy, " qwait ", ttqwait, "\n" end main