HPC
HPC

| HPC and Data for Lattice QCD

debug

HPC and Data for Lattice QCD

debug

Trouble-Shooting and Debugging of User Program

Problem: Exception during program execution

In case of an exception all error messages should be preserved. In particular, the following information is important:
  1. Exception Type (see below)
  2. PMA Value (see below)
  3. OS type and flags
Check also:
  • Whether jinit0.jex has been run before starting the program (this initializes al Jn Memories to a well defined value 0), otherwise repeat the run.
  • Whether you have opened communications along a direction which is invalid for your machine configuration.

How to understand the type of exception:

When running caos a file "Exception.log" is created, where the details of the exception status are reported. They are transformed into more human-readable format by the command
	exview Exception.log
Note: If a masked exception was encountered, the corresponding bit remains set (and hence is shown in the exception message when an unmasked exception is encountered). For instance, if generation of denormalized numbers ("LutDen" and "AluDenOut" is masked by -j 0x88) and an other exception, like "AluOvfl", is encountered, the "Exception.log" will also show "AluDenOut" if such a denormalized result has occurred, even though it did not generate an exception.
Jn Exceptions which typically can be produced by programming errors are (see also Jn Registers):
   00000001 	BndEx	  Invalid Jn Data Memory Address
   00000004	LutEx	  Overflow of a LUT operation
   00000008	LutDen	  Denormalized LUT result (usually masked)
   00000010	AluBadOp  Bad FILU Operand
   00000020	AluOvfl	  FILU Overflow
   00000080	AluDenOut Denormalized FILU result (usually masked)
   00000100	AluDenOp  Denormalized FILU operand 
   00400000	LaguEx	  Invalid local address
Note, that the exceptions LutDen and AluDenOut should usually be masked ("-j 0x88). These exceptions indicate that an arithmetic operation produced a very small number ( < 1.75e-38 which is called "denormalized" in the IEEE format) and which has been rounded to 0 (because APE treats all denormalized results as 0).
All other Jn Exceptions are most likely due to a compiler or HW problem and should always be communicated to experts. Tz Exceptions which typically can be produced by programming errors are (see also Tz Registers):
   00800000	DMBndEx	  Invalid Tz Data Memory Address
   01000000	AddOvf	  Adder Overflow of the ALU
   02000000	MulOvf	  Multiplyer Overflow of the ALU
   04000000	ShOvf	  Shifter Overflow of the ALU
   08000000	DivOvf	  Division Overflow of the ALU
   10000000	DivBy0	  Division by zero of the ALU
   20000000	AguOvf	  Overflow of the AGU
All other Tz Exceptions are most likely due to a compiler or HW problem and should always be communicated to experts.

How to locate the exception in the user program

The PMA tells where in the program the exception was generated. The script ~simma/public/bin/pmtrace may help to recover from the PMA the relevant sections in the TAO and assembly source code if the program has been compiled with rtc. The pedestrian (and sometimes more instructive) way is as follows, where we assume that you have an exception at PMA 0x000XYZ on program.jex:
  1. Perform
            disjex program.jex | less
    
    (which gives you an ASCI dump of the microcode).
  2. Search for a line starting with "XYZ:" (which gives you the microcode line corresponding to the PMA, where the exception happened)
  3. 3) Search upward (towards lower hexadecimal PMAs in the first column) for the first line of the form
         ">>>>>>>>>      LABEL 0x0000LLLL (NNN)   <<<<<<<<<"
    
    (which tells you the number of the label by which the section starts). If the hexadecimal label number 0xLLLL is less than 0x1000, continue searching upward until you have found the first label bigger than 0x1000.
  4. 4) Inspect the jasm file of your program:
    • If the jasm was generated by rtc, search a line of the form: "LABEL *0xLLLL"
    • If the jasm was generated by xtc, search a line of the form: "LABEL *<0xHHH+XTC_FIRST_LABEL>" where the hexadecimal number 0xHHH is computed as 0xHHH = 0xLLLL - 0x1000 (without leading zeroes in 0xHHH, i.e. 0x123 instead of 0x00000123)
    This label mark is the beginning of the section, in which the exception happened. Provided you are fortunate and NOT using /include's, you should find some jasm comment starting with "!! ---" which shows the original TAO source line. Otherwise, you can try to meditate over the assembly code in order to derive the original TAO code from it (not really recommended :-)

Problems due to the use of uninitialized variables

One of the most subtle reasons for various kinds of exceptions may be the access to an uninitialized variable (either due to a compiler bug or, more likely, due to a programming error). In the worst case, this means that your program finishes without exception but gives incorrect results. In the best case, you will get some kind of exception, which may be:
  • On Tz due to integer arithmetics using a wrong integer value
  • On Jn due to the use of wrong floating-point data (either because you access memory at a wrong addresses due to uninitialized and hence wrong integer addresses, or because you access at the correct address some wrong FP data which has never initialized).

Case 1: Your code accesses uninitialized integer variables (on Tz):

To identify such situations, it may be helpful to run the program twice. Once after executing tinit0.jex (which initializes all Tz memory to 0) before your program, and once after tinitf.jex (which initializes all Tz memory to -1). The behaviour (exceptions, results) are likely to be different if the program accesses uninitialized Tz variables.
Unfortunately, there is no method which 100% guarantees to find the use of uninitialized integer variables in your code. Warning: If you forgot to run tinit0.jex or tinitf.jex before your program, the integer variables on Tz have an undefined value. Depending on the programs that run before (e.g. different programs on different boards or units), the uninitializes variables may hence have different values on different Tz's and cause extremely abnormal behaviour of the machine.

Case 2: Your code accesses uninitialized floating-point variables (on Jn):

Such situations can often be detected by finding an exception like AluBadOp or LutEx when running jinitf.jex (which initializes all Jn memory to NaN) before your program (instead of jinit0.jex which initializes all Jn memory to 0). Unfortunately, this method does not work always, because after running jinitf.jex you might get an AluBadOp exception even in a correct program (the compiler does presently not always guarantee, that a NaN from memory can not arrive at the FP unit, even if that address is never accessed by as user data).