debug
HPC and Data for Lattice QCD
debug
Trouble-Shooting and Debugging of User Program
Problem: Exception during program execution
In case of an exception all error messages should be preserved. In particular, the following information is important:- Exception Type (see below)
- PMA Value (see below)
- OS type and flags
- Whether jinit0.jex has been run before starting the program (this initializes al Jn Memories to a well defined value 0), otherwise repeat the run.
- Whether you have opened communications along a direction which is invalid for your machine configuration.
How to understand the type of exception:
When running caos a file "Exception.log" is created, where the details of the exception status are reported. They are transformed into more human-readable format by the commandexview Exception.logNote: If a masked exception was encountered, the corresponding bit remains set (and hence is shown in the exception message when an unmasked exception is encountered). For instance, if generation of denormalized numbers ("LutDen" and "AluDenOut" is masked by -j 0x88) and an other exception, like "AluOvfl", is encountered, the "Exception.log" will also show "AluDenOut" if such a denormalized result has occurred, even though it did not generate an exception.
Jn Exceptions which typically can be produced by programming errors are (see also Jn Registers):
00000001 BndEx Invalid Jn Data Memory Address 00000004 LutEx Overflow of a LUT operation 00000008 LutDen Denormalized LUT result (usually masked) 00000010 AluBadOp Bad FILU Operand 00000020 AluOvfl FILU Overflow 00000080 AluDenOut Denormalized FILU result (usually masked) 00000100 AluDenOp Denormalized FILU operand 00400000 LaguEx Invalid local addressNote, that the exceptions LutDen and AluDenOut should usually be masked ("-j 0x88). These exceptions indicate that an arithmetic operation produced a very small number ( < 1.75e-38 which is called "denormalized" in the IEEE format) and which has been rounded to 0 (because APE treats all denormalized results as 0).
All other Jn Exceptions are most likely due to a compiler or HW problem and should always be communicated to experts. Tz Exceptions which typically can be produced by programming errors are (see also Tz Registers):
00800000 DMBndEx Invalid Tz Data Memory Address 01000000 AddOvf Adder Overflow of the ALU 02000000 MulOvf Multiplyer Overflow of the ALU 04000000 ShOvf Shifter Overflow of the ALU 08000000 DivOvf Division Overflow of the ALU 10000000 DivBy0 Division by zero of the ALU 20000000 AguOvf Overflow of the AGUAll other Tz Exceptions are most likely due to a compiler or HW problem and should always be communicated to experts.
How to locate the exception in the user program
The PMA tells where in the program the exception was generated. The script ~simma/public/bin/pmtrace may help to recover from the PMA the relevant sections in the TAO and assembly source code if the program has been compiled with rtc. The pedestrian (and sometimes more instructive) way is as follows, where we assume that you have an exception at PMA 0x000XYZ on program.jex:-
Perform
disjex program.jex | less
(which gives you an ASCI dump of the microcode). - Search for a line starting with "XYZ:" (which gives you the microcode line corresponding to the PMA, where the exception happened)
-
3) Search upward (towards lower hexadecimal PMAs in the first
column) for the first line of the form
">>>>>>>>> LABEL 0x0000LLLL (NNN) <<<<<<<<<"
(which tells you the number of the label by which the section starts). If the hexadecimal label number 0xLLLL is less than 0x1000, continue searching upward until you have found the first label bigger than 0x1000. -
4) Inspect the jasm file of your program:
- If the jasm was generated by rtc, search a line of the form: "LABEL *0xLLLL"
- If the jasm was generated by xtc, search a line of the form: "LABEL *<0xHHH+XTC_FIRST_LABEL>" where the hexadecimal number 0xHHH is computed as 0xHHH = 0xLLLL - 0x1000 (without leading zeroes in 0xHHH, i.e. 0x123 instead of 0x00000123)
Problems due to the use of uninitialized variables
One of the most subtle reasons for various kinds of exceptions may be the access to an uninitialized variable (either due to a compiler bug or, more likely, due to a programming error). In the worst case, this means that your program finishes without exception but gives incorrect results. In the best case, you will get some kind of exception, which may be:- On Tz due to integer arithmetics using a wrong integer value
- On Jn due to the use of wrong floating-point data (either because you access memory at a wrong addresses due to uninitialized and hence wrong integer addresses, or because you access at the correct address some wrong FP data which has never initialized).
Case 1: Your code accesses uninitialized integer variables (on Tz):
To identify such situations, it may be helpful to run the program twice. Once after executing tinit0.jex (which initializes all Tz memory to 0) before your program, and once after tinitf.jex (which initializes all Tz memory to -1). The behaviour (exceptions, results) are likely to be different if the program accesses uninitialized Tz variables.Unfortunately, there is no method which 100% guarantees to find the use of uninitialized integer variables in your code. Warning: If you forgot to run tinit0.jex or tinitf.jex before your program, the integer variables on Tz have an undefined value. Depending on the programs that run before (e.g. different programs on different boards or units), the uninitializes variables may hence have different values on different Tz's and cause extremely abnormal behaviour of the machine.