HPC
HPC

| HPC and Data for Lattice QCD

comm

HPC and Data for Lattice QCD

comm

Debugging of APEmille Communication Links

Notations:

- Comm Registers and bits refer to Sasha's documentation
  "Ch-registers.rtc 14-Dec-99"

Slices:

- Comm slices/Piggy Backs are numbered such that
  	Comm0 is closest to Jn1
  	Comm1 is closest to Jn3
  	Comm2 is closest to Jn5
  	Comm3 is closest to Jn7
- "caos" prints the four Comm slices in the order of growing BID
  and in each line in the order: Comm0 Comm1 Comm2 Comm3
- Data errors seen in the bits 23-16 and 7-0 of any (even or odd bank)
  32-bit data on Jn correspond to the UPPER cables, i.e. Comm0 and Comm1.
- Data errors seen in the bits 31-24 and 15-8 of any (even or odd bank)
  32-bit data on Jn correspond to the LOWER cables, i.e. Comm2 and Comm3
- Schematically, the nibbles of the hex-representation of the data
  pass through U(pper) or L(ower) slices as follows: 0xLLUULLUU

Directions:

(Most) directions in the Comm-world refer to an active, i.e. sender point of view. In particular: - "+x" on the PB front panel labels the connector over which data
  is sent out when communicaiton is done in send_x_plus direction
- TAO directions refer to the passive (receiver) point of view:
  send direction      remote address     	TAO direction
  ---------------------------------------------------------
  send_x_plus 	0x01000000		x_minus
  send_x_minus	0x01800000		x_plus
  send_y_plus	0x02000000		y_minus
  send_z_plus  	0x03000000		z_minus
  i.e. in the assignment b = a[x_plus] the node [x,y,z] receives data
  from node [x+1,y,z] which corresponds to data transfer along "send_x_minus"
- Exception registers refer to the direction of data movement,
  for instance removing the "-x" cable on PB 1 causes exceptions 
  in the Registers labled Xplus when transfering data in send_x_plus 
  direction:
       PB0                     PB1
       |                       |
       |+x  -----.             |+x
       |          \            |
       |           \           |
       |            \          |
       |-x           `-------> |-x
                 send_x_plus

EDAC Error Registration:

NO EDAC errors/exceptions are registered unless ALL directiosn are opened!!!

Dummy Transfers:

Whenever in RUN-mode, the Comm's perform dummy communications (dummy data with alternating 0 and 1 bits) along the last activated direction. At the begining of a program (or during the whole program if no remote communications are done) the dummy communications are along the send_*_plus directions. For this reason, it is advisable to perform a HARDRESET to stop any ongoing dummy transfers (e.g. in case of an exception???).