HPC
HPC

| HPC and Data for Lattice QCD

C Extensions for Apenext

HPC and Data for Lattice QCD

C Extensions for Apenext

Syntax Extensions

Data Types,
Field Access,
Complex conjugation,
Complex and vector constants,
Where,
any / all / none,
register structs,
Static and dynamic cache,
Loop unrolling

Data types

Name

Size (in bytes)

int

8

short

8

long

8

float

8

double

8

complex

2 x 8

vector

2 x 8

vector int

2 x 8

vector unsigned

2 x 8

Note: there is only limited support for type char, mostly for string constants.

Field access

real and imaginary parts of complex expressions can be accessed TAO-like, i.e.



complex c;




float f;



c.im = 4;


f = c.re;

  

Note : Alternatively, the ISO-C syntax like

 float _Complex

is supported.

Vector type expressions :



vector v;



float f;




v.hi = 4;


f = v.lo;

  

Complex conjugation

The conjugation operator is '~' (i.e. like bitwise NOT for integer types).
Note that this is a prefix operator.



register complex c1,c2,c3;




float f;



c1 = ~cc * cc;



f = (c3+c1*~c2).re;


  

Complex and vector constants



register complex c1 = {1.4,-22.2};




vector v;



vector int vi = (vector int){0x45,0x12};





float f = 12.1;



c1 = {13.1,f};


v  = {13.1,f};



vi = {45,f};    // includes conversion from float to int



  

Note that these 2 element constants are by default treated as complex. If using it as vector add a cast to be on the safe side.

where

The where construct is used like in TAO but has a syntax like a C-if statement:



where( a < b ) {


        
// do something



} else {

        
// do something else




}

  

There are no restrictions regarding the types used in the conditions, i.e. also char may be used.

any / all / none

Also these constructs are used like in TAO but may be mixed with other conditions



if( any(a < b) ) {


        
// do something



} else {

        
// do something else




}

  

register structs

For reasons of optimization, structs can be placed in regsets.
The syntax is like for regular register variables:



void fun() {


        
typedef struct {


                
float f;

                
int i;

        } bla;



        
register bla b


        b.f = 42.0;


}


  

Static and dynamic cache

Loops (, and ) and functions can be explicitly cached by adding a "#pragma cache" just before their begin:


#pragma cache




complex fun(int i) {

        
return (complex)i;



}




void main() {

        
int i;



        #pragma cache

        
for(i=0;i<10;i++) {

                
// do something



        }


}

  

Loop unrolling

Regular loops with constant bounds can be explicitly unrolled by adding a "#pragma unroll" just before their begin:


/* complete unrolling, no jumps left */



# pragma unroll




for(int i=0; i<5; i++) {


        
/* some code */




}




/* 3-fold unrolling (i.e. 10 iterations) */



# pragma unroll 3




for( int j=0; j<30; j++ ) {


        
/* some code */




}

  

Slice Specification for Initializers

For global initializers (except character strings), a slice specifier can be preprended to the actual value or list of values. Also, further slice/initializers combinations may follow, separated by blanks. The slice specifiers have the same syntax as in TAO:

        [ x1 , y1 , z1 ] [ x2 , y2, z2 ]

The 2nd set of coordinates may be omitted if it's identical to the first.
Example:



int a[4] = [1.2.1][1.2.2]{ 1,2,3,4 } [0,1,1]{ 9,8,7,6 };





int xx = [1,1,0] 56;




struct { int a; float b; } sss = [1,2,3] { 4, 5.6 };





complex aa = [1,1,0] { 334.5, 12.3};



vector vv = [1,2,0] { 334.5, 12.3};




union { double d; unsigned i; } u = [2,0,0] { 5 };


  


Note: The memory range occupied by the identifer will be initialized to zero on all nodes that are not covered by the given slice(s).
Another example: A quite elegant way to define node coordinates



double node_x = [1,0,0][1,1,1] 1;




double node_y = [0,1,0][1,1,1] 1;



double node_z = [0,0,1][1,1,1] 1;

  

Inline assembly

Inline assembly *should* work similar to the gcc implementation. It uses the similar assembly templates but currently supports only a small set of constraints.
Simple cases are o.k. , stick to the following examples:


// literal or string



asm( "\tatr 0x20 0x0010000\n\trtc 0x21 0x20" );


asm( string );



// read access for expr1 and expr2 in registers


// "r" means that the expressions should be read



// from a register. %0 and %1 are the 1st and 2nd


// input arguments


// Note the missing output specifiction between


// the first two colons.



asm("mnemonic %0 %1" :  : "r" (expr1), "r" (expr2) );



// read access for expr1 and expr2 in registers and



// output to register outexpr1



asm( "mnemonic %0 %1 %2" : "=r" (outexpr1) : \


"r" (expr1), "r" (expr2) );





// an argument type "I" translates into a literal


// in the output string:



asm( "menmonic %0 :%1" : : "r" (a), "I" (sizeof(a)>>4) );



  

Queue Access

Explicit queue access is provided by some macros in

Example:


#include <queue.h>



#include <topology.h>           
// not necessary as it's included by queue.h






int result1, result2;




typedef struct {


        
int a[5];


} mystruct;




int a[4]   = { 101, 102, 103, 104 };



mystruct s = {{ 201, 202, 203, 204, 205 }};




void main () {

        
int r, i = 2;


        
register mystruct rs, rs2;


        
/* MTQ */


        prefetch( a[i+X_PLUS] );



        
// QTR


        fetch( r );


        result1 = r;


        rs = s;


        
// RTQ


        rprefetch( X_PLUS, rs );


        
// QTR


        fetch( rs2 );


        result2 = rs2.a[2];


}


  

Note: When using explicit queue commands, it's probably wise to understand and use memory access pragmas and qualifiers.

Local Memory Access

Direct access to the memory is usually faster than the default access path via the queue mechanism. In some cases it's even required to avoid messing up the queue contents.
nlcc currently optimizes memory accesses where possible (if the option is used) but cannot do so if indirection is used (e.g. array access via some index variable).
nlcc currently provides two ways to enforce local memory access :


  1. A will enforce local memory access from the location of the pragma until the end of the current block.
    Example:

    
    
    #include <queue.h>
    
    
    
    
    int a[2] = { 0x100, 0x200 };
    
    
    
    int result[4];
    
    
    
    
    
    void main() {
    
            
    int i = 0, r;
    
    
            prefetch( a[1] );                       // pre-filling the local queue
    
    
            prefetch( a[1] );
    
            prefetch( a[1] );
    
    
            {
    
    
                    result[0] = a[i];               // access via MTQ/QTR
    
    
    
                    #pragma localmem
    
    
                    result[1] = a[i];               // local access via MTR
    
    
    
                    {
    
                            result[2] = a[i];       // local access via MTR
    
    
    
                    }
    
    
                    result[3] = a[i];               // local access via MTR
    
    
            }
    
    
            fetch(r);                               // empty the queue
    
    
    
            fetch(r);
    
            fetch(r);
    
    
            
    /*
    
    
            at this point, result[0] contains 0x200 while
    
    
            result[1 .. 3] contain 0x100 .
    
            */
    
    
    }
    
      
    
    
  2. The qualifier
    Variables of all types can have a qualifier, in which case all accesses to or via this variable will be local.
    Unlike other qualifiers, the qualifier applies to the actual identifier and therefore works for pointers as well.
    Explicit
    queue access completely ignores this qualifier.
    Example:




int nodeprivate a[4] = { 1,2,3,4 };



int b[4]             = { 5,6,7,8 };




void main () {


        
int i = 2;

        
int r1,r2,r3,r4;

        

int *p;

        
int nodeprivate *q;


        r1 = a[i];              // local access via MTR




        r2 = b[i];              // access via MTQ/QTR



        p = &a[i];

        r3 = *p;                // access via MTQ/QTR




        q = &b[i];

        r4 = *q;                // local access via MTR




}


  

Note: The exact name of this qualifier is under discussion. If you have a better suggestion, please submit it. The term 'local' and 'global' have been intentionally avoided since they are used to describe scopes already.

IO Macros (for OS0 !)

By including the following macros are defined:


write a global int variable.

write a global float variable, terminated by .

write a global complex variable, terminated by .

write a global vector variable, terminated by .
,
write a string, constant or via pointer.

The results can be expanded via from the file.
Note that nlcc uses the copying version of the IO macros, so stack and register variables can be printed as well.
Example:


#include <io.h>




complex c;



char *p;





void main() {

        c = {29.4, -100};

        p = "Hello World\n";


        
puts("complex : ");

        write_complex(c);

        
puts(p);



}

  

More elaborate I/O functions will be privided in the course of the OS and C library development.

Memory Allocation

Currently, in absence of a more complete C library and the definition of a final memory layout, the only possible dynamic memory allocation is on the stack.
Two methods are commonly used:

Variable length arrays:
This is a new C99 feature. It's implemented in nlcc since version 0.3.37.
alloca()
This is not a standard C feature, but is supported by most compilers (including nlcc). The alloca() function allocates memory in the current stack frame and frees it when returning from the allocating function. Protoype in stdlib.h.
Note: All memory allocating functions are defined (in the C standard) to accept requested sizes in bytes. Thus one may be tempted to use a seemingly portable expression like:


int *array = (int *)alloca( 100 * sizeof(int) );  



This, however, will fail due to the somewhat special alignment demands of apeNEXT : integer, float (i.e. all types that stricly occupy only one 64 bit memory bank) report sizes of 8 bytes but are each aligned to 16 byte (128 bits) even when packed inside an array.
So, a more portable expression would involve also a call to the (non-standard) function alignof() (in stdlib.h) or the nlcc-specific sizeof_aligned() as in


int *array = (int *)alloca( 100 * sizeof_aligned(int) );  




Note2: Alloca() cannot be used as an argument to a function call (the result, however, can). The reason is that it would mess up the other function arguments that are placed later onto the stack.




NLCC General C Extensions

  • Local and global memory

    • localint keyword

  • the where condition

  • all, any and none conditions

  • nodeglobal and nodeprivate qualifier keywords.

  • vectors

    • vector of float,

    • vector of ints

  • variable length arrays

  • Pragmas

    • Pragmas for functions.

  • Initialisations

    • vectors and complex values

    • node and grid specification

  • __func__ : containing the function name.