C Extensions for Apenext
HPC and Data for Lattice QCD
C Extensions for Apenext
Data
Types,
Field
Access,
Complex
conjugation,
Complex
and vector constants,
Where,
any
/ all / none,
register
structs,
Static
and dynamic cache,
Loop
unrolling
Name |
Size (in bytes) |
---|---|
int |
8 |
short |
8 |
long |
8 |
float |
8 |
double |
8 |
complex |
2 x 8 |
vector |
2 x 8 |
vector int |
2 x 8 |
vector unsigned |
2 x 8 |
Note: there is only limited support for type char, mostly for string constants.
Field access
real and imaginary parts of complex expressions can be accessed TAO-like, i.e.
complex c; float f; c.im = 4; f = c.re; |
Note : Alternatively, the ISO-C syntax like
float _Complex
is supported.
Vector type expressions :
vector v; float f; v.hi = 4; f = v.lo; |
The conjugation operator is '~' (i.e. like bitwise NOT for integer
types).
Note that this is a prefix operator.
register complex c1,c2,c3; float f; c1 = ~cc * cc; f = (c3+c1*~c2).re; |
register complex c1 = {1.4,-22.2}; vector v; vector int vi = (vector int){0x45,0x12}; float f = 12.1; c1 = {13.1,f}; v = {13.1,f}; vi = {45,f}; // includes conversion from float to int |
Note that these 2 element constants are by default treated as complex. If using it as vector add a cast to be on the safe side.
The where construct is used like in TAO but has a syntax like a C-if statement:
where( a < b ) { // do something } else { // do something else } |
There are no restrictions regarding the types used in the conditions, i.e. also char may be used.
Also these constructs are used like in TAO but may be mixed with other conditions
if( any(a < b) ) { // do something } else { // do something else } |
For reasons of optimization, structs
can be placed in regsets.
The syntax is like for regular register
variables:
void fun() { typedef struct { float f; int i; } bla; register bla b b.f = 42.0; } |
Loops (, and ) and functions can be explicitly cached by adding a "#pragma cache" just before their begin:
#pragma cache complex fun(int i) { return (complex)i; } void main() { int i; #pragma cache for(i=0;i<10;i++) { // do something } } |
Regular loops with constant bounds can be explicitly unrolled by adding a "#pragma unroll" just before their begin:
/* complete unrolling, no jumps left */ # pragma unroll for(int i=0; i<5; i++) { /* some code */ } /* 3-fold unrolling (i.e. 10 iterations) */ # pragma unroll 3 for( int j=0; j<30; j++ ) { /* some code */ } |
Slice Specification for Initializers
For global initializers (except character strings), a slice specifier can be preprended to the actual value or list of values. Also, further slice/initializers combinations may follow, separated by blanks. The slice specifiers have the same syntax as in TAO:
[ x1 , y1 , z1 ] [ x2 , y2, z2 ]
The 2nd set of coordinates may be omitted if it's identical to the
first.
Example:
int a[4] = [1.2.1][1.2.2]{ 1,2,3,4 } [0,1,1]{ 9,8,7,6 }; int xx = [1,1,0] 56; struct { int a; float b; } sss = [1,2,3] { 4, 5.6 }; complex aa = [1,1,0] { 334.5, 12.3}; vector vv = [1,2,0] { 334.5, 12.3}; union { double d; unsigned i; } u = [2,0,0] { 5 }; |
Note: The memory range
occupied by the identifer will be initialized to zero on all nodes
that are not covered by the given slice(s).
Another example: A
quite elegant way to define node coordinates
double node_x = [1,0,0][1,1,1] 1; double node_y = [0,1,0][1,1,1] 1; double node_z = [0,0,1][1,1,1] 1; |
Inline assembly *should* work similar to the gcc implementation.
It uses the similar assembly templates but currently supports only a
small set of constraints.
Simple cases are o.k. , stick to the
following examples:
// literal or string asm( "\tatr 0x20 0x0010000\n\trtc 0x21 0x20" ); asm( string ); // read access for expr1 and expr2 in registers // "r" means that the expressions should be read // from a register. %0 and %1 are the 1st and 2nd // input arguments // Note the missing output specifiction between // the first two colons. asm("mnemonic %0 %1" : : "r" (expr1), "r" (expr2) ); // read access for expr1 and expr2 in registers and // output to register outexpr1 asm( "mnemonic %0 %1 %2" : "=r" (outexpr1) : \ "r" (expr1), "r" (expr2) ); // an argument type "I" translates into a literal // in the output string: asm( "menmonic %0 :%1" : : "r" (a), "I" (sizeof(a)>>4) ); |
Explicit queue access is provided by
some macros in
Example:
#include <queue.h> #include <topology.h> // not necessary as it's included by queue.h int result1, result2; typedef struct { int a[5]; } mystruct; int a[4] = { 101, 102, 103, 104 }; mystruct s = {{ 201, 202, 203, 204, 205 }}; void main () { int r, i = 2; register mystruct rs, rs2; /* MTQ */ prefetch( a[i+X_PLUS] ); // QTR fetch( r ); result1 = r; rs = s; // RTQ rprefetch( X_PLUS, rs ); // QTR fetch( rs2 ); result2 = rs2.a[2]; } |
Note: When using explicit queue commands, it's probably wise to understand and use memory access pragmas and qualifiers.
Direct access to the memory is usually faster than the default
access path via the queue mechanism. In some cases it's even required
to avoid messing up the queue contents.
nlcc currently optimizes
memory accesses where possible (if the option is used)
but cannot do so if indirection is used (e.g. array access via some
index variable).
nlcc currently provides two ways to enforce local
memory access :
-
A will enforce local memory access from the location of the pragma until the end of the current block.
Example:#include <queue.h> int a[2] = { 0x100, 0x200 }; int result[4]; void main() { int i = 0, r; prefetch( a[1] ); // pre-filling the local queue prefetch( a[1] ); prefetch( a[1] ); { result[0] = a[i]; // access via MTQ/QTR #pragma localmem result[1] = a[i]; // local access via MTR { result[2] = a[i]; // local access via MTR } result[3] = a[i]; // local access via MTR } fetch(r); // empty the queue fetch(r); fetch(r); /* at this point, result[0] contains 0x200 while result[1 .. 3] contain 0x100 . */ }
-
The qualifier
Variables of all types can have a qualifier, in which case all accesses to or via this variable will be local.
Unlike other qualifiers, the qualifier applies to the actual identifier and therefore works for pointers as well.
Explicit queue access completely ignores this qualifier.
Example:
-
int nodeprivate a[4] = { 1,2,3,4 }; int b[4] = { 5,6,7,8 }; void main () { int i = 2; int r1,r2,r3,r4; int *p; int nodeprivate *q; r1 = a[i]; // local access via MTR r2 = b[i]; // access via MTQ/QTR p = &a[i]; r3 = *p; // access via MTQ/QTR q = &b[i]; r4 = *q; // local access via MTR }
Note: The exact name of this qualifier is under discussion. If you have a better suggestion, please submit it. The term 'local' and 'global' have been intentionally avoided since they are used to describe scopes already.
By including the following macros are defined:
-
- write a global int variable.
-
- write a global float variable, terminated by .
-
- write a global complex variable, terminated by .
-
- write a global vector variable, terminated by .
- ,
- write a string, constant or via pointer.
The results can be expanded via from the
file.
Note that nlcc uses the copying version of the IO
macros, so stack and register variables can be printed as
well.
Example:
#include <io.h> complex c; char *p; void main() { c = {29.4, -100}; p = "Hello World\n"; puts("complex : "); write_complex(c); puts(p); } |
More elaborate I/O functions will be privided in the course of the OS and C library development.
Currently, in absence of a more complete C library and the
definition of a final memory layout, the only possible dynamic memory
allocation is on the stack.
Two methods are commonly
used:
- Variable length arrays:
- This is a new C99 feature. It's implemented in nlcc since version 0.3.37.
- alloca()
-
This is not a standard C feature, but is supported by most
compilers (including nlcc). The alloca() function allocates memory
in the current stack frame and frees it when returning from the
allocating function. Protoype in stdlib.h.
Note: All memory allocating functions are defined (in the C standard) to accept requested sizes in bytes. Thus one may be tempted to use a seemingly portable expression like: -
int *array = (int *)alloca( 100 * sizeof(int) );
- This, however, will fail due to the somewhat special alignment
demands of apeNEXT : integer, float (i.e. all types that stricly
occupy only one 64 bit memory bank) report sizes of 8 bytes but are
each aligned to 16 byte (128 bits) even when packed inside an
array.
So, a more portable expression would involve also a call to the (non-standard) function alignof() (in stdlib.h) or the nlcc-specific sizeof_aligned() as in -
int *array = (int *)alloca( 100 * sizeof_aligned(int) );
-
Note2: Alloca() cannot be used as an argument to a function call (the result, however, can). The reason is that it would mess up the other function arguments that are placed later onto the stack.
NLCC General C Extensions
-
Local and global memory
-
localint keyword
-
-
the where condition
-
all, any and none conditions
-
nodeglobal and nodeprivate qualifier keywords.
-
vectors
-
vector of float,
-
vector of ints
-
-
variable length arrays
-
Pragmas
-
Pragmas for functions.
-
-
Initialisations
-
vectors and complex values
-
node and grid specification
-
-
__func__ : containing the function name.