HPC
HPC

| HPC and Data for Lattice QCD

optimize

HPC and Data for Lattice QCD

optimize

Hints on optimization

Some hints for optimization:

With the tools below it is possible to change something in your program and then compare the pipeline performance or some parts of the microcode with the previous versions. There are some guidelines that might help if you want to achieve a good performance. But feel free to try other tricks.
  1. Every load and store burst has some latency. So keep both the total number of memory read and write statements low, and try to read/write as many as possible with ony statement. This is usually possible by loading the needed values to some registers, performing all the calculation on these registers, and then write back these registers with one statement.
    An example for this would be:
    	matrix complex vec.[4]
    	vec  v[2]
    	!! put something in v[0] here...
    	v[1].[0]=v[0].[0]~*(-2.0)
    	v[1].[1]=v[0].[1]~*(-3.0)
    	v[1].[2]=v[0].[2]~*(1.5)
    	v[1].[3]=v[0].[3]~*(2.5)
    
    The number of bursts can be reduced dramatically (from 4 to 1 for both reading an writing) with the following code:
    	matrix complex vec.[4]
    	vec  v[2]
    	!! put something in v[0] here...
    	register vec vr,vw
    	vr=v[0]
    	vw.[0]=vr.[0]~*(-2.0)
    	vw.[1]=vr.[1]~*(-3.0)
    	vw.[2]=vr.[2]~*(1.5)
    	vw.[3]=vr.[3]~*(2.5)
    	v[1]=vw
    
    You could gain a little bit more in this code by using the :local statement when loading vr, although there are some problems with the compiler, so at the moment this might not work everywhere. Be careful with registers, though. They get lost at the boundarys of an optimization block (end of if-statement, do-enddo loop, calls to subroutines), so you have to write back the ones you still need before an end of an optimization block, and reinitialize them when they are needed again.
    Sometimes some values are needed in every loop of a do-enddo loop. If you would load them in every loop anyways, you could declare them as physical registers (physreg) and initialize them outside the loop. This value will be kept over boundarys of optimization blocks like the end of an if statement or a do-enddo loop. But again be careful, after a call to a subroutine the values of physical registers are (usually) lost as well (unless you declare them in some way global, see the NEWSYNTAX file).
    Use registers as much as possible. They are not the real registers of the APE, so you have infinite many of them. You just tell the compiler that those values don't have to be safed (or that you do it), so this minimizes the number of needed load/store bursts.
  2. Don't split the optimization blocks at unnecessary places. If you want a maximum gain from your registers you should try to get the lenght of the optimization blocks as large as possible. If you need if statements or calls to subroutines try to move them to the beginning (end) of the loop you try to optimize. The shaker can move statements only within optimization blocks, and if they are interrupted in the middle no statements from one half can be moved to the other one. In addition you loose your registers.
    If you have a do loop inside the loop you try to speed up it would help to unroll this loop. There are ZZ statements to help you (/for statement).
    where-endwhere statements don't split your optimization block. But be aware that the code inside takes time even if the condition was wrong! In addition be careful with ZZ statements you use. Be sure to know what is inside them (if statement, calls, etc.), so you know where you loose your registers or your physical registers. For example in the widely used random number generator ranfloat there is both an if statement and some calls to subroutines. You can rewrite this so that it doesn't have calls in it (so you will no longer loose your physical registers), but you will loose your registers after you used ranfloat.
  3. You should usually try to get the pipeline performance to a value as high as possible. But this is not always true. If you can reduce the number of floating point operations this will most likely lower your pipeline performance, but your program will still be faster than before. This might happen if you improve your algorithm. But sometimes there is a trick that does the same. If you need some values complex conjugated you would usually load them to some registers (which takes time), then you would apply the complex conjugation operator ~ on those values (which again takes one cycle for each value) and then you would use those values. If you only need those values complex conjugated there is a better way. They can be loaded complex comjugated on the fly by using the ^* operator. In the above example this would be:
    	matrix complex vec.[4]
    	vec  v[20]
    	!! put something in v[0] here...
    	register vec vr,vw
    	vr=v[0]^*         !! load complex conjugated on the fly
    	vw.[0]=vr.[0]*(-2.0)
    	vw.[1]=vr.[1]*(-3.0)
    	vw.[2]=vr.[2]*(1.5)
    	vw.[3]=vr.[3]*(2.5)
    	v[1]=vw
    
    This means you still need the time for loading those values, but when they are loaded they are directly complex conjugated, so you safe the cycles for performing the complex conjugation.
    However there are limitations to this trick. Especially it would cause problems in the code shown here, because there have to be more than 18 complex values in the corresponding array. Again, see NEWSYNTAX for more information.
    You can see in the microcode that you load something complex conjugated on the fly. In one of the last columns you usually see that something is loaded by VIN, if it is complex conjugated there is a CIN. (some reference to some more explanations about e.g. TTOJ, MCI, MCO, etc. would
    be in place here)
    If you don't use this trick where it would be useful you sometimes can see that from the microcode dump, too. In the microcode dump the first column tells you the cycle number. The next block of columns tells you (hmm, if you could read microcode) what the integer unit (ALU) does (or at least how much it does). Then the next block of columns tells you what (or how much) the floating point unit does, and in the last column there are (mainly) some informations about the memory access (e.g. loads and stores by VIN, CIN, VOUT). If shortly after you see some VINs there are some 7s in the first column of the floating point part, then you most likely can gain something if you use complex conjugation on the fly.
    (Btw, you don't directly access the real and imaginary parts, right? Because one complex normal operation can be done in just one cycle it is highly inefficient to access the real or imaginary parts alone.)
  4. Look at the microcode dump to see where exactly there are some unfilled cycles in the floating point part (resulting in a lower pipeline performance) and why they are there. Typically you are interested in the performance inside some do-enddo loop. In this loop you usually start by reading in some values. Therefore the integer unit usually has to do a lot of calculations before you can start to really calculate something with the floating point unit. If this part for the integer unit is too large it might be a good idea trying to reduce it. One reason might be that you use multiarrays. Using the multindex statement might help, and you can put some parts of the multindex calculation to an outside loop (remember, you can add multindices!). The best thing might be to reduce the number of indices of the array. An additional way around this problem would be that you load some values for the next loop at the end of a loop in a physical register, so you can calculate on this data right from the beginning. This might help to overlap index calculation, memory access and floating point calculations.
    Another thing you might see is that there are some cycles where the floating point unit does nothing, although it already did calculate something before, and it does calculate a little later again. Even worse you might see a very long list of VINs (CINs) before the first calculation is done. The shaker tries to prevent this, but the result will sometimes be better if you help the shaker by reshuffling some parts of your code. Try to load that data first that is needed for the first calculation in the loop. The shaker might not be able to change the order of the loading if the adress of the value has to be calculated. If you can rearrange your program in such a way, it will be possible to do some calculations while you read in the next values from the memory. This should speed up things alot. If there are still many empty lines for the floating point unit there might be some logical dependencies that prevent a filling of those cycles. Try to find out why. If it has to do with too many index calculations try to reduce that (as described above).
  5. If you need data from a floating point unit which is not a direct neighbour of yours, it might be useful to use the set remote <ident> statement.
  6. Unrolling in latency limited codes (/for): In codes which are dominated by the latencies of the FP operations (e.g. due to excessive data dependencies) unrolling with "/for" usually helps. Of course it works only if there are no dependencies between the different for-iterations. Such dependencies may arise form
    	- store operations
    	- where constructs
    
    
    Note: sqrt contains a hidden "where", while isqrt does not. Hence one can calculate an un-safe sqrt as x*isqrt(x)

Tools for analyzing the performance:

After compilation of your program following HOWTO-compile you will have
	myprog.jasm
	myprog.jex
For analyzing the performance you should make a dump of the microcode:
	disjex myprog-xtc.jex > myprog-xtc.mcd

You can now look at the pipeline performance of your program:
	xperf -c myprog.mcd | less
(You could use myprog-xtc.jex instead of the microcode dump, but this way is faster.) The tool xperf allows you to look at different aspects of your program. To get the complete information about some routine this routine should be directly in myprog.zzt, i.e. it should not be in some file that is included by the /include statement. You might want to write a wrapper for this, use the C Preprocessor (cpp) or just copy that routine in the main file.
The first thing xperf tells you are the boundarys of the optimization blocks. If you are not sure about these blocks you can use xperf to see where they are. The start of each such block is marked in the first column of the output by some cycle number, e.g. 0043A.
Other nice informations are in this line:
	0 %  C: 0  D: 0  V: 0  S: 0  I: 0  L: 0  IN: 0/0  OUT: 0/0
This is at the end of each block. The first value is very important for optimization. It shows the pipeline performance, i.e. it tells in what ratio of cycles in this block the processor really does floating point operations. Values between 50% and 95% seem to be reasonable, but of course this depends alot on your actual algorithm. Be aware that 95% does *not* mean that you have 95% of the peak performance. If you multiply 2 real numbers you use one cycle, so this would count for the 95%. But you could multiply 2 complex numbers and add an additional one, it would still only cost one cycle. So the pipeline performance might be too high by a factor of 8 if you want to know your real performance.
The next 4 values tell you something about the number of different floating point operations. The IN (OUT) value tells you how many values are read (written) and how many bursts are needed for that. You can get additional information if you look directly at the microcode dump. From xperf you know the cycle number of some optimization block (e.g. 0043A as above). You can find out the place of that block in the microcode by
	pmtrace myprog-xtc.mcd 0x43A
Among other things this will tell you the last label before that cycle. If you have the cycle number directly from xperf this should be exactly the label for that cycle. Lets say pmtrace tells you, that the last label is 4200 (0x1068). Now you can look at the microcode dump and search for (4200). Look at the code from this label downward until the next label, this is the microcode for the optimization block you are interested in.