BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

When you convert your applications to native PowerPC code, they run lightning fast.
To get the most out of RISC processors, however, you need to pay close attention to
your code structure and execution. Fast code is no longer measured solely by an
instruction timing table. The Power PC 601 processor includes pipelining,
multi-issue and speculative execution, branch prediction, and a set associative cache.
All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good
optimizing compiler to help you. In this column I'll pass on some of what I've learned
about tuning Power PC code. There are gotchas and coding habits to avoid, and there are
techniques for squeezing the most from your speed-critical native code. For a good
introduction to RISC pipelining and related concepts that appear in this column, see
"Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every
machine clock cycle, but RISC processors can do this only in the best of circumstances.
At their worst they're as slow as CISC processors. The following loop, for example,
averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make
significant improvements. This next loop generates the same result, yet averages one
calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They
include expanding loops, scoping local variables, using inline routines, and using
faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its
point of view. You should understand how it looks at your code and what assumptions
and optimizations it's allowed to make. The more you empathize with your compiler,
the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code
line by line usually isn't optimal, because the processor stalls to wait for dependent
instructions. The compiler tries to move instr uctions that are independent into the
stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed
to keep the pipeline full of useful work. This simple example could cause 46 stalled
cycles on the PowerPC 601, so the compiler will look at other nearby code for
independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance
in several ways. Loop expanding is one of the simplest methods. The idea is to perform
more than one independent operation in a loop, so that the compiler can reorder more
work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in
each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC
processor the second loop is twice as fast as the first. This is because the compiler can
schedule independent instructions to keep the pipeline constantly moving. (If the data
doesn't occur in nice increments, you can still expand the loop; just add a small loop at
the end to handle the extra iterations.)Be careful not to expand a loop too much,
though. Very large loops won't fit in the cache, causing cache misses for each iteration.
In addition, the larger a loop gets, the less work can be done entirely in registers.
Expand too much and the compiler will have to use memory to store intermediate
results, outweighing your marginal gains. Besides, you get the biggest gains from the
first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the
PowerPC chip -- 32 general registers and 32 floating-point registers. By having so
many, the processor can often avoid slow memory operations. Your compiler will take
advantage of this when it can, but you can help it by carefully scoping your variables
and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler
examines the scope of each variable when it schedules registers, and your code can
provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
}

In this loop, the global variable gArraySize is scoped for the whole program. Because
we call a subroutine in the loop, the compiler can't tell if gArraySize will change
during each iteration. Since the subroutine might modify gArraySize, the compiler has
to be conservative. It will reload gArraySize from memory on every iteration, and it
won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local variable, we tell the compiler that gArraySize and
c[i] won't be modified and that it's all right to just keep them handy in registers. In
addition, we can store data as temporary variables scoped only within the loop. This
tells the compiler how we intend to use the data, so that the compiler can use free
registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
}

These minor changes give the compiler more information about the data, in this
instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated
dereferences and typecasting, chances are the object code has wasteful memory
instructions and inefficient register usage. A great compiler might optimize well
anyway, but don't count on it. Judicious use of temporary variables (as mentioned
above) will help the compiler understand exactly what you're doing -- plus your code
will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles
on the Macintosh. Code often contains double memory dereferences, which is important
when memory can move. But when you can guarantee that memory won't move, use a
local pointer, so that you only dereference a handle once. This saves load instructions
and allows fur ther optimizations. Casting data types is usually a free operation --
you're just telling the compiler that you know you're copying seemingly incompatible
data. But it's not free if the data types have different bit sizes, which adds conversion
instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true
that often the pipeline can keep moving even though a branch is encountered, because
the branch execution unit will try to resolve branches very early in the pipeline or
will predict the direction of the branch. Still, the more subroutines you have, the less
your compiler will be able to reorder and intelligently schedule instructions. Keep
speed-critical code together, so that more of it can be pipelined and the compiler can
schedule your registers better. Use inline routines for short operations, as I did in the
improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know
about. Some are processor model specific. For example, the PowerPC 601 has 32K of
cache, while the 603 has 16K split evenly into an instruction cache and a data cache.
But in general you should know about floating-point performance and the virtues of
memory alignment.

Floating-point multiplication is wicked fast -- up to nine times the speed of integer
multiplication. Use floating-point multiplication if you can. Floating-point division
takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to
unaligned data stall while the processor loads different words and then shifts and
splices them. For example, be sure to align floating-point data to 64-bit boundaries,
or you'll stall for four cycles while the processor loads 32-bit halves with two
64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about
tweaking its performance at all. For your speed-critical code, though, these tips I've
given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

High-Performance Computing by Kevin Dowd (O'Reilly & Associates,
Inc., 1993).
High-Performance Computer Architecture by Harold S. Stone
(Addison-Wesley, 1993).
PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's
been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle.
Fixing engine stalls, poor timing, and rough starts proved difficult, but he was
recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization
option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601
microsecond timing library is provided on this issue's CD. *