BALANCE OF POWER

Tuning PowerPCMemory Usage

DAVE EVANS

If you care about the performance of code you write for the Power Macintosh, memory
usage should be your foremost concern. With the PowerPCTM601 processor today, and
even more important with future processors, memory usage of your code will have the
greatest effect on its performance. Poorly written code will execute at a fraction of its
potential, and often very simple changes will greatly improve the execution speed of
your critical code.

Processors are improving much faster than the memory subsystems that support
them. As the PowerPC chips move from 80 MHz to 100 MHz and beyond, their thirst
for data to process and instructions to execute will increasingly tax memory. Memory
caches attempt to mitigate that thirst, and all PowerPC processors come equipped with
built-in caches. But your code can work well with a cache or it can work very poorly
with a cache. I'll show you why and discuss what you can do to optimize your memory
usage.

CACHE BASICS
As you know, a cache is simply very fast RAM that the processor can access quickly and
that it uses to store recently referenced data and code. On the PowerPC processors, any
data stored in the cache can be accessed without stalling the processor's pipeline.
Accesses to data not in the cache will take about 20 times as long reading from main
memory, or even 1 million times as long if the access causes a page fault with virtual
memory. Getting and keeping your performance-critical code and data in the cache are
therefore key to your execution speed.

A cache is divided into small blocks calledcache lines . On the PowerPC 601, for
example, the cache has 1024 cache line blocks, each holding 32 bytes. In addition, the
601 will fetch two blocks when it can, making the cache line size effectively 64 bytes.

The first PowerPC processors have set associative caches of different sizes. The 601
has an eight-way set associative, unified cache that's 32K in size, and the 603 has a
two-way set associative, split cache with 8K for data and 8K for instruction code. The
termset associative refers to the way the cache relates to main memory, which is
important to your performance. In some simple caching schemes, each cache line maps
directly to specific areas of main memory; any access to one of these areas loads bytes
into that cache line. But on the PowerPC processor, sets of cache lines are combined
and then mapped to memory. There are eight cache lines in each set on the 601, and
two in each set on the 603. An access to one of the areas mapped to a set will load bytes
into the last-used cache line of the set, keeping the most frequently used cache lines
from being purged. This more complicated scheme typically yields much better
performance than the directly mapped cache.

CACHE THRASHING
The cache will most affect your performance when you're accessing large amounts of
data. A typical example of this is walking through arrays to perform some operation.
The best strategy is to minimize cache collisions during your accesses, and the best
tactic for this is to access your data as sequentially as possible. If you walk through
memory sequentially, you'll load the cache every 64 bytes, but all 64 bytes will be
available for fast processing. Here's an example:

unsigned longdata[64][1024];
for (row = 2; row < 64; row++)
    for (column = 0; column < 1024; column++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

This example performs additions on each element of a large matrix and accesses that
matrix sequentially in memory. It walks across each row, adding elements and storing
the result. But just inverting the loops can significantly change the way memory is
accessed:

unsigned longdata[64][1024];
for (column = 0; column < 1024; column++)
    for (row = 2; row < 64; row++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

Reversing the loops leads to less than optimal performance since we perform each
addition for all the columns before moving to the next element of a row. Instead of
sequential access, this access pattern jumps across memory in even steps of 4K.
Unfortunately, on the PowerPC processor these accesses map to the same set of cache
lines, and every operation causes the cache to reload from main memory. This second
example takes twice as long to execute the same calculation on a Power Macintosh
6100/60.

By paying attention to how your code accesses memory, you can avoid serious cache
thrashing like that done by the second example. Things to look out for are loops that
iterate for a power of 2 steps (128, 256, and so on)and code whose memory accesses
are not close together.

An approach called blocking may help your loops. Often your code isn't as simple as
above, and your memory accesses aren't regular during the loop. If you're walking two
different arrays with different increments through memory, it may be impossible to
serialize your accesses. Blocking performs the calculations in blocks of rows and
columns. Instead of iterating across all the columns and then proceeding to the next
row, you divide the dimensional space into blocks and calculate one whole block at a
time. In this next example, we calculate the multiplication of two matrices.

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column++) {
        long    sum = 0;
        for (i = 0; i < 128; i++)
            sum += foo[row][i] * bar[i][column];
        result[row][column] = sum;
    }

As this algorithm walks through memory, it accesses result and foosequentially, but
bar is accessed in 256-byte steps. Accessing bar by jumping through memory causes
cache misses, and sequential elements of bar are flushed from the cache before they're
needed.

By performing this operation in small blocks, we can better use the cache. The key is
to use all the elements of foo and bar that are in a cache line before moving on. One
way to do this is to expand the loop and perform four operations in a single iteration:

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column += 4) {
        long    sum1 = 0, sum2 = 0;
        long    sum3 = 0, sum4 = 0;
        for (i = 0; i < 128; i++) {
            sum1 += foo[row][i] * bar[i][column];
            sum2 += foo[row][i] * bar[i][column+1];
            sum3 += foo[row][i] * bar[i][column+2];
            sum4 += foo[row][i] * bar[i][column+3];
    }
    result[row][column] = sum1;
    result[row][column+1] = sum2;
    result[row][column+2] = sum3;
    result[row][column+3] = sum4;
    }

This expanded loop calculates a block of four cells in each iteration. This executes
faster because elements of bar are read from the cache and don't always cause cache
misses as in the earlier example. Notice that in the expanded inner loop, a cache line of
the bar matrix will be loaded the first time that it's referenced; then the following
three references to bar will occur without stalling. Using the bar elements while
they're still in the cache gives us a significant improvement.

CODE STYLE
Good compilers can pay attention to your memory accesses and will optimize how you
access memory. For example, load and store operations can be reordered by the
compiler to occur when the data is most likely to be available. The first time data is
accessed it tends to cause a cache line to load, and subsequent accesses to nearby data
must also wait for the cache load to complete. The compiler may be able to help by
inserting a few instructions between the loads. This way the cache line will be fully
loaded when the subsequent accesses are needed.

For more information on loop expansion and instruction reordering, see the
Balance of Power column in develop Issue 18.*

You can help your compiler by using local variables when you can. These tell the
compiler exactly how the data will be used, enabling it to easily reorder the loads and
stores for this data.

You should also carefully note memory dereferences, especially double dereferences.
Although it may be obvious to you, the compiler often can't tell whether two pointers
address the same object in memory. The compiler may be prevented from reordering
instructions because it can't tell whether two operations are really dependent on each
other, just because they contain dereferences. Here's an example:

paramBlock->size = myStructure->size;
paramBlock->offset = myStructure->offset;

Although it appears obvious, the compiler usually can't tell if paramBlock references
the same memory as myStructure. In the resulting binary, the compiler will be
conservative and not reorder these operations for best execution. Replacing the
dereference of myStructure with local variables for size and offset will allow the
compiler to fully optimize this example.

INSTRUCTION THRASHING
Your code binary itself can cause the cache to thrash as it loads to be executed. This is
very hard to detect and optimize. The basic problem is that your subroutines may map
to the same areas of the cache, and frequent calls among them will stall to reload the
cache. Some code profilers for RISC workstations have attempted to detect this
problem, but for the Macintosh I can't suggest much help. Just changing the link
order of your code and then executing profiles may have an effect; some link orders
will thrash more than others.

DATA STRUCTURES
The layout of your data structures can greatly affect your cache usage and your
memory usage in general. For example, memory accesses that cross 64-bit memory
boundaries take twice as long to process, as this forces two bus transactions. On the
PowerPC 601 processor, any misaligned data access within a memory boundary takes
the standard amount of time, which (because of typical Macintosh data structures) is a
valuable feature of the chip; future PowerPC processors, however, may take longer to
access misaligned data. If you can align your data structures, do so now. A good tactic is
to keep 64-bit data at the top of your structure, followed by your 32-bit data, and so
on to prevent accidental misalignment of elements. Pad the end of the structure to an
even 64-bit increment if you will have arrays of structures or will allocate them on
the stack. And if certain parts of your structure are accessed much more often than
other parts, keep these together so that they stay in the cache, and make sure they're
aligned.

DON'T FORGET
The memory usage of your speed-critical code will greatly affect its performance
today, and current problems will just get worse when PowerPC processors go above
100 MHz. Profile your code to find the most critical bottlenecks; then pay close
attention to how that code addresses memory. You'll be rewarded with an excellent
return on your investment.

DAVE EVANS occasionally uses the combinatorics skills he learned at the
Massachusetts Institute of Technology, but more often he's been practicing his
combination punches at a Thai kickboxing gym. Designing fast algorithms for Apple's
OS Platforms Group is definitely rewarding, but developing a fast left hook really gets
him pumped up. *

Thanks to Tom Adams, Mike Cappella, Rob Johnston, and Mike Neil for reviewing this
column. *