See if you can solve this programming puzzle, presented in the form of a dialog
between guest puzzlers Dave Hersey and Cameron Esfahani (cam). The dialog gives
clues to help you. Keep guessing until you're done; your score is the number to the left
of the clue that gave you the correct answer. Even if you never run into the particular
problems being solved here, you'll learn some valuable debugging techniques that will
help you solve your own programming conundrums.
Dave Hey cam, it's kinda quiet. Where are KON and BAL?
cam Since the local salad bar closed, I haven't seen KON. BAL disappeared after he left
the video game industry. Have you been getting enough sleep? You look tired.
Dave I've been under a lot of pressure to track down this bug.
cam Maybe I can help. What's the problem?
Dave I have a Power Mac 6100/66 running System 7.5 with QuickDraw GX 1.1.
When I try to print from a word processor, I get the message "The application has
unexpectedly quit, because an error of type 11 occurred." What's an error of type 11?
cam That's an unhandled exception from native code. What word processor are you
using?
Dave Um, a very large one in a very large office suite from a very large company up
north.
cam Have you updated to version 1.1.3 of QuickDraw GX?
Dave Yeah. The problem still happens.
cam Does it happen on any other machine?
Dave Yes. It crashes on any Power Mac but works fine on 680x0 machines.
cam Hmm. Is the word processor native on the Power Mac?
Dave Yes -- it's fat.
cam It sure is. But I have the same version of system software and the same word
processor, yet my machine doesn't crash.
Dave Well, I have a standard system installed, but I added a bunch of whizzy fonts.
cam If I install one of your fonts, will my machine crash?
Dave Sometimes. If you install all my fonts, it crashes all the time.
cam That's easy, then: bad fonts. Here, take out this Thingamajigs font.
Dave No way, man. This is a standard bitmap-only font. It should work. Ike's machine
doesn't have Thingamajigs on it and his machine still crashes.
cam Does he have bitmap-only fonts installed?
Dave Yes.
cam At what point in the printing process do you crash?
Dave The crash occurs just as the application starts spooling the print file.
cam Is this word processor QuickDraw GX-aware?
Dave Yes. It has support for the new QuickDraw GX print dialogs, and it calls the
QuickDraw GX translator to translate QuickDraw drawing commands into QuickDraw
GX shapes during printing.
cam Good for them. Have you tried to reproduce the crash with other QuickDraw
GX-aware applications?
Dave Yup. I tried to reproduce it with several QuickDraw GX-aware and QuickDraw
GX-savvy applications. No luck.
cam Try running the 680x0 version of this program on your Power Mac. It will be
slow and piggy, but try it anyway.
Dave The problem went away! So, the crash seems to have something to do with the
PowerPC code in this application.
cam Hmm. Let's install MacsBug and take a look at this from the debugger.
Dave I tried that before, but I couldn't see any symbols in the PowerPC code where it
crashes. I couldn't tell which routine the PC was in.
cam You should install the new version of MacsBug. Version 6.5.2 understands native
exceptions and can use embedded symbols.
Dave Nifty. . . . OK, I've done that. But I still crash.
cam Why do you crash? Type how.
Dave MacsBug claims that there was a "PowerPC access exception at 001DB030
ConstructNFNTDirectory+002B4."
cam What does ConstructNFNTDirectory do? Hey, wait, there's Alex Beaman. Alex,
can you help us out here?
Alex Sure. QuickDraw GX views all fonts as type 'sfnt'. It's really elegant:
ConstructNFNTDirectory will make an NFNT font appear to have an 'sfnt' directory. It
can build either just the directory header or the entire directory, and this is
controlled by a Boolean parameter passed into the function. OK, gotta run!
Dave Thanks, Alex. When I disassemble ConstructNFNTDirectory with MacsBug, I get
this:
ilp ConstructNFNTDirectory
Disassembling PowerPC code from ConstructNFNTDirectory
ConstructNFNTDirectory
+00000 001DAD7C stmw r14,-0x0048(SP)
+00004 001DAD80 mflr r0
+00008 001DAD84 clrlwi r27,r5,0x18
+0000C 001DAD88 addi r28,r3,0x0000
+00010 001DAD8C mfcr r12
...
+00060 001DADDC addi r3,r30,0x0000
+00064 001DADE0 addi r4,r28,0x0000
+00068 001DADE4 bl GetNoLoadResource
...
+000E4 001DAE60 addi r3,r20,0x0000
+000E8 001DAE64 bl ComputeSearchFields
+000EC 001DAE68 crmove cr7_SO,cr7_SO
+000F0 001DAE6C cmpwi cr2,r27,0x0000
...
+002B4 001DB030 *lwzx r5,r19,r5
...
+002F0 001DB06C lhz r5,0x0004(r20)
+002F4 001DB070 li r16,0x0001
+002F8 001DB074 addic r5,r5,0x0001
+002FC 001DB078 sth r5,0x0004(r20)
+00300 001DB07C beq cr2,ConstructNFNTDirectory+00324
...
+003C8 001DB144 addic SP,SP,0x00A0
+003CC 001DB148 mtcrf 0x38,r12
+003D0 001DB14C mtlr r0
+003D4 001DB150 lmw r16,-0x0040(SP)
+003D8 001DB154 blr
cam An access exception means we're trying to read or write to an invalid address.
That, of course, could be caused by many things, such as uninitialized variables or
trashed memory. Let's check the heaps with hc.
Dave Both the system heap and the application heap are fine.
cam OK, I restart the program and use brp in MacsBug to set a breakpoint at
ConstructNFNTDirectory. brp is just like br, except it works for PowerPC code.
After I start printing and the breakpoint is hit, I step through this function to follow
the code flow.
Dave At offset 0x0300 you don't take that branch, and you eventually begin executing
code that will corrupt the QuickDraw GX heap.
cam But that's wrong -- we should've taken that branch. The caller didn't ask
ConstructNFNTDirectory to create the entire directory, just its header; it didn't
allocate enough space for all of it. Check the heaps again.
Dave The heaps seem fine. QuickDraw GX allocates out of its own heap, which MacsBug
doesn't know about. Even if it did know about it, it wouldn't be able to tell us if the heap
was corrupt, as QuickDraw GX has its own memory manager.
cam Darn, memory corruption bugs are the worst. You can trash memory and not see
the effects of it until you're miles away from that code. OK, why didn't it take the
branch at offset 0x0300?
Dave Well, CR2 is true, so the branch won't be taken.
cam How can you tell that CR2 is true?
Dave The PowerPC chip has eight condition register fields, CR0 through CR7, stored
in nibbles in a 32-bit condition register (Dave Evans talked about this in his column
in develop Issue 21). So the value of CR2 would be bits 8 through 11 of the condition
register. The chip has its bits numbered from 0 through 31, from left to right. We can
tell that CR2 contains a true value because its second logical bit isn't set. That bit
corresponds to the equals operator, so the fact that it's 0 means the operation that set
this register was not equal.
cam Who sets up CR2?
Dave The code at offset 0x00F0. As Alex mentioned, one of the parameters to this
function is a Boolean that controls whether the whole directory is created or only the
header. Because this parameter is a Boolean, the PowerPC processor can just compare
it against 0 and use the result as a flag for later branches. Parameters passed in
PowerPC code are put from left to right into registers R3 through R10; since this
parameter is the third parameter to the function, it's passed to the routine in register
R5. (A much better description of this is inInside Macintosh: PowerPC System
Software.)
cam I love this chip. I'll reexecute the program and get back to the start of this
function and examine CR2.
Dave It starts out false.
cam So someone's trashing it along the way. Well, we can't use some of our normal
tricks for detecting when memory gets trashed. One problem is that step spy doesn't
work yet for PowerPC. Another problem is that we would want to step spy on CR2,
which is a register, and step spy never worked on registers. We'll have to do this the
hard way: let's step through this function, watching CR2 to see just when it gets
changed.
Dave The subroutine GetNoLoadResource at offset 0x0068 changes CR2 from false to
true. GetNoLoadResource is a wrapper to GetResource.
cam I restart the program and trace over the GetResource call.
Dave Yep, that's the function that trashes CR2.
cam Is it legal for the compiler to rely on CR2 being preserved across function calls?
Dave Yes. According to the PowerPC ABI (Application Binary Interface) documentation
-- section 3.6 in the first edition -- CR2 through CR5 are nonvolatile and need to be
saved across function calls.
cam Look at the code for GetResource. Since in System 7.5 GetResource is a native
trap with a routine descriptor, I can use the MacsBug dcmd drd to dump that out.
Here's what I get:
drd GetResource
The RoutineDescriptor at: 011EDFEC
Mixed Mode Magic Trap: AAFE, version: 07,
routine descriptor flags: 00 (NotIndexable),
loadLocation: 00000000, reserved2: 00,
selectorInfo: 00 (No Selector),
routine count: 0000
--- Routine Record 00000000 -----
procInfo: 000002F0, reserved1: 00, ISAType: 01 (kPowerPCISA),
Routine Flags: 0004 (IsAbsolute, IsPrepared, NativeISA,
PassSelector, IsNotDefault), procPtr: 01219EEC,
storedOffset: 00000000, selector: 00000000
Dave There's only one routine associated with the trap and it's the native
implementation.
cam Where's that function? On the Power Mac, every ProcPtr is actually a data
structure that contains the routine's real address and TOC. This is called a TVector
(transition vector). This allows every fragment to have its own globals, because the
correct TOC gets loaded for each routine by the runtime environment. So, to find the
routine's address, you need to dereference the ProcPtr.
Address 00E77B78 is in the "Porky WProcessor" heap at 00DFC430 The address is in a CFM fragment "Porky WProcessor" [non-write exec] It is 00073058 bytes into this heap block: Start Length Tag Mstr Ptr Lock Prg Type ID File Name * 00E04B20 003D35D8+0C N
Dave Apparently it's in the heap of the application.
cam So this program is patching GetResource. At least they have a native patch -- a
good habit these days because you don't know what traps will go native from now on. If
you're patching native PowerPC code with 680x0 code, performance-sensitive code
will run slower. For this reason, you should make all of your patches fat. Let's
disassemble the patch on GetResource.
ilp 1219eec^
Disassembling PowerPC code from 1219eec^
No procedure name
00E77B78 stwu SP,-0x0058(SP)
00E77B7C mflr r12
00E77B80 stw r12,0x0060(SP)
00E77B84 stmw r26,0x0040(SP)
00E77B88 stw r3,0x0070(SP)
00E77B8C sth r4,0x0074(SP)
00E77B90 extsh r4,r4
00E77B94 lis r5,0x4D42
00E77B98 ori r5,r5,0x4446
00E77B9C cmplw cr2,r3,r5
...
00E77C10 lmw r26,0x0040(SP)
00E77C14 lwz r12,0x0060(SP)
00E77C18 mtlr r12
00E77C1C addic SP,SP,0x0058
00E77C20 blr
Dave At 0x00E77B9C they do a compare and store the result in CR2. However, they
don't save and restore CR2 across this function, so it's trashed when we return to
ConstructNFNTDirectory.
cam OK, I restart the program and manually save and restore the value of CR2 across
the GetResource calls. I do this by futzing with bit 2 in CR2.
Dave Everything prints fine.
cam It looks like a compiler bug. Either they shouldn't be using CR2 or they should be
preserving it. In any case, the GetResource patch is trashing CR2, and that changes a
Boolean which causes us to read in extra data. The caller never allocated enough space
for the extra data, so the QuickDraw GX heap gets corrupted.
Dave Holy cow! A compiler bug. Shouldn't we notify the compiler developer?
cam Well, this company has their own in-house development tools group. They write
their own compilers, linkers, and debuggers. We should contact them anyway, so that
they can create a patch that fixes this problem. [This patch, "Office4.2x Update for
Power Mac," is now available on most online services.]
Dave Why are they patching GetResource?
cam It looks like they were looking for resources of type 'MBDF' (menu bar
definition procedures). I can tell this from the instructions at addresses 0x00E77B94
through 0x00E77B9C. The PowerPC architecture has a limitation of 16 bits on the
size of an immediate constant. So, if you wanted to compare a value against a 32-bit
constant, you would have to build the 32-bit value with two instructions. This is what
occurs at addresses 0x00E77B94 and 0x00E77B98, where they insert 0x4D42 and
0x4446 together into a 32-bit value. If you look at the ASCII of this constant, it's
'MBDF'. At address 0x00E77B9C, they compare this constant to the resource type
parameter passed to GetResource. Since that parameter is the first parameter, it will
be in register R3.
Dave Why didn't we crash when we had only one NFNT font installed?
cam This patch would cause ConstructNFNTDirectory to always overwrite the buffer
passed in. But that wouldn't always cause your machine to freak out. By adding enough
NFNT fonts, we trashed the QuickDraw GX heap significantly enough to cause the crash.
Dave Wow, all this and it was an application patch that caused the problem! It sure
would have been cool if we could have used the patch dcmd.
cam Yeah. The patch dcmd does works on the Power Mac -- but we didn't know that
was the problem when we started.
Dave It's interesting that it was an application bug. That would explain why I crash in
a spreadsheet application by the same company. They share the same patch.
cam Nasty.
Dave Yeah.
DAVE HERSEY (AppleLink HERSEY) works in the QuickDraw GX PrintShop level 4
bio-containment facility, thousands of feet beneath the Cupertino R&D campus. There,
he develops PowerPC-native QuickDraw GX printing code, works on Copland, and
relaxes by dabbling with an occasional hot agent over lunch.*
CAMERON ESFAHANI (AppleLink DIRTY, Internet dirty@powertalk.apple.com) is
the shortest member of the Graphics team at Apple. To add a few more inches to his
height, he sometimes wears roller blades in meetings. If that doesn't help, he has been
known to don his large purple hat with sparkles.*
SCORING
80-100 You could have a promising career writing compilers for a company up
north.
45-70 Dr. MacsBug could always use another assistant.
25-40 Don't worry, it took us a while to figure it out too.
5-20 Visual Basic fan, are you?*
Thanks to Alex Beaman, Tom Dowdy, Ron Voss, KON (Konstantin Othmer), and BAL
(Bruce Leak) for reviewing this column.*