Balance of Power: Sleuthing Through Your Code

DAVE EVANS

The night was well advanced, but the bright glow of fluorescent
lamps misrepresented time. As I sat back in my comfortable chair,
rubbing tired eyes, I wondered what the venerable but fictional Mr.
Sherlock Holmes would offer me as advice. Perhaps because I was
so weary from the long hours of debugging, I easily imagined Mr.
Holmes sitting near me in a tweed suit smoking his pipe. Certainly
he would address me as he once addressed his compatriot Dr.
Watson, with a slightly condescending tone, and he would tell me
that in my debugging I was missing the key iota of information.

At that moment, a solitary number seemed brighter on my monitor. Perhaps I have an
overactive imagination, but it seemed as if MacsBug were magically illuminating that
crucial, overlooked information. My computer was at interrupt level 2, yet it was
waiting for a driver request to complete. How could I have missed the interrupt level
earlier? It was no wonder that the computer froze. My software had most likely called
the driver synchronously at exactly the wrong time. The voice of Mr. Holmes rang
again in my ears. This time he quoted from that unfortunate story "A Case of Identity"
when he said, "It has long been an axiom of mine that the little things are infinitely the
most important."

Sir Arthur's famous detective was unsurpassed as an observer of detail. He believed
that keen attention to all things -- even the mundane -- was the key to good detective
work. In debugging software, I've found this advice is also true. Although many
software bugs can be solved quite easily, the most challenging problems demand more
attention. This is especially true of crashes or freezes in your software. To find the
detail we need for those, we often have to go below source-level tools and get
comfortable with lower-level aids.

In this column I'll take you through some low-level debugging techniques. I'll start
with basic strategy and then discuss particular methods and examples. Although many
details will be PowerPC-specific, much of the information here is useful on all
Macintosh computers.

THE STRATEGY OF A SLEUTH

The experienced engineer starts with a basic strategy when faced with a troublesome
software crash or freeze. The strategy is similar to Mr. Holmes's approach to solving
difficult crimes. Using the scientific method, he starts by collecting key information
and details. When he has finished researching, he begins to analyze the information and
eliminates hypothesis after hypothesis. Once close to a solution, he seeks out more
detail to narrow his suspects to a single culprit. Similarly, your strategy for
debugging software should start with careful observation and research. Then you
should hypothesize, test your theories, and collect more detail. This narrowing
approach will draw you closer to the pernicious coding error in your software.

It's tempting when faced with a difficult crash to experiment instead of researching it
first. But beware! Don't just reimplement your code with new approaches until it
stops crashing. Though some may cynically suggest that that's the Macintosh way to
program, don't be lulled into this strategy. I've found that it usually produces unstable
code and ultimately takes longer than researching the original problem.

In researching a crash or a freeze, the private bug detective should first ask these few
basic questions:

What kind of crash or freeze is this?
What code did the computer stop in?
How did I get to that code?

For these, you'll need a low-level debugger (such as MacsBug). Let's look at each one
in turn.

GET YOUR BEARINGS

The first step is to determine the kind of problem you've got. For crashes there are a
number of possible problems, including the all-too-familiar illegal instruction and
bus errors. Note that PowerPC exception handlers don't currently distinguish between
these or other types. In MacsBug the correct type will be reported, but your debugger
may instead describe all crashes as general spurious interrupts or type 11 errors.

If your crash is from an illegal instruction error, it's possible that the processor
jumped to an invalid address or the intended code moved in memory. In this case you'll
notice (in a disassembly where execution stopped) that most instructions are invalid
or nonsense. This can also occur if the emulator tries to emulate PowerPC code, or if
the processor tries to execute 680x0 code as PowerPC code. Try disassembling
memory as both PowerPC code (using ipp pc) and 680x0 code (using ip pc).

If your crash is from a bus error, the most likely cause is an invalid address in some
register. Disassemble memory where execution stopped and examine the instructions.
If there are instructions that dereference registers, inspect those registers for
addresses that aren't in a valid range. If you're debugging 680x0 code on a Power
Macintosh, you'll need to look at all the instructions near the crash, because the
680x0 emulator won't tell you exactly which instruction caused the error.

Researching a freeze requires a different approach. If the freeze prevents you from
using any debugging tools, you must isolate the offending code by watching the
computer execute up to the freeze. Setting breakpoints, tracing, and stopping execution
at known locations will bring you closer. This approach is slow but will lead you to the
code that caused the error or to the state that prompted it. If the computer is frozen
but you can still use debugging tools, it's very possible that you're in an infinite loop.

THE LAYOUT OF THE CRIME SCENE

Sherlock Holmes sometimes astonished readers by deducing crimes just from hearing
second-hand details. He was also known, however, to walk the back alleys of London and
gumshoe the scene of a crime when necessary. Learning the layout of the crime scene
was crucial for a number of his deductions. When staring at your newly crashed
software, do you recognize the code that your debugger is displaying? Disassemble
memory near the location of the crash and snoop around for clues. Check for the
following to determine how your computer came to this final resting place:

If you're using MacsBug, use the wh pc command to check where the code
is.
Display memory and disassemble from the beginning of the code's block of
memory.
Does the code nearby reference strings or Gestalt selectors?
Look for text symbols and strings in the code.

If you've crashed in PowerPC code, most low-level debuggers will give great
information about where you are. This is because most PowerPC code is registered and
linked using the Code Fragment Manager, which these debuggers can access for hints.
For example, if you use the wh pc command in MacsBug, after crashing in PowerPC
code you'll see something like this:

 Address 000BAE34 is in the System heap
    at 00002800 at NQDColor2Index+00018
 The address is in a CFM fragment "NQD"

 It is 0001AD28 bytes into this heap block:
     Start    Length      Tag  Mstr Ptr Lock
  * 000A00F0 0003DB00+04   R   00002AC4   L

Here we see that the computer crashed at a location 24 bytes from the beginning of the
NQDColor2Index routine. This routine is in the NQD (or Native QuickDraw) code
fragment. Since this address is close to the beginning of the routine, we can
disassemble from its start and examine the six instructions that executed before the
crash for more clues:

Disassembling PowerPC code from bae00
  NQDColor2Index
    +00000 000BAE00   li      r5,0x0000
    +00004 000BAE04   lwz     r4,TheGDevice(r0)
    +00008 000BAE08   sth     r5,QDErr(r0)
    +0000C 000BAE0C   stw     r31,-0x0004(SP)
    +00010 000BAE10   lwz     r5,0x0000(r4)
    +00014 000BAE14   addi    r31,r3,0x0000
    +00018 000BAE18  *lwz     r3,0x000C(r5)

A bus error at NQDColor2Index+00018 would occur if register R5 contained an
invalid address. Look at the register display to validate that hypothesis. Notice in the
code that R5 is a dereference of R4, which comes from the low-memory global
TheGDevice. Here we crashed because TheGDevice had become invalid, so now your
investigation turns toward that global.

A freeze will typically occur because of a double page fault or exception or because of
an infinite loop. Synchronous driver calls will also freeze if called when the interrupt
level is above 0. A double fault or exception is common only if you're writing driver
software. Your computer can handle only one page fault or exception at a time. A double
fault or exception occurs when software that services a fault subsequently causes a
second fault. For example, disk drivers are sometimes called by the Virtual Memory
Manager to help service page faults; therefore, if you develop a disk driver you must
take care not to cause page faults since you may be asked to service one as well.

A good way to detect infinite loops is to trace for a few instructions using your
debugger. If you notice the same set of instructions being repetitively executed, you
could be in an infinite loop. Look at branch instructions for clues to why the loop isn't
completing. A special case of these loops is the vSyncWait routine. It looks like this:

MOVE.W      $0010(A0),D0
BGT.S         *-6

This tight loop is waiting for the two-byte value located 16 bytes from register A0 to
become 0 or negative. This is a standard sequence to wait for a driver request to
complete. The driver request is described in an IOParam record pointed to by register
A0. When the driver is done servicing the request, it will interrupt the loop and
modify the ioResult field 16 bytes into that record. It will then return from the
interrupt, and the loop will complete normally. A freeze in this loop means the driver
isn't servicing the request. If you typed dm a0 iopb in MacsBug, you might see
something like this:

 Displaying IOParamBlockRec at 000003A4
  000003A4  qLink              NIL
  000003A8  qType              0002
  000003AA  ioTrap             A003
  000003AC  ioCmdAddr          NIL
  000003B0  ioCompletion       NIL
  000003B4  ioResult           0001
  000003B6  ioNamePtr          NIL
  000003BA  ioVRefNum          0008
  000003BC  ioRefNum           FFDF
  000003BE  ioVersNum          #0
  000003BF  ioPermssn          #23
  000003C0  ioMisc             NIL
  000003C4  ioBuffer           01C7E2B0
  000003C8  ioReqCount         00010000
  000003CC  ioActCount         00010000
  000003D0  ioPosMode          0001
  000003D2  ioPosOffset        1B84AA00

Take note of the ioTrap and ioRefNum fields. In this case, ioTrap is $A003, which is
the synchronous Read trap. Using the drvr dcmd in MacsBug, you'll find that the
driver with refNum $FFDF is .ASYC00, which is the SCSI driver. This hang, then,
occurs during a synchronous Read call to the SCSI driver. Perhaps I should next check
the current interrupt level.

HOW DID WE GET THERE?

After a long, ponderous silence, while sharply focused on the current enigma, Holmes
might startle you by saying, "Let us reconstruct, Watson." Then he would describe the
probable series of events that preceded that particular criminal act. If the
reconstruction wasn't adequate to identify a perpetrator, at least it would review the
crucial discoveries so far. It would show Holmes's appreciable progress toward a
solution. Similarly, while in the midst of a difficult debugging task, you should
reconstruct the turn of events to gain extremely helpful information.

Figuring out what happened, once the computer is stopped cold in a crash or a freeze,
isn't easy. In effect, you're looking for footsteps in the sand that are often obscured or
covered with other false marks. For this task, the technique we most often use is the
stack crawl.

Procedural programming on the Macintosh uses a stack. For each procedure call, the
stack is added to, and vital clues such as return addresses and stack frame pointers are
left for us to find. In PowerPC code, the link register adds to our clues and is
guaranteed to point back to the penultimate procedure of interest. Your low-level
debugger will certainly have a stack crawl tool to use as well.

In MacsBug, the sc and sc7 commands are your basic stack-crawling aids. Start your
search with the sc command, which looks for stack frames. Frames are structures
found on the stack containing both the return address and a pointer to the previous
frame. In PowerPC code the frames also contain a standard area to preserve basic
registers. Fortunately, frames are required in PowerPC code and follow a standard
format. Most 680x0 compilers will generate stack frames as well, although much of
the 680x0 system software was written in assembly language without frames. If
during your crash you have a valid stack frame address in register A6 or R1, the sc
command will show you a history of which code execution preceded your software's
demise. Listing 1 shows a basic sccommand's result.

Listing 1. Display from the sc command

 Calling chain using A6/R1 links
  Back chain  ISA  Caller
  01C8A0AC    68K  01C139CA  'CODE 0001 0F6E Main'+03A1A
  01C8A0A0    68K  01C132EA  'CODE 0001 0F6E Main'+0333A
  01C89F4A    68K  00058748  'scod BFB1 011C'+01A38
  01C89E6A    68K  00064090  'scod BFB1 011C'+0D380
  01C89E40    68K  408787FC  CHECKUPDATESEARCH+0003E
  01C89E16    68K  40878426  __GETSUBWINDOWS+000D6

In this example the first two links are in a CODE resource from file number $0F6E.
Use the MacsBug file command to determine which file they were loaded from. It's
likely that they're from the current application, and the return addresses displayed in
the Caller column (01C139CA and 01C132EA) are most likely in the application's
binary. The return addresses listed are crucial to your sleuthing. They not only point
out where execution would have returned to but, more important, they show which
instructions were recently executed: the ones just before the return address. Those
addresses are your footprints in the sand. They are clues in your reconstruction, and
they hint to the turn of events that led to the crash or freeze.

Note the third and fourth lines in Listing 1, which show return addresses in an 'scod'
resource. Those 'scod' resources implement the Process Manager. It's possible that the
application binary, probably at the instruction just before address 1C132EA, made a
call to the Process Manager.

The fifth and sixth lines of the display show return addresses in the Macintosh ROM.
The symbols are shown because I've installed a ROM map file in my MacsBug
Preferences folder. You should use the provided ROM map file for your computer,
because it will often give you better stack crawl information. You can also deduce that
these return addresses are in the ROM from the addresses themselves. Most Macintosh
ROMs begin at memory address $40800000. PCI-based Macintosh ROMs currently
begin at $FFC00000, and PowerPC processor-based PowerBook ROMs at
$40000000. You can determine the beginning address of your ROM by looking at the
ROMBase low-memory global. In MacsBug, for example, typedl ROMBase to display
the beginning ROM address.

The sc7 command in MacsBug gives you less precise information. In cases when you
don't have stack frames, you can ask your debugger to display all possible return
addresses on the stack. Your debugger will intelligently guess which values on the
stack are possible return addresses, but most of the information displayed will be
extraneous. You must pick through this information for clues -- an arduous task. The
stack frame-based crawl is neat and tidy, whereas the same situation would produce
the sc7 display shown in Listing 2. I've added an asterisk (*) on each line that's also
in the sc command's display.

Listing 2. Display from the sc7 command

Return addresses on the stack
 Stack Addr Frame Addr ISA  Caller
  01C8A0B0             68K  01C16D62 'CODE 0001 0F6E Main'+06DB2c
  01C8A0A4   01C8A0A0  68K  01C139CA 'CODE 0001 0F6E Main'+03A1A    *
  01C8A094             68K  40849116 UNLOADSEG+00046
  01C8A06A   01C8A066  68K  409CFFFC DISPTABLE+8D0BC
  01C8A018             68K  4087EAF0 GETRESOURCE+000B2
  01C8A00E             68K  408806F6
  01C8A008             PPC  00094BE8 EmToNatEndMoveParams+00014
  01C89FF8             68K  0011ACDA
  01C89FE0             68K  4087ECFE VRMGRSTDENTRY+000B0
  01C89FDC             68K  4087ECFE VRMGRSTDENTRY+000B0
  01C89FD8             68K  0011A5B4
  01C89F4E   01C89F4A  68K  01C132EA 'CODE 0001 0F6E Main'+0333A    *
  01C89F4A             68K  01C8A09E
  01C89F22   01C89F1E  68K  00058748 'scod BFB1 011C'+01A38         *
  01C89F1E             68K  01C89F48
  01C89EDE   01C89EDA  68K  00163E30
  01C89EDA             68K  01C89F1C
  01C89E62             68K  01C8AFBE
  01C89E44   01C89E40  68K  00064090 'scod BFB1 011C'+0D380         *
  01C89E1A   01C89E16  68K  408787FC CHECKUPDATESEARCH+0003E        *
  01C89DF4   01C89DF0  68K  40878426 __GETSUBWINDOWS+000D6          *
  01C89DE2             68K  4087876E CALCANCESTORRGNS+0002A
  01C89DDE             68K  001191E6

In this example, there were a number of values on the stack that might have been valid
return addresses. The six we saw in the sc command's display are there. Many of the
other lines will not be relevant return addresses, because many procedures reserve
space on the stack but don't always use it or initialize it. There will often be old return
addresses in that unused part of the stack. These old return addresses are like very
faint footprints in the sand -- from some previous execution -- and they may tell you
what occurred much earlier in time. More often, though, they'll just be distracting and
irrelevant to your search.

Be very wary of an sc7 command when tracing through PowerPC code. PowerPC code
typically has large stack frames, at least 56 bytes for each procedure, and the code
often doesn't use all those bytes. This will cause many old return addresses to stay in
the unused parts of the stack frame, and those old addresses will appear in your sc7
command's display.

Sometimes you'll notice that the sc and sc7 commands fail to work. In MacsBug, you
may see the error

Bad stack: stack pointer must be even and
   <= stack base

There's more than one stack that the system uses, but the stack base that MacsBug
refers to in this error is the application stack's base or top address. The sc and sc7
commands first check to see if the A6, A7, and R1 registers point to locations below
the application stack's base. If they don't, MacsBug returns this error. The executing
code may be using a different stack, however. Many parts of the Mac OS system
software use separate stacks. To force MacsBug to execute a stack crawl anyway,
specify the register to use and the amount of memory to search through. For example,
the MacsBug commands sc7 a7 4000 and sc a6 4000 will execute a stack crawl
even if the A6 and A7 registers point above the application stack's base.

System stacks vary in size from about 8000 bytes up to 48000 bytes. There's no easy
way to determine the base of a system stack that's in use. If you don't get interesting
clues from 16384 bytes ($4000 in hex), vary the number of bytes you specify and
compare your results.

ELEMENTARY, OF COURSE

Don't be pacified by source-level debuggers. Lower-level tools give you a much better
understanding of the Mac OS and your code. These tools also give you the ability to
research the most complicated problems. Strive to be a software sleuth, and you'll gain
some truly useful expertise.

DAVE EVANS still works at Apple in the Mac OS System Software group. He always
enjoyed Sherlock Holmes stories while he was growing up, and he was excited to learn
that most of the stories are no longer protected under copyright and are easily
accessible on the Internet (see the 221B Baker Street Web page at
http://www.contrib.andrew.cmu.edu/u/mset/holmes.html).*

Thanks to Geoff Chatterton, Doug Clarke, Michael Dautermann, and Tim Maroney for
reviewing this column.*