The night was well advanced, but the bright glow of fluorescent
lamps misrepresented time. As I sat back in my comfortable chair,
rubbing tired eyes, I wondered what the venerable but fictional Mr.
Sherlock Holmes would offer me as advice. Perhaps because I was
so weary from the long hours of debugging, I easily imagined Mr.
Holmes sitting near me in a tweed suit smoking his pipe. Certainly
he would address me as he once addressed his compatriot Dr.
Watson, with a slightly condescending tone, and he would tell me
that in my debugging I was missing the key iota of information.
At that moment, a solitary number seemed brighter on my monitor. Perhaps I have an
overactive imagination, but it seemed as if MacsBug were magically illuminating that
crucial, overlooked information. My computer was at interrupt level 2, yet it was
waiting for a driver request to complete. How could I have missed the interrupt level
earlier? It was no wonder that the computer froze. My software had most likely called
the driver synchronously at exactly the wrong time. The voice of Mr. Holmes rang
again in my ears. This time he quoted from that unfortunate story "A Case of Identity"
when he said, "It has long been an axiom of mine that the little things are infinitely the
most important."
Sir Arthur's famous detective was unsurpassed as an observer of detail. He believed
that keen attention to all things -- even the mundane -- was the key to good detective
work. In debugging software, I've found this advice is also true. Although many
software bugs can be solved quite easily, the most challenging problems demand more
attention. This is especially true of crashes or freezes in your software. To find the
detail we need for those, we often have to go below source-level tools and get
comfortable with lower-level aids.
In this column I'll take you through some low-level debugging techniques. I'll start
with basic strategy and then discuss particular methods and examples. Although many
details will be PowerPC-specific, much of the information here is useful on all
Macintosh computers.
The experienced engineer starts with a basic strategy when faced with a troublesome
software crash or freeze. The strategy is similar to Mr. Holmes's approach to solving
difficult crimes. Using the scientific method, he starts by collecting key information
and details. When he has finished researching, he begins to analyze the information and
eliminates hypothesis after hypothesis. Once close to a solution, he seeks out more
detail to narrow his suspects to a single culprit. Similarly, your strategy for
debugging software should start with careful observation and research. Then you
should hypothesize, test your theories, and collect more detail. This narrowing
approach will draw you closer to the pernicious coding error in your software.
It's tempting when faced with a difficult crash to experiment instead of researching it
first. But beware! Don't just reimplement your code with new approaches until it
stops crashing. Though some may cynically suggest that that's the Macintosh way to
program, don't be lulled into this strategy. I've found that it usually produces unstable
code and ultimately takes longer than researching the original problem.
In researching a crash or a freeze, the private bug detective should first ask these few
basic questions:
For these, you'll need a low-level debugger (such as MacsBug). Let's look at each one
in turn.
The first step is to determine the kind of problem you've got. For crashes there are a
number of possible problems, including the all-too-familiar illegal instruction and
bus errors. Note that PowerPC exception handlers don't currently distinguish between
these or other types. In MacsBug the correct type will be reported, but your debugger
may instead describe all crashes as general spurious interrupts or type 11 errors.
If your crash is from an illegal instruction error, it's possible that the processor
jumped to an invalid address or the intended code moved in memory. In this case you'll
notice (in a disassembly where execution stopped) that most instructions are invalid
or nonsense. This can also occur if the emulator tries to emulate PowerPC code, or if
the processor tries to execute 680x0 code as PowerPC code. Try disassembling
memory as both PowerPC code (using ipp pc) and 680x0 code (using ip pc).
If your crash is from a bus error, the most likely cause is an invalid address in some
register. Disassemble memory where execution stopped and examine the instructions.
If there are instructions that dereference registers, inspect those registers for
addresses that aren't in a valid range. If you're debugging 680x0 code on a Power
Macintosh, you'll need to look at all the instructions near the crash, because the
680x0 emulator won't tell you exactly which instruction caused the error.
Researching a freeze requires a different approach. If the freeze prevents you from
using any debugging tools, you must isolate the offending code by watching the
computer execute up to the freeze. Setting breakpoints, tracing, and stopping execution
at known locations will bring you closer. This approach is slow but will lead you to the
code that caused the error or to the state that prompted it. If the computer is frozen
but you can still use debugging tools, it's very possible that you're in an infinite loop.
Sherlock Holmes sometimes astonished readers by deducing crimes just from hearing
second-hand details. He was also known, however, to walk the back alleys of London and
gumshoe the scene of a crime when necessary. Learning the layout of the crime scene
was crucial for a number of his deductions. When staring at your newly crashed
software, do you recognize the code that your debugger is displaying? Disassemble
memory near the location of the crash and snoop around for clues. Check for the
following to determine how your computer came to this final resting place:
If you've crashed in PowerPC code, most low-level debuggers will give great
information about where you are. This is because most PowerPC code is registered and
linked using the Code Fragment Manager, which these debuggers can access for hints.
For example, if you use the wh pc command in MacsBug, after crashing in PowerPC
code you'll see something like this:
Address 000BAE34 is in the System heap
at 00002800 at NQDColor2Index+00018
The address is in a CFM fragment "NQD"
It is 0001AD28 bytes into this heap block:
Start Length Tag Mstr Ptr Lock
* 000A00F0 0003DB00+04 R 00002AC4 L
Here we see that the computer crashed at a location 24 bytes from the beginning of the
NQDColor2Index routine. This routine is in the NQD (or Native QuickDraw) code
fragment. Since this address is close to the beginning of the routine, we can
disassemble from its start and examine the six instructions that executed before the
crash for more clues:
Disassembling PowerPC code from bae00
NQDColor2Index
+00000 000BAE00 li r5,0x0000
+00004 000BAE04 lwz r4,TheGDevice(r0)
+00008 000BAE08 sth r5,QDErr(r0)
+0000C 000BAE0C stw r31,-0x0004(SP)
+00010 000BAE10 lwz r5,0x0000(r4)
+00014 000BAE14 addi r31,r3,0x0000
+00018 000BAE18 *lwz r3,0x000C(r5)
A bus error at NQDColor2Index+00018 would occur if register R5 contained an
invalid address. Look at the register display to validate that hypothesis. Notice in the
code that R5 is a dereference of R4, which comes from the low-memory global
TheGDevice. Here we crashed because TheGDevice had become invalid, so now your
investigation turns toward that global.
A freeze will typically occur because of a double page fault or exception or because of
an infinite loop. Synchronous driver calls will also freeze if called when the interrupt
level is above 0. A double fault or exception is common only if you're writing driver
software. Your computer can handle only one page fault or exception at a time. A double
fault or exception occurs when software that services a fault subsequently causes a
second fault. For example, disk drivers are sometimes called by the Virtual Memory
Manager to help service page faults; therefore, if you develop a disk driver you must
take care not to cause page faults since you may be asked to service one as well.
A good way to detect infinite loops is to trace for a few instructions using your
debugger. If you notice the same set of instructions being repetitively executed, you
could be in an infinite loop. Look at branch instructions for clues to why the loop isn't
completing. A special case of these loops is the vSyncWait routine. It looks like this:
MOVE.W $0010(A0),D0 BGT.S *-6
This tight loop is waiting for the two-byte value located 16 bytes from register A0 to
become 0 or negative. This is a standard sequence to wait for a driver request to
complete. The driver request is described in an IOParam record pointed to by register
A0. When the driver is done servicing the request, it will interrupt the loop and
modify the ioResult field 16 bytes into that record. It will then return from the
interrupt, and the loop will complete normally. A freeze in this loop means the driver
isn't servicing the request. If you typed dm a0 iopb in MacsBug, you might see
something like this:
Displaying IOParamBlockRec at 000003A4 000003A4 qLink NIL 000003A8 qType 0002 000003AA ioTrap A003 000003AC ioCmdAddr NIL 000003B0 ioCompletion NIL 000003B4 ioResult 0001 000003B6 ioNamePtr NIL 000003BA ioVRefNum 0008 000003BC ioRefNum FFDF 000003BE ioVersNum #0 000003BF ioPermssn #23 000003C0 ioMisc NIL 000003C4 ioBuffer 01C7E2B0 000003C8 ioReqCount 00010000 000003CC ioActCount 00010000 000003D0 ioPosMode 0001 000003D2 ioPosOffset 1B84AA00
Take note of the ioTrap and ioRefNum fields. In this case, ioTrap is $A003, which is
the synchronous Read trap. Using the drvr dcmd in MacsBug, you'll find that the
driver with refNum $FFDF is .ASYC00, which is the SCSI driver. This hang, then,
occurs during a synchronous Read call to the SCSI driver. Perhaps I should next check
the current interrupt level.
After a long, ponderous silence, while sharply focused on the current enigma, Holmes
might startle you by saying, "Let us reconstruct, Watson." Then he would describe the
probable series of events that preceded that particular criminal act. If the
reconstruction wasn't adequate to identify a perpetrator, at least it would review the
crucial discoveries so far. It would show Holmes's appreciable progress toward a
solution. Similarly, while in the midst of a difficult debugging task, you should
reconstruct the turn of events to gain extremely helpful information.
Figuring out what happened, once the computer is stopped cold in a crash or a freeze,
isn't easy. In effect, you're looking for footsteps in the sand that are often obscured or
covered with other false marks. For this task, the technique we most often use is the
stack crawl.
Procedural programming on the Macintosh uses a stack. For each procedure call, the
stack is added to, and vital clues such as return addresses and stack frame pointers are
left for us to find. In PowerPC code, the link register adds to our clues and is
guaranteed to point back to the penultimate procedure of interest. Your low-level
debugger will certainly have a stack crawl tool to use as well.
In MacsBug, the sc and sc7 commands are your basic stack-crawling aids. Start your
search with the sc command, which looks for stack frames. Frames are structures
found on the stack containing both the return address and a pointer to the previous
frame. In PowerPC code the frames also contain a standard area to preserve basic
registers. Fortunately, frames are required in PowerPC code and follow a standard
format. Most 680x0 compilers will generate stack frames as well, although much of
the 680x0 system software was written in assembly language without frames. If
during your crash you have a valid stack frame address in register A6 or R1, the sc
command will show you a history of which code execution preceded your software's
demise. Listing 1 shows a basic sccommand's result.
Listing 1. Display from the sc command
Calling chain using A6/R1 links Back chain ISA Caller 01C8A0AC 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A 01C8A0A0 68K 01C132EA 'CODE 0001 0F6E Main'+0333A 01C89F4A 68K 00058748 'scod BFB1 011C'+01A38 01C89E6A 68K 00064090 'scod BFB1 011C'+0D380 01C89E40 68K 408787FC CHECKUPDATESEARCH+0003E 01C89E16 68K 40878426 __GETSUBWINDOWS+000D6
In this example the first two links are in a CODE resource from file number $0F6E.
Use the MacsBug file command to determine which file they were loaded from. It's
likely that they're from the current application, and the return addresses displayed in
the Caller column (01C139CA and 01C132EA) are most likely in the application's
binary. The return addresses listed are crucial to your sleuthing. They not only point
out where execution would have returned to but, more important, they show which
instructions were recently executed: the ones just before the return address. Those
addresses are your footprints in the sand. They are clues in your reconstruction, and
they hint to the turn of events that led to the crash or freeze.
Note the third and fourth lines in Listing 1, which show return addresses in an 'scod'
resource. Those 'scod' resources implement the Process Manager. It's possible that the
application binary, probably at the instruction just before address 1C132EA, made a
call to the Process Manager.
The fifth and sixth lines of the display show return addresses in the Macintosh ROM.
The symbols are shown because I've installed a ROM map file in my MacsBug
Preferences folder. You should use the provided ROM map file for your computer,
because it will often give you better stack crawl information. You can also deduce that
these return addresses are in the ROM from the addresses themselves. Most Macintosh
ROMs begin at memory address $40800000. PCI-based Macintosh ROMs currently
begin at $FFC00000, and PowerPC processor-based PowerBook ROMs at
$40000000. You can determine the beginning address of your ROM by looking at the
ROMBase low-memory global. In MacsBug, for example, typedl ROMBase to display
the beginning ROM address.
The sc7 command in MacsBug gives you less precise information. In cases when you
don't have stack frames, you can ask your debugger to display all possible return
addresses on the stack. Your debugger will intelligently guess which values on the
stack are possible return addresses, but most of the information displayed will be
extraneous. You must pick through this information for clues -- an arduous task. The
stack frame-based crawl is neat and tidy, whereas the same situation would produce
the sc7 display shown in Listing 2. I've added an asterisk (*) on each line that's also
in the sc command's display.
Listing 2. Display from the sc7 command
Return addresses on the stack Stack Addr Frame Addr ISA Caller 01C8A0B0 68K 01C16D62 'CODE 0001 0F6E Main'+06DB2c 01C8A0A4 01C8A0A0 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A * 01C8A094 68K 40849116 UNLOADSEG+00046 01C8A06A 01C8A066 68K 409CFFFC DISPTABLE+8D0BC 01C8A018 68K 4087EAF0 GETRESOURCE+000B2 01C8A00E 68K 408806F6 01C8A008 PPC 00094BE8 EmToNatEndMoveParams+00014 01C89FF8 68K 0011ACDA 01C89FE0 68K 4087ECFE VRMGRSTDENTRY+000B0 01C89FDC 68K 4087ECFE VRMGRSTDENTRY+000B0 01C89FD8 68K 0011A5B4 01C89F4E 01C89F4A 68K 01C132EA 'CODE 0001 0F6E Main'+0333A * 01C89F4A 68K 01C8A09E 01C89F22 01C89F1E 68K 00058748 'scod BFB1 011C'+01A38 * 01C89F1E 68K 01C89F48 01C89EDE 01C89EDA 68K 00163E30 01C89EDA 68K 01C89F1C 01C89E62 68K 01C8AFBE 01C89E44 01C89E40 68K 00064090 'scod BFB1 011C'+0D380 * 01C89E1A 01C89E16 68K 408787FC CHECKUPDATESEARCH+0003E * 01C89DF4 01C89DF0 68K 40878426 __GETSUBWINDOWS+000D6 * 01C89DE2 68K 4087876E CALCANCESTORRGNS+0002A 01C89DDE 68K 001191E6
In this example, there were a number of values on the stack that might have been valid
return addresses. The six we saw in the sc command's display are there. Many of the
other lines will not be relevant return addresses, because many procedures reserve
space on the stack but don't always use it or initialize it. There will often be old return
addresses in that unused part of the stack. These old return addresses are like very
faint footprints in the sand -- from some previous execution -- and they may tell you
what occurred much earlier in time. More often, though, they'll just be distracting and
irrelevant to your search.
Be very wary of an sc7 command when tracing through PowerPC code. PowerPC code
typically has large stack frames, at least 56 bytes for each procedure, and the code
often doesn't use all those bytes. This will cause many old return addresses to stay in
the unused parts of the stack frame, and those old addresses will appear in your sc7
command's display.
Sometimes you'll notice that the sc and sc7 commands fail to work. In MacsBug, you
may see the error
Bad stack: stack pointer must be even and <= stack base
There's more than one stack that the system uses, but the stack base that MacsBug
refers to in this error is the application stack's base or top address. The sc and sc7
commands first check to see if the A6, A7, and R1 registers point to locations below
the application stack's base. If they don't, MacsBug returns this error. The executing
code may be using a different stack, however. Many parts of the Mac OS system
software use separate stacks. To force MacsBug to execute a stack crawl anyway,
specify the register to use and the amount of memory to search through. For example,
the MacsBug commands sc7 a7 4000 and sc a6 4000 will execute a stack crawl
even if the A6 and A7 registers point above the application stack's base.
System stacks vary in size from about 8000 bytes up to 48000 bytes. There's no easy
way to determine the base of a system stack that's in use. If you don't get interesting
clues from 16384 bytes ($4000 in hex), vary the number of bytes you specify and
compare your results.
Don't be pacified by source-level debuggers. Lower-level tools give you a much better
understanding of the Mac OS and your code. These tools also give you the ability to
research the most complicated problems. Strive to be a software sleuth, and you'll gain
some truly useful expertise.
DAVE EVANS still works at Apple in the Mac OS System Software group. He always
enjoyed Sherlock Holmes stories while he was growing up, and he was excited to learn
that most of the stories are no longer protected under copyright and are easily
accessible on the Internet (see the 221B Baker Street Web page at
http://www.contrib.andrew.cmu.edu/u/mset/holmes.html).*
Thanks to Geoff Chatterton, Doug Clarke, Michael Dautermann, and Tim Maroney for
reviewing this column.*