WRITING LOCALIZABLE APPLICATIONS

JOSEPH TERNASKY AND BRYAN K. ("BEAKER")
RESSLER

JOSEPH TERNASKY AND BRYAN K. ("BEAKER") RESSLERMore and more
software companies are finding rich new markets overseas. Unfortunately, many of
these developers have also discovered that localizing an application involves a lot more
than translating a bunch of STR# resources. In fact, localization often becomes an
unexpectedly long, complex, and expensive development cycle. This article describes
some common problems and gives proactive engineering advice you can use during
initial U.S. development to speed your localization efforts later on.

Most software localization headaches are associated with text drawing and character
handling, so that's what this article stresses. Four common areas of difficulty are:

keyboard input (specifically for two-byte scripts)
choice of fonts and sizes for screen display
date, time, number, and currency formats and sorting order
character encodings

We discuss each of these potential pitfalls in detail and provide data structures and
example code.

PRELIMINARIES

Throughout the discussion, we assume you're developing primarily for the U.S.
market, but you're planning to publish internationally eventually (or at least you're
trying to keep your options open). As you're developing your strategy, here are a few
points to keep in mind:

Don't dismiss any markets out of hand -- investigate the potential
rewards for entry into a particular market and the features required for that
market.
The amount of effort required to support western Europe is relatively
small. Depending on the type of application you're developing, the additional
effort required for other countries isn't that much more. There's also a
growing market for non-Roman script systems inside the U.S.
The labor required to build atruly globalprogram is much less if you do
the work up front, rather than writing quick-and-dirty code for the U.S. and
having to rewrite it later.
Consider market growth trends. A market that's small now may be big
later.

This article concentrates on features for western Europe and Japan because those are
the markets we're most familiar with. We encourage you to investigate other markets
on your own.

LINGO LESSON 101
This international software thing is rife with specialized lingo. For a complete
explanation of all the terms, see the hefty "Worldwide Software Overview," Chapter
14 ofInside MacintoshVolume VI. But we're not here to intimidate, so let's go over a
few basic terms.

Script. A writing system that can be used to represent one or more human languages.
For example, the Roman script is used to represent English, Spanish, Hungarian, and
so on. Scripts fall into several categories, as described in the next section, "Script
Categories."

Script code. An integer that identifies a script on the Macintosh.

Encoding. A mapping between characters and integers. Each character in the
character set is assigned a unique integer, called itscharacter code. If a character
appears in more than one character set it may have more than one encoding, a situation
discussed later in the section "Dealing With Character Encodings." Since each script
has a unique encoding, sometimes the termsscript and encodingare used
interchangeably.

Character code. An integer that's associated with a given character in a script.

Glyph. The displayed form of a character. The glyph for a given character code may
not always be the same -- in some scripts the codes of the surrounding characters
provide a context for choosing a particular glyph.

Line orientation. The overall direction of text flow within a line. For instance,
English has left-to-right line orientation, while Japanese can use either
top-to-bottom (vertical) or left-to-right (horizontal) line orientation.

Character orientation. The relationship between a character's baseline and the
line orientation. When the line orientation and the character baselines go in the same
direction, it's calledwith-streamcharacter orientation. When the line orientation
differs from the character baseline direction, it's called cross-stream character
orientation. For instance, in Japanese, when the line orientation is left- to-right,
characters are also oriented left-to-right (with-stream). Japanese can also be
formatted with a top-to-bottom (vertical) line orientation, in which case character
baselines can be left-to-right (cross-stream) or top-to-bottom (with-stream). See
Figure 1.

Figure 1 Line and Character Orientation in Mixed Japanese/English Text

SCRIPT CATEGORIES
Scripts fall into different categories that require different software solutions. Here
are the basic categories:

Simple scriptshave small character sets (fewer than 256 characters),
and no context information is required to choose a glyph for a given character
code. They have left-to-right lines and top-to-bottom pages. Simple scripts
encompass the languages of the U.S. and Europe, as well as many other
countries worldwide. For example, some simple scripts are Roman, Cyrillic,
and Greek.
Two-byte scriptshave large character sets (up to 28,000 characters)
and require no context information for glyph choice. They use various
combinations of left-to- right or top-to-bottom lines and top-to-bottom or
right-to-left pages. Two-byte scripts include the languages of Japan, China,
Hong Kong, Taiwan, and Korea.
Context-sensitive scriptshave a small character set (fewer than 256
characters) but may have a larger glyph set, since there are potentially
several graphic representations for any given character code. The mapping
from a given character code to a glyph depends on surrounding characters.
Most languages that use a context-sensitive script have left-to-right lines and
top-to-bottom pages, such as Devanagari and Bengali.
Bidirectional scriptscan have runs of left-to-right and right-to-left
characters appearing simultaneously in a single line of text. These scripts
have small character sets (fewer than 256 characters) and require no context
information for glyph choice. Bidirectional scripts are used for languages such
as Hebrew that have both left-to-right and right-to-left characters, with
top-to-bottom pages.

There are a few exceptional scripts that fall into more than one of these categories,
such as Arabic and Urdu. Arabic, for instance, is both context sensitive and
bidirectional.

Now with the preliminaries out of the way, we're ready to discuss some localization
pitfalls.

KEYBOARD INPUT

Sooner or later, your users are going to start typing. You can't stop them. So now what
do you do? One approach is to simply ignore keyboard input. While perfectly
acceptable to open-minded engineers like yourself, your Marketing colleagues may
find this approach unacceptable. So, let's examine what happens when two-byte script
users type on their keyboards.

Obviously, a Macintosh keyboard doesn't have enough keys to allow users of two-byte
script systems to simply press the key corresponding to the one character they want
out of 28,000. Instead, two- byte systems are equipped with a softwareinput method,
also called a front-end processoror FEP, which allows users to type phonetically on a
keyboard similar to the standard U.S. keyboard. (Some input methods use strokes or
codes instead of phonetics, but the mechanism is the same.)

As soon as the user begins typing, a smallinput windowappears at the bottom of the
screen. When the user signals the input method, it displays variousreadingsthat
correspond to the typed input. These readings may include one or more two-byte
characters. There may be more than one valid reading of a given "clause" of input, in
which case the user must choose the appropriate reading.

When satisfied, the user accepts the readings, which are then flushed from the input
window and sent to the application as key-down events. Since the Macintosh was never
really designed for two-byte characters, a two-byte character is sent to the
application as two separate one-byte key-down events. Interspersed in the stream of
key-down events there may also be one-byte characters, encoded as ASCII.

Before getting overwhelmed by all this, consider two important points. First,the input
method is taking the keystrokes for you. The keystrokes the user types are not being
sent directly into your application -- they're being processed first. Also, since the
user can type a lot into the input method before accepting the processed input, you can
get a big chunk of key-down events at once.

So let's see what your main event loop should look like in its simplest form if you want
to properly accept mixed one- and two-byte characters:

// Globals
unsigned short  gCharBuf;       // Buffer that holds our (possibly
                                // two-byte) character
Boolean         gNeed2ndByte;   // Flag that tells us we're waiting
                                // for the second byte of a two-byte
                                // character

void EventLoop(void)
{
    EventRecord     event;        // The current event
    short           cbResult;     // The result of our CharByte call
    unsigned char   oneByte;      // Single byte extracted from event
    Boolean         processChar;  // Whether we should send our
                                  // application a key message

    if (WaitNextEvent(everyEvent, &event, SleepTime(), nil)) {
        switch (event.what) {
            . . .
            case keyDown:
            case autoKey:
                . . .
                // Your code checks for Command-key equivalents here.
                . . .
                processChar = false;
                oneByte = (event.message & charCodeMask);
                if (gNeed2ndByte) {
                    // We're expecting the second byte of a two-byte
                    // character. So OR the byte into the low byte of
                    // our accumulated two-byte character.
                    gCharBuf = (gCharBuf << 8) | oneByte;
                    cbResult = CharByte((Ptr)&gCharBuf, 1);
                    if (cbResult == smLastByte)
                        processChar = true;
                    gNeed2ndByte = false;
                } else {
                    // We're not expecting anything in particular. We
                    // might get a one-byte character, or we might
                    // get the first byte of a two-byte character.
                    gCharBuf = oneByte;
                    cbResult = CharByte((Ptr)&gCharBuf, 1);
                    if (cbResult == smFirstByte)
                        gNeed2ndByte = true;
                    else if (cbResult == smSingleByte)
                        processChar = true;
                }

                // Now possibly send the typed character to the rest
                // of the application.
                if (processChar)
                    AppKey(gCharBuf);
                break;

            case . . .
        }
    }
}

CharByte returns smSingleByte, smFirstByte, or smLastByte. You use this
information to determine what to do with a given key event. Notice that the AppKey
routine takes an unsigned short as a parameter. That's very important. For an
application to be two-byte script compatible, you need toalwayspass unsigned shorts
around for a single character. This example is also completelyone-bytecompatible --
if you put this event loop in your application, it works in the U.S.

The example assumes that the grafPort is set to the document window and the port's
font is set correctly, which is important because the Script Manager's behavior is
governed by the font of the current grafPort (see "Script Manager Caveats"). Although
this event loop works fine on both one- byte and two-byte systems, it could be made
more efficient. For example, since input methods sometimes send you a whole mess of
characters at a time, you could buffer up the characters into a string and send them
wholesale to AppKey, making it possible for your application to do less redrawing on
the screen.

AVOIDING FONT TYRANNY

Have you ever written the following lines of code?

void DrawMessage(short messageNum)
{
    Str255theString;

    GetIndString(theString, kMessageStrList, messageNum);
    TextFont(geneva);
    TextSize(9);
    MoveTo(kMessageXPos, kMessageYPos);
    DrawString(theString);
}

If so, you're overdue for a good spanking. While we're very proud of you for putting
that string into a resource like a good international programmer, the font, size, and
pen position are a little too, well, specific. Granted, it's hard to talk yourself out of
using all those nice constants defined in Fonts.h, but if you're trying to write a
localizable application, this is definitely thewrong approach.

A better approach is to do this:

TextFont(applFont);
TextSize(0);
GetFontInfo(&fontInfo);
MoveTo(kMessageXPos, kMessageYMargin + fontInfo.ascent +
fontInfo.leading);

Since applFont is always a font in the system script, and TextSize(0) gives a size
appropriate to the system script, you get the right output. Plus, you're now
positioning the pen based on the font, instead of using absolute coordinates. This is
important. For instance, on a Japanese systemTextSize(0) results in a point size of
12, so the code in the preceding example might not work if the pen-positioning
constants were set up to assume a 9-point font height.

If you want to make life even easier for your localizers, you could eliminate the
pen-positioning constants altogether. Instead, use an existing resource type (the
'DITL' type is appropriate for this example) to store the layout of the text items in the
window. Even though you're drawing the items yourself, you can still use the
information in the resource to determine the layout, and the localizers can then change
the layout using a resource editor -- which is a lot better than hacking your code.

There are some other interesting ways to approach this problem. Depending on what
you're drawing, the Script Manager may be able to tell you both which font and which
size to use. Suppose you need to draw some help text. You can use the following code:

void DrawHelpText(Str255 helpText, Rect *helpZone)
{
    long fondSize;

    fondSize = GetScript(smSystemScript, smScriptHelpFondSize);
    TextFont(HiWord(fondSize));
    TextSize(LoWord(fondSize));
    NeoTextBox(&helpText[1], helpText[0], helpZone, GetSysJust(),
        0, nil, nil);
}

Here the Script Manager tells you the appropriate font and size for help text. On a U.S.
system, that would be Geneva 9; on a Japanese system, it's Osaka 9. NeoTextBox is a
fast, flexible replacement for the Toolbox routine TextBox and is Script Manager
compatible. You can learn more about NeoTextBox by reading "The TextBox You've
Always Wanted" indevelopIssue 9.

The Script Manager has some other nice combinations:

smScriptMonoFondSize    // Default monospace font and size (use when
                        // you feel the urge to use Courier 12)
smScriptSmallFondSize   // Default small font and size (use when you
                        // feel the urge to use Geneva 9)
smScriptSysFondSize     // Default system font and size (use when you
                        // feel the urge to use Chicago 12)
smScriptAppFondSize     // Default application font and size (use as
                        // default document font)

The various FondSize constants are available only in System 7. If you're writing for
earlier systems, you shouldat leastuse GetSysFont, GetAppFont, and GetDefFontSize, as
described in Chapter 17 ofInside MacintoshVolume V. And if you're too lazy to do even
that,pleaseuse TextFont(0) and TextSize(0) to get the system font, which will be
appropriate for the system script. This is, by the way, how grafPorts are initialized
by QuickDraw. In other words, if you don't touch the port, it will already be set up
correctly for drawing text in the system script.

INTERNATIONAL DATING

Before you get too excited, you should know that we're not talking about the true-love
variety of date here. No, we're talking about something much more tedious -- input
and output of international dates, times, numbers, and currency values. First we'll
look at output formatting, and then input parsing.

OUTPUT OF DATES, TIMES, NUMBERS, AND CURRENCY VALUES
To output dates, times, numbers, and currency values (which we'll callformatted
values), you need to know the script you're formatting for. This can be a user
preference, or you can determine the script from the current font of the field
associated with the value you're formatting (use Font2Script). You can use these
International Utilities routines to format dates, times, and numbers:

Use IUDateString for formatting a date.
Use IUTimeString for formatting a time.
Use NumToString for simple numbers without separators.
Use Str2Format and FormatX2Str for complete number formatting with
separators.

Formatting a currency value is a bit trickier. You have to format the number and then
add the currency symbol in the right place. We'll show you how to get the currency
symbol and the positioning information from the 'itl0' resource.

First, let's look at an example of date and time formatting:

#define kWantSeconds    true            // For IUTimeString
#define kNoSeconds      false

unsigned long       secs;
Str255              theDate, theTime;

// Get the current date and time into Pascal strings.
GetDateTime(&secs);
IUDateString(secs, shortDate, theDate);
IUTimeString(secs, kNoSeconds, theTime);

Formatting a number with FormatX2Str is a little more complicated, because
FormatX2Str requires a canonical number format string (type NumFormatString)
that describes the output format. You make a NumFormatString by converting a literal
string, like

##,###.00;-##,###.00;0.00

The strings are in the format

positiveFormat;negativeFormat;zeroFormat

where the last two parts are optional. The example string would format the number
32767 as 32,767.00, -32767 as -32,767.00, and zero as 0.00. The exact format of
these strings can be quite complicated and is described inMacintosh Worldwide
Development: Guide to System Software.

The following handy routine formats a number using a format read from a string list.
You provide the string list resource and specify which item in the list to use when
formatting a given number.

OSErr FormatANum(short theFormat, extended theNum, Str255 theString)
{
    NItl4Handle     itl4;
    OSErr           err;
    NumberParts     numberParts;
    Str255          textFormatStr;  // "Textual" number format spec
    NumFormatString formatStr;      // Opaque number format

    // Load the 'itl4' and copy the NumberParts record out of it.
    itl4 = (NItl4Handle)IUGetIntl(4);
    if (itl4 == nil)
        return resNotFound;
    numberParts = *(NumberParts *)((char *)*itl4 +
        (*itl4)->defPartsOffset);
    // Get the format string, convert it to a NumFormatString, and
    // then use it to format the input number.
    GetIndString(textFormatStr, kFormatStrs, theFormat);
    err = Str2Format(textFormatStr, &numberParts, &formatStr);
    if (err != noErr)
        return err;
    err = FormatX2Str(theNum, &formatStr, &numberParts, theString);
    return err;
}

Given a currency value, the following routine formats the number and then adds the
currency symbol in the appropriate place. This routine assumes that you use a
particular number format for currency values, but you can easily modify it to include
an argument that specifies the format item in the string list.

OSErr FormatCurrency(extended theNum, Str255 theString)
{
    Intl0Hndl   itl0;
    OSErr       err;
    Str255      currencySymbol, formattedValue;

    // First, format the number like this: ##,###.00. FormatX2Str
    // will replace the "," and "." separators
    // appropriately for the font script.
    err = FormatANum(kCurrencyFormat, theNum, formattedValue);
    if (err != noErr)
        return err;

    // Get the currency symbol from the 'itl0' resource. The currency
    // symbol is stored as up to three bytes. If any of the bytes
    // aren't used they're set to zero. So, we use strncpy to copy
    // out the currency symbol as a C string and forcibly terminate
    // it in case it's three bytes long.
    itl0 = (Intl0Hndl)IUGetIntl(0);
    if (itl0 == nil)
        return resNotFound;
    strncpy(currencySymbol, &(*itl0)->currSym1, 3);
    currencySymbol[3] = 0x00;
    c2pstr(currencySymbol);

    // Now put the currency symbol and the formatted value together
    // according to the currency symbol position.
    if ((*itl0)->currFmt & currSymLead) {
        StringCopy(theString, currencySymbol);
        StringAppend(theString, formattedValue);
    } else {
        StringCopy(theString, formattedValue);
        StringAppend(theString, currencySymbol);
    }
    return noErr;
}

The 'itl0' resource also includes the decimal and thousands separators. These should be
the same values used by FormatX2Str, which gets these symbols from the
NumberParts structure in the 'itl4' resource.

If using the extended type in your application makes you queasy, you can easily modify
these routines to work with the Fixed type. Just use Fix2X in the FormatX2Str call to
convert the Fixed type to extended.

INPUT OF DATES, TIMES, AND NUMBERS
The Script Manager includes routines for parsing formatted values to retrieve a date,
time, or number. The process is logically the reverse of formatting a value for output.
Most applications don't even deal with formatted numbers. They just read raw
numbers (no thousands separators or currency symbols), locate the decimal
separator, convert the integer and fraction parts using NumToString, and then put the
integer and fraction parts back together.

DEALING WITH CHARACTER ENCODINGS

When writing Macintosh applications, most developers make certain assumptions that
cause problems when the application is used in other countries. One of these
assumptions is that all characters are represented by a single byte; another is that a
given character code always represents the same character. The first assumption
causes immediate problems because the two-byte script systems use both one-byte
and two-byte character codes. An application that relies on one-byte character codes
often breaks up a two-byte character into two one-byte characters, rendering the
application useless for two-byte text. The second assumption causes more subtle
problems, which prevent the user from mixing text in several different scripts
together in one document.

Different versions of the Macintosh system software use a different script by default.
Systems sold in the U.S. and Europe use the Roman script. Those sold in Japan, Hong
Kong, or Korea use the Japanese, traditional Chinese, or Korean script, respectively.
In addition, some sophisticated users have several script systems installed at one time,
and System 7.1 makes this even easier. Actually, even unsophisticated users can have
two script systems installed at one time. All systems have the Roman script installed,
so Japanese users, for example, have both the Japanese and the Roman script
available.

For an application to work correctly with any international system software, it must
be able to handle different character encodings simultaneously. That is, the user should
be able to enter characters in different scripts and edit the text without damaging the
associated script information. This section discusses three ways to handle character
encodings. These methods require different amounts of effort to implement and provide
different capabilities. Of course, those that require the most effort also provide the
most flexibility and power for your users. Before we discuss these methods, let's
define some more terms.

Language. A human language that's written using a particular script. Several
languages can share the same script. For example, the Roman script is used by
English, French, German, and so on. It's also possible for the same language to be
written in more than one script, although that's a rare exception.

Alphabet, syllabary, ideograph set. A collection of characters used by a
language. Some scripts include more than one of these collections. As a simple example,
the Roman script includes both an uppercase and a lowercase alphabet. As a more
complicated example, the Japanese script includes the Roman alphabet, the Hiragana
and Katakana syllabaries, and the Kanji ideograph set. An alphabet, syllabary, or
ideograph set isn't necessarily encoded in the same way in two different scripts. For
example, the Roman alphabet in the Roman script uses one-byte codes, but the Roman
alphabet in the Japanese script uses either one-byte or two-byte codes.

Segment. A subset of an encoding that may be shared by one or more scripts. For
example, the simple (7-bit) ASCII characters make up a segment that's shared by all
the scripts on the Macintosh. Characters in this segment have the same code in any
Macintosh encoding.

Unicode. An international character encoding that encompasses all the written
languages of the world. Each character is assigned a unique 16-bit integer. Unicode is
aunifiedencoding -- all characters that have the same abstract shape share a common
character code, even if they're used in more than one language.

METHOD 1: NATIVE ENCODING
The easiest method is to simply pick one character encoding for your localization and
stick with it throughout the application. This is usually the native character encoding
for the country (and language) that you're targeting with the localized application. For
example, if you're localizing anapplication for the Japanese market, you choose the
shift-JIS (Shifted Japanese Industrial Standard) character encoding and modify all
your text-handling routines to use this encoding.

The shift-JIS encoding uses both one-byte and two-byte character codes, so you need
to use the Script Manager's CharByte routine whenever you're stepping through a
string. For a random byte in a shift-JIS encoded string, CharByte tells you if the byte
represents a one-byte character, the low byte of a two-byte character, or the high
byte of a two-byte character. You also have to handle two-byte characters on input (as
described earlier in the section "Keyboard Input") and use the native system and
application fonts for text (as described in the section "Avoiding Font Tyranny").

To summarize, the native encoding method has a few advantages:

It's very easy to implement, so most of your code will work with simple
modifications.
Since you're using the native encoding, the users in the country for which
you're localizing will be able to manipulate text using the conventions of the
native language.
Every encoding includes the simple (7-bit) ASCII characters, so they'll
also be able to use English.

Unfortunately, this method has many disadvantages:

You have to create one version of the application for every localization
that you do. Each version will use a different native encoding.
Documents created with one version of the application can't necessarily be
used with another version of the application. For example, a document created
with the Japanese version that includes two-byte Japanese text will be
displayed incorrectly when opened with the French version.
The user doesn't have access to all the characters in the Roman script
(extended ASCII encoding) because these are also used by the native encoding.
For example, the ASCII accented characters and extended punctuation use
character codes that are also used by the one-byte Katakana syllabary in the
shift-JIS encoding. Remember, even the simple international systems really
use two scripts -- the native script and the Roman script.

METHOD 2: MULTIPLE ENCODINGS
The most complete method for handling character encodings is to keep track of the
encoding for every bit of text that your application stores. In this method the encoding
(or the script code) is stored with a run of text just like a font family, style, or point
size. The first step is to determine which languages you may want to support. Once this
is determined, you can decide which encodings are necessary to implement support for
those languages. For example, suppose your Marketing department wants to do
localized versions for French, German, Italian, Russian, Japanese, and Korean.
French, German, and Italian all use the Roman script. Russian uses the Cyrillic script;
Japanese uses the Japanese script; and Korean uses the Korean script. To support these
languages, you have to handle the Roman, Cyrillic, Japanese, and Korean encodings.

In general, each script that you include requires support for its encoding and,
possibly, additional features that are specific to that script. For example, Japanese
script can be drawn left-to-right or top- to-bottom, so a complete implementation
would handle vertical text. There are other features specific to the Japanese script
(amikake, furigana, and so on) that you may also want to implement.

If any of the encodings include two-byte characters, the data structures that you use to
represent text runs must be able to handle two-byte codes. When you're processing a
text run, the encoding of that run determines how you can treat the characters in the
run. For example, you can munge a Roman text run in the usual way, safe and secure in
the familiar world of one-byte character codes. In contrast, your dealings with
Japanese text runs may be wrought with angst since these runs can include both
one-byte and two-byte characters in any combination. The Script Manager is designed
to support applications that tag text runs with a script code. As long as the font of the
current grafPort is set correctly, all the Script Manager routines work with the
correct encoding for that script. For example, if you specify a Japanese font in the
current grafPort, the Script Manager routines assume that any text passed to them is
stored in the shift-JIS encoding.

Keyboard script. During this discussion of the multiple encodings method, we've
been assuming that you already know the script (and therefore the encoding) of text
that the user has entered. How exactly do you know this? The Script Manager keeps
track of the script of the text being entered from the keyboard in a global variable.
Your application should read this variable programmatically after receiving a
keyboard event, as follows:

short keyboardScript;

keyboardScript = GetEnvirons(smKeyScript);

Once you know the keyboard script, make sure that this information stays with the
character as it becomes part of a text run. If the keyboard script is the same as the
script of the text run, you can just add this character to the text run. Otherwise, you
must create a new text run, tag it with the keyboard script, and place the character in
it.

You can also set the keyboard script directly when the user selects text with the mouse
or changes the current font. The question is, which script do you set the keyboard to
use? That depends on the font of the selected text or the new font the user has chosen.
The first step is to convert the font into a script and then use the resulting script code
to set the keyboard script. This process is known askeyboard forcing.

short fontScript;

fontScript = Font2Script(myFontID);
KeyScript(fontScript);

The user can always change the keyboard script by clicking the keyboard icon (in
System 6) or by choosing a keyboard layout from the Keyboard menu (in System 7).
As a result, you're no longer sure that the keyboard script and the font script agree
when the user actually types something. You should always check the keyboard script
against the font script before entering a typed character into a text run. If the
keyboard script and the font script don't agree, a new current font is derived from the
keyboard script. This process is known asfont forcing.

short fontScript, keyboardScript;

fontScript = Font2Script(myFontID);
keyboardScript = GetEnvirons(smKeyScript);
if (fontScript != keyboardScript)
myFontID = GetScript(keyboardScript, smScriptAppFond);

The combination of keyboard forcing and font forcing is calledfont/keyboard
synchronization. Both keyboard and font forcing should be optional; the user should be
able to turn these features off with a preferences setting.

Changing fonts. An application that works with multiple encodings must pay special
attention to font changes. For each text run in the selection, the application should
check the script of the text run against the script of the new font. If the scripts agree,
the text run can use the new font. If the scripts don't agree, the application can either
ignore the new font for that text run or apply some special processing.

short fontScript;
short textRunIndex, textRunCount;

fontScript = Font2Script(myNewFontID);
for (textRunIndex = 0;
        textRunIndex < textRunCount;
        textRunIndex++) {
    if (textRunStore[textRunIndex].script == fontScript)
        textRunStore[textRunIndex].fontID = myNewFontID;
    else
        SpecialProcessing(&textRunIndex, &textRunCount, myNewFontID);
}

All the encodings used by the Macintosh script systems include the simple (7-bit)
ASCII characters, so it's often possible to convert these characters from one script to
another. The special processing consists of these two steps: 1. Breaking a text run into
pieces, some of which contain only simple ASCII characters and others that contain all
the characters not included in simple ASCII 2. Applying the new font to the runs that
contain only simple ASCII characters and leaving the other runs with the old font

Boolean FindASCIIRun(unsigned char *textPtr, long textLength,
    long *runLength)
{
    *runLength = 0;

    if (*textPtr < 0x80) {
        // We know that this character is simple ASCII, since values
        // less than 128 can't be the first byte of a two-byte
        // character, and they're shared among all scripts. So, let's
        // block up a run of simple ASCII.
        while (*textPtr++ < 0x80 && textLength-- > 0)
            *runLength++;
        return true;            // Run is simple ASCII.
    } else {
        // We know this character is not simple ASCII. It may be
        // two-byte or it may be some character in a non-Roman
        // script. So, let's block up a run of non-simple-ASCII
        // characters.
        while (textLength > 0) {
            if (CharByte(textPtr, 0) == smFirstByte) {
                // Skip over two-byte character.
                textPtr += 2;
                textLength -= 2;
                *runLength += 2;
            } else if (*textPtr >= 0x80) {
                // Skip over one-byte character.
                textPtr++;
                textLength--;
                *runLength++;
            } else
                break;
        }
        return false;           // Run is NOT simple ASCII.
    }
}

void SpecialProcessing(short *runIndex, short *runCount,
    short myNewFontID)
{
    TextRunRecord       originalRun, createdRun;
    unsigned char       *textPtr;
    long                textLength, runLength, runFollow;
    Boolean             simpleASCII;

    // Retrieve this run and remove it from the run list.
    GetTextRun(*runIndex, &originalRun);
    RemoveTextRun(*runIndex);

    // Get the pointer and length of the original text.
    textPtr = originalRun.text;
    textLength = originalRun.count;

    // Loop through all of the sub-runs in this run.
    runFollow = *runIndex;
    while (textLength > 0) {
        // Find the length of the sub-run and its type.
        TextFont(originalRun.fontID);
        simpleASCII= FindASCIIRun(textPtr, textLength, &runLength);

        // Create the sub-run and duplicate the characters.
        createdRun = originalRun;   // Same formats.
        createdRun.text = NewPtr(runLength);
        // Real programs check for nil pointer here.
        createdRun.length = runLength;
        BlockMove(textPtr, createdRun.text, runLength);

        // Roman runs can use the new font.
        if (simpleASCII)
            createdRun.fontID = myNewFontID;

        // Add the new sub-run and advance the run index.
        AddTextRun(runFollow++, createdRun);

        // Advance over this sub-run and continue looping.
        textPtr += runLength;
        textLength -= runLength;
    }

    // Dispose of the original run information.
    DisposeTextRun(originalRun);
}

Searching and sorting. Applications that work with multiple encodings must also
take care during searching or sorting operations. An application that uses only the
native encoding can assume that character codes are unique and that any two text runs
can be compared directly using the sorting routines in the International Utilities
Package. On the other hand, an application that uses multiple encodings must always
consider a character code within the context of a text run and its associated script. In
this case, character codes are unique only within a script, not across script
boundaries, so text runs can't be compared directly using International Utilities
routines unless they have the same script. If the script codes are different, the
International Utilities routines provide a mechanism for first ordering the scripts
themselves (IUScriptOrder).

You have the same problems with searching as with sorting. In addition, search
commands usually include options for case sensitivity that could be extended in a
multiple encodings application. For example, the Japanese script includes both
one-byte and two-byte versions of the Roman characters. For purposes of searching,
the user might want to consider these equivalent. The simplified Chinese script also
includes both one-byte and two-byte versions of the Roman characters, and these
should also be equivalent to Roman characters in the Roman script and in the Japanese
script. Just like case sensitivity, considering one- and two-byte versions of a
character as equivalent should be an option in your search dialog box. You can use the
Script Manager's Transliterate routine to implement a byte-size insensitive
comparison. Use Transliterate to convert both the source and the target text into
one-byte characters, then compare the resulting strings. Because all the scripts share
the same simple (7-bit) ASCII character codes, this mechanism treats all the Roman
characters, both one-byte and two-byte, in every script as equivalent.

Summary. The multiple encodings method has several advantages:

The user can mix text in any number of scripts within one document.
You can produce several localized versions of the application from a single
code base.
Users in a particular region can use features intended for users in a
different region, even if the product isn't advertised to provide those features.

The disadvantages of this method are apparent from the examples:

It's much more difficult to implement than the native encoding method.
The two-byte scripts use mixed one-byte and two-byte encodings, so even
though you're keeping track of the script of each text run, you still need to
worry about mixed character sizes within a run.
Because some characters are duplicated between scripts, you need to treat
their corresponding character codes as equivalent. This further complicates
the basic algorithms you use for text editing, searching, sorting, and so on.

METHOD 3: POOR MAN'S UNIFICATION
Our favorite method combines the power of the multiple encodings method with the
simplicity of the native encoding method. The idea is to create a single "native"
encoding that encompasses all the scripts included in the multiple encodings method. In
the multiple encodings method, some characters are encoded several times: a character
can have the same code value in different scripts, or it can have different code values
in the same script. For example, the letterA has the code value 0x41 in the Roman
script and the same one-byte code value in Japanese and traditional Chinese. However,
Japanese also encodes the letterA as the two-byte value 0x8260, and traditional
Chinese also encodes it as the two-byte value 0xA2CF. A unified encoding would map all
of the identical characters in the multiple encodings to one unique code value.

You might have noticed that this method has some of the same goals as the Unicode
scheme -- a single character encoding for all languages with one unique code for every
character. Unicode extends this goal to the unification of the two-byte scripts.
Characters that have the same abstract shape in the simplified Chinese, traditional
Chinese, Japanese, and Korean scripts have been grouped together as a single character
under Unicode. Our method doesn't go that far. We unify the simple ASCII characters
from all scripts but leave the various two-byte scripts to their unique encodings.
Thus the name for this method -- poor man's unification.

Segments. The poor man's unification method relies on the concept of a segment. A
segment is a subset of an encoding with characters that are all the same byte size. For
example, the Roman script is divided into two segments -- the simple ASCII segment
and the extended ASCII/European segment. The Japanese script has three segments --
the simple ASCII segment, the one-byte Katakana segment, and the two-byte segment
(including symbols, Hiragana, Katakana, Roman, Cyrillic, and Kanji).

The key to poor man's unification is the simple ASCII segment. This segment is shared
among all the scripts on the Macintosh (see Figure 2). Furthermore, poor man's
unification treats the various encodings as collections of segments that can be shared
among encodings. There's logically only one simple ASCII segment, and all the scripts
share it. In the multiple encodings method, characters in this range could be found in
each script. That is, the word "Beaker" could be stored in both a Roman text run and in
a Japanese text run (as one-byte ASCII). In contrast, an application that uses poor
man's unification would tag text runs with the segment, not the script, so these two
occurrences of "Beaker" would be indistinguishable.

Figure 2 Scripts Sharing the ASCII Segment

The best way to see the advantages of this method is to solve a problem we already
considered -- changing the font of the selection. With the multiple encodings method,
this entailed breaking text runs into smaller runs using our FindASCIIRun routine.
With poor man's unification, the same problem is much easier to solve because the
runs are already divided into segments and the simple ASCII segment is allowed to take
on any font. Other segments are only allowed to use fonts that belong to the same script
they do.

#define asciiSegment        0
#define europeanSegment     1
#define katakanaSegment     2
#define japaneseSegment     3

short fontScript, runSegment;
short textRunIndex, textRunCount;

fontScript = Font2Script(myNewFontID);
for (textRunIndex = 0;
        textRunIndex < textRunCount;
        textRunIndex++) {
    runSegment = textRunStore[textRunIndex].segment;
    if (SegmentAllowedInScript(runSegment, fontScript))
        textRunStore[textRunIndex].fontID = myNewFontID;
}

The special processing is gone (surprise). Once you know that a segment isn't included
in the script of the font, you can't go any further. Such a segment consists entirely of
characters that aren't in the script of the font.

Boolean SegmentAllowedInScript(short segment, short script)
{
    switch (script) {
        case smRoman:
            switch (segment) {
                case asciiSegment:
                case europeanSegment:
                    return true;
                default:
                    return false;
            }
        case smJapanese:
            switch (segment) {
                case asciiSegment:
                case katakanaSegment:
                case japaneseSegment:
                    return true;
                default:
                    return false;
            }
        default:
            switch (segment) {
                case asciiSegment:
                    return true;
                default:
                    return false;
            }
    }
}

Determining a segment from keyboard input. How do you determine the
segment of a character when it's entered from the keyboard? 1. First determine
which script the character belongs to by checking the keyboard script. 2. Then use
the character-code value and the encoding definitions to assign the character a
particular segment.

#define ksJISSpace  0x8140

unsigned short keyboardScript;
unsigned short charSegment;
EventRecord     lowByteEvent;

keyboardScript = GetEnvirons(smKeyScript);
charSegment = ScriptAndByteToSegment(keyboardScript, charCode);

if (charSegment == japaneseSegment) {
    // Get low byte of two-byte character from keyboard.
    do {
        // You can get null events between two bytes of a two-byte
        // character.
        GetNextEvent(keyDownMask | keyUpMask | autoKeyMask,
            &lowByteEvent);
        if (lowByteEvent.what == nullEvent)
            GetNextEvent(keyDownMask | keyUpMask | autoKeyMask,
                &lowByteEvent);
    } while (lowByteEvent.what == keyUp);

    if ((lowByteEvent.what == keyDown) ||
            (lowByteEvent.what == autoKey))
        charCode = (charCode << 8) |
            (lowByteEvent.message & charCodeMask);
    else
        // We've gotten a valid high byte under the Japanese keyboard
        // with no subsequent low byte forthcoming. Something serious
        // is wrong with the current input method. Return a Japanese
        // space for now. Hmmmm.
        charCode = ksJISSpace;
}

#define kASCIILow       0x00
#define kASCIIHigh      0x7f
#define kRange1Low      0x81
#define kRange1High     0x9f
#define kRange2Low      0xe0
#define kRange2High     0xfc

short ScriptAndByteToSegment(unsigned short script,
    unsigned char byte)
{
    switch (script) {
        case smRoman:
            if ((byte >= kASCIILow) && (byte <= kASCIIHigh))
                return asciiSegment;
            else
                return europeanSegment;
        case smJapanese:
            if ((byte >= kASCIILow) && (byte <= kASCIIHigh))
                return asciiSegment;
            else if ((byte >= kRange1Low) && (byte <= kRange1High))
                return japaneseSegment;
            else if ((byte >= kRange2Low) && (byte <= kRange2High))
                return japaneseSegment;
            else
                return katakanaSegment;
        default:
            // New scripts and segments added before this.
            return asciiSegment;
    }
}

You might think this is quite a bit of effort just to get the low byte of a two-byte
character. You're right. And as for Joe's use of the antiquated GetNextEvent instead of
the more modern WaitNextEvent, Beaker notes that "It's a cooperative multitasking
world and Joe's not cooperating." Joe replies, "Yeah, but I don't want a context switch
while I'm trying to get the low byte of a two- byte character."

Changing fonts. Applications that employ poor man's unification still have to worry
about font forcing. Here's an algorithm for "smart" font forcing that tries to
anticipate which fonts the user will select for text in each segment. When you find a
case where the current keyboard script and font script don't agree, instead of using the
application font for the keyboard script, search the surrounding text runs for a font
that does agree with the keyboard script. Only if you can't find a font that agrees do you
default to the application font of the keyboard script. From the user's perspective, this
is much nicer. Once the user has selected a font for each script, the application goes
back and forth between the fonts automatically as the keyboard script is changed.

short fontScript, keyboardScript;

fontScript = Font2Script(myFontID);
keyboardScript = GetEnvirons(smKeyScript);
// Search backward.
if (fontScript != keyboardScript) {
    for (textRunIndex = currentRunIndex - 1;
            textRunIndex >= 0;
            textRunIndex--) {
        myFontID = textRunStore[textRunIndex].fontID;
        fontScript = Font2Script(myFontID);
        if (fontScript == keyboardScript)
            break;
    }
}

// Search forward.
if (fontScript != keyboardScript) {
    for (textRunIndex = currentRunIndex + 1;
            textRunIndex < textRunCount;
            textRunIndex++) {
        myFontID = textRunStore[textRunIndex].fontID;
        fontScript = Font2Script(myFontID);
        if (fontScript == keyboardScript)
            break;
    }
}

// Punt if we couldn't find an appropriate run.
if (fontScript != keyboardScript)
    myFontID = GetScript(keyboardScript, smScriptAppFond);

Applications that use font forcing also have to worry about keyboard forcing. However,
if the application includes the feature just described, keyboard forcing is not as
important. Many users will prefer to leave the keyboard completely under manual
control and allow the "smart" font forcing to choose the correct font when they start
typing. The keyboard script is always visible in the menu bar, but the current font is
not.

Summary. The poor man's unification method has more advantages than the other
two:

All characters in a run belong to the same segment and therefore take the
same number of bytes for their code values. That is, any given run will be all
one-byte characters or all two-byte characters, which makes it easier to step
through text for deleting or cursor movement. This is in contrast to the
multiple encodings method, which can mix one-byte and two-byte characters
in a single text run.
Runs of simple ASCII and European characters still take one byte per
character to store. If you're working on a word processor and plan to keep
large amounts of text in memory, this can be an advantage.
Like the multiple encodings method, this method is easy to extend as you
add more scripts to the set your application supports. Each time you add a new
script, you need to define the new segments that make up that script and then
modify the classification routines to correctly handle the new script and
segment codes. Once you locate a specification for the new encoding, these
modifications should be straightforward.

Unfortunately, this method has two disadvantages when compared to pure Unicode:

You still have to deal with one-byte and two-byte characters, even though
they won't be mixed together (see "The Demise of One-Byte Characters").
The application needs to tag each text run with a segment code because the
character codes aren't unique across segments.

Unicode does away with both of these disadvantages by making all characters two bytes
wide and insisting on one huge set of unique character codes.

PAY ME NOW OR PAY ME LATER

Perhaps the moral of the story is "globalization, not localization," as Joe says. The
more generalizations you can build into your application during initial development,
the more straightforward your localization process is destined to be, and the less your
localized code base will diverge from your original product.

Weigh the size and growth potential of a given language market against the amount of
effort required to implement that language. Stick to the markets where your product is
most likely to flourish. This article has shown that in some cases you can dramatically
reduce code complexity by taking shortcuts -- the poor man's unification scheme in
this article is a good example. A healthy balance between Script Manager techniques
and custom code will help you bring your localized product to market fast and make it a
winner.

SCRIPT MANAGER CAVEATS
When you use a char to store a character or part of a character, use an unsigned char.
In two-byte scripts, the high byte of a two-byte character often has the high bit set,
which would make a signed char negative, possibly ruining your day. The same goes for
the use of a short to store a full one- or two-byte character -- use an unsigned short.

Another important point is that most Script Manager routines rely on the font of the
current grafPort for their operation. That means you should always be sure that the
port is set appropriately and that the font of the current port is correct before making
any Script Manager calls.

A new set of interfaces has been provided for System 7.1. While the old Script
Manager's text routines still work, the new routines add flexibility. For example, you
can use CharacterByteType instead of CharByte.

THE DEMISE OF ONE-BYTE CHARACTERS
The point of poor man's unification is to simplify your life. On that theme, there's
another technique that will help. You can simply decide that characters are two bytes.
Period. Expand one-byte characters into an unsigned short, with the character code in
the low byte and the segment code in the high byte. Then just use unsigned shorts
everywhere instead of unsigned chars. You'll find that your code gets easier to write
and easier to understand, and that lots of special cases where you would have broken
everything out into one- and two-byte cases collapse into one case.

Putting the segment code into the high byte of one-byte characters ensures that the
one-byte character codes are unique. If your program handles only one two-byte
script, the two-byte codes are also unique. When both these conditions are true,
there's no need to store the segment codes in runs, since they're implied by the high
byte of each character code.

Here are a few examples using the codes from the sample segments in the section on
poor man's unification:

0x0041 is the letter A in the asciiSegment.
0x0191 is the letter ë in the europeanSegment.
0x8140 is a two-byte Japanese space character.

In other words, one-byte characters carry their segment code in their high byte, and
the two-byte characters all belong to the same segment. You can imagine how much
easier searching and sorting algorithms are if you know you can always advance your
pointers by two bytes instead of constantly calling CharByte to find out how big each
character is. Plus, you might as well get used to storing 16 bits per character, since
that's how Unicode works. Yes, it's an extra byte per character -- deal with it.