ADDUCIVE > World-class user interface design

Originally presented at the AVIOS 2000 conference, May 23, 2000, in San Jose, California. This paper is found as pp. 9-14 in the proceedings under the title "Internationalizing Speech Applications."

LOCALIZATION OF SPEECH RECOGNITION SOFTWARE FOR JAPAN

Over-the-phone speech applications are less easy to translate to other languages than web applications or traditional desktop GUI applications. Using a mouse and keyboard requires training, but using the phone and speech itself is a familiar experience for which callers already have many linguistic and cultural expectations. Speech applications must adapt to these expectations, since longstanding habit cannot be trained away, even for the most cooperative callers. The adaptations required for internationalizing speech applications often go far beyond substituting one language's vocabulary for another.

Examples from the translation of two applications from English to Japanese show that even apparently simple, common items—names, numbers, answers to simple questions—pose a wide range of surprising challenges for both speech input and output. However, addressing the problems raised by one language can lead to a better solution for speakers of all languages. Internationalizing an application also reveals some of the peculiarities of English and American English, shedding light on the voice user interface design process.

[1] Apple Computer, Inc., Inside Macintosh: Volume VI. 1991, pp. 2:4-12.

 

Introduction

Application developers are usually cautioned to consider international users and markets early in the design process. In the domain of graphical user interfaces, internationalization means translating labels, allowing right-to-left text and extra room for languages that are not as terse as English, dealing with new currencies and date formats, and checking icons for cultural appropriateness. [1] Rarely if ever is the logic of the application affected or the look and feel significantly altered.

Why is this not true for speech applications? Many common items in a speech recognition application—dates, times, names, money amounts, yes/no answers—exhibit some of the least logical traits of human language. Take dates as an example. Nearly every culture and language provides a completely numerical method for writing dates. Hardly anyone uses these numeric forms when speaking, however. Instead of saying "five, thirty, two thousand," most people would say "Tuesday next week," and the people or programs who need the numbers can figure them out for themselves. This does not make things easy or uniform to program, but it does make for an interesting challenge for the developer who wants an application that sounds and behaves naturally across cultures.

The examples here are based on two demonstrations developed for a multinational company. Both applications were developed originally for American English and then localized for Japan.

The first system gives order status reports and deals with numbers, money amounts, dates, and company and product names. It was not originally designed to be translated.

The second is a dialer, or auto-attendant. The dialer was designed to be localized for Germany as well, but only the English and Japanese versions were completed. The dialer handles personal names, department names, phone numbers, and commands. It provides separate phone numbers for internal and external callers, and web-based interfaces in English and Japanese. Unlike their speech counterparts, the web interfaces were localized simply by replacing all user-visible text strings. Although each language has a separate phone number, all languages share all data.

The most interesting problems arise with names, numbers, dates, and phone numbers. Commands and proper names had straightforward translations or transliterations. To translate these applications, I worked with a translator who is a native speaker of Japanese, familiar with both US and Japanese business practices. I had completed a class in third-semester Japanese myself. This prepared me to ask questions, but it is difficult to imagine how a speech application could be successfully localized without a native speaker who is not only fluent in the language, but observant of cultural differences and knowledgeable about business practices.

I will review cultural differences, and look in detail at personal names, numbers, and phone numbers, before offering recommendations about designing and building speech applications for an international audience.

Localization vs. Translation

Localization is not translation. A statement appropriate for one culture translated directly to another may be too abrupt, too patronizing, or considered redundant in another. Some concepts may not translate at all. The dialer is a case in point. A Japanese company that started using a machine to answer its calls would alarm its customers and others, because it would seem to be indicating that it is going through such severe financial difficulty that it can no longer afford a receptionist. The internally accessible version, however, was appealing to employees in Japan because they wanted to have calls routed without bothering a receptionist. German colleagues also pointed out that the German office would be very unlikely to use a publicly available auto-attendant for the same reasons.

There are also differing expectations about how machines should speak. As an example, speech applications for American audiences take broadcast media as a model for how they should sound: friendly, but brief and to the point, using as few words as possible. Though talking cars and vending machines have been brought to market in the US, Americans have rejected these devices as mechanical and impersonal. By contrast, Japan is full of talking machines and other automated announcements—the automated reporting of train arrivals is considered useful and helpful information. There are even shops in Japan that use motion detectors to trigger recorded shouts of irasshaimase (welcome). Invariably, because they are providing customer service, these recorded announcements use polite language, even when shorter, more efficient alternatives are available. Often, the useful information is surrounded by words of thanks and warnings to be careful. The background noise of polite recordings is essential to any Japanese train station. Female voices dominate, because this is considered more polite. The English dialer uses a male voice, but the Japanese dialer unquestionably needed a female voice.

When speaking to machines, Japanese callers are likely to be much less wordy in their responses, and the filler words are more uniform. An American asked for a quantity might say "forty-five", "I'd like forty-five," "forty-five of them," or "forty-five, please." The translator and callers offered few alternatives to the equivalent of "forty-five units." In Japanese, sentence subjects are frequently omitted, no verb is really necessary in this case, and the equivalent of please would be incorporated into the verb ending. There is nothing to add to this simple statement. This makes some of the Japanese recognition grammars much simpler than their English counterparts. Below in the section about numbers, I explain why Japanese callers would say "forty-five units" instead of just "forty-five."

Names

Name order

The Japanese say their family name before their given name. (To avoid confusion, I'll use the term "family name" for what English speakers would call a last name or surname, and "given name" for first name.) Even people who have Japanese business associates may be unaware of this, though, because when speaking English, the Japanese adjust to Western ways.

You might think, therefore, that swapping the family name and given name would be all you have to do. This would be true, except that in Japanese practice, a Western name is still given in its Western order. How do you tell if someone is Japanese? The database included an employee's location, but this field was not sufficient for distinguishing Japanese from Western names—the Tokyo office had employees from all over the world, and there were many Japanese and Japanese-Americans elsewhere. Japanese-Americans pose yet another problem: some Japanese callers will say an obviously Japanese name in Japanese order, but others, knowing that the person doesn't even speak Japanese, will use the Western order.

The solution to this problem was to check to see if the name followed the rules of Japanese, and to put them in Japanese order for the Japanese version. To accommodate Japanese-Americans, Western order was also allowed for Japanese names. So Western order was allowed for all names, and Japanese order was also allowed for names that sound Japanese (technically, they just followed the rules of standard romanization).

This paper and the projects it describes would not have been possible without the many talents of Hiromi Takahashi Yampol, who served as both translator and voice.

Thanks also to Todd Yampol and James Giangola of Nuance Communications for their help with this paper.
 

NEW STUFF

ABOUT ADDUCIVE

CONSULTING SERVICES


[2] Mangajin's Basic Japanese through Comics, Part 2. 1996, pp. 54-59.

 

Given names, titles, and ambiguity

Despite our attention to the format of full names, the Japanese do not use a person's given name in business. Their family name and title (Fig. 1) is used instead.

This means that callers would not be likely to use a person's full name, even if they know it. There was no plausible way to politely ask a caller to give a person's full name. There would have to be so much explanation and apology that the call would be quite long. Even considering that Japanese callers tolerate more wordy explanations from machines and tend to show more cooperation, we decided that we could not expect people to break with tradition.

We decided to ask for the name in the simplest terms possible, even if this would lead to callers omitting the given name. Because the dialer was designed to deal with thousands of employees, the logic to handle ambiguity was already in place, but it would be exercised more in the Japanese version. In the English version, the grammar had only full names, but in the Japanese version, given names were an optional part of the name. In the English version, it was assumed that since callers would say full names, any ambiguity would be resolved by saying a person's department or location. In the Japanese version, given names could be the most common disambiguating factor.


shacho company president
daitoryo country president
kaicho chairman, director
fukushacho  company vice president
bucho department head
jicho assistant chief

Figure 1.
Some Japanese business titles. [2]


[3] Kodansha Encyclopedia of Japan. 1983, Vol 5, pp. 324-5.

To make matters worse, the most popular Japanese family names (Fig. 2) are very popular. In Japan, the names Sato and Suzuki each account for more than 1.5% of the population. A Japanese company of just 72 employees is more likely than not to have two Satos or two Suzukis! Certain German names are also quite common—both family and given names—so there were many German employees with the same full names.

Japanese custom offers some help, but our database did not. Though Japanese business people omit the person's given name, they give the person's title, for example "President Tanaka" and "Department Head Yamamoto." Fortunately, these titles are shorter in Japanese than in English, just a couple of syllables. Unfortunately, title information was unavailable—my client was not a Japanese company, and so did not assign these traditional titles. A Japanese company would use these titles, and a properly designed auto-attendant ought to use the titles to distinguish people with the same family names but different corporate positions. Asking callers which President Tanaka they mean would not be a good idea.


Sato
Suzuki
Tanaka
Yamamoto
Watanabe
Kobayashi
Saito
Tamura
Ito
Takahashi

Figure 2.
Common Japanese family names. [3]

 

Numbers

Like many Asian languages, Japanese does not make a distinction between singular and plural for nouns and verbs. This simplifies some things for both speech input and output. Otherwise, Japanese numbers are much more complicated, especially for creating natural-sounding speech output.

In Japanese, there are multiple ways of counting. For counts of less than ten, there is a generic system, but otherwise each class of object to be counted has its own counting suffix, and fluent-sounding Japanese requires them. So there is one set of numbers for long, slender things like pencils, another for people, and days of the month have their own counting system (Fig. 3). This means that many versions of the numbers must be recorded, even for small applications.

For speech input, the counter can be used to resolve ambiguity. For example, in making a hotel reservation, the number of people is easy to distinguish from the number of nights and the number of people. Even if they all represent the same number, they are distinct words with distinct sound.

[4] Lampkin, Rita L. Japanese Verbs and Essentials of Grammar, 1997. pp. 110-117.

 
  Digit or plain General Years old People Pencils, etc. Day of month
0 zero, rei          
1 ichi hitotsu issai hitori ippon tsuitachi
2 ni futatsu nisai futari nihon futsuka
3 san mittsu sansai sannin sanbon mikka
4 yon, shi yottsu yonsai yonin yonhon yokka
5 go itsutsu gosai gonin gohon itsuka
6 roku muttsu rokusai rokunin roppon, rokuhon   muika
7 shichi, nana   nanatsu nanasai nananin, shichinin   nanahon nanoka
8 hachi yattsu hassai hachinin happon yoka
9 kyu, ku kokonotsu   kyusai kyunin kyuhon kokonoka
10 ju to jussai junin juppon toka
11 ju-ichi   ju-issai ju-ichinin ju-ippon juichi-nichi
20 ni ju   hatachi ni junin ni juppon hatsuka
21   ni ju-ichi   ni ju-issai   ni ju-ichinin ni ju-ippon nijuichi-nichi  

Figure 3.
Japanese numbers. [4]

 

Phone Numbers

Japanese digits are different from the counting forms of numbers. The traditional form of the digit four sounds the same as the word for death, and is the first syllable of the word for seven. The form shi, therefore, is almost never heard in phone numbers, replaced with yon. Shichi and nana are both used for seven, even though shichi contains shi and sounds like ichi (one) and hachi (eight). Zero is heard as both zero and rei. This leads to a more complex recognition grammar.

When saying a phone number, many Japanese people use the particle no between the parts of the number (Fig. 4), more commonly between the prefix and last four digits than after the area code. Japanese area codes have varying length, and the length of the phone number varies depending on the area code. Though it is tempting to allow variable length phone numbers and an optional no in any position, variable length digit strings result in less accurate recognition than more constrained recognition grammars. Moreover, no sounds like go, the digit five. For phone numbers read back by the dialer, we omitted the no, using a pause instead. This practice is also common, and sounds modern and efficient. We hoped callers would imitate it.

For the US, a simple grammar for ten digit numbers is sufficient, and it does not need to be modified as new area codes appear. For Japan, the grammar must account for each area code, and this grammar must be maintained, since area codes are sometimes split, and additional digits are added as cities grow. This is also the practice in the United Kingdom and elsewhere.

(03) 3224-5000, US Embassy, Tokyo
zero san no san ni ni yon no go zero zero zero
or, zero san no san ni ni yon no go sen ban

(025) 245-3331, Hotel Niigata, Niigata
zero ni go no ni yon go no san san san ichi

(0476) 28-1010, English directory, Narita airport
zero yon nana roku no ni hachi no ichi zero ichi zero

Figure 4.
Some Japanese phone numbers and their readings.

 

Recommendations

Good software engineering practice leads to separating the language-dependent code and data from the language-independent components—namely, the outgoing speech (prompt) logic, the recognition targets (grammars or vocabulary lists), and even the application logic itself. Early attention to all target languages gives you the best chance to avoid problems.

Some systems allow outgoing speech to be customized by providing a series of blanks to be filled in. Unless it allows the order of these blanks to be switched around for various languages, this will surely be insufficient—it wouldn't even be able to handle Japanese names, for example. Even filling in blanks is sometimes insufficient. In Japanese, the type of object being counted is needed in order to produce a natural sounding number—this is an additional parameter to the prompting logic. Furthermore, context information may need to be carried from one question to another—for example, the grammatical gender of an item may need to be known in order to ask "how many of them?"

The parsing (or natural language) capabilities of the underlying system go a long way in abstracting the differences among languages and deriving the meaning in a form useful to the rest of the application. Since the parser deals with ambiguity, keep in mind that some forms may be ambiguous in one country but always clear in another. The 12 hour clock, for example, results in ambiguous times, but it is less common outside the US, where a 24 hour clock is used.

The flow of an application may need to be adjusted for localization. In the dialer, Japanese callers are asked more often to disambiguate employees with matching family names. For a smaller company, this disambiguation logic would not have been necessary in English. Even so, additional effort to accommodate users can benefit speakers of many languages, and allow the software to adapt to more situations.

Finally, since a particular language may require context information that other languages don't use (Japanese titles, for example), the languages you target may affect the design of the underlying database, and affect the cost of collecting, encoding, or maintaining the data (as with Japanese area codes). Knowing which languages and cultures you are targeting, and knowing something about them must be part of the early design stages, and not postponed until after the first language is complete.

Since improved customer service and the promise of friendly applications is driving the adoption of speech recognition applications, developers need to pay special attention to the finer points of language and culture. Comfortable callers are likely to be more cooperative. Not only does this mean that they will be more successful using the application, but also that they will choose it over more expensive alternatives or competitors' services.



Home  Articles  Site Map  Links  Contact
Last updated by Brian Krause, brk@adducive.com, August 7, 2002
Adducive   1 650-274-2415 (+1 650-BRIA-415)