RISC OS Japanese Support Functional Specification

Drawing number:2205,203/FS
Issue:2
Status:Released
Author:Kevin Bracey
Date:2nd July 2003

Contents

  1. Introduction
  2. UTF-8
  3. UCS ⇔ RISC OS 8-bit alphabet mapping
  4. General UCS mapping/utilities
  5. Japanese territory
  6. Japanese keyboard driver
  7. New low-level keys
  8. Future enhancements
  9. References
  10. Glossary
  11. History

1. Introduction

This document describes miscellaneous changes to RISC OS to support the Japanese language and UTF-8.

2. UTF-8

UTF-8 has been assigned the RISC OS alphabet number 111, with name "UTF8". The International module provides this number->name mapping. When *Alphabet UTF8 is issued, it will define character shapes for &80 to &FF as simple symbols showing the correct code, with visual cues to differentiate between UTF-8 start bytes and continuation bytes.

At some future time the kernel VDU driver will be taught about UTF-8, and how to display large character sets, and these definitions will not be used. In the meantime, these definitions will be useful for debugging.

UTF8 is to be regarded as a "special" alphabet by applications. All other RISC OS alphabets are simple 8-bit encodings, with the lower set identical to ASCII. These can be mapped to Unicode using the tables/calls described below. Unicode-aware applications should detect the system alphabet being set to UTF8, and be aware that incoming text-files or keyboard input should be assumed to be a UTF-8 stream. The vast majority of applications do not do any real text processing at all, just treating input and output as a sequence of bytes, looking for ASCII characters that are significant to them and treating top-bit set characters as alphabetic. These applications will not generally be bothered by the system alphabet being UTF8.

3. UCS ⇔ RISC OS 8-bit alphabet mapping

Service_International has been extended to have two new reason codes:

Service_International 7 (Service Call &43)

Define a character from the Universal Character Set

On entry

R1=&43 (reason code)
R2=7 (sub-reason code)
R3=character in system font to redefine
R4=UCS character number (&00000000 - &7FFFFFFF)

On exit

R1=0 if claimed, otherwise preserved
R2-R4 preserved

Use

Any module providing UCS character definitions should check to see if it can provide an 8×8 definition for the specified UCS character. If it can, then it should claim the service call and define the character specified by R3, using VDU 23.

If the character number is not recognised, all registers should be preserved.

This call is intended as an interim measure before the kernel supports large character sets. The standard International module contains several hundred character definitions - this call allows easy access to them.

Service_International 8 (Service Call &43)

Return a pointer to a UCS conversion table

On entry

R1=&43 (reason code)
R2=8 (sub-reason code)
R3=alphabet number

On exit

R1=0 if claimed, otherwise preserved
R2-R3 preserved
R4=pointer to table if recognised, otherwise preserved

Use

Any module providing additional alphabets should compare the given alphabet number with each alphabet number provided by the module. If the given alphabet number matches a known alphabet number, then it should claim the service (by setting R0 to 0), and set R4 to point to a UCS conversion table for that alphabet.

The table is a table of 256 32-bit words, one for each character, giving the equivalent UCS code for that character in the alphabet. If a character is not defined in the alphabet, its entry should contain &FFFFFFFF (not a valid UCS identifier).

Characters 0-31 and 127 are control codes under RISC OS - their entries in the tables of all alphabets should be 0-31 and 127 - this will guarantee a sensible translation to UTF8.

If the given alphabet number is not recognised, all registers should be preserved.

It is nonsensical to issue this service call for alphabet UTF8 (111). No module should claim such a service call.

4. General UCS mapping/utilities

For general translation to and from other character encodings, and to aid processing of UTF-8 text, a new RISC OS library, UnicodeLib has been written. Its facilities can be broken down by header file.

<Unicode/charsets.h>

charsets.h defines constants of the form cs{name}, giving the IANA character encoding registration number for each character encoding. For example, csShiftJIS is defined as 17. These numbers can be used as convenient fixed character set identifiers.

<Unicode/languages.h>

languages.h defines constants specifying the ISO 639 two-letter language codes. For example, lang_JAPANESE is defined as "ja".

<Unicode/iso10646.h>

iso10646.h provides a couple of basic type definitions for UCS-handling applications.

typedef unsigned int UCS4;
typedef unsigned short UCS2;

#define NULL_UCS4 ((UCS4)0xFFFFFFFF)
#define NULL_UCS2 ((UCS2)0xFFFF)

The NULL defines are defined to be the standard "not a UCS character" values. These are used, for example, in mapping tables.

<Unicode/utf8.h>

This header file provides basic utilities for manipulating UTF-8 text. Detailed documentation is provided in the header file, but the basic functions provided are:

extern char *UCS4_to_UTF8(char *out, UCS4 code);

extern int UTF8_codelen(UCS4 code);
extern int UTF8_seqlen(char c);

extern int UTF8_to_UCS4(const char *in, UCS4 *code_out);

extern char *UTF8_next(const char *p);
extern char *UTF8_prev(const char *p);

extern char *UTF8_next_n(const char *p, int n_chars);

extern int UTF8_strlen(const char *p);
extern int UTF8_strlen_n(const char *p, int n_bytes);

<Unicode/encoding.h>

encoding.h provides access to the character encoding translation facilities of UnicodeLib. It provides facilities for stream based translation between Unicode and a wide range of other encodings. Encodings are specified by IANA registry number (see languages.h above), but a function to find the number given a MIME charset name is provided.

Translation is generally via mapping tables. These are stored in Unicode:Encodings. Unicode$Path is set up by the !Unicode system resource, or points into an equivalent directory in ResourceFS.

Most tables are referenced by their ISO 2022 escape sequence, and stored in Unicode:Encodings.ISO2022, split by set type (C0/C1/G94/G94x94/G96). Only the first two characters of these filenames are significant - they give the hexadecimal code of the character set the table is for. Any remaining characters can be used as a comment (eg "Unicode:Encodings.ISO2022.G94.41[646-GB]"). New ISO character sets can be added here and the encoding library will find them if an appropriate escape sequence is placed at the start of the text.

Current ISO 2022 character sets understood are:

C0C0 set of ISO 646
C1C1 set of ISO 4873
G94ISO 646-IRV (40), ISO 646-GB (41), ISO 646-IRV (42), Finland/Sweden (43), ISO 646-SE (47), ISO 646-SE (48), Katakana (49), JIS Roman (4A), ISO 646-DE (4B), ISO 646 Portuguese (4C), GB1998-80 (54), UK Teletext (56), ISO 646-IT (59), ISO 646-ES (5A), ISO 646-NO (60), ISO 646-FR (66), ISO 646-HU (69), ASMO 449 Arabic (6B), ISO 6937:1983 (6C), Serbocroatian/Slovenian (7A).
G96Latin-1 (41), Latin-2 (42), Latin-3 (43), Latin-4 (44), Greek (46), Arabic (47), Hebrew (48), Cyrillic (4C), Latin-5 (4D), Latin-1,2,5 supplementary (50), ISO 6937:1994 (52), Latin-6 (56), Latin-6 supplementary (58), Baltic Rim supplementary (59), Welsh (5C), Sami (5D), Latin-8 (5F), Latin-9 (62)
G94x94JIS C 6226-1978 (40), GB 2312-80 (41), JIS X 208-1983 (42), KSC 5601 (43), JIS X 0212-1990 (44), CNS 11643 Sets 1-7 (47-4D)

The encoding files are data files containing packed 16-bit words, one per character in the set, giving its UCS equivalent. &FFFF is used as a null value. C0 and C1 sets contain 32 characters, and G94, G96 and G94x94 sets contain, unsurprisingly, 94, 96 and 8836 characters respectively.

There are a few character sets that aren't ISO registered. These are stored separately:

Acorn.Latin1The standard RISC OS alphabet
Apple.MacRomanStandard Macintosh encoding
BigFiveThe defacto Chinese standard
KOI8-RThe defacto Russian Internet standard
Microsoft.CP1250Microsoft Windows Eastern European, Cyrillic and Western European code pages
Microsoft.CP1251
Microsoft.CP1252

All of these, except BigFive, contain 128 characters, for code positions 128-255. 0-127 are assumed to be ASCII. BigFive contains 14758 characters (a 94×157 array), listing the double-byte characters of Big Five.

There should be no need to manipulate any of these extra tables; you should use the encoding library instead. This information is for completeness only. The interface to the encoding facilities is fully documented by the encoding.h header file.

<Unicode/unictype.h>

This header file provides access to information from the Unicode Character Database. At present, it only provides a call to find the character category, and a call to check whether a character is ideographic.

<Unicode/autojp.h>

autojp.h provides a method of autodetecting which encoding a Japanese text file is written in (Japanese web pages typically rely on this autodetection). It can detect EUC, ISO 2022-JP, Shift-JIS and pure ASCII.

5. Japanese territory

Japan is already assigned RISC OS territory number 32. The territory module, "Japan", specifies the following settings:

SettingValue
Alphabet111: UTF8
Keyboard32: Japan
Time zones"JST", +9:00 hours (for both standard and DST)
Weekdays - full (%WE)"日曜日", "月曜日", "火曜日", "水曜日", "木曜日", "金曜日", "土曜日"
Weekdays - short (%W3)"日", "月", "火", "水", "木", "金", "土"
Months (%MO/%M3)"1月", "2月", "3月", "4月", "5月", "6月", "7月", "8月", "9月", "10月", "11月", "12月"
Ordinal suffix (%ST)"日"
AM/PM (%AM/%PM)"午前"/"午後"
Era start dates"水, 01-7月-1868", "火, 30-7月-1912", "土, 25-12月-1926", "日, 08-1月-1989"
Eras - full (%JE)"明治", "大正", "昭和", "平成"
Eras - short (%J1)"M", "T", "S", "H"
Standard date format"%ce%yr年%m3%zdy日"
Standard time format"%z24:%mi:%se"
Standard date+time format"%z24:%mi:%se %ce%yr年%m3%zdy日"
Territory names (1-36,48,49)"イギリス", "Master", "Compact", "イタリア", "スペイン", "フランス", "ドイツ", "ポルトガル", "エスペラント", "ギリシア", "スウェーデン", "フィンランド", "--unused--", "デンマーク", "ノルウェー", "アイスランド", "カナダ1", "カナダ2", "カナダ", "トルコ", "アラビア", "アイルランド", "香港", "ロシア", "ロシア2", "イスラエル", "メキシコ", "ラテンアメリカ", "オーストラリア", "オーストリア", "ベルギー", "日本", "中東", "オランダ", "スイス", "ウェールズ", "アメリカ", "ウェールズ2"
Writing directionHorizontal, left-to-right, top-to-bottom
Character properties0-127, as per UK territory; 128-255 will have no properties.
Lower case table0-127, as per UK territory; 128-255 don't change
Upper case table0-127, as per UK territory; 128-255 don't change
Control table0-127, as per UK territory; 128-255 don't change
Plain table0-127, as per UK territory; 128-255 don't change
Value table0-127, as per UK territory; 128-255 don't change
Representation table"0123456789ABCDEF"
Collate callUCS order, case insensitive only in Basic Latin
Symbols Decimal point: "."
Thousands separator: ","
Grouping: 3,0
International currency symbol: "JPY "
Currency symbol: "¥"
Monetary decimal point: "."
Monetary thousands separator: " "
Monetary grouping: 3,0
Monetary positive sign: ""
Monetary negative sign: "-"
International monetary fractional digits: 0
Monetary fractional digits: 0
Monetary formats: ¥50, -¥50
List separator: ";"
Calendar information First working day: 2 (Monday/月曜日)
Last working day: 6 (Friday/金曜日)
Months in year: 12
Days in month: As per UK territory
Max %AM/%PM length: 6 bytes
Max %WE length: 9 bytes
Max %W3 length: 3 bytes
Max %DY length: 2 bytes
Max %ST length: 3 bytes
Max %MO length: 5 bytes
Max %M3 length: 5 bytes
Max %TZ length: 3 bytes

The Japan territory will provide three special time format strings related to the Japanese era system of year numbering. Information about eras comes from the territory's Messages file, so a new era can be added easily. The initial settings are listed above. You can't use Territory_ReadCalendarInformation to find the maximum length of these format strings, so you should assume the maximum lengths listed here.

%JEJapanese era name in full (max length 12 bytes)
%J1Abbreviation for Japanese era name (max length 2 bytes)
%JYYear number within the era (max length 3 bytes)

Note that many of the alphabet settings (such as the character property table) assume a single-byte, 8-bit alphabet. The values suggested here should cause most applications that use these calls (and the C <ctype.h> functions) not to interfere with any UTF-8 strings.

See Future enhancements for more notes on this.

6. Japanese keyboard driver

A Japanese keyboard layout will be added to the InternationalKeyboard module. The layout will be the standard 109-key Japanese PC keyboard layout (JIS X 6002-1985), and it will output UTF-8.

All drivers will detect the system alphabet being UTF8, and will output UTF8 rather than their native alphabet. This will allow any keyboard layout to be used in conjunction with the Japanese IME. (Hence the Japanese keyboard will always output UTF-8)

The Japanese keyboard has a number of special keys that are used to drive the IME. These will output special function-key-like buffer codes that the IME will pick up and act on. If the IME is not being used, they will do nothing.

KeyLabelCodePosition
WindowsWindows logo&C0A01a and A04a (outside the Alt keys)
MenuMenu symbol&C1A04b (to the left of right Ctrl)
Hankaku/Zenkaku (Half-width/Full-width)半角/全角&C2E01 (to the left of 1)
Kanji (Toggle IME)漢字&C3Alt+E01
Eisuu (Roman)英数&C4C01 (normally Caps Lock)
Muhenkan (Non-convert)無変換&C5A02a (between left Alt and Space)
Henkan/Jikouhou (Convert/Next Candidate)変換 (次候補)&C6A03a (between Space and Kana)
Kana (Hiragana)ひらがな&C7A03b (between Henkan and right Alt)
Zenkouho (All candidates)全候補&C8Alt+A03a
Kanji Bangou (Kanji number)漢字番号&C9Alt+C01
Caps LockCaps Lock Shift+C01
Shift-Henkan (Previous convert)前候補&D6Shift+A03a
Shift-Kana (Katakana)カタカナ&D7Shift+A03b
Kana/Romaji Lockローマ字 Alt+A03b

All these keys generate different codes in conjunction with Shift and/or Ctrl, as the normal function keys do (+10 for Shift, +20 for Ctrl, +30 for both). The only exception is Shift-Eisuu, which is Caps Lock (actually it will be Shift-Caps). Only the shifted forms with specific labels are listed above. These codes - &C0-&C9, &D0-&D9, &E0-&E9, &F0-&F9 - were previously unallocated in RISC OS.

This specification states that these key codes (apart from the ones for Windows/Menu) can be used for other purposes by other drivers, but it is recommended that they be used for special keys in the same physical position (ie the same low-level key number) - probably these mappings from physical key to pseudo-function key will be retained in all drivers, they just may not be physically present. The Wimp passes these codes to applications as &1C0-&1C9 ... &1F0-&1F9.

Kana/Romaji lock (Alt-Kana) is handled internally by the driver, in the same way as Caps Lock. No buffer code is generated. OS_Byte 202's bit 5 [PRM 1-884] is now assigned to Kana/Romaji:

BitValueMeaning
50Kana Lock (Hiragana entry)
1Romaji Lock (Latin entry, as UK)

When Kana Lock is active, an alphabetic key will output Hiragana; Alt plus a key will output Katakana. When Romaji Lock is active, the alphabetic keys will output Latin characters; Alt plus a key will output Hiragana. The default state will be Romaji Lock. The Kana/Romaji state will be orthogonal to the Caps Lock state, although Caps Lock is not meaningful in Kana Lock mode.

Note that the IME will not care whether the keyboard driver outputs Katakana or Hiragana - its Eisuu/Katakana/Hiragana tristate overrides.

Kana Lock in RISC OS is more effective than in Windows - many of the Kana characters marked on a Japanese keyboard are in fact unavailable on PCs. Our use of Unicode keyboard driver output, as opposed to JIS X 0201 output, makes these available.

Kana Lock will be Hangul Lock in Korean. The meaning can be similarly adapted for other territories.

7. New low-level keys

The extra physical keys on the Japanese keyboard are assigned the following low-level codes:

PS/2 codeLow-level codeInternal code−ve INKEY code
Windows leftE0 1F68125INKEY −126
Windows rightE0 2769126INKEY −127
MenuE0 2F6A127INKEY −128
No convert676B109INKEY −110
Convert646C110INKEY −111
Kana136D111INKEY −112
Yen/Bar6A1D46INKEY −47
\ _516E95INKEY −96

Notes: The Yen/Bar key is to the left of backspace (ie where the £ sign was on A-series keyboards). It hence resurrects the old 1D low level code.
The \ _ key is to the left of the right shift key (analogous to the \ | key on UK keyboards). It has its own PS/2 scan code, hence it is given its own RISC OS code, rather than borrowing the \ | code, as there is no reason why a keyboard couldn't have both of these keys.

The keyboard hardware driver maps the new codes to low-level codes, and the InternationalKeyboard module maps the new low-level codes to internal codes.

8. Future enhancements

To get full character/sorting for a UTF-8 will require a new interface. This will be defined at some future point. It seems likely that the International module will provide some sort of interface to the Unicode Character Database, which individual territories will use, modifying any locale-specific settings, to provide the necessary facilities.

The Territory_Collate entry point can provide full functionality for UTF-8; however initially it will just collate characters 0-127 correctly. Remaining characters will just be sorted into UCS order.

9. References

HTML 4.0
http://www.w3.org/TR/REC-html40/
ISO 2022
International Standard ISO/IEC 2022:1994: Information technology - Character code structure and extension techniques
ISO 10646
International Standard ISO/IEC 10646-1: Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
JIS X 6002-1985
Japanese Standard JIS X 6002-1985: Keyboard Layout for Information Processing Using the JIS 7 Bit Coded Character Set
UJIP
Understanding Japanese Information Processing - Ken Lunde
Unicode
The Unicode Standard, Version 2.0 - The Unicode Consortium

10. Glossary

FEP Front End Processor - another name for IME(qv)
Kanji The Japanese ideographic characters
Hiragana The Japanese phonetic alphabet used for Japanese words
IANA Internet Assigned Numbers Authority
IME Input Method Engine
Kana Katakana or Hiragana
Katakana The Japanese phonetic alphabet used for foreign words, or for emphasis
Romaji The Japanese name for the Latin alphabet
UCS Universal Character Set - the character set defined in ISO 10646 and the Unicode Standard
UCS-2 The encoding of the UCS(qv) into 16-bit words. What most people mean if they refer to "Unicode" in the context of an encoding.
UTF-8 UCS Tranformation Format 8 - the standard multibyte encoding of UCS data.

11. History

Issue Date Author Description of change
1 14 Sep 1998 Kevin Bracey Initial release.
2 02 Jul 2003 Tematic Revised for public release.