| Drawing number: | 2205,203/FS |
| Issue: | 2 |
| Status: | Released |
| Author: | Kevin Bracey |
| Date: | 2nd July 2003 |
At some future time the kernel VDU driver will be taught about UTF-8, and how to display large character sets, and these definitions will not be used. In the meantime, these definitions will be useful for debugging.
UTF8 is to be regarded as a "special" alphabet by applications. All other RISC OS alphabets are simple 8-bit encodings, with the lower set identical to ASCII. These can be mapped to Unicode using the tables/calls described below. Unicode-aware applications should detect the system alphabet being set to UTF8, and be aware that incoming text-files or keyboard input should be assumed to be a UTF-8 stream. The vast majority of applications do not do any real text processing at all, just treating input and output as a sequence of bytes, looking for ASCII characters that are significant to them and treating top-bit set characters as alphabetic. These applications will not generally be bothered by the system alphabet being UTF8.
If the character number is not recognised, all registers should be preserved.
This call is intended as an interim measure before the kernel supports large character sets. The standard International module contains several hundred character definitions - this call allows easy access to them.
The table is a table of 256 32-bit words, one for each character, giving the equivalent UCS code for that character in the alphabet. If a character is not defined in the alphabet, its entry should contain &FFFFFFFF (not a valid UCS identifier).
Characters 0-31 and 127 are control codes under RISC OS - their entries in the tables of all alphabets should be 0-31 and 127 - this will guarantee a sensible translation to UTF8.
If the given alphabet number is not recognised, all registers should be preserved.
It is nonsensical to issue this service call for alphabet UTF8 (111). No module should claim such a service call.
typedef unsigned int UCS4; typedef unsigned short UCS2; #define NULL_UCS4 ((UCS4)0xFFFFFFFF) #define NULL_UCS2 ((UCS2)0xFFFF)
The NULL defines are defined to be the standard "not a UCS character" values. These are used, for example, in mapping tables.
extern char *UCS4_to_UTF8(char *out, UCS4 code); extern int UTF8_codelen(UCS4 code); extern int UTF8_seqlen(char c); extern int UTF8_to_UCS4(const char *in, UCS4 *code_out); extern char *UTF8_next(const char *p); extern char *UTF8_prev(const char *p); extern char *UTF8_next_n(const char *p, int n_chars); extern int UTF8_strlen(const char *p); extern int UTF8_strlen_n(const char *p, int n_bytes);
Translation is generally via mapping tables. These are stored in Unicode:Encodings. Unicode$Path is set up by the !Unicode system resource, or points into an equivalent directory in ResourceFS.
Most tables are referenced by their ISO 2022 escape sequence, and stored in Unicode:Encodings.ISO2022, split by set type (C0/C1/G94/G94x94/G96). Only the first two characters of these filenames are significant - they give the hexadecimal code of the character set the table is for. Any remaining characters can be used as a comment (eg "Unicode:Encodings.ISO2022.G94.41[646-GB]"). New ISO character sets can be added here and the encoding library will find them if an appropriate escape sequence is placed at the start of the text.
Current ISO 2022 character sets understood are:
| C0 | C0 set of ISO 646 |
| C1 | C1 set of ISO 4873 |
| G94 | ISO 646-IRV (40), ISO 646-GB (41), ISO 646-IRV (42), Finland/Sweden (43), ISO 646-SE (47), ISO 646-SE (48), Katakana (49), JIS Roman (4A), ISO 646-DE (4B), ISO 646 Portuguese (4C), GB1998-80 (54), UK Teletext (56), ISO 646-IT (59), ISO 646-ES (5A), ISO 646-NO (60), ISO 646-FR (66), ISO 646-HU (69), ASMO 449 Arabic (6B), ISO 6937:1983 (6C), Serbocroatian/Slovenian (7A). |
| G96 | Latin-1 (41), Latin-2 (42), Latin-3 (43), Latin-4 (44), Greek (46), Arabic (47), Hebrew (48), Cyrillic (4C), Latin-5 (4D), Latin-1,2,5 supplementary (50), ISO 6937:1994 (52), Latin-6 (56), Latin-6 supplementary (58), Baltic Rim supplementary (59), Welsh (5C), Sami (5D), Latin-8 (5F), Latin-9 (62) |
| G94x94 | JIS C 6226-1978 (40), GB 2312-80 (41), JIS X 208-1983 (42), KSC 5601 (43), JIS X 0212-1990 (44), CNS 11643 Sets 1-7 (47-4D) |
The encoding files are data files containing packed 16-bit words, one per character in the set, giving its UCS equivalent. &FFFF is used as a null value. C0 and C1 sets contain 32 characters, and G94, G96 and G94x94 sets contain, unsurprisingly, 94, 96 and 8836 characters respectively.
There are a few character sets that aren't ISO registered. These are stored separately:
| Acorn.Latin1 | The standard RISC OS alphabet |
| Apple.MacRoman | Standard Macintosh encoding |
| BigFive | The defacto Chinese standard |
| KOI8-R | The defacto Russian Internet standard |
| Microsoft.CP1250 | Microsoft Windows Eastern European, Cyrillic and Western European code pages |
| Microsoft.CP1251 | |
| Microsoft.CP1252 |
All of these, except BigFive, contain 128 characters, for code positions 128-255. 0-127 are assumed to be ASCII. BigFive contains 14758 characters (a 94×157 array), listing the double-byte characters of Big Five.
There should be no need to manipulate any of these extra tables; you should use the encoding library instead. This information is for completeness only. The interface to the encoding facilities is fully documented by the encoding.h header file.
| Setting | Value |
|---|---|
| Alphabet | 111: UTF8 |
| Keyboard | 32: Japan |
| Time zones | "JST", +9:00 hours (for both standard and DST) |
| Weekdays - full (%WE) | "日曜日", "月曜日", "火曜日", "水曜日", "木曜日", "金曜日", "土曜日" |
| Weekdays - short (%W3) | "日", "月", "火", "水", "木", "金", "土" |
| Months (%MO/%M3) | "1月", "2月", "3月", "4月", "5月", "6月", "7月", "8月", "9月", "10月", "11月", "12月" |
| Ordinal suffix (%ST) | "日" |
| AM/PM (%AM/%PM) | "午前"/"午後" |
| Era start dates | "水, 01-7月-1868", "火, 30-7月-1912", "土, 25-12月-1926", "日, 08-1月-1989" |
| Eras - full (%JE) | "明治", "大正", "昭和", "平成" |
| Eras - short (%J1) | "M", "T", "S", "H" |
| Standard date format | "%ce%yr年%m3%zdy日" |
| Standard time format | "%z24:%mi:%se" |
| Standard date+time format | "%z24:%mi:%se %ce%yr年%m3%zdy日" |
| Territory names (1-36,48,49) | "イギリス", "Master", "Compact", "イタリア", "スペイン", "フランス", "ドイツ", "ポルトガル", "エスペラント", "ギリシア", "スウェーデン", "フィンランド", "--unused--", "デンマーク", "ノルウェー", "アイスランド", "カナダ1", "カナダ2", "カナダ", "トルコ", "アラビア", "アイルランド", "香港", "ロシア", "ロシア2", "イスラエル", "メキシコ", "ラテンアメリカ", "オーストラリア", "オーストリア", "ベルギー", "日本", "中東", "オランダ", "スイス", "ウェールズ", "アメリカ", "ウェールズ2" |
| Writing direction | Horizontal, left-to-right, top-to-bottom |
| Character properties | 0-127, as per UK territory; 128-255 will have no properties. |
| Lower case table | 0-127, as per UK territory; 128-255 don't change |
| Upper case table | 0-127, as per UK territory; 128-255 don't change |
| Control table | 0-127, as per UK territory; 128-255 don't change |
| Plain table | 0-127, as per UK territory; 128-255 don't change |
| Value table | 0-127, as per UK territory; 128-255 don't change |
| Representation table | "0123456789ABCDEF" |
| Collate call | UCS order, case insensitive only in Basic Latin |
| Symbols |
Decimal point: "." Thousands separator: "," Grouping: 3,0 International currency symbol: "JPY " Currency symbol: "¥" Monetary decimal point: "." Monetary thousands separator: " " Monetary grouping: 3,0 Monetary positive sign: "" Monetary negative sign: "-" International monetary fractional digits: 0 Monetary fractional digits: 0 Monetary formats: ¥50, -¥50 List separator: ";" |
| Calendar information |
First working day: 2 (Monday/月曜日) Last working day: 6 (Friday/金曜日) Months in year: 12 Days in month: As per UK territory Max %AM/%PM length: 6 bytes Max %WE length: 9 bytes Max %W3 length: 3 bytes Max %DY length: 2 bytes Max %ST length: 3 bytes Max %MO length: 5 bytes Max %M3 length: 5 bytes Max %TZ length: 3 bytes |
The Japan territory will provide three special time format strings related to the Japanese era system of year numbering. Information about eras comes from the territory's Messages file, so a new era can be added easily. The initial settings are listed above. You can't use Territory_ReadCalendarInformation to find the maximum length of these format strings, so you should assume the maximum lengths listed here.
| %JE | Japanese era name in full (max length 12 bytes) |
| %J1 | Abbreviation for Japanese era name (max length 2 bytes) |
| %JY | Year number within the era (max length 3 bytes) |
Note that many of the alphabet settings (such as the character property table) assume a single-byte, 8-bit alphabet. The values suggested here should cause most applications that use these calls (and the C <ctype.h> functions) not to interfere with any UTF-8 strings.
See Future enhancements for more notes on this.
All drivers will detect the system alphabet being UTF8, and will output UTF8 rather than their native alphabet. This will allow any keyboard layout to be used in conjunction with the Japanese IME. (Hence the Japanese keyboard will always output UTF-8)
The Japanese keyboard has a number of special keys that are used to drive the IME. These will output special function-key-like buffer codes that the IME will pick up and act on. If the IME is not being used, they will do nothing.
| Key | Label | Code | Position |
|---|---|---|---|
| Windows | Windows logo | &C0 | A01a and A04a (outside the Alt keys) |
| Menu | Menu symbol | &C1 | A04b (to the left of right Ctrl) |
| Hankaku/Zenkaku (Half-width/Full-width) | 半角/全角 | &C2 | E01 (to the left of 1) |
| Kanji (Toggle IME) | 漢字 | &C3 | Alt+E01 |
| Eisuu (Roman) | 英数 | &C4 | C01 (normally Caps Lock) |
| Muhenkan (Non-convert) | 無変換 | &C5 | A02a (between left Alt and Space) |
| Henkan/Jikouhou (Convert/Next Candidate) | 変換 (次候補) | &C6 | A03a (between Space and Kana) |
| Kana (Hiragana) | ひらがな | &C7 | A03b (between Henkan and right Alt) |
| Zenkouho (All candidates) | 全候補 | &C8 | Alt+A03a |
| Kanji Bangou (Kanji number) | 漢字番号 | &C9 | Alt+C01 |
| Caps Lock | Caps Lock | Shift+C01 | |
| Shift-Henkan (Previous convert) | 前候補 | &D6 | Shift+A03a |
| Shift-Kana (Katakana) | カタカナ | &D7 | Shift+A03b |
| Kana/Romaji Lock | ローマ字 | Alt+A03b |
All these keys generate different codes in conjunction with Shift and/or Ctrl, as the normal function keys do (+10 for Shift, +20 for Ctrl, +30 for both). The only exception is Shift-Eisuu, which is Caps Lock (actually it will be Shift-Caps). Only the shifted forms with specific labels are listed above. These codes - &C0-&C9, &D0-&D9, &E0-&E9, &F0-&F9 - were previously unallocated in RISC OS.
This specification states that these key codes (apart from the ones for Windows/Menu) can be used for other purposes by other drivers, but it is recommended that they be used for special keys in the same physical position (ie the same low-level key number) - probably these mappings from physical key to pseudo-function key will be retained in all drivers, they just may not be physically present. The Wimp passes these codes to applications as &1C0-&1C9 ... &1F0-&1F9.
Kana/Romaji lock (Alt-Kana) is handled internally by the driver, in the same way as Caps Lock. No buffer code is generated. OS_Byte 202's bit 5 [PRM 1-884] is now assigned to Kana/Romaji:
| Bit | Value | Meaning |
|---|---|---|
| 5 | 0 | Kana Lock (Hiragana entry) |
| 1 | Romaji Lock (Latin entry, as UK) |
When Kana Lock is active, an alphabetic key will output Hiragana; Alt plus a key will output Katakana. When Romaji Lock is active, the alphabetic keys will output Latin characters; Alt plus a key will output Hiragana. The default state will be Romaji Lock. The Kana/Romaji state will be orthogonal to the Caps Lock state, although Caps Lock is not meaningful in Kana Lock mode.
Note that the IME will not care whether the keyboard driver outputs Katakana or Hiragana - its Eisuu/Katakana/Hiragana tristate overrides.
Kana Lock in RISC OS is more effective than in Windows - many of the Kana characters marked on a Japanese keyboard are in fact unavailable on PCs. Our use of Unicode keyboard driver output, as opposed to JIS X 0201 output, makes these available.
Kana Lock will be Hangul Lock in Korean. The meaning can be similarly adapted for other territories.
| PS/2 code | Low-level code | Internal code | −ve INKEY code | |
|---|---|---|---|---|
| Windows left | E0 1F | 68 | 125 | INKEY −126 |
| Windows right | E0 27 | 69 | 126 | INKEY −127 |
| Menu | E0 2F | 6A | 127 | INKEY −128 |
| No convert | 67 | 6B | 109 | INKEY −110 |
| Convert | 64 | 6C | 110 | INKEY −111 |
| Kana | 13 | 6D | 111 | INKEY −112 |
| Yen/Bar | 6A | 1D | 46 | INKEY −47 |
| \ _ | 51 | 6E | 95 | INKEY −96 |
| Notes: | The Yen/Bar key is to the left of backspace (ie where the £ sign was on A-series keyboards). It hence resurrects the old 1D low level code. |
| The \ _ key is to the left of the right shift key (analogous to the \ | key on UK keyboards). It has its own PS/2 scan code, hence it is given its own RISC OS code, rather than borrowing the \ | code, as there is no reason why a keyboard couldn't have both of these keys. |
The keyboard hardware driver maps the new codes to low-level codes, and the InternationalKeyboard module maps the new low-level codes to internal codes.
To get full character/sorting for a UTF-8 will require a new interface. This will be defined at some future point. It seems likely that the International module will provide some sort of interface to the Unicode Character Database, which individual territories will use, modifying any locale-specific settings, to provide the necessary facilities.
The Territory_Collate entry point can provide full functionality for UTF-8; however initially it will just collate characters 0-127 correctly. Remaining characters will just be sorted into UCS order.
| FEP | Front End Processor - another name for IME(qv) |
| Kanji | The Japanese ideographic characters |
| Hiragana | The Japanese phonetic alphabet used for Japanese words |
| IANA | Internet Assigned Numbers Authority |
| IME | Input Method Engine |
| Kana | Katakana or Hiragana |
| Katakana | The Japanese phonetic alphabet used for foreign words, or for emphasis |
| Romaji | The Japanese name for the Latin alphabet |
| UCS | Universal Character Set - the character set defined in ISO 10646 and the Unicode Standard |
| UCS-2 | The encoding of the UCS(qv) into 16-bit words. What most people mean if they refer to "Unicode" in the context of an encoding. |
| UTF-8 | UCS Tranformation Format 8 - the standard multibyte encoding of UCS data. |
| Issue | Date | Author | Description of change |
| 1 | 14 Sep 1998 | Kevin Bracey | Initial release. |
| 2 | 02 Jul 2003 | Tematic | Revised for public release. |