Why Does Android Shift-jis Encoding Of Yen (u+00a5) Symbol Produce -4,-4 ?

July 30, 2023 Post a Comment

Running the following code seems to generate the wrong values: byte[] data = '\u00a5'.getBytes('Shift_JIS'); It produces [ -4, -4 ], but I expect [ 0x5c ] I've tried various alter

Solution 1:

A partial answer: back when Microsoft created its east-Asian code pages for Windows, like the Japanese code page 932 and Korean 949, they made the byte 0x5C render as a currency symbol (either a Yen sign or Won sign respectively) while still syntactically acting as a backslash character in file paths (so that a file path on a Japanese system might look like

C:¥Documents¥something.doc

). Thus the byte was in a sense a Yen sign, but also in a sense a backslash; the same byte was even rendered as a different one of these symbols depending upon the font when on a Japanese system, according to http://archives.miloush.net/michkap/archive/2005/09/17/469941.html.

The lack of a consistent meaning of the symbol within the encoding means that while a Shift-JIS encoder can sensibly map both \ and ¥ to the byte 0x5C, a decoder trying to map a Shift-JIS-encoded string to a sequence of unicode code points has no way of knowing whether to convert the byte 0x5C to a backslash or to a yen sign; Japanese users used to make that choice via their font selection (if they were able to make it at all).

In the face of this unfixable ambiguity, all decoders seem to choose to decode 0x5C to a backslash. (At least, Python does this, and the WhatWG have a spec that dictates it.)

As for the details of what Java/Android in particular are doing when asked to encode a Yen sign in shift_jis, I'm afraid I don't know.

Android Tech Blog

Why Does Android Shift-jis Encoding Of Yen (u+00a5) Symbol Produce -4,-4 ?

Solution 1:

Post a Comment for "Why Does Android Shift-jis Encoding Of Yen (u+00a5) Symbol Produce -4,-4 ?"