Manually decoding GB-2312 text from email headers or a mail log

Page created 2022-09-21.

Encoding-independent part

An encoded subject field is written as =?charset?encoding?encoded-data?=. The encoding is either B, for base64, or Q, for quoted-printable.

Example: =?ISO-8859-1?B?RXjkbXBs6CBz/GJqZWN0Cg==?= is base64-encoded ISO-8859-1-coded text. Decoding the base64 provides the byte sequence 45 78 e4 6d 70 6c e8 20 73 dc 62 6a 65 63 74 0a, and decoding that gives the text "Exämplè sübject".

Second example: =?UTF-8?B?0JXRhdCw0LzRgNOPyZsg0ZXPhdCy0ZjOtdGB0YIK?=. With a terminal configured to use UTF-8 one can easily read this by passing the base64-encoded text to base64 -d:

$ echo "0JXRhdCw0LzRgNOPyZsg0ZXPhdCy0ZjOtdGB0YIK" | base64 -d ; echo
Ехамрӏɛ ѕυвјεст

echo "base64text" | base64 -d ; echo

For ISO-8859-1-encoded text, use iconv:

echo "base64text" | base64 -d | iconv -f Windows-1252 -t UTF-8

If one wants to manually decode UTF-8 (with a table, perhaps), pipe the result of the base64 decoding through xxd.

The GB-2312 part

GB2312 is a National Standard of the People's Republic of China (hence the "GB") and an alias for the EUC-CN encoding – the Extended Unix Code is a multibyte character encoding with different variants for various CJK languages, and EUC-CN encodes Simplified Chinese.

One can use a very similar command line to decode Chinese, using iconv to decode the resulting binary data. (One can specify the encoding as either "GB2312" (no hyphen allowed) or "EUC-CN".)

Example input: =?gb2312?B?yr7A/db3zOLX1rbO?=

$ echo "yr7A/db3zOLX1rbO" | base64 -d | iconv -f GB2312 -t UTF-8 ; echo
示例主题字段

If one wants to manually try decoding this, use xxd for printing out the bytes:

$ echo "yr7A/db3zOLX1rbO" | base64 -d | xxd
00000000: cabe c0fd d6f7 cce2 d7d6 b6ce            ............

Wikipedia has pretty okay code charts for decoding.

Decoding is a bit of a hassle, there's a bunch of lookups. Examples:

CA BE: lead byte CA, code table row 42. Trailing byte BE, decimal 190; subtract 160 from this, now we have 30: row 42, column 30, "示". Columns start at 1, so this is the 30th (not 31st) character in the row.
C0 FD: lead byte C0, code table row 32. Trailing byte FD, decimal 253, subtract 160, get 93: 93rd character of row 32, or the second-to-last (there are 94 characters per line): "例".
And so on.

echo "base64text" | base64 -d | iconv -f GB2312 -t UTF-8 ; echo