Not especially optimized, but supposed to be clear. This is here so I don't have to always re-figure-out how to do this when i need it.
Summarized short form, leaving out the nuances of forbidden characters:
| Byte range | AND with | Sequence length |
|---|---|---|
| 00-7F (0-127) | Nothing | 1 |
| 80-BF (128-191) | 3F (63) | Continuation |
| C0-DF (192-223) | 1F (31) | 2 |
| E0-EF (224-239) | 0F (15) | 3 |
| F0-F7 (240-247) | 07 (7) | 4 |
| F8-FB (248-251) | 03 (3) | 5 |
| FC-FD (252-253) | 01 (1) | 6 |
list is an array of bytes. Returns a string.
function utf8dec(list) {
let out = "";
let buf = 0;
let expecting = 0;
for (const byte of list) {
if (expecting > 0) {
expecting--;
if (byte >= 0x80 && byte <= 0xBF) {
const value = byte & 63;
buf = (buf << 6) + value;
if (expecting == 0) {
out += String.fromCodePoint(buf);
buf = 0;
}
} else {
expecting = 0;
out += "\uFFFD";
}
} else {
if (byte <= 127) {
out += String.fromCodePoint(byte);
} else if (byte <= 0xC0 && byte <= 0xDF) {
buf = byte & 31;
expecting = 1;
} else if (byte >= 0xE0 && byte <= 0xEF) {
buf = byte & 15;
expecting = 2;
} else if (byte >= 0xF0 && byte <= 0xF7) {
buf = byte & 7;
expecting = 3;
} else {
out += "\uFFFD";
}
}
}
return out;
}
Given a code point, a natural number N between 0 and 0x10FFFF = 1,114,111 inclusive, find the sequence of bytes its UTF-8 encoding is. In the algorithm below, the pipe character | is binary OR, and the ampersand & is binary AND, and >> is the shift-right operator.
0xC0 | ((N & 0x7C0) >> 6), or in decimal, 192 | ((N & 1984) >> 6).0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).0xE0 | ((N & 0xF000) >> 12), or in decimal, 224 | ((N & 61440) >> 12).0x80 | ((N & 0xFC0) >> 6), or in decimal, 128 | ((N & 4032) >> 6).0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).0xF0 | ((N & 0x1C0000) >> 18), or in decimal, 240 | ((N & 1835008) >> 18).0x80 | ((N & 0x3F000) >> 12), or in decimal, 128 | ((N & 258048) >> 12).0x80 | ((N & 0xFC0) >> 6), or in decimal, 128 | ((N & 4032) >> 6).0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).If you want them, here are the constants given above, written out in binary, showing how they work:
Two-byte: 110x-xxxx 10yy-yyyy xxx xxyy yyyy Bare code point 111 1100 0000 0x7C0 = 1984 000 0011 1111 0x03F = 63 Three-byte: 1110-xxxx 10yy-yyyy 10zz-zzzz xxxx yyyy yyzz zzzz Bare code point 1111 0000 0000 0000 0xF000 = 61440 0000 1111 1100 0000 0x0FC0 = 4032 0000 0000 0011 1111 0x003F = 63 Four-byte: 1111-0xxx 10yy-yyyy 10zz-zzzz 10ww-wwww x xxyy yyyy zzzz zzww wwww Bare code point 1 1100 0000 0000 0000 0000 0x1C0000 = 1835008 0 0011 1111 0000 0000 0000 0x03F000 = 258048 0 0000 0000 1111 1100 0000 0x000FC0 = 4032 0 0000 0000 0000 0011 1111 0x00003F = 63
2020-04-20, 2022-03-22 | index