Not especially optimized, but supposed to be clear. This is here so I don't have to always re-figure-out how to do this when i need it.
Summarized short form, leaving out the nuances of forbidden characters:
Byte range | AND with | Sequence length |
---|---|---|
00-7F (0-127) | Nothing | 1 |
80-BF (128-191) | 3F (63) | Continuation |
C0-DF (192-223) | 1F (31) | 2 |
E0-EF (224-239) | 0F (15) | 3 |
F0-F7 (240-247) | 07 (7) | 4 |
F8-FB (248-251) | 03 (3) | 5 |
FC-FD (252-253) | 01 (1) | 6 |
list
is an array of bytes. Returns a string.
function utf8dec(list) { let out = ""; let buf = 0; let expecting = 0; for (const byte of list) { if (expecting > 0) { expecting--; if (byte >= 0x80 && byte <= 0xBF) { const value = byte & 63; buf = (buf << 6) + value; if (expecting == 0) { out += String.fromCodePoint(buf); buf = 0; } } else { expecting = 0; out += "\uFFFD"; } } else { if (byte <= 127) { out += String.fromCodePoint(byte); } else if (byte <= 0xC0 && byte <= 0xDF) { buf = byte & 31; expecting = 1; } else if (byte >= 0xE0 && byte <= 0xEF) { buf = byte & 15; expecting = 2; } else if (byte >= 0xF0 && byte <= 0xF7) { buf = byte & 7; expecting = 3; } else { out += "\uFFFD"; } } } return out; }
Given a code point, a natural number N between 0 and 0x10FFFF = 1,114,111 inclusive, find the sequence of bytes its UTF-8 encoding is. In the algorithm below, the pipe character | is binary OR, and the ampersand & is binary AND, and >> is the shift-right operator.
0xC0 | ((N & 0x7C0) >> 6)
, or in decimal, 192 | ((N & 1984) >> 6)
.0x80 | (N & 0x3F)
, or in decimal, 128 | (N & 63)
.0xE0 | ((N & 0xF000) >> 12)
, or in decimal, 224 | ((N & 61440) >> 12)
.0x80 | ((N & 0xFC0) >> 6)
, or in decimal, 128 | ((N & 4032) >> 6)
.0x80 | (N & 0x3F)
, or in decimal, 128 | (N & 63)
.0xF0 | ((N & 0x1C0000) >> 18)
, or in decimal, 240 | ((N & 1835008) >> 18)
.0x80 | ((N & 0x3F000) >> 12)
, or in decimal, 128 | ((N & 258048) >> 12)
.0x80 | ((N & 0xFC0) >> 6)
, or in decimal, 128 | ((N & 4032) >> 6)
.0x80 | (N & 0x3F)
, or in decimal, 128 | (N & 63)
.If you want them, here are the constants given above, written out in binary, showing how they work:
Two-byte: 110x-xxxx 10yy-yyyy xxx xxyy yyyy Bare code point 111 1100 0000 0x7C0 = 1984 000 0011 1111 0x03F = 63 Three-byte: 1110-xxxx 10yy-yyyy 10zz-zzzz xxxx yyyy yyzz zzzz Bare code point 1111 0000 0000 0000 0xF000 = 61440 0000 1111 1100 0000 0x0FC0 = 4032 0000 0000 0011 1111 0x003F = 63 Four-byte: 1111-0xxx 10yy-yyyy 10zz-zzzz 10ww-wwww x xxyy yyyy zzzz zzww wwww Bare code point 1 1100 0000 0000 0000 0000 0x1C0000 = 1835008 0 0011 1111 0000 0000 0000 0x03F000 = 258048 0 0000 0000 1111 1100 0000 0x000FC0 = 4032 0 0000 0000 0000 0011 1111 0x00003F = 63
2020-04-20, 2022-03-22 | index