UTF-8 decoding code

Not especially optimized, but supposed to be clear. This is here so I don't have to always re-figure-out how to do this when i need it.

In prose

Decode a stream of bytes as follows. All ranges are inclusive.
If a byte is between 0x00 and 0x7F (0 and 127), output the byte as an ASCII character.
If the byte is between 0x80 and 0xBF (128 and 191), you're looking at a continuation byte. AND it with 0x3F (63) for the value it carries. Shift whatever you had previously six bits left and add the value. If you weren't expecting a continuation byte you're in an error condition: perhaps just output the replacement character U+FFFD �.
If the byte is between 0xC0 and 0xDF (192 and 223) you're looking at the first byte of a two-byte sequence. AND it with 0x1F (31) for the value it carries.
If the byte is between 0xE0 and 0xEF (224 and 239) you're looking at the first byte of a three-byte sequence. AND it with 0x0F (15) for the value it carries.
If the byte is between 0xF0 and 0xF7 (240 and 247) you're looking at the first byte of a four-byte sequence. AND it with 7 for the value it carries.
Properly-formed UTF-8 only supports up to 4-byte characters (because Unicode only has 0x10FFFF code points), but in principle the system can be extended to 5- and 6-byte sequences: 5-byte sequences start with F8, F9, FA or FB (248 to 251) and are ANDed with 3, and 6-byte sequences start with FC or FD (252 or 253) and are ANDed with 1.
0xFE and 0xFF (254, 255) have no business existing in a UTF-8 bytestream. The bytes 0xF8 to 0xFD (248 to 253) are also disallowed, because they start 5- and 6-byte sequences, which are forbidden. Also forbidden are the bytes 0xC0 and 0xC1 (192 and 193), because while they start a two-byte sequence, when decoded they start an "overlong" encoding of a character under U+007F (encodable in one byte); overlong encodings are forbidden.

Summarized short form, leaving out the nuances of forbidden characters:

Byte range	AND with	Sequence length
00-7F (0-127)	Nothing	1
80-BF (128-191)	3F (63)	Continuation
C0-DF (192-223)	1F (31)	2
E0-EF (224-239)	0F (15)	3
F0-F7 (240-247)	07 (7)	4
F8-FB (248-251)	03 (3)	5
FC-FD (252-253)	01 (1)	6

JavaScript

list is an array of bytes. Returns a string.

function utf8dec(list) {
  let out = "";
  let buf = 0;
  let expecting = 0;
  for (const byte of list) {
    if (expecting > 0) {
      expecting--;
      if (byte >= 0x80 && byte <= 0xBF) {
        const value = byte & 63;
        buf = (buf << 6) + value;
        if (expecting == 0) {
          out += String.fromCodePoint(buf);
          buf = 0;
        }
      } else {
        expecting = 0;
        out += "\uFFFD";
      }
    } else {
      if (byte <= 127) {
        out += String.fromCodePoint(byte);
      } else if (byte <= 0xC0 && byte <= 0xDF) {
        buf = byte & 31;
        expecting = 1;
      } else if (byte >= 0xE0 && byte <= 0xEF) {
        buf = byte & 15;
        expecting = 2;
      } else if (byte >= 0xF0 && byte <= 0xF7) {
        buf = byte & 7;
        expecting = 3;
      } else {
        out += "\uFFFD";
      }
    }
  }
  return out;
}

Reverse operation

Given a code point, a natural number N between 0 and 0x10FFFF = 1,114,111 inclusive, find the sequence of bytes its UTF-8 encoding is. In the algorithm below, the pipe character | is binary OR, and the ampersand & is binary AND, and >> is the shift-right operator.

If N is between 0 and 0x7F = 127, inclusive, N is the output byte itself.
If N is between 0x80 and 0x7FF, or 128 and 2047 inclusive, we have a two-byte sequence.
Calculate the first byte: it's 0xC0 | ((N & 0x7C0) >> 6), or in decimal, 192 | ((N & 1984) >> 6).
Calculate the second byte: it's 0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).
If N is between 0x800 and 0xFFFF, or 2048 and 65536 inclusive, we have a three-byte sequence.
Calculate the first byte: it's 0xE0 | ((N & 0xF000) >> 12), or in decimal, 224 | ((N & 61440) >> 12).
Calculate the second byte: it's 0x80 | ((N & 0xFC0) >> 6), or in decimal, 128 | ((N & 4032) >> 6).
Calculate the third byte: it's 0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).
If N is between 0x10000 and 0x10FFFF, or 65536 and 1114111 inclusive, we have a four-byte sequence.
Calculate the first byte: it's 0xF0 | ((N & 0x1C0000) >> 18), or in decimal, 240 | ((N & 1835008) >> 18).
Calculate the second byte: it's 0x80 | ((N & 0x3F000) >> 12), or in decimal, 128 | ((N & 258048) >> 12).
Calculate the third byte: it's 0x80 | ((N & 0xFC0) >> 6), or in decimal, 128 | ((N & 4032) >> 6).
Calculate the fourth byte: it's 0x80 | (N & 0x3F), or in decimal, 128 | (N & 63).
There are no code points above 0x10FFFF (1114111), so if your number wasn't handled by the before cases, you have an error.

If you want them, here are the constants given above, written out in binary, showing how they work:

Two-byte: 110x-xxxx 10yy-yyyy
xxx xxyy yyyy   Bare code point
111 1100 0000   0x7C0 = 1984
000 0011 1111   0x03F = 63

Three-byte: 1110-xxxx 10yy-yyyy 10zz-zzzz
xxxx yyyy yyzz zzzz   Bare code point
1111 0000 0000 0000   0xF000 = 61440
0000 1111 1100 0000   0x0FC0 = 4032
0000 0000 0011 1111   0x003F = 63

Four-byte: 1111-0xxx 10yy-yyyy 10zz-zzzz 10ww-wwww
x xxyy yyyy zzzz zzww wwww  Bare code point
1 1100 0000 0000 0000 0000  0x1C0000 = 1835008
0 0011 1111 0000 0000 0000  0x03F000 = 258048
0 0000 0000 1111 1100 0000  0x000FC0 = 4032
0 0000 0000 0000 0011 1111  0x00003F = 63

2020-04-20, 2022-03-22 | index