Not especially optimized, but supposed to be clear. This is here so I don't have to always re-figure-out how to do this when i need it.

In prose

Summarized short form, leaving out the nuances of forbidden characters:

Byte rangeAND withSequence length
00-7F (0-127)Nothing1
80-BF (128-191)3F (63)Continuation
C0-DF (192-223)1F (31)2
E0-EF (224-239)0F (15)3
F0-F7 (240-247)07 (7)4
F8-FB (248-251)03 (3)5
FC-FD (252-253)01 (1)6

JavaScript

list is an array of bytes. Returns a string.

function utf8dec(list) {
  let out = "";
  let buf = 0;
  let expecting = 0;
  for (const byte of list) {
    if (expecting > 0) {
      expecting--;
      if (byte >= 0x80 && byte <= 0xBF) {
        const value = byte & 63;
        buf = (buf << 6) + value;
        if (expecting == 0) {
          out += String.fromCodePoint(buf);
          buf = 0;
        }
      } else {
        expecting = 0;
        out += "\uFFFD";
      }
    } else {
      if (byte <= 127) {
        out += String.fromCodePoint(byte);
      } else if (byte <= 0xC0 && byte <= 0xDF) {
        buf = byte & 31;
        expecting = 1;
      } else if (byte >= 0xE0 && byte <= 0xEF) {
        buf = byte & 15;
        expecting = 2;
      } else if (byte >= 0xF0 && byte <= 0xF7) {
        buf = byte & 7;
        expecting = 3;
      } else {
        out += "\uFFFD";
      }
    }
  }
  return out;
}

Reverse operation

Given a code point, a natural number N between 0 and 0x10FFFF = 1,114,111 inclusive, find the sequence of bytes its UTF-8 encoding is. In the algorithm below, the pipe character | is binary OR, and the ampersand & is binary AND, and >> is the shift-right operator.

If you want them, here are the constants given above, written out in binary, showing how they work:

Two-byte: 110x-xxxx 10yy-yyyy
xxx xxyy yyyy   Bare code point
111 1100 0000   0x7C0 = 1984
000 0011 1111   0x03F = 63

Three-byte: 1110-xxxx 10yy-yyyy 10zz-zzzz
xxxx yyyy yyzz zzzz   Bare code point
1111 0000 0000 0000   0xF000 = 61440
0000 1111 1100 0000   0x0FC0 = 4032
0000 0000 0011 1111   0x003F = 63

Four-byte: 1111-0xxx 10yy-yyyy 10zz-zzzz 10ww-wwww
x xxyy yyyy zzzz zzww wwww  Bare code point
1 1100 0000 0000 0000 0000  0x1C0000 = 1835008
0 0011 1111 0000 0000 0000  0x03F000 = 258048
0 0000 0000 1111 1100 0000  0x000FC0 = 4032
0 0000 0000 0000 0011 1111  0x00003F = 63

2020-04-20, 2022-03-22 | index