Online UTF-8 Encode and Decode

Full screen

Tool Introduction

Online UTF-8 encoding and decoding tool

Introduction to UTF-8

UTF-8 is a variable-length character encoding for Unicode, also known as Universal Code.

UTF-8 encodes UNICODE characters in 1 to 6 bytes.

UTF-8 encoding rules

If there is only one byte, its highest binary bit is 0;

If it is multi-byte, the first byte starts from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of encoded bytes, and the rest of the bytes start with 10.

The UTF-8 conversion table is represented as follows:
Unicode/UCS-4
number of bits
UTF-8
number of bytes
Remark
0000~
007F
0~7
0 XXX XXXX
1
0080~
07FF
8~11
110 X XXXX
10 XX XXXX
2
0800~
FFFF
12~16
1110 XXXX
10 XX XXXX
10 XX XXXX
3
Basic definition range: 0~FFFF
10000~
1F FFFF
17~21
1111 0 XXX
10 XX XXXX
10 XX XXXX
10 XX XXXX
4
Unicode6.1 definition range: 0~10 FFFF
20 0000~
3FFFFFF
22~26
1111 10 XX
10 XX XXXX
10 XX XXXX
10 XX XXXX
10 XX XXXX
5
Description: This non-unicode encoding range belongs to UCS-4 encoding
The early specification, UTF-8, can reach 6-byte sequences, which can cover up to 31 bits (the original limit of the universal character set). Nonetheless, in November 2003 UTF-8 was re-specified by RFC 3629, which can only use the areas originally defined by Unicode, U+0000 to U+10FFFF. According to the specification, these byte values will not appear in a legal UTF-8 sequence
400 0000~
7FFF FFFF
27~31
1111 110X
10 XX XXXX
10 XX XXXX
10 XX XXXX
10 XX XXXX
10 XX XXXX
6

UTF-8 advantages

UTF-8 encoding can be read and written quickly by masking bits and shifting operations. strcmp() and wcscmp() return the same result for string comparisons, thus making sorting easier. Bytes FF and FE are never present in UTF-8 encoding, so they can be used to indicate UTF-16 or UTF-32 text (see BOM) UTF-8 is byte-order independent. Its endianness is the same on all systems, so it doesn't really need a BOM.

UTF-8 Disadvantages

You can't tell the number of bytes of UTF-8 text from the number of UNICODE characters, because UTF-8 is a variable-length encoding and it takes 2 bytes to encode those characters that are only 1 byte in the extended ASCII character set ISO Latin-1 is a subset of UNICODE, but not a subset of UTF-8. UTF-8 encoding of 8-bit characters will be filtered by email gateways because internet messages are originally designed to be 7-bit ASCII. Hence the UTF-7 encoding. UTF-8 uses the value 100xxxxx in its representation more than 50% of the time, and existing implementations such as ISO 2022, 4873, 6429, and 8859 systems mistake it for a C1 control code. Hence the UTF-7.5 encoding.