Code Point
May 20, 2023
A code point is a numerical value that represents a specific character or symbol in a given character set or encoding scheme. It is the fundamental unit of information in Unicode, which is a character encoding standard that assigns a unique code point to every character in the world’s writing systems.
The purpose of code points is to provide a standard way of representing characters across different computer systems and software applications. In the early days of computing, different manufacturers and operating systems used their own proprietary character sets, which made it difficult to exchange data and documents between different systems. Unicode was developed to address this problem by providing a universal character encoding standard that can represent all characters used in human writing systems.
How Code Points Work
In Unicode, each code point is assigned a unique number, typically written in hexadecimal notation. For example, the code point for the letter “A” in the basic Latin alphabet is U+0041. This means that “A” is represented by the numerical value 65 in decimal notation.
Unicode supports over 137,000 code points, with the first 128 code points (0 to 127) reserved for the ASCII character set. The remaining code points are used to represent characters in other writing systems, such as Chinese, Arabic, or Cyrillic.
Unicode Planes
Unicode divides its code points into 17 planes, each containing 65,536 code points. The first plane, called the Basic Multilingual Plane (BMP), contains the most commonly used characters, including those in the ASCII and Latin-1 Supplement character sets. The BMP includes code points U+0000 to U+FFFF.
The remaining 16 planes are used to represent more specialized characters, such as those used in historic scripts, musical notation, or mathematical symbols. These planes are typically used only by specialized software applications.
Unicode Characters and Glyphs
It is important to distinguish between Unicode characters and the visual glyphs that represent them. A character is an abstract concept that represents a specific symbol or idea, while a glyph is the visual representation of that character.
For example, the Unicode character U+005A represents the letter “Z”, but the glyph used to display that character may look different depending on the font and style used. This distinction is important because different languages and writing systems may use the same character to represent different glyphs.
Code Points and Encodings
One of the challenges of using code points is that they must be encoded in a way that can be stored and transmitted by computer systems. There are many different encoding schemes that can be used to represent Unicode code points, each with its own advantages and disadvantages.
UTF-8 Encoding
The most commonly used Unicode encoding scheme is UTF-8, which is a variable-length encoding scheme that can represent all Unicode code points using one to four bytes. In UTF-8, the first 128 code points (0 to 127) are represented by a single byte, while other code points are represented by two to four bytes.
UTF-8 has become the de facto standard for encoding Unicode on the web because it is compatible with ASCII and supports all Unicode code points. It is also efficient because it uses fewer bytes to represent the most commonly used characters.
Other Encoding Schemes
There are other encoding schemes that can be used to represent Unicode, including UTF-16, UTF-32, and ISO-8859. Each encoding scheme has its own advantages and disadvantages, depending on the specific use case.
UTF-16 is a fixed-length encoding scheme that uses two bytes to represent each code point. It is commonly used in Windows-based systems and supports all Unicode code points.
UTF-32 is a fixed-length encoding scheme that uses four bytes to represent each code point. It is less commonly used than UTF-8 or UTF-16 but provides a simple and consistent way of representing all Unicode code points.
ISO-8859 is a family of character encoding schemes that support different subsets of the ASCII and Latin-1 Supplement character sets. These encoding schemes are primarily used for European languages and do not support all Unicode code points.