Character encoding - PIE.Wiki. Encoding text information

Unicode is a character coding system that is used by computers to store and exchange textual data. Unicode has a unique number (or code point) for each character in the world's major writing systems. This system also includes technical symbols, punctuation marks, and many other symbols used in writing.

In addition to being a character map, Unicode also includes algorithms for matching and encoding two-sided scripts such as Arabic, as well as specifications for normalizing textual forms.

This section provides a general description of Unicode. For more complete information and for a list of supported languages whose characters can be encoded using Unicode, see the Unicode Consortium website.

Code points

Symbols are units of information that roughly correspond to a unit of text in natural language writing. Unicode defines how characters are interpreted rather than displayed.

A character image (glyph) that is displayed, or a visual representation of a character, is a character that appears on a monitor screen or printed page. In some writing systems, one character can correspond to multiple glyphs, or multiple characters can correspond to a single glyph. For example, "ll" in Spanish is one glyph, but two characters: "l" and "l".

In Unicode, characters are converted to code points. Code points are numbers that are assigned by the Unicode Consortium to each character in each notation system. Code points are represented as "U +" and four numbers and / or letters. The following are examples of code points for four different characters: lowercase l, lowercase u with umlaut, beta, and lowercase e with acute.

Unicode contains 1,114,112 code points; to date, they have over 96,000 characters assigned.

Levels (Planes)

The Unicode code space for characters is divided into 17 planes, each of which contains 65,536 code points.

The first level (plane) -plane 0- is the Basic Multilingual Plane (BMP). Most of the most used characters are encoded using BMP, and this is the layer at which the most characters are encoded today. BMP contains code points for almost all characters in modern languages and many special characters... There are about 6,300 unused code points in BMP, which will be used to add more characters in the future.

The next level (plane) -plane 1- is the Supplementary Multilingual Plane (SMP). SMP is used to encode ancient characters as well as musical and mathematical symbols.

Character encoding

Character encoding defines each character, its code point and how the code point is represented in bits. Without knowing which encoding was used, you will not be able to interpret the character string correctly.

There are a very large number of coding schemes, but it is very difficult to convert their data between them, and few of them can take into account the presence of characters in more than two or three different languages. For example, if your PC is set to use OEM-Latin II by default and you are browsing a Website that uses IBM EBCDIC-Cyrillic, then any characters that will be represented in Cyrillic that will not be encoded in the Latin II schema will not be displayed. These characters will be replaced by other characters, such as question marks and squares.

Since Unicode contains code points for most characters in all modern languages, then using the Unicode character encoding will allow your computer to interpret almost every known character.

There are three main Unicode schemes for encoding characters: UTF-8, UTF-16, and UTF-32. UTF stands for Unicode Transformation Format. The numbers that follow UTF indicate the size of the units (in bytes) used for encoding.

UTF-8 uses an 8-bit variable width code unit. UTF-8 uses 1 to 6 bytes to encode a character; it can use fewer, the same, or more bytes than UTF-16 to encode the same character. In windows-1251, each code from 0 to 127 (U + 0000 to U + 0127) is stored in one byte. Only code points 128 (U + 0128) and above are stored using 2 to 6 bytes.
UTF-16 uses one fixed-width 16-bit code unit. It is relatively compact and all of the most commonly used characters can be encoded with a single 16-bit code unit. Other characters can be accessed using pairs of 16-bit code units.
UTF-32 it takes 4 bytes to encode any character. In most cases, a UTF-32 encoded document will be about twice the size of a UTF-16 encoded document. Each character is encoded in one 32-bit, fixed-width encoding unit. You can use UTF-32 if you are not limited in disk space and want to use one code unit for each character.

All three forms of encoding can encode the same characters and can be translated from one to another without loss of data.

There are other encodings as well: for example, UTF-7 and UTF-EBCDIC. There is also GB18030, which is the Chinese equivalent of UTF-8 and supports simplified and traditional Chinese characters. For the Russian language, it is convenient to use windows-1251.

Unicode is a very large and complex world, because the standard allows you to represent and work in a computer with all the major scripts of the world. Some writing systems have been around for over a thousand years, and many of them have evolved almost independently of each other in different parts of the world. People invented so many things and they are often so different from each other that it was an extremely difficult and ambitious task to combine all this into a single standard.

To really understand Unicode, you need to at least superficially imagine the features of all the scripts that the standard allows you to work with. But is it really necessary for every developer? We will say no. To use Unicode in most everyday tasks, it is enough to know a reasonable minimum of information, and then delve into the standard as needed.

In this article, we will talk about the basic principles of Unicode and highlight those important practical issues that developers will certainly face in their daily work.

Why did you need Unicode?

Before the advent of Unicode, single-byte encodings were almost universally used, in which the boundary between the characters themselves, their representation in computer memory and display on the screen was rather arbitrary. If you worked with one or another national language, then the corresponding fonts-encodings were installed on your system, which made it possible to draw bytes from the disk on the screen in such a way that they make sense to the user.

If you printed a text file on a printer and saw a set of incomprehensible krakozyabras on a paper page, this meant that the corresponding fonts were not loaded into the printing device and it interprets the bytes in a different way than you would like.

This approach in general and single-byte encodings in particular had a number of significant drawbacks:

It was possible to work simultaneously with only 256 characters, and the first 128 were reserved for Latin and control characters, and in the second half, in addition to the characters of the national alphabet, it was necessary to find a place for pseudo-graphic characters (╔ ╗).
The fonts were tied to a specific encoding.
Each encoding represented its own set of characters and conversion from one to another was possible only with partial losses, when the missing characters were replaced by graphically similar ones.
Transferring files between devices running different operating systems was difficult. It was necessary either to have a converter program, or to carry additional fonts along with the file. The existence of the Internet as we know it was impossible.
There are non-alphabetic writing systems in the world (hieroglyphic writing), which in a single-byte encoding are not representable in principle.

Basic principles of Unicode

We all understand perfectly well that the computer does not know about any ideal entities, but operates with bits and bytes. But computer systems are still created by people, not machines, and sometimes it is more convenient for you and me to operate with speculative concepts, and then move from the abstract to the concrete.

Important! One of the central tenets in Unicode philosophy is a clear distinction between characters, their representation in a computer, and their display on an output device.

The concept of an abstract unicode character is introduced, which exists exclusively in the form of a speculative concept and an agreement between people, enshrined in the standard. Each Unicode character is associated with a non-negative integer called its code point.

So, for example, the unicode character U + 041F is a capital Cyrillic letter P. There are several ways to represent this character in computer memory, just like there are several thousand ways to display it on a monitor screen. But at the same time P, it will also be P or U + 041F in Africa.

This is the familiar encapsulation, or separation of interface from implementation, a concept that has worked well in programming.

It turns out that, being guided by the standard, any text can be encoded as a sequence of unicode characters

Hello U + 041F U + 0440 U + 0438 U + 0432 U + 0435 U + 0442
write it down on a piece of paper, pack it in an envelope and send it to any part of the world. If they know about the existence of Unicode, then the text will be perceived by them in exactly the same way as by you and me. They will not have the slightest doubt that the penultimate character is exactly the Cyrillic lowercase e(U + 0435) not say latin small e(U + 0065). Note that we haven't said a word about byte representation.

Although unicode characters are called symbols, they do not always correspond to a character in the traditional naive sense, such as a letter, number, punctuation mark, or hieroglyph. (See under the spoiler for more details.)

Examples of various unicode characters

There are purely technical unicode characters, for example:

U + 0000: null character;
U + D800 – U + DFFF: minor and major surrogates for the technical representation of code points in the range 10000 to 10FFFF (read: outside BMP / BMP) in the UTF-16 encoding family;
etc.

There are punctuation markers, such as U + 200F: a marker to change the direction of writing from right to left.

There is a whole cohort of spaces of various widths and purposes (see the excellent habr article :):

U + 0020 (space);
U + 00A0 (non-breaking space, in HTML);
U + 2002 (semicircular space or En Space);
U + 2003 (round space or Em Space);
etc.

There are combining diacritical marks - all kinds of strokes, dots, tildes, etc., which change / clarify the meaning of the previous sign and its outline. For example:

U + 0300 and U + 0301: signs of primary (acute) and secondary (weak) stress;
U + 0306: short (superscript), as in th;
U + 0303: superscript tilde
etc.

There is even such exotic as language tags (U + E0001, U + E0020 – U + E007E, and U + E007F), which are now in limbo. They were conceived as the ability to mark certain sections of text as belonging to a particular language variant (say, American and British English), which could affect the details of how the text is displayed.

What is a symbol, what is the difference between a grapheme cluster (read: perceived as a single whole image of a symbol) from a unicode symbol and from a code quantum, we will tell you next time.

Unicode code space

The Unicode code space consists of 1 114 112 code points, ranging from 0 to 10FFFF. Of these, only 128,237 were assigned values for the ninth version of the standard. Part of the space is reserved for private use and the Unicode consortium promises never to assign values to positions from these special areas.

For the sake of convenience, the entire space is divided into 17 planes (now six of them are involved). Until recently, it was customary to say that most likely you only have to face the Basic Multilingual Plane (BMP), which includes Unicode characters from U + 0000 to U + FFFF. (Looking ahead a bit: BMP characters are represented in UTF-16 in two bytes, not four). In 2016, this thesis is already in doubt. So, for example, popular Emoji characters may well be found in a user message and you need to be able to process them correctly.

Encodings

If we want to send text over the Internet, then we need to encode a sequence of unicode characters as a sequence of bytes.

The Unicode standard includes a number of Unicode encodings, such as UTF-8 and UTF-16BE / UTF-16LE, that allow the entire code point space to be encoded. Conversion between these encodings can be freely performed without loss of information.

Also, no one has canceled single-byte encodings, but they allow you to encode your own individual and very narrow piece of the Unicode spectrum - 256 or less code points. For such encodings, tables exist and are available to everyone, where each value of a single byte is associated with a unicode character (see, for example, CP1251.TXT). Despite the limitations, single-byte encodings turn out to be very practical when it comes to working with a large array of monolingual text information.

UTF-8 is the most widely used Unicode encoding on the Internet (it won the palm in 2008), mainly due to its economy and transparent compatibility with seven-bit ASCII. Latin and service characters, basic punctuation marks and numbers - i.e. all seven-bit ASCII characters are encoded in UTF-8 in one byte, the same as in ASCII. The characters of many major scripts, apart from some of the rarer hieroglyphic characters, are represented in it by two or three bytes. The largest code point defined by the standard, 10FFFF, is encoded in four bytes.

Note that UTF-8 is a variable length encoding. Each Unicode character in it is represented by a sequence of code quanta with a minimum length of one quantum. The number 8 means the bit length of the code unit - 8 bits. For the UTF-16 family of encodings, the size of the code quantum is, respectively, 16 bits. For UTF-32 - 32 bits.

If you are sending an HTML page with Cyrillic text over the network, then UTF-8 can give a very tangible benefit, because all markup as well as JavaScript and CSS blocks will be efficiently encoded in one byte. For example home page Habra in UTF-8 is 139Kb, and in UTF-16 it is already 256Kb. For comparison, if you use win-1251 with the loss of the ability to store some characters, then the size, compared to UTF-8, will be reduced by only 11Kb to 128Kb.

To store string information in applications, 16-bit unicode encodings are often used due to their simplicity, as well as the fact that the characters of the world's major writing systems are encoded in one sixteen-bit quantum. So, for example, Java successfully uses UTF-16 for internal representation of strings. The Windows operating system also uses UTF-16 internally.

In any case, as long as we stay in Unicode space, it doesn't really matter how string information is stored within a single application. If the internal storage format allows you to correctly encode all more than a million code points and there is no loss of information at the application boundary, for example, when reading from a file or copying to the clipboard, then everything is fine.

For correct interpretation of text read from disk or from a network socket, you must first determine its encoding. This is done either using user-supplied meta information written in or near the text, or it is determined heuristically.

In the dry residue

There is a lot of information and it makes sense to give a brief summary of everything that was written above:

Unicode postulates a clear distinction between characters, their representation in a computer, and their display on an output device.
Unicode characters do not always correspond to a character in the traditional naive sense, such as a letter, number, punctuation mark, or hieroglyph.
The Unicode code space consists of 1 114 112 code points, ranging from 0 to 10FFFF.
The Basic Multilingual Plane includes Unicode characters U + 0000 through U + FFFF, which are encoded in UTF-16 in two bytes.
Any Unicode encoding allows you to encode the entire space of Unicode code points and the conversion between different such encodings is carried out without loss of information.
Single-byte encodings can encode only a small part of the unicode spectrum, but can be useful when working with a large amount of monolingual information.
UTF-8 and UTF-16 encodings have variable code lengths. In UTF-8, each Unicode character can be encoded with one, two, three, or four bytes. In UTF-16, two or four bytes.
The internal format of storing textual information within a separate application can be arbitrary, provided that it works correctly with the entire space of Unicode code points and that there is no loss in cross-border data transmission.

A quick note on coding

Some confusion can occur with the term encoding. Within Unicode, encoding occurs twice. The first time a character set is encoded, in the sense that each Unicode character is assigned a corresponding code point. This process converts the Unicode character set into a coded character set. The second time a sequence of unicode characters is converted to a byte string, this process is also called encoding.

In English terminology, there are two different verbs to code and to encode, but even native speakers are often confused about them. In addition, the term character set or charset is used synonymously with the term coded character set.

All this we say to the fact that it makes sense to pay attention to the context and distinguish situations when it comes to the code position of an abstract unicode character and when it comes to its byte representation.

Finally

There are so many different aspects of Unicode that it is impossible to cover everything in one article. And unnecessary. The above information is enough to avoid confusion in the basic principles and work with text in most everyday tasks (read: without going beyond BMP). In the next articles we will talk about normalization, give a more complete historical overview of the development of encodings, talk about the problems of Russian-language Unicode terminology, and also make material about the practical aspects of using UTF-8 and UTF-16.

Unicode

Unicode Consortium logo

Unicode(most often) or Unicode(eng. Unicode) is a character coding standard that allows characters to be represented in almost all written languages.

The standard was proposed in 1991 by the non-profit organization "Unicode Consortium" (eng. Unicode Consortium, Unicode Inc.).

The use of this standard makes it possible to encode a very large number of characters from different scripts: in Unicode documents, Chinese characters, mathematical characters, letters of the Greek alphabet, Latin and Cyrillic alphabet can coexist, thus switching code pages becomes unnecessary.

The standard consists of two main sections: universal character set (eng. UCS, universal character set) and the family of encodings (eng. UTF, Unicode transformation format).

The universal character set defines a one-to-one correspondence of characters to codes - elements of the code space that represent non-negative integers. The family of encodings defines the machine representation of a sequence of UCS codes.

Unicode codes are divided into several areas. The area with codes U + 0000 through U + 007F contains the ASCII characters with corresponding codes. Next are the areas of signs of various scripts, punctuation marks and technical symbols.

Some of the codes are reserved for future use. Under the Cyrillic characters areas of characters with codes from U + 0400 to U + 052F, from U + 2DE0 to U + 2DFF, from U + A640 to U + A69F are allocated (see Cyrillic in Unicode).

1 Prerequisites for the creation and development of Unicode

2 Unicode versions

3 Code space

4 Coding system
4.1 Consortium Policy

4.2 Combining and Duplicating Symbols

5 Modifying characters

6 Normalization algorithms
6.1 NFD

6.2 NFC

6.3 NFKD

6.4 NFKC

6.5 Examples

7 Bidirectional writing

8 Featured Symbols

9 ISO / IEC 10646

10 Ways of presentation
10.1 UTF-8

10.2 Byte order

10.3 Unicode and traditional encodings

10.4 Implementations

11 Input Methods
11.1 Microsoft Windows

11.2 Macintosh

11.3 GNU / Linux

12 Unicode Problems

13 "Unicode" or "Unicode"?

Prerequisites for the creation and development of Unicode

By the late 1980s, 8-bit characters had become the standard. At the same time, there were many different 8-bit encodings, and new ones constantly appeared.

This was explained both by the constant expansion of the range of supported languages, and by the desire to create an encoding partially compatible with some other (a typical example is the emergence of an alternative encoding for the Russian language, due to the exploitation of Western programs created for the CP437 encoding).

As a result, several problems appeared:

the problem of "krakozyabr";

the problem of limited character set;

the problem of converting one encoding to another;

the problem of duplicate fonts.

The problem "krakozyabr"- the problem of displaying documents in the wrong encoding. The problem could be solved either by consistently introducing methods for specifying the encoding used, or by introducing a single (common) encoding for all.

The limited character set problem... The problem could be solved either by switching fonts within the document, or by introducing a "wide" encoding. Font switching has long been practiced in word processors, and fonts with a non-standard encoding were often used, the so-called. "Dingbat fonts". As a result, when trying to transfer a document to another system, all non-standard characters turned into "krakozyabry".

The problem of converting one encoding to another... The problem could be solved either by compiling conversion tables for each pair of encodings, or by using an intermediate conversion to a third encoding that includes all characters of all encodings.

Duplicate fonts problem... For each encoding, its own font was created, even if the character sets in the encodings coincided partially or completely. The problem could be solved by creating "large" fonts, from which the characters needed for a given encoding would subsequently be selected. However, this required the creation of a single registry of symbols in order to determine what corresponds to what.

The need for a single "wide" encoding was recognized. Variable-length encodings, widely used in East Asia, were found to be too difficult to use, so it was decided to use fixed-width characters.

Using 32-bit characters seemed too wasteful, so it was decided to use 16-bit ones.

The first version of Unicode was an encoding with a fixed character size of 16 bits, that is, the total number of codes was 2 16 (65 536). Since then, symbols have been denoted by four hexadecimal digits (for example, U + 04F0). At the same time, it was planned to encode in Unicode not all existing characters, but only those that are necessary in everyday life. Rarely used symbols had to be placed in the "private use area" that originally occupied the codes U + D800 ... U + F8FF.

In order to use Unicode also as an intermediate in converting different encodings to each other, all characters represented in all the most famous encodings were included in it.

In the future, however, it was decided to encode all the symbols and, in connection with this, significantly expand the code domain.

At the same time, character codes began to be considered not as 16-bit values, but as abstract numbers that can be represented in a computer in many different ways (see representation methods).

Since in a number of computer systems (for example, Windows NT) fixed 16-bit characters were already used as the default encoding, it was decided to encode all the most important characters only within the first 65,536 positions (the so-called English. basic multilingual plane, BMP).

The rest of the space is used for "additional characters" (eng. supplementary characters): writing systems of extinct languages or very rarely used Chinese characters, mathematical and musical symbols.

For compatibility with old 16-bit systems, the UTF-16 system was invented, where the first 65,536 positions, with the exception of positions from the interval U + D800 ... U + DFFF, are displayed directly as 16-bit numbers, and the rest are represented as "surrogate pairs "(The first element of the pair from the U + D800… U + DBFF region, the second element of the pair from the U + DC00… U + DFFF region). For surrogate couples, a part of the code space (2048 positions) allocated for "private use" was used.

Since UTF-16 can display only 2 20 +2 16 −2048 (1 112 064) characters, this number was chosen as the final value of the Unicode code space (code range: 0x000000-0x10FFFF).

Although the Unicode code area was extended beyond 2-16 as early as version 2.0, the first characters in the "top" area were only placed in version 3.1.

The role of this encoding in the web sector is constantly growing. At the beginning of 2010, the share of websites using Unicode was about 50%.

Unicode versions

Work on finalizing the standard continues. New versions are released as the symbol tables change and are updated. In parallel, new ISO / IEC 10646 documents are being issued.

The first standard was released in 1991, the last in 2016, the next is expected in the summer of 2017. The standards versions 1.0-5.0 were published as books, and have an ISBN.

The version number of the standard is made up of three digits (for example, "4.0.1"). The third digit is changed when minor changes are made to the standard that do not add new characters.

Code space

Although the notation forms UTF-8 and UTF-32 allow up to 2,331 (2,147,483,648) code points to be encoded, it was decided to use only 1,112,064 for compatibility with UTF-16. However, even this is more than enough for the moment - in version 6.0 a little less than 110,000 code points are used (109,242 graphic and 273 other symbols).

The code space is split into 17 planes(eng. planes) 2 16 (65 536) characters each. Ground plane ( plane 0) is called basic (basic) and contains the symbols of the most common scripts. The rest of the planes are additional ( supplementary). The first plane ( plane 1) is used mainly for historical scripts, the second ( plane 2) - for rarely used Chinese characters (CJK), the third ( plane 3) is reserved for archaic Chinese characters. Planes 15 and 16 are reserved for private use.

To denote Unicode characters, use the notation like "U + xxxx"(For codes 0 ... FFFF), or" U + xxxxx"(For codes 10000 ... FFFFF), or" U + xxxxxx"(For codes 100000 ... 10FFFF), where xxx- hexadecimal digits. For example, the character "i" (U + 044F) has the code 044F 16 = 1103 10.

Coding system

The universal coding system (Unicode) is a set of graphic symbols and a way of encoding them for computer processing of text data.

Graphic symbols are symbols that have a visible image. Graphical characters are opposed to control and formatting characters.

Graphic symbols include the following groups:

letters contained in at least one of the supported alphabets;

numbers;

punctuation marks;

special signs (mathematical, technical, ideograms, etc.);

separators.

Unicode is a system for the linear representation of text. Characters with additional superscripts or subscripts can be represented as a sequence of codes built according to certain rules (composite character) or as a single character (monolithic version, precomposed character). On this moment(2014), it is believed that all letters of large scripts are included in Unicode, and if a symbol is available in a composite version, it is not necessary to duplicate it in a monolithic form.

Consortium policy

The consortium does not create a new one, but states the established order of things. For example, emoji pictures were added because Japanese mobile operators used them extensively.

To do this, adding a symbol goes through a complex process. And, for example, the symbol of the Russian ruble passed it in three months simply because it received official status.

Trademarks are only coded by way of exception. So, in Unicode there is no Windows flag or Apple apple.

Once a character has appeared in the encoding, it will never move or disappear. If you need to change the order of characters, this is done not by changing positions, but by the national sorting order. There are other, more subtle guarantees of stability - for example, normalization tables will not change.

Combining and Duplicating Symbols

The same symbol can take several forms; in Unicode, these forms are contained in one code point:

if it happened historically. For example, Arabic letters have four forms: detached, at the beginning, in the middle and at the end;

or if one language is adopted in one form, and in another - another. Bulgarian Cyrillic differs from Russian, and Chinese characters from Japanese.

On the other hand, if historically there were two different code points in fonts, they remain different in Unicode. The lowercase Greek sigma has two forms, and they have different positions. Extended Latin letter Å (A with a circle) and angstrom sign Å, Greek letterμ and the prefix “micro” µ are different symbols.

Of course, similar characters in unrelated scripts are put in different code points. For example, the letter A in Latin, Cyrillic, Greek and Cherokee are different symbols.

It is extremely rare that the same character is placed in two different code positions to simplify text processing. The mathematical stroke and the same stroke for indicating the softness of sounds are different symbols, the second is considered a letter.

Modifying characters

Representation of the character "Y" (U + 0419) in the form of the base character "I" (U + 0418) and the modifying character "" (U + 0306)

Graphic characters in Unicode are divided into extended and non-extended (widthless). Non-extended characters do not take up space in the line when displayed. These include, in particular, accent marks and other diacritical marks. Both extended and non-extended characters have their own codes. Extended symbols are otherwise called basic (eng. base characters), and non-extended ones - modifying (eng. combining characters); and the latter cannot meet independently. For example, the “á” character can be represented as a sequence of the base character “a” (U + 0061) and the modifier character “́” (U + 0301), or as a monolithic character “á” (U + 00E1).

A special type of modifying characters are face style selectors (eng. variation selectors). They only apply to those symbols for which such variants are defined. In version 5.0, weights are defined for a number of mathematical symbols, for symbols of the traditional Mongolian alphabet, and for symbols of the Mongolian square script.

Normalization algorithms

Since the same symbols can be represented different codes, comparison of strings byte by byte becomes impossible. Normalization algorithms normalization forms) solve this problem by converting the text to a certain standard form.

Casting is carried out by replacing symbols with equivalent ones using tables and rules. "Decomposition" is the replacement (decomposition) of one character into several constituent characters, and "composition", on the contrary, is the replacement (connection) of several constituent characters with one character.

The Unicode standard defines 4 text normalization algorithms: NFD, NFC, NFKD, and NFKC.

NFD

NFD, eng. n ormalization f orm D ("D" from the English. d ecomposition), the normalization form D is canonical decomposition - an algorithm according to which recursive replacement of monolithic symbols is performed (eng. precomposed characters) into several components (eng. composite characters) according to the decomposition tables.

Å
U + 00C5
→
A
U + 0041

̊
U + 030A

ṩ
U + 1E69
→
s
U + 0073

̣
U + 0323

̇
U + 0307

ḍ̇
U + 1E0B U + 0323
→
d
U + 0064

̣
U + 0323

̇
U + 0307

q̣̇
U + 0071 U + 0307 U + 0323
→
q
U + 0071

̣
U + 0323

̇
U + 0307

NFC

NFC, eng. n ormalization f orm C ("C" from the English. c omposition), the normalization form C is an algorithm according to which canonical decomposition and canonical composition are performed sequentially. First, canonical decomposition (NFD algorithm) reduces the text to the form D. Then canonical composition, the inverse of NFD, processes the text from beginning to end, taking into account the following rules:

symbol S counts initial if it has a modification class equal to zero according to the Unicode character table;

in any sequence of characters starting with the character S, symbol C blocked from S, only if between S and C is there any symbol B which is either initial or has the same or greater modification class than C... This rule applies only to strings that have gone through canonical decomposition;

symbol counts primary a composite if it has a canonical decomposition in the Unicode character table (or canonical decomposition for Hangul and it is not included in the exclusion list);

symbol X can be combined with the symbol first Y if and only if there is a primary composite Z, canonically equivalent to the sequence<X, Y>;

if the next character C not blocked by the last start base character encountered L and it can be successfully combined with it first, then L replaced by composite L-C, a C removed.

o
U + 006F

̂
U + 0302
→ →
H
U + 0048

①
U + 2460
→
1
U + 0031

ｶ
U + FF76
→
カ
U + 30AB

→
ﬁ
U + FB01

ﬁ
U + FB01

f i
U + 0066 U + 0069

f i
U + 0066 U + 0069

2 ⁵
U + 0032 U + 2075

2 ⁵
U + 0032 U + 2075

2 ⁵
U + 0032 U + 2075

2 5
U + 0032 U + 0035

2 5
U + 0032 U + 0035

ẛ̣
U + 1E9B U + 0323

ſ ̣ ̇
U + 017F U + 0323 U + 0307

ẛ ̣
U + 1E9B U + 0323

s ̣ ̇
U + 0073 U + 0323 U + 0307

ṩ
U + 1E69

th
U + 0439

and ̆
U + 0438 U + 0306

th
U + 0439

and ̆
U + 0438 U + 0306

th
U + 0439

e
U + 0451

e ̈
U + 0435 U + 0308

e
U + 0451

e ̈
U + 0435 U + 0308

e
U + 0451

A
U + 0410

A
U + 0410

A
U + 0410

A
U + 0410

A
U + 0410

が
U + 304C

が
U + 304B U + 3099

が
U + 304C

が
U + 304B U + 3099

が
U + 304C

Ⅷ
U + 2167

Ⅷ
U + 2167

Ⅷ
U + 2167

V I I I
U + 0056 U + 0049 U + 0049 U + 0049

V I I I
U + 0056 U + 0049 U + 0049 U + 0049

ç
U + 00E7

c ̧
U + 0063 U + 0327

ç
U + 00E7

c ̧
U + 0063 U + 0327

ç
U + 00E7

Bi-directional letter

The Unicode standard supports writing languages with a left-to-right direction (eng. left-to-right, LTR), and with writing from right to left (eng. right-to-left, RTL) - for example, Arabic and Hebrew letters. In both cases, the characters are stored in a "natural" order; their display, taking into account the desired direction of the letter, is provided by the application.

In addition, Unicode supports combined texts that combine fragments with different directions of the letter. This feature is called bidirectionality(eng. bidirectional text, BiDi). Some simplified text processors (for example, in cell phones) can support Unicode, but not bidirectional support. All Unicode characters are divided into several categories: written from left to right, written from right to left, and written in any direction. Symbols of the latter category (mainly punctuation marks), when displayed, take the direction of the surrounding text.

Featured symbols

Diagram of the basic multilingual plane of Unicode

Unicode includes virtually all modern scripts, including:

Arabic

Armenian,

Bengali,

Burmese,

verb

Greek

Georgian,

devanagari,

Jewish,

Cyrillic,

Chinese (Chinese characters are actively used in Japanese, as well as occasionally in Korean),

Coptic,

Khmer,

Latin,

Tamil,

Korean (Hangul),

Cherokee,

Ethiopian,

Japanese (which includes, in addition to the syllabic alphabet, also Chinese characters)

other.

For academic purposes, many historical scripts have been added, including: Germanic runes, ancient Türkic runes, ancient Greek writing, Egyptian hieroglyphs, cuneiform, Mayan writing, Etruscan alphabet.

Unicode provides a wide range of mathematical and musical symbols and pictograms.

In principle, Unicode does not include state flags, company and product logos, although they are found in fonts (for example, the Apple logo in the MacRoman encoding (0xF0) or Windows logo in font Wingdings (0xFF)). In Unicode fonts, logos must be placed in the custom character area only.

ISO / IEC 10646

The Unicode Consortium works closely with working group ISO / IEC / JTC1 / SC2 / WG2, which is developing international standard 10646 (ISO / IEC 10646). Synchronization is established between the Unicode standard and ISO / IEC 10646, although each standard uses its own terminology and documentation system.

Cooperation of the Unicode Consortium with the International Organization for Standardization (eng. International Organization for Standardization, ISO ) began in 1991. In 1993, ISO issued the DIS 10646.1 standard. To synchronize with it, the Consortium approved the version 1.1 of the Unicode standard, which was supplemented with additional characters from DIS 10646.1. As a result, the values of the encoded characters in Unicode 1.1 and DIS 10646.1 are exactly the same.

In the future, cooperation between the two organizations continued. In 2000, the Unicode 3.0 standard was synchronized with ISO / IEC 10646-1: 2000. The upcoming third version of ISO / IEC 10646 will be synchronized with Unicode 4.0. Perhaps these specifications will even be published as a single standard.

Similar to the UTF-16 and UTF-32 formats in the Unicode standard, the ISO / IEC 10646 standard also has two main forms of character encoding: UCS-2 (2 bytes per character, similar to UTF-16) and UCS-4 (4 bytes per character, similar to UTF-32). UCS means universal multi-octet(multibyte) coded character set(eng. universal multiple-octet coded character set ). UCS-2 can be considered a subset of UTF-16 (UTF-16 without surrogate pairs) and UCS-4 is a synonym for UTF-32.

Differences between Unicode and ISO / IEC 10646 standards:

slight differences in terminology;

ISO / IEC 10646 does not include the sections required to fully implement Unicode support:
no data on binary encoding of characters;

there is no description of comparison algorithms (eng. collation) and rendering (eng. rendering) characters;

there is no list of properties of symbols (for example, there is no list of properties required to implement support for bidirectional (eng. bi-directional) letters).

Presentation methods

Unicode has several forms of representation (eng. Unicode transformation format, UTF ): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). The UTF-7 representation form was also developed for transmission over seven-bit channels, but due to incompatibility with ASCII, it was not spread and was not included in the standard. On April 1, 2005, two humorous submissions were proposed: UTF-9 and UTF-18 (RFC 4042).

Microsoft Windows NT and Windows 2000 and Windows XP-based systems primarily use the UTF-16LE form. On UNIX-like operating systems GNU / Linux, BSD and Mac OS X adopted UTF-8 form for files and UTF-32 or UTF-8 for character processing in random access memory.

Punycode is another form of encoding sequences of Unicode characters into so-called ACE sequences, which are composed only of alphanumeric characters, as permitted in domain names.

UTF-8

UTF-8 is the Unicode representation that provides the best compatibility with older systems that used 8-bit characters.

Text containing only characters numbered less than 128 is converted to plain ASCII text when written in UTF-8. Conversely, in UTF-8 text, any byte with a value less than 128 displays ASCII character with the same code.

The rest of the Unicode characters are represented by sequences from 2 to 6 bytes in length (in fact, only up to 4 bytes, since there are no characters with a code greater than 10FFFF in Unicode, and there are no plans to introduce them in the future), in which the first byte always has the form 11xxxxxx and the rest - 10xxxxxx... No surrogate pairs are used in UTF-8, 4 bytes is enough to write any unicode character.

UTF-8 was invented on September 2, 1992 by Ken Thompson and Rob Pike and implemented in Plan 9... The UTF-8 standard is now officially enshrined in RFC 3629 and ISO / IEC 10646 Annex D.

UTF-8 characters are derived from Unicode as follows:
Unicode UTF-8: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxx
Theoretically possible, but also not included in the standard:
0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Although UTF-8 allows you to specify the same character in several ways, only the shortest one is correct. The rest of the forms should be rejected for security reasons.

Byte order

In a UTF-16 data stream, the low byte can be written either before the high one (eng. UTF-16 little-endian), or after the older one (eng. UTF-16 big-endian). Similarly, there are two variants of the four-byte encoding - UTF-32LE and UTF-32BE.

To define the format of the Unicode representation at the beginning text file the signature is written - the character U + FEFF (non-breaking space with zero width), also called byte sequence marker(eng. byte order mark (BOM)). This makes it possible to distinguish between UTF-16LE and UTF-16BE since the U + FFFE character does not exist. It is also sometimes used to denote the UTF-8 format, although the notion of byte order does not apply to this format. Files that follow this convention begin with these byte sequences:
UTF-8 EF BB BF UTF-16BE FE FF UTF-16LE FF FE UTF-32BE 00 00 FE FF UTF-32LE FF FE 00 00
Unfortunately, this method does not reliably distinguish between UTF-16LE and UTF-32LE, since the character U + 0000 is allowed by Unicode (although real texts rarely start with it).

Files in UTF-16 and UTF-32 encodings that do not contain a BOM must be in big-endian (unicode.org) byte order.

Unicode and traditional encodings

The introduction of Unicode changed the approach to traditional 8-bit encodings. If earlier the encoding was specified by the font, now it is specified by the correspondence table between this encoding and Unicode.

In fact, 8-bit encodings have become a representation of a subset of Unicode. This made it much easier to create programs that have to work with many different encodings: now, to add support for one more encoding, you just need to add another Unicode lookup table.

In addition, many data formats allow any Unicode characters to be inserted, even if the document is written in the old 8-bit encoding. For example, you can use ampersand codes in HTML.

Implementation

Most modern operating systems provide some degree of Unicode support.

In operating systems of the Windows NT family, the double-byte UTF-16LE encoding is used for the internal representation of file names and other system strings. System calls that take string parameters are available in single-byte and double-byte variants. For more information, see the article Unicode on the Microsoft Windows family of operating systems.

UNIX-like operating systems, including GNU / Linux, BSD, OS X, use UTF-8 encoding to represent Unicode. Most programs can handle UTF-8 as traditional single-byte encodings, regardless of the fact that a character is represented as several consecutive bytes. To work with individual characters, strings are usually recoded to UCS-4, so that each character has a machine word.

One of the first successful commercial implementations of Unicode was Wednesday Java programming... It basically abandoned the 8-bit representation of characters in favor of the 16-bit one. This solution increased memory consumption, but allowed us to return an important abstraction to programming: an arbitrary single character (type char). In particular, a programmer could work with a string as with a simple array. Unfortunately, the success was not final, Unicode outgrew the 16-bit limit and by J2SE 5.0, an arbitrary character again began to occupy a variable number of memory units - one char or two (see surrogate pair).

Most programming languages now support Unicode strings, although their representation may differ depending on the implementation.

Input Methods

Since no keyboard layout can allow all Unicode characters to be entered at the same time, support is required from operating systems and applications. alternative methods input arbitrary Unicode characters.

Microsoft Windows

Although starting in Windows 2000, the Character Map utility (charmap.exe) supports Unicode characters and allows you to copy them to the clipboard, this support is limited only to the base plane (character codes U + 0000… U + FFFF). Symbols with codes from U + 10000 "Symbol table" does not display.

There is a similar table, for example, in Microsoft Word.

Sometimes you can type a hexadecimal code, press Alt + X, and the code will be replaced with the appropriate character, for example, in WordPad, Microsoft Word. In editors, Alt + X performs the reverse transformation as well.

In many MS Windows programs, to get Unicode character, while holding down the Alt key, type the decimal value of the character code on numeric keypad... For example, the combinations Alt + 0171 ("), Alt + 0187 (") and Alt + 0769 (accent mark) will be useful when typing Cyrillic texts. The combinations Alt + 0133 (…) and Alt + 0151 (-) are also interesting.

Macintosh

Mac OS 8.5 and later supports an input method called "Unicode Hex Input". While holding down the Option key, you need to type the four-digit hexadecimal code of the required character. This method allows you to enter characters with codes greater than U + FFFF using surrogate pairs; such pairs will be automatically replaced by the operating system with single characters. Before using this input method, you need to activate in the corresponding section of system settings and then select as the current input method in the keyboard menu.

Starting with Mac OS X 10.2, there is also a Character Palette application that allows you to select characters from a table in which you can select characters from a specific block or characters supported by a specific font.

GNU / Linux

GNOME also has a Symbol Map utility (formerly gucharmap) that allows you to display symbols for a specific block or writing system and provides the ability to search by name or description of a symbol. When the code of the desired character is known, it can be entered in accordance with the ISO 14755 standard: while holding down the Ctrl + ⇧ Shift keys, enter the hexadecimal code (starting with some version of GTK +, the code must be entered by pressing "U"). The entered hexadecimal code can be up to 32 bits in length, allowing you to enter any Unicode characters without using surrogate pairs.

All X Window applications, including GNOME and KDE, support Compose key input. For keyboards that do not have a dedicated Compose key, you can assign any key for this purpose - for example, ⇪ Caps lock.

The GNU / Linux console also allows entering a Unicode character by its code - for this, the decimal code of the character must be entered as digits of the extended keyboard block while holding down the Alt key. You can enter characters by their hexadecimal code: for this you need to hold down the AltGr key, and to enter digits A-F use the keys on the extended keyboard block from NumLock to ↵ Enter (clockwise). Input in accordance with ISO 14755 is also supported. In order for the above methods to work, you need to enable Unicode mode in the console by calling unicode_start(1) and select a suitable font by calling setfont(8).

Mozilla Firefox for Linux supports ISO 14755 character input.

Unicode problems

In Unicode, English "a" and Polish "a" are the same character. In the same way, the Russian "a" and the Serbian "a" are considered the same symbol (but different from the Latin "a"). This coding principle is not universal; apparently, a solution "for all occasions" cannot exist at all.

Chinese, Korean, and Japanese texts are traditionally written from top to bottom, starting at the top right corner. Switching between horizontal and vertical spellings for these languages is not provided for in Unicode - this must be done by means of markup languages or internal mechanisms of word processors.

Unicode allows for different weights of the same character depending on the language. So, Chinese characters can have different styles in Chinese, Japanese (kanji) and Korean (hancha), but at the same time in Unicode they are denoted by the same symbol (the so-called CJK unification), although simplified and full characters still have different codes ... Likewise, Russian and Serbian languages use different italic style. NS and T(in Serbian they look like u and w, see Serbian italics). Therefore, you need to ensure that the text is always correctly marked as related to one or another language.

The translation from lowercase to uppercase also depends on the language. For example: in Turkish there are letters İi and Iı - thus, Turkish case-changing rules conflict with English ones, which require “i” to be translated into “I”. Similar problems exist in other languages - for example, in the Canadian dialect of French, the register is translated a little differently than in France.

Even with Arabic numerals, there are certain typographic subtleties: numbers can be "uppercase" and "lowercase", proportional and monospaced - for Unicode there is no difference between them. Such nuances remain with the software.

Some of the disadvantages are not related to Unicode itself, but rather to the capabilities of the text processors.

Files of non-Latin text in Unicode always take up more space, since one character is encoded not by one byte, as in various national encodings, but by a sequence of bytes (the exception is UTF-8 for languages whose alphabet fits into ASCII, as well as the presence of two characters in the text and more languages, the alphabet of which not fits into ASCII). The font file required to display all the characters in the Unicode table takes up relatively large memory space and is more computationally intensive than the user's national language font alone. With the increase in the power of computer systems and the reduction in the cost of memory and disk space, this problem becomes less and less significant; however, it remains relevant for portable devices such as mobile phones.

Although Unicode support is implemented in the most common operating systems, still not all applied software supports correct work with him. In particular, Byte Order Marks (BOM) are not always processed and accented characters are poorly supported. The problem is temporary and is a consequence of the comparative novelty of the Unicode standards (in comparison with single-byte national encodings).

The performance of all string processing programs (including sorts in the database) decreases when Unicode is used instead of single-byte encodings.

Some rare writing systems are still not represented properly in Unicode. The depiction of "long" superscript characters extending over several letters, as, for example, in Church Slavonic, has not yet been implemented.

Unicode or Unicode?

"Unicode" is both a proper name (or part of a name, for example, the Unicode Consortium) and a common name derived from the English language.

At first glance, it is preferable to use the spelling "Unicode". In the Russian language there are already morphemes “uni-” (words with the Latin element “uni-” were traditionally translated and written through “uni-”: universal, unipolar, unification, uniform) and “code”. Against, trade marks, borrowed from the English language, are usually transmitted through practical transcription, in which the de-etymologized combination of letters "uni-" is written as "uni-" ("Unilever", "Unix", etc.), that is, in the same way as in the case of letter-by-letter acronyms like UNICEF “United Nations International Children's Emergency Fund” - UNICEF.

The spelling of "Unicode" has already firmly entered the Russian-language texts. Wikipedia uses the more common version. On MS Windows, the Unicode option is used.

There is a special page on the Consortium's website, where the problems of transferring the word "Unicode" to different languages and writing systems. For the Russian Cyrillic alphabet, the option "Unicode" is specified.

ﬁ
U + FB01

ﬁ
U + FB01

f	i
U + 0066	U + 0069

f	i
U + 0066	U + 0069

2	⁵
U + 0032	U + 2075

2	⁵
U + 0032	U + 2075

2	⁵
U + 0032	U + 2075

2	5
U + 0032	U + 0035

2	5
U + 0032	U + 0035

ẛ̣
U + 1E9B	U + 0323

ſ	̣	̇
U + 017F	U + 0323	U + 0307

ẛ	̣
U + 1E9B	U + 0323

s	̣	̇
U + 0073	U + 0323	U + 0307

ṩ
U + 1E69

th
U + 0439

and	̆
U + 0438	U + 0306

th
U + 0439

and	̆
U + 0438	U + 0306

th
U + 0439

e
U + 0451

e	̈
U + 0435	U + 0308

The problems associated with encodings are usually taken care of by the software, so there is usually no difficulty in using encodings. If difficulties arise, then they are usually generated by bad programs - feel free to send them to the trash.

I invite everyone to speak out in

What is encoding

In Russian "character set" is also called a "character set" table, and the process of using this table to translate information from a computer representation into a human one, and a characteristic of a text file, reflecting the use of a certain system of codes in it when displaying text.

How text is encoded

The set of symbols used in writing text is referred to in computer terminology as an alphabet; the number of symbols in the alphabet is usually called its power. To represent textual information in a computer, an alphabet with a capacity of 256 characters is most often used. One of its characters carries 8 bits of information, therefore, the binary code of each character takes 1 byte of computer memory. All characters of such an alphabet are numbered from 0 to 255, and each number corresponds to an 8-bit binary code, which is the ordinal number of a character in the binary number system - from 00000000 to 11111111. Only the first 128 characters with numbers from zero ( binary code 00000000) to 127 (01111111). These include lowercase and uppercase letters Latin alphabet, numbers, punctuation marks, brackets, etc. The remaining 128 codes, starting with 128 (binary code 10000000) and ending with 255 (11111111), are used to encode letters of national alphabets, service and scientific symbols.

Types of encodings

The most famous encoding table is ASCII (American Standard Code for Information Interchange). It was originally developed for the transmission of texts by telegraph, and at that time it was 7-bit, that is, only 128 7-bit combinations were used to encode English characters, service and control characters. In this case, the first 32 combinations (codes) served to encode control signals (start of text, end of line, carriage return, call, end of text, etc.). In the development of the first IBM computers, this code was used to represent symbols in a computer. Since in source code ASCII was only 128 characters, for their encoding were enough byte values with the 8th bit equal to 0. Byte values with the 8th bit equal to 1 began to be used to represent pseudo-graphic characters, mathematical signs and some characters from languages other than English (Greek, German umlauts, French diacritics, etc.). When they began to adapt computers for other countries and languages, there was no longer enough room for new symbols. To fully support languages other than English, IBM has introduced several country-specific code tables. So for the Scandinavian countries, table 865 (Nordic) was proposed, for the Arab countries - table 864 (Arabic), for Israel - table 862 (Israel), and so on. In these tables, some of the codes from the second half of the code table were used to represent the characters of the national alphabets (by excluding some pseudo-graphic characters). The situation with the Russian language developed in a special way. Obviously, the replacement of characters in the second half of the code table can be done different ways... So, several different tables of Cyrillic character encoding appeared for the Russian language: KOI8-R, IBM-866, CP-1251, ISO-8551-5. All of them represent the symbols of the first half of the table in the same way (from 0 to 127) and differ in the representation of the symbols of the Russian alphabet and pseudo-graphics. For languages like Chinese or Japanese, 256 characters are generally not enough. In addition, there is always the problem of outputting or saving in one file at the same time texts on different languages(for example, when quoting). Therefore, a universal code table UNICODE, containing symbols used in the languages of all peoples of the world, as well as various service and auxiliary symbols (punctuation marks, mathematical and technical symbols, arrows, diacritics, etc.). Obviously, one byte is not enough to encode such a large number of characters. Therefore UNICODE uses 16-bit (2-byte) codes to represent 65,536 characters. To date, about 49,000 codes have been used (the last significant change was the introduction of the EURO currency symbol in September 1998). For compatibility with previous encodings, the first 256 codes are the same as in the ASCII standard. In the UNICODE standard, except for the specific binary code(these codes are usually denoted by the letter U, followed by a + sign and the actual code in hexadecimal representation) each character is assigned a specific name. Another component of the UNICODE standard is algorithms for one-to-one conversion of UNICODE codes in a sequence of bytes of variable length. The need for such algorithms is due to the fact that not all applications are able to work with UNICODE. Some applications only understand 7-bit ASCII codes, other applications understand 8-bit ASCII codes. Such applications use the so-called extended ASCII codes to represent characters that do not fit, respectively, in a 128-character set or 256-character set, when characters are encoded with variable length byte strings. UTF-7 is used to reversibly convert UNICODE codes to extended 7-bit ASCII codes, and UTF-8 is used to reversibly convert UNICODE codes to extended 8-bit ASCII codes. Note that both ASCII and UNICODE and other character encoding standards do not define the images of characters, but only the composition of the character set and the way it is represented in a computer. In addition (which may not be immediately obvious), the order of the enumeration of characters in the set is very important, since it affects the sorting algorithms in the most significant way. It is the table of correspondence of symbols from a certain set (say, symbols used to represent information on English language, or in different languages, as in the case of UNICODE) and denote by the term character encoding table or charset. Each standard encoding has a name, for example, KOI8-R, ISO_8859-1, ASCII. Unfortunately, there is no standard for encoding names.

Common encodings

ISO 646 o ASCII EBCDIC ISO 8859: o ISO 8859-1 - ISO 8859-11, ISO 8859-13, ISO 8859-14, ISO 8859-15 o CP437, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP866, CP869 Microsoft encodings Windows: o Windows-1250 for Central European languages that use Latin letters o Windows-1251 for Cyrillic alphabets o Windows-1252 for Western languages o Windows-1253 for Greek o Windows-1254 for Turkish o Windows-1255 for Hebrew o Windows-1256 for Arabic o Windows-1257 for Baltic languages o Windows-1258 for Vietnamese MacRoman, MacCyrillic KOI8 (KOI8-R, KOI8-U ...), KOI-7 Bulgarian coding ISCII VISCII Big5 (the most famous version of Microsoft CP950 ) o HKSCS Guobiao o GB2312 o GBK (Microsoft CP936) o GB18030 Shift JIS for Japanese (Microsoft CP932) EUC-KR for Korean (Microsoft CP949) ISO-2022 and EUC for Chinese Writing UTF-8 and UTF-16 encodings Unicode characters

In the coding system ASCII(American Standard Code for Information Interchange) each character is represented by one byte, which can encode 256 characters.

ASCII has two encoding tables - basic and extended. The base table fixes the values of the codes from 0 to 127, and the extended one refers to characters with numbers from 128 to 255. This is enough to express with various combinations of eight bits all the characters of the English and Russian languages, both lowercase and uppercase, as well as punctuation marks, symbols for basic arithmetic operations and common special symbols that can be observed on the keyboard.

The first 32 codes of the base table, starting with zero, are given to hardware manufacturers (primarily to manufacturers of computers and printing devices). This area contains the so-called control codes, which do not correspond to any language characters, and, accordingly, these codes are not displayed either on the screen or on the printing devices, but they can be controlled how other data is output. Starting from code 32 to code 127, the symbols of the English alphabet, punctuation marks, numbers, arithmetic operations and auxiliary symbols are placed, all of them can be seen on the Latin part of the computer keyboard.

The second, extended part is given to national coding systems. There are many non-Latin alphabets in the world (Arabic, Hebrew, Greek, etc.), including the Cyrillic alphabet. Also, the German, French, Spanish keyboard layouts are different from the English ones.

The English part of the keyboard used to have many standards, but now they have all been replaced by a single ASCII code. For the Russian keyboard, there were also many standards: GOST, GOST-alternative, ISO (International Standard Organization - International Institute for Standardization), but these three standards have actually died out, although they can meet somewhere, in some antediluvian computers or in computer networks.

The main character encoding of the Russian language, which is used in computers with an operating Windows system called Windows-1251, it was developed for the Cyrillic alphabets by Microsoft. Naturally, the absolute majority of computer text data is encoded in Windows-1251. By the way, encodings with a different four-digit number were developed by Microsoft for other common alphabets: Arabic, Japanese and others.

Another common encoding is called KOI-8(information exchange code, eight-digit) - its origin dates back to the times of the Council for Mutual Economic Assistance of Eastern European States. Today the KOI-8 encoding is widespread in computer networks on the territory of Russia and in the Russian sector of the Internet. It so happens that some text of the letter or something else is not readable, which means that you need to switch from KOI-8 to Windows-1251. ten

In the 90s, the largest software manufacturers: Microsoft, Borland, the same Adobe decided on the need to develop a different text encoding system, in which each character will be allocated not 1, but 2 bytes. She got the name Unicode, and it is possible to encode 65,536 characters of this field is enough to fit in one table of national alphabets for all languages of the planet. Most of Unicode (about 70%) is occupied by Chinese characters, in India there are 11 different national alphabets, there are many exotic names, for example: the writing of the Canadian aborigines.

Since the encoding of each character in Unicode is allocated not 8, but 16 bits, the size of the text file is doubled. This was once an obstacle to the introduction of a 16-bit system. and now, with gigabyte hard drives, hundreds of megabytes of RAM, gigahertz processors, doubling the volume of text files, which, in comparison, for example, with graphics, take up very little space, does not really matter.

The Cyrillic alphabet in Unicode ranks from 768 to 923 (basic characters) and from 924 to 1023 (extended Cyrillic, various less common national letters). If the program is not adapted for the Cyrillic Unicode, then it is possible that the text characters are recognized not as Cyrillic, but as extended Latin (codes from 256 to 511). And in this case, instead of text, a meaningless set of various exotic symbols appears on the screen.

This is possible if the program is outdated, created before 1995. Or a rare one, about which no one bothered to Russify. It is also possible that the Windows OS installed on the computer is not fully configured for the Cyrillic alphabet. In this case, you need to make the appropriate entries in the registry.