Special unicode characters. The problem of distinguishing externally similar numbers and letters.

Every Internet user, trying to configure one or another of its functions, at least once saw the written word "Unicode" on the display. You will find out what it is by reading this article.

Definition

Unicode encoding is a character encoding standard. It was offered by the non-profit organization Unicode Inc. in 1991. The standard is designed to combine as many different types of characters as possible in one document. The page that was created on its basis may contain letters and hieroglyphs from different languages(from Russian to Korean) and mathematical signs... In this case, all characters in this encoding are displayed without problems.

Reasons for creation

Once upon a time, long before unified system"Unicode", the encoding was chosen based on the preferences of the author of the document. For this reason, it is often necessary to use different tables to read one document. Sometimes this had to be done several times, which significantly complicated the life of an ordinary user. As already mentioned, a solution to this problem in 1991 was proposed by the non-profit organization Unicode Inc., which proposed a new type of character encoding. It was intended to combine obsolete and diverse standards. "Unicode" is an encoding that allowed to achieve the unthinkable at that time: to create a tool that supports a huge number of characters. The result surpassed many expectations - documents appeared simultaneously containing both English and Russian text, Latin and mathematical expressions.

But the creation of a single coding was preceded by the need to resolve a number of problems that arose due to the huge variety of standards that already existed at that time. The most common ones are:

elven letters, or "krakozyabry";
limited character set;
the problem of converting encodings;
duplication of fonts.

A small historical excursion

Imagine that it is the 80s. Computer technology is not yet so widespread and has a different form from today. At that time, each OS is unique in its own way and is modified by each enthusiast for specific needs. The need for information exchange turns into additional refinement of everything in the world. An attempt to read a document created under a different OS often displays an incomprehensible set of characters on the screen, and games with an encoding begin. It is not always possible to do this quickly, and sometimes the necessary document can be opened after six months, or even later. People who exchange information frequently create conversion tables for themselves. And so the work on them reveals an interesting detail: they need to be created in two directions: "from mine to yours" and vice versa. The machine cannot make a banal inversion of calculations, for it in the right column is the source, and in the left - the result, but not vice versa. If there was a need to use any Special symbols in the document, they had to first be added, and then also explained to the partner what he needed to do so that these symbols did not turn into "krakozyabry". And let's not forget that for each encoding, you had to develop or implement your own fonts, which led to the creation of a huge number of duplicates in the OS.

Imagine also that on the page of fonts you will see 10 identical Times New Roman with small annotations: for UTF-8, UTF-16, ANSI, UCS-2. Do you understand now that it was imperative to develop a universal standard?

"Creator Fathers"

The origins of Unicode can be traced back to 1987, when Joe Becker of Xerox, along with Lee Collins and Mark Davis of Apple began research into the practical creation of a universal character set. In August 1988, Joe Becker published a draft proposal for a 16-bit international multilingual coding system.

A few months later, the Unicode WG was expanded to include Ken Whistler and Mike Kernegan of the RLG, Glenn Wright of Sun Microsystems and several others, completing the preliminary work on a common coding standard.

general description

Unicode is based on the concept of a character. This definition is understood as an abstract phenomenon that exists in a specific form of writing and is realized through graphemes (their "portraits"). Each character is specified in "Unicode" unique code belonging to a specific block of the standard. For example, there is grapheme B in both English and Russian alphabets, but in Unicode it corresponds to 2 different characters. A transformation is applied to them, that is, each of them is described by a database key, a set of properties and a full name.

Benefits of Unicode

The Unicode encoding differed from the rest of its contemporaries by a huge supply of characters for “encrypting” characters. The fact is that its predecessors had 8 bits, that is, they supported 28 characters, but new development already had 216 characters, which was a giant step forward. This made it possible to encode almost all existing and common alphabets.

With the advent of "Unicode" there was no need to use conversion tables: as a single standard, it simply eliminated their need. Likewise, the "krakozyabry" have sunk into oblivion - a single standard made them impossible, as well as eliminated the need to create duplicate fonts.

Development of Unicode

Of course, progress does not stand still, and 25 years have passed since the first presentation. However, the Unicode encoding stubbornly maintains its position in the world. In many ways, this became possible due to the fact that it became easily implemented and became widespread, being recognized as developers of proprietary (paid) and open source software.

At the same time, one should not assume that today the same Unicode encoding is available to us as a quarter of a century ago. On this moment its version changed to 5.х.х, and the number of encoded characters increased to 231. The ability to use a larger supply of characters was abandoned in order to still maintain support for Unicode-16 (encodings where their maximum number was limited to 216). Since its inception and up to version 2.0.0, the "Unicode Standard" has almost doubled the number of characters it contains. The growth of opportunities continued in the following years. By version 4.0.0, there was a need to increase the standard itself, which was done. As a result, "Unicode" acquired the form in which we know it today.

What else is there in Unicode?

In addition to the huge, constantly growing number of symbols, it has another useful feature. This is the so-called normalization. Instead of scrolling through the entire document character by character and substituting the appropriate icons from the look-up table, one of the existing normalization algorithms is used. What are we talking about?

Instead of wasting computing resources on regular checking of the same symbol, which may be similar in different alphabets, a special algorithm is used. It allows you to take out similar characters in a separate column of the substitution table and refer to them, rather than re-checking all the data over and over again.

Four such algorithms have been developed and implemented. In each of them, the transformation takes place according to a strictly defined principle that differs from the others, therefore it is not possible to name any one of them the most effective. Each has been developed for specific needs, has been implemented and is successfully used.

Distribution of the standard

Over the 25 years of its history, the Unicode encoding is probably the most widely used in the world. Programs and web pages are also tailored to this standard. The fact that Unicode is used today by more than 60% of Internet resources can indicate the breadth of application.

Now you know when the Unicode standard came into being. What it is, you also know and will be able to appreciate the full significance of the invention made by a group of specialists from Unicode Inc. over 25 years ago.

Do you need hosting or domain? Click here! Do you want to create an online store? Click here! (Shopify)

Sometimes, when writing a post, there is a need for a character (sign) that is not on the keyboard, in such situations the unicode character table will help you. Today we will consider online service, in which all unicode characters are grouped ...

Unicode character table

For those who are interested in the background of the appearance Unicode- here is the link to wikipedia

So let's designate our interests in unicode characters- this is the use of them in their articles, on their sites.
First, let's go to the page service unicode characters:

Let's take a look at the interface of this service a little. At the very top there is a search field, in it it is enough to drive in the name of the element you are looking for, for example: "Arrow" or "Ellipsis", after entering, click on the search to get the result.

Next to the search there is a page language switcher.

Below is a list of frequently requested symbols, perhaps among them there will be the one you need, if so, just click on the symbol to go to the page with detailed information about it.

The main part of the page is occupied by Unicode character table, for a more convenient search, you can also click on "Control Characters" to select a group of characters, for example: "Greek Characters" if you need to insert a Greek character.

Find the item you want in the Unicode character table

For example, let's use search and enter the word "Arrow" into it and press search.

On the search results page, we are looking for the symbol we need and click on it to go to the page detailed information about him.

On the page Unicode character we are interested in its HTML code or Mnemonic code, both can be used on a web page, to do this, copy the code and paste it in the right place in the HTML markup, the browser interprets it and displays it as a symbol on the page.

Please note that on the Unicode character page, there is a choice of font. Always test how your font will display with Verdana, Arial (and other web fonts). not all characters are supported by them.

(codes from 0 to 127), i.e. one byte encodes Latin letters, numbers and special characters. Russian letters (Cyrillic) are represented by 16-bit (two-byte) codes:

110XXXXX 10XXXXXX,

where X denotes binary digits for placing the character code in accordance with the table UNICODE.

Unicode (English Unicode) is a character encoding standard that allows characters to be represented in almost all written languages. Unicode characters are encoded as unsigned integers. These numbers will be called unicode character codes or simply UNICODE... Unicode has several forms of representation of characters in a computer: UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE)... (English Unicode transformation format - UTF).

Consider how it is encoded in UTF-8 letter F... Her UNICODE- 1046 10 or 0416 16 or 10000 010110 2. UNICODE in binary, it is split into two parts: five left bits and six right bits. The left side is padded to a byte with a sign 110 two-byte code UTF-8: 110 10000. Two bits are assigned to the right side 10 sign of continuation of multibyte code: 10 010110. Final letter code F v UTF-8 looks like that:

110 10000 10 010110 2
or D0 96 16

Thus, the Russian letter is encoded twice: first into 11-bit UNICODE and then into 16-bit UTF-8.

In the table below, in addition to the codes UNICODE and UTF-8 in hexadecimal notation, codes are given UTF-8 in decimal notation and for comparison Cyrillic codes in encoding CP-1251, otherwise called windovs-1251.

Cyrillic UTF-8 Codes Table

Symbol	UNICODE		UTF-8		CP-1251
Symbol	Hex	Ten	Hex	Ten	CP-1251
A	0410	1040	D090	208 144	192
B	0411	1041	D091	208 145	193
V	0412	1042	D092	208 146	194
G	0413	1043	D093	208 147	195
D	0414	1044	D094	208 148	196
E	0415	1045	D095	208 149	197
F	0416	1046	D096	208 150	198
Z	0417	1047	D097	208 151	199
AND	0418	1048	D098	208 152	200
Th	0419	1049	D099	208 153	201
TO	041A	1050	D09A	208 154	202
L	041B	1051	D09B	208 155	203
M	041C	1052	D09C	208 156	204
N	041D	1053	D09D	208 157	205
O	041E	1054	D09E	208 158	206
NS	041F	1055	D09F	208 159	207
R	0420	1056	D0A0	208 160	208
WITH	0421	1057	D0A1	208 161	209
T	0422	1058	D0A2	208 162	210
Have	0423	1059	D0A3	208 163	211
F	0424	1060	D0A4	208 164	212
NS	0425	1061	D0A5	208 165	213
C	0426	1062	D0A6	208 166	214
H	0427	1063	D0A7	208 167	215
NS	0428	1064	D0A8	208 168	216
SCH	0429	1065	D0A9	208 169	217
B	042A	1066	D0AA	208 170	218
NS	042B	1067	D0AB	208 171	219
B	042C	1068	D0AC	208 172	220
NS	042D	1069	D0AD	208 173	221
NS	042E	1070	D0AE	208 174	222
I AM	042F	1071	D0AF	208 175	223
a	0430	1072	D0B0	208 176	224
b	0431	1073	D0B1	208 177	225
v	0432	1074	D0B2	208 178	226
G	0433	1075	D0B3	208 179	227
d	0434	1076	D0B4	208 180	228
e	0435	1077	D0B5	208 181	229
f	0436	1078	D0B6	208 182	230
s	0437	1079	D0B7	208 183	231
and	0438	1080	D0B8	208 184	232
th	0439	1081	D0B9	208 185	233
To	043A	1082	D0BA	208 186	234
l	043B	1083	D0BB	208 187	235
m	043C	1084	D0BC	208 188	236
n	043D	1085	D0BD	208 189	237
O	043E	1086	D0BE	208 190	238
NS	043F	1087	D0BF	208 191	239
R	0440	1088	D180	209 128	240
with	0441	1089	D181	209 129	241
T	0442	1090	D182	209 130	242
at	0443	1091	D183	209 131	243
f	0444	1092	D184	209 132	244
NS	0445	1093	D185	209 133	245
c	0446	1094	D186	209 134	246
h	0447	1095	D187	209 135	247
NS	0448	1096	D188	209 136	248
SCH	0449	1097	D189	209 137	249
b	044A	1098	D18A	209 138	250
NS	044B	1099	D18B	209 139	251
b	044C	1100	D18C	209 140	252
NS	044D	1101	D18D	209 141	253
NS	044E	1102	D18E	209 142	254
I am	044F	1103	D18F	209 143	255
Symbols outside the general rule
Yo	0401	1025	D001	208 101	168
e	0451	1025	D191	209 145	184

Sometimes you need to add an icon to your design, but don't feel like inserting additional images or an entire icon font like Font Awesome? Then we have good news for you - there is an extensive library of available icons and symbols already in your browser. It's called Unicode, and it's the standard that assigns unique identifiers for an ever-growing number (currently over 110,000) of symbols and icons.

This does not mean that you have a selection of hundreds of thousands of icons, though. It depends on the browser that renders them, and it uses the fonts that are installed on the system to do this. In this article, we have compiled a number of character sets that are available on Windows, Linux, OS X, Android, and IOS. You can use them in your designs today!

Tip: which explains everything there is to know about encodings and Unicode, which we recommend every software developer read.

How to use these icons

The icons shown in the tables below are common symbols that you can copy and paste as if they were letters of the alphabet. But if the encoding used to save the HTML / CSS files not UTF-8 they will not be displayed. This is why we introduced HTML escape code that will always work. Here's what you need to do to use these icons.:

Find the icon you like. We have provided small and large previews.
Copy the code.
Paste it into HTML as plain text. In CSS you can use them as property value content... In JS, PHP, and other programming languages, you can use them as plain text in strings.
You can customize icons by setting font size, color, text and shadows just like normal text.

Icons

Name	Preview		Code
Smiley	☺	☺	☺
Warning Sign	⚠	⚠	⚠
Hot springs	♨	♨	♨
Wheelchair	♿	♿	♿
Recycle	♻	♻	♻
8-Ball	➑	➑	➑
High Voltage	⚡	⚡	⚡
White star	☆	☆	☆
Black star	★	★	★
White heart	♡	♡	♡
Black heart	❤	❤	❤
Coffee	☕	☕	☕
Airplane	✈	✈	✈
Hourglass	⌛	⌛	⌛
Clock	⌚	⌚	⌚
Black scissors	✂	✂	✂
White scissors	✄	✄	✄
Crown	♕	♕	♕
Anchor	⚓	⚓	⚓
Cross	✝	✝	✝
Black-white circle	◑	◑	◑
Eight note	♪	♪	♪
Beamed eighth notes	♫	♫	♫
Four Balloon-Spoked Asterisk	✣	✣	✣
Circled White Star	✪	✪	✪
White star	✰	✰	✰
White four pointed star	✧	✧	✧
Black four pointed star	✦	✦	✦
Ballot box check	☑	☑	☑
Check Mark	✔	✔	✔
Cross mark	✘	✘	✘
Pencil	✎	✎	✎
Writing hand	✍	✍	✍
Female	♀	♀	♀
Male	♂	♂	♂
Black telephone	☎	☎	☎
White Telephone	☏	☏	☏
Envelope	✉	✉	✉
Telephone Location	✆	✆	✆

Unicode arrows

Name	Preview		Code
Leftwards arrow	←	←	←
Rightwards arrow	→	→	→
Upwards Arrow
Downwards arrow	↓	↓	↓
Left Right Arrow	↔	↔	↔
Up down arrow	↕	↕	↕
Right And Left Arrows	⇄	⇄	⇄
Up and down arrows	⇅	⇅	⇅
Down-Left 90deg Arrow	↲	↲	↲
Down-Right 90deg Arrow	↳	↳	↳
Up-Left 90deg Arrow	↰	↰	↰
Up-Right 90deg Arrow	↱	↱	↱
North West Arrow To Corner	⇱	⇱	⇱
South East Arrow To Corner	⇲	⇲	⇲
Leftwards Arrow To Bar	⇤	⇤	⇤
Rightwards Arrow To Bar	⇥	⇥	⇥
Anticlockwise semicircle arrow	↶	↶	↶
Clockwise semicircle arrow	↷	↷	↷
Anticlockwise circle arrow	↺	↺	↺
Clockwise circle arrow	↻	↻	↻
Wide-Headed Rightwards Arrow	➔	➔	➔
Downwards zigzag arrow	↯	↯	↯
North west arrow	↖	↖	↖
Heavy south east arrow	➘	➘	➘
Heavy rightwards arrow	➙	➙	➙
Heavy north east arrow	➚	➚	➚
Dashed Rightwards Arrow	➟	➟	➟
Dotted Leftwards Arrow	⇠	⇠	⇠
Black rightwards arrowhead	➤	➤	➤
Leftwards white arrow	⇦	⇦	⇦
Rightwards white arrow	⇨	⇨	⇨
Left Angle Quotation Mark	«	«	«
Right Angle Quotation Mark	»	»	»
Right Black Pointer
Left Black Pointer	◀	◀	◀
Up Black Pointer	▲	▲	▲
Down black pointer	▼	▼	▼
Right White Pointer	▷	▷	▷
Left White Pointer	◁	◁	◁
Up White Pointer	△	△	△
Down white pointer	▽	▽	▽
Bow arrow	➴	➴	➴

Special characters in unicode

Unicode currency

Weather icons

Name	Preview		Code
Degree	°	°	°
Small sun	☀	☀	☀
Big sun	☼	☼	☼
Cloud	☁	☁	☁
Umbrella	☔	☔	☔
Snowflake 1	❆	❆	❆
Snowflake 2	❅	❅	❅
Snowflake 3	❄	❄	❄

Unicode pointers

Name	Preview		Code
Pointer Left Black	☚	☚	☚
Pointer Right Black	☛	☛	☛
Pointer Left White	☜	☜	☜
Pointer Up White	☝	☝	☝
Pointer Right White	☞	☞	☞
Pointer Down White	☟	☟	☟

Zodiac signs in unicode

Name	Preview		Code
Aries	♈	♈	♈
Taurus	♉	♉	♉
Twins	♊	♊	♊
Cancer	♋	♋	♋
a lion	♌	♌	♌
Virgo	♍	♍	♍
scales	♎	♎	♎
Scorpion	♏	♏	♏
Sagittarius	♐	♐	♐
Capricorn	♑	♑	♑
Aquarius	♒	♒	♒
Fishes	♓	♓	♓

Unicode card symbols

Name	Preview		Code
Clubs Black	♠	♠	♠
Hearts Black	♥	♥	♥
Diamonds black	♦	♦	♦
Spades black	♣	♣	♣
Clubs White	♤	♤	♤
Hearts White	♡	♡	♡
Diamonds white	♢	♢	♢
Spades white	♧	♧	♧

Chess pieces in unicode

Name	Preview		Code
King white	♔	♔	♔
Queen white	♕	♕	♕
Rook white	♖	♖	♖
Bishop White	♗	♗	♗
Knight white	♘	♘	♘
Pawn white	♙	♙	♙
King black	♚	♚	♚
Queen black	♛	♛	♛
Rook black	♜	♜	♜
Bishop Black	♝	♝	♝
Knight black	♞	♞	♞
Pawn black	♟	♟	♟

Game of dice

Name	Preview		Code
Dice roll one	⚀	⚀	⚀
Dice roll two	⚁	⚁	⚁
Dice roll three	⚂	⚂	⚂
Dice roll four	⚃	⚃	⚃
Dice roll five	⚄	⚄	⚄
Dice roll six	⚅	⚅	⚅

Unicode math symbols

Name	Preview		Code
Infinity	∞	∞	∞
Plus minus	±	±	±
Less-Than Or Equal To	≤	≤	≤
More-Than Or Equal To	≥	≥	≥
Not Equal To	≠	≠	≠
Division	÷	÷	÷
Multiplication x	×	×	×
Heavy Multiplication x	✖	✖	✖
Superscript one	¹	¹	¹
Superscript Two	²	²	²
Superscript three	³	³	³
Circled Plus	⊕	⊕	⊕
Circled Multiplication	⊗	⊗	⊗
Logical AND	∧	∧	∧
Logical OR	∨	∨	∨
Delta	∆	∆	∆
Pie	∏	∏	∏
Sigma (SUM)	∑	∑	∑
Omega	Ω	Ω	Ω
Empty Set	∅	∅	∅
Angle	∠	∠	∠
Parallel	∥	∥	∥
Perpendicular	⊥	⊥	⊥
Almost Equal To	≈	≈	≈
Triangle	△	△	△
Circle	○	○	○
Square	□	□	□

Fractions

Name	Preview		Code
One Quarter (1/4)	¼	¼	¼
One Half (1/2)	½	½	½
Three Quarters (3/4)	¾	¾	¾
One Third (1/3)	⅓	⅓	⅓
Two Thirds (2/3)	⅔	⅔	⅔
One Eight (1/8)	⅛	⅛	⅛
Three Eights (3/8)	⅜	⅜	⅜
Five Eights (5/8)	⅝	⅝	⅝
Seven Eights (7/8)	⅞	⅞	⅞

Roman numerals in unicode

Name	Preview		Code
Roman Numeral One	Ⅰ	Ⅰ	Ⅰ
Roman Numeral Two	Ⅱ	Ⅱ	Ⅱ
Roman Numeral Three	Ⅲ	Ⅲ	Ⅲ
Roman Numeral Four	Ⅳ	Ⅳ	Ⅳ
Roman Numeral Five	Ⅴ	Ⅴ	Ⅴ
Roman Numeral Six	Ⅵ	Ⅵ	Ⅵ
Roman Numeral Seven	Ⅶ	Ⅶ	Ⅶ
Roman Numeral Eight	Ⅷ	Ⅷ	Ⅷ
Roman Numeral Nine	Ⅸ	Ⅸ	Ⅸ
Roman Numeral Ten	Ⅹ	Ⅹ	Ⅹ
Roman Numeral Eleven	Ⅺ	Ⅺ	Ⅺ
Roman Numeral Twelve	Ⅻ	Ⅻ	Ⅻ

There are some differences in the rendering of these symbols in different operating systems... This is caused by the different font families that are used. In addition, iOS and Android replace some Unicode characters with emojis, so be sure to check the added characters to make sure they don't and the icons are showing as intended.

The elements of the code space that represent non-negative integers. The family of encodings defines the machine representation of a sequence of UCS codes.

Unicode codes are divided into several areas. The area with codes U + 0000 through U + 007F contains the ASCII characters with corresponding codes. Next are the areas of signs of various scripts, punctuation marks and technical symbols. Some of the codes are reserved for future use. Under the Cyrillic characters areas of characters with codes from U + 0400 to U + 052F, from U + 2DE0 to U + 2DFF, from U + A640 to U + A69F are allocated (see Cyrillic in Unicode).

Prerequisites for the creation and development of Unicode

Since in a number of computer systems (for example, Windows NT) fixed 16-bit characters were already used as the default encoding, it was decided to encode all the most important characters only within the first 65,536 positions (the so-called English. basic multilingual plane, BMP). The rest of the space is used for "additional characters" (eng. supplementary characters): writing systems of extinct languages or very rarely used Chinese characters, mathematical and musical symbols.

For compatibility with old 16-bit systems, the UTF-16 system was invented, where the first 65,536 positions, with the exception of positions from the interval U + D800 ... U + DFFF, are displayed directly as 16-bit numbers, and the rest are represented as "surrogate pairs "(The first element of the pair from the U + D800… U + DBFF region, the second element of the pair from the U + DC00… U + DFFF region). For surrogate pairs, a portion of the code space (2048 positions) previously reserved for "characters for private use" was used.

Since UTF-16 can display only 2 20 + 2 16 −2048 (1 112 064) characters, this number was chosen as the final value for the Unicode code space.

Although the Unicode code area was extended beyond 2-16 as early as version 2.0, the first characters in the "top" area were only placed in version 3.1.

The role of this encoding in the web sector is constantly growing, at the beginning of 2010 the share of websites using Unicode was about 50%.

Unicode versions

As the Unicode character table changes and replenishes and new versions of this system are released - and this work is ongoing, since the original Unicode system included only Plane 0 - two-byte codes - new ISO documents are also released. The Unicode system exists in total in the following versions:

1.1 (conforms to ISO / IEC 10646-1: 1993), 1991-1995 standard.
2.0, 2.1 (same ISO / IEC 10646-1: 1993 standard plus additions: "Amendments" 1 to 7 and "Technical Corrigenda" 1 and 2), 1996 standard.
3.0 (ISO / IEC 10646-1: 2000 standard) 2000 standard.
3.1 (ISO / IEC 10646-1: 2000 and ISO / IEC 10646-2: 2001 standards) 2001 standard.
3.2 2002 standard.
4.0, standard 2003.
4.01, standard 2004.
4.1, standard 2005.
5.0, standard 2006.
5.1, standard 2008.
5.2, standard 2009.
6.0, standard 2010.
6.1, standard 2012.
6.2, standard 2012.

Code space

Although the notation forms UTF-8 and UTF-32 allow up to 2,331 (2,147,483,648) code points to be encoded, it was decided to use only 1,112,064 for compatibility with UTF-16. However, even this is more than enough - today (in version 6.0) slightly less than 110,000 code points (109,242 graphic and 273 other symbols) are used.

The code space is split into 17 planes 2 16 (65536) characters each. The zero plane is called basic, it contains the symbols of the most common scripts. The first plane is used mainly for historical scripts, the second - for rarely used CJK characters, the third is reserved for archaic Chinese characters. Planes 15 and 16 are reserved for private use.

To denote Unicode characters a notation of the form “U + xxxx"(For codes 0 ... FFFF), or" U + xxxxx"(For codes 10000 ... FFFFF), or" U + xxxxxx"(For codes 100000 ... 10FFFF), where xxx- hexadecimal digits. For example, the character "i" (U + 044F) has the code 044F = 1103.

Coding system

The universal coding system (Unicode) is a set of graphic symbols and a way of encoding them for computer processing of text data.

Graphic symbols are symbols that have a visible image. Graphical characters are opposed to control and formatting characters.

Graphic symbols include the following groups:

letters contained in at least one of the supported alphabets;
numbers;
punctuation marks;
special signs (mathematical, technical, ideograms, etc.);
separators.

Unicode is a system for the linear representation of text. Characters with additional superscripts or subscripts can be represented as a sequence of codes built according to certain rules (composite character) or as a single character (monolithic version, precomposed character).

Modifying characters

Representation of the character "Y" (U + 0419) in the form of the base character "I" (U + 0418) and the modifying character "" (U + 0306)

Graphic characters in Unicode are divided into extended and non-extended (widthless). Non-extended characters do not take up space in the line when displayed. These include, in particular, accent marks and other diacritical marks. Both extended and non-extended characters have their own codes. Extended symbols are otherwise called basic (eng. base characters), and non-extended ones - modifying (eng. combining characters); and the latter cannot meet independently. For example, the character "á" can be represented as a sequence of the base character "a" (U + 0061) and the modifier character "́" (U + 0301), or as a monolithic character "á" (U + 00C1).

A special type of modifying characters are face style selectors (eng. variation selectors). They only apply to those symbols for which such variants are defined. In version 5.0, style options are defined for a series mathematical symbols, for the symbols of the traditional Mongolian alphabet and for the symbols of the Mongolian square writing.

Normalization forms

Since the same symbols can be represented different codes, which sometimes complicates processing, there are normalization processes designed to bring the text to a certain standard form.

The Unicode standard defines 4 forms of text normalization:

Normalization Form D (NFD) - Canonical Decomposition. In the process of converting the text into this form, all compound characters are recursively replaced by several compound ones, in accordance with the decomposition tables.
Normalization Form C (NFC) is canonical decomposition followed by canonical composition. First, the text is reduced to the D form, after which the canonical composition is performed - the text is processed from beginning to end and the following rules are followed:
- The S symbol is initial if it has a modification class of zero in the Unicode character base.
- In any sequence of characters starting with an initial character S, a character C is blocked from S if and only if there is any character B between S and C that is either an initial character or has the same or greater modification class than C. This rule applies only to strings that have gone through canonical decomposition.
- Primary A composite is a character that has a canonical decomposition in the Unicode character base (or canonical decomposition for Hangul and is not included in the exceptions list).
- The X character can be primary aligned with the Y character if and only if there is a primary composite Z canonically equivalent to the sequence .
- If the next character C is not blocked by the last encountered initial base character L and it can be successfully aligned with it, then L is replaced by the composite L-C, and C is removed.
Normalization Form KD (NFKD) - Compatible Decomposition. When cast into this form, all composite characters are replaced using both the canonical Unicode decomposition maps and compatible decomposition maps, after which the result is placed in canonical order.
Normalization Form KC (NFKC) - Compatible Decomposition followed by canonical composition.

The terms "composition" and "decomposition" mean, respectively, the connection or decomposition of symbols into their constituent parts.

Examples of

Source text	NFD	NFC	NFKD	NFKC
Français	Franc \ u0327ais	Fran \ xe7ais	Franc \ u0327ais	Fran \ xe7ais
A, E, Y		\ u0410, \ u0401, \ u0419	\ u0410, \ u0415 \ u0308, \ u0418 \ u0306	\ u0410, \ u0401, \ u0419
が	\ u304b \ u3099	\ u304c	\ u304b \ u3099	\ u304c
Henry iv	Henry iv	Henry iv	Henry iv	Henry iv
Henry Ⅳ	Henry \ u2163	Henry \ u2163	Henry iv	Henry iv

Bi-directional letter

The Unicode standard supports writing languages with a left-to-right direction (eng. left-to-right, LTR), and with writing from right to left (eng. right-to-left, RTL) - for example, Arabic and Hebrew letters. In both cases, the characters are stored in a "natural" order; their display, taking into account the desired direction of the letter, is provided by the application.

In addition, Unicode supports combined texts that combine fragments with different directions of the letter. This feature is called bidirectionality(eng. bidirectional text, BiDi). Some simplified text processors (for example, in cell phones) can support Unicode, but not bidirectional support. All Unicode characters are divided into several categories: written from left to right, written from right to left, and written in any direction. Symbols of the latter category (mainly punctuation marks), when displayed, take the direction of the surrounding text.

Featured symbols

Unicode includes virtually all modern scripts, including:

other.

For academic purposes, many historical scripts have been added, including: runes, ancient Greek, Egyptian hieroglyphs, cuneiform, Mayan writing, Etruscan alphabet.

Unicode provides a wide range of mathematical and musical symbols and pictograms.

However, Unicode fundamentally does not include company and product logos, although they are found in fonts (for example, the Apple logo in the MacRoman encoding (0xF0) or the Windows logo in the Wingdings font (0xFF)). In Unicode fonts, logos must be placed in the custom character area only.

ISO / IEC 10646

The Unicode Consortium works closely with working group ISO / IEC / JTC1 / SC2 / WG2, which is developing international standard 10646 (ISO / IEC 10646). Synchronization is established between the Unicode standard and ISO / IEC 10646, although each standard uses its own terminology and documentation system.

Cooperation of the Unicode Consortium with the International Organization for Standardization (eng. International Organization for Standardization, ISO ) began in 1991. In 1993, ISO issued the DIS 10646.1 standard. To synchronize with it, the Consortium approved the version 1.1 of the Unicode standard, which was supplemented with additional characters from DIS 10646.1. As a result, the values of the encoded characters in Unicode 1.1 and DIS 10646.1 are exactly the same.

In the future, cooperation between the two organizations continued. In 2000 Unicode standard 3.0 has been synchronized with ISO / IEC 10646-1: 2000. The upcoming third version of ISO / IEC 10646 will be synchronized with Unicode 4.0. Perhaps these specifications will even be published as a single standard.

Similar to the UTF-16 and UTF-32 formats in the Unicode standard, the ISO / IEC 10646 standard also has two main forms of character encoding: UCS-2 (2 bytes per character, similar to UTF-16) and UCS-4 (4 bytes per character, similar to UTF-32). UCS means universal multi-octet(multibyte) coded character set(eng. universal multiple-octet coded character set ). UCS-2 can be considered a subset of UTF-16 (UTF-16 without surrogate pairs) and UCS-4 is a synonym for UTF-32.

Presentation methods

Unicode has several forms of representation (eng. Unicode transformation format, UTF ): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). The UTF-7 representation form was also developed for transmission over seven-bit channels, but due to incompatibility with ASCII, it was not spread and was not included in the standard. On April 1, 2005, two humorous submissions were proposed: UTF-9 and UTF-18 (RFC 4042).

Unicode UTF-8: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxx

Theoretically possible, but also not included in the standard:

0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Although UTF-8 allows you to specify the same character in several ways, only the shortest one is correct. The rest of the forms should be rejected for security reasons.

Byte order

In a UTF-16 data stream, the high byte can be written either before the low (eng. UTF-16 big-endian), or after the younger (eng. UTF-16 little-endian). Similarly, there are two options for the four-byte encoding - UTF-32BE and UTF-32LE.

To define the format of the Unicode representation at the beginning text file the signature is written - the character U + FEFF (non-breaking space with zero width), also called byte order mark(eng. byte order mark, BOM ). This makes it possible to distinguish between UTF-16LE and UTF-16BE since the U + FFFE character does not exist. It is also sometimes used to denote the UTF-8 format, although the notion of byte order does not apply to this format. Files that follow this convention begin with these byte sequences:

UTF-8 EF BB BF UTF-16BE FE FF UTF-16LE FF FE UTF-32BE 00 00 FE FF UTF-32LE FF FE 00 00

Unfortunately, this method does not reliably distinguish between UTF-16LE and UTF-32LE, since the character U + 0000 is allowed by Unicode (although real texts rarely start with it).

Files in UTF-16 and UTF-32 encodings that do not contain a BOM must be in big-endian (unicode.org) byte order.

Unicode and traditional encodings

The introduction of Unicode changed the approach to traditional 8-bit encodings. If earlier the encoding was specified by the font, now it is specified by the correspondence table between this encoding and Unicode. In fact, 8-bit encodings have become a representation of a subset of Unicode. This made it much easier to create programs that have to work with many different encodings: now, to add support for one more encoding, you just need to add another Unicode lookup table.

In addition, many data formats allow any Unicode characters to be inserted, even if the document is written in the old 8-bit encoding. For example, you can use ampersand codes in HTML.

Implementation

Most modern operating systems provide some degree of Unicode support.

In operating systems of the Windows NT family, the double-byte UTF-16LE encoding is used for the internal representation of file names and other system strings. System calls that take string parameters are available in single-byte and double-byte variants. For more details see the article