Computers Windows Internet

From unicode codes to letters. The problem of distinguishing externally similar numbers and letters.

Sometimes you need to add an icon to your design, but don't feel like inserting additional images or an entire icon font like Font Awesome? Then we have good news for you - there is an extensive library of icons and symbols available already in your browser. It's called Unicode, and it's the standard that assigns unique identifiers for an ever-growing number (currently over 110,000) of symbols and icons.

This does not mean that you have a selection of hundreds of thousands of icons, though. It depends on the browser that renders them, and it uses the fonts that are installed on the system to do this. In this article, we have compiled a number of character sets that are available on Windows, Linux, OS X, Android, and IOS. You can use them in your designs today!

Tip: which explains everything there is to know about encodings and Unicode, which we recommend every software developer read.

How to use these icons

The icons shown in the tables below are common symbols that you can copy and paste as if they were letters of the alphabet. But if the encoding used to save the HTML / CSS files not UTF-8 they will not be displayed. This is why we introduced HTML escape code that will always work. Here's what you need to do to use these icons:

  • Find the icon you like. We have provided small and large previews.
  • Copy the code.
  • Paste it into HTML as plain text. In CSS you can use them as property value content... In JS, PHP, and other programming languages, you can use them as plain text in strings.
  • You can customize the icons by setting the font size, color, text and shadows just like normal text.

Icons

NamePreviewCode
Smiley
Warning Sign
Hot springs
Wheelchair
Recycle
8-Ball
High Voltage
White star
Black star
White heart
Black heart
Coffee
Airplane
Hourglass
Clock
Black scissors
White scissors
Crown
Anchor
Cross
Black-white circle
Eight Note
Beamed eighth notes
Four Balloon-Spoked Asterisk
Circled White Star
White star
White four pointed star
Black four pointed star
Ballot box check
Check Mark
Cross mark
Pencil
Writing hand
Female
Male
Black telephone
White Telephone
Envelope
Telephone Location

Unicode arrows

NamePreviewCode
Leftwards arrow
Rightwards arrow
Upwards Arrow
Downwards arrow
Left Right Arrow
Up Down Arrow
Right And Left Arrows
Up and down arrows
Down-Left 90deg Arrow
Down-Right 90deg Arrow
Up-Left 90deg Arrow
Up-Right 90deg Arrow
North West Arrow To Corner
South East Arrow To Corner
Leftwards Arrow To Bar
Rightwards Arrow To Bar
Anticlockwise semicircle arrow
Clockwise semicircle arrow
Anticlockwise circle arrow
Clockwise circle arrow
Wide-Headed Rightwards Arrow
Downwards Zigzag Arrow
North west arrow
Heavy south east arrow
Heavy rightwards arrow
Heavy north east arrow
Dashed Rightwards Arrow
Dotted Leftwards Arrow
Black rightwards arrowhead
Leftwards white arrow
Rightwards white arrow
Left Angle Quotation Mark « « «
Right Angle Quotation Mark » » »
Right Black Pointer
Left Black Pointer
Up Black Pointer
Down Black Pointer
Right White Pointer
Left White Pointer
Up White Pointer
Down white pointer
Bow arrow

Special characters in unicode

Unicode currency

Weather icons

NamePreviewCode
Degree ° ° °
Small sun
Big sun
Cloud
Umbrella
Snowflake 1
Snowflake 2
Snowflake 3

Unicode pointers

NamePreviewCode
Pointer Left Black
Pointer Right Black
Pointer Left White
Pointer Up White
Pointer Right White
Pointer Down White

Zodiac signs in unicode

NamePreviewCode
Aries
Taurus
Twins
Cancer
a lion
Virgo
scales
Scorpion
Sagittarius
Capricorn
Aquarius
Fishes

Unicode card characters

NamePreviewCode
Clubs Black
Hearts Black
Diamonds black
Spades black
Clubs White
Hearts White
Diamonds white
Spades white

Chess pieces in unicode

NamePreviewCode
King white
Queen white
Rook white
Bishop White
Knight white
Pawn white
King black
Queen black
Rook black
Bishop Black
Knight black
Pawn black

Game of dice

NamePreviewCode
Dice roll one
Dice roll two
Dice roll three
Dice roll four
Dice roll five
Dice roll six

Unicode math symbols

NamePreviewCode
Infinity
Plus minus ± ± ±
Less-Than Or Equal To
More-Than Or Equal To
Not Equal To
Division ÷ ÷ ÷
Multiplication x × × ×
Heavy Multiplication x
Superscript one ¹ ¹ ¹
Superscript Two ² ² ²
Superscript three ³ ³ ³
Circled Plus
Circled Multiplication
Logical AND
Logical OR
Delta
Pie
Sigma (SUM)
Omega Ω Ω Ω
Empty Set
Angle
Parallel
Perpendicular
Almost Equal To
Triangle
Circle
Square

Fractions

NamePreviewCode
One Quarter (1/4) ¼ ¼ ¼
One Half (1/2) ½ ½ ½
Three Quarters (3/4) ¾ ¾ ¾
One Third (1/3)
Two Thirds (2/3)
One Eight (1/8)
Three Eights (3/8)
Five Eights (5/8)
Seven Eights (7/8)

Roman numerals in unicode

NamePreviewCode
Roman Numeral One
Roman Numeral Two
Roman Numeral Three
Roman Numeral Four
Roman Numeral Five
Roman Numeral Six
Roman Numeral Seven
Roman Numeral Eight
Roman Numeral Nine
Roman Numeral Ten
Roman Numeral Eleven
Roman Numeral Twelve

There are some differences in the rendering of these symbols in different operating systems Oh. This is caused by the different font families that are used. Additionally, iOS and Android replace some Unicode characters with emoticons, so be sure to check the added characters to make sure that doesn't happen and the icons are showing as intended.

Unicode (English Unicode) is a character encoding standard. Simply put, this is a table of correspondence of text characters (, letters, punctuation elements) binary codes... The computer only understands the sequence of zeros and ones. In order for him to know what exactly he should display on the screen, it is necessary to assign a unique number to each character. In the eighties, characters were encoded in one byte, that is, in eight bits (each bit is 0 or 1). Thus, it turned out that one table (aka encoding or set) can only hold 256 characters. This may not be enough even for one language. Therefore, many different encodings appeared, confusion with which often led to the fact that instead of the readable text, some strange krakozyabry appeared on the screen. A single standard was required, which became Unicode. The most used encoding is UTF-8 (Unicode Transformation Format), which uses 1 to 4 bytes to display a character.

Symbols

Characters in Unicode tables are numbered with hexadecimal numbers. For example, the Cyrillic capital letter M is designated U + 041C. This means that it stands at the intersection of line 041 and column C. It can simply be copied and then pasted somewhere. In order not to rummage through a multi-kilometer list, you should use the search. Having entered the symbol page, you will see its number in Unicode and the way it is drawn in different fonts. You can also drive the sign itself into the search bar, even if a square is drawn instead, at least in order to find out what it was. Also, on this site there are special (and - random) sets of the same type of icons, collected from different sections, for ease of use.

The Unicode standard is international. It includes signs from almost all scripts in the world. Including those that are no longer used. Egyptian hieroglyphs, Germanic runes, Mayan writing, cuneiform and alphabets of ancient states. Presented and the designation of measures and weights, musical notation, mathematical concepts.

The Unicode Consortium itself does not invent new characters. Those icons that find their application in society are added to the tables. For example, the ruble sign was actively used for six years before being added to Unicode. Emoji pictograms (emoticons) were also first widely used in Japan and before they were included in the encoding. But trademarks and company logos are not added in principle. Even as common as the Apple apple or the Windows flag. Today, in version 8.0, about 120 thousand characters are encoded.

The elements of the code space that represent non-negative integers. The family of encodings defines the machine representation of a sequence of UCS codes.

Unicode codes are divided into several areas. The area with codes from U + 0000 to U + 007F contains the ASCII characters with the corresponding codes. Next are the areas of signs of various scripts, punctuation and technical symbols. Some of the codes are reserved for future use. Under the Cyrillic characters areas of characters with codes from U + 0400 to U + 052F, from U + 2DE0 to U + 2DFF, from U + A640 to U + A69F are allocated (see Cyrillic in Unicode).

Prerequisites for the creation and development of Unicode

Since in a number of computer systems (for example, Windows NT) fixed 16-bit characters were already used as the default encoding, it was decided to encode all the most important characters only within the first 65,536 positions (the so-called English. basic multilingual plane, BMP). The rest of the space is used for "additional characters" (eng. supplementary characters): writing systems of extinct languages ​​or very rarely used Chinese characters, mathematical and musical symbols.

For compatibility with old 16-bit systems, the UTF-16 system was invented, where the first 65,536 positions, with the exception of positions from the interval U + D800 ... U + DFFF, are displayed directly as 16-bit numbers, and the rest are represented as "surrogate pairs »(The first element of the pair from the U + D800… U + DBFF region, the second element of the pair from the U + DC00… U + DFFF region). For surrogate pairs, a part of the code space (2048 positions), previously reserved for "characters for private use", was used.

Since UTF-16 can only display 2 20 + 2 16 −2048 (1 112 064) characters, this number was chosen as the final value for the Unicode code space.

Although the Unicode code area was extended beyond 2-16 as early as version 2.0, the first characters in the "top" area were only placed in version 3.1.

The role of this encoding in the web sector is constantly growing, at the beginning of 2010 the share of websites using Unicode was about 50%.

Unicode versions

As the Unicode character table changes and replenishes and new versions of this system are released - and this work is ongoing, since the original Unicode system included only Plane 0 - two-byte codes - new ISO documents are also released. The Unicode system exists in total in the following versions:

  • 1.1 (conforms to ISO / IEC 10646-1: 1993), 1991-1995 standard.
  • 2.0, 2.1 (same standard ISO / IEC 10646-1: 1993 plus additions: "Amendments" 1 to 7 and "Technical Corrigenda" 1 and 2), 1996 standard.
  • 3.0 (ISO / IEC 10646-1: 2000 standard) 2000 standard.
  • 3.1 (ISO / IEC 10646-1: 2000 and ISO / IEC 10646-2: 2001 standards) 2001 standard.
  • 3.2, 2002 standard.
  • 4.0, standard 2003.
  • 4.01, standard 2004.
  • 4.1, standard 2005.
  • 5.0, standard 2006.
  • 5.1, standard 2008.
  • 5.2, standard 2009.
  • 6.0, standard 2010.
  • 6.1, standard 2012.
  • 6.2, standard 2012.

Code space

Although the notation forms UTF-8 and UTF-32 allow up to 2 31 (2 147 483 648) code points to be encoded, it was decided to use only 1 112 064 for compatibility with UTF-16. However, even this is more than enough - today (in version 6.0) slightly less than 110,000 code points (109,242 graphic and 273 other symbols) are used.

The code space is split into 17 planes 2 16 (65536) characters each. The zero plane is called basic, it contains the symbols of the most common scripts. The first plane is used mainly for historical scripts, the second - for rarely used CJK characters, the third is reserved for archaic Chinese characters. Planes 15 and 16 are reserved for private use.

To denote Unicode characters a notation of the form “U + xxxx"(For codes 0 ... FFFF), or" U + xxxxx"(For codes 10000 ... FFFFF), or" U + xxxxxx"(For codes 100000 ... 10FFFF), where xxx- hexadecimal digits. For example, the character "i" (U + 044F) has the code 044F = 1103.

Coding system

The universal coding system (Unicode) is a set of graphic symbols and a way of encoding them for computer processing of text data.

Graphic symbols are symbols that have a visible image. Graphical characters are opposed to control and formatting characters.

Graphic symbols include the following groups:

  • letters contained in at least one of the supported alphabets;
  • numbers;
  • punctuation marks;
  • special signs (mathematical, technical, ideograms, etc.);
  • separators.

Unicode is a system for the linear representation of text. Characters that have additional superscripts or subscripts can be represented as a sequence of codes built according to certain rules (composite character) or as a single character (monolithic version, precomposed character).

Modifying characters

Representation of the character "Y" (U + 0419) in the form of the base character "I" (U + 0418) and the modifying character "" (U + 0306)

Graphic characters in Unicode are divided into extended and non-extended (widthless). Non-extended characters do not take up space in the line when displayed. These include, in particular, accent marks and other diacritical marks. Both extended and non-extended characters have their own codes. Extended symbols are otherwise called basic (eng. base characters), and non-extended ones - modifying (eng. combining characters); and the latter cannot meet independently. For example, the character "á" can be represented as a sequence of the base character "a" (U + 0061) and the modifier character "́" (U + 0301), or as a monolithic character "á" (U + 00C1).

A special type of modifying characters is the style selectors (eng. variation selectors). They apply only to those symbols for which such variants are defined. In version 5.0, font options are defined for a number of mathematical symbols, for symbols of the traditional Mongolian alphabet, and for symbols of the Mongolian square script.

Normalization forms

Since the same symbols can be represented different codes, which sometimes complicates processing, there are normalization processes designed to bring the text to a certain standard form.

The Unicode standard defines 4 forms of text normalization:

  • Normalization Form D (NFD) - Canonical Decomposition. In the process of converting the text into this form, all compound characters are recursively replaced by several compound ones, in accordance with the decomposition tables.
  • Normalization Form C (NFC) is canonical decomposition followed by canonical composition. First, the text is reduced to the D form, after which the canonical composition is performed - the text is processed from beginning to end and the following rules are followed:
    • The S symbol is initial if it has a modification class of zero in the Unicode character base.
    • In any sequence of characters starting with a start character S, a character C is blocked from S if and only if there is any character B between S and C that is either a start character or has the same or greater modification class than C. This rule applies only to strings that have gone through canonical decomposition.
    • Primary A composite is a character that has a canonical decomposition in the Unicode character base (or canonical decomposition for Hangul and is not included in the exceptions list).
    • The X symbol can be primary aligned with the Y symbol if and only if there is a primary Z composite canonically equivalent to the sequence .
    • If the next C character is not blocked by the last encountered initial base character L and it can be successfully aligned with it first, then L is replaced with the L-C composite, and C is removed.
  • Normalization Form KD (NFKD) - Compatible Decomposition. When cast into this form, all composite characters are replaced using both canonical Unicode decomposition maps and compatible decomposition maps, after which the result is placed in canonical order.
  • Normalization form KC (NFKC) - compatible decomposition followed by canonical composition.

The terms "composition" and "decomposition" mean, respectively, the connection or decomposition of symbols into their constituent parts.

Examples of

Source text NFD NFC NFKD NFKC
Français Franc \ u0327ais Fran \ xe7ais Franc \ u0327ais Fran \ xe7ais
A, E, Y \ u0410, \ u0401, \ u0419 \ u0410, \ u0415 \ u0308, \ u0418 \ u0306 \ u0410, \ u0401, \ u0419
\ u304b \ u3099 \ u304c \ u304b \ u3099 \ u304c
Henry IV Henry IV Henry IV Henry IV Henry IV
Henry Ⅳ Henry \ u2163 Henry \ u2163 Henry IV Henry IV

Bi-directional letter

The Unicode standard supports writing languages ​​in both the left-to-right direction (eng. left-to-right, LTR), and with writing from right to left (eng. right-to-left, RTL) - for example, Arabic and Hebrew letters. In both cases, the characters are stored in a "natural" order; their display, taking into account the desired direction of the letter, is provided by the application.

In addition, Unicode supports combined texts that combine fragments with different directions of the letter. This feature is called bidirectionality(eng. bidirectional text, BiDi). Some simplified text processors (for example, in cell phones) can support Unicode, but not bidirectional support. All Unicode characters are divided into several categories: written from left to right, written from right to left, and written in any direction. Symbols of the latter category (mainly punctuation marks), when displayed, take the direction of the surrounding text.

Featured Symbols

Unicode includes virtually all modern scripts, including:

other.

For academic purposes, many historical scripts have been added, including: runes, ancient Greek, Egyptian hieroglyphs, cuneiform, Mayan writing, Etruscan alphabet.

Unicode provides a wide range of mathematical and musical symbols and pictograms.

However, Unicode fundamentally does not include company and product logos, although they are found in fonts (for example, the Apple logo in the MacRoman encoding (0xF0) or the Windows logo in the Wingdings font (0xFF)). In Unicode fonts, logos should only be placed in the custom character area.

ISO / IEC 10646

The Unicode Consortium works closely with working group ISO / IEC / JTC1 / SC2 / WG2, which is developing international standard 10646 (ISO / IEC 10646). Synchronization is established between the Unicode standard and ISO / IEC 10646, although each standard uses its own terminology and documentation system.

Cooperation of the Unicode Consortium with the International Organization for Standardization (eng. International Organization for Standardization, ISO ) began in 1991. In 1993, ISO issued the DIS 10646.1 standard. To synchronize with it, the Consortium approved the version 1.1 of the Unicode standard, which added additional characters from DIS 10646.1. As a result, the values ​​of the encoded characters in Unicode 1.1 and DIS 10646.1 are exactly the same.

In the future, cooperation between the two organizations continued. In 2000 Unicode standard 3.0 has been synchronized with ISO / IEC 10646-1: 2000. The upcoming third version of ISO / IEC 10646 will be synchronized with Unicode 4.0. Perhaps these specifications will even be published as a single standard.

Similar to the UTF-16 and UTF-32 formats in the Unicode standard, the ISO / IEC 10646 standard also has two main forms of character encoding: UCS-2 (2 bytes per character, similar to UTF-16) and UCS-4 (4 bytes per character, similar to UTF-32). UCS means universal multi-octet(multibyte) coded character set(eng. universal multiple-octet coded character set ). UCS-2 can be considered a subset of UTF-16 (UTF-16 without surrogate pairs) and UCS-4 is a synonym for UTF-32.

Presentation methods

Unicode has several forms of representation (eng. Unicode transformation format, UTF ): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). The UTF-7 form was also developed for transmission over seven-bit channels, but due to incompatibility with ASCII, it was not spread and was not included in the standard. On April 1, 2005, two humorous submissions were proposed: UTF-9 and UTF-18 (RFC 4042).

Unicode UTF-8: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxx

Theoretically possible, but also not included in the standard:

0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Although UTF-8 allows you to specify the same character in several ways, only the shortest one is correct. The rest of the forms should be rejected for security reasons.

Byte order

In a UTF-16 data stream, the high byte can be written either before the low (eng. UTF-16 big-endian), or after the younger (eng. UTF-16 little-endian). Similarly, there are two variants of the four-byte encoding - UTF-32BE and UTF-32LE.

To define the format of the Unicode representation at the beginning text file the signature is written - the character U + FEFF (non-breaking space with zero width), also called byte order mark(eng. byte order mark, BOM ). This makes it possible to distinguish between UTF-16LE and UTF-16BE since there is no U + FFFE character. It is also sometimes used to denote the UTF-8 format, although the notion of byte order does not apply to this format. Files that follow this convention begin with these byte sequences:

UTF-8 EF BB BF UTF-16BE FE FF UTF-16LE FF FE UTF-32BE 00 00 FE FF UTF-32LE FF FE 00 00

Unfortunately, this method does not reliably distinguish between UTF-16LE and UTF-32LE, since the character U + 0000 is allowed by Unicode (although real texts rarely start with it).

Files in UTF-16 and UTF-32 encodings that do not contain a BOM must be in big-endian (unicode.org) byte order.

Unicode and traditional encodings

The introduction of Unicode changed the approach to traditional 8-bit encodings. If earlier the encoding was specified by the font, now it is specified by the correspondence table between this encoding and Unicode. In fact, 8-bit encodings have become a representation of a subset of Unicode. This made it much easier to create programs that need to work with many different encodings: now, to add support for one more encoding, you just need to add another Unicode lookup table.

In addition, many data formats allow you to insert any Unicode characters, even if the document is written in the old 8-bit encoding. For example, you can use ampersand codes in HTML.

Implementation

Most modern operating systems provide some degree of Unicode support.

In operating systems of the Windows NT family, the double-byte UTF-16LE encoding is used for the internal representation of file names and other system strings. System calls that take string parameters are available in single-byte and double-byte variants. For more details see the article

If you only need to enter a few special characters or characters, you can use the character table or keyboard shortcuts. List ASCII characters see the tables below, or see Inserting national alphabets using keyboard shortcuts.

Notes:

Inserting ASCII Characters

To insert an ASCII character, press and hold the ALT key, and then type the character code. For example, to insert a degree sign (º), hold down the ALT key and type in numeric keypad code 0176.

Note:

Inserting Unicode Characters

Important: Some Microsoft programs Office, such as PowerPoint and InfoPath, cannot convert Unicode character codes. If you require a Unicode character and are using one of the programs that do not support Unicode characters, use to enter characters, which you may need.

Notes:

    Numbers should be typed on the numeric keypad, not alphanumeric. If you need to press to enter numbers on the numeric keypad NUM key LOCK, make sure this is done.

    If you're having trouble converting a Unicode code to a character, type the code on the numeric keypad, select it, and then press Alt + X.

    V Microsoft Windows XP and later versions of the Unicode Universal Font are installed automatically. In Microsoft Windows 2000, the Unicode font must be installed manually.

    On Microsoft Windows 2000

    1. Quit all programs.

      Double click the icon Installation and removal of programms on control panels.

      Do one of the following:

      • if application Microsoft Office installed as part of Microsoft Office, select Microsoft Office in field Installed programs and then click Replace;

        If Office application was installed separately, click its name in the list Installed programs and then click Change.

    2. In the dialog box Installing Microsoft Office 2003 select an option Add or remove components and then click Further.

      Please select Additional customization applications and press the button Further.

      Expand the list Common Office Tools.

      Expand the list Multilingual support.

      Click the icon Universal font and select the desired installation option.

Using the symbol table

Symbol table is Microsoft's built-in Windows program, which allows you to view the characters available in the selected font. Using a symbol table, you can copy individual symbols or groups of symbols to the clipboard and then paste them into a program that supports them.

Click the button Start, and then select Programs, Standard, Service and table of symbols.

To select a symbol in the symbol table, click it, click Select, click right click mouse in the place of the document where you want to add the symbol, and select the command Insert.

Common character codes

For more character characters, see the article installed on your computer, ASCII character codes or a Unicode character code script diagram.

Sign

Sign

Currency symbols

Legal Symbols

Math symbols

Fractions

Punctuation and dialect symbols

Form symbols

Common diacritical codes

For a complete list of glyphs and associated character codes, see.

Sign

Sign

Sign

Sign

Non-printable ASCII control characters

The numbers 0–31 in the ASCII table are assigned to control characters used to control some peripheral devices such as printers. For example, the number 12 represents the page translation function. This command takes the printer to the top of the next page.

Non-printable ASCII control character table

Decimal

Sign

Decimal

Sign

lack of information

data channel change

start of heading

device control 1

beginning of text

device control 2

end of text

device control 3

end of transmission

device control 4

negative confirmation

the confirmation

sound signal

end of transfer block

horizontal tab

end of media

line feed / new line

vertical tab

page translation / new page

file separator

carriage return

group separator

non-persistent shift

record separator

conservation shift

segment separator

additional information

Note: Disclaimer regarding machine translation... This article was translated using a computer system without human intervention. Microsoft offers these machine translations to help non-English users learn about Microsoft products, services, and technologies. Since the article was translated using machine translation, it may contain lexical, syntax and grammatical errors.

Unicode is an international character encoding standard that allows text to be displayed consistently on any computer in the world, regardless of the system language it uses.

The basics

To understand what the Unicode character table is for, let's first understand the mechanism for displaying text on a monitor screen. A computer, as we know, processes all information in digital form, and it must display it graphically for correct human perception. Thus, in order for us to read this text, it is necessary to solve at least two tasks:

  • Digitize printable characters.
  • Provide the operating system with the ability to match digital forms with vector characters, in other words, find the correct letters.

First encodings

American ASCII is considered to be the ancestor of all encodings. It described used in English language Latin alphabet with punctuation marks and Arabic numerals. It was the 128 characters used in it that became the basis for subsequent developments - even the modern Unicode character table uses them. Since then, the letters of the Latin alphabet have occupied the first positions in any encoding.

In total, ASCII allowed 256 characters to be stored, but since the first 128 were occupied by the Latin alphabet, the remaining 128 began to be used all over the world to create national standards. For example, in Russia, CP866 and KOI8-R were created on its basis. Such variations were called extended versions of ASCII.

Code pages and "krakozyabry"

Further development of technology and the emergence of a graphical interface led to the fact that the American Institute for Standardization was created ANSI encoding... To Russian users, especially with experience, its version is known under Windows name 1251. It was the first to use the concept of “code page”. It was with the help of code pages, which contained symbols of national alphabets other than Latin, that "mutual understanding" was established between computers used in different countries.

However, the presence of a large number of different encodings used for one language started to cause problems. The so-called krakozyabry appeared. They arose from a mismatch between the original code page, in which any information was created, and the code page used by default on the end user's computer.


As an example, the above Cyrillic encodings CP866 and KOI8-R can be cited. The letters in them differed in code positions and principles of placement. In the first, they were arranged in alphabetical order, and in the second - in an arbitrary order. You can imagine what was happening in front of the eyes of a user who tried to open such a text without having the required code page or when it was misinterpreted by the computer.

Creation of Unicode

The proliferation of the Internet and related technologies such as Email, led to the fact that in the end the situation with the distortion of the texts ceased to suit everyone. Leading IT companies have formed the Unicode Consortium. The character table he introduced in 1991 under the name UTF-32 could store over a billion unique characters. It was crucial step on the way to decrypting texts.


However, the first universal Unicode table of character codes, UTF-32, was not widely adopted. The main reason was the redundancy of stored information. It was quickly calculated that for countries that use the Latin alphabet encoded with the new universal table, text would take up four times the space than when using the extended ASCII table.

Development of Unicode

The following Unicode UTF-16 character table has fixed this problem. Coding in it was carried out in half the number of bits, but at the same time the number of possible combinations also decreased. Instead of billions of characters, it only stores 65,536. Nevertheless, it was so successful that this number, by the decision of the Consortium, was determined as the basic storage space for Unicode characters.

Despite this success, UTF-16 did not suit everyone, since the amount of stored and transmitted information was still doubled. The universal solution was UTF-8, a variable length Unicode character table. This can be called a breakthrough in this area.


Thus, with the introduction of the last two standards, the Unicode character table has solved the problem of a single code space for all fonts in use today.

Unicode for Russian

Due to the variable length of the code used to display characters, Latin is encoded in Unicode in the same way as in its ancestor ASCII, that is, in one bit. For other alphabets, the picture may look different. For example, the characters of the Georgian alphabet use three bytes for encoding, and the characters of the Cyrillic alphabet use two. All this is possible within the framework of using the UTF-8 Unicode standard (character table). The Russian language or the Cyrillic alphabet occupies 448 positions in the total code space, divided into five blocks.


These five blocks include the basic Cyrillic and Church Slavonic alphabets, as well as additional letters of other languages ​​using the Cyrillic alphabet. A number of positions are highlighted for displaying old forms of representation of Cyrillic letters, and 22 positions out of the total number are still free.

Current version of Unicode

With the solution of its primary task, which was to standardize fonts and create a single code space for them, the Consortium did not stop its work. Unicode is constantly evolving and expanding. The last current version of this standard, 9.0, was released in 2016. It included six additional alphabets and expanded the list of standardized emojis.

I must say that in order to simplify research, even the so-called dead languages ​​are added to Unicode. They got this name because people for whom he would be native do not exist. This group also includes languages ​​that have come down to our time only in the form of written monuments.

In principle, anyone can apply to add characters to the new Unicode specification. True, for this you have to fill in a decent amount source documents and spend a lot of time. A living example of this is the story of the programmer Terence Eden. In 2013, he applied for the inclusion in the specification of symbols related to the designation of computer power control buttons. They have been used in technical documentation since the mid-70s of the last century, but until the introduction of the 9.0 specification they were not part of Unicode.

table of symbols

Every computer, regardless of the operating system used, uses a Unicode character table. How to use these tables, where to find them, and why can they be useful to an ordinary user?


In OS Windows table symbols is located in the "Service" section of the menu. In the Linux family of operating systems, it can usually be found in the "Standard" subsection, and in MacOS, in the keyboard preferences. The main purpose of this table is to enter into text documents characters that are not located on the keyboard.

The application for such tables can be found the widest: from entering technical symbols and icons of national monetary systems to writing instructions for the practical use of Tarot cards.

Finally

Unicode is used everywhere and entered our life along with the development of the Internet and mobile technologies... Thanks to its use, the system of interethnic communications has been significantly simplified. We can say that the introduction of Unicode is an indicative, but completely invisible from the outside example of the use of technology for the common good of all mankind.