Computers Windows Internet

The unicode coding standard allocates for each character. What is Unicode

Unicode is a coding system that assigns a unique code to any character, regardless of platform, regardless of program, regardless of language.

In fact, even in the very title of this note there is some inaccuracy - or, rather, a contradiction. The fact is that there is still no unanimity on how to correctly spell the English word Unicode in Russian letters - Unicode or Unicode. We will now use the first spelling option.

Sometimes, when you open documents or view pages instead of Polish letters, there are several "bushes". These errors can be caused by incorrect character encoding. The text that you see on the screen is stored by the computer in the form of zeros and ones, which are packed into eight parts. One zero or one bit; A group of eight is a byte. Electronic devices store information in bytes, including text.

Okay, but how can all these pieces of computer make history? Ten will look like this: in such a way, to write numbers, we call binary, we have learned to use decimal numbers. This is done so that each character is assigned a number. The sentence "Alla has a cat" is a set of numbers.

Unicode is not just a multibyte encoding for representing many characters, as many people think, although such a definition can be considered correct to a certain extent. Officially, the definition is this: Unicode is a coding system that assigns a unique code to any character, regardless of platform, regardless of the program, regardless of the language.

Note that uppercase and lowercase letters are a separate character for the computer, and whitespace is also a sign, although not visible. In addition, individual digits are treated the same way. For example, character 1 is not assigned the number 1, only.

No, but where does the computer know which character is assigned this number? The answer is character encoding. Generally speaking, it is a table with all the symbols and numbers assigned to them. Over time, this number is no longer enough - computers have acquired new capabilities and became more popular in other countries with special letters.

The unicode standard is split into two parts. The first is a set of numbers corresponding to each character in each supported alphabet. It's called the Universal Character Set, or just UCS. The numbers used in the UCS code space are all integers and non-negative. These numbers - code positions - are designated as U + 0000, U + 0001, U + 0002, etc. With all this, the code space is not homogeneous, but is divided into several semantic areas. Codes from U + 0000 to U + 007F are ASCII characters, followed by characters from different national alphabets, punctuation and technical characters like a carriage return on a new line. The second part of the standard is, in fact, encodings that provide a bitwise representation of each code in the text.

Although both have more than two hundred and fifty characters, this number is still not enough to work in all languages, so each language family has its own version of these encodings. This is called a code page. They all had the same origin, so they included punctuation marks, uppercase and lowercase English letters, and numbers. However, the rest of the characters varied from version to code.

But even the introduction of code pages did not solve all the problems. The first is standards interoperability. Although the first 127 characters are the same for all standards, the rest of the characters are ordered according to their own representation. The word corner written in this first corner will be opened after the system is opened for the second standard.

All unicode characters are divided into two types - extended and widthless (modifying). The extended characters are the usual letters with which this note is typed. Modifying ones are above-letter signs like accents, dots, "caps", etc., etc. Most of the letters with similar signs for all alphabets are presented as a sequence of extended and modifying symbols. True, the Cyrillic "e" and "y" are presented as separate extended characters.

Another problem is that 256 characters are still not enough to write documents fluently. In most cases this is sufficient, but problems arise when you need to refer to Mr. Möller, Pütz or Strassmann for his master's thesis. Insert Cyrillic or enter Greek letters into a complex mathematical formula.

A new coding method is needed that will include all possible characters used in all languages ​​of the world. This makes it less common for mistakes. However, another problem arose - so many possible characters did not allow writing 1 byte for them. One byte is composed of eight or zeros, which allows them to be organized in different ways. So many characters have code pages. To increase the number of characters supported by the encoding system, you need to increase the number of bytes required to insert a letter.

The most common practical implementation of Unicode is UTF-8. This standard provides good compatibility with old ASCII texts due to the fact that the characters of the English alphabet and other common characters (ie characters with ASCII codes from 0 to 127) are written in one byte. The rest of the characters are written in a large number of bytes - from two to six.

Recently, the Turkish Lira character has been added to the Unicode character. The number assigned to it is 110 If each letter were to be occupied by 3 bytes, even those from the beginning with one, would increase the space required to save the file to disk.

The developers came up with the idea that characters should not have a constant number of bytes. And for this, in order to work on a computer, they correspond to the encoding. Change the character encoding in each of these three and see if all letters are displayed. If you need to encode your page this way, download a free and more powerful notepad.

Unicode and related standards are being developed by the Unicode Consortium. As it is written on the official website of this organization, www.unicode.org, "The Unicode Consortium is a non-profit organization founded to develop and develop the Unicode standard defining the representation of textual information in modern software products and standards, and to promote its widespread dissemination and use. Members The Consortium is a large number of corporations and organizations working in the fields of information processing and the computer industry. Financial support of the Consortium is carried out exclusively through the membership fees of its members. Membership in the Unicode Consortium is open to any organization or individual who supports the Unicode standard and wants to help spread it and implementation ". Join us! :)

The description in this article is based on Beta 8, which is only available to a small circle of beta testers. As beta testing draws to a close, it is very likely that most of the features and functionality described will actually appear in the final release. We will cover the most significant changes and news regarding the current version 51 in each of the thematic blocks.

Working with files and the appearance of the application

The new sixth version does not change anything in this organization of the workspace, but significantly expands the capabilities of these two windows. If you ask how, the answers are "bookmarked". Each window can now quickly switch between any folder we define. You can create as many bookmarks as you like and you can quickly switch between different folders in one window. You can continue working with bookmarks. Each of the tabs behaves like a separate file panel, so we can work with it in the old way.

The Unicode standard, or ISO / IEC 10646, is the result of collaboration between the International Organization for Standardization (ISO) and leading computer manufacturers and software... The reasons stated on the previous page led them to a fundamentally new formulation of the question: why spend efforts on the development of separate code tables, if it is possible to create a single table for all national languages? Such a task seems overly ambitious, but only at first glance. The fact is that out of 6,700 living languages, about fifty are the official languages ​​of states, and they use about 25 different scripts: the numbers for our computer age are quite foreseeable.

Bookmarks will be displayed only if in the window by at least two. When copying large amounts of data that persists for a long time, background copies may be used in previous releases. Of course, this popular functionality has been retained and brought several new releases in the new version.

If you select it, you will see an extended background copy dialog where you can pause copying and set the maximum bitrate that will be copied. In addition, files can be added to this window continuously. This mode "handles" the problems of some users who are coping intensively in multiple windows at runtime - they can install maximum speed transfer and sort files sequentially in the queue to significantly reduce the overall disk load.

A preliminary estimate showed that a 16-bit range is sufficient to encode all these scripts, that is, the range from 0000 to FFFF. Each script was allocated its own block in this range, which was gradually filled with the codes of the characters of this script. Today, the coding of all living official scripts can be considered complete.

Now the function of creating folders can also create a directory structure. Additional improvements when working with files. Directory Synchronization has two big improvements. In the main renaming module, options for saving settings and tools for creating a visual overwrite file have been added.

The client is also capable of continuously detecting interrupted connections and can automatically reconnect it at boot time. Other new features include support for new types of proxy servers, automatic replacement of prohibited characters in subscription names, and the configuration of new connections in the Preferences dialog box.

A well-developed methodology for analyzing and describing writing systems has allowed the Unicode consortium to move recently to coding the rest of the Earth's scripts that are of any interest: these are the scripts of dead languages, Chinese hieroglyphs that have dropped out of modern use, artificially created alphabets, etc. To represent everything this wealth of 16-bit coding is no longer enough, and today Unicode uses a 21-bit code space (000000 - 10FFFF), which is split into 16 zones called planes. So far, Unicode plans include the following planes:

With the options made in earlier versions, the original configuration dialog became unclear. The new configuration dialog is divided into more readable categories, with each section dedicated to separate page dialogue. You will find most of the parameters known from previous version, and many new options.

It is a world class character encoding. The vast majority of special characters in all languages ​​of the world are already in place. Spells of "regular expressions" - many people are afraid. All he has to do is watch his official introduction. A double-byte character is a multilingual double-byte character code. Most of the symbols used when using computers around the world, including technical symbols and Special symbols, can be represented as Unicode characters as a double-byte character.

    Plane 0 (codes 000000 - 00FFFF) - BMP, Basic Multilingual Plane (BMP, Basic Multilingual Plane), corresponds to the original Unicode range.

    Plane 1 (codes 010000 - 01FFFF) - DMP, Supplementary Multilingual Plane (SMP), intended for dead scripts.

    Plane 2 (codes 020000 - 02FFFF) - DIP, an additional hieroglyphic plane (SIP, Supplementary Ideographic Plane), intended for hieroglyphs that were not included in the BMP.

    Since each double-byte character is represented in a fixed 16-bit size, the character width simplifies programming using international character sets. Typically, character widths take up more memory space than multibyte characters, but are faster to process.

    If it's a regular string, she'll say, if it's a Unicode string, she'll say. First, the terminal window is usually configured to display characters from only a limited set of languages. If you issue a print instruction on a unicode string, it may not display correctly in the terminal window. Here, each unicode character must be encoded as one or more "bytes" for storage in a file.

    Plane 14 (codes 0E0000 - 0EFFFF) is a chipboard, an additional special plane (SSP, Supplementary Special-purpose Plane), intended for special-purpose characters.

    Plane 15 (codes 0F0000 - 0FFFFF) - Private-Use Plane, intended for symbols of artificial writing.

    Plane 16 (codes 100000 - 10FFFF) - Private-Use Plane, intended for symbols of artificial writing.

    We've avoided low-level data encoding data so far, but understanding a little about bits and bytes will help you figure it out. This is the only value, limited by two possibilities, which we conventionally write as 0, or computers store bits as electric charges or magnetic polarities, or in some other way that we don't need to bother. A sequence of eight 0-1 bits is called a byte.

    There are many possible encodings. One Unicode character is mapped to a sequence of up to four bytes. You can do this in one of two ways. Everything will work fine until you try to print or write the content to a file. If you are typing and the terminal window is not configured to display that language, you may be delivering strange output.

A breakdown of BMP into blocks is given in WDH: Unicode Standard. Here we only note that the first 128 codes (00000 - 0007F) correspond to ASCII codes and encode the base Latin block. The layout of the scripts for the Unicode range will be described in detail in my article "Unicode and the scripts of the world." Since in the future we will only be interested in the BMP symbols, I use their 16-bit codes of the form XXXX (the most significant bits are equal to zero and are not indicated).

If you try to write to the file, you may receive an error message. This allows you to write in one document, for example in a loop and the Polish alphabet. This is by far the most versatile standard. Of course they all use the same array and have the same capabilities, the only difference is in a different way of writing. The biggest portability is your Unicode documents. Every parser must support it. You do not have guarantees for one standard.

If you use this, you don't need any encoding information. Some information about programs that can help you encode and convert characters can be found in the "Tools" section. This tutorial only shows the steps you need to follow to add a font that has both uppercase and lowercase letters and diacritics that are specific to the Romanian language. Writing without diacritics can lead to ambiguous expressions such as "12-year-old tank", "novel born in Rome."

general description

Unicode is based on the concept of a character. A symbol is an abstract concept that exists in a specific writing system and is realized through its images (graphemes). This means that each character is given unique code and belongs to a specific Unicode block. For example, there is grapheme A in English, Russian, and Greek alphabets. However, in Unicode it corresponds to three different characters "Latin capital letter A" (code 0041), "Cyrillic capital letter A" (code 0410) and "Greek capital letter ALPHA" (code 0391). If we now apply the conversion to a lowercase letter to these characters, we will respectively get "Latin lowercase letter A" (code 0061, grapheme a), "Cyrillic lowercase letter A" (code 0430, grapheme a) and "Greek lowercase letter ALPHA" ( code 03B1, grapheme α), i.e. different graphemes.

Diacritics are a hat - for that - an envelope, a comma and a comma - the language for Romanians. It is designed for any letter in any language, on any hardware or software platform, to match its unique and unambiguous number. It will now have the shape and size of the glyph.

We'll choose a comma graphic. A red outline of a glyph means it is selected. For other Romanian diacritic letters, the same adaptations are made. The Font Information dialog box appears. We add in the Last Name field the name what name we want for the font.

The question may arise: what is the conversion to a lowercase letter? Here we come to the most interesting and important point in the standard. The point is, Unicode is not easy code table... The concept of an abstract symbol allowed the creators of Unicode to build a character database in which each character is described by its unique code (database key), full name, and set of properties. For example, a symbol with code 0410 is described in this database as follows:

0410; CYRILLIC CAPITAL LETTER A; Lu; 0; L ;;;;; N ;;;; 0430;

Let's decipher this entry. It means that the code 0410 is assigned to the "Cyrillic capital letterА "(full name of the symbol), which has the following properties:

General categorylowercase letter (Lu = Letter, uppercase)
Combination class0
Output directionleft to right (L)
Symbol decompositionNo
Decimal digitNo
NumberNo
Numerical valueNo
Mirror symbolmissing (N)
Full name in Unicode 1.0also
A commentNo
Uppercase mappingNo
Display to lowercase letter0430
Mapping to title letterNo

The listed properties are defined for each Unicode character. This allowed its developers to create standard algorithms that determine, based on the properties of symbols, rules for their rendering, sorting and conversion to upper / lower case letters.

In summary, we can say that the Unicode standard consists of three interrelated parts:

    symbol databases;

    base of graphemes (glyphs), which determine the visual representation of these symbols;

    a set of algorithms that determine the rules for working with symbols.

In conclusion of this section, we present the graphemes of the Cyrillic block (codes 0400 - 04FF). Please note that it includes not only the letters of the modern Cyrillic alphabets (Russian, Ukrainian, Belarusian, Bulgarian, Serbian, Macedonian, etc.), but also all the letters of the original Cyrillic alphabet used in Church Slavonic writing.

Transformation formats

As we have seen, each Unicode character has a unique 21-bit code point. However, for practical implementation, such a character encoding is inconvenient. The fact is that operating systems and network protocols traditionally treat data as streams of bytes. This leads to at least two problems:

    The order of bytes in a word is different for different processors. Intel processors, DEC, and others store its most significant byte in the first byte of a machine word, while Motorola, Sparc and others processors store its least significant byte. They are respectively called little-endian and big-endian (these terms derive from Swift's "spikes" and "blunts", arguing about which end to break eggs from).

    Many byte-oriented systems and protocols allow only bytes from a specific range as data. The rest of the bytes are considered overhead; in particular, it is customary to use a null byte as the end-of-line character. Since Unicode encodes characters in a row, direct transmission of its codes as byte strings may conflict with the rules of the data transfer protocol.

To overcome these problems, the standard includes three transformation formats UTF-8, UTF-16, and UTF-32, which define the rules for encoding Unicode characters as byte strings, 16-bit word pairs, and 32-bit words, respectively. The choice of the format to use depends on the architecture of the computing system and data storage and transmission standards. A brief description of transformation formats can be found in WDH: Unicode Standard.

Implementation problems

I think that even from the above brief description Unicode standard it is clear that its full support from major operating systems will mark a revolution in word processing. A user sitting at any Web terminal will be able to choose any keyboard layout, type text in any language and transmit it to any computer that will display this text correctly. Databases will be able to store, correctly sort and display in reports text information again in any language. In order for this paradise to come, five things are necessary:

    Operating systems must support Unicode transform formats at the level of input, storage, and display of text strings.

    We need smart keyboard drivers that allow us to enter characters in any Unicode block and pass them codes to the operating system.

    Text editors should support displaying all Unicode characters and perform a common set of character operations on them.

    The same must be done correctly by the DBMS for text and memo fields.

    Since national encodings will coexist with Unicode for a long time, it is necessary to support text transformations between them.

Unfortunately, we have to admit that in ten years (Unicode 1.0 appeared in 1991), much less has been done in this direction than we would like. Even Windows, which contains the most consistent support for Unicode at the system level, is full of completely irrational limitations due only to its historical development. The situation is even worse on Unix, as Unicode support has been ported from the kernel to specific applications. Arguably, Unicode is most heavily supported today in two environments: web browsers and Java virtual machines. This is not surprising, since both environments were originally designed to be system-independent.

Objective difficulties in supporting Unicode should also be noted. For example, we will focus only on the display of graphemes, for which you need to install the appropriate fonts in the system. The problem is that a font containing all Unicode graphemes will have a completely awkward size. For example, the TrueType font Arial Unicode MS, which contains a large portion of Unicode characters, "weighs" 24MB. As Unicode fills with new blocks, the size of such fonts will approach 100MB. A way out of the situation is Microsoft's proposed on-demand symbol loading, accepted in their browser. Internet Explorer... However, while the standards about the rules for the formation of Unicode-fonts are silent.

Ways to work with Unicode characters and national encodings in the most important environments and programming systems will be discussed in the following articles.

There is no such page

She may have been. Someday. Earlier. Or else only in plans. But now she's gone.