Computers Windows Internet

Russian letters in the windows code table. Cyrillic encoding - Russian

Encodings

When I first started programming in C, my first program (apart from HelloWorld) was a transcoding program text files from the main GOST encoding (remember this one? :-) to the alternative one. It was back in 1991. Much has changed since then, but unfortunately, such programs have not lost their relevance over the past 10 years. Too much data has already been accumulated in various encodings and too many programs are used that can work with only one. There are at least a dozen different encodings for the Russian language, which makes the problem even more confusing.

Where did all these encodings come from and what are they for? Computers, by their very nature, can only work with numbers. In order to store letters in the memory of a computer, it is necessary to assign a certain number to each letter (approximately the same principle was used before the advent of computers - remember the same Morse code). Moreover, the number is desirably smaller - the fewer binary digits are involved, the more efficiently the memory can be used. This correspondence between the set of characters and numbers is actually the encoding. The desire to save memory at any cost, as well as the disunity of different groups of computer scientists, led to the current state of affairs. The most common encoding method now is to use one byte (8 bits) for one character, which determines the total number of characters in 256. The set of the first 128 characters is standardized (ASCII set) and is the same in all common encodings (those encodings where it is not so already practically out of use). Anglican letters and punctuation symbols are in this range, which determines their amazing survivability in computer systems :-). Other languages ​​are not so happy - they all have to huddle in the remaining 128 numbers.

Unicode

In the late 1980s, many realized the need to create a single standard for character encoding, which led to the emergence of Unicode. Unicode is an attempt to fix a specific number for a specific character once and for all. It is clear that 256 characters will not fit here with all the desire. For quite a long time it seemed that 2 bytes (65536 characters) should be enough. But no - the latest version of the Unicode standard (3.1) already defines 94,140 characters. For such a number of characters, you probably already have to use 4 bytes (4294967296 characters). Maybe enough for a while ... :-)

Into the set Unicode characters includes all sorts of letters with all sorts of dashes and papendyulki, Greek, mathematical, hieroglyphs, pseudo-graphic symbols, etc., etc. Including our favorite Cyrillic characters (the range of values ​​is 0x0400-0x04ff). So there is no discrimination on this side.

If you are interested in specific character codes, it is convenient to use the "Character Map" program from WinNT to view them. For example, here is the Cyrillic range:

If you have a different OS or are interested in the official interpretation, then the full charts can be found on the official Unicode website (http://www.unicode.org/charts/web.html).

The char and byte types

Java has a separate char data type of 2 bytes for characters. This often creates confusion in the minds of beginners (especially if they have previously programmed in other languages, such as C / C ++). This is because most other languages ​​use 1-byte data types to process characters. For example, in C / C ++, char is mostly used for both character handling and byte handling - there is no separation. Java has its own type for bytes - the byte type. Thus, a C-based char corresponds to a Java byte, and a Java char from the C world is closest to the wchar_t type. It is necessary to clearly separate the concepts of characters and bytes - otherwise, misunderstanding and problems are guaranteed.

Java has been using for character encoding almost since its inception Unicode standard... Java library functions expect to see char characters represented by Unicode codes... In principle, of course, you can stuff anything there - numbers are numbers, the processor will endure everything, but for any processing, the library functions will act on the assumption that they were given Unicode encoding... So you can safely assume that the char encoding is fixed. But this is inside the JVM. When data is read from outside or passed outside, then it can be represented by only one type - the byte type. All other types are constructed from bytes, depending on the data format used. This is where encodings come into play - in Java, it's just a data format for transferring characters, which is used to form data of type char. For each code page, the library has 2 conversion classes (ByteToChar and CharToByte). These classes are in the sun.io package. If, when converting from char to byte, no corresponding character was found, it is replaced with the? Character.

By the way, these codepage files in some early versions of JDK 1.1 contain bugs that cause conversion errors or even runtime exceptions. For example, this concerns the KOI8_R encoding. The best thing to do while doing this is to upgrade to a later version. Judging by Sun's description, most of these problems have been resolved in JDK 1.1.6.

Prior to JDK 1.4, the set of available encodings was determined only by the JDK vendor. Starting with 1.4, a new API has appeared (java.nio.charset package), with which you can already create your own encoding (for example, support a rarely used, but terribly necessary for you).

String class

In most cases, Java uses an object of type java.lang.String to represent strings. This is a regular class that internally stores an array of characters (char), and which contains many useful methods for manipulating characters. The most interesting are constructors with a byte array as the first parameter and getBytes () methods. With these methods, you can perform conversions from byte arrays to strings and vice versa. In order to specify which encoding to use at the same time, these methods have a string parameter that sets its name. For example, here's how you can convert bytes from KOI-8 to Windows-1251:

// Data encoded KOI-8 byte koi8Data = ...; // Convert from KOI-8 to Unicode String string = new String (koi8Data, "KOI8_R"); // Convert from Unicode to Windows-1251 byte winData = string.getBytes ("Cp1251");

The list of 8-bit encodings available in modern JDKs and supporting Russian letters can be found below, in the section.

Since an encoding is a data format for characters, in addition to the familiar 8-bit encodings in Java, there are also multibyte encodings on an equal footing. These include UTF-8, UTF-16, Unicode, etc. For example, this is how you can get bytes in the UnicodeLittleUnmarked format (16-bit Unicode encoding, low byte first, no byte order sign):

// Convert from Unicode to Unicode byte data = string.getBytes ("UnicodeLittleUnmarked");

It is easy to make a mistake with such conversions - if the encoding of the byte data does not correspond to the specified parameter when converting from byte to char, then the recoding will not be performed correctly. Sometimes after that it is possible to pull out the correct characters, but more often than not, part of the data will be irretrievably lost.

In a real program, it is not always convenient to specify the code page explicitly (although it is more reliable). For this, a default encoding was introduced. By default, it depends on the system and its settings (for Russian Windows, the Cp1251 encoding is adopted), and in old JDKs it can be changed by setting the file.encoding system property. In JDK 1.3, changing this setting sometimes works, sometimes it doesn't. This is caused by the following: initially, file.encoding is set according to the regional settings of the computer. The default encoding reference is remembered internally during the first conversion. In this case, file.encoding is used, but this conversion occurs even before using the JVM startup arguments (in fact, when parsing them). Actually, according to Sun, this property reflects the system encoding, and it should not be changed on the command line (see, for example, the comments on BugID) However, in JDK 1.4 Beta 2, changing this setting again began to have an effect. What is this, a conscious change or a side effect that may disappear again - the Sun-sheep have not yet given a clear answer.

This encoding is used when the page name is not explicitly specified. This should always be remembered - Java will not try to predict the encoding of the bytes you pass to create a String (it also won't be able to read your thoughts on this :-). It just uses the current default encoding. Because this setting is the same for all transformations, sometimes you can run into trouble.

To convert from bytes to characters and vice versa, use only by these methods. In most cases, a simple type conversion cannot be used - the character encoding will not be taken into account. For example, one of the most common mistakes is reading data byte-wise using the read () method from an InputStream, and then casting the resulting value to the char type:

InputStream is = ..; int b; StringBuffer sb = new StringBuffer (); while ((b = is.read ())! = - 1) (sb.append ((char) b); // <- так делать нельзя ) String s = sb.toString ();

Pay attention to the typecasting - "(char) b". The byte values ​​are simply copied to char instead of re-encoding (the range of values ​​is 0-0xFF, not the one where the Cyrillic alphabet is located). This copying corresponds to the ISO-8859-1 encoding (which one-to-one corresponds to the first 256 Unicode values), which means that we can assume that this code simply uses it (instead of the one in which the characters in the original data are actually encoded). If you try to display the received value, there will be either questions or krakozyably on the screen. For example, when reading the string "ABC" in the Windows encoding, something like this can easily be displayed: "АÁВ". This kind of code is often written by programmers in the West - it works with English letters, and that's fine. Fixing this code is easy - you just need to replace StringBuffer with ByteArrayOutputStream:

InputStream is = ..; int b; ByteArrayOutputStream baos = new ByteArrayOutputStream (); while ((b = is.read ())! = - 1) (baos.write (b);) // Convert bytes to string using default encoding String s = baos.toString (); // If you need a specific encoding, just specify it when calling toString (): // // s = baos.toString ("Cp1251");
For more information on common errors, see the section.

8-bit encodings of Russian letters

Here are the main 8-bit encodings of Russian letters that have become widespread:

In addition to the main name, synonyms can be used. The set of them may differ in different versions of the JDK. Here is a list from JDK 1.3.1:

  • Cp1251:
    • Windows-1251
  • Cp866:
    • IBM866
    • IBM-866
    • CP866
    • CSIBM866
  • KOI8_R:
    • KOI8-R
    • CSKOI8R
  • ISO8859_5:
    • ISO8859-5
    • ISO-8859-5
    • ISO_8859-5
    • ISO_8859-5: 1988
    • ISO-IR-144
    • 8859_5
    • Cyrillic
    • CSISOLatinCyrillic
    • IBM915
    • IBM-915
    • Cp915

Moreover, synonyms, unlike the main name, are case-insensitive - this is a feature of the implementation.

It is worth noting that these encodings may not be available on some JVMs. For example, you can download two different versions of the JRE from the Sun site - US and International. In the US version, there is only a minimum - ISO-8859-1, ASCII, Cp1252, UTF8, UTF16 and several variations of double-byte Unicode. Everything else is available only in the International version. Sometimes because of this, you can run into a rake with the launch of the program, even if it does not need Russian letters. A typical error that occurs while doing this:

Error occurred during initialization of VM java / lang / ClassNotFoundException: sun / io / ByteToCharCp1251

It arises, as it is not difficult to guess, due to the fact that the JVM, based on the Russian regional settings, is trying to set the default encoding in Cp1251, but since the class of such support is absent in the US version, it naturally breaks off.

Files and data streams

Just as bytes are conceptually separated from characters, Java distinguishes between byte streams and character streams. Working with bytes is represented by classes that directly or indirectly inherit the InputStream or OutputStream classes (plus the unique RandomAccessFile class). Working with symbols is represented by a sweet couple of Reader / Writer classes (and their descendants, of course).

Streams of bytes are used to read / write unconverted bytes. If you know that bytes represent only characters in a certain encoding, you can use the special converter classes InputStreamReader and OutputStreamWriter to get a stream of characters and work with it directly. This is usually useful for plain text files or when working with many of the Internet's network protocols. In this case, the character encoding is specified in the constructor of the converter class. Example:

// Unicode string String string = "..."; // Write the string to a text file in Cp866 encoding PrintWriter pw = new PrintWriter // class with methods for writing strings(new OutputStreamWriter // class-converter(new FileOutputStream ("file.txt"), "Cp866"); pw.println (string); // write the line to the file pw.close ();

If the stream may contain data in different encodings, or characters are mixed with other binary data, then it is better to read and write byte arrays (byte), and use the already mentioned methods of the String class for conversion. Example:

// Unicode string String string = "..."; // Write the string to a text file in two encodings (Cp866 and Cp1251) OutputStream os = new FileOutputStream ("file.txt"); // class for writing bytes to file // Write the string in Cp866 encoding os.write (string.getBytes ("Cp866")); // Write the string in Cp1251 encoding os.write (string.getBytes ("Cp1251")); os.close ();

The console in Java is traditionally represented by streams, but, unfortunately, not characters, but bytes. The fact is that character streams appeared only in JDK 1.1 (along with the entire encoding mechanism), and access to console I / O was designed in JDK 1.0, which led to the appearance of a freak in the form of the PrintStream class. This class is used in the variables System.out and System.err, which actually give access to the output to the console. By all accounts, this is a stream of bytes, but with a bunch of methods for writing strings. When you write a string to it, it converts internally to bytes using the default encoding, which is usually unacceptable in the case of Windows - the default encoding will be Cp1251 (Ansi), and for the console window, you usually need to use Cp866 (OEM). This error was registered back in 97th year () but the Sun-sheep seem to be in no hurry to fix it. Since there is no method for setting the encoding in PrintStream, to solve this problem, you can replace the standard class with your own using the System.setOut () and System.setErr () methods. For example, here's the usual beginning in my programs:

... public static void main (String args) ( // Set the output of console messages in the desired encoding try (String consoleEnc = System.getProperty ("console.encoding", "Cp866"); System.setOut (new CodepagePrintStream (System.out, consoleEnc)); System.setErr (new CodepagePrintStream (System.err, consoleEnc)); ) catch (UnsupportedEncodingException e) (System.out.println ("Unable to setup console codepage:" + e);) ...

You can find the sources of the CodepagePrintStream class on this site: CodepagePrintStream.java.

If you are constructing the data format yourself, I recommend that you use one of the multibyte encodings. The most convenient format is usually UTF8 - the first 128 values ​​(ASCII) in it are encoded in one byte, which can often significantly reduce the total amount of data (it is not for nothing that this encoding is taken as the basis in the XML world). But UTF8 has one drawback - the number of bytes required depends on the character code. Where this is critical, you can use one of the two-byte Unicode formats (UnicodeBig or UnicodeLittle).

Database

In order to read characters from the database correctly, it is usually enough to tell the JDBC driver to the required character encoding in the database. How exactly depends on the specific driver. Nowadays, many drivers already support this setting, in contrast to the recent past. Here are some examples I know of.

JDBC-ODBC bridge

This is one of the most commonly used drivers. The bridge from JDK 1.2 and older can be easily configured to the desired encoding. This is done by adding an additional charSet property to the set of parameters passed to open a connection to the base. The default is file.encoding. This is done something like this:

// Establish a connection

Oracle 8.0.5 JDBC-OCI driver for Linux

When receiving data from the database, this driver determines "its" encoding using the NLS_LANG environment variable. If this variable is not found, then it assumes that the encoding is ISO-8859-1. The trick is that NLS_LANG should be an environment variable (set by the set command), not a Java system property (like file.encoding). In the case of using the driver inside the Apache + Jserv servlet engine, the environment variable can be set in the jserv.properties file:

wrapper.env = NLS_LANG = American_America.CL8KOI8R
Information about this was sent by Sergey Bezrukov, for which special thanks to him.

JDBC driver for working with DBF (zyh.sql.dbf.DBFDriver)

This driver has only recently learned to work with Russian letters. Even though he reports by getPropertyInfo () that he understands the charSet property, it is a fiction (at least in the version from 07/30/2001). In reality, you can customize the encoding by setting the CODEPAGEID property. For Russian characters there are two values ​​available - "66" for Cp866 and "C9" for Cp1251. Example:

// Base connection parameters Properties connInfo = new Properties (); connInfo.put ("CODEPAGEID", "66"); // Cp866 encoding // Establish a connection Connection db = DriverManager.getConnection ("jdbc: DBF: / C: / MyDBFFiles", connInfo);
If you have DBF files of FoxPro format, then they have their own specifics. The fact is that FoxPro saves in the file header the ID of the code page (byte with offset 0x1D), which was used to create the DBF. When opening a table, the driver uses the value from the header, not the "CODEPAGEID" parameter (the parameter in this case is used only when creating new tables). Accordingly, for everything to work properly, the DBF file must be created with the correct encoding - otherwise there will be problems.

MySQL (org.gjt.mm.mysql.Driver)

With this driver, everything is pretty simple too:

// Base connection parameters Properties connInfo = new Poperties (); connInfo.put ("user", user); connInfo.put ("password", pass); connInfo.put ("useUnicode", "true"); connInfo.put ("characterEncoding", "KOI8_R"); Connection conn = DriverManager.getConnection (dbURL, props);

InterBase (interbase.interclient.Driver)

For this driver, the "charSet" parameter works:
// Base connection parameters Properties connInfo = new Properties (); connInfo.put ("user", username); connInfo.put ("password", password); connInfo.put ("charSet", "Cp1251"); // Establish a connection Connection db = DriverManager.getConnection (dataurl, connInfo);

However, do not forget to specify the character encoding when creating the database and tables. For the Russian language, you can use the values ​​"UNICODE_FSS" or "WIN1251". Example:

CREATE DATABASE "E: \ ProjectHolding \ DataBase \ HOLDING.GDB" PAGE_SIZE 4096 DEFAULT CHARACTER SET UNICODE_FSS; CREATE TABLE RUSSIAN_WORD ("NAME1" VARCHAR (40) CHARACTER SET UNICODE_FSS NOT NULL, "NAME2" VARCHAR (40) CHARACTER SET WIN1251 NOT NULL, PRIMARY KEY ("NAME1"));

There is a bug in version 2.01 of InterClient - the resource classes with messages for the Russian language are not compiled correctly there. Most likely, the developers simply forgot to specify the source encoding when compiling. There are two ways to fix this error:

  • Use interclient-core.jar instead of interclient.jar. At the same time, there will simply be no Russian resources, and the English ones will be picked up automatically.
  • Recompile files to normal Unicode. Parsing class files is a thankless task, so it's better to use JAD. Unfortunately JAD, if it encounters characters from the ISO-8859-1 set, outputs them in 8-digit encoding, so you won't be able to use the standard native2ascii encoder - you have to write your own (Decode program). If you don't want to bother with these problems, you can just take a ready-made file with resources (patched jar with the driver - interclient.jar, separate resource classes - interclient-rus.jar).

But even having tuned the JDBC driver to the desired encoding, in some cases, you can run into trouble. For example, when trying to use the wonderful new JDBC 2 scrolling cursors in the JDBC-ODBC bridge from JDK 1.3.x, you quickly find that Russian letters simply don't work there (updateString () method).

There is a small story associated with this error. When I first discovered it (under JDK 1.3 rc2), I registered it with BugParade (). When the first beta of JDK 1.3.1 was released, this bug was flagged as fixed. Delighted, I downloaded this beta, ran the test - it does not work. I wrote to Sun-sheep about this - in response they wrote to me that the fix would be included in future releases. Okay, I thought, let's wait. Time passed, release 1.3.1 was released, and then beta 1.4. Finally I took the time to check - it doesn't work again. Mother, mother, mother ... - the echo echoed habitually. After an angry letter to Sun, they introduced a new error (), which they gave to the Indian branch to be torn apart. The Indians fiddled with the code, and said that everything was fixed in 1.4 beta3. I downloaded this version, ran a test case under it, here's the result -. As it turned out, the beta3 distributed on the site (build 84) is not the beta3 where the final fix was included (build 87). Now they promise that the fix will be included in 1.4 rc1 ... Well, in general, you understand :-)

Russian letters in the sources of Java programs

As mentioned, the program uses Unicode when executing. The source files are written in ordinary editors. I use Far, you probably have your favorite editor. These editors save files in 8-bit format, which means that reasoning similar to the above applies to these files as well. Different versions of compilers perform character conversion slightly differently. Earlier versions of JDK 1.1.x use the file.encoding setting, which can be overridden with the non-standard -J option. In newer ones (as reported by Denis Kokarev - starting from 1.1.4), an additional -encoding parameter was introduced, with which you can specify the encoding used. In compiled classes, strings are represented in the form of Unicode (more precisely, in a modified version of the UTF8 format), so the most interesting thing happens during compilation. Therefore, the most important thing is to find out in what encoding your source codes and specify the correct value when compiling. By default, the same notorious file.encoding will be used. An example of calling the compiler:

In addition to using this setting, there is one more method - to specify letters in the "\ uXXXX" format, where the character code is indicated. This method works with all versions, and you can use a standard utility to get these codes.

If you use any IDE, then it may have its own glitches. Often these IDEs use the default encoding for reading / saving sources - so pay attention to the regional settings of your OS. In addition, there may be obvious errors - for example, the rather good IDE-scale CodeGuide does not well digest the capital Russian letter "T". The built-in code analyzer takes this letter as a double quote, which leads to the fact that the correct code is perceived as erroneous. You can fight this (by replacing the letter "T" with its code "\ u0422"), but unpleasant. Apparently, somewhere inside the parser, an explicit conversion of characters to bytes (like: byte b = (byte) c) is used, so instead of the code 0x0422 (the code of the letter "T"), the code is 0x22 (the code of the double quote).

JBuilder has another problem, but it is more related to ergonomics. The fact is that in JDK 1.3.0, under which JBuilder runs by default, there is a bug (), due to which newly created GUI windows, when activated, automatically include the keyboard layout depending on the regional settings of the OS. Those. if you have Russian regional settings, then it will constantly try to switch to the Russian layout, which gets in the way when writing programs. The JBuilder.ru site has a couple of patches that change the current locale in the JVM to Locale.US, but the best way is to upgrade to JDK 1.3.1, which has fixed this bug.

Novice JBuilder users may also encounter such a problem - Russian letters are saved as "\ uXXXX" codes. To avoid this, in the Default Project Properties dialog, General tab, in the Encoding field, change Default to Cp1251.

If you are using a compiler other than standard javac for compilation, pay attention to how it performs character conversion. For example, some versions of the IBM jikes compiler do not understand that there are encodings other than ISO-8859-1 :-). There are versions patched in this regard, but often some encoding is also stitched there - there is no such convenience as in javac.

JavaDoc

To generate HTML documentation for the source code, the javadoc utility is used, which is included in the standard JDK distribution. To specify encodings, it has as many as 3 parameters:

  • -encoding - this setting specifies the source encoding. The default is file.encoding.
  • -docencoding - this setting specifies the encoding of the generated HTML files. The default is file.encoding.
  • -charset - this setting specifies the encoding that will be written to the headers of the generated HTML files ( ). Obviously, it should be the same as the previous setting. If this setting is omitted, the meta tag will not be added.

Russian letters in properties files

Resource loading methods are used to read properties files, which work in a specific way. Actually, the Properties.load method is used for reading, which does not use file.encoding (the source code is hardcoded with ISO-8859-1 encoding), so the only way to specify Russian letters is to use the "\ uXXXX" format and the utility.

The Properties.save method works differently in JDK 1.1 and 1.2. In versions 1.1, it simply discarded the high byte, so it worked correctly only with English letters. 1.2 does the reverse conversion to "\ uXXXX", so it works mirrored to the load method.

If your properties files are loaded not as resources, but as ordinary configuration files, and you are not satisfied with this behavior, there is only one way out, write your own loader.

Russian letters in Servlets.

Well, what these same Servlets are for, I think you know. If not, it's best to read the documentation first. Here, only the peculiarities of working with Russian letters are described.

So what are the features? When the Servlet sends a response to a client, there are two ways to send that response — via an OutputStream (getOutputStream () method) or a PrintWriter (getWriter () method). In the first case, you are writing arrays of bytes, so the above methods of writing to streams are applicable. In the case of PrintWriter, it uses the set encoding. In any case, it is necessary to correctly specify the encoding used when calling the setContentType () method, in order for the correct character conversion on the server side. This directive must be made before the call to getWriter () or before the first write to the OutputStream. Example:

// Set the encoding of the response // Note that some engines do not allow // space between ";" and "charset" response.setContentType ("text / html; charset = UTF-8"); PrintWriter out = response.getWriter (); // Debug output of the name of the encoding to check out.println ("Encoding:" + response.getCharacterEncoding ()); ... out.close (); )

It's about giving answers to the client. Unfortunately, it is not so easy with input parameters. Input parameters are encoded by the browser byte according to the MIME type "application / x-www-form-urlencoded". As Alexey Mendelev said, browsers encode Russian letters using the currently set encoding. And, of course, nothing is reported about it. Accordingly, for example, in JSDK versions from 2.0 to 2.2 this is not checked in any way, and what kind of encoding will be used for conversion depends on the engine used. Starting with specification 2.3, it became possible to set the encoding for javax.servlet.ServletRequest - the setCharacterEncoding () method. The latest versions of Resin and Tomcat already support this specification.

Thus, if you are lucky and you have a server with Servlet 2.3 support, then everything is quite simple:

public void doPost (HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException ( // Message encoding request.setCharacterEncoding ("Cp1251"); String value = request.getParameter ("value"); ...

There is one significant subtlety in applying the request.setCharacterEncoding () method - it must be applied before the first call to a request for data (for example, request.getParameter ()). If you use filters that process the request before it arrives in the servlet, there is a nonzero chance that one of the filters may read some parameter from the request (for example, for authorization) and request.setCharacterEncoding () in the servlet will not work ...

Therefore, it is ideologically more correct to write a filter that sets the request encoding. Moreover, it must be the first in the chain of filters in web.xml.

An example of such a filter:

import java.io. *; import java.util. *; import javax.servlet. *; import javax.servlet.http. *; public class CharsetFilter implements Filter (// encoding private String encoding; public void init (FilterConfig config) throws ServletException ( // read from config encoding = config.getInitParameter ("requestEncoding"); // if not installed, install Cp1251 if (encoding == null) encoding = "Cp1251"; ) public void doFilter (ServletRequest request, ServletResponse response, FilterChain next) throws IOException, ServletException (request.setCharacterEncoding (encoding); next.doFilter (request, response);) public void destroy () ())

And its config in web.xml:

Charset Filter CharsetFilter Charset Filter /*

If you are unlucky and you have more old version- to achieve the result, you will have to pervert:

    The original way of working with encodings is offered by Russian Apache - it is described exactly how.

  • FOP

    If the program does not work anywhere, then the problem is only in it and in your hands. Reread carefully everything that was written above and look. If the problem manifests itself only in a specific environment, then the matter may be in the settings. Where exactly depends on which graphics library you are using. If AWT - correct configuration of the font.properties.ru file can help. An example of a correct file can be taken from Java 2. If you do not have this version, you can download it from this site: Windows version, Linux version (see also below). This file specifies the fonts and code pages to use. If you have a Russian version of OS installed, just add this file to where the font.properties file is located. If this is the English version, then you need to either rewrite this file instead of font.properties or additionally change the current regional settings to Russian. Sometimes the -Duser.language = ru setting might work, but more often it won't. There are about the same problems as with file.encoding - whether it works or not depends on the JDK (see the error by number).

    With the Swing library, everything is simpler - everything in it is drawn through the Java2D subsystem. It is very easy to convert labels in standard dialogs (JOptionPane, JFileChooser, JColorChooser) into Russian - you just need to create several resource files. I've already done this, so you can just take the finished file and add it to lib \ ext or CLASSPATH. The only problem I encountered is that in JDK versions starting from 1.2 rc1 and 1.3 beta, Russian letters are not displayed under Win9x when using standard fonts (Arial, Courier New, Times New Roman, etc.) due to a bug in Java2D. The error is quite peculiar - with standard fonts, letter images are displayed not according to Unicode codes, but according to the Cp1251 ( Ansi encoding). This bug is registered in the BugParade under the number. By default, Swing uses the fonts specified in the font.properties.ru file, so it is enough to replace them with other ones and the Russian letters appear. Unfortunately, the set of working fonts is small - these are the Tahoma, Tahoma Bold fonts and two sets of fonts from the JDK distribution - Lucida Sans * and Lucida Typewriter * (example of the font.properties.ru file). How these fonts differ from the standard ones is not clear to me.

    Since version 1.3rc1 this problem has already been fixed, so you just need to update the JDK. JDK 1.2 is already very outdated, so I don't recommend using it. It should also be noted that the original version of Win95 ships with fonts that do not support Unicode - in this situation, you can simply copy the fonts from Win98 or WinNT.

    Typical mistakes, or "where did the letter W go?"

    Letter Sh.

    This question ("where did the letter W go?") Is often asked by novice Java programmers. Let's see where it really goes most often. :-)

    Here's a typical HelloWorld-style program:

    public class Test (public static void main (String args) (System.out.println ("ИЦУКЕНГШЩЗХЪ");))
    in Far, save this code to the Test.java file, compile ...
    C: \> javac Test.java
    and run ...
    C: \> java Test YTsUKENG? ЩЗХЪ

    What happened? Where did the letter W go? The trick here is that there was a mutual compensation of two errors. Far's text editor creates a DOS-encoded file (Cp866) by default. The javac compiler uses file.encoding to read the source (unless otherwise specified with the -encoding key). And in Windows environment with Russian locale the default encoding is Cp1251. This is the first mistake. As a result, symbols in the compiled Test.class file have incorrect codes. The second error is that the standard PrintStream is used for output, which also uses the setting from file.encoding, but the console window on Windows displays characters using DOS encoding. If the Cp1251 encoding were mutually valid, then there would be no data loss. But the character W in Cp866 has a code 152, which is not defined in Cp1251, and therefore maps to the Unicode character 0xFFFD. When converting back from char to byte, "?" Is substituted for it.

    You can run into a similar compensation if you read characters from a text file using java.io.FileReader, and then display them on the screen using System.out.println (). If the file was written in Cp866 encoding, then the output will go correctly, with the exception of the letter Ш again.

    Direct byte conversion<->char.

    This error is a favorite of foreign Java programmers. It is discussed in some detail at the beginning of the description. If you ever look at other people's sources, then always pay attention to the explicit type conversion - (byte) or (char). Rakes are often buried in such places.

    Algorithm for finding problems with Russian letters

    If you have no idea where the loss of Russian letters can occur in your program, then you can try the next test. Any program can be considered as an input data processor. Russian letters are the same data, they generally go through three stages of processing: they are read from somewhere into the program memory (input), processed inside the program and displayed to the user (output). In order to determine the location of the problems, instead of the data, you should try to sew the following test line into the source code: "ABC \ u0410 \ u0411 \ u0412", and try to output it. After that, see what you got:

    • If you see "ABVABV", then the compilation of the sources and the output are working correctly for you.
    • If you see "??? ABC" (or any other characters other than "ABC" in place of the first three letters), then the output is working correctly, but the compilation of the sources is going wrong - most likely the -encoding key is not specified.
    • If you see "??????" (or any other symbols except "ABC" in place of the second three letters), then the output is not working correctly for you.

    Having configured the output and compilation, you can already easily figure out the input. After setting up the entire chain, the problems should be gone.

    About the native2ascii utility

    This utility is part of the Sun JDK and is designed to convert source code to ASCII format. It reads the input file using the specified encoding and writes out characters in the "\ uXXXX" format. If you specify the -reverse switch, then the reverse conversion is performed. This program is very useful for converting resource files (.properties) or for processing sources if you suspect that they can be compiled on computers with non-Russian regional settings.

    If you run the program without parameters, it works with standard input (stdin), rather than displaying a key hint like other utilities. This leads to the fact that many do not even realize that it is necessary to specify parameters (except, perhaps, those who found the strength and courage to look into the documentation :-). Meanwhile, this utility for correct work you must, at a minimum, specify the encoding used (with the -encoding key). If this is not done, then the default encoding (file.encoding) will be used, which may differ somewhat from the expected one. As a result, having received incorrect letter codes (due to incorrect encoding), you can spend a lot of time looking for errors in an absolutely correct code.

    About the character conversion method

    Many people use this method incorrectly, probably not fully understanding its essence and limitations. It is designed to restore correct letter codes if they have been misinterpreted. The essence of the method is simple: the original byte array is restored from the received incorrect characters using the appropriate code page. Then, from this array of bytes, using the already correct page, normal character codes are obtained. Example:

    String res = new String (src.getBytes ("ISO-8859-1"), "Cp1251");

    There can be several problems in using this technique. For example, the wrong page is being used for recovery, or it may change in some situations. Another problem could be that some pages are performing an ambiguous byte conversion<->char. See, for example, error description by number.

    Therefore, you should use this method only in the most extreme case, when nothing else helps, and you have a clear idea of ​​where the incorrect conversion of characters occurs.

    Russian letters and MS JVM

    It is unclear for what reasons, but it lacks all the files for encoding Russian letters, acrome Cp1251 (they probably tried to reduce the size of the distribution in this way). If you need other encodings, for example, Cp866, then you need to add the corresponding classes to the CLASSPATH. Moreover, the classes from the latest versions of the Sun JDK do not fit - Sun has changed their structure for a long time, so the latest versions of classes do not match with Microsoft (MS still has a structure from JDK 1.1.4). On the Microsoft server, in principle, there is a complete set of additional encodings, but there is a file about 3 meters in size, and their server does not support resume downloading :-). I managed to download this file, I repackaged it with jar, you can take it from here.

    If you are writing an applet that should work under MS JVM and you needed to read from somewhere (for example, from a file on the server) bytes in Russian encoding other than Cp1251 (for example, in Cp866), then you will no longer be able to use the standard encoding mechanism - applets are prohibited from adding classes to system packages, which in this case is the sun.io package. There are two ways out here - either to recode the file on the server to Cp1251 (or to UTF-8) or, before converting from bytes to Unicode, convert bytes from the desired encoding to Cp1251.

    Russification of Java for Linux

    I will say right away - I do not work with Linux, and the information given here was obtained from the readers of this description. If you find an inaccuracy or want to add - write to me.

    There are two parallel problems when cyrillizing the JVM on Linux:

    1. Cyrillic output problem in GUI components
    2. Problem of entering Cyrillic from the keyboard (in X11)

    The withdrawal problem can be solved in this way (this algorithm was sent by Artemy E. Kapitula):

    1. Install normal Windows NT / 200 ttf fonts in X11. I would recommend Arial, Times New Roman, Courier New, Verdana and Tahoma - and it is better to connect them not through a font server, but as a directory with files.
    2. Add the following font.properties.ru file to the $ JAVA_HOME / jre / lib directory

    The input problem is solved approximately in the following way (this algorithm was sent by Mikhail Ivanov):

    Setting up the input of Russian letters in the following configuration:

    • Mandrake Linux 7.1
    • XFree86 3.3.6
    • IBM Java 1.3.0 (release)

    Problem:

    IBM Java 1.3 does not allow entering Russian letters (visible as crocodiles), despite the fact that they are visible on labels and in menus.

    using XIM (-> xkb) in AWT. (this is not bad in itself, it just needs to be handled carefully with such things + some xkb tools do not like it).

    Configure xkb (and locale (xkb without locale DOES NOT WORK))

    Procedure:

    1. locale is exposed (somewhere like / etc / profile or ~ / .bash_profile)
      export LANG = ru_RU.KOI8-R export LC_ALL = ru_RU.KOI8-R
    2. edit (if not already done) / etc / X11 / XF86Config. The Keyboard section should contain something like this:
      XkbKeycodes "xfree86" XkbTypes "default" XkbCompat "default" XkbSymbols "ru" XkbGeometry "pc" XkbRules "xfree86" XkbModel "pc101" XkbLayout "ru" XkbOptions "grp: shift_toggle ciphers "# toggle caps-lock" ohm
      note: this xkb setting is not compatible with xrus (and others like kikbd) and therefore you will have to say goodbye to them.
    3. X's are restarted. You need to check that everything works (like Russian letters in the terminal and applications)
    4. font.properties.ru -> $ JAVA_HOME / jre / lib
    5. fonts.dir -> $ JAVA_HOME / jre / lib / fonts
    6. cd $ JAVA_HOME / jre / lib / fonts; rm fonts.scale; ln -s fonts.dir fonts.scale

    Now Russian letters should be entered and displayed in swing without problems.

If you notice an inaccuracy in the description or want to add, then write to me about it, and your name will also appear in this list. :-)

Full support for .htaccess directives is attached ...

Domain renewals 199-00 rub

The encoding is a table of symbols, where each letter of the alphabet (as well as numbers and special characters) is assigned a unique number - a symbol code.

Only half of the table is standardized, the so-called. ASCII code - the first 128 characters that include the letters of the Latin alphabet. And there are never any problems with them. The second half of the table (and there are a total of 256 symbols in it - according to the number of states that one byte can take) is given under the national symbols, and in each country this part is different. But only in Russia they managed to come up with as many as 5 different encodings. The term "different" means that the same symbol corresponds to a different digital code. Those. if we incorrectly determine the encoding of the text, then our attention will be presented to an absolutely unreadable text.

Encodings appeared historically. The first widely used Russian encoding was called KOI-8... It was invented when the UNIX system was adapted to the Russian language. This was back in the seventies - before the appearance of personal computers. And until now in UNIX it is considered the main encoding.

Then the first personal computers appeared, and the triumphant march of DOS began. Instead of using an already invented encoding, Microsoft decided to make its own, incompatible with anything. This is how it appeared DOS encoding(or 866 code page). By the way, it introduced special characters for drawing frames, which was widely used in programs written under DOS. For example, in the same Norton Commander.

Macintosh computers developed in parallel with the IBM-compatible ones. Despite the fact that their share in Russia is very small, nevertheless, there was a need for Russification and, of course, another encoding was invented - MAC.

Time passed, and in 1990 Microsoft brought to light the first successful Windows version 3.0-3.11. And along with it, the support of national languages. And again, the same trick was done as with DOS. For some unknown reason, they did not support any of the previously existing ones (as did OS / 2, which adopted DOS encoding as the standard), but proposed a new Win encoding(or code page 1251 ). De facto, it has become the most widespread in Russia.

And, finally, the fifth encoding option is no longer associated with a specific company, but with attempts to standardize encodings at the level of the entire planet. This was done by ISO, an international standards organization. And guess what they did with the Russian language? Instead of mistaking any of the above for "standard Russian", they came up with another one (!) And called it a long indigestible combination ISO-8859-5... Of course, it also turned out to be incompatible with anything. And at the moment this encoding is practically not used anywhere. It seems to be used only in Oracle database. At least I have never seen the text in this encoding. However, it is supported in all browsers.

Now we are working on creating a new universal encoding ( UNICODE), in which it is supposed to cram all the languages ​​of the world into one code table. Then there will definitely be no problems. For this, 2 bytes were allocated for each character. Thus, the maximum number of characters in the table has expanded to 65535. But there is still too much time left before everyone switches to UNICODE.


Here we digress a little and consider the meta tag - Content-Type for a holistic perception.

Meta tags are used to describe the properties of an HTML document and must be contained within the HEAD tag. Meta tags like NAME contain text information about the document, its author and some recommendations for search engines. For example: Robots, Description, Keywords, Author, Copyright.

Meta tags of the HTTP-EQUIV type affect the formation of the document header and determine the mode of its processing.

Meta tag Content-Type - Responsible for specifying the document type and character encoding.

It is necessary to use the Content-Type meta tag only taking into account some nuances:

    First, the character encoding of the text must match the encoding specified in the tag.

    Second, the server should not change the text encoding when processing a browser request.

    Third, if the server changes the text encoding, it must correct or remove the Content-Type meta tag.

Failure to comply with these requirements can lead to the following: the web server will automatically detect the encoding of the client's request and give the recoded page to the web browser. The browser, in turn, will read the document according to the Content-Type meta tag. And if the encodings do not match, then the document can be read only after a series of intricate manipulations. This is especially true for older browsers.

Attention! Meta Content-Type tag is very often inserted HTML generators code.

The most common encoding types are:

    Windows-1251 - Cyrillic (Windows).

    KOI8-r - Cyrillic (KOI8-R)

    cp866 - Cyrillic (DOS).

    Windows-1252 - Western Europe (Windows).

    Windows-1250 - Central Europe (Windows).

Surely everyone knows the meta tag -

V this material used excerpts from the article from the site http://cherry-design.ru/

Recently released domains with PR and TIC:

Service http://reg.ru - the largest hosting and domain registrar, allows you to apply for registration of a domain name, which was recently released by the former Administrator. Released domains often have high TIC and PR rates and may be interesting to acquire.

Released domains .RU c TIC:
Free premium domains:

Amount of information: 7659 bytes

Historically, 7 bits were allocated to represent printed characters (encoding text) in the first computers. 2 7 = 128. This amount was quite enough to encode all lowercase and uppercase letters Latin alphabet, ten numbers and various signs and brackets. This is exactly the 7-bit table that is ASCII characters(American Standard Code for Information Interchange), detailed information which you can get with the command man ascii operating system Linux.

When it became necessary to encode national alphabets, 128 characters were not enough. It was decided to switch to encoding using 8 bits (i.e. one byte). As a result, the number of characters that can be encoded in this way became equal to 2 8 = 256. At the same time, the symbols of the national alphabets were located in the second half of the code table, that is, they contained a unit in the most significant bit of the byte allocated for encoding the character. This is how the ISO 8859 standard appeared, containing many encodings for the most common languages.

Among them was one of the first tables for encoding Russian letters - ISO 8859-5(use the command man iso_8859_1 to get the codes for Russian letters in this table).

Transfer tasks text information the network was forced to develop another encoding for Russian letters, called Koi8-R(the information display code is 8-bit, russified). Consider a situation when a letter containing Russian text is sent via e-mail... It happened that in the process of traveling through networks, a letter was processed by a program that worked with a 7-bit encoding and zeroed the eighth bit. As a result of this transformation, the character code was reduced by 128, turning into the character code of the Latin alphabet. There was a need to increase the stability of the transmitted text information to zeroing 8 bits.

Fortunately, a significant number of Cyrillic letters have phonetic counterparts in the Latin alphabet. For example, Ф and F, P and R. There are several letters that match even in outline. Arranging Russian letters in code table in such a way that their code exceeds the code similar Latin to the number 128, they achieved that the loss of the 8th bit turned the text, although it consisted of one Latin alphabet, but still understood by the Russian-speaking user.

Since of all the operating systems widespread at that time, the most convenient means of working with the network were possessed by various clones of the operating Unix systems then this encoding has become the de facto standard in these systems. This is still the case in Linux. And it is this encoding that is most often used to exchange mail and news on the Internet.

Then came the era personal computers and the MS DOS operating system. As it turned out, the Koi8-R encoding was not suitable for her (just like ISO 8859-5), in her table some Russian letters were in those places that many programs assumed filled with pseudographics (horizontal and vertical lines, corners, etc.). etc.). Therefore, another encoding of the Cyrillic alphabet was invented, in the table of which Russian letters "flowed" around graphic symbols from all sides. Called this encoding alternative(alt) because it was an alternative to the official standard, ISO-8859-5 encoding. The indisputable advantage of this encoding is that the Russian letters in it are arranged in alphabetical order.

After the appearance of Windows from Microsoft, it turned out that the alternative encoding for some reason is not suitable for it. Moving the Russian letters in the table again (there was an opportunity - after all, pseudo-graphics in Windows are not required), we got the encoding Windows 1251(Win-1251).

But computer technologies are constantly improving and at present an increasing number of programs are beginning to support the Unicode standard, which allows you to encode almost all languages ​​and dialects of the inhabitants of the Earth.

So, in different operating systems, preference is given to different encodings. In order to make it possible to read and edit text typed in a different encoding, Russian text transcoding programs are used. Some text editors contain built-in transcoders that allow you to read text in various encodings (Word, etc.). To convert files, we will use a number of utilities in Linux, the purpose of which is clear from the name: alt2koi, win2koi, koi2win, alt2win, win2alt, koi2alt (from where, where, the number 2 (two) is similar in sound to the preposition to, indicating the direction). These commands have the same syntax: command<входной_файл >output_file.

Example

Let's recode the text typed in the Edit editor in the MS DOS environment into the Koi8-R encoding. To do this, run the command

alt2koi file1.txt> filenew

Since line feeds are encoded differently in MS DOS and Linux, it is also recommended to execute the "fromdos" command:

fromdos filenew> file2.txt

The reverse command is called "todos" and has the same syntax.

Example

Let's sort the List.txt file containing a list of surnames and prepared in the Koi8-R encoding in alphabetical order. Let's use the sort command, which sorts the text file in ascending or descending order of character codes. If you apply it immediately, then, for example, the letter V will be at the end of the list, similar to the corresponding letter of the Latin alphabet V... Remembering that in the alternative encoding, the Russian letters are arranged strictly alphabetically, we will perform a number of operations: recode the text into an alternative encoding, sort it and return it back to the Koi8-R encoding. Using the command pipeline, we get

koi2alt List.txt | sort | alt2koi> List_Sort.txt

In modern Linux distributions, many of the problems associated with localization software... In particular, the sort utility now takes into account the peculiarities of the Koi8-R encoding, and to sort the file in alphabetical order, just run the command

Encodings

When I first started programming in C, my first program (apart from HelloWorld) was a program for converting text files from the main GOST encoding (remember this one? :-) to an alternative one. It was back in 1991. Much has changed since then, but unfortunately, such programs have not lost their relevance over the past 10 years. Too much data has already been accumulated in various encodings and too many programs are used that can work with only one. There are at least a dozen different encodings for the Russian language, which makes the problem even more confusing.

Where did all these encodings come from and what are they for? Computers, by their very nature, can only work with numbers. In order to store letters in the memory of a computer, it is necessary to assign a certain number to each letter (approximately the same principle was used before the advent of computers - remember the same Morse code). Moreover, the number is desirably smaller - the fewer binary digits are involved, the more efficiently the memory can be used. This correspondence between the set of characters and numbers is actually the encoding. The desire to save memory at any cost, as well as the disunity of different groups of computer scientists, led to the current state of affairs. The most common encoding method now is to use one byte (8 bits) for one character, which determines the total number of characters in 256. The set of the first 128 characters is standardized (ASCII set) and is the same in all common encodings (those encodings where it is not so already practically out of use). Anglican letters and punctuation symbols are in this range, which determines their amazing survivability in computer systems :-). Other languages ​​are not so happy - they all have to huddle in the remaining 128 numbers.

Unicode

In the late 1980s, many realized the need to create a single standard for character encoding, which led to the emergence of Unicode. Unicode is an attempt to fix a specific number for a specific character once and for all. It is clear that 256 characters will not fit here with all the desire. For quite a long time it seemed that 2 bytes (65536 characters) should be enough. But no - the latest version of the Unicode standard (3.1) already defines 94,140 characters. For such a number of characters, you probably already have to use 4 bytes (4294967296 characters). Maybe enough for a while ... :-)

The Unicode character set includes all sorts of letters with all sorts of dashes and papendyulki, Greek, mathematical, hieroglyphs, pseudo-graphic symbols, etc., etc. Including our favorite Cyrillic characters (the range of values ​​is 0x0400-0x04ff). So there is no discrimination on this side.

If you are interested in specific character codes, it is convenient to use the "Character Map" program from WinNT to view them. For example, here is the Cyrillic range:

If you have a different OS or are interested in the official interpretation, then the full charts can be found on the official Unicode website (http://www.unicode.org/charts/web.html).

The char and byte types

Java has a separate char data type of 2 bytes for characters. This often creates confusion in the minds of beginners (especially if they have previously programmed in other languages, such as C / C ++). This is because most other languages ​​use 1-byte data types to process characters. For example, in C / C ++, char is mostly used for both character handling and byte handling - there is no separation. Java has its own type for bytes - the byte type. Thus, a C-based char corresponds to a Java byte, and a Java char from the C world is closest to the wchar_t type. It is necessary to clearly separate the concepts of characters and bytes - otherwise, misunderstanding and problems are guaranteed.

Java has been using the Unicode standard for character encoding almost since its inception. Java library functions expect to see Unicode characters in char variables. In principle, of course, you can stuff anything there - numbers are numbers, the processor will endure everything, but for any processing, the library functions will act on the assumption that they were given the Unicode encoding. So you can safely assume that the char encoding is fixed. But this is inside the JVM. When data is read from outside or passed outside, then it can be represented by only one type - the byte type. All other types are constructed from bytes, depending on the data format used. This is where encodings come into play - in Java, it's just a data format for transferring characters, which is used to form data of type char. For each code page, the library has 2 conversion classes (ByteToChar and CharToByte). These classes are in the sun.io package. If, when converting from char to byte, no corresponding character was found, it is replaced with the? Character.

By the way, these codepage files in some early versions of JDK 1.1 contain bugs that cause conversion errors or even runtime exceptions. For example, this concerns the KOI8_R encoding. The best thing to do while doing this is to upgrade to a later version. Judging by Sun's description, most of these problems have been resolved in JDK 1.1.6.

Prior to JDK 1.4, the set of available encodings was determined only by the JDK vendor. Starting with 1.4, a new API has appeared (java.nio.charset package), with which you can already create your own encoding (for example, support a rarely used, but terribly necessary for you).

String class

In most cases, Java uses an object of type java.lang.String to represent strings. This is a regular class that internally stores an array of characters (char), and which contains many useful methods for manipulating characters. The most interesting are constructors with a byte array as the first parameter and getBytes () methods. With these methods, you can perform conversions from byte arrays to strings and vice versa. In order to specify which encoding to use at the same time, these methods have a string parameter that sets its name. For example, here's how you can convert bytes from KOI-8 to Windows-1251:

// Data encoded KOI-8 byte koi8Data = ...; // Convert from KOI-8 to Unicode String string = new String (koi8Data, "KOI8_R"); // Convert from Unicode to Windows-1251 byte winData = string.getBytes ("Cp1251");

The list of 8-bit encodings available in modern JDKs and supporting Russian letters can be found below, in the section.

Since an encoding is a data format for characters, in addition to the familiar 8-bit encodings in Java, there are also multibyte encodings on an equal footing. These include UTF-8, UTF-16, Unicode, etc. For example, this is how you can get bytes in the UnicodeLittleUnmarked format (16-bit Unicode encoding, low byte first, no byte order sign):

// Convert from Unicode to Unicode byte data = string.getBytes ("UnicodeLittleUnmarked");

It is easy to make a mistake with such conversions - if the encoding of the byte data does not correspond to the specified parameter when converting from byte to char, then the recoding will not be performed correctly. Sometimes after that it is possible to pull out the correct characters, but more often than not, part of the data will be irretrievably lost.

In a real program, it is not always convenient to specify the code page explicitly (although it is more reliable). For this, a default encoding was introduced. By default, it depends on the system and its settings (for Russian Windows, the Cp1251 encoding is adopted), and in old JDKs it can be changed by setting the file.encoding system property. In JDK 1.3, changing this setting sometimes works, sometimes it doesn't. This is caused by the following: initially, file.encoding is set according to the regional settings of the computer. The default encoding reference is remembered internally during the first conversion. In this case, file.encoding is used, but this conversion occurs even before using the JVM startup arguments (in fact, when parsing them). Actually, according to Sun, this property reflects the system encoding, and it should not be changed on the command line (see, for example, the comments on BugID) However, in JDK 1.4 Beta 2, changing this setting again began to have an effect. What is this, a conscious change or a side effect that may disappear again - the Sun-sheep have not yet given a clear answer.

This encoding is used when the page name is not explicitly specified. This should always be remembered - Java will not try to predict the encoding of the bytes you pass to create a String (it also won't be able to read your thoughts on this :-). It just uses the current default encoding. Because this setting is the same for all transformations, sometimes you can run into trouble.

To convert from bytes to characters and vice versa, use only by these methods. In most cases, a simple type conversion cannot be used - the character encoding will not be taken into account. For example, one of the most common mistakes is reading data byte-wise using the read () method from an InputStream, and then casting the resulting value to the char type:

InputStream is = ..; int b; StringBuffer sb = new StringBuffer (); while ((b = is.read ())! = - 1) (sb.append ((char) b); // <- так делать нельзя ) String s = sb.toString ();

Pay attention to the typecasting - "(char) b". The byte values ​​are simply copied to char instead of re-encoding (the range of values ​​is 0-0xFF, not the one where the Cyrillic alphabet is located). This copying corresponds to the ISO-8859-1 encoding (which one-to-one corresponds to the first 256 Unicode values), which means that we can assume that this code simply uses it (instead of the one in which the characters in the original data are actually encoded). If you try to display the received value, there will be either questions or krakozyably on the screen. For example, when reading the string "ABC" in the Windows encoding, something like this can easily be displayed: "АÁВ". This kind of code is often written by programmers in the West - it works with English letters, and that's fine. Fixing this code is easy - you just need to replace StringBuffer with ByteArrayOutputStream:

InputStream is = ..; int b; ByteArrayOutputStream baos = new ByteArrayOutputStream (); while ((b = is.read ())! = - 1) (baos.write (b);) // Convert bytes to string using default encoding String s = baos.toString (); // If you need a specific encoding, just specify it when calling toString (): // // s = baos.toString ("Cp1251");
For more information on common errors, see the section.

8-bit encodings of Russian letters

Here are the main 8-bit encodings of Russian letters that have become widespread:

In addition to the main name, synonyms can be used. The set of them may differ in different versions of the JDK. Here is a list from JDK 1.3.1:

  • Cp1251:
    • Windows-1251
  • Cp866:
    • IBM866
    • IBM-866
    • CP866
    • CSIBM866
  • KOI8_R:
    • KOI8-R
    • CSKOI8R
  • ISO8859_5:
    • ISO8859-5
    • ISO-8859-5
    • ISO_8859-5
    • ISO_8859-5: 1988
    • ISO-IR-144
    • 8859_5
    • Cyrillic
    • CSISOLatinCyrillic
    • IBM915
    • IBM-915
    • Cp915

Moreover, synonyms, unlike the main name, are case-insensitive - this is a feature of the implementation.

It is worth noting that these encodings may not be available on some JVMs. For example, you can download two different versions of the JRE from the Sun site - US and International. In the US version, there is only a minimum - ISO-8859-1, ASCII, Cp1252, UTF8, UTF16 and several variations of double-byte Unicode. Everything else is available only in the International version. Sometimes because of this, you can run into a rake with the launch of the program, even if it does not need Russian letters. A typical error that occurs while doing this:

Error occurred during initialization of VM java / lang / ClassNotFoundException: sun / io / ByteToCharCp1251

It arises, as it is not difficult to guess, due to the fact that the JVM, based on the Russian regional settings, is trying to set the default encoding in Cp1251, but since the class of such support is absent in the US version, it naturally breaks off.

Files and data streams

Just as bytes are conceptually separated from characters, Java distinguishes between byte streams and character streams. Working with bytes is represented by classes that directly or indirectly inherit the InputStream or OutputStream classes (plus the unique RandomAccessFile class). Working with symbols is represented by a sweet couple of Reader / Writer classes (and their descendants, of course).

Streams of bytes are used to read / write unconverted bytes. If you know that bytes represent only characters in a certain encoding, you can use the special converter classes InputStreamReader and OutputStreamWriter to get a stream of characters and work with it directly. This is usually useful for plain text files or when working with many of the Internet's network protocols. In this case, the character encoding is specified in the constructor of the converter class. Example:

// Unicode string String string = "..."; // Write the string to a text file in Cp866 encoding PrintWriter pw = new PrintWriter // class with methods for writing strings(new OutputStreamWriter // class-converter(new FileOutputStream ("file.txt"), "Cp866"); pw.println (string); // write the line to the file pw.close ();

If the stream may contain data in different encodings, or characters are mixed with other binary data, then it is better to read and write byte arrays (byte), and use the already mentioned methods of the String class for conversion. Example:

// Unicode string String string = "..."; // Write the string to a text file in two encodings (Cp866 and Cp1251) OutputStream os = new FileOutputStream ("file.txt"); // class for writing bytes to file // Write the string in Cp866 encoding os.write (string.getBytes ("Cp866")); // Write the string in Cp1251 encoding os.write (string.getBytes ("Cp1251")); os.close ();

The console in Java is traditionally represented by streams, but, unfortunately, not characters, but bytes. The fact is that character streams appeared only in JDK 1.1 (along with the entire encoding mechanism), and access to console I / O was designed in JDK 1.0, which led to the appearance of a freak in the form of the PrintStream class. This class is used in the variables System.out and System.err, which actually give access to the output to the console. By all accounts, this is a stream of bytes, but with a bunch of methods for writing strings. When you write a string to it, it converts internally to bytes using the default encoding, which is usually unacceptable in the case of Windows - the default encoding will be Cp1251 (Ansi), and for the console window, you usually need to use Cp866 (OEM). This error was registered back in 97th year () but the Sun-sheep seem to be in no hurry to fix it. Since there is no method for setting the encoding in PrintStream, to solve this problem, you can replace the standard class with your own using the System.setOut () and System.setErr () methods. For example, here's the usual beginning in my programs:

... public static void main (String args) ( // Set the output of console messages in the desired encoding try (String consoleEnc = System.getProperty ("console.encoding", "Cp866"); System.setOut (new CodepagePrintStream (System.out, consoleEnc)); System.setErr (new CodepagePrintStream (System.err, consoleEnc)); ) catch (UnsupportedEncodingException e) (System.out.println ("Unable to setup console codepage:" + e);) ...

You can find the sources of the CodepagePrintStream class on this site: CodepagePrintStream.java.

If you are constructing the data format yourself, I recommend that you use one of the multibyte encodings. The most convenient format is usually UTF8 - the first 128 values ​​(ASCII) in it are encoded in one byte, which can often significantly reduce the total amount of data (it is not for nothing that this encoding is taken as the basis in the XML world). But UTF8 has one drawback - the number of bytes required depends on the character code. Where this is critical, you can use one of the two-byte Unicode formats (UnicodeBig or UnicodeLittle).

Database

In order to read characters from the database correctly, it is usually enough to tell the JDBC driver to the required character encoding in the database. How exactly depends on the specific driver. Nowadays, many drivers already support this setting, in contrast to the recent past. Here are some examples I know of.

JDBC-ODBC bridge

This is one of the most commonly used drivers. The bridge from JDK 1.2 and older can be easily configured to the desired encoding. This is done by adding an additional charSet property to the set of parameters passed to open a connection to the base. The default is file.encoding. This is done something like this:

// Establish a connection

Oracle 8.0.5 JDBC-OCI driver for Linux

When receiving data from the database, this driver determines "its" encoding using the NLS_LANG environment variable. If this variable is not found, then it assumes that the encoding is ISO-8859-1. The trick is that NLS_LANG should be an environment variable (set by the set command), not a Java system property (like file.encoding). In the case of using the driver inside the Apache + Jserv servlet engine, the environment variable can be set in the jserv.properties file:

wrapper.env = NLS_LANG = American_America.CL8KOI8R
Information about this was sent by Sergey Bezrukov, for which special thanks to him.

JDBC driver for working with DBF (zyh.sql.dbf.DBFDriver)

This driver has only recently learned to work with Russian letters. Even though he reports by getPropertyInfo () that he understands the charSet property, it is a fiction (at least in the version from 07/30/2001). In reality, you can customize the encoding by setting the CODEPAGEID property. For Russian characters there are two values ​​available - "66" for Cp866 and "C9" for Cp1251. Example:

// Base connection parameters Properties connInfo = new Properties (); connInfo.put ("CODEPAGEID", "66"); // Cp866 encoding // Establish a connection Connection db = DriverManager.getConnection ("jdbc: DBF: / C: / MyDBFFiles", connInfo);
If you have DBF files of FoxPro format, then they have their own specifics. The fact is that FoxPro saves in the file header the ID of the code page (byte with offset 0x1D), which was used to create the DBF. When opening a table, the driver uses the value from the header, not the "CODEPAGEID" parameter (the parameter in this case is used only when creating new tables). Accordingly, for everything to work properly, the DBF file must be created with the correct encoding - otherwise there will be problems.

MySQL (org.gjt.mm.mysql.Driver)

With this driver, everything is pretty simple too:

// Base connection parameters Properties connInfo = new Poperties (); connInfo.put ("user", user); connInfo.put ("password", pass); connInfo.put ("useUnicode", "true"); connInfo.put ("characterEncoding", "KOI8_R"); Connection conn = DriverManager.getConnection (dbURL, props);

InterBase (interbase.interclient.Driver)

For this driver, the "charSet" parameter works:
// Base connection parameters Properties connInfo = new Properties (); connInfo.put ("user", username); connInfo.put ("password", password); connInfo.put ("charSet", "Cp1251"); // Establish a connection Connection db = DriverManager.getConnection (dataurl, connInfo);

However, do not forget to specify the character encoding when creating the database and tables. For the Russian language, you can use the values ​​"UNICODE_FSS" or "WIN1251". Example:

CREATE DATABASE ′ E: ProjectHoldingDataBaseHOLDING.GDB ′ PAGE_SIZE 4096 DEFAULT CHARACTER SET UNICODE_FSS; CREATE TABLE RUSSIAN_WORD ("NAME1" VARCHAR (40) CHARACTER SET UNICODE_FSS NOT NULL, "NAME2" VARCHAR (40) CHARACTER SET WIN1251 NOT NULL, PRIMARY KEY ("NAME1"));

There is a bug in version 2.01 of InterClient - the resource classes with messages for the Russian language are not compiled correctly there. Most likely, the developers simply forgot to specify the source encoding when compiling. There are two ways to fix this error:

  • Use interclient-core.jar instead of interclient.jar. At the same time, there will simply be no Russian resources, and the English ones will be picked up automatically.
  • Recompile files to normal Unicode. Parsing class files is a thankless task, so it's better to use JAD. Unfortunately JAD, if it encounters characters from the ISO-8859-1 set, outputs them in 8-digit encoding, so you won't be able to use the standard native2ascii encoder - you have to write your own (Decode program). If you don't want to bother with these problems, you can just take a ready-made file with resources (patched jar with the driver - interclient.jar, separate resource classes - interclient-rus.jar).

But even having tuned the JDBC driver to the desired encoding, in some cases, you can run into trouble. For example, when trying to use the wonderful new JDBC 2 scrolling cursors in the JDBC-ODBC bridge from JDK 1.3.x, you quickly find that Russian letters simply don't work there (updateString () method).

There is a small story associated with this error. When I first discovered it (under JDK 1.3 rc2), I registered it with BugParade (). When the first beta of JDK 1.3.1 was released, this bug was flagged as fixed. Delighted, I downloaded this beta, ran the test - it does not work. I wrote to Sun-sheep about this - in response they wrote to me that the fix would be included in future releases. Okay, I thought, let's wait. Time passed, release 1.3.1 was released, and then beta 1.4. Finally I took the time to check - it doesn't work again. Mother, mother, mother ... - the echo echoed habitually. After an angry letter to Sun, they introduced a new error (), which they gave to the Indian branch to be torn apart. The Indians fiddled with the code, and said that everything was fixed in 1.4 beta3. I downloaded this version, ran a test case under it, here's the result -. As it turned out, the beta3 distributed on the site (build 84) is not the beta3 where the final fix was included (build 87). Now they promise that the fix will be included in 1.4 rc1 ... Well, in general, you understand :-)

Russian letters in the sources of Java programs

As mentioned, the program uses Unicode when executing. The source files are written in ordinary editors. I use Far, you probably have your favorite editor. These editors save files in 8-bit format, which means that reasoning similar to the above applies to these files as well. Different versions of compilers perform character conversion slightly differently. Earlier versions of JDK 1.1.x use the file.encoding setting, which can be overridden with the non-standard -J option. In newer ones (as reported by Denis Kokarev - starting from 1.1.4), an additional -encoding parameter was introduced, with which you can specify the encoding used. In compiled classes, strings are represented in the form of Unicode (more precisely, in a modified version of the UTF8 format), so the most interesting thing happens during compilation. Therefore, the most important thing is to find out in what encoding your source codes and specify the correct value when compiling. By default, the same notorious file.encoding will be used. An example of calling the compiler:

In addition to using this setting, there is one more method - to specify letters in the "uXXXX" format, where the character code is indicated. This method works with all versions, and you can use a standard utility to get these codes.

If you use any IDE, then it may have its own glitches. Often these IDEs use the default encoding for reading / saving sources - so pay attention to the regional settings of your OS. In addition, there may be obvious errors - for example, the rather good IDE-scale CodeGuide does not well digest the capital Russian letter "T". The built-in code analyzer takes this letter as a double quote, which leads to the fact that the correct code is perceived as erroneous. You can fight this (by replacing the letter "T" with its code "u0422"), but unpleasant. Apparently, somewhere inside the parser, an explicit conversion of characters to bytes (like: byte b = (byte) c) is used, so instead of the code 0x0422 (the code of the letter "T"), the code is 0x22 (the code of the double quote).

JBuilder has another problem, but it is more related to ergonomics. The fact is that in JDK 1.3.0, under which JBuilder runs by default, there is a bug (), due to which newly created GUI windows, when activated, automatically include the keyboard layout depending on the regional settings of the OS. Those. if you have Russian regional settings, then it will constantly try to switch to the Russian layout, which gets in the way when writing programs. The JBuilder.ru site has a couple of patches that change the current locale in the JVM to Locale.US, but the best way is to upgrade to JDK 1.3.1, which has fixed this bug.

Novice JBuilder users may also encounter such a problem - Russian letters are saved as "uXXXX" codes. To avoid this, in the Default Project Properties dialog, General tab, in the Encoding field, change Default to Cp1251.

If you are using a compiler other than standard javac for compilation, pay attention to how it performs character conversion. For example, some versions of the IBM jikes compiler do not understand that there are encodings other than ISO-8859-1 :-). There are versions patched in this regard, but often some encoding is also stitched there - there is no such convenience as in javac.

JavaDoc

To generate HTML documentation for the source code, the javadoc utility is used, which is included in the standard JDK distribution. To specify encodings, it has as many as 3 parameters:

  • -encoding - this setting specifies the source encoding. The default is file.encoding.
  • -docencoding - this setting specifies the encoding of the generated HTML files. The default is file.encoding.
  • -charset - this setting specifies the encoding that will be written to the headers of the generated HTML files ( ). Obviously, it should be the same as the previous setting. If this setting is omitted, the meta tag will not be added.

Russian letters in properties files

Resource loading methods are used to read properties files, which work in a specific way. Actually, the Properties.load method is used for reading, which does not use file.encoding (the ISO-8859-1 encoding is hardcoded in the sources), so the only way to specify Russian letters is to use the "uXXXX" format and the utility.

The Properties.save method works differently in JDK 1.1 and 1.2. In versions 1.1, it simply discarded the high byte, so it worked correctly only with English letters. 1.2 does the reverse conversion to "uXXXX" so that it works mirrored to the load method.

If your properties files are loaded not as resources, but as ordinary configuration files, and you are not satisfied with this behavior, there is only one way out, write your own loader.

Russian letters in Servlets.

Well, what these same Servlets are for, I think you know. If not, it's best to read the documentation first. Here, only the peculiarities of working with Russian letters are described.

So what are the features? When the Servlet sends a response to a client, there are two ways to send that response — via an OutputStream (getOutputStream () method) or a PrintWriter (getWriter () method). In the first case, you are writing arrays of bytes, so the above methods of writing to streams are applicable. In the case of PrintWriter, it uses the set encoding. In any case, it is necessary to correctly specify the encoding used when calling the setContentType () method, in order for the correct character conversion on the server side. This directive must be made before the call to getWriter () or before the first write to the OutputStream. Example:

// Set the encoding of the response // Note that some engines do not allow // space between ';' and 'charset' response.setContentType ("text / html; charset = UTF-8"); PrintWriter out = response.getWriter (); // Debug output of the name of the encoding to check out.println ("Encoding:" + response.getCharacterEncoding ()); ... out.close (); )

It's about giving answers to the client. Unfortunately, it is not so easy with input parameters. Input parameters are encoded by the browser byte according to the MIME type "application / x-www-form-urlencoded". As Alexey Mendelev said, browsers encode Russian letters using the currently set encoding. And, of course, nothing is reported about it. Accordingly, for example, in JSDK versions from 2.0 to 2.2 this is not checked in any way, and what kind of encoding will be used for conversion depends on the engine used. Starting with specification 2.3, it became possible to set the encoding for javax.servlet.ServletRequest - the setCharacterEncoding () method. The latest versions of Resin and Tomcat already support this specification.

Thus, if you are lucky and you have a server with Servlet 2.3 support, then everything is quite simple:

public void doPost (HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException ( // Message encoding request.setCharacterEncoding ("Cp1251"); String value = request.getParameter ("value"); ...

There is one significant subtlety in applying the request.setCharacterEncoding () method - it must be applied before the first call to a request for data (for example, request.getParameter ()). If you use filters that process the request before it arrives in the servlet, there is a nonzero chance that one of the filters may read some parameter from the request (for example, for authorization) and request.setCharacterEncoding () in the servlet will not work ...

Therefore, it is ideologically more correct to write a filter that sets the request encoding. Moreover, it must be the first in the chain of filters in web.xml.

An example of such a filter:

import java.io. *; import java.util. *; import javax.servlet. *; import javax.servlet.http. *; public class CharsetFilter implements Filter (// encoding private String encoding; public void init (FilterConfig config) throws ServletException ( // read from config encoding = config.getInitParameter ("requestEncoding"); // if not installed, install Cp1251 if (encoding == null) encoding = "Cp1251"; ) public void doFilter (ServletRequest request, ServletResponse response, FilterChain next) throws IOException, ServletException (request.setCharacterEncoding (encoding); next.doFilter (request, response);) public void destroy () ())

And its config in web.xml:

Charset Filter CharsetFilter Charset Filter /*

If you are unlucky and you have an older version, you will have to perverted to achieve the result:

    The original way of working with encodings is offered by Russian Apache - it is described exactly how.

  • FOP

    The FOP package is designed for processing documents according to the XSL FO (Formating Objects) standard. In particular, it allows you to create PDF documents based on XML documents... The FOP package uses the Xalan XSLT processor paired with Xerces by default to transform from native XML to FO. To create the final image in FOP, you need to connect fonts that support Russian letters. Here's how you can do it for version 0.20.1:

    1. Copy the ttf files from the Windows system directory to the conffonts subdirectory (for example, in c: fop-0.20.1conffonts). Arial normal / normal, normal / bold, italic / normal, and italic / bold require arial.ttf, arialbd.ttf, ariali.ttf, and arialbi.ttf files.
    2. Generate font description files (such as arial.xml). To do this, for each font, you need to run the command (this is for Arial normal / normal, all in one line):
      java -cp.; c: fop-0.20.1uildfop.jar; c: fop-0.20.1libatik.jar; c: fop-0.20.1libxalan-2.0.0.jar; c: fop-0.20.1libxerces.jar; c: fop-0.20.1libjimi-1.0.jar org.apache.fop.fonts.apps.TTFReader fontsarial.ttf fontsarial.xml
    3. In FOP add to conf / userconfig.xml a description of the font with Russian letters, such as:
      Arial normal / bold, italic / normal and italic / bold are added similarly.
    4. When calling FOP from command line after org.apache.fop.apps.Fop write -c c: fop-0.20.1confuserconfig.xml If you need to use FOP from a servlet, then you need to in the servlet after the line
      Driver driver= new Driver ();
      add lines:
      // The fonts directory (c: weblogicfonts) was created for convenience only. String userConfig = "fonts / userconfig.xml"; File userConfigFile = new File (userConfig); Options options = new Options (userConfigFile);
      Then the location of the ttf files in the userconfig.xml file can be specified relative to the root of the application server, without specifying the absolute path:
    5. In the FO file (or XML and XSL), before using the font, write:
      font-family = "Arial" font-weight = "bold" (If using Arial bold) font-style = "italic" (If using Arial italic)

    This algorithm was sent by Alexey Tyurin, for which a special thanks to him.

    If you use the viewer built into FOP, then you need to take into account its peculiarities. In particular, although it is assumed that the inscriptions in it are Russified, in fact it was done with an error (in version 0.19.0). To load labels from resource files, the org.apache.fop.viewer.resources package uses its own loader (org.apache.fop.viewer.LoadableProperties class). The reading encoding is hard-coded there (8859_1, as in the case of Properties.load ()), but support for the "uXXXX" type is not implemented. I reported this bug to the developers, they included a fix in their plans.

    Among other things, there is a site dedicated to the Russification of FOP (http://www.openmechanics.org/rusfop/) There you can find the FOP distribution kit with already fixed errors and connected Russian fonts.

    CORBA

    The CORBA standard provides a type that corresponds to the Java String type. This is the wstring type. Everything is fine, but some CORBA servers do not support