You are currently viewing Stranger Things about Java Characters

Stranger Things about Java Characters

  What you are reading is the first in a series of articles titled “Stranger things in Java”, inspired by the contents of my book “Java for Aliens”. These articles are dedicated to insights of the Java language. Deepening the topics we use every day will allow us to master the Java coding even in the strangest scenario.
Do you know that the following is a valid Java statement?
\u0069\u006E\u0074 \u0069 \u003D \u0038\u003B
  You can try to copy and paste it inside the main method of any class and compile it. If you then add also the following statement after the previous instruction:
System.out.println(i);
by running that class, you will get the print of number 8! And do you know that this comment instead produces a syntax error at compile time?
/*
 * The file will be generated inside the C:\users\claudio folder
 */
  Yet the comments shouldn’t be producing syntax errors. In fact, programmers often comment out pieces of code, just to make the compiler ignore them… so what’s going on? If you have already understood everything, you can ignore the rest of the article, otherwise you can spend a few minutes in reviewing a bit of basic Java: the primitive type char.
  The mystery of the comment error and other stories…

Primitive Character Data Type

As everyone knows, the char type is one of the eight primitive Java types. It allows us to store characters, one at a time. Below is a simple example where the character value is assigned to a char type:
char aCharacter = 'a';
  Actually, this data type is not used frequently, because in most cases programmers need character sequences and therefore prefer strings. Each character literal value must be included between two single quotes, not to be confused with double quotes that are used for string literals. A string declaration follows:
String s = "Java melius semper quam latinam linguam est";
  There are three ways to assign a literal value to a char type, and all three modes require the inclusion of the value between single quotes:
  • use a single printable character on the keyboard (for example '&');
  • use the Unicode format with hexadecimal notation (for example '\u0061', which is equivalent to the decimal number 97 and which identifies the 'a' character);
  • use a special escape character (for example '\n' which indicates the line feed character).
Let’s add some details in the next three sections.

Printable Keyboard Characters

We can assign to a char variable, any character found on our keyboard, provided that our system settings support the required character, and that the character is printable (for example the Canc and Enter keys are not printable). In any case, the literal assignable to a char primitive type is always included between two single quotes. Here are some examples:
char aUppercase = 'A';
char minus = '-';
char at = '@';
  The char data type is stored in 2 bytes (16 bits), with a range consisting only of positive numbers ranging from 0 to 65535. In fact, there is a ‘mapping’ that associates a certain character to each number. This mapping (or encoding) is defined by the Unicode standard (further described in the next section).

Unicode Format (Hexadecimal Notation)

We said that the char primitive type is stored in 16 bits, and therefore can define as many as 65536 different characters. Unicode encoding deals with standardizing all the characters (and also symbols, emojis, ideograms etc.) that exist on this planet. Unicode is an extension of the encoding known as UTF-8, which in turn is based on the old 8-bit Extended ASCII standard, which in turn contains the oldest standard known as ASCII code (acronym for American Standard Code for Information Interchange). You can give a look to an ASCII table at this link. We can directly assign a char a Unicode value in hexadecimal format using 4 digits, which uniquely identifies a given character, prefixing it with the prefix \u (always lower case). For example:
char phiCharacter = '\u03A6';  // Capital Greek letter Φ
char nonIdentifiedUnicodeCharacter = '\uABC8';
  In this case we’re talking about literal in Unicode format (or literal in hexadecimal format). In fact, using 4 digits with the hexadecimal format, exactly 65536 characters are covered. Actually, Java 15 supports Unicode version 13.0 which contains many more characters than 65536. In fact, today the Unicode standard as evolved a lot, and now allows us to represent potentially over a million characters, although only 143,859 numbers have already been assigned to a character. But the standard is constantly evolving (for more information visit this link). Anyway, to assign Unicode values that are outside the 16-bit range of a char type, we usually use classes like String and Character, but since is a very rare case and not interesting for the purpose of this article, we will not talk about it.

Special Escape Characters

In a char type it is also possible to store special escape characters, that is, sequences of characters that cause particular behaviors in the printing:
  • \b is equivalent to a backspace, that is a cancellation to the left (equivalent to the Delete key)
  • \n is equivalent to a line feed (equivalent to the Enter key)
  • \\ equals only one \ (just because the \ character is used for escape characters)
  • \t is equivalent to a horizontal tab (equivalent to the TAB key)
  • \' is equivalent to a single quote (a single quote delimits the literal of a character)
  • \" is equivalent to a double quote (a double quote delimits the literal of a string)
  • \r represents a carriage return (special character that moves the cursor to the beginning of the line)
  • \f represents a form feed (disused special character representing the cursor moving to the next page of the document)
Note that assigning the literal '"' to a character is perfectly legal, so the following statement:
System.out.println('"');
  which is equivalent to the following code:
char doubleQuotes = '"';
System.out.println(doubleQuotes);
  is correct and will print the double quote character:
"
  If we tried not to use the escape character for a single quote, for example, with the following statement:
System.out.println(''');
  we will get the following compile-time errors, since the compiler will not be able to distinguish the character delimiters:
error: empty character literal
        System.out.println(''');
                           ^
error: unclosed character literal
        System.out.println(''');
                             ^
2 errors
  Since the string literal delimiters are represented with double quotes, then the situation is reversed. In fact. it is possible to represent single quotes within a string:
System.out.println("'IQ'");
  that will print:
'IQ'
  On the other hand, we must use the \" escape character to use double quotes within a string. So, the following statement:
System.out.println(""IQ"");
  will cause the following compilation errors:
error: ')' expected
        System.out.println(""IQ"");
                             ^
error: ';' expected
        System.out.println(""IQ"");
                               ^
2 errors
  Instead, the following instruction is correct:
System.out.println("\"IQ\"");
  and will print:
"IQ"

Write Java Code with the Unicode Format

The Unicode literal format can also be used to replace any line of our code. In fact, the compiler first transforms the Unicode format into a character, and then evaluates the syntax. For example, we could rewrite the following statement:
int i = 8;
  in the following way:
\u0069\u006E\u0074 \u0069 \u003D \u0038\u003B
  In fact, if we add the following to the statement to the previous line:
System.out.println("i = " + i);
  it will print:
i = 8
  Undoubtedly, this is not a useful way to write our code. But can be useful to know this feature in order to understand some mistakes that (rarely) happen.

Unicode Format for Escape Characters

The fact that the Unicode hexadecimal format is transformed by the compiler before it evaluates the code, has some consequences and justifies the existence of escape characters. For example, let’s consider the line feed character which can be represented with the escape character \n. Theoretically, the line feed is associated in the Unicode encoding to the decimal number 10 (which corresponds to the hexadecimal number A). But, if we try to define it using the Unicode format:
char lineFeed = '\u000A';
  we will get the following compile-time error:
error: illegal line end in character literal
        char lineFeed = '\u000d';
                        ^
1 error
  in fact, the compiler transforms the previous code into the following before evaluating it:
char lineFeed = '
';
  that is, the Unicode format has been transformed into the new line character, and the previous syntax is not a valid syntax for the Java compiler. Likewise, the single quote character (') that corresponds to the decimal number 39 (equivalent to the hexadecimal number 27) and that we can represent with the escape character \', cannot be represented with the Unicode format:
char singleQuote = '\u0027';
  Also in this case, the compiler will transform the previous code in this way:
char singleQuote = ''';
  which will give rise to the following compile-time errors:
error: empty character literal
        char singleQuote = '\u0027';
                           ^
error: unclosed character literal
        char singleQuote = '\u0027';
                                  ^
2 errors
  The first error is due to the fact that the first pair of quotes does not contain a character, while the second error indicates that specifying the third single quote is an unclosed character literal. Also, with regard to the carriage return character, represented by the hexadecimal number D (corresponding to the decimal number 13), and already representable with the escape character \r, there are problems. In fact, if we write:
char carriageReturn = '\u000d';
  we will get the following compile-time error:
error: illegal line end in character literal
        char carriageReturn = '\u000d';
                              ^
1 error
  In fact, the compiler has transformed the number in Unicode format into a carriage return by returning the cursor to the beginning of the line, and what was supposed to be the second single quote became the first. As for the character\, represented by the decimal number 92 (corresponding to the hexadecimal number 5C), and represented by the escape character \\, if we write:
char backSlash = '\u005C';
  we will get the following compile-time error:
error: unclosed character literal
        char backSlash = '\u005C';
                         ^
1 error
  This is because the previous code will have been transformed into the following:
char backSlash = '\';
  and therefore the \' pair of characters is considered as an escape character corresponding to an single quote ', and therefore the literal closure is missing another single quote. On the other hand, if we consider the character ", represented by the hexadecimal number 22 (corresponding to the decimal number 34), and, represented by the escape character \", if we write:
char quotationMark = '\u0022';
  there will be no problem. But if we use this character within a string:
String quotationMarkString = "\u0022";
  we will get the following compile-time error:
error: unclosed string literal
   String quotationMarkString = "\u0022";
                                       ^
1 error 
  since the previous code will have been transformed into the following:
String quotationMarkString = """
 

The mystery of the comment error

An even stranger situation is found when using single-line comments for Unicode formats such as carriage return or line feed. For example, despite being commented out, both of the following statements would give rise to compile-time errors!
// char lineFeed = '\u000A';  
// char carriageReturn = '\u000d'; 
  This is because the hexadecimal formats are always transformed by the compiler with the line feed and carriage return characters, which are not compatible with the single line comments, because they print characters outside the comment! To solve the situation, use the multi-line comment notation, for example:
/* char lineFeed = '\u000A';  
   char carriageReturn = '\u000d'; */
  Another mistake that can cause a programmer to lose a lot of time, is when the sequence \u is used in a comment. For example, with the following comment, we will get a compile-time error:
/*
 * The file will be generated inside the C:\users\claudio folder
 */
  If the compiler does not find a sequence of 4 hexadecimal characters valid after \u, it will print the following error:
error: illegal unicode escape
 * The file will be generated inside the C:\users\claudio folder
                                             ^
1 error
 

Conclusions

In this article we have seen that the use of the char type in Java hides some truly surprising special cases. In particular, we have seen that it is possible to write Java code, using the Unicode format. This is because the compiler first transforms the Unicode format into a character, and then evaluates the syntax. This implies that programmers can find syntax errors where they would never expect, especially inside the comments.