Wide Character Types - What We Need Them For?

type

status

date

slug

summary

category

icon

password

In C++, strings are treated as sequences of ASCII characters by default. For instance, when you write char* str = "hello";, this string is encoded in ASCII.

If you prefix the string with an L, the string is then considered a sequence of wide characters, typically Unicode characters. For example, wchar_t* str = L"hello";. This string is encoded in Unicode (UTF-16 or UTF-32), with each character typically occupying 2 bytes (on Windows) or 4 bytes (on some other platforms), depending on the platform and compiler settings.

It's important to note that ASCII can only represent basic English letters, numbers, and symbols, while Unicode can represent characters of nearly all known scripts, including Asian and other non-Latin scripts. For instance, if you directly write a Chinese string without an L prefix in the C++ source code, such as char* str = "你好";, the encoding of the string will depend on the encoding of the source code file itself. This may lead to unpredictable results, as the Chinese characters may appear as gibberish if the file encoding does not match the encoding expected by the program's runtime environment.

However, Unicode has several different encoding schemes, including UTF-8, UTF-16, and UTF-32. The L prefix defaults to UTF-16 or UTF-32, depending on the platform and compiler. On Windows and most C++ compilers, the L prefix usually implies UTF-16. On some platforms, you can use the u8, u, and U prefixes to specify UTF-8, UTF-16, or UTF-32 encoding. For example, u8"hello" is a UTF-8 encoded string.

Taking UTF-8 as an example, it is a very common encoding scheme that can represent any Unicode character and is compatible with ASCII encoding. If your source code file is encoded in UTF-8, you can write the following code:

Here, "你好" is encoded in UTF-8. However, when you try to print this string using functions like printf, the output will depend on whether your terminal or runtime environment also supports UTF-8.

To ensure your Chinese string is displayed correctly in all environments, best practice is often to use the L prefix and wide characters, as well as the corresponding wide-character functions like wprintf. This ensures your string is encoded in Unicode and can be correctly displayed in any environment that supports Unicode.

string literals vs wide string literals

String literals (const char*): Encoded in the execution character set (usually ASCII or UTF-8). In the case of UTF-8, characters can have variable byte lengths, ranging from 1 to 4 bytes depending on the character. This is ideal for representing a vast range of characters, including ASCII, while maintaining space efficiency.

Wide string literals (const wchar_t*): Encoded in the execution wide-character set (often UTF-16 or UTF-32). Each character in these wide string literals occupies a fixed amount of memory—typically 2 bytes for UTF-16 (though some characters require 4 bytes) and 4 bytes for UTF-32. This allows for easy indexing of characters at the expense of using more memory, especially for scripts where many or most characters could be represented in fewer bytes.