What Is Unicode?

We’ve been skipping lightly over some of the implications of Unicode, but it’s time to get our hands dirty.

If you think about Unicode, it seems like a whole lot of zeros going nowhere. All the text characters employed by those of us who use English fit in the first 128 Unicode bytes, meaning that half of each 16-bit Unicode character is 0. If you do a hex dump of a 32-bit Visual Basic program (English and most European versions), you’ll see all those zeros lined up in neat little columns in the part of the file where string constants are stored. Every 0 must be filtered out when you send a string to an ANSI API function and then reinserted when you get the string back. But the Unicode conversion is more than just converting to and from zeros. Try the following code to find out which characters use more than 8 bits:

For i = 0 To 255
    Debug.Print Hex$(AscW(Chr$(i)))
Next

For those of you too lazy to try this, I’ll tell you the result. Every character has zeros in the high byte except characters 145–156 and character 159. Weird, huh?

For the most part, you don’t need to worry about Unicode conversion. Once you’ve set up your Declare statements (or loaded the Windows API type library), everything happens automatically. But just when you think you’ve got Unicode under control, something turns up that doesn’t work quite the way you expected. For example, Win32 supports both Unicode and ANSI versions of all functions, but 32-bit COM supports only Unicode. If you want to call 32-bit COM functions from Visual Basic, you’ll have to pass Unicode strings (even in Windows 95, which doesn’t support Unicode in any other context). Normally, you don’t call COM functions, because Basic does it for you, but I’ll show you some exceptions in the next section.

When Basic sees that you want to pass a string to an outside function, it conveniently squishes the internal Unicode strings into ANSI strings. But if the function expects Unicode, you must find a way to make Basic leave your 16-bit characters alone. The new Byte type was added specifically for those cases in which you don’t want the languages messing with data behind your back. I’ll show you more examples of this in “Reading and Writing Blobs,” page 277. For now, let’s look at some Unicode basics that will set the stage for calling Unicode API functions.

Basic allows you to assign strings to byte arrays and byte arrays to strings:

Dim ab() As Byte, s As String
s = “ABCD"
ab = s
s = ab

What would you expect these statements to do? If you guessed that the first byte of ab will contain the ASCII code for “A”, the second the code for “B”, and so on, you guessed wrong. Check it out in the Locals window. The ab array contains 65, 0, 66, 0, 67, 0, 68, and 0. Although Visual Basic will now show the contents of a Byte array (unlike version 4), I find it easier to compare Strings and Byte arrays by calling my HexDump functions (in UTILITY.BAS). HexDump works on Byte arrays, and HexDumpS works on strings. HexDumpB, too, works on strings, but it dumps them as bytes rather than as characters. Here’s what you get if you dump the variables shown earlier in the Immediate window:

? HexDump(ab)
41 00 42 00 43 00 44 00   A.B.C.D.
? HexDumpB(s)
41 00 42 00 43 00 44 00   A.B.C.D.
? HexDumpS(s)
41 42 43 44               ABCD

But what if you want to put the 8-bit ANSI characters of a string into a byte array without the zeros? Basic lets you force Unicode conversion by using the StrConv function:

ab = StrConv(s, vbFromUnicode)

A hex dump shows the bytes without zeros:

? HexDump(ab)
41 42 43 44               ABCD

Now let’s assign the byte array back to the string:

s = ab

If you look at s in the Immediate window at this point, you might be surprised to see the string “??”. What is this? Well, what you have in the first 16-bit character is “AB” (&H4241). The Unicode character &H4241 represents the sacatai hieroglyphic in the Basic dialect of northeastern Cathistan. In the second character, you have another, the boganit hieroglyphic. Visual Basic doesn’t know anything about sacatai or boganit or any other Unicode characters above 255, so it just displays them as question marks.

To convert the byte array back to a recognizable string, undo the previous StrConv function:

s = StrConv(ab, vbUnicode)

The string now looks “right” in the debugger and “wrong” in the byte hex dump.