How Long is That in Japanese? Determining String Sizes

Dear Dr. GUI:

How do you calculate the size of a multibyte character set (MBCS) string in bytes? I am currently developing an application in Visual C++ for both Unicode and MBCS. I need the information for a call to RegSetValueEx. It is no problem to make the calculation for SBCS or Unicode strings because each character has the same size, but in MBCS the character size changes.

I hope you can help me.

Thank you,

Claus Michelsen

Dr. GUI replies:

This one's easy—if you know the trick. Dr. GUI learned it in a seminar on DBCS surgery a few years ago.

A byte that contains zero has only one use in MBCS/DBCS strings: as a terminator. You are guaranteed that a zero byte will never be a trail byte for a two-byte character. In other words, if you hit a zero byte, you're at the end of the string. Note: This guarantee does not apply to Unicode strings.

Given that, you can use strlen on a DBCS string just as you would with single-byte strings. It'll return the length of the string in bytes, not characters.

If you need the length in characters, use _mbslen or _mbstrlen. (Check the docs for the differences.) For Unicode strings, use wcslen.

Best of all is to use the TCHAR type and associated macros (_T, _txxxx, and so on) for string manipulation. These macros are defined differently depending on whether you have _UNICODE, _MBCS, or neither (meaning single-byte) defined at compile time. Using this type and the associated macros is best because the right interpretation is picked automatically for you, allowing the same source code to be used for Unicode, MBCS, or single-byte applications. For example, to get the length in characters, you can always use the _tcsclen macro, which maps to strlen, _mbslen, or wsclen, as appropriate.

If you've correctly defined the _UNICODE preprocessor symbol, you can always use the Windows API Istrlen in either case. The name Istrelen will map to the appropriate function, IstrelenW (for Unicode) or IstrlenA (otherwise).

Check out the "Multibyte Character Set (MBCS) Survival Guide" in the MSDN Library for more information. And check out "Routine Mappings" in the Runtime Library Reference for more information about the TCHAR-related macros. It's a good idea to write new code using these macros rather than using specific character types. Finally, check out "Unicode and Character Sets" in the MSDN Library (Product Documentation, SDKs, Win32 SDK, Win32 Programmer's Reference, Overviews, International Features, Unicode and Character Sets).