Paul DiLascia
Paul DiLascia is a freelance software consultant specializing in training and software development in C++ and Windows. He is the author of Windows++: Writing Reusable Code in C++ (Addison-Wesley, 1992).
QCould you please comment on a discussion that we've been having in our development department about passing character strings into and out of C++ class objects? One side argues that they should always be passed as LPCSTRs, another argues for using the CString class.
I've enclosed a much-simplified class definition (see Figure 1) to illustrate the different methods considered. In reality our classes have a number of CString data members and, rather than having separate functions to set or get a particular item, we use the one function with an ID number to indicate which string we want to access.
Ian Clegg
England
Figure 1 Passing Character Strings
class MyClass { CString* m_str; public: MyClass(); ~MyClass(); void SetLPCSTR( LPCSTR lpInputString ); LPCSTR GetLPCSTR( ); void SetCString( CString csInputString ); CString GetCString( ); }; MyClass::MyClass( ) { m_str = NULL; } MyClass::~MyClass( ) { if( m_str ) delete m_str; } LPCSTR MyClass::GetLPCSTR( ) { CString csReturn; if ( m_str ) csReturn = *m_str; return csReturn; } void MyClass::SetLPCSTR( LPCSTR lpInputString ) { if ( m_str == NULL ) m_str = new CString( lpInputString ); else *m_str = lpInputString; } CString MyClass::GetCString( ) { CString csReturn; if ( m_str ) csReturn = *m_str; return csReturn; } void MyClass::SetCString( CString csInputString ) { if ( m_str == NULL ) m_str = new CString( csInputString ); else *m_str = csInputString; }
AAt first I thought I knew the answer to this question off the top of my head—after all, it seems like such a simple, innocent question—but then I began to wonder. Many hours and countless brain cells later, I found myself sucked deeper and deeper into C++, MFC internals, and yucky assembly language mucky-muck. In the end, it turned out I was right, but only through brute force was I able to prove it. After such a grueling ordeal, I quickly realized that the only way I could reward myself, the only possible pleasure to be gained from having endured it, was to inflict the same punishment on my readers. In fact, since CStrings are so important and ubiquitous in MFC, a familiar little class that programmers use every day, I decided to make this month's column a sort of mini-treatise on CStrings—especially since they've changed considerably as of release 4.0. I know it sounds boring but I guarantee you'll be surprised to learn all the things that go on while you're not looking.
First off, if you're not using CStrings, shame on you! Come on guys, the year 2000 is almost upon us! No more character arrays, strcpy, strdup, and all that rot. CStrings are easy, lightweight, overwrite-proof, and provide useful functions for manipulating strings. There's even a Format function that works like printf! Not to mention that CStrings go from English to International (ASCII to Unicode) with the flip of a compiler switch. There's just no excuse for writing char[256] any more unless you're dealing with legacy code written in COBOL.
Now that I got that off my chest, let's do something about Ian's code. First, it's always better to instantiate CString objects directly as inline class members or on the stack instead of allocating them from the heap. A CString is very small; it contains only one member (m_pchData, a pointer to the actual character data) so a CString is only four bytes, the same as an int. It makes no sense to allocate CStrings individually unless you have some truly bizarre situation at hand. In general, you should think of CString as a primitive type like int or long or double. You wouldn't allocate space to store one int, would you? So the first thing you should do is make m_str an actual CString instead of a pointer to one.
class MyClass { CString m_str; // not a pointer . . . };
This entails no extra storage overhead. On the contrary, it uses half the space the previous design used and results in less memory fragmentation. It also simplifies your code greatly because you can get rid of all the new/delete stuff and checks for NULL. CString already contains a lot of code to do all that checking for you. Use it.
The second thing you should do is declare your Set/GetCString functions with const CString& (reference to const CString) instead of just plain CString. When you pass an object by value, C++ must make a copy of it on the stack, which requires a function call to the copy constructor CString::CString(const CString&). If you use a reference, C++ just pushes a pointer and there's no copy constructor call. When you use a reference, you need const to tell the compiler that your Set function doesn't modify its argument or, in the case of Get, that the CString returned may not be modified. In general, you can use const Foo& as a way to pass Foo objects more efficiently—as if they were values—provided you don't modify them. Figure 2 shows my modified version of MyClass with const CString& declarations and m_str converted to CString.
Figure 2 strtest.cpp
//////////////////////////////////////////////////////////////// // 1996 Microsoft Systems Journal. // Written by Paul DiLascia // // STRTEST shows what function calls the compiler generates // for various kinds of CString conversions and assignments. // // To see the code the compiler generates, type // // cl /c /Fa strtest.cpp // // and look at strtest.asm // #include <stdio.h> typedef const char* LPCSTR; ////////////////// // Stripped-down CString with outline functions instead of // inline, to show what happens in the compiled code // class CString { protected: LPCSTR m_pchData; // pointer to string data public: CString(); CString(const CString& stringSrc); CString(LPCSTR lpsz); ~CString(); operator LPCSTR() const; const CString& operator= (const CString& cs); const CString& operator= (LPCSTR lp); }; ////////////////// // Some class with get/set methods to access a CString. // class MyClass { CString m_str; public: void SetLPCSTR(LPCSTR lp); LPCSTR GetLPCSTR(); void SetCString(const CString& cs); const CString& GetCString(); }; //////////////// // In a real program, these functions would be inline. // I left them "outline" so you can see where the // compiler invokes them. // // LPCSTR MyClass::GetLPCSTR() { return m_str; // invokes CString::operator LPCSTR(); } const CString& MyClass::GetCString() { return m_str; // just return pointer to m_str; } void MyClass::SetLPCSTR(LPCSTR lp) { m_str = lp; // invokes CString::operator=(LPCSTR); } void MyClass::SetCString(const CString& cs) { m_str = cs; // invokes CString::operator= (const CString&) } void main() { MyClass foo; CString cs = "this is a CString"; LPCSTR lp = "this is an LPCSTR"; // Four cases for set operation foo.SetCString(cs); // case #1 foo.SetLPCSTR(cs); // case #2 foo.SetCString(lp); // case #3 foo.SetLPCSTR(lp); // case #4 // Four cases for get operation cs = foo.GetCString(); // case #5 cs = foo.GetLPCSTR(); // case #6 lp = foo.GetCString(); // case #7 lp = foo.GetLPCSTR(); // case #8 }
Now, let's explore the original question: should you declare Get/Set functions with LPCSTR or const CString&? If you take my initial advice to always use CString and never use LPCSTR, then this question never arises. However, LPCSTR is sometimes necessary, and the magic of C++ lets you use LPCSTR interchangeably with CString.
Say you have a function, like SetLPCSTR, that expects LPCSTR but you call it with a CString instead.
MyClass myobj; CString cs; myobj.SetLPCSTR(cs); // type mismatch?
Superficially it looks like a type mismatch, but this code compiles because CString has a conversion operator, CString::operator LPCSTR, that converts the CString to an LPCSTR. All the compiler needs to know is that there's this member function called operator LPCSTR (operator const char*) that returns LPCSTR. The compiler generates code like this:
. . . CString cs; myobj.SetLPCSTR(cs.operator LPCSTR()); // OK, types // agree
This looks funny because there's a space in the function name, but that's just syntax. Internally, operator LPCSTR is just another member function that returns LPCSTR. SetLPCSTR gets LPCSTR, which is what it expects.
What about going the other way? What if you have a Set function that expects const CString& and you try to give it an LPCSTR?
LPCSTR lp; myobj.SetCString(lp); // type mismatch?
This is a little more tricky. One of the functions defined for CString is CString::CString(LPCSTR), a constructor that creates a CString from an LPCSTR. The compiler notices this and says, "Duh, I can make this compile if I create a temporary variable."
LPCSTR lp; CString temp(lp); // create temp myobj.SetCString(temp); // OK, args match
Once again, the types match: SetCString gets a CString, which is what it expects.
There are two other things I must point out here. First, hidden behind the scenes is a call to the destructor CString::~CString as temp goes out of scope. Second, the temp solution only works if the argument to SetCString is declared either CString or const CString&. If SetCString is declared to take CString& (a non-const reference), the compiler can't use the temp trick. For all it knows, SetCString might modify temp, and there's no way to propagate the change back to lp.
However you declare your arguments—CString or LPCSTR—you can still pass the other kind of argument in your code. Which is better? I'm getting there, I promise.
So far, I've only showed you what happens for converting function arguments. As you'd expect, the compiler works the same magic on return values. You can write
LPCSTR lp; CString cs; lp = myobj.GetCString(); // type mismatch? cs = myobj.GetLPCSTR(); // type mismatch?
and C++ works its gris-gris to make your code compile. In the first case, C++ converts the return value from const CString& to LPCSTR by invoking the conversion operator CString::operator LPCSTR. In the second case, the conversion is actually an assignment: C++ invokes CString::
operator=(LPCSTR).
In all, there are eight cases to consider: four cases for Set and four cases for Get, depending on the type declared versus the type passed or assigned. In addition to the hidden conversions for arguments and return values, you also have to consider what happens inside your Set/Get functions. For example, if you write
void MyClass::SetLPCSTR(LPCSTR lpsz) { m_str = lpsz; }
the innocent-looking assignment statement actually compiles into a call to CString::operator=(LPCSTR). Likewise, you have to consider what happens for SetCString, GetLPCSTR, and GetCString. Things are really getting out of hand here!
In an effort to get a handle on all this madness, I wrote a program, STRTEST.CPP (see Figure 2), that illustrates exactly what happens in each situation. STRTEST contains the improved MyClass with Set/Get functions for CString and LPCSTR and a main function that exercises each of the eight cases I mentioned. It also contains a stripped-down version of CString, with only the functions declared that are relevant to the discussion at hand. All functions are left outline (as opposed to inline) so you can see where the compiler generates function calls.
The idea is to compile STRTEST and look at the assembly code generated in the hopes of understanding what's really going on behind the veil of the compiler. This is the brute force investigative technique I mentioned at the outset. It's disgusting to look at, I know, but it's also amusing. Figure 3 shows the abridged assembly output for the main function, with my running commentary.
Figure 3 strtest.asm
_main PROC NEAR . . prolog, initialization . ;;================ Set functions ================ ;; Case #1: foo.SetCString(cs); ;; -call SetCString (no conversion) ;; lea eax, DWORD PTR _cs$[ebp] push eax lea ecx, DWORD PTR _foo$[ebp] call ?SetCString@MyClass@@QAEXABVCString@@@Z ; MyClass::SetCString ;; Case #2: foo.SetLPCSTR(cs); ;; -first call operator LPCSTR to convert CString->LPCSTR ;; -call SetLPCSTR ;; lea ecx, DWORD PTR _cs$[ebp] call ??BCString@@QBEPBDXZ ; CString::operator LPCSTR push eax lea ecx, DWORD PTR _foo$[ebp] call ?SetLPCSTR@MyClass@@QAEXPBD@Z ; MyClass::SetLPCSTR ;; Case #3: foo.SetCString(lp); ;; -first create temp CString T520 ;; -intialize with constructor CString::CString(LPCSTR) ;; -call SetCString ;; -destruct the temp variable. ;; mov eax, DWORD PTR _lp$[ebp] push eax lea ecx, DWORD PTR $T520[ebp] ; T520 is temp variable call ??0CString@@QAE@PBD@Z ; CString::CString push eax lea ecx, DWORD PTR _foo$[ebp] call ?SetCString@MyClass@@QAEXABVCString@@@Z ; MyClass::SetCString lea ecx, DWORD PTR $T520[ebp] call ??1CString@@QAE@XZ ; CString::~CString ;; Case #4: foo.SetLPCSTR(lp) ;; -call SetLPCSTR (no conversion) ;; mov eax, DWORD PTR _lp$[ebp] push eax lea ecx, DWORD PTR _foo$[ebp] call ?SetLPCSTR@MyClass@@QAEXPBD@Z ; MyClass::SetLPCSTR
;;================ Get functions ================ ;; Case #5: cs = foo.GetCString(); ;; -call GetCString ;; -invoke assignment operator top copy CString ;; lea ecx, DWORD PTR _foo$[ebp] call ?GetCString@MyClass@@QAEABVCString@@XZ ; MyClass::GetCString push eax lea ecx, DWORD PTR _cs$[ebp] call ??4CString@@QAEHABV0@@Z ; CString::operator= ;; Case #6: cs = foo.GetLPCSTR(); ;; -call GetCString ;; -call CString::operator==(LPCSTR) ;; lea ecx, DWORD PTR _foo$[ebp] call ?GetLPCSTR@MyClass@@QAEPBDXZ ; MyClass::GetLPCSTR push eax lea ecx, DWORD PTR _cs$[ebp] call ??4CString@@QAEABV0@PBD@Z ; CString::operator= ;; Case #7: lp = foo.GetCString(); ;; -call GetCString ;; -convert result to LPCSTR with CString::operator LPCSTR ;; lea ecx, DWORD PTR _foo$[ebp] call ?GetCString@MyClass@@QAEABVCString@@XZ ; MyClass::GetCString mov ecx, eax call ??BCString@@QBEPBDXZ ; CString::operator char const * mov DWORD PTR _lp$[ebp], eax ;; Case #8: lp = foo.GetLPCSTR(); ;; -call GetLPCSTR (no conversion afterwards) ;; lea ecx, DWORD PTR _foo$[ebp] call ?GetLPCSTR@MyClass@@QAEPBDXZ ; MyClass::GetLPCSTR mov DWORD PTR _lp$[ebp], eax . . cleanup, return . _main ENDP
You'd think by now I would just come out and tell you the answer, but I've only described the type conversions generically. The next thing you have to do is look inside CString to see what all these operators and constructors actually do. Fortunately, this is a little more interesting. Consider the conversion operator for LPCSTR. I mentioned earlier that CString contains just one member, m_pchData, a char* that points to the actual character data, such as "Hello, world". Knowing this, you can probably guess how CString:: operator LPCSTR is implemented.
// (from afx.inl) inline CString::operator LPCTSTR() const { return m_pchData; // just return ptr to string }
Just like a typical Get function, all it does is return a data member. Since it's inline, converting a CString to LPCSTR is very fast. If you write
SetLPCSTR(cs); // cs is a CString
it gets compiled exactly as if you'd written
SetLPCSTR(cs.m_pchData);
which you can't do because m_pchData is protected.
What about the other operators? Well, when I told you about m_pchData, I didn't tell you everything. It's true that m_pchData points to the underlying character string, but hidden behind the string is a little struct.
struct CStringData { long nRefs; // reference count int nDataLength; // length of string int nAllocLength; // length of buffer allocated };
Figure 4 illustrates the situation. When CString allocates space for a new string, it adds a few extra bytes to store this header. CStringData contains vital information about the string. For example, CString::GetStringLength is implemented like this:
// (from afx.inl) inline int CString::GetLength() const { return GetData()->nDataLength; }
GetData is another inline function:
inline CStringData* CString::GetData() const { ASSERT(m_pchData != NULL); return ((CStringData*)m_pchData)-1; }
Figure 4 Anatomy of a CString
Why did the implementers of MFC put the CStringData information as a hidden block preceding the character data instead of storing it as class members in CString, which would be the obvious thing to do? Because it makes CStrings small and fast. Consider what happens when you copy a CString in either a copy constructor or an assignment from CString to CString. If all the information is stored in the CString, as it was before MFC 4.0, you'd have to copy it along with m_pchData, so there would be more things to copy. Plus, you can't just copy the value of m_pchData, you have to allocate a new buffer and copy the contents with a function like strcpy or memcpy.
Starting in release 4.0, MFC uses a different technique called "copy on modify" to copy CStrings. Commercial string libraries have long used this technique; MFC finally caught up. The basic idea is to copy only the pointer at first, and not actually copy the bytes until it becomes necessary. Figure 5 shows how it works.
Figure 5 CString Copy on Modify in Action!
Say you have a CString, cstr1, with a ref count of 1. Then suppose you make a copy of it.
cstr2 = cstr1;
Instead of copying all the string information and character bytes, the assignment operator copies the pointer m_pchData and increments CStringData::nRefs. Now cstr1 and cstr2 actually point to the same object in memory, but nRefs is 2 instead of 1. This makes two CStrings, but just one byte array. What happens if the program subsequently alters either cstr1 or cstr2? No problem. Before modifying any CString, MFC checks the ref count. If it's greater than 1, some other CString is pointing to this same m_pchData so MFC can't change it. Instead, MFC allocates a new m_pchData with its own CStringData and copies the bytes. MFC decrements the ref count in the original object and sets the new ref count to 1. A similar thing happens when a CString is destroyed; only when the ref count drops to zero does MFC actually deallocate m_pchData. You can see that this strategy only works because the information about the string—CStringData—is kept with the string itself and CStrings are just pointers to these data/string objects. Figure 6 summarizes what all the relevant CString functions and operators do with regard to copying.
Figure 6 CString Functions and Operators
CString function/operator | Costly? | What it does |
CString::CString(const CString& cs) | No | Quick copy. Copy value of m_pchData and increment CStringData:::nRefs. |
CString::CString(LPCSTR lp) | Yes | Always allocate a new character array and CStringData. Copy bytes from lp. |
CString::~CString() | No | Deallocate string only if –nRefs <= 0; that is, if this is the only CString using this particular m_pchData. |
operator LPCSTR() const | No | Inline function just returns m_pchData. No function call. |
const CString& operator=(const CString& cs) | No | Similar to copy constructor. Copy value of m_pchData and increment nRefs. |
const CString& operator=(LPCSTR lp); | Yes | Similar to LPCSTR constructor. Always allocate a new character array and CStringData. Copy bytes from lp. |
The whole point is that copying CStrings is now very fast since you just copy one pointer. You can pass CStrings around by value without paying a price. A typical application might have many functions with arguments declared CString, and you might pass the same CString by value from function A to function B to function C. Each call requires creating a copy of the CString on the stack. Before MFC 4.0, this would allocate and copy a new string every time! Copy on modify fixes this situation so only m_pchData is copied. As soon as one of the functions or some other part of the code attempts to modify the underlying string, MFC makes a new copy.
Remember, this only applies when you pass CStrings by value. If you use const references (const CString&), C++ passes a pointer to the actual CString and doesn't even call the copy constructor.
Finally, I'm in a position to answer the question! I could have made you wade through the assembler code, but I have some sympathy and did the dirty work myself. I compiled the results in two tables that summarize what happens in the eight different cases in the main function of STRTEST.CPP (see Figures 7 and 8).
Figure 7 Conversion Possibilities for Set Functions
Case | Argument Conversion | MFC Conversion | Inside Set Function |
1 SetCString(cs) | cs ® const CString& | No conversion, just passes a | The code |
| pointer to cs. | m_str = cs; | |
| calls | ||
| CString::operator=(const | ||
| CString&) | ||
| which does a quick copy. | ||
| Equivalent to | ||
| m_str.operator=(cs); | ||
| which does | ||
| m_str.m_pchData = | ||
| cs.m_pchData; | ||
| // plus increment ref count | ||
2 SetLPCSTR(cs) | cs ® LPCSTR | Calls inline | The code |
| CString::operator LPCSTR | m_str = cs; | |
| to convert cs to LPCSTR. This | calls | |
| just gets cs.m_pchData. | CString::operator=(LPCSTR) | |
| Equivalent to: | to allocate and copy bytes. | |
| SetLPCSTR(cs.m_pchData); | Equivalent to: | |
| m_str.operator(lp); | ||
| // allocate, copy | ||
3 SetCString(lp) | lp ® const CString& | Creates a temp variable | The code |
| initialized from | m_str = cs; | |
| CString::CString(LPCSTR) | calls | |
| which allocates and copies | CString::operator=(const | |
| bytes. Equivalent to: | CString&) | |
| CString temp(lp); | which does a quick copy. | |
| // allocate, copy | Equivalent to: | |
| SetCString(temp); | m_str.operator=(cs); | |
| temp.CString::~CString(); | which does | |
| This requires calling | m_str.m_pchData = | |
| CString::~CString | cs.m_pchData; | |
| when temp goes out of scope. | // plus increment ref count | |
| The destructor will deallocate | ||
| m_pchData if no other CString | ||
| is using it. | ||
4 SetLPCSTR(lp) | lp ® LPCSTR | No conversion, just passes | The code |
| pointer lp. | m_str = lp; | |
| calls | ||
| CString::operator=(LPCSTR) | ||
| to allocate and copy bytes. | ||
| Equivalent to: | ||
| m_str.operator=(lp); | ||
| // allocate, copy | ||
cs = Cstring | |||
lp = LPCSTR | |||
red = performance hit |
Figure 8 Conversion Possibilities for Get Functions
Case | Assignment Conversion | MFC Conversion | Inside Get Function |
5 cs=GetCString() | cs ¬ const CString& | Calls | return m_str; |
| CString::operator=(const | // as const CString& | |
| CString&) | No conversion, just returns | |
| which does a quick copy. | const pointer to m_str. | |
| Equivalent to: | ||
| cs.operator=(GetCString()); | ||
| which does | ||
| cs.m_pchData = | ||
| GetCString().m_pchData; | ||
| // and increment ref count | ||
6 cs=GetLPCSTR() | cs ¬ LPCSTR | Calls | return m_str; // as LPCSTR |
| CString::operator=(LPCSTR); | Calls | |
| to allocate and copy bytes. | CString::operator LPCSTR | |
| Equivalent to: | (inline) | |
| cs.operator=(GetLPCSTR()); | to convert m_str to LPCSTR. | |
| This just gets cs.m_pchData. | ||
| Equivalent to: | ||
| return m_str.m_pchData | ||
7 lp=GetCString() | lp ¬ const CString& | Calls inline | return m_str; |
| CString::operator LPCSTR | // as const CString& | |
| to get m_pchData. Equivalent to: | No conversion, just returns | |
| lp = GetCString().m_pchData; | const pointer to m_str. | |
8 lp=GetLPCSTR() | lp ¬ LPCSTR | No conversion. Simple assignment | return m_str; // as LPCSTR |
| lp=pointer | Calls | |
| returned. | CSTring::operator LPCSTR | |
| (inline) | ||
| to convert m_str to LPCSTR. | ||
| This just gets cs.m_pchData. | ||
| Equivalent to: | ||
| return m_str.m_pchData | ||
cs = Cstring | |||
lp = LPCSTR | |||
red = performance hit |
And the winner is CString! If you want to maximize performance, you should declare your Set/Get functions using CString, not LPCSTR.
class MyClass { CString m_foo; public: void SetFoo(const CString& cs) { m_str = cs; } const CString& GetFoo(); { return m_str; } };
Why? Well, it should be obvious from Figure 8 that CString is the way to go for the Get function. Case 6, where you return LPCSTR and then assign it to a CString, is the one to avoid because it always does an allocation when the CString is assigned to an LPCSTR.
The Set function is a little more subtle. At first glance, it seems like case 3 is really bad because it not only creates a temp variable, but the temp variable must be destroyed as well. When you look at it again, you realize that the m_pchData created for temp is immediately copied to m_str inside the Set function and has its ref count bumped up. But, when temp is subsequently destroyed, nothing happens because m_str is now using the same m_pchData that was originally created for temp! In other words, the underlying string is allocated only once and then handed to m_str, where it resides until m_str is destroyed. This is essentially the same overhead as case 4, only the allocation happens inside the Set function in operator=(LPCSTR). Having a Set function that takes LPCSTR doesn't really buy you anything. (There are a few extra pushes and pops associated with creating the temp variable, but that's negligible.)
The moral of the story is, use const CString& in all your declarations. This makes sense; m_str is already a CString so why convert it? LPCSTRs will have to be converted one way or another, so let the compiler do it when necessary. If you convert the m_str to LPCSTR in your Set/Get functions, you'll only have to convert back again in the case where you have a CString. Phew
Have a question about programming in C or C++? Send it to Paul DiLascia at 72400.2702@compuserve.com
This article is reproduced from Microsoft Systems Journal. Copyright © 1995 by Miller Freeman, Inc. All rights are reserved. No part of this article may be reproduced in any fashion (except in brief quotations used in critical articles and reviews) without the prior consent of Miller Freeman.
To contact Miller Freeman regarding subscription information, call (800) 666-1084 in the U.S., or (303) 447-9330 in all other countries. For other inquiries, call (415) 358-9500.