Collections 201-Indexing


Simple Collections, like the one above, provide a handy storage place when you don’t know how many data items you’ll have. But as we saw with the CVector, CList, and CStack classes, there are other ways to store a variable number of items. The unique feature of Collections is that you can look up items with string keys.


To do this, you specify the key as the second argument of the Add method. Later you can use the key instead of an index to access the item. Technically, you can make any string the key. You can also assign a string key to each item in a Collection of integers. For example:

Do
s = GetName ‘ Gets name from dialog or file
i = GetAge(s) ‘ Gets age from dialog or file
If s = sEmpty Then Exit Do
people.Add i, s ‘ Store data (age) first, then key (name)
Loop

Now you have the age of each person stored in the items of a Collection. Later you might need to get that data. You could access it like this:

iAge = people(7)

But the index merely reflects the order in which people were added and probably isn’t a significant value. You’re much more likely to know the person’s name and use it when you need to look up the data:

iAge = people(“George”)

Figure 4-1 shows how the data is stored in this Collection. The name is the key. The age is the data. The Collection might have additional internal values (possibly to assist in hashing to look up values), but these are the obvious ones. Notice that the string data isn’t actually stored in the Collection. The Collection contains a BSTR (see “Dealing with Strings” in Chapter 2), which is actually a pointer to string data. Usually, this is a detail we can ignore, but it will be relevant in the next example.


The age Collection isn’t a very realistic example. You’re not likely to use a Collection with a key just to store one piece of data. A much more likely scenario is that you’ll put objects with several properties into a Collection. You’ll put an identifying property (one that you can reconstruct from other sources) into the Collection as the key. Let’s assume the following simple class:

‘ CPerson class
Public Age As Integer
Public Coolness As Double
Public Name As String
§

The Name property is the obvious choice for the key. You’ll choose a particular property as the key because it represents a value unique to the object—a value that you’ll know in circumstances when you don’t know the object’s position in the Collection.



Figure 4-1. How data is stored in Collections.


Let’s take a look at some code to add people objects to a Collection:

Dim person As CPerson, s As String
Do
s = GetName
If s = sEmpty Then Exit Do
Set person = New CPerson
person.Age = GetAge(s)
person.Coolness = GetCoolness(s)
person.Name = s
§
people.Add person, person.Name
Loop

Later if someone asks you how cool George is, it’s easy to find out:

Debug.Print “How cool is George? “ & people(“George”).Coolness

But there is a problem. See if you can identify it in Figure 4.1. In the first example (see page 195), the age data is actually stored in the Variant item. In this example, the data is an object and an object won’t fit in a 16-byte variant even if you want it to. An object in a Variant is actually a reference to an object outside the Variant. Strings in the object are actually BSTR pointers to data outside the object.


The problem is that the string data for the Name property exists in two different places. The key data copy is attached to the Collection, and a second copy is attached to the CPerson object. If you have a lot of people in your Collection, you might be wasting a lot of memory. A frequently requested Collection enhancement is to provide an Index property that returns the index of an item by its key, and a Key property that would return the key data by its index:

iIndex = people.Index(“George”)
sKey = people.Key(iIndex)

In some collections, these properties could save you from storing duplicate strings for the key. In other cases, the key is a calculated string that you wouldn’t need to access anyway. These properties would give the Collection class flexibility similar to the associative arrays of the Perl language.

FLAME The Collection class provided with Visual Basic version 4 was a good 1 version. The Collection class provided with Visual Basic version 5 is not a good 2 version. Users offered many complaints and suggestions. None of them were implemented.
Problem: What is a Collection? You know its user interface, but you don’t really know what internal data structure makes it work. Normally, this isn’t a concern. Who cares how it works as long as it works? But, on second thought, the internal implementation is bound to affect the efficiency of operations. If some operations are less efficient than others, it might be nice to avoid the slow ones. While writing the Collection sort procedures in Chapter 5, I began to suspect that inserting elements into the middle of a Collection might be an operation to avoid.


I theorized that it would be faster to insert elements at the end of a Collection than at the beginning or in the middle and that insertion would get slower as you added more elements. To test this idea, I tried inserting a lot of elements (8000) into different parts of a Collection. The results weren’t quite what I expected.

Problem P-Code Native Code
Add first half to end of Collection 0.0635 sec 0.0509 sec
Add last half to end of Collection 0.0696 sec 0.0573 sec
Add first half to start of Collection 0.0965 sec 0.0735 sec
Add last half to start of Collection 0.0987 sec 0.0766 sec
Add first half to middle of Collection 1.2021 sec 1.1732 sec
Add last half to middle of Collection 4.0951 sec 3.8412 sec


Conclusion: What kind of data structure allows fast insertion at the beginning and at the end but not in the middle? This stumped me for a while, but here’s a theory. If a Collection were implemented internally as a doubly linked list, insertion would always cost the same. But finding a particular position (with the before or after parameter of the Add method or the index parameter of the Remove method) would involve iterating from the start or the end until you reached the given point. To find a position near the middle, you’d have to start at the beginning or the end, whichever was closest, and iterate almost halfway through. The more elements in the list, the farther you’d have to go. This would explain the results shown. And, as a matter of fact, I have been told by Visual Basic developers that Collections are doubly linked lists (with additional features to support indexing).


Of course, anything you think you know about the implementation of a ­Collection might be wrong for this version and probably will be wrong for the next version. Still, you might want to think carefully about whether you really need to insert elements in the middle of a large Collection.