Thoughts on the Craft of Programming: Abstraction, Refactoring, and How Changes Introduce Bugs - Vowels
(Page 4 of 6 )
Refactor to Minimize the Scope of the Definition of "Vowel" That code looks reasonably good, but it does combine two functions that are really quite separate: scanning the string and deciding whether a character is a vowel. Once again, our artistic antennae are alerted. Scanning a string is an instance of the more general function of traversing a data collection, so with the Visitor pattern at the back of our minds we decide to separate the two functions.
This also makes sense because the definition of "vowel" may vary between languages (and indeed may not exist at all in some), so this is a part of the program that looks as though it might depend on the target environment and hence one that should be isolated into a small function. With this change, the total amount of text has increased a little, but the conceptual complexity has been reduced because each function can now be viewed in isolation.
Although I will not demonstrate it here, the code shown in Figure 6 is ready for further generalization, because any is_x function with the same signature as is_vowel could be passed as an argument to the string scanning function, and so all sorts of different tests could be realized by writing only additional code (i.e., not changing existing code).
In C++ or Java, instead of passing a function address (impossible in Java) one would wrap each function in its own class, declaring is_x as a method and pass as the argument an object of the appropriate class (the functor idiom), and in C# one could use a delegate, but the principle is exactly the same.
Now that it is a separate function, is_vowel invites a more efficient implementation -- for example, one based on a 256-byte array of character properties in the manner of the isalpha function and its companions5. This would be worth doing if profiling revealed that this function was an execution bottleneck.
The function is_vowel is a candidate for an independent unit test. Indeed, since there are only 256 possible values of c6, this is an example of that rare case: a piece of code that can be exhaustively tested. So we should take advantage of this and write and execute such a test in this instance; this would be particularly worthwhile if we planned to improve the implementation as suggested in the previous paragraph. We'll return to this point a little later.
Figure 6: Refactor to Minimize the Scope of the Definition of "Vowel" Code
A New Feature: Counting the Number of Vowel Pairs The new feature is to count the number of vowel pairs in the string, or more precisely the number of instances where s[i] and s[i+1] are both vowels; for example, the vowel pair count for "beautiful" is 2, for "ea" and "au"7. Figure 7 shows the new function, count_vowelpairs, and its supporting function, is_vowelpair:
Figure 7: Counting Vowel Pairs
That was easy -- a little copy and paste, a minor edit, a one-line function, and we're done. Note how, because of the previous refactoring, we were able to use the is_vowelpair function to implement is_vowelpair. Note also that if in the original implementation of is_vowel we had passed a character pointer instead of a character, we could have implemented is_vowelpair as an is_x function and used the function pointer protocol mentioned previously to implement the new feature without having to implement count_vowelpairs as a separate new function. But we were not sufficiently farsighted, and the decision to pass a character was a wise one, based on the principle that it is safer not to pass a pointer if it can be avoided.
But are we really done? No, we have been a little cavalier in our approach, and the program contains an error, one that might well evade testing8.
On hearing that, a programmer might first be drawn to the is_vowelpair function because it dereferences (p+1). Is that always a valid address? Yes, in fact it is, because the test in the for statement guarantees that p always points to a character in the string, so even when p is pointing to the last character in the string, (p+1) is pointing at the null character terminating the string and so is still a valid address. We were lucky when we made the change, because we had not explicitly considered this. There is no way in C to protect is_vowelpair from being invoked with a pointer p such that (p+1) is an invalid address; we resolve to ensure that the specification for is_vowelpair contains a prominent notice of this precondition.
So what is the bug? It is this: if the string ends in a vowel x, the character pair "x\0" will be incorrectly identified as a vowel pair. The groundwork for this bug was laid in the improvement from Version 3 to Version 4; it turns out that the if test and the strchr test are not exactly equivalent. The function strchr considers the terminating null character to be part of the string for the purposes of the search, so searching for the null character will always succeed and return a non-zero pointer. This is documented in the specification for strchr but is not well known; it is not mentioned in the MSDN Library documentation for the Microsoft Foundation Classes method CString::Find (which itself calls strchr). Yet it is on such minutiż that the correct execution of a program may hinge.
Let's return briefly to the point about unit testing in the previous section. The long-term value of such testing is now more clearly apparent, and if we had doubts earlier about whether the investment of time to write and execute the test would be repaid, those doubts are now probably resolved. However, tests must be written with as much care as code, and in this particular case we must be sure to test the case when c is zero.
How Should we Fix this Bug?
One way would be to replace the body of is_vowelpair with the statement:
return is_vowel(*p) && *(p+1) != '\0' && is_vowel(*(p+1));
This will fix the bug, and in one sense it is quite correct, because it simply compensates for the slightly awkward behavior of strchr in our context. Another interpretation might be that the error lies in is_vowel, which classifies the null character as a vowel, and that this function should be changed to correct the error (I hear Bertrand Meyer -- see footnote 9 -- in the background chiding us for an insufficiently precisely defined precondition for is_vowel)9. I maintain, however, that either of these changes treats the symptom rather than the disease. Our original suspicion of (p+1) was justified; I contend that the proper precondition for is_vowelpair is that both p and (p+1) point to actual characters within the string, not to the terminator. Therefore a correct fix for the bug is to leave is_vowelpair as it is and replace the for statement in count_vowelpairs with this:
for (; *s != '\0' && *(s+1) != '\0'; s++)
It is necessary to test both characters because testing only *(s+1) will fail if the string is null (for then *(s+1) would reference an invalid address). This is a bit clumsy, and we might want to consider testing for a null string before entering the loop or using a different termination condition, perhaps one based directly on the length of the string. We might also want to consider weakening the precondition for is_vowel by making it classify the null character as not a vowel. As the saying goes, these things are left as exercises for the reader.
Next: An Aside on Programming by Contract >>
More Development Cycles Articles
More By The Rational Edge