![]() |
|
![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
|
![]() |
![]() |
Listing 16.7 L16-7.ASM ScanLoop: lodsw ;get the next 2 bytes (AL = first, AH = 2nd) xlat ;look up firsts char/not status xor dl,al ;see if theres a new char/not status add di,dx ;we add 1 for each char/not transition mov dl,al mov al,ah ;look at the second byte xlat ;look up its char/not status xor dl,al ;see if theres a new char/not status add di,dx ;we add 1 for each char/not transition mov dl,al dec dx jnz ScanLoop John later divides the transition count by two to get the word count. (Food for thought: Its also possible to use CMP and ADC to detect words without branching.) Johns approach makes it clear that word-counting is nothing more than a fairly simple state machine. The interesting part, of course, is building the fastest state machine. Level 3: BreakthroughThe boundaries between the levels of optimization are not sharply defined. In a sense, level 3 optimization is just like levels 1 and 2, but more so. At level 3, one takes whatever level 2 perspective seems most promising, and implements it as efficiently as possible on the x86. Even more than at level 2, at level 3 this means breaking out of familiar patterns of thinking. In the case of word counting, level 3 means building a table-driven state machine dedicated to processing a buffer of bytes into a count of words with a minimum of branching. This level of optimization strips away many of the abstractions we usually use in coding, such as loops, tests, and named variableslook back to Listing 16.5, and youll see what I mean. Only a few people reached this level, and I dont think any of them did it without long, hard thinking; David Staffords final entry (that is, the one I present as Listing 16.5) was at least the fifth entry he sent me. The key concept at level 3 is the use of a massive (64K) lookup table that processes byte sequences directly into word-count actions. With such a table, its possible to look up the appropriate action for two bytes simultaneously in just a few instructions; next, Im going to look at the inspired and highly unusual way that Davids code, shown in Listing 16.5, does exactly that. (Before assembling Listing 16.5, you must run the C code in Listing 16.8, to generate an include file defining the 64K lookup table. When you assemble Listing 16.5, TASM will report a location counter overflow warning; ignore it.) LISTING 16.8 MAKETAB.C // MAKETAB.C Build QSCAN3.INC for QSCAN3.ASM #include <stdio.h> #include <ctype.h> #define ChType( c ) (((c) & 0x7f) == \ || isalnum((c) & 0x7f)) int NoCarry[ 4 ] = { 0, 0x80, 1, 0x80 }; int Carry[ 4 ] = { 1, 0x81, 1, 0x80 }; void main( void ) { int ahChar, alChar, i; FILE *t = fopen( QSCAN3.INC, wt ); printf( Building table. Please wait... ); for( ahChar = 0; ahChar < 128; ahChar++ ) { for( alChar = 0; alChar < 256; alChar++ ) { i = ChType( alChar ) * 2 + ChType( ahChar ); if( alChar % 8 == 0 ) fprintf( t, \ndb %02Xh, NoCarry[ i ] ); else fprintf( t, ,%02Xh, NoCarry[ i ] ); fprintf( t, ,%02Xh, Carry[ i ] ); } } fclose( t ); } Davids approach is simplicity itself, although his implementation arguably is not. Consider any three sequential bytes in the buffer. Those three bytes define two potential places where a word might be counted, as shown in Figure 16.1. Given the separator/non-separator states of the three bytes, you can instantly determine whether to count a word or not; you count a word if and only if somewhere in the sequence there is a non-separator followed by a separator. Note that a maximum of one word can be counted per three-byte sequence. The trick, then, is to identify the separator/not statuses of each set of three bytes and turn them into a 1 (count word) or 0 (dont count word), as quickly as possible. Assuming that the separator/not status for the first byte is in the Carry flag, this is easily accomplished by a lookup in a 64K table, based on the Carry flag and the other two bytes, as shown in Figure 16.2. (Remember that were counting 7-bit ASCII here, so the high bit is ignored.) Thus, David is able to add the word/not status for each pair of bytes to the main word count simply by getting the two bytes, working in the carry status from the last byte, and using the resulting value to index into the 64K table, adding in the 1 or 0 value found in that table. A sequence of MOV/ADC/ADD suffices to perform all word-counting tasks for a pair of bytes. Three instructions, no branchespretty nearly perfect code.
One detail remains to be attended to: setting the Carry flag for next time if the last byte was a non-separator. David does this in a bizarre and incredibly effective way: He presets the high bit of the count, and sets the high bit in the lookup table for those entries looked up by non-separators. When a non-separators lookup entry is added to the count, it will produce a carry, as desired. The high bit of the count is masked off before being added to the total count, so David is essentially using different parts of the count variables for different purposes (counting, and setting the Carry flag).
There are a number of other interesting details in Davids code, including the unrolling of the loop 64 times, so that 256 bytes in a row are processed without a single branch. Unfortunately, I lack the space to discuss Listing 16.5 any further. Perhaps thats not so unfortunate, after all; Id hate to deny you the pleasure of discovering the wonders of this rather remarkable code yourself. I will say one more thing, though. The cycle count for Davids inner loop is 6.5 cycles per byte processed, and the actual measured time for his routine, overhead and all, is 7.9 cycles/byte. The original C code clocked in at around 100 cycles/byte. Enough said, I trust. Enough Word Counting Already!Before I finish up this chapter, Id like to mention that Terje Mathisens WC word-counting program, which Ive mentioned previously and which is available, with source, on Bix, is in the ballpark with Davids code for performance. Whats more, Terjes program handles 8-bit ASCII, counts lines as well as words, and supports user-definable separator sets. Its wonderful code, well worth a look; it also happens to be a great word-counting utility. By the way, Terje builds his 64K table on the fly, at program initialization; this allows for customized tables, shrinks the size of the EXE, and, according to Terjes calculations, takes less time than loading the table off disk as part of the EXE. So, has David written the fastest possible word-counting code? Well, maybebut I have a letter from Terry Holmes, of San Rafael, California, that calculates the theoretical maximum performance of native 386 word-counting code at 5.5 cycles/byte, which would be significantly faster than Davids code. Terry, alas, didnt bother to implement his design, but maybe Ill take a shot at it someday. Itd be fun, for surebut jeez, Ive got real work to do!
![]() |
![]() |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
|