EarthWeb   
   datamation.com Brought to you by EarthWeb

HomeSubscribeSearchFAQSitemapContact Us
     

   

  Search Tips
  Advanced Search
   
  

  
Go to ITKnowledge Enterprise

Michael Abrash's Graphics Programming Black Book, Special Edition Michael Abrash's Graphics Programming Black Book, Special Edition
by Michael Abrash
The Coriolis Group
ISBN: 1576101746   Pub Date: 07/01/97

Search this book:
 
Previous Table of Contents Next


The first, obvious thing we can do to Listing 21.1 is change ADC AX,0 to ADC EAX,0, eliminating a prefix byte and saving a full cycle. Now we’re down from five to four cycles. What next?

Listing 21.2 shows one interesting alternative that doesn’t really buy us anything. Here, we’ve eliminated all size prefixes by doing byte-sized MOVs and ADDs, but because the size prefix on ADD AX,[ESI], for whatever reason, didn’t cost anything in Listing 21.1, our efforts are to no avail—Listing 21.2 still takes 4 cycles per checksummed word. What’s worth noting about Listing 21.2 is the extent to which the code is broken into simple instructions and reordered so as to avoid size prefixes, register contention, AGIs, and data bank conflicts (the latter because both [ESI] and [ESI+1] are in the same cache data bank, as discussed in the last chapter).

LISTING 21.2 L21-2.ASM

; Calculates TCP/IP (16-bit carry-wrapping) checksum for buffer
;  starting at ESI, of length ECX words.
; Returns checksum in AX.
; High word of EAX, DX, ECX and ESI destroyed.
; All cycle counts assume 32-bit protected mode.
; Assumes buffer length > 0.

        sub     eax,eax         ;initialize the checksum
        mov     dx,[esi]        ;first word to checksum
        dec     ecx             ;we’ll do 1 checksum outside the loop
        jz      short ckloopend ;only 1 checksum to do
        add     esi,2           ;point to the next word to checksum

ckloop:
        add     al,dl           ;cycle 1 U-pipe
        mov     dl,[esi]        ;cycle 1 V-pipe
        adc     ah,dh           ;cycle 2 U-pipe
        mov     dh,[esi+1]      ;cycle 2 V-pipe
        adc     eax,0           ;cycle 3 U-pipe
        add     esi,2           ;cycle 3 V-pipe
        dec     ecx             ;cycle 4 U-pipe
        jnz     ckloop          ;cycle 4 V-pipe

ckloopend:
        add     ax,dx           ;checksum the last word
        adc     eax,0

Listing 21.3 is a more sophisticated attempt to speed up the checksum calculation. Here we see a hallmark of Pentium optimization: two operations (the checksumming of the current and next pair of words) interleaved together to allow both pipes to run at near maximum capacity. Another hallmark that’s apparent in Listing 21.3 is that Pentium-optimized code tends to use more registers and require more instructions than 486-optimized code. Again, note the careful mixing of byte-sized reads to avoid AGIs, register contention, and cache bank collisions, in particular the way in which the byte reads of memory are interspersed with the additions to avoid register contention, and the placement of ADD ESI,4 to avoid an AGI.

LISTING 21.3 L21-3.ASM

; Calculates TCP/IP (16-bit carry-wrapping) checksum for buffer
;  starting at ESI, of length ECX words.
; Returns checksum in AX.
; High word of EAX, BX, EDX, ECX and ESI destroyed.
; All cycle counts assume 32-bit protected mode.
; Assumes buffer length > 0.

        sub     eax,eax            ;initialize the checksum
        sub     edx,edx            ;prepare for later ORing
        shr     ecx,1              ;we’ll do two words per loop
        jnc     short ckloopsetup  ;even number of words
        mov     ax,[esi]           ;do the odd word
        jz      short ckloopdone   ;no more words to checksum
        add     esi,2              ;point to the next word
ckloopsetup:
        mov     dx,[esi]           ;load most of 1st word to
        mov     bl,[esi+2]         ; checksum (last byte loaded in loop)
        dec     ecx                ;any more dwords to checksum?
        jz    short ckloopend      ;no

ckloop:
        mov     bh,[esi+3]      ;cycle 1 U-pipe
        add     esi,4           ;cycle 1 V-pipe
        shl     ebx,16          ;cycle 2 U-pipe
                                ;cycle 2 V-pipe idle
                                ; (register contention)
        or      ebx,edx         ;cycle 3 U-pipe
        mov     dl,[esi]        ;cycle 3 V-pipe
        add     eax,ebx         ;cycle 4 U-pipe
        mov     bl,[esi+2]      ;cycle 4 V-pipe
        adc     eax,0           ;cycle 5 U-pipe
        mov     dh,[esi+1]      ;cycle 5 V-pipe
        dec     ecx             ;cycle 6 U-pipe
        jnz     ckloop          ;cycle 6 V-pipe

ckloopend:
        mov     bh,[esi+3]      ;checksum the last dword
           add   ax,dx
           adc   ax,bx
           adc   ax,0

        mov         edx,eax         ;compress the 32-bit checksum
        shr         edx,16          ; into a 16-bit checksum
        add         ax,dx
        adc         eax,0
ckloopdone:

The checksum loop in Listing 21.3 takes longer than the loop in Listing 21.2, at 6 cycles versus 4 cycles for Listing 21.2—but Listing 21.3 does two checksum operations in those 6 cycles, so we’ve cut the time per checksum addition from 4 to 3 cycles. You might think that this small an improvement doesn’t justify the additional complexity of Listing 21.3, but it is a one-third speedup, well worth it if this is a critical loop—and, in general, if it isn’t critical, there’s no point in hand-tuning it. That’s why I haven’t bothered to try to optimize the non-inner-loop code in Listing 21.3; it’s only executed once per checksum, so it’s unlikely that a cycle or two saved there would make any real-world difference.

Listing 21.3 could be made a bit faster yet with some loop unrolling, but that would make the code quite a bit more complex for relatively little return. Instead, why not make the code more complex and get a big return? Listing 21.4 does exactly that by loading one dword at a time to eliminate both the word prefix of Listing 21.1 and the multiple byte-sized accesses of Listing 21.3. An obvious drawback to this is the considerable complexity needed to ensure that the dword accesses are dword-aligned (remember that unaligned dword accesses cost three cycles each), and to handle buffer lengths that aren’t dword multiples. I’ve handled these problems by requiring that the buffer be dword-aligned and a dword multiple in length, which is of course not always the case in the real world. However, the point of these listings is to illustrate Pentium optimization—dword issues, being non-inner-loop stuff, are solvable details that aren’t germane to the main focus. In any case, the complexity and assumptions are well justified by the performance of this code: three cycles per loop, or 1.5 cycles per checksummed word, more than three times the speed of the original code. Again, note that the actual order in which the instructions are arranged is dictated by the various optimization hazards of the Pentium.


Previous Table of Contents Next

homesubscribesearchfaqsitemapcontactus
Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.