
|
 |
|
Michael Abrash's Graphics Programming Black Book, Special Edition
by Michael Abrash
The Coriolis Group
ISBN: 1576101746 Pub Date: 07/01/97
|
LISTING 21.4 L21-4.ASM
; Calculates TCP/IP (16-bit carry-wrapping) checksum for buffer
; starting at ESI, of length ECX words.
; Returns checksum in AX.
; High word of EAX, ECX, EDX, and ESI destroyed.
; All cycle counts assume 32-bit protected mode.
; Assumes buffer starts on a dword boundary, is a dword multiple
; in length, and length > 0.
sub eax,eax ;initialize the checksum
shr ecx,1 ;well do two words per loop
mov edx,[esi] ;preload the first dword
add esi,4 ;point to the next dword
dec ecx ;well do 1 checksum outside the loop
jz short ckloopend ;only 1 checksum to do
ckloop:
add eax,edx ;cycle 1 U-pipe
mov edx,[esi] ;cycle 1 V-pipe
adc eax,0 ;cycle 2 U-pipe
add esi,4 ;cycle 2 V-pipe
dec ecx ;cycle 3 U-pipe
jnz ckloop ;cycle 3 V-pipe
ckloopend:
add eax,edx ;checksum the last dword
adc eax,0
mov edx,eax ;compress the 32-bit checksum
shr edx,16 ; into a 16-bit checksum
add ax,dx
adc eax,0
Listing 21.5 improves upon Listing 21.4 by processing 2 dwords per loop, thereby bringing the time per checksummed word down to exactly 1 cycle. Listing 21.5 basically does nothing but unroll Listing 21.4s loop one time, demonstrating that the venerable optimization technique of loop unrolling still has some life left in it on the Pentium. The cost for this is, as usual, increased code size and complexity, and the use of more registers.
LISTING 21.5 L21-5.ASM
; Calculates TCP/IP (16-bit carry-wrapping) checksum for buffer
; starting at ESI, of length ECX words.
; Returns checksum in AX.
; High word of EAX, EBX, ECX, EDX, and ESI destroyed.
; All cycle counts assume 32-bit protected mode.
; Assumes buffer starts on a dword boundary, is a dword multiple
; in length, and length > 0.
sub eax,eax ;initialize the checksum
shr ecx,2 ;well do two dwords per loop
jnc short noodddword ;is there an odd dword in buffer?
mov eax,[esi] ;checksum the odd dword
jz short ckloopdone ;no, done
add esi,4 ;point to the next dword
noodddword:
mov edx,[esi] ;preload the first dword
mov ebx,[esi+4] ;preload the second dword
dec ecx ;well do 1 checksum outside the loop
jz short ckloopend ;only 1 checksum to do
add esi,8 ;point to the next dword
ckloop:
add eax,edx ;cycle 1 U-pipe
mov edx,[esi] ;cycle 1 V-pipe
adc eax,ebx ;cycle 2 U-pipe
mov ebx,[esi+4] ;cycle 2 V-pipe
adc eax,0 ;cycle 3 U-pipe
add esi,8 ;cycle 3 V-pipe
dec ecx ;cycle 4 U-pipe
jnz ckloop ;cycle 4 V-pipe
ckloopend:
add eax,edx ;checksum the last two dwords
adc eax,ebx
adc eax,0
ckloopdone:
mov edx,eax ;compress the 32-bit checksum
shr edx,16 ; into a 16-bit checksum
add ax,dx
adc eax,0
Listing 21.5 is undeniably intricate code, and not the sort of thing one would choose to write as a matter of course. On the other hand, its five times as fast as the tight, seemingly-speedy loop in Listing 21.1 (and six times as fast as Listing 21.1 would have been if the prefix byte had behaved as expected). Thats an awful lot of speed to wring out of a five-instruction loop, and the TCP/IP checksum is, in fact, used by network software, an area in which a five-times speedup might make a significant difference in overall system performance.
I dont claim that Listing 21.5 is the fastest possible way to do a TCP/IP checksum on a Pentium; in fact, it isnt. Unrolling the loop one more time, together with a trick of Terjes that uses LEA to advance ESI (neither LEA nor DEC affects the carry flag, allowing Terje to add the carry from the previous loop iteration into the next iterations checksum via ADC), produces a version thats a full 33 percent faster. Nonetheless, Listings 21.1 through 21.5 illustrate many of the techniques and considerations in Pentium optimization. Hand-optimization for the Pentium isnt simple, and requires careful measurement to check the efficacy of your optimizations, so reserve it for when you really, really need itbut when you need it, you need it bad.
A Quick Note on the 386 and 486
Ive mentioned that Pentium-optimized code does fine on the 486, but not always so well on the 386. On a 486, Listing 21.1 runs at 9 cycles per checksummed word, and Listing 21.5 runs at 2.5 cycles per checksummed word, a healthy 3.6-times speedup. On a 386, Listing 21.1 runs at 22 cycles per word; Listing 21.5 runs at 7 cycles per word, a 3.1-times speedup. As is often the case, Pentium optimization helped the other processors, but not as much as it helped the Pentium, and less on the 386 than on the 486.
|