I wrote strlen() for practice but it's really slow. Even if I copy the code from my compiler and run it, it's still slow. This is my test.

/*
  strlen_comp
  
  Comparing my and Visual C++'s strlen
    by Kimberly Hamrick
*/
#include <assert.h> // for result tests
#include <string.h> // for C++'s strlen()
#include <time.h>   // for timing stuff
#include <iostream> // for C++ I/O

using namespace std;

namespace hamrick {
  // partial namespace for prototypes
  size_t strlen( const char *str );
  size_t xstrlen( const char *str );
}

typedef size_t (*strlen_pfunc)( const char * );

void unit_test( const char *title, const char *data, strlen_pfunc func ) {
  // add the time for a bunch of runs
  clock_t start = clock();
  for ( int i = 0; i < 100000000; ++i )
    func( data );
  cout<< title <<" -- "<< ( double( clock() ) - start ) / CLOCKS_PER_SEC <<endl;
}

int main() {
  // test different cases
  assert( hamrick::strlen( "" ) == size_t( 0 ) );
  assert( hamrick::strlen( "A" ) == size_t( 1 ) );
  assert( hamrick::strlen( "ABCDEFGHIJKLMNOPQRSTUVWXYZ" ) == size_t( 26 ) );
  assert( hamrick::strlen( 0 ) == size_t( 0 ) );

  // test performance
  unit_test( "strlen          ", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", strlen );
  unit_test( "hamrick::strlen ", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", hamrick::strlen );
  unit_test( "hamrick::xstrlen", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", hamrick::xstrlen );

  return 0;
}

namespace hamrick {
  // this is my version of strlen
  size_t strlen( const char *str ) {
    // treat a null pointer as an empty string
    size_t len = 0;

    if ( str != 0 ) {
      // find the end and count each character
      while ( *str != '\0' ) {
        ++len;
        ++str;
      }
    }

    return len;
  }

  // I copied this from the compiler source
  size_t xstrlen( const char *str ) {
    const char *eos = str;
    while( *eos++ ) ;
    return( eos - str - 1 );
  }
}

When I run it hamrick::strlen() and hamrick::xstrlen() are about 10 times slower than strlen()! Why?

strlen           -- 1.874
hamrick::strlen  -- 9.904
hamrick::xstrlen -- 9.405
Press any key to continue . . .

That's cool, thanks! But that doesn't answer my question. This is what I stole for the xstrlen() test.

size_t __cdecl strlen (
        const char * str
        )
{
        const char *eos = str;

        while( *eos++ ) ;

        return( eos - str - 1 );
}

But when I call strlen(), it's 10 times faster than xstrlen() when the code for both is the same. Why is that?

Did you compile the debug code, or the release code?

Did you disable the stack checking?
- the diagnostic checks for stack overflow and/or stack corruption will dominate the time for small functions.

In anticipation of your next question (how do I disable...), which version of Visual C++ are you using?

Did you look at the actual assembler instructions generated by the compiler for xstrlen() ?

Did you compile the debug code, or the release code?

Debug.

In anticipation of your next question (how do I disable...), which version of Visual C++ are you using?

Visual C++ 2005 Express Edition.

Did you look at the actual assembler instructions generated by the compiler for xstrlen() ?

I tried that, but the assembly output doesn't show me the code generated for C++'s strlen(), just my strlen() and xstrlen(). I can't look at the assembler to see what C++'s strlen() is doing different.

> the assembly output doesn't show me the code generated for C++'s strlen(), just my strlen() and xstrlen().

the microsoft C++ compiler recognizes and directly inserts optimized inline code for several functions (including strlen). functions like strlen in the Standard C library are available in both intrinsic and non-intrinsic forms. we can control the use of intrinsics by inserting #pragma intrinsic statements into the source code or using the -Oi compiler switch.

the microsoft C++ compiler recognizes and directly inserts optimized inline code for several functions (including strlen). functions like strlen in the Standard C library are available in both intrinsic and non-intrinsic forms. we can control the use of intrinsics by inserting #pragma intrinsic statements into the source code or using the -Oi compiler switch.

I get the same result if intrinsic forms are enabled or disabled. :(

without intrinsics, the code generated for hamrick::strlen is:

;    COMDAT ?strlen@hamrick@@YAIPBD@Z
_TEXT    SEGMENT
_str$ = 8                        ; size = 4
?strlen@hamrick@@YAIPBD@Z PROC                ; hamrick::strlen, COMDAT

; 49   :     // treat a null pointer as an empty string
; 50   :     size_t len = 0;
; 51   : 
; 52   :     if ( str != 0 ) {

    mov    ecx, DWORD PTR _str$[esp-4]
    xor    eax, eax
    test    ecx, ecx
    je    SHORT $LN1@strlen

; 53   :       // find the end and count each character
; 54   :       while ( *str != '\0' ) {

    cmp    BYTE PTR [ecx], al
    je    SHORT $LN1@strlen
    npad    2
$LL2@strlen:

; 55   :         ++len;
; 56   :         ++str;

    add    ecx, 1
    add    eax, 1
    cmp    BYTE PTR [ecx], 0
    jne    SHORT $LL2@strlen
$LN1@strlen:

; 57   :       }
; 58   :     }
; 59   : 
; 60   :     return len;
; 61   :   }

    ret    0
?strlen@hamrick@@YAIPBD@Z ENDP                ; hamrick::strlen
_TEXT    ENDS

for hamrick::xstrlen, it is

PUBLIC    ?xstrlen@hamrick@@YAIPBD@Z            ; hamrick::xstrlen
; Function compile flags: /Ogtpy
;    COMDAT ?xstrlen@hamrick@@YAIPBD@Z
_TEXT    SEGMENT
_str$ = 8                        ; size = 4
?xstrlen@hamrick@@YAIPBD@Z PROC                ; hamrick::xstrlen, COMDAT

; 65   :     const char *eos = str;

    mov    ecx, DWORD PTR _str$[esp-4]
    mov    eax, ecx
$LL2@xstrlen:

; 66   :     while( *eos++ ) ;

    mov    dl, BYTE PTR [eax]
    add    eax, 1
    test    dl, dl
    jne    SHORT $LL2@xstrlen

; 67   :     return( eos - str - 1 );

    sub    eax, ecx
    sub    eax, 1

; 68   :   }

    ret    0
?xstrlen@hamrick@@YAIPBD@Z ENDP                ; hamrick::xstrlen
_TEXT    ENDS

for the crt strlen, this is the code picked up directly from the debugger:

78144550  mov         ecx,dword ptr [esp+4] 
78144554  test        ecx,3 
7814455A  je          78144580 
7814455C  mov         al,byte ptr [ecx] 
7814455E  add         ecx,1 
78144561  test        al,al 
78144563  je          781445B3 
78144565  test        ecx,3 
7814456B  jne         7814455C 
7814456D  add         eax,0 
78144572  lea         esp,[esp] 
78144579  lea         esp,[esp] 
;----------------------------------------------------------
78144580  mov         eax,dword ptr [ecx] 
78144582  mov         edx,7EFEFEFFh 
78144587  add         edx,eax 
78144589  xor         eax,0FFFFFFFFh 
7814458C  xor         eax,edx 
7814458E  add         ecx,4 
78144591  test        eax,81010100h 
78144596  je          78144580 
;-------------------------------------------------------------
78144598  mov         eax,dword ptr [ecx-4] 
7814459B  test        al,al 
7814459D  je          781445D1 
7814459F  test        ah,ah 
781445A1  je          781445C7 
781445A3  test        eax,0FF0000h 
781445A8  je          781445BD 
781445AA  test        eax,0FF000000h 
781445AF  je          781445B3 
781445B1  jmp         78144580 
781445B3  lea         eax,[ecx-1] 
781445B6  mov         ecx,dword ptr [esp+4] 
781445BA  sub         eax,ecx 
781445BC  ret

the loop that executes in this code is from instructions at address 78144580 to 78144596; this compares 4 bytes at a time. (hamrick::strlen,xstrlen compare one byte at a time.) the code before this loop is to get to a multiple of 4-bytes (for alignment) and the code after that is to take care of the trailing bytes. because the string we are using to test is a literal, it starts on a 4-byte multiple. 27 bytes need to be tested, but because of alignment in the SDA, a null byte of padding would be added at the end making it a clear multiple of 28 bytes.
because of this

char test_str[] = "0ABCDEFGHIJKLMNOPQRSTUVWXYZ\0\0\0" ;
unit_test( "strlen          ", test_str+1, strlen );

would make the results closer
and

char test_str[] = "0ABCD\0\0\0" ;
unit_test( "strlen          ", test_str+1, strlen );

would make hamrick::strlen,xstrlen actually faster than CRT strlen (provided we disable intrinsics).

That makes sense. But even when I disable intrinsics, the result is the same. Is Visual C++ really using a different function than the one in the CRT source folder?

That makes sense. But even when I disable intrinsics, the result is the same. Is Visual C++ really using a different function than the one in the CRT source folder?

it appears so - the disassembly code i posted is with intrinsics disabled. perhaps there are different versions of sources for the debug version of the library and the release version. i do not have the CRT source for visual studio with me (i tested it on vc++ express); you could look into the CRT source tree for a directory containing assembly code.

This question has already been answered. Start a new discussion instead.