Both of the previous algorithms are seriously flawed. They never choose the first line, while the second line has a 50% probability of being picked up. For any subsequent line, the probability of being picked up is decreasing.
This comment is wrong, because in the last algorithm presented (repeated here for convenience, with the parameter type changed from ifstream to ifstream& because you can't copy an object of type ifstream):
string random_line(ifstream& file)
{
string line;
string selected_line;
int nlines = 0;
while (getlinefile, line))
if ((rand() % ++nlines) == 0)
selected_line = line;
return selected_line;
}
at the beginning of the first trip through the loop, nlines is zero. ++nlines increments it to 1, which means that we start by computing rand() % 1, which is always zero. This fact means that the first time through the loop, the statement
selected_line = line;
will always be executed. If the values of the random numbers are such that this statement is not executed again, then the algorithm will select the first line.
The second misstatement is more subtle. AsAccelerated C++ points out, using rand() % n for a random number between 0 and n-1 gives results that are generally not random, even if rand() itself is perfect.
To see why, assume that rand() returns a 32-bit signed value, i.e. a value between 0 and 2147483647, inclusive, and that you are computing rand() % 190.
There are 2147483648 possible values that rand() can take on, and 2147483648 % 190 is 98. In other words, 190 does not divide 2147483648 exactly. This fact means that the 190 possible values that this algorithm yields cannot possibly occur with equal probability.
Instead of using rand() % n, it would be more accurate to use nrand(n), where nrand is the function defined along the lines of Accelerated C++:
int nrand(int n)
{
assert(n <= RAND_MAX);
const int bucket_size = RAND_MAX/n;
int r;
do r = rand()/bucket_size;
while(r >= n);
return r;
}