Reading in a random line from a file - Page 2

Question

vijayan121 1,152 Posting Virtuoso

13 Years Ago

My problem is, how to prevent reading so many lines from the file, especially if I have to return more than one line. If I have to return k lines, the complexity is theta(k n^2). I would rather prefer theta(k n log n) .

To select K lines from a population containing an unknown number of lines, you do not have to repeat the selection of one random line many times. A reservoir sampling algorithm can do this in one pass. (The earlier example of selecting one random line is a special case of reservoir sampling; where the reservoir size is just one.)

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <iterator>
#include <vector>
#include <random>
#include <stdexcept>

using namespace std ;

// simple(st) reservoir algorithm for random sampling without replacement
//
// algorithm R (credited to Alan Waterman) from:
//    Random Sampling with a Reservoir' by Jeffry Scott Vitter,
//         ACM Transactions on Mathematical Software, March 1985
//    http://www.ittc.ku.edu/~jsv/Papers/Vit85.Reservoir.pdf
//
// randomly select K (different) lines from an input stream opened in text mode
// time complexity is O(N)
template< typename RNG >
vector<string> random_lines( istream& stm, size_t K, RNG& engine )
{
    vector<string> sample ;
    std::string line ;

    while( sample.size() < K )
    {
        if( getline( stm, line ) ) sample.push_back(line) ;
        else throw range_error( "sample size too large" ) ;
    }

    for( size_t i = K ; getline( stm, line ) ; ++i )
    {
        size_t r = uniform_int_distribution<size_t>( 0, i )(engine) ;
        if( r < K ) sample[r] = line ;
    }

    return sample ;
}

// usage:
int main()
{
    random_device rdev ;
    mt19937 twister( rdev() ) ;
    std::ifstream file( __FILE__ ) ;
    vector<string> sample = random_lines( file, 15, twister ) ;

    int i = 0 ;
    for_each( sample.begin(), sample.end(),
              [&i]( const string& s ) { cout << ++i << ". " << s << '\n' ; } ) ;
}

arkoenig 340 Practically a Master Poster

13 Years Ago

Of course you don't have to read a vector to return a random element. You do, however, have to know how many elements the vector has.

If the "vector" is really a file, the only way to know how many lines it has is to read the entire file. You can guess how many lines it has by reading part of the file and assuming that the average line length for the entire file is the same as it is for the part that you read, but without reading the entire file, you have no way of knowing how accurate your guess is.

vijayan121 1,152 Posting Virtuoso

13 Years Ago

The algorithm requires a large pool compared with the sample, else it can
happen that some sample[] element is never populated.
It is only guaranteed that the first element is not null, or do I miss something?

The reservoir algorithm (for simple random sampling without replacement of K lines in a file) works like this:

The reservoir size is fixed at K (the sample size).

A.
First (lines 27-31) create a reservoir of size K filled with the first K lines in the file.

while( sample.size() < K )
    {
        if( getline( stm, line ) ) sample.push_back(line) ;
        else throw range_error( "sample size too large" ) ;
    }

B.
Now read line K+1.
Generate a random integer uniformly distributed in the inclusive interval [0,K]
With probability K/(K+1), replace a random line in the reservoir with line K+1

size_t r = uniform_int_distribution<size_t>( 0, i )(engine) ;
        if( r < K ) sample[r] = line ;

So far, we have read K+1 lines of which K are in the reservoir.
Probability that line K+1 is in the reservoir is now K/(K+1).
Probability that line K+1 is not in the reservoir is 1/(K+1).

Probability that a particular line (say line 1) is now not in the reservoir is:
the probability that line K+1 replaced a line in the reservoir: K/(K+1)
multiplied by the probability that line 1 was the one that was replaced: 1/K (equal probability for any of the K lines).
ie. K/(K+1) * 1/K == 1/(K+1).
The probability that a particular line is still in the reservoir is 1 - 1/(K+1) == K/(K+1).
Each of the K+1 lines have a probability K/K+1 of being in the reservoir.

C.
Now read line K+2.
Generate a random integer uniformly distributed in the inclusive interval [0,K+1]
With probability K/(K+2), replace a random line in the reservoir with line K+2

So far, we have read K+2 lines of whichKare in the reservoir.
Probability that line K+2 is in the reservoir is now K/(K+2).
Probability that line K+2 is not in the reservoir is 2/(K+2).

Probability that any other line is now in the reservoir is:
K/(K+1) * ( 2/(K+2) + K/(K+2) * (K-1)/K ) == K/(K+1) * ( 2/(K+2) + (K-1)/(K+2) ) == K/(K+1) * ( (K+1)/(K+2) ) == K/K+2

Each of the K+2 lines have a probability K/(K+2) of being in the reservoir.

D.
By induction, if there were N lines in the file, each of the N lines has a
probability of K/N of being in the reservoir at the end.

The trouble is that you might not know the value of K at the time you start reading the file.

Yes, the requirement might be: 'create a sample containing 0.5% of the total number of lines in the file.' If approximate randomness is adequate, I suppose a stratified sampling algorithm would suffice.

Otherwise, AFAIK, there is no other option other than:
a. Read the complete file once and determine the population size N
b. Compute the sample size K
c. Perform the selection.

In my experience, using stored stream offsets and then making multiple calls to seekg() is usually slower than making a second pass through the entire file from beginning to end as part of a standard reservoir algorithm (after having determined K during the first pass).

Here are some numbers (admittedly the file is not huge; to compensate for that, the programs were run on a slow antique laptop).

Randomly select 1000 lines from a 446 MB file containing 8.4 million lines of text.

>wc -lc test.txt
8415696 446347032 test.txt

lines: 8415696 bytes: 446347032 test.txt
average line length: 446347032 / 8415696 = 53.04
1000/8415696 = 0.0001188
8415696 * 0.0001188 = 999.7846848 = 1000 lines

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <iterator>
#include <vector>
#include <random>
#include <stdexcept>
#include <ctime>

using namespace std ;

// simple(st) reservoir algorithm for random sampling without replacement
//
// algorithm R (credited to Alan Waterman) from:
//    Random Sampling with a Reservoir' by Jeffry Scott Vitter,
//         ACM Transactions on Mathematical Software, March 1985
//    http://www.ittc.ku.edu/~jsv/Papers/Vit85.Reservoir.pdfB.
//
// randomly select K (different) lines from an input stream opened in text mode
// time complexity is O(N), memory complexity O(K)
template< typename RNG >
vector<string> random_lines_n( istream& stm, size_t K, RNG& engine )
{
    vector<string> sample ;
    std::string line ;

    while( sample.size() < K )
    {
        if( getline( stm, line ) ) sample.push_back(line) ;
        else throw range_error( "sample size too large" ) ;
    }

    for( size_t i = K ; getline( stm, line ) ; ++i )
    {
        size_t r = uniform_int_distribution<size_t>( 0, i )(engine) ;
        if( r < K ) sample[r] = line ;
    }

    return sample ;
}

// randomly select K (different) lines from an input stream opened in text mode
// K is not known untill after all the lines have been read
// **** CAUTION: algorithm uses stored stream offsets for every line in the stream
// **** CAUTION: will not work on streams (like socket streams) which do not support seek
// time complexity is O(N), memory complexity O(N)
template< typename RNG >
vector<string> random_lines_f( istream& stm, double fraction, RNG& engine )
{
    vector<streamoff> offsets( 1U, 0 ) ;
    std::string line ;
    while( getline( stm, line ) ) offsets.push_back( stm.tellg() ) ;
    offsets.pop_back() ;

    typedef vector<streamoff>::size_type size_type ;
    size_type K = size_type( offsets.size() * fraction + 0.5 ) ;
    size_type N = offsets.size() ;

    // partial Fisher-Yates shuffle to get first K as random elements
    for( size_type i = 0 ; i < K ; ++i )
    {
         size_type r = uniform_int_distribution<size_type>( i, N-1 )(engine) ;
         swap( offsets[r], offsets[i] ) ;
    }

    vector<string> sample ;
    stm.clear() ;
    for( size_type i = 0 ; i < K ; ++i )
    {
        stm.seekg( offsets[i] ) ;
        if( !getline( stm, line ) ) cout << offsets[i] << "  ***\n" ;
        sample.push_back(line) ;
    }

    return sample ;
}

// usage:
int main()
{
    random_device rdev ;
    mt19937 twister( rdev() ) ;
    const char* const path2file = "test.txt" ; 
    std::ifstream file( path2file ) ;

    time_t start_time = time(0) ;
    clock_t start_clock = clock() ;
    vector<string> sample = random_lines_f( file, 0.0001188, twister ) ;
    //vector<string> sample = random_lines_n( file, 1000, twister ) ;
    clock_t end_clock = clock() ;
    time_t end_time = time(0) ;

    cout << end_time - start_time << " elapsed secs  "
         << double(end_clock-start_clock)/CLOCKS_PER_SEC << " clock secs.\n" ;
}

The hardware:

>dmesg | grep CPU | grep GHz
[ 0.339541] CPU0: Intel(R) Pentium(R) M processor 1.66GHz stepping 08
>dmesg | grep MB | grep available
[ 0.000000] 0MB HIGHMEM available.
[ 0.000000] 510MB LOWMEM available.

compiler: g++ 4.5

>alias c++0x
g++ -Wall -std=c++0x -pedantic -Werror -I /usr/local/include -L /usr/local/lib

The first set of numbers are for random_lines_f() using stream offsets.

>c++0x -fomit-frame-pointer -O3 main_f.cc && ./a.out && ./a.out && ./a.out && ./a.out
56 elapsed secs 32.92 clock secs.
51 elapsed secs 32.68 clock secs.
49 elapsed secs 32.52 clock secs.
49 elapsed secs 32.48 clock secs.

The second set of numbers are for random_lines_n().
Two passes through the file will take more time; even if that is twice as much, this would still be faster.

>c++0x -fomit-frame-pointer -O3 main_n.cc && ./a.out && ./a.out && ./a.out && ./a.out
16 elapsed secs 4.83 clock secs.
13 elapsed secs 4.86 clock secs.
12 elapsed secs 4.86 clock secs.
11 elapsed secs 4.82 clock secs.

Note: This is just one measurement under a particular set of circumstances; it is intended to be merely illustrative.

Edited 13 Years Ago by vijayan121 because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

WaltP 2,905 Posting Sage w/ dash of thyme Team Colleague · Answer 1 · 2011-03-15T03:19:16+00:00

1) Store the offset of the current line (for fseek) into an array
2) Read the line of the file
3) Increment COUNTER that counts the line
4) Continue at 1 until EOF
5) Get a random number based on COUNTER
6) fseek() to the line
7) Read the line.

A random line read.

kylcrow 0 Junior Poster · Answer 2 · 2011-03-15T03:35:18+00:00

As the OP to this thread I welcome all conversation regardless of the original posting date :)

razvan1 0 Newbie Poster · Answer 3 · 2011-03-15T03:38:13+00:00

But i am factoring in the preference for long lines:

while(true){
  jump_to_random_location();
  char* s = readlineatlocation(); // backward & forward
  if (!nrand(strlen(s))) return s;
}

This algorithm provides the same probability for each line, I don't understand why you claim the opposite.
My problem is, how to prevent reading so many lines from the file, especially if I have to return more than one line. If I have to return k lines, the complexity is theta(k n^2). I would rather prefer theta(k n log n) .

razvan1 0 Newbie Poster · Answer 4 · 2011-03-15T03:50:21+00:00

1) Store the offset of the current line (for fseek) into an array
2) Read the line of the file
3) Increment COUNTER that counts the line
4) Continue at 1 until EOF
5) Get a random number based on COUNTER
6) fseek() to the line
7) Read the line.
A random line read.

Because we (actually, it's only me) don't want to read the whole file.

arkoenig 340 Practically a Master Poster · Answer 5 · 2011-03-15T04:39:25+00:00

Without reading the whole file, you cannot select every line with equal probability. One reason is that you have no way of knowing whether the portions of the file you did not read consist of one long line each, or a whole bunch of little tiny lines each.

arkoenig 340 Practically a Master Poster · Answer 6 · 2011-03-15T20:49:21+00:00

The trouble is that you might not know the value of K at the time you start reading the file.

razvan1 0 Newbie Poster · Answer 7 · 2011-03-15T21:27:41+00:00

@vijayan121 nice posting. The algorithm requires a large pool compared with the sample, else it can happen that some sample[] element is never populated. It is only guaranteed that the first element is not null, or do I miss something?

arkoenig 340 Practically a Master Poster · Answer 8 · 2011-03-15T21:43:36+00:00

This algorithm looks like it requires enough memory to hold the entire sample--that is, enough to hold enough information to identify which lines are in the sample. That could be as little as one integer per line; those integers would just record the line numbers.

The question is whether this algorithm actually solves the problem. It is hard to answer that question without knowing what the problem is. If, for example, you want to choose a random line from the file a lot of times, and you do not know in advance how many times you are going to have to do it, this algorithm will not solve the problem.

If you know in advance how many lines are in the entire file, you can choose n random elements from the file without even storing any of the elements.

This is another example that shows why there is little point in discussing solutions to a problem until you know for sure what the problem is and what constitutes an acceptable solution.

razvan1 0 Newbie Poster · Answer 9 · 2011-03-15T22:51:31+00:00

@arkoenig let me restate the problem:
(1)Given a file, return a set of k lines, chosen randomly.
(2)The algorithm should choose every line of the file with the same probability.

(1) can be verified either theoretically (prove that the vector sample contains a sample of k elements) and though unit tests.
(2) can be verified theoretically, by computing the probability of picking every line and though a unit test that will show that on a long run, the distribution of chosen elements, even if there are large discrepancies between the length of the lines in the files, is similar to the uniform distribution.

Performance is judged by determining the complexity (memory and processor), as a function of N - length of the file, K - number of the lines in the file, n0 - average file line length, k - number of lines to be read, and so on.

razvan1 0 Newbie Poster · Answer 10 · 2011-03-15T22:55:27+00:00

Without reading the whole file, you cannot select every line with equal probability. One reason is that you have no way of knowing whether the portions of the file you did not read consist of one long line each, or a whole bunch of little tiny lines each.

I disagree. You don't have to read a vector to return a random element.

In my case, the probability of picking a line, of length n1 is:

(n1 / N) * (1 / n1) = 1 / N

the first term is because there are n1 chances out of N to pick the line when choosing the random pointer, while the second term is the length adjustment, it makes the probability independent of particular line length. Hence, the probability of choosing a line is 1/N, which is independent of line length.

zia shaikh 0 Newbie Poster · Answer 11 · 2016-04-29T20:16:15+00:00

i am making a quiz manegment program in c++ so how can i read a random questions which are store in text file