Reading in a random line from a file

Question

kylcrow 0 Junior Poster

17 Years Ago

Hi. I am new to the forum and I had a question. I have looked around and haven't really been able to find exactly what I am looking for.

I am looking for the code in C++ to read in a random single line with spaces from a file.

This is the file I have...(example.txt)

the big green
forrest gump
blades of glory
glory road
animal house
pirates of the carribean
the seven little foys
the lord of the rings
the green mile
cast away
star wars
it
now and then
the princess bride

This is the code for a function I have so far, but it just prints out the entire file.

void file(){
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while (! myfile.eof())
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
}

how do i change this so that it reads in just a random single line?

c++

11 Contributors
43 Replies
5K Views
9 Years Discussion Span
Latest Post 8 Years Ago Latest Post by zia shaikh

mrnutty 761 Senior Poster

13 Years Ago

A question on this algorithm :

string random_line(ifstream& file)
{
    string line;
    string selected_line;
    int nlines = 0;
    while (getlinefile, line))
        if ((rand() % ++nlines) == 0)
            selected_line = line;      
    return selected_line;
}

the while loop only exits if getline fails. getline only fails if it reads the EOF usually. So doesn't that
mean that the algorithm reads the whole file no matter what?

arkoenig 340 Practically a Master Poster

13 Years Ago

It is very easy to estimate the performance of a solution, it is called complexity class.

It is impossible to estimate the performance of a solution without knowing whether a particular program actually is a solution.

It is reasonable to claim that an algorithm that selects every line with the same probability is a solution. However, an algorithm that seeks to a random offset in the file, and then extracts a line near that offset, generally does not select every line with the same probability. So the question then becomes whether that selection is reasonable, and why.

So far, you have said that the selection is reasonable, but you haven't said why. That is, you haven't said what criteria you are using to distinguish reasonable results from unreasonable results. Until you do that, there's no point in talking about the performance of individual programs, because there's no way to know whether those programs even produce correct results.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 1 · 2007-04-06T23:24:10+00:00

just count the lines as they are read and stop when it has read the line you need. Toss all lines but the last one away.

iamthwee · Answer 2 · 2007-04-06T23:24:17+00:00

1. Read in each line into a string array or std::vector.

2. Whilst at the same time incrementing a counter, which counts each line.

3. Using srand pick a number which is less than the max line count.

4. Use that number as the index to the array of strings/ vector and output to the screen.

Alternatively you can do something similar without the need for a temporary storage device (vector etc).

kylcrow 0 Junior Poster · Answer 3 · 2007-04-06T23:28:08+00:00

I think I understand, but I am new to C++ so could you show me an example of the actual code? It was be greatly appreciated.

iamthwee · Answer 4 · 2007-04-06T23:47:55+00:00

Well the first thing you need to do is figure out how to generate random numbers.

iamthwee · Answer 5 · 2007-04-07T00:19:00+00:00

I don't have my compiler so this is not tested.

#include <cstdlib> 
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <ctime> 

class RandomCrap
{
public:
   int show(int p)
   {
      srand( ( unsigned )time( 0 ) ); 
      int random_integer; 
    
      random_integer = ( rand ( ) % p ) + 1;
      return random_integer;
    }
};

int main() 
{ 
    std::ifstream read ("example.txt");
    
    std::string line;
    std::vector < std::string > stuff;
    
    int counter = 0;
    while (getline(read, line, '\n'))
    {
      stuff.push_back( line );
      counter++;
    }
    

    read.close();

    RandomCrap a;
    std::cout << stuff[a.show( counter - 1 )];
    
    std::cin.get(); 
    return 0;
}

I think it might always miss the first line. Meh?

vijayan121 1,152 Posting Virtuoso · Answer 6 · 2007-04-07T05:40:51+00:00

the basic problem here is that we need to select one item at random
when we do not know beforehand how many items there are. the most
common example of this in real code is when a random element has to
be picked from a list (size unknown). there is a well known algorithm to
do this without having to pass through the entire sequence:

string random_line( ifstream file )
{
  string line ;
  int nlines = 0 ;
  while( getline( file, line ) )
     if(  ( rand() % ++nlines  ) == 0 ) break ;
  return line ;
}

the variable nlines counts the number of lines as they are read.
for any line, rand() % ++nlines == 0 is true with probability
1 / nlines. the first line is initially selected (with probability 1),
the second line replaces it with probability 1/2, the third line will
replace the survivor (either line one or line 2, each of which are
equally likely) with probability 1/3, and so on.therefore, for each
of the nlines lines seen so far is selected with probability
1/nlines.

if i remember right, this was first seen widely in some code
written by brian kernighan.

vijayan121 1,152 Posting Virtuoso · Answer 7 · 2007-04-07T11:08:27+00:00

THERE IS A SERIOUS ERROR IN THE PREVIOUS POST.

there is a well known algorithm to
do this without having to pass through the entire sequence:
string random_l[ine( ifstream file )
{
   string line ;
   int nlines = 0 ;
   while( getline( file, line ) )
      if(  ( rand() % ++nlines  ) == 0 ) break ;
   return line ;
}

IT SHOULD BE
there is a well known algorithm to
do this without having to copy the entire sequence
into a buffer having random access :

string random_line( ifstream file )
{
      string line ;
      string selected_line ; // added
      int nlines = 0 ;
      while( getline( file, line ) )
         if(  ( rand() % ++nlines  ) == 0 ) // break ;
                  selected_line = line ;      
    return selected_line ;
}

razvan1 0 Newbie Poster · Answer 8 · 2011-03-14T03:19:28+00:00

Both of the previous algorithms are seriously flawed. They never choose the first line, while the second line has a 50% probability of being picked up. For any subsequent line, the probability of being picked up is decreasing.

Another idea would be to jump in the file (fseek) at a random offset and read the current line. However, this would favorize the large lines. We can pick a line pointed by the random offset with the probability of 1/string_length. This means, if the average length of a line is 100, we will read on average about 100 lines until picking one up. Even if the procedure is statistically correct, it is inefficient, as it reads much more lines than needed.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 9 · 2011-03-14T09:56:46+00:00

Ancient Dragon 5,243 Achieved Level 70

13 Years Ago

Nice info, but 4 years too late.

arkoenig 340 Practically a Master Poster · Answer 10 · 2011-03-14T10:01:07+00:00

Nice info, but 4 years too late.

And wrong, too, for at least two reasons that come to mind immediately.

razvan1 0 Newbie Poster · Answer 11 · 2011-03-14T18:23:22+00:00

@arkoenig what is wrong, the drafted method, or the critics on existing algorithms? I am looking for a way to do this efficiently.

arkoenig 340 Practically a Master Poster · Answer 12 · 2011-03-14T21:32:25+00:00

Both of the previous algorithms are seriously flawed. They never choose the first line, while the second line has a 50% probability of being picked up. For any subsequent line, the probability of being picked up is decreasing.

This comment is wrong, because in the last algorithm presented (repeated here for convenience, with the parameter type changed from ifstream to ifstream& because you can't copy an object of type ifstream):

string random_line(ifstream& file)
{
    string line;
    string selected_line;
    int nlines = 0;
    while (getlinefile, line))
        if ((rand() % ++nlines) == 0)
            selected_line = line;      
    return selected_line;
}

at the beginning of the first trip through the loop, nlines is zero. ++nlines increments it to 1, which means that we start by computing rand() % 1, which is always zero. This fact means that the first time through the loop, the statement

selected_line = line;

will always be executed. If the values of the random numbers are such that this statement is not executed again, then the algorithm will select the first line.

The second misstatement is more subtle. As Accelerated C++ points out, using rand() % n for a random number between 0 and n-1 gives results that are generally not random, even if rand() itself is perfect.

To see why, assume that rand() returns a 32-bit signed value, i.e. a value between 0 and 2147483647, inclusive, and that you are computing rand() % 190.

There are 2147483648 possible values that rand() can take on, and 2147483648 % 190 is 98. In other words, 190 does not divide 2147483648 exactly. This fact means that the 190 possible values that this algorithm yields cannot possibly occur with equal probability.

Instead of using rand() % n, it would be more accurate to use nrand(n), where nrand is the function defined along the lines of Accelerated C++:

int nrand(int n)
{
    assert(n <= RAND_MAX);

    const int bucket_size = RAND_MAX/n;
    int r;

    do r = rand()/bucket_size;
    while(r >= n);

    return r;
}

razvan1 0 Newbie Poster · Answer 13 · 2011-03-14T22:50:47+00:00

My mistake, I imagined there is a break after matching a random number.

I was looking for a method of picking a random line without reading the whole file.
Your method has a better memory footprint that the naive one, but was still not solving my problem.

The method I was proposing only reads an amount of lines proportional with the line length. Since we cannot assume anything about the lines length (if they are known to be a normal or a uniform distribution, or even a stratified population of normal distributions) I don't see how could be avoided the read of a larger number of lines.

arkoenig 340 Practically a Master Poster · Answer 14 · 2011-03-14T22:59:21+00:00

I was looking for a method of picking a random line without reading the whole file.

That's impossible, unless you already know how many lines are in the file.
Because until you reach the end of the file, you don't know how many lines remain unread.

razvan1 0 Newbie Poster · Answer 15 · 2011-03-14T23:20:03+00:00

Not necessary impossible. The file length is easy to be determined, so you jump at a random location within the file. Read the line at the cursor (backward content + forward). The problem now is that the long lines have a better chance of being picked as the short lines. For this reason, the line of length n is only picked with a chance of 1:n. This means, if the average line is 100 characters long, the method will jump on average 100 times and read a line until picking one.

arkoenig 340 Practically a Master Poster · Answer 16 · 2011-03-14T23:42:17+00:00

This is a roundabout way of saying that the problem is not impossible to solve if you solve a different problem instead.

And, of course, your suggestion depends on running on an operating system that can determine the length of a file without reading. Not all systems can do it; and the ones that can are not always able to do so on all devices. In particular, you can't determine the length of a "file" that is typed in from a keyboard as your program is running, because the person doing the typing may not even have decided yet how long the file will be.

mrnutty 761 Senior Poster · Answer 17 · 2011-03-14T23:53:02+00:00

A problem I see by the suggested solutions so far that is, it is inefficient if one calls the function numerous times. So in the long run it will probably be better to read it into a vector and use that vector
to extract lines. That way you get rid of the unknown length problem and you can even use std::random_shuffle to read a random first element. There would only be a one time pay, that is you need to read the whole file once. But I think the benefits one gets from this versus the suggested solution is more.

Narue 5,707 Bad Cop Team Colleague · Answer 18 · 2011-03-15T00:09:43+00:00

So in the long run it will probably be better to read it into a vector

When I give this as an interview problem, there's a caveat that the file may be very large. Whether I mention it after being asked about file sizes or by rejecting solutions similar to yours depends on the candidate. I have yet to see such this question asked where the naive approach (reading everything into a list) counts as a viable solution.

razvan1 0 Newbie Poster · Answer 19 · 2011-03-15T00:25:33+00:00

@arkoenig True, a "disk file" is required, not an abstract stream. Most of the OSs execute fseek and ftell without actually performing reads. So long I only care for Windows >= XP,Linux 2.6 and MacOSX >= 10.5, the assumption holds.
@firstperson true for small files. I have a 20Gb file, and want to get 100 random lines.
@Narue I added the question on the interview list too.

Narue 5,707 Bad Cop Team Colleague · Answer 20 · 2011-03-15T01:15:46+00:00

So doesn't that mean that the algorithm reads the whole file no matter what?

Yes, of course. How do you propose counting the lines in a file without reading it at least once? But in this case only two lines of the file are in memory at any given time, which is quite an advantage over similar algorithm with the exception of all lines being stored in a vector.

razvan1 0 Newbie Poster · Answer 21 · 2011-03-15T02:23:11+00:00

The problem doesn't ask how many lines are in the files, it asks to get a random line.

arkoenig 340 Practically a Master Poster · Answer 22 · 2011-03-15T02:27:28+00:00

A problem I see by the suggested solutions so far that is, it is inefficient if one calls the function numerous times. So in the long run it will probably be better to read it into a vector and use that vector
to extract lines.

I don't see how you can estimate how likely such a solution is to be better without knowing something about (a) the circumstances under which the program will be run, or (b) the definition of "better."

As Narue has pointed out, there is no a priori reason to believe that you even have enough memory to store the whole file. Even if you do, there is a different tradeoff that seems at least as attractive at first glance: Read through the file and store a vector for which each element indicates the starting position of a line, stored as a seek offset. You still have to reread the lines that you wind up selecting, but at least you don't have to store the lines that you never select.

This strategy has the additional advantage that instead of storing a seek pointer to the beginning of every line, you can store a seek pointer to the beginning of every nth line, for a suitable value of n. Then you can still get to any line by seeking to the last known spot before the beginning of that line, and you still don't have to read the whole file. You can adjust the space/time tradeoff by changing the value of n.

I can even imagine strategies that change the value of n dynamically as the file is found to be larger than various size thresholds.

But I don't see the point in trying to design algorithms that meet hypothetical constraints--especially as the OP seems to be willing to accept an algorithm that doesn't even solve the problem correctly as originally stated.

arkoenig 340 Practically a Master Poster · Answer 23 · 2011-03-15T02:31:48+00:00

The problem doesn't ask how many lines are in the files, it asks to get a random line.

Usually someone who says "get a random line" implies that each line should have the same probability of being selected as any other line. If you don't care about that requirement, then you'll have to explain what probability distributions you're willing to accept. In particular, the technique you suggested has the property that any line with 100 characters in it is 100 times as likely to be selected as any line with only a single character in it. Such behavior does not normally fall into what people mean when they say "random."

razvan1 0 Newbie Poster · Answer 24 · 2011-03-15T02:51:30+00:00

that's the reason i need to throw the dice again, a line obtained by the random offset method is only picked when !nrand(strlen(line))

This compensates the probability of picking up long lines, but on the other hand, produces a lot of unneeded reads, especially for files with relatively long lines.

arkoenig 340 Practically a Master Poster · Answer 25 · 2011-03-15T02:56:20+00:00

I don't see any point in talking about algorithm details until you explain what probability distribution you're willing to accept.

gerard4143 371 Nearly a Posting Maven · Answer 26 · 2011-03-15T03:01:48+00:00

Has anyone checked the date of this posting...Apr 6th, 2007

razvan1 0 Newbie Poster · Answer 27 · 2011-03-15T03:02:20+00:00

It is very easy to estimate the performance of a solution, it is called complexity class. The naive solution scores a proud Theta(N) in both computation and memory, with N being the length of the file, the algorithm you support needs the same computation amount, but Theta(1) memory (less is better better), while mine requires Theta(n^2) computation and Theta(1) of memory, where n is the average size of the line. If the n^2 is larger than N, total length of the file, obviously your method is better. On a very large file, mine scores better.