954,535 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Reading in a random line from a file

Hi. I am new to the forum and I had a question. I have looked around and haven't really been able to find exactly what I am looking for.

I am looking for the code in C++ to read in a random single line with spaces from a file.

This is the file I have...(example.txt)

the big green
forrest gump
blades of glory
glory road
animal house
pirates of the carribean
the seven little foys
the lord of the rings
the green mile
cast away
star wars
it
now and then
the princess bride



This is the code for a function I have so far, but it just prints out the entire file.

void file(){
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while (! myfile.eof())
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
}


how do i change this so that it reads in just a random single line?

kylcrow
Junior Poster
124 posts since Apr 2007
Reputation Points: 11
Solved Threads: 2
 

just count the lines as they are read and stop when it has read the line you need. Toss all lines but the last one away.

Ancient Dragon
Retired & Loving It
Team Colleague
30,050 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 

1. Read in each line into a string array or std::vector.

2. Whilst at the same time incrementing a counter, which counts each line.

3. Using srand pick a number which is less than the max line count.

4. Use that number as the index to the array of strings/ vector and output to the screen.

Alternatively you can do something similar without the need for a temporary storage device (vector etc).

iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
 

I think I understand, but I am new to C++ so could you show me an example of the actual code? It was be greatly appreciated.

kylcrow
Junior Poster
124 posts since Apr 2007
Reputation Points: 11
Solved Threads: 2
 

Well the first thing you need to do is figure out how to generate random numbers.

iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
 

I don't have my compiler so this is not tested.

#include <cstdlib> 
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <ctime> 

class RandomCrap
{
public:
   int show(int p)
   {
      srand( ( unsigned )time( 0 ) ); 
      int random_integer; 
    
      random_integer = ( rand ( ) % p ) + 1;
      return random_integer;
    }
};

int main() 
{ 
    std::ifstream read ("example.txt");
    
    std::string line;
    std::vector < std::string > stuff;
    
    int counter = 0;
    while (getline(read, line, '\n'))
    {
      stuff.push_back( line );
      counter++;
    }
    

    read.close();

    RandomCrap a;
    std::cout << stuff[a.show( counter - 1 )];
    
    std::cin.get(); 
    return 0;
}


I think it might always miss the first line. Meh?

iamthwee
Posting Expert
5,950 posts since Aug 2005
Reputation Points: 1,543
Solved Threads: 439
 

the basic problem here is that we need to select one item at random
when we do not know beforehand how many items there are. the most
common example of this in real code is when a random element has to
be picked from a list (size unknown). there is a well known algorithm to
do this without having to pass through the entire sequence:

string random_line( ifstream file )
{
  string line ;
  int nlines = 0 ;
  while( getline( file, line ) )
     if(  ( rand() % ++nlines  ) == 0 ) break ;
  return line ;
}



the variablenlines counts the number of lines as they are read.
for any line, rand() % ++nlines == 0 is true with probability
1 / nlines. the first line is initially selected (with probability 1),
the second line replaces it with probability 1/2, the third line will
replace the survivor (either line one or line 2, each of which are
equally likely) with probability 1/3, and so on.therefore, for each
of the nlines lines seen so far is selected with probability
1/nlines.

if i remember right, this was first seen widely in some code
written by brian kernighan.

vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
 

THERE IS A SERIOUS ERROR IN THE PREVIOUS POST.
there is a well known algorithm to
do this without having to pass through the entire sequence:

string random_l[ine( ifstream file )
{
   string line ;
   int nlines = 0 ;
   while( getline( file, line ) )
      if(  ( rand() % ++nlines  ) == 0 ) break ;
   return line ;
}


IT SHOULD BE
there is a well known algorithm to
do this without having to copy the entire sequence
into a buffer having random access :

string random_line( ifstream file )
{
      string line ;
      string selected_line ; // added
      int nlines = 0 ;
      while( getline( file, line ) )
         if(  ( rand() % ++nlines  ) == 0 ) // break ;
                  selected_line = line ;      
    return selected_line ;
}
vijayan121
Posting Virtuoso
1,606 posts since Dec 2006
Reputation Points: 1,159
Solved Threads: 287
 

Both of the previous algorithms are seriously flawed. They never choose the first line, while the second line has a 50% probability of being picked up. For any subsequent line, the probability of being picked up is decreasing.

Another idea would be to jump in the file (fseek) at a random offset and read the current line. However, this would favorize the large lines. We can pick a line pointed by the random offset with the probability of 1/string_length. This means, if the average length of a line is 100, we will read on average about 100 lines until picking one up. Even if the procedure is statistically correct, it is inefficient, as it reads much more lines than needed.

razvan1
Newbie Poster
14 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 

Nice info, but 4 years too late.

Ancient Dragon
Retired & Loving It
Team Colleague
30,050 posts since Aug 2005
Reputation Points: 5,662
Solved Threads: 2,343
 
Nice info, but 4 years too late.


And wrong, too, for at least two reasons that come to mind immediately.

arkoenig
Master Poster
703 posts since Jun 2010
Reputation Points: 359
Solved Threads: 109
 

@arkoenig what is wrong, the drafted method, or the critics on existing algorithms? I am looking for a way to do this efficiently.

razvan1
Newbie Poster
14 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 
Both of the previous algorithms are seriously flawed. They never choose the first line, while the second line has a 50% probability of being picked up. For any subsequent line, the probability of being picked up is decreasing.


This comment is wrong, because in the last algorithm presented (repeated here for convenience, with the parameter type changed from ifstream to ifstream& because you can't copy an object of type ifstream):

string random_line(ifstream& file)
{
    string line;
    string selected_line;
    int nlines = 0;
    while (getlinefile, line))
        if ((rand() % ++nlines) == 0)
            selected_line = line;      
    return selected_line;
}

at the beginning of the first trip through the loop, nlines is zero. ++nlines increments it to 1, which means that we start by computing rand() % 1, which is always zero. This fact means that the first time through the loop, the statement

selected_line = line;

will always be executed. If the values of the random numbers are such that this statement is not executed again, then the algorithm will select the first line.

The second misstatement is more subtle. AsAccelerated C++ points out, using rand() % n for a random number between 0 and n-1 gives results that are generally not random, even if rand() itself is perfect.

To see why, assume that rand() returns a 32-bit signed value, i.e. a value between 0 and 2147483647, inclusive, and that you are computing rand() % 190.

There are 2147483648 possible values that rand() can take on, and 2147483648 % 190 is 98. In other words, 190 does not divide 2147483648 exactly. This fact means that the 190 possible values that this algorithm yields cannot possibly occur with equal probability.

Instead of using rand() % n, it would be more accurate to use nrand(n), where nrand is the function defined along the lines of Accelerated C++:

int nrand(int n)
{
    assert(n <= RAND_MAX);

    const int bucket_size = RAND_MAX/n;
    int r;

    do r = rand()/bucket_size;
    while(r >= n);

    return r;
}
arkoenig
Master Poster
703 posts since Jun 2010
Reputation Points: 359
Solved Threads: 109
 

My mistake, I imagined there is a break after matching a random number.

I was looking for a method of picking a random line without reading the whole file.
Your method has a better memory footprint that the naive one, but was still not solving my problem.

The method I was proposing only reads an amount of lines proportional with the line length. Since we cannot assume anything about the lines length (if they are known to be a normal or a uniform distribution, or even a stratified population of normal distributions) I don't see how could be avoided the read of a larger number of lines.

razvan1
Newbie Poster
14 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 
I was looking for a method of picking a random line without reading the whole file.


That's impossible, unless you already know how many lines are in the file.
Because until you reach the end of the file, you don't know how many lines remain unread.

arkoenig
Master Poster
703 posts since Jun 2010
Reputation Points: 359
Solved Threads: 109
 

Not necessary impossible. The file length is easy to be determined, so you jump at a random location within the file. Read the line at the cursor (backward content + forward). The problem now is that the long lines have a better chance of being picked as the short lines. For this reason, the line of length n is only picked with a chance of 1:n. This means, if the average line is 100 characters long, the method will jump on average 100 times and read a line until picking one.

razvan1
Newbie Poster
14 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 

This is a roundabout way of saying that the problem is not impossible to solve if you solve a different problem instead.

And, of course, your suggestion depends on running on an operating system that can determine the length of a file without reading. Not all systems can do it; and the ones that can are not always able to do so on all devices. In particular, you can't determine the length of a "file" that is typed in from a keyboard as your program is running, because the person doing the typing may not even have decided yet how long the file will be.

arkoenig
Master Poster
703 posts since Jun 2010
Reputation Points: 359
Solved Threads: 109
 

A problem I see by the suggested solutions so far that is, it is inefficient if one calls the function numerous times. So in the long run it will probably be better to read it into a vector and use that vector
to extract lines. That way you get rid of the unknown length problem and you can even use std::random_shuffle to read a random first element. There would only be a one time pay, that is you need to read the whole file once. But I think the benefits one gets from this versus the suggested solution is more.

firstPerson
Senior Poster
3,923 posts since Dec 2008
Reputation Points: 841
Solved Threads: 608
 
So in the long run it will probably be better to read it into a vector


When I give this as an interview problem, there's a caveat that the file may be very large. Whether I mention it after being asked about file sizes or by rejecting solutions similar to yours depends on the candidate. I have yet to see such this question asked where the naive approach (reading everything into a list) counts as a viable solution.

Narue
Bad Cop
Administrator
15,460 posts since Sep 2004
Reputation Points: 6,464
Solved Threads: 1,401
 

@arkoenig True, a "disk file" is required, not an abstract stream. Most of the OSs execute fseek and ftell without actually performing reads. So long I only care for Windows >= XP,Linux 2.6 and MacOSX >= 10.5, the assumption holds.
@firstperson true for small files. I have a 20Gb file, and want to get 100 random lines.
@Narue I added the question on the interview list too.

razvan1
Newbie Poster
14 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You