Hello,
In trying to port some Java code to C++ I'm disliking the std::stream's lack of copy constructors. I know exactly what the problem is, but I don't know what I can do about it. I have a dangling reference to a stream as a member variable of a class. My full code is very large but hopefully this will work to explain my dilema: (comments in caps indicate the problem areas, lowercase is extra information)

class PushbackReader{//class allows unread operations and limits read/unread to single char
protected:
    istream & myreader;//SHOULD THIS BE A POINTER?
    //other stuff for pushback buffer
public:
    PushbackReader(string filename){myreader=ifstream(filename)/*rest*/}//CTOR MAKES FILE
    PushbackReader(istream & reader) : myreader(reader){/*rest*/}//HOW TO PASS OTHER
    //other things
    int get(){
        if(false/*check buffer*/) return -1;//return buffered char
        else {
            int c = myreader.get();
            if(myreader.good()) return c;
            else return -1;
        }
    }
};

class Tokenizer{//uses reader
protected:
    shared_ptr<PushbackReader> reader;
public:
    Tokenizer(string totokenize){
        reader=shared_ptr<PushbackReader>(new PushbackReader(istringstream(totokenize)));
        //THIS IS MY PROBLEM THE REFERENCE THE READER HAS TO THE STREAM
        //ENDS UP A DANGLING POINTER BECAUSE IT GOES OUT OF SCOPE
        //SHOULD I MAKE IT A FULL POINTER? OR IS THERE A BETTER WAY?
    }
    string /*really Token class*/ nextToken(){
        char c = reader->get();//THIS CALL IS ACCESS VIOLATION
        //use c somehow
        return string(c,1);
    }
};
int main(){
    string s="string string string string";
    Tokenizer tokenizer(s);
    cout<<tokenizer.nextToken();
}

My problem is that the reference that the PushbackReader has goes out of scope after the constructor call, so I want to know that the best way to accomplish this goal is. Thanks in advance.

That is a common situation and there are two common solutions to it. But first, it is important to mention to semantics of this. When you give a non-const reference to a constructor, there is a conventional understanding that the reference must remain valid for as long as the object exists, and this is what allows it to be semantically correct to do what you are trying to do. Additionally, you should document that fact in the (doxygen) comments.

Now for the solution. The problem is essentially that the object needs to hold a pointer to an object which it may or may not have ownership of (ownership == responsibility to delete it) (see this tutorial for a more thorough explanation of ownership relations in C++). This duality implies the need for a flag to tell, upon construction, whether the object has ownership or not of the pointee. And thus, a simple solution arises:

class PushbackReader {
  protected:
    std::istream* myreader;
    bool has_myreader_ownership;
    //other stuff for pushback buffer
  public:
    explicit PushbackReader(const std::string& filename) : 
                            myreader( new std::ifstream( filename.c_str() ) ),
                            has_myreader_ownership(true) {
      /* rest */
    };
    explicit PushbackReader(std::istream& reader) : 
                            myreader(&reader),
                            has_myreader_ownership(false) {
      /*rest*/
    };

    ~PushbackReader() {
      if(has_myreader_ownership)
        delete myreader;
    };

    // the class must be made non-copyable (C++11):
    PushbackReader(const PushbackReader&) = delete;
    PushbackReader& operator=(const PushbackReader&) = delete;

    // but you can still make it moveable (C++11):
    PushbackReader(PushbackReader&& rhs) {
      if(rhs.has_myreader_ownership) {  // either create a new myreader:
        myreader = new std::ifstream( 
          std::move( static_cast< std::ifstream& >( *(rhs.myreader) ) ) );
        has_myreader_ownership = true;
      } else {                          // or copy the pointer:
        myreader = rhs.myreader;
        has_myreader_ownership = false;
      };
      /* rest */
    };
    PushbackReader& operator=(PushbackReader&& rhs) { 
      // simply swap and leave 'rhs' with the old 'this' object: 
      std::swap( myreader, rhs.myreader );
      std::swap( has_myreader_ownership, rhs.has_myreader_ownership );
      /* rest */
      return *this;
    };

    // OR (C++03), you can just make it non-copyable with this:
  private:
    PushbackReader(const PushbackReader&);
    PushbackReader& operator=(const PushbackReader&);

  public:
    //other things
};

Notice, in the above, that I added the explicit keyword to the single-argument constructors to avoid implicit conversions (from istream to PushbackReader, or from string to PushbackReader). Also, notice that I added full qualifications with std:: for the standard library components, because you declare a class in a header file and you should never use a statement like using namespace std; inside a header file.

With the above solution, you get a fully usable class that has very little overhead in terms of bookkeeping for the myreader pointer. However, the best you can do, while remaining safe, is to make the class non-copyable (but moveable, using C++11). This may not be great for you, a slightly heavier but simpler alternative follows.

The second way to implement this is to use the std::shared_ptr as a means to achieve two things: store the ownership vs. no ownership flag, and enable copying. You pay a price in the heavier bookkeeping that std::shared_ptr does, compared to the previous solution (raw pointer + bool). But you also get a simpler implementation. First, you need a no-op deleter, which is a simple functor that can "not delete" a pointer (there is also one such no-op deleter in the Boost.Serialization library, and I also have one, both are called null_deleter):

struct null_deleter {
  void operator()(void const *) const {}
};

This can then be given to a shared-pointer upon construction and it will be used when comes the time to delete the pointer it holds, at which point, nothing will happen, and that is what you want when you don't have ownership on the pointer. And thus, you get the following implementation:

class PushbackReader {
  protected:
    std::shared_ptr< std::istream > myreader;
    //other stuff for pushback buffer
  public:

    explicit PushbackReader(const std::string& filename) : 
                            myreader( new std::ifstream( filename.c_str() ) ) {
      /* rest */
    };

    explicit PushbackReader(std::istream& reader) : 
                            myreader(&reader, null_deleter() ) {
      /*rest*/
    };

    //other things
};

And that's it. Now, you get a copyable class (with shared ownership on the stream), and the above implementation is all you need, the rest (copy-constructor, destructor, copy-assignment operator, move-constructor, and move-assignment operator) are all correct in their default form.

Et voilĂ !

proper way to pass a temporary stream...
I'm disliking the std::stream's lack of copy constructors.
I know exactly what the problem is, but I don't know what I can do about it

The proper way is to move the temporary stream. Stream classes in C++ are moveable, but not copyable.

Some types are not amenable to copy semantics but can still be made movable. For example:

  • fstream
  • unique_ptr (non-shared, non-copyable ownership)
  • A type representing a thread of execution

By making such types movable (though still non-copyable) their utility is tremendously increased. Movable but non-copyable types can be returned by value from factory functions:

...
ifstream find_and_open_data_file(/* ... /);
...
ifstream data_file = find_and_open_data_file(/
... */); // No copies!
...

In the above example, the underlying file handle is passed from object to object, as long as the source ifstream is an rvalue. At all times, there is still only one underlying file handle, and only one ifstream owns it at a time.
.
Movable but non-copyable types can also safely be put into standard containers. If the container needs to "copy" an element internally (e.g. vector reallocation) it will move the element instead of copying it.
...
- http://www.artima.com/cppsource/rvalue.html

A simple example:

#include <iostream>
#include <fstream>
#include <string>
#include <utility>

struct push_back_reader
{
    // stm referes to an input file stream opened (and owned) by this object
    push_back_reader( const std::string& path2file ) 
                    : owned_stm(path2file), stm(owned_stm) {}

    // stm just holds a reference to an input strem owned by some other component
    push_back_reader( std::istream& lvr ) : stm(lvr) {} // owned_stm is not used

    // this object takes over the ownership of the input file stream
    push_back_reader( std::ifstream&& rvr ) /* noexcept */ 
                    : owned_stm( std::move(rvr) ), stm(owned_stm) {}

    ~push_back_reader() /* noexcept */ ;
    // copy constructor, copy assignment operator...

    private:
        std::ifstream owned_stm ;
        std::istream& stm ;
};

int main()
{
    // push_back_reader( const std::string& ) - one creates (and owns) the stream
    push_back_reader one( "afile.txt" ) ;

    // push_back_reader( std::istream& ) ; - two does not own the stream, 
    // it just holds a reference
    push_back_reader two( std::cin ) ;

    // push_back_reader( std::istream&& ) ; - three takes over ownership of the stream
    push_back_reader three( std::ifstream( "another_file.txt" ) ) ;


    std::ifstream temp_stream( "yet_another_file.txt" ) ;

    // push_back_reader( std::istream&& ) ; - four takes over ownership of the stream
    push_back_reader four( std::move(temp_stream) ) ;

    // do not use temp_stream afterthis
}

This requires conformance from the compiler and the library implementation. For instance the code will work with clang 3.1 or later with libc++, Visual Studio 2010 or later; but not with GCC and it's libstdc++.

With a non-conforming implementation, a work-around would have to be kludged. Probably the simplest would be to simuate move semantics by swapping the two stream buffers (with a tacit assumption that both use the same codecvt).

Edited 3 Years Ago by vijayan121

Thanks for the replies.

Notice, in the above, that I added the explicit keyword to the single-argument constructors to avoid implicit conversions (from istream to PushbackReader, or from string to PushbackReader). Also, notice that I added full qualifications with std:: for the standard library components, because you declare a class in a header file and you should never use a statement like using namespace std; inside a header file.

True, in my codebase I already have the std:: qualifiers, but I forgot the explicit.
The first example seems rather involved.

And that's it. Now, you get a copyable class (with shared ownership on the stream), and the above implementation is all you need, the rest (copy-constructor, destructor, copy-assignment operator, move-constructor, and move-assignment operator) are all correct in their default form.

With this example the issue of scope is still relevant, when the passed reference goes out of scope it will be destructed, even though there is now a shared_ptr to it correct?

I don't intend to allow copying or or assignment, would a unique_ptr as an argument be a better choice for this instance?

This requires conformance from the compiler and the library implementation. For instance the code will work with clang 3.1 or later with libc++, Visual Studio 2010 or later; but not with GCC and it's libstdc++.

With a non-conforming implementation, a work-around would have to be kludged. Probably the simplest would be to simuate move semantics by swapping the two stream buffers (with a tacit assumption that both use the same codecvt).

This seems simple and to fix the temporary variable problem, and I am using VS2010, but I'm building a library, so not supporting a popular compiler like GCC seems sad. I think I'll try unique_ptr first.

Could you get away by just having a stringstream object in your tokenizer class? What other functionality does PushbackReader offer?

With this example the issue of scope is still relevant, when the passed reference goes out of scope it will be destructed, even though there is now a shared_ptr to it correct?

Yes, that's true. That's the reason why I mentioned the semantics issue in my previous post by saying:

When you give a non-const reference to a constructor, there is a conventional understanding that the reference must remain valid for as long as the object exists, and this is what allows it to be semantically correct to do what you are trying to do. Additionally, you should document that fact in the (doxygen) comments.

because any acute reader of your question would have pointed that problem out right away, and I was defusing that by pointing out that this is a case of borrowed ownership and is thus OK from a coding practices point of view.

I think you need to wrap your head around this concept of value-semantics and automatic storage (or scope-bound objects, or deterministic life-times) that is central to coding practices in C++. This is very different from coding practices in pure OOP languages. In Java/C# and others, objects are all created in freestore and interconnected as an object graph, and a garbage collector works in the background to clean up anything that has become disconnected from the rest (i.e., not needed anymore). In C++, the primary mechanism for creating objects is having them come into scope (in the call stack) and having them go out of scope (leaving a stack frame), this is called a deterministic model (or automatic storage), which naturally leads to a stack-like (or tree-like) structure for inter-connected objects, and such a structure requires no bookkeeping beyond that of the call-stack. It is certainly possible to create more general object graphs (as needed in traditional OOP) but that would generally require shared_ptr and weak_ptr to do the additional bookkeeping required for those complex ownership relations, but it should only be used when it's needed, which is far less than you might think, especially if you have been "spoiled" by a pure OOP language that imposes this heavy bookkeeping as the one-size-fits-all solution.

So, that specific problem with the passed reference going out of scope is really only a problem if you program with the mindset that anything can be created or destroyed at anytime, which is true (i.e., possible) but completely outside the normal practices in C++. Here, the assumption can be that whoever creates the object of class PushbackReader, has also created a stream or has access to one provided from somewhere, and thus, the stream will have a longer life-time than the PushbackReader object. In other words, these would be the prototypical uses of your PushbackReader class:

void some_function(std::istream& someStream) {

  // some code...

  PushbackReader reader(someStream);  // create a reader object on stack.
  reader.get(/* .. */);
  // ...

  // some more code...

}; // the reader goes out of scope here, while the stream reference is still valid.


void some_function() {
  // some code...
  std::ifstream infile("some_filename.txt");  // open a file stream
  // some code...
  PushbackReader reader(infile);  // create a reader object on stack.
  reader.get(/* .. */);
  // ...

  // some more code...

}; // the reader goes out of scope here, while the stream reference is still valid.
   // then, the stream goes out of scope and closes the file.

My original point was that if a class takes a non-const reference in its constructor, it is implied that one would use the class in a similar way as above, or, at least, would be aware that one is responsible for ensuring that the stream reference remains valid for as long as the PushbackReader class exists. This is a pretty fundamental notion in C++ and one that a minimally competent programmer is sure to observe in his daily coding. And thus, the problem that you pointed out is not a real problem in practice. Of course, it's still safer to document this fact in the (doxygen) comments, or even avoid doing this if possible (see below).

This seems simple and to fix the temporary variable problem, and I am using VS2010, but I'm building a library, so not supporting a popular compiler like GCC seems sad. I think I'll try unique_ptr first.

I would guess that vijayan121 is referring to the streams not yet being moveable in the libstdc++ library (something I haven't verified). But this is only required in order to make your class moveable too (or grab a stream rvalue-reference in the constructor, which is a rather useless feature in my opinion). So, that's not a big deal I would say.

Could you get away by just having a stringstream object in your tokenizer class? What other functionality does PushbackReader offer?

I don't know what functionality PushbackReader is intended to have, and I also suspect that this functionality is widely covered already by some clever uses of STL algorithms (including back-inserters) and STL streams (as well as Boost.Tokenizer and Boost.Regex). In any case, the idea of using an internal stringstream is actually a much nicer solution, if it is appropriate for whatever purpose this is for.

For example, you could have a class like this:

class PushbackReader {
  private:
    std::stringstream myreader;
    //other stuff for pushback buffer
  public:
    PushbackReader() { 
      /* rest */
    };

    void loadFromFile(const std::string& filename) {
      std::ifstream infile(filename.c_str());
      myreader << infile.rdbuf();
    };

    void loadFromStream(std::istream& reader) {
      myreader << infile.rdbuf();
    };

    explicit PushbackReader(const std::string& filename) {
      loadFromFile(filename);
      /* rest */
    };
    explicit PushbackReader(std::istream& reader) {
      loadFromStream(reader);
      /*rest*/
    };

    //other things
};

And with the above, you don't have to worry about anything, ownership-wise. Of course, this might not be appropriate depending on the nature of the input streams (e.g., user input, network stream, serial / USB port, etc.), but this will work nicely for moderately sized files or in-memory streams.

Thanks for the further explanations, and I like the internal stringstream idea, but I have been unable to find a method within standard streams that allow an unread(char) operation. That is the principal reason for the PushbackReader, it's a class in the java.io package of the standard, but I'm not able to find something similar in C++. The full class declaration looks like this, I have changed it to use unique_ptr so now it works as intended but isn't using best practice, I suppose:

    class PushbackReader{
    private:
        std::unique_ptr<std::istream> reader;
        char* buf;
        int pos;
        int bsize;
    public:
        PushbackReader(std::unique_ptr<std::istream> reader, int pushbacksize);
        int read();
        void unread(int ch);
        ~PushbackReader();
    };

The typical use inside tokenizer looks something like this, which is in a TokenizerState subclass (typical state pattern):

Token WordState::nextToken(shared_ptr<PushbackReader>r, int c, Tokenizer& t){
//reader should possibly be a reference, c is the char that the tokenizer read
//to determine which state to call, tokenizer is *this for callback on ignored
//states like whitespace and comments
    int i = 0;
    do {
        checkBufLength(i);
        charbuf[i++]=(char)c;
        c=r->read();
    }while(wordChar(c));
    if(c>=0) r->unread(c);//unread the character so the next state can have it when it calls r->read()
    string sval(charbuf,i);
    return Token(Token::TT_WORD,sval,0);
}

unread is also used to allow multi-character symbols such as '==' '<=' and for valid symbols to contain non-valid symbols such as =**= when no =* or =** symbols are valid, if the tokenizer encounters =**; it unreads ;,*, and * and returns an equals symbol as the next token. I did not find anything similar, did I miss something, is there similar funtionality in the standard library?

Edited 3 Years Ago by sciwizeh: additional clarification, typo in first edit

did I miss something, is there similar funtionality in the standard library?

Well, there is a peek() function. Also, you can always do seeking operations to move the position of the get pointer on the stream. And you should really check out the Boost.Tokenizer library, not to mention other important parser libraries like Boost.Regex and Boost.Spirit.

Thanks for the continued help.

peek() wouldn't allow multi-character backtracking (no?), seeking would, can it be done relative to current poristion?

I would rather not use boost, I am porting this code as an excersize to learn how the parser works (primary goal) and to practice my C++ which as is probably obvious, I need to do. With that in mind, I want to use as few helper libraries as possible (excluding STL). As you have probably guessed, I do come from a Java background, so memory management seems like a real pain to me, and I'm not used to worrying about it. Wrapping my head around the idea that it's the callers job to keep something in scope is a bit hard, as memory managment takes care of that in Java.

My plan is to get the whole codebase working in C++, then because I want to learn python to port it to python as well.

The framework I'm trying to port it is from the book Building Parsers with Java and the source code is on its website.

peek() wouldn't allow multi-character backtracking (no?),

No, the peek function allows you to see the next character without actually extracting it from the stream. What you have described as your tokenizing tasks can easily be implemented using that. For example, the code from your last post could be implemented as follows:

Token WordState::nextToken(std::istream& s, int c, Tokenizer& t) {
    string sval;
    while( s && wordChar( s.peek() ) )
        sval.push_back( char(s.get()) );
    return Token(Token::TT_WORD, sval, 0);
};

seeking would, can it be done relative to current poristion?

Yes, you can do it as follows:

s.seekg(-5, std::ios_base::cur);  // move back by 5 chars from current position.

Edited 3 Years Ago by mike_2000_17

No, the peek function allows you to see the next character without actually extracting it from the stream. What you have described as your tokenizing tasks can easily be implemented using that. For example, the code from your last post could be implemented as follows:

That may be true, but I didn't show the code for the SymbolState which recurses through allowable symbols (psuedo-code):

class symbolnode
    char thischar
    symbolnode parent
    list<symbolnode> children
    bool valid
    symbolnode deepestread(PushbackReader r)
        char c = r.read()
        symbolnode n = findchildinchildrenwithchar(c)
        if(not exists n)
            r.unread(c)
            return this
        return n.deepestread(r)
    symbolnode unreadtovalid(PushbackReader r)
        if(valid) return this
        r.unread(mychar)
        return parent.unreadtovalid(r)
    string ansestry()
        if(exists parent) return parent.ancestry() concat mychar
        return mychar as string
string nextSymbol(PushbackReader r, char first)
    symbolnode n1 = ROOT.findchildinchildrenwithchar first
    symbolnode n2 = n1.deepestread(r)
    symbolnode n3 = n2.unreadtovalid(r)
    return n3.ancestry()


build nodetree with root having all first symbol characters, 
only leaves have valid set to true
all nodes have the character they match as mychar

the previous code works with the reader I have but I do not believe peek() would be enough to allow the repeated un-reading. Slightly hard to follow but it does work as expected in its current state. The unreadtovalid method doesn't follow the peek pattern of read()->dostuf->unread(), it makes read()->dostuff->read()->dostuff->unread()->unread() possible, which a peek() based solution would not support.

I've ommitted some helper functions for building the symbol tree and the real version uses a subclass as the root to change how it searches for symbols in the children, but it would work without the change and be much less efficient (everything but alpha-numeric characters is a child of the root, and most symbols are single character so the rest don't need the speed boost)

The root is also the only node that has a full nextsymbol() method, it isn't really global.

The seek function seems as though it could be a better solution, and the current position will persist?

I will probably stick with the code I currently have working, because don't fix something that isn't broken are good words to live by, and because it's the way the original library does it, but I want to learn what the accepted way to do it would be.

Edited 3 Years Ago by sciwizeh: additional clarification

With the way you describe it now, it seems that it might be more appropriate to use a general container of chars instead of a stream. Streams are mostly intended to be used for sequential operations, not so much for random-access operations (or somewhat random-access). If you are not going to use some of the nice features of the streams, like formatted input-outputs and the buffering, there is no need to use it, and it will just be a handicap. So, it might be easier to just dump the content of the stream into vector of chars, as so:

std::vector<char> charVector(std::istream_iterator<char>(inStream), std::istream_iterator<char>());

And then, you can just use iterator ranges and STL algorithms to do you tokenizing work.

That may be a better way, however the access isn't as random as I may have made it sound, the general process is very linear, and there can be a maximum of MAX(symbol length -1) un-reads per nextToken() call. And the position will always advance the position by at least one.

Your help is appreciated, and as I've managed to get the code working I suppose this problem is solved, though not in any of the better ways suggested, and I should mark it as such, although I'm learning with the continuing discussion.

Edited 3 Years Ago by sciwizeh

This seems simple and to fix the temporary variable problem, and I am using VS2010, but I'm building a library, so not supporting a popular compiler like GCC seems sad.

Yes. GCC is a mainstream compiler, and are several situations where it is indispensable. It certainly shouldn't be ignored while writing a portable library.

However, if you are prepared to wait for some time, kludging work-arounds for GCC would not be needed. The GCC developers are aware of this issue, it is in their bugzilla, and libstdc++ streams will be moveable, as they should be, in some future release.

I would guess that vijayan121 is referring to the streams not yet being moveable in the libstdc++ library (something I haven't verified). But this is only required in order to make your class moveable too

This is required for a lot of reasons, one of them being, if required, making class holding a stream object moveable.

(or grab a stream rvalue-reference in the constructor, which is a rather useless feature in my opinion).

Grab a stream rvalue-reference reference? The move constructor does more than just grab a reference. And the move-constructor is the mechanism; it is not the goal. There are many reasons why the C++ community came to the conclusion that making streams movable would not be a useless feature:

Hinnant, Stroustrup, and Kozicki have one example of their utility in the article quoted earlier - writing factory functions:

By making such types movable (though still non-copyable) their utility is tremendously increased. Movable but non-copyable types can be returned by value from factory functions:
std::ifstream find_and_open_data_file( /* ... / ) ;
// ...
std::ifstream data_file = find_and_open_data_file( /
... */ ) ; // No copies!

Here is an example where move semantics are indispensable:

void my_thread_fun( std::ofstream&& send_output_here ) ;
// ...
// and then:
std::ofstream thread_output( "thread_output.txt" ) ;
thread_output << std::boolalpha << std::fixed << std::setprecision(2) ;
std::thread my_thread( my_thread_fun, std::move(thread_output) ) ;

And another, where it enables the writing of simple, transparent code.

std::map< std::string, std::fstream > my_open_files ;
// ...
// and then:
my_open_files.emplace( "log file", std::fstream( "mylog.txt", std::ios::app ) ) ;

Edited 3 Years Ago by vijayan121

This seems simple and to fix the temporary variable problem, and I am using VS2010, but I'm building a library, so not supporting a popular compiler like GCC seems sad.

I've some good news.

Was playing around with the GCC 4.8 snapshot release, and libstdc++ streams are now moveable.
Just took a cursory look; from what I saw, it seemed to work flawlessly.

(The GCC 4.8.0 trunk is in the release branch now.)

Edited 3 Years Ago by vijayan121

This question has already been answered. Start a new discussion instead.