Generic file reader

Question

TheWolverine 0 Junior Poster in Training

13 Years Ago

Hi all,

I've setup a class within my software package to read generic text files. For this, I'm using the getline( ) function and then parsing each line for specific types of files.

In setting this up, I've come across the ACSII incompatibilities between Windows systems and Unix systems. I struggled for a while before realising that a text file that had been created on a Windows machine contained the return carriage, which showed up on my Mac as the ^M character.

I was wondering if there is any general, accepted way to deal with these discrepancies. Is it simply a case of detecting the operating system and having a case for each, or is there a more universal way of dealing with this discrepancy.

I'd appreciate any input.

Thanks a lot!

Kartik

c++ reader text

3 Contributors
5 Replies
385 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by mike_2000_17

Fbody 682 Posting Maven

13 Years Ago

Have you tried opening the file in binary mode? When you open a file in binary mode, the end-of-line character(s) aren't converted when read/written...

Fbody 682 Posting Maven

13 Years Ago

As you mentioned, the file was originally stored on a Windows machine. Windows uses the 2-character combination "\r\n" as the end-of-line marker whereas Mac and *nix use one or the other, but not both.

You may want to consider using the C-style string function strpbrk(). With a comparison string of "\r\n". That will return a pointer to the first control character, no matter how the line ends, then you can either extract up to, or truncate at, that point.

Edited 13 Years Ago by Fbody because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TheWolverine 0 Junior Poster in Training · Answer 1 · 2011-04-08T02:04:04+00:00

Have you tried opening the file in binary mode? When you open a file in binary mode, the end-of-line character(s) aren't converted when read/written...

I went ahead and just tried that, but it seems the end-of-line character is still being read, because the string length should be 12 for a specific case I'm testing, but when I read it in from the file, the length turns out to be 13. The first 12 characters all match the expected characters, so there's something at the end. If I add a "\r" at the end of the expected string, then the problem disappears, which is how I diagnosed it as being a problem with the carriage return.

I added the ios::binary option after the string filename in the open function of an ofstream object in my class.

Is there something else I should be doing in addition?

Thanks!

Kartik

TheWolverine 0 Junior Poster in Training · Answer 2 · 2011-04-09T02:44:44+00:00

I've decided to just make simple search-and-erase loops to search for the "\r" and "\n" characters and to trim them from the strings outputted by getline. I'm not sure if this is the best way though. I'm a little hesitant to use C-style functions in the code, as I'm trying to keep the code purely C++-style.

Thanks,

Kartik

mike_2000_17 2,669 21st Century Viking Team Colleague Featured Poster · Answer 3 · 2011-04-09T03:25:29+00:00

You can use Boost.Tokenizer. It has nicer functions for extracting tokens from strings. You can, for instance, specify a list of characters that should be ignored and that mark the separation of tokens. This is the canonical example:

// char_sep_example_1.cpp
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

int main()
{
  std::string str = ";;Hello|world||-foo--bar;yow;baz|";
  typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
  tokenizer tokens(str, boost::char_separator<char>("-;|"));
  for (tokenizer::iterator tok_iter = tokens.begin();
       tok_iter != tokens.end(); ++tok_iter)
    std::cout << "<" << *tok_iter << "> ";
  std::cout << "\n";
  return 0;
}

There are also a few other nice features. For more complicated stuff, you can use Regex, but it might be overkill if you are just doing simple tokenizing.