I have a
std::string (char *, std::string::size_t )
constructor failing and I am not sure why.

For my small test file there was no problem, but when using
ifstream::read()
with an unsigned int sz of 1Gb (1073741824)
(I think that I used a pos_type 1st time)
and the resulting char* memblock

I first compare against gcount to confirm that sz isnt bigger.

Have I got an overflow problem max_size() returns 4Gb as expected?

Or is it a problem with the file data?

The main problem that I have is that I am trying to chop a very big file (25Gb) xml into manageable chunks. So I loaded the file in 1Gb
chunks and wrote them straight out using ifstream read() and ofstream write(). So my files might not be in the state that I think they are and might be missing an end of file.

I made a small 260k file in the same way that works fine.

Can I access the file to chop it on html tags in Windows XP.

Thanks,

David

P.S. the 25Gb is wikipedia text only

1. Post the code (at least some of it) so we can get a better idea of what's going on.
2. Why don't you use an XML library to parse the code. There are hundreds of XML libraries out there. Such libraries can save you a ton of work and they are usually very efficient.
3. Why are you trying to read the entire file at once? Couldn't you process line by line, and not have to worry about overflows at all?

I will try to post exact code tomorrow if wanted when I am at the correct computer but for now I thought I ought to give a quick clarification.

In brief the code is roughly

#include <fstream>
#include <string>
#include <iostream>
void main()
{
unsigned int size(1073741824); //orig pos_type
char * memblock = new[size];

std::string file_name("D:\\wiki.txt"); 
/* a file that is the memblock read written for 1st gig
and the ofstream write */
//read in 1 Gig file
std::ifstream fin(file_name.c_str(),std::ios::binary); // also used in

if(fin.is_open())
{

   fin.seekg (0, std::ios::beg);
   fin.read (memblock, size);
//check how much has been read;
   if(size > fin.gcount())
   {
     size = fin.gcount();
   }
  std::string dummy("");
  std::cout <<dummy.max_size() <<"<>" << size << std::endl;
  std::string data(memblock, size); //<-<< this goes wrong
  std::cout << "I Don't get here! for big file" << std::endl;
 fin.close();
}
//clean up memory
delete [] memblock;
}

works for test file but not for big file
where it outputs max size as expected:
4294967296<>1073741824
Then visio crashes on the constructor

2. Why don't you use an XML library to parse the code. There are hundreds of XML libraries out there. Such libraries can save you a ton of work and they are usually very efficient.

3. Why are you trying to read the entire file at once? Couldn't you process line by line, and not have to worry about overflows at all
?

The original file that I want to process is 25Gb this presents problems for a very large amount of programs and templates as the size of an int restricts to 2^32 and hence 4Gb input. Especially on windows many things fail at the 4Gb limit.

I would like to process the file line by line eventually but first I want to simply chop it on a single xml tag. To process the file I was wanting to use an optimised validated find routine such as found in the STL.

I also am not sure how to access just a pointer for the file start of data and then iterate char by char, which is why I wanted to use a container

I will try chopping the original 25Gb file into smaller units ~ 1Mb and then see if that can be processed.

Edited 6 Years Ago by tetron: n/a

unsigned int size(1073741824); //orig pos_type
//....
//read in 1 Gig file
std::ifstream fin(file_name.c_str(),std::ios::binary); // also used in

I'm not sure why you are using a binary fstream to read text data. It seems like you are causing yourself some extra pain by doing this. If you are really just parsing xml, why not use a standard ascii fstream? Also, To manage/query file sizes, you should use stat or some similar system call to get the file size.
limit.

I would like to process the file line by line eventually but first I want to simply chop it on a single xml tag.

It's actually very easy to parse line by line. If you use an ascii file stream, you can be confident that each line could be easily converted to a string object for parsing.

I also am not sure how to access just a pointer for the file start of data and then iterate char by char, which is why I wanted to use a container

Again, if you use an ascii ifstream, this is trivial. You can simply use the ifstream::get( char& c ) to extract a single character from the file.

So, I recommend you do this:
1. Use stat to query the size of the file for checks and safety
2. Open the ifstream not as ios::binary but regular old ascii ifstream::in
3. If you want to parse a section of the file at a time by converting it into a stream, use getline()
4. If you want to parse a character at a time, use get()
5. If you are not trying to extract the entire file into a single string, these techniques should work for even large files.

Thanks for your help.

Sorry, I did not make myself clear the comment was supposed to indicate that I had tried using ifstream with in as well.

I'm not sure why you are using a binary fstream to read text data.

The problem that I have is that I do not understand why the string cstr() might be failing. The only thing that I am not sure about is the state ot the original file as I used a brute force approach to chop the 25Gbytes

I don't know stat() and could not find much documentation on it. A c struct for getting file info? I looked at sys/stat.h and I think that it still uses an int_32 so would be unusable with the original 25Gig file.

4. If you want to parse a character at a time, use get()

This might be usable but I am not sure what happens around the 4Gig mark. When I said find the chars directly I meant char * pc; equivalent but even char * this won't cope with the file as not big enough.

I have discovered that the issue I have is that I am trying to use too much memory and the new() operator is throwing in the string
as I am overflowing the visio limit.

So I was really asking the wrong question, so I will have to try a slightly different approach.

It appears that I am creating a temp structure that causes an overflow and I shouldn't really need this.

I chopped the big file into smaller chunks and all is working ok. The file was alright it was a problem with memory and new() but with the smaller files everything was fine.

This question has already been answered. Start a new discussion instead.