I've bene banging my head against a brick wall trying to work out how to do this.

I need to be able to read a text file line by line and to read each character of the line. When the character is a space, anything up to that is added to an array. Any other characters such as brackets are put into their own array element so i could have words and punctuation.

I haven't got very far as I can't find anything online so help me progress

#include <fstream>
#include <iostream>

using namespace std;

int main()
{
	char ch;

	ifstream myFile("textfile.txt");
	
	if (! myFile)
	{
		cout << "Error opening output fle"  << endl;
		return -1;
	}

	while (! myFile.eof())
	{
		myFile.get(ch);
		/*
               This is where the adding to arrays/vectors would happen
               */
	}
	myFile.close();
	return 0;
}

Can anyone please get me started?

You should post a short sample text file and what needs to end up happening with that text file. The description is a start, but not quite detailed enough.

My name is Bob.  I live at 1234 Main St. in Sacramento, CA.  My
favorite food is pizza!  What's your name?  Well, I have to
go to work now.  I'm late; it's already 9 a.m.  It was nice meeting you[B]![/B]

Say that's the file (or if that's not a good file, provide another one). What do you do with it? What does the array (s) hold?

Edited 7 Years Ago by VernonDozier: n/a

You should post a short sample text file and what needs to end up happening with that text file. The description is a start, but not quite detailed enough.

My name is Bob.  I live at 1234 Main St. in Sacramento, CA.  My
favorite food is pizza!  What's your name?  Well, I have to
go to work now.  I'm late; it's already 9 a.m.  It was nice meeting you[B]![/B]

Say that's the file (or if that's not a good file, provide another one). What do you do with it? What does the array (s) hold?

it should scan each character until it gets to a space so in the array would be a word. Any punctation should be in their own array element. I want to ignore whitespace and newlines

so the array in your example would be

My
name
is
Bob
.
...etc

Many thanks for the reply

Look at ispunct and strtok. I imagine that'd be your best bet. Your delimiters are a character array of all punctuation. Read in using the >> operator to throw out white space, then split the string further into tokens using strtok.

http://www.cplusplus.com/reference/clibrary/cctype/ispunct/
http://www.cplusplus.com/reference/clibrary/cstring/strtok/

strtok may not be with the trouble since you actually want to SAVE the punctuation too. Perhaps some combination of strtok to isolate all the words, and then go through character by character and grab the punctuation.

"find" and "substr" from the "string" library could also come in handy. Since you want to keep things in order and keep the punctuations, you might not want to bother with strtok and just go through the string and find all punctuation using ispunct and keep track of the indexes. Then split the string up using substr or whatever. More than one way to do it. Look at the string, cctype, and cstring libraries, as well as getline and get.

I am making some progress

I'm now able to read each character from each line but the problem now is processing each character.

I still need to keep scanning each character until I reach a whitespace. say i the first character is a letter, i need to add that to a char array and keep adding letters to the char array until i reach a whtespace.

How would i do that?

Here is my code I got so far. This code can strip comments

#include <fstream>
#include <iostream>

using namespace std;

int main()
{

	int i;
	int a=0;
	char ch;
	string line;

	ifstream myFile("scan.cm");
	
	if (! myFile)
	{
		cout << "Error opening output fle"  << endl;
		return -1;
	}
	
	while( getline( myFile, line ) )
	{
		a++;
		
		cout << "line " << a << ":";
		for (i=0; i < line.length(); i++)
		{
			if(line[0] =='/' && line[1] == '*') continue;
			
			if(line[0] =='*') continue;
			
			if(line[0] =='*' && line[1] == '/') continue;
			
			cout << line[i];

                        //this is where the array needs to be or a vector
		}
		
		cout << endl;
		
	}
	
	myFile.close();

	return 0;
}

Edited 7 Years Ago by AdRock: n/a

Is this a code parser? I see you are looking for "/*" and "*/", which are comment delimiters. if not, what is the significance of the tests on lines 29 through 33? Since whitespace is a delimiter and not just newline, I'd go with the >> operator rather than getline. >> separates all whitespace for you.

I'm going to the Brown's house for dinner[B].[/B]

Read in using >> and it splits to this:

I'm
going
to
the
Bob's
house
for
dinner[B].[/B]

Next task with each word is to split into punctuation. I'd go through character by character looking for punctuation and grabbing their indexes.

Bob's

Character at index 3 is punctuation, indexes are 0 through 4, so three substrings:

Bob  // up to but not including index 3.
'      // index 3
s   // everything after index [B]3[/B]

Stick the three strings in the array, go on to the next word. Decide exactly what "punctuation" is and go through a character at a time. If punctuation is what is defined in "ispunct", use it. Otherwise write your own. If the "tokens" are more complicated than single characters, this approach won't work.

Edited 7 Years Ago by VernonDozier: n/a

Comments
this akes sense to me but how do i do it?

Many thanks

Yes I am working on a code parser and I'm getting there slowly

I've found some examples online that other people have tried for different things and i can get it working in my code sort of.

The problem i have is getting all of the data put into the vector until it reaches a whitespace. It has something to do with the do while loop

#include <fstream>
#include <iostream>
#include <vector>
#include <string>
#include <sstream>

using namespace std;

int main()
{
	int i;
	int a=0;
	char ch;
	string line;

	ifstream myFile("scan.cm");
	
	if (! myFile)
	{
		cout << "Error opening output fle"  << endl;
		return -1;
	}
	
	vector < vector < char > > info;
	
	while( getline( myFile, line ) )
	{
		vector < char > data;
		char value;
		istringstream iss(line);
		
		while (iss >> value)
		{
			if(line[0] =='/' && line[1] == '*')
			{
				continue;
			}
			else if(line[0] =='*')
			{
				continue;
			}
			else if(line[0] =='*' && line[1] == '/')
			{
				continue;
			}
			else
			{
				//do
				//{
					data.push_back(value);
				//}
				//while (value !=' ');
			}
		}	
		info.push_back(data);
	}
	
	for ( vector < vector < char > > :: size_type i = 0, size = info.size(); i < size; ++i)
		{
			cout << "line " << i + 1 << ": ";
			for ( vector < char > :: size_type j = 0, length = info[i].size(); j < length; ++j)
			{
				cout << info[i][j];
			}
			cout << endl;
		}
	myFile.close();

	return 0;
}

Lines 32 through 54 - What are you trying to accomplish here? Lines 34, 38, 42 are the same test over and over again. Why test for the exact same thing every time through the loop? Just test it once. line[0] and line[1] never change in this loop. Lines 42 through 45 can be deleted. If the test on line 42 is true, then the earlier test on line 38 will have been true, so it is impossible for line 44 to execute.

Not sure what you were trying to accomplish with the commented out do-while loop, but left uncommented, it would be an infinite loop since value wouldn't ever change.

What exactly are you trying to end up with here and what's the input and what are the guidelines? What is this parser expected to be able to handle? You should post a sample file, along with what needs to end up in all of the vectors at the end based on that sample file.

I need to create a list of each word and punctuation and what line they appear on like this (using your example)

See with your example I would scan the first letter and keep adding it to array/vector until i reach a whitespace which would create a word. Once it reaches a whitespace, it knows the word is complete. It keeps scanning until it finds another letter and does the same but if it reaches some punctuation, it adds that to it's own array/vector.

Line 1: My
Line 1: name
Line 1: is
Line 1: Bob
Line 1: .
Line 1: I
Line 1: live
Line 1: at
Line 1: 1234
Line 1: Main
Line 1: St
Line 1: .
Line 1: in
Line 1: Sacramento
Line 1: ,
Line 1: CA.
Line 2: My
Line 2: favorite
Line 2: food
Line 2: is
Line 2:pizza

!

I have been trying different things all day but still not getting where i need to be.

I can scan each character and output each character but it's not reading whitespace as a delimiter.

I've been looking at strtok but if you had a string like Bob's, it would take that as a whole string and not 3 separate tokens like you suggested.

How would i do what you suggested as I can't find anything that helps me and everything i try doesn't work?

Many thanks for your help btw

I need to create a list of each word and punctuation and what line they appear on like this (using your example)

See with your example I would scan the first letter and keep adding it to array/vector until i reach a whitespace which would create a word. Once it reaches a whitespace, it knows the word is complete. It keeps scanning until it finds another letter and does the same but if it reaches some punctuation, it adds that to it's own array/vector.


!

Having a vector of vectors of char seems needlessly complex to me. A single vector of strings would be better, I think. If that's what you're doing, then where does the "/*" and "*/" come into play? My sample file doesn't have any "/*" or "*/" in it, so since you're treating them as some type of token, you should probably provide a file that has them. Specifically, is "/*" stored as one string (or one character vector, as you have it) or two, as in '/' and '*'?

Here's my approach:

vector <string> SplitString (string aString)
{
  // splits string based on punctuation delimiters.
  // "house" returns {"house"} // 1 element
  // "a.m." returns {"a",".","m","."} // 4 elements

  // possibly use "find", "substr", strtok", ispunct" functions to help.
}

// main function

vector <string> tokens;
ifstream ins;
ins.open ("scan.cm");

string aString;

while (ins >> aString)
{
  vector <string> newTokens = SplitString (aString);
  int numNewTokens = newTokens.size ();
  for (int i = 0; i < numNewTokens; i++)
  {
    tokens.push_back (newTokens[i]);
  }
}

ins.close ();

// tokens vector now contains a bunch of strings.

It doesn't handle the line numbers, so you could use your getline and stringstream ideas to split a sentence into words in addition to above. But I think something incorporating what is above will be your best bet.

I have been trying different things all day but still not getting where i need to be.

I can scan each character and output each character but it's not reading whitespace as a delimiter.

I've been looking at strtok but if you had a string like Bob's, it would take that as a whole string and not 3 separate tokens like you suggested.

How would i do what you suggested as I can't find anything that helps me and everything i try doesn't work?

Many thanks for your help btw

Here's the example from http://www.cplusplus.com/reference/clibrary/cstring/strtok/

modified to work with "Bob's".

/* strtok example */
#include <stdio.h>
#include <string.h>



int main ()
{
  char str[] ="Bob's";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str,"',.-/*");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, "',.-/*");
  }

  return 0;
}

It'll display "Bob" and "s", but it throws the apostrophe out, which isn't good for your needs. It would thus be only part of the solution. Frankly, I'm not sure strtok is worth it. If you want to write the SplitString function from my last post, first decide exactly what the punctuation delimiter list is. If it matches cctype's ispunct delimiters, use it. If not, write your own and use it as a helper function. For example,

bool ispunct (char aCharacter, string delimiters)
{
  int numDelimiters = delimiters.length ();
  for (int i = 0; i < numDelimiters; i++)
  {
    if (aCharacter == delimiters[i])
      return true;
  }

  return false;
}

Use it as a helper function, go through a string line by line to find the punctuation indexes. Then use substr to split the string. Since strtok throws out the punctuation and you want to keep them, you have to do this anyway, so don't bother with strtok.

Best of luck with this. Hopefully this'll help get you started. Unfortunately I have to go to work and won't be able to help anymore, but hopefully someone else will if need be. You may want to start a new thread though.

This article has been dead for over six months. Start a new discussion instead.