tokenization of file input

Question

katerinaaa 0 Newbie Poster

18 Years Ago

Hi,
I would like to help me with a problem I have.

I want to make a program that tokenize the text of an input file and create a new file with all the words (one word per line).

Because in the input file there are numbers, html tags like and numbers like I. II. III. , I would like not to take place in output file.

In my code I have implementate the filereader and filewriter.
I also know that maybe I have to use stringTokenizer but I don't know to continue . . .:-(

Could anyone help me ?

public static void main(String arg[])  
    {
        new TestStreamTokenizer().testInOut(arg[0], arg[1]);
    }


private void createReadWriteStreams(String inFName, String outFName) 
{
            _fileReader = new FileReader(inFName);
            _fileWriter = new FileWriter(outFName);
            _printWriter = new PrintWriter(_fileWriter);
}

 public void testInOut(String inFName, String outFName) 
{
            createReadWriteStreams(inFName, outFName);
             StreamTokenizer tokenizer = new    StreamTokenizer(_fileReader);
            tokenizer.eolIsSignificant(true);

            int nextTok = tokenizer.nextToken();

            while (StreamTokenizer.TT_EOF != nextTok) 
            {
                // ........................
                //I don't know how can I do it???
                
            }
}

Thanks a lot

P.S. :
--------------------------
My input file is attached
----------------------------

file-system java

input.txt (183.67 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

<P ID=1>
CONTENTS
</P>
<P ID=2>
VOLUME I
</P>
<P ID=3>
BOOK FIRST.--A JUST MAN
</P
<P ID=5>
BOOK SECOND.--THE FALL
</P>
<P ID=6>
    I.  The Evening of a Day of Walking
   II.  Prudence counselled to Wisdom
  III.  The Heroism of Passive Obedience
   IV.  Details concerning the Cheese-Dairies of Pontarlier
    V.  Tranquillity
   VI.  Jean Valjean
  VII.  The Interior of Despair
 VIII.  Billows and Shadows
   IX.  New Troubles
    X.  The Man aroused
   XI.  What he does
  XII.  The Bishop works
 XIII.  Little Gervais
</P>
<P ID=158>
  For the little seminary . . . . . . . . . . . . . .    1,500 livres
  Society of the  mission . . . . . . . . . . . . . .      100   "
  For the Lazarists of Montdidier . . . . . . . . . .      100   "
  Seminary for foreign missions in Paris  . . . . . .      200   "
  Congregation of the Holy Spirit . . . . . . . . . .      150   "
  Religious establishments of the Holy Land . . . . .      100   "
  Charitable maternity societies  . . . . . . . . . .      300   "
  Extra, for that of Arles  . . . . . . . . . . . . .       50   "
  Work for the amelioration of prisons  . . . . . . .      400   "
  Work for the relief and delivery of prisoners . . .      500   "
  To liberate fathers of families incarcerated for debt  1,000   "
  Addition to the salary of the poor teachers of the
       diocese  . . . . . . . . . . . . . . . . . . .    2,000   "
  Public granary of the Hautes-Alpes  . . . . . . . .      100   "
  Congregation of the ladies of D----, of Manosque, and of
       Sisteron, for the gratuitous instruction of poor
       girls  . . . . . . . . . . . . . . . . . . . .    1,500   "
  For the poor  . . . . . . . . . . . . . . . . . . .    6,000   "
  My personal expenses  . . . . . . . . . . . . . . .    1,000   "
                                                        ------
       Total  . . . . . . . . . . . . . . . . . . . .   15,000   "
</P>
<P ID=13825>
"The proof that God is good is that she is here."
</P>
<P ID=13826>
"Father!" said Cosette.
</P>
<P ID=13827>
Jean Valjean continued:
</P>
<P ID=8108>
"I am capable of descending the Rue de Gres, of crossing the Place
Saint-Michel, of sloping through the Rue Monsieur-le-Prince, of taking
the Rue de Vaugirard, of passing the Carmelites, of turning into the
Rue d'Assas, of reaching the Rue du Cherche-Midi, of leaving behind
me the Conseil de Guerre, of pacing the Rue des Vielles Tuileries,
of striding across the boulevard, of following the Chaussee du Maine,
of passing the barrier, and entering Richefeu's. I am capable of that. 
My shoes are capable of that."
</P>
<P ID=8109>
"Do you know anything of those comrades who meet at Richefeu's?"
</P>
<P ID=8110>
"Not much.  We only address each other as thou."
</P>
<P ID=8111>
"What will you say to them?"
</P>
<P ID=8112>
"I will speak to them of Robespierre, pardi!  Of Danton. 
Of principles."
</P>
<P ID=8113>
"You?"
</P>
<P ID=13828>
"It is quite true that it would be charming for us to live together. 
Their trees are full of birds.  I would walk with Cosette. 
It is sweet to be among living people who bid each other `good-day,'
who call to each other in the garden.  People see each other from
early morning.  We should each cultivate our own little corner. 
She would make me eat her strawberries.  I would make her gather
my roses.  That would be charming.  Only . . ."
</P>
<P ID=10356>
Father Hucheloup had, possibly, been born a chemist, but the fact
is that he was a cook; people did not confine themselves to drinking
alone in his wine-shop, they also ate there.  Hucheloup had invented
a capital thing which could be eaten nowhere but in his house,
stuffed carps, which he called carpes au gras.  These were eaten by
the light of a tallow candle or of a lamp of the time of Louis XVI.,
on tables to which were nailed waxed cloths in lieu of table-cloths.
People came thither from a distance.  Hucheloup, one fine morning,
had seen fit to notify passers-by of this "specialty"; he had dipped
a brush in a pot of black paint, and as he was an orthographer
on his own account, as well as a cook after his own fashion,
he had improvised on his wall this remarkable inscription:--
</P>
<P ID=10357>
                    CARPES HO GRAS.
</P>
<P ID=10358>
One winter, the rain-storms and the showers had taken a fancy
to obliterate the S which terminated the first word, and the G
which began the third; this is what remained:--
</P>
<P ID=10359>
                      CARPE HO RAS.
</P>
<P ID=10360>
Time and rain assisting, a humble gastronomical announcement had
become a profound piece of advice.
</P>
<P ID=10361>
In this way it came about, that though he knew no French, Father Hucheloup
understood Latin, that he had evoked philosophy from his kitchen,
and that, desirous simply of effacing Lent, he had equalled Horace. 
And the striking thing about it was, that that also meant: 
"Enter my wine-shop."
</P>
<P ID=10362>
Nothing of all this is in existence now.  The Mondetour labyrinth
was disembowelled and widely opened in 1847, and probably no longer
exists at the present moment.  The Rue de la Chanvrerie and Corinthe
have disappeared beneath the pavement of the Rue Rambuteau.
</P>
<P ID=10363>
As we have already said, Corinthe was the meeting-place if not the
rallying-point, of Courfeyrac and his friends.  It was Grantaire
who had discovered Corinthe.  He had entered it on account of the
Carpe horas, and had returned thither on account of the Carpes
au gras.  There they drank, there they ate, there they shouted;
they did not pay much, they paid badly, they did not pay at all,
but they were always welcome.  Father Hucheloup was a jovial host.
</P>
<P ID=10364>
Hucheloup, that amiable man, as was just said, was a wine-shop-keeper
with a mustache; an amusing variety.  He always had an ill-tempered air,
seemed to wish to intimidate his customers, grumbled at the people
who entered his establishment, and had rather the mien of seeking
a quarrel with them than of serving them with soup.  And yet,
we insist upon the word, people were always welcome there.  This oddity
had attracted customers to his shop, and brought him young men,
who said to each other:  "Come hear Father Hucheloup growl."  He had
been a fencing-master. All of a sudden, he would burst out laughing. 
A big voice, a good fellow.  He had a comic foundation under
a tragic exterior, he asked nothing better than to frighten you,
very much like those snuff-boxes which are in the shape of a pistol. 
The detonation makes one sneeze.
</P>
<P ID=10365>
Mother Hucheloup, his wife, was a bearded and a very homely creature.
</P>
<P ID=10366>
About 1830, Father Hucheloup died.  With him disappeared the secret
of stuffed carps.  His inconsolable widow continued to keep the
wine-shop. But the cooking deteriorated, and became execrable;
the wine, which had always been bad, became fearfully bad. 
Nevertheless, Courfeyrac and his friends continued to go to Corinthe,--
out of pity, as Bossuet said.
</P>
<P ID=10367>
The Widow Hucheloup was breathless and misshapen and given
to rustic recollections.  She deprived them of their flatness
by her pronunciation.  She had a way of her own of saying things,
which spiced her reminiscences of the village and of her springtime. 
It had formerly been her delight, so she affirmed, to hear
the loups-de-gorge (rouges-gorges) chanter dans les ogrepines
(aubepines)--to hear the redbreasts sing in the hawthorn-trees.
</P>
<P ID=10368>
The hall on the first floor, where "the restaurant" was situated,
was a large and long apartment encumbered with stools, chairs, benches,
and tables, and with a crippled, lame, old billiard-table. It
was reached by a spiral staircase which terminated in the corner
of the room at a square hole like the hatchway of a ship.
</P>
<P ID=10369>
This room, lighted by a single narrow window, and by a lamp that
was always burning, had the air of a garret.  All the four-footed
furniture comported itself as though it had but three legs--
the whitewashed walls had for their only ornament the following
quatrain in honor of Mame Hucheloup:--
</P>
<P ID=10370>
          Elle etonne a dix pas, elle epouvente a deux,
          Une verrue habite en son nez hasardeux;
          On tremble a chaque instant qu'elle ne vous la mouche
          Et qu'un beau jour son nez ne tombe dans sa bouche.[48]
</P>
<P ID=10371>
[48] She astounds at ten paces, she frightens at two, a wart inhabits
her hazardous nose; you tremble every instant lest she should blow it
at you, and lest, some fine day, her nose should tumble into her mouth.
</P>
<P ID=10372>
This was scrawled in charcoal on the wall.
</P>
<P ID=10373>
Mame Hucheloup, a good likeness, went and came from morning till
night before this quatrain with the most perfect tranquillity. 
Two serving-maids, named Matelote and Gibelotte,[49] and who had
never been known by any other names, helped Mame Hucheloup to set
on the tables the jugs of poor wine, and the various broths
which were served to the hungry patrons in earthenware bowls. 
Matelote, large, plump, redhaired, and noisy, the favorite
ex-sultana of the defunct Hucheloup, was homelier than any
mythological monster, be it what it may; still, as it becomes the
servant to always keep in the rear of the mistress, she was less
homely than Mame Hucheloup.  Gibelotte, tall, delicate, white with
a lymphatic pallor, with circles round her eyes, and drooping lids,
always languid and weary, afflicted with what may be called
chronic lassitude, the first up in the house and the last in bed,
waited on every one, even the other maid, silently and gently,
smiling through her fatigue with a vague and sleepy smile.
</P>
<P ID=10374>
[49] Matelote:  a culinary preparation of various fishes. 
Gibelotte:  stewed rabbits.
</P>
<P ID=10375>
Before entering the restaurant room, the visitor read on the door
the following line written there in chalk by Courfeyrac:--
</P>
<P ID=10376>
          Regale si tu peux et mange si tu l'oses.[50]
</P>

2 Contributors
5 Replies
172 Views
1 Day Discussion Span
Latest Post 18 Years Ago Latest Post by Ezzaral

All 5 Replies

Ezzaral 2,714 Posting Sage

18 Years Ago

Actually, you should use the split() method of String instead of StringTokenizer and use regular expressions to remove text that you do not wish to include. Split will split your string by whatever delimiter you specify and return the parts as a string array. Regular expressions will allow you to specify patterns to match the pieces you don't want to include. Sun has a tutorial on regular expressions here: http://java.sun.com/docs/books/tutorial/essential/regex/

Ezzaral 2,714 Posting Sage

18 Years Ago

Thanks a lot for your answer.
I would like to ask you something more about split parameter.
How can I make a regular expression that delete the words that is like I. II. III. IV. .... and .

Well, you will have to work a little bit on the regular expressions to match on your content. The expression "" would match your "" tags, if they are always of that form. "" by itself will match "", so not much to that one. The roman numerals will be a little trickier, since they are merely a sequence of vertain capital letters followed by a period (in your example at least). You might get away with the pattern "[IVXLCDM]+\." for those, but there is a slight change you might accidently match some of your text by mistake (pretty unlikely I would say though.

Have I to call split a lot of times or can I do it differently?

You can first use the regular expressions to strip things you do not want to capture. If you are reading a line at a time in to a string variable, you can strip things out by calling replaceAll() with your regular expression and an empty string"" for the replacement string. After stripping out the unwanted content, call split(" ") to split on spaces to get your array of words to write out to file.

BufferedReader reader = new BufferedReader(new FileReader("foo.in"));
String inputString = reader.readLine();

// strip out the stuff you don't want
String cleanString = inputString.replaceAll("<P ID=\\d+>", "");
cleanString = cleanString.replaceAll("<P>", "");
cleanString = cleanString.replaceAll("[IVXLCDM]+\\.", "");

// get the remaining words into an array
String[] words = cleanString.split(" ");

// loop words and write them, then read next line, etc.

This is just one way that might work for you. I would imagine someone who does a lot of file parsing with regular expressions could present a more efficient way, but this might give you a start.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

katerinaaa 0 Newbie Poster · Answer 1 · 2007-06-07T01:41:45+00:00

Thanks a lot for your answer.

I would like to ask you something more about split parameter.

How can I make a regular expression that delete the words that is like I. II. III. IV. .... and .

Have I to call split a lot of times or can I do it differently?

Thanks a lot again!

katerinaaa 0 Newbie Poster · Answer 2 · 2007-06-07T21:01:20+00:00

Thanks a lot Ezzaral for your help.
The program works nearly perfect.

The only problem I have is that I can't replace character '.'.
I read that fullstop is special character and so I have to call the function like that :

cleanString.replaceAll("\\.", " ");

But when I use it I have problem with the roman numerals (are printed in output file).

Any idea ?

And to close thread I would like to ask if I could make only one expression.
For example, replaceAll("\\d" "\"" "\\?" ":"," ")

Is there something like that ?

Thansk a lot!!!

I Promiss that I won't ask again!

Ezzaral 2,714 Posting Sage Team Colleague Featured Poster · Answer 3 · 2007-06-07T22:24:10+00:00

You should be able to just strip the roman numerals first and then remove the remaining "." occurrences.

On your other question about combining, yes, you can combine some of them but not all. If you add [ ] brackets, it becomes an OR comparision, so "[\\d\\"\\?:]" would strip all of those characters. Don't combine it with the others though, which need to match a specific sequence. If you add those expressions in between the brackets, it will strip any of those characters (such as P) even if the whole sequence does not match.

tokenization of file input

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers