I would like to ask if anyone know how can I create a tokenizer for a txt file in C++.
I find it difficult because there are not only words but there are also numbers and <p id> tags.

I have attached the file that is needed to be tokenized.

Could anyone help me ?

Thanks a lot

The file wasn't attached.

The file wasn't attached.

Maybe that's the point ...

The file is something like that :

<P ID=1>
<P ID=2>
<P ID=3>
<P ID=4>
I. M. Myriel
II. M. Myriel becomes M. Welcome
III. A Hard Bishopric for a Good Bishop
IV. Works corresponding to Words
V. Monseigneur Bienvenu made his Cassocks last too long
VI. Who guarded his House for him
VII. Cravatte
VIII. Philosophy after Drinking
IX. The Brother as depicted by the Sister
X. The Bishop in the Presence of an Unknown Light
XI. A Restriction
XII. The Solitude of Monseigneur Welcome
XIII. What he believed
XIV. What he thought
<P ID=5>
<P ID=6>
I. The Evening of a Day of Walking
II. Prudence counselled to Wisdom
III. The Heroism of Passive Obedience
IV. Details concerning the Cheese-Dairies of Pontarlier
V. Tranquillity
VI. Jean Valjean
VII. The Interior of Despair
VIII. Billows and Shadows
IX. New Troubles
X. The Man aroused
XI. What he does
XII. The Bishop works
XIII. Little Gervais

