Hi, there,

I am also very new to Java.
For class, I have to create an structure that reflects that of a document stored in a plain txt file.
The hyerarchy of the classes is:
Class document is made of
//complex objects
class Page objects, that are made of
class PArragraph, that are made of
class Phrase objects, that are made of
class Word objects, and/or class Number objects, both made of
// elemental tyoe objects
class basicElement objects, characters read from the file.

EOdocument To tell a documents end I need a eof character (EOF)
EOpage To tell a page end, I need two consecutive carriage return OR the above
EOparragraph To tell a parragraphs end, I need one carriage return OR any of the above,
EOphrase To tell a phrase end, I need a dot OR any of the above
EOword To tell a word's end, I need a space, a comma, semicolon, colon OR any of the above
if a word is a number or a word is reflected in a field of the word class.

I have thought of reading charachter by charachter the file and begin building the structure. My approach was that I first create a Document object, whose class constructor creates an arraylist of Pages until a EOF is encountered.
The Pages constructor creates an arraylist of Parragraphs until a EOpage is encountered. The Parragraphs constructor creates an arraylist of phrases until a EOparragraph is encountered.
The Phrases constructor creates an arraylist of words & numbers until a EOphrase is encountered.
The word constructor creates an arraylist of basicElements until a EOword is encountered.

I have, though, several questions that are setting me stuck.

For now, I have coded a Textfile reader using a Scanner and, playing with regular expressions, have came to be able to extract tokens of any kind of the above. But, I only get a series of strings yielded by the delimiters, not actual objects of my classes. I would like to let every object fill himself's arraylist at the time of construction, but I find the problem of: how would each new basicElement know what has been the last read position of the text file? Should I pass a parameter through all the chain of construction back and forth? Could I create a static string that any new basicElement can ask for the Next element? Does it make any sense?
Is this efficient or would it be better to start from a big string containing all the document, create then several strings each with the text of each page, then create new strings for each parragraph and so on? I find it very complex because I would have to create hundreds or pearhaps thousands of strings that would be tokenized or scanned in a exponential growing complexity. Seems scarry

any advise?

I have some code and a scheme of my analysis, but the only thing I have came with is to be able to tokenize with regex.

In my search I have found something called an iterator, this could help me with the static string approach as I would have an independent and all-visible objects that knows the last characters read...

looking forward to hearing from you

ankilosado

Recommended Answers

All 44 Replies

structure that reflects that of a document stored in a plain txt file.

You don't define what the layout/structure is for the txt file.
What is the relationship between lines and tokens on a line with the objects you want to create?

commented: he didn't seem to have read my post. +0

the text suggests that they are 3 records, 3 fields each. But in my program, this would be a Document with 1 page, 3 parragraphs. The first Parragraph is "0001, cochabamba, lufthansa. it has 1 phrase, made of 3 words. the 1st word is "0001" that, in this case, is labeled as a number, though this is not important now. every one of these objects is an arraylist of objects. the document, Arraylist <Pages>; the page, Arraylist<parragraphs>; the parragraphs, Arraylist<phrase>; the phrases, Arraylist<word>, the words, Arraylist<basicElements>, basic element is a char.

missed text in my post. Sorry (du not know what happeed)

consider this text:
"0001, cochabamba, lufthansa
0002, chicago, Quantas
0003, teheran, binter"
the text suggests... (now goes the previous post)

missed text in my post. Sorry (du not know what happeed)

consider this text:
"0001, cochabamba, lufthansa
0002, chicago, Quantas
0003, teheran, binter"
the text suggests... (now goes the previous post)

Your mixing up the problem definition with the coding techniques.
Can you just do the problem definition and leave out the ArrayList stuff.

there are 3 records, 3 fields each

Does that say that the input file is limited to only 3 records.
And each record has 3 fields.
What are the three fields?

Where do your 5 classes come in?
One class for each record and one class for each field is only 4 classes?

Here's a strategy that may work for you - it's kinda like a recursive strategy, except that the levels are pre-defined. Pseudo code fragments follow:

new Document(Scanner input) {
  create empty ArrayList of Pages
  while (not input.EOF) add new Page(input) to ArrayList
}

new Page(Scanner input) {
  create empty ArrayList of Paragraphs
  while (not input.EOF) {
    if (next token in input is a Page delimiter) return;
    add new Paragraph(input) to ArrayList
  }
}

new Paragraph(Scanner input) {
  create empty ArrayList of Phrases
  while (not input.EOF) {
    if (next token in input is a Paragraph delimiter) return;
    add new Phrase(input) to ArrayList
  }
}

...


new Word(Scanner input) {
  create empty ArrayList of Characters
  while (not input.EOF) {
    if (next token in input is a Word delimiter) return;
    add next Character to ArrayList
  }
}
commented: he read my problem and thought about it before answering, which I appreciate a lot. This allowed him to understand my problem. +1

Thankyou guys. Norm, I'm affraid that JamesCherrill got the idea.

James, as I can see, you prefer to pass the Scanner as a parameter to the new objects. This eliminates the need of declaring the Scanner as static, right?

A question arises at this point (though I also mentioned it before): deep in the quasi-recursive scheme, when dealing with characters, the Scanner that is given as a parameter ¿knows what was the last position already processed? I imagine it does, as it is the same Scanner in memory, the same object, but I am not sure.
There's another one: I mentioned that the delimiter of a page is two consecutive manual carriage returns (or new line characters). Imagine the process is that a character is being constructed because a word created a new basicElement object. The Scanner input finds there is a CR (or \n or new line character) As it is a EOparagraph it is also a EOPhrase and a EOWord, so this makes the calling word to return itself to the calling Phrase, which will also find there is time to return itself to the paragraph, that will know the construction of itself has ended, so it returns itself to the page that created the paragraph. The page will create another new paragraph, what, deeping to the basic element creation, will find ANOTHER (a second one, contiguous to the former) CR. Two CR means end of page. This terminates teh basic element, the word, the phrase... the page... but What these last objects were empty. I do not want them. Can I avoid the creation of these objects? (remember that the chain goes through the constructors of the classes) I thought of having a field to contain a code:"reg" for regular characters and "PA" "WD" and so on for termination characters. But This wouldnt work if the termination characater is double and needs two reads of the scanner.

It could be possible to use split. IF I split the whole text using the delimiter (double carriage return) I would have an array of strings that I could name proto-pages, as they're just strings. I could pass each proto-page when I create a new page, something like
class documento(String wholeText){
Arraylist<Pages> pagesArray;

documento(String wholeText){ //this will be the constructor of documento class
String [] protopages=wholetext.split(EOPage sequence)
for i = 0 to number of elements in protopages {
add Page(protopages) to Arraylist
}
}

And similarly to the rest of substructures...

How do you see this?

Hi ankilosado
Yes, the Scanner knows where its current poosition is, and it's considered better Java style to pass something as a parameter (ie share it in a controlled way) rather than making it public static (ie totally uncontrolled sharing). (When you pass an Object as a parameter what you are really doing is passing a local copy of a reference to the Object, so all the local parameters are references to exactly the same Object.)
As for the double CR problem - I automatically dislike the idea of special-casing this. Won't there be other similar cases? Eg full-stop+space at end of sentance is NOT two word delimiters. I never get to use Scanners myself, so I'm no expert, but can't you use hasNext for a quick look-ahead to see if the next "n" characters match a regex? If so, you can look first for the 2-character delimiters, and only if that fails, look for a single character delimiter.
As for empty elements: is this a gap in the spec? How should it handle consecutive delimeters (eg newPage/newPage - is this an empty page, or should you just ignore it?).
If you wnat top ignore empty elements completely, it's easy enough to chcck them before adding them to the ArrayLists, ie replace

add new Phrase(input) to ArrayList

with

Phrase p =  new Phrase(input);
if (! p.isEmpty()) add p to ArrayList

wher isEmpty() is a trivial method implemented in each class

I'll check what you say... soon reply

I had some more thoughts - feel free to use them or ignore them...
I would separate the messy business of parsing the chars and delimiters from the logic of building the doc structure. Instead of passing a Scanner, I would build a class to do the initial parsing and pass that around instead. It only needs 3 public methods: get next char, get next delimiter, "peek" at next delimiter (ie return the delimiter but leave it on the stack). I'd pass the file name into its constructor so all the choices about scanner vs byte array parsing etc are hidden inside it. That will make all the code easier to understand, allow the parsing to be tested before its used in the doc structure classes, and allows you to change the implementation if you get into difficulties.
I's return the delimiters as int values (1= word delim, 2= phrase delim etc) describing the exact type of delimiter, so you can test for == WORD_DELIM or >= WORD_DELIM - I think you'll need to do both.
Finally, I'd add a writeXLMto(PrintWriter out) method to all the doc classes so I can easily display and debug their contents - eg
word.writeXMLto gives
<Word>Fred</Word>
then you can use that for Phrase to give:
<Phrase>
<Word>Fred</Word>
<Word>Sally</Word>
</Phrase>
etc etc

Let me know how this goes - I'ts caught my interest!
J

Oh my! I have written the answer twice! I needed to find something in a Spanish Enghlish Dictionary to be sure of what I was writing and I lost all my answer!!

I'll try again,

First of all, I am glad that it caught your interest. Thankyou for your valuable time!

I understood that what you suggest is that I convert the original file to a string in XML format, with XML-like tags. This would need that a process reads all the file, processes the text looking for delimiters and then constructs a better organized text delimited as in XML. This would be passed to the sub-elements so that they can easily construct things.

Looks similar to the approach I tried to explain before, where I would "recursively" chop a large string that contains the original text and pass each piece to the constructor of the inmediate substructure. Something like this (forgive syntax errors and do not let them obscure the strategy I draw. Nonetheless, feel free to comment any of these syntax bugs):

Class Doc {
  private Arraylist <Pages> PagesArray; //where objects of class Pages are to be stored
  private int DNI=0; //sort of an object ID

  private Doc(String inputTxt) {
      //lots of pre-conditions checking here
      String [] protoPagesArray = inputTxt.split(page delimiter goes here)
      for (String protopage: protoPagesArray){
           PagesArray.add (new Pages(protopage))
           }
  } // end of constructor method

//... more methods here if needed
}// end of class

Then the Pages class is something like

Class Doc {
  private Arraylist <Pages> PagesArray; //where objects of class Pages are to be stored
  private int DNI=0; //sort of an object ID

  private Pages(String inputTxt) {
      //lots of pre-conditions checking here
      String [] protoParragraphsArray= inputTxt.split(parragraph delimiter goes here)
      for (String protoparra: protoParragraphsArray){
           ParagraphsArray.add (new Parragraph(protoparra))
           }
  } // end of constructor method

//... more methods here if needed
}// end of class Pages

and so on.

The only difference will be the word. As I later have to build methods to compute some statistics on the text (how many vowels, how many digits, words, numbers, pages, etc.)
I need basicElements, I guess. In this case, I am not being able to construct a regex that yields a simple characters array and it looks easier to just code a for each (if it was possible) loop in the last step without splitting anything. Actually, what I have seen is that split always "eats" the delimiting string and doesn't return it. I tried to use the regex . to match any characther and I ended with a list of empty lines.

I think I'd better attach you the codes I have written to test this.

/**
 * Clase que lee un fichero de texto.
 * 
 * @author (your name) 
 * @version (a version number or a date)
 */
  import java.io.*;
  import java.util.*;
  import java.util.ArrayList;

public class leeFichero  {
  private Scanner x;
  private ArrayList<String> txtCuerpo;
  
  public void abreTxt(){
      String nombFichero = "c:\\texto_practica_2010.txt";      
      try {
          x = new Scanner (new File (nombFichero));
      }//del try
      catch(Exception aExcep) {
          System.out.println("No se encontró " + nombFichero);
      }//del catch
  }// del método abreTxt
  
  public void leeTxt(){
      x.useDelimiter (".?"); 
      int contador=0;
      while (x.hasNext()) {
          System.out.println(contador++ + "carácter: "+x.next());
//           boolean flag = false;
//           String Palabra = x.next();
//           for (int i=0;  i<Palabra.length(); i++){
//               if (Character.isDigit(Palabra.charAt(i)))  {
//                   flag=true;
//                 }
//                else {flag=false;}
//           }
//           if (flag) {System.out.print("número  :");}
//           else      {System.out.print("palabra :");}
//           System.out.println(Palabra+".-");
      }
  }//del método leeTxt
  
  public String getFrase() {
      x.useDelimiter("\\s*.\\S*");
      String cadena = "";
      try {
          if (x.hasNext()) {
               cadena = x.next();
            }
            else {
                 cadena = null;
            }
        } // del try
        catch (Exception aExcep) {
            System.out.println("Excepción");
        } // del catch
        return cadena;
    } //del método getCadena
 
  public void cierraTxt(){
      x.close();
  }// del método cierraTxt
  
}

AND THE TESTING CLASS

/**
 * Write a description of class pruba_leeFichero here.
 * 
 * @author (your name) 
 * @version (a version number or a date)
 */
public class pruba_leeFichero
{
    // instance variables - replace the example below with your own
    private int x;

    /**
     * Constructor for objects of class pruba_leeFichero
     */
    /**
     * An example of a method - replace this comment with your own
     * 
     * @param  y   a sample parameter for a method
     * @return     the sum of x and y 
     */
    public static void main(String [] args){
        // put your code here
        leeFichero f = new leeFichero();
        f.abreTxt();
        f.leeTxt();
//         String cadena = "";
//         while (cadena != "-1" & cadena != null){
//            cadena = f.getFrase();
//            System.out.println("Frase:  " + cadena);
//         }   
        f.cierraTxt();
        
    }
}

I wasn't thinking of directly converting the file to XML... there were two separate ideas:
1. Have a class that handles the basic parsing of the file and has methods to get the next char or delimiter. How this works internally is private to the class.
2. Add methods to Word/Phrase etc to output them in XML format so you can see how well the code is working.
I have to go now - I'll be back later.

Oh, I see... I try to catch every meanning to better benefit from your help, but, it's obvious that English is not my primary language ;-) though, I feel comfortable with your help.

The possible reason that made me think that you were suggesting the XML thing can be that some of the requirements of the application is that it should be able to store the structure in a file and recover it at a later time, if the user requires it through a menu. I felt attracted to use XML, as it could also let me inmerse in that format so widely used for databases. But, as I am short of time (this is why I appreciate a lot that you loan me your time) I thought that Serializing was the best option.

Apart from that, I am building the mixed (and probably not so nice) approach of passing "proto" strings obtained by splitting. Once I have them finished, I will post them or some of them.
I am having a problem when trying to compile. I receive an error I cannot address:
this is the code for the Page object:

/**
 * Write a description of class pagina here.
 * 
 * @author (your name) 
 * @version (a version number or a date)
 */
import java.util.ArrayList;

public class Pagina extends elementoCompuesto
{
    // instance variables - replace the example below with your own
    private int x;
    private ArrayList<Parrafo> ParrafosArrayList;

    /**
     * Constructor for objects of class pagina
     */
    public Pagina(String inputTxt) {
        String [] protoParrafosArray = inputTxt.split("\n");
        for (String protoparrafo :protoParrafosArray){
            ParrafosArrayList.add( Parrafo new Parrafo(protoparrafo));
        }
        totalParr++;//voy incrementando la estadística
    }

}

I receive an error from the compiler:
"')' expected" in this line:

ParrafosArrayList.add( Parrafo new Parrafo(protoparrafo));

It appears in every of the classes I've coded.

What is the extra word in bold:
ParrafosArrayList.add( Parrafo new Parrafo(protoparrafo));

Norm is right - the word in bold should not be there.
And don't worry - your English is good enough. A lot better then my ... Spanish?

Hi, Norm, glad to see you again :-)

Parrafo is Spanish for Paragraph. In Spanish, the structure elements would be:

Documento means Document,
Pagina means Page
Parrafo means Paragraph
Frase means Phrase
Palabra means Word
Numero means Number
Digito means Digit

What I intend is to create a Parrafo (Paragraph) object and add it to the ArrayList of Parrafos, that I declare previously. Let me try taking it off.
...

UHMMM Now I see. You both were right!

IT is amazing (to me) how the BlueJ compiler travels through the chain of objects compiling from out to inside!

so,

Here you are the classes as I have coded them.
I receive no compiling errors.

Plus I attached a filereader class (leeFichero), a test class for it (pruba_leeFichero) and also what I intend will be the main program with the main method, that is AnalizaDocumento.java.
LeeFichero has got the scanner that I started using at the beginning of this work (despite I changed completely my strategy). IT is still the provider of the String that we pass to the Documento class.

AnalizaDocumento has the main menu, plus some secondary menus that are of no importance now. In the main menu there's an option to start the analysis that is "Leer fichero". this creates a Documento object that starts (suposedly, though I have no finished the program)all the constructors.

these 3 files are just to explain how I would use the classes we have been discussin here (thankyou very very much) and I have to say that I stiil have not made a full compilation to see how everything works. I hope, if my wife and m baby lt me, to finish it this weekend.

Feel free to correct, critizise, laugh at... this material. anything would help. What I can see is that there's light at the end of the tunnel.

As for the statistics I mentioned at some point, My intention is to be computing it at the moment of building the structure and store it permamently, instead of computing it each time the user asks for it. My approach is to have in every object an array of integer. Each position of the array would be the totals of the elements being counted ([0] would be the number of characters under this object. In a certain instance of Frase (phrase) [0] contains the sum of all the characters of all the words that make the phrase. Each of these words contains in their own statistics array, the total of characters in each, and so on.
I have not yet mplemented this solution, but this would take only some more code. Adding a new Phrase to a paragraph is followed by a call to a method of Phrase class that returns the statistics array. Then this array would be added to the paragraph's own to update the statistics.

I forgot to include the "dictionary" of the statistics array:

public class PAGS
{
    public final static int CARACTERES = 0;
    public final static int LETRAS = 1;
    public final static int VOCALES = 2;
    public final static int CONSONANTES = 3;
    public final static int SIMBOLOS = 4;
    public final static int DIGITOS = 5;
    public final static int UNOS = 6;
    public final static int DOSES = 7;
    public final static int TRESES = 8;
    public final static int CUATROS = 9;
    public final static int CINCOS = 10;
    public final static int SEIS = 11;
    public final static int SIETES = 12;
    public final static int OCHOS = 13;
    public final static int NUEVES = 14;
    public final static int CEROS = 15;
    public final static int AES = 16;
    public final static int BES = 17;
    public final static int CES = 18;
    public final static int DES = 19;
    public final static int ES = 20;
    public final static int EFES = 21;
    public final static int GES = 22;
    public final static int HACHES = 23;
    public final static int IES = 24;
    public final static int JOTAS = 25;
    public final static int KAS = 26;
    public final static int ELES = 27;
    public final static int EMES = 28;
    public final static int ENES = 29;
    public final static int EÑES = 30;
    public final static int OES = 31;
    public final static int PES = 32;
    public final static int QUES = 33;
    public final static int ERRES = 34;
    public final static int ESES = 35;
    public final static int TES = 36;
    public final static int UES = 37;
    public final static int UVES = 38;
    public final static int DOBLEUVES = 39;
    public final static int EQUIS = 40;
    public final static int YES = 41;
    public final static int ZETAS = 42;
    public final static int FPAL = 43;
    public final static int FFRAS = 44;
    public final static int FPAR = 45;
    public final static int FPAG = 46;
    public final static int FDL = 47;
    public final static int PALABRAS = 48;
    public final static int NUMEROS = 49;
    public final static int FRASES = 50;
    public final static int PARRAFOS = 51;
    public final static int PAGINAS = 52;
//     public final static int BASICOS = 53; not in use
//     public final static int BASICOS = 54; not in use
//     public final static int BASICOS = 55; not in use
//     public final static int BASICOS = 56; not in use
//     public final static int BASICOS = 57; not in use
//     public final static int BASICOS = 58; not in use

}

the use will be, let's say that the array is called statArray and it is dimensioned or taylored to the statistics of the class (I mean, that a word does not need the number of paragraphs contained in it)
in the case of a paragraph that has already added a phrase to itself,
So, the use is this.statArray[PALABRAS]=this.statArray[PALABRAS]+Phrase.statArray[PALABRAS].

The reason I do not use integers as the indexes is that I could easyly make mistakes because there are more than 50 statistics. This approach is more mnemotechnic.

Is there any special reason to use an array for statistics? This kind of tree-like structure fits very badly in arrays. Why not just add them to the class itself? This is a much more "object oriented" way of thinking. So in each class you have public getter methods like (for the Page class)

public int getParagraphCount() {
   return paragraphList.size();
}
public int getSentenceCount() {
  int n = 0;
  for (Paragraph p : paragraphList) {
    n += p.getSentenceCount();
  }
  return n;
}

Now you can access anything by a very natural for-each loop using the objects themselves, rather than trying to compute and understand array indexes.

It's usually better prectice to get statistics like this "on the fly" rather than calculating them and storing them, if the overhead isn't large. This way they are guaranteed to always be up-to-date.

As for saving the parsed data, why nor just declare the classes as Serialisable and write the top Document instance to am Object file?

I have a super class, that is elementoCompuesto (Spanish for ComplexElement) that has everything the upper classes have in common: Statistics integer Array, ID integer, methods to update, show, print, give the statistics, and every class above (in the cascade) basicElemento derives (or so should) from this complex general clase. At this point I have 5 classes inheriting from elementoCompuesto. But I can see that there is kind of a pattern in that all of them have an ArrayList, though of different classes. ¿Do you think that it makes sense and is of any use to declare a generic ArrayList in the parent (or super class) elementoCompuesto? Does it have any advantage?

Well, it looks like I am left alone in the dark!

for if someone is still looking this thread, I will say that I have advanced in the conception of the project as well as in the coding.
I have built the statistics objects and tested them.
I have also designed all the objects that I presumably need and I am now testong them.
I am having problems with some of the objects.
Anybody helps? If so, I will explain what happens.

Oh, James, I'm sorry for not having seen your post. I realized about it after having posted my last.
To reply yours, I have to say that my intention is to compute the statistics just once. It is not expected for the doc to change thus neither its statistics would. This way, I thought it would be better to store them instead of walking aling all the sub-tree of sub-elements of a given instance of, let's say, a Page. If it was a document of 4200 words, 15 words a phrase at 10 phrases per paragraph and 7 paragraphs each page, we would have to ask, as for this page, to 7*10*15 objects. This is 1050 calculations each time you ask for the statistics of the page... I felt like this was too much.

So, I certainly have to go through (No. of pages * No. paragraphs per page, * No. of phrases per paragraphs * No. of words per phrase * No. of characters per word, but just once. The statistics will include less items, or magnitudes, than I previously expected. it would be 12.

public final static int CARACTERES = 0;     // characthers. they go in the position 0 of the array
    public final static int LETRAS     = 1;         // letters
    public final static int VOCALES    = 2;        // vowels
    public final static int CONSONANTES= 3;    // consonnants
    public final static int SIMBOLOS   = 4;       // symbols
    public final static int DIGITOS    = 5;        // digits
    public final static int PALABRAS   = 6;
    public final static int NUMEROS    = 7;
    public final static int FRASES     = 8;
    public final static int PARRAFOS   = 9;
    public final static int PAGINAS    = 10;
    public final static int DESCONOCIDOS = 11;
    
    public final static int STATS_DIMENSION = 12;

}

so, what do you think?

I attach a graphic that depicts the objects model annd relationships.

dashed arrows mean that the class utilizes (is build up with) elements of the pointed class. The continuous arrow means that the origin inherits from te pointed class.
statsArray and statsItem are inside the package called practica_2010.

I also attach the blueJ project's files.

All of it in a JAR file plus the graphic

Almost done! I have developed all the first part of the project, that which builds up the nested structure. It computes correctly the statistics (well, there are some issues that come from the flaw requirements of the homework) and show'em perfectly.

BlueJ is helpful enough to be able to test the code step by step, from basic objects to the top Document objects.

Norm and James, I still need your help in some questions, but, tell me if you would like to see the results and how would you prefer that I send them-

Sorry, major oprobs with internet connection - probablhy offline till next week. J

Hi, me again. People say bad things about the French nationalised industries, but France Telecom just replaced 1/2 mile of cable through trees to my house within 24 hours of localising a fault to that area. For free. Vive la France!
Anyway -

Norm and James, I still need your help in some questions

I'm back online, so start asking!
J

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.