Sorry I couldn't think of a better title, but thanks for reading!

My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.

Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).

Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:


After re-parsing by . I will be left with more crazy strings like:


How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?

Parsing the text with space and dot delimiters isn't enought, java is more complex than that. You should maybe check out recursive descent. There might be easier ways to do it but I would define the java grammar and write a recursive descent parser for it. Check out the link and google it, I believe it will be useful.