Hey guys, I'm creating a code analyzer that enforces the CamelCase convention that Java uses, for example, thisIsAWellConstructedJavaVariable and thisinnotawellconstructedjavavariable.

So obviously I need to allow my program to indentify English words. What I plan to do is search over a dictionary (database if it exists) until the are no results and so assume that a new word has begun. So suppose I had areaoftriangle as a variable, then I'd search a... ar... are... area... areao

areao would not be found and so I assume I'm starting a new word. Thus an alphabetic list of words in some highly accessible form would be perfect! I've searched and found dictionaries such as WordWeb, WordNet, ASpell, etc. But does anyone have a recommendation for me?

Thanks in advance!

Recommended Answers

All 3 Replies

>So obviously I need to allow my program to indentify English words.
Obviously? Indeed, how would you enforce camel case in this identifier:

KfxReturnValue kfxRetVal;

Identifiers are pretty close to free form in Java, so you're in for a rough ride with this program if you want it to be remotely useful. First, you need to pick out identifiers (you can do this most easily by identifying declarations rather than parsing every token in the source for identifiers). Then you need to grab the identifier and determine breakpoints (the place where a programmer might put an underscore) and match it against a mask using camel case. If it matches, move on. If it doesn't, offer the mask as a suggested change.

That's all pretty easy except for determining the breakpoints. I can guarantee that matching English words will either fail miserably or be of limited use. You might be better off writing this part as a plug-in where client code can supply logic that matches their naming conventions. If you want to do this for the general case, you have to account for English words as well as common and uncommon abbreviations across a wide range of project domains.

And finally, you said that this program enforces camel case. If your design suggests changes then that's fine, but if it actually makes changes or requires them to be made, that's not fine. What if the program is wrong? Nobody will use it, plain and simple. It's extremely difficult to write this program to be always right, so you need to make a compromise and suggest rather than enforce.

instead I'd recommend going through all the appropriate identifiers and parsing their camelcasedness.

In otherwords, go through all the identifiers and make a two-way mapping of camelcase fragments and where they appear. So if you found identifiers "abcDefGhi", "abcKoopa", "caterpillar", "snowCat" and "abcbomb", you'd get in your dictionary "abc", "def", "ghi", "koopa", "carbomb", "snow", "cat", and "caterpillar", with pointers (in the abstract sense) to the places in source code where those fragments appeared.

Then use some magic algorithm that searches for fragments that are concatenations of others or prefixes of others, and if they don't form some ordinary English word, then they're bad. For example, "abc" is a prefix of abcbomb. Maybe abcBomb was meant? But while cat is a prefix of caterpillar, caterpillar's in the dictionary.

Of course, that's dumb. If you want to enforce the camelcase rule, just tell people to do it and threaten to fire them, or if your company's in North Korea, threaten to imprison them, if they don't comply. If you disallow underscores from the names, that'll be enough to compel them to use camelcase. Right? Then again, if people can get through three years of CS thinking a std::vector's implemented with a linked list, maybe it isn't. Sigh.

Hehe well perhaps obvious was a bit presumptuous of me? But really it seemed like the only course of action. I’m not entirely sure how one would identify break points? The analyzer is aimed specifically for a novice user and works by providing suggestions (so by enforce, I actually meant tries to enforce or something like that :D).

The only way I could think of finding the break points was by using a dictionary. Conceptually it is the only solution my mind can perceive. I thought that the dictionary would be allowed to grow so that abbreviations would eventually be understood in future analysis. Of course that means explaining the concept of the CamelCamel case convention, of which is also apart of my analyzer, that is, it’s a learning tool. Yeah so all of this is part of my Honours project :)

‘My best code is written with the delete key,’ I like that!

Interesting approach Rashakil (how do you pronounce your name? Cool name though!). It was also suggested to me that I gather all the identifiers for comparison because perhaps an identifier’s case was mistyped so that ‘variableOne’ and ‘variableone’ would allow me to suggest that ‘variableOne’ was meant. That is, compare indentifies regardless of case and then suggest that identifier that has a capital letter in it. I will also consider building up a dictionary as you suggest, but because this is a small part of what I’m trying to achieve and time is limited, I might not implement it. Also, because it is aimed at novice users, it is likely that they will tend to not use the CamelCase convention and so my built up dictionary would probably just consist of large compound words. But it really is an interesting take! Thank you!

I would still like to use my dictionary search so if anyone has a suggestion of a good alphabetic dictionary database kind of thingy, then please holla! Oh and easier methods would be welcomed too!

Thanks for responding!
Power to the people.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.