The program that you will be writing will be called

  • Open the file for reading
  • Create an empty dictionary
  • Read line by line through the file
  • For each line strip the end-of-line character and remove punctuations. Replace hyphens (-) with blanks. In the function parseString() create a blank new string. Go through the input string character-by-character. Accept only letters (using isalpha()) and spaces (using isspace()) and add those to the new string and replace hyphens with space. Return the new string.
  • After the dictionary is created close the input file.
  • Remove all words that start with a capital letter.
    o Go through the words in the dictionary. For each word check if it starts with a capital letter.
    o If it does, first check if the lower case version of that word exists in the dictionary. If it exists, then add the upper case word's frequency to the lower case word's frequency.
    o If the lower case version does not exist in the dictionary, check it it exists in a comprehensive word list that is used for crossword and Scrabble players. If it does, create an entry in the dictionary with the lower case version of the word and the word frequency computed.
    o Add the word starting with a capital letter in a list.
    o After you have checked for all capitalized words, remove all those words in the above list and their frequencies from the word frequency dictionary.

You should now have a dictionary of words in lower case and their frequencies. You will have removed all proper names of people and places. You will also have removed those words that occur just once in the novel and as the first word in the sentence or always as the first word in a sentence. That number should few compared to the total number of words that we are dealing with in those novels. You can always write the list of words beginning with a capital letter in a file and examine that file.

Now use the function wordComparison().

  • Print the number of distinct words used, i.e. number of words used if you remove the duplicates. Realize, that is just the length of the list of keys for the word frequency dictionary.
  • Compute and print the total number of words used by adding all the frequencies together.
  • Calculate and print the percentage of distinct words to the total number of words used.

You will create two sets with the list of keys from the two word frequency dictionaries. Let us call these sets D and H for the two authors respectively. The set difference D - H represents all the words that Dickens used that Hardy did not. The set difference H - D represents all the words that Hardy used that Dickens did not. For each of these set differences print the following pieces of information:

  • The number of words in that set difference.
  • Compute the total frequencies of these words in the set difference (D-H or H-D) and express that as the percentage of total words number in the novel that you found earlier.

Enter name of first book: dickens.txt
Enter name of second book: hardy.txt

Enter last name of first author: Dickens
Enter last name of second author: Hardy

Total distinct words = 55
Total words (including duplicates) = 116
Ratio (% of total distinct words to total words) = 47.4137931034

Total distinct words = 92
Total words (including duplicates) = 122
Ratio(% of total distinct words to total words) = 75.4098360656

Dickens used 47 words that Hardy did not use.
Relative frequency of words used by Dickens not in common with Hardy = 62.9310344828

Hardy used 84 words that Dickens did not use.
Relative frequency of words used by Hardy not in common with Dickens = 77.0491803279

This code will give a count of all 'words' in text file, but it includes numbers. It also doesnt remove hyphens.

# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!-,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
        freq_dic[word] += 1
        freq_dic[word] = 1

print 'Unique words:', (len(freq_dic)-1)

# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
# display result
for word, freq in freq_list:
    print word, freq