I'm a complete beginner in the programming world, so forgive me for the basic questions. I'm trying to run Peter Norvig's spelling corrector from the Windows XP command line, but am having difficulties.
I have a text file of addresses with a number of misspellings. I would like to use Norvig's script in correcting these misspellings. I have created a file, 'big.txt', consisting of addresses with the correct spellings. This is to be used as the reference data embedded in line 11 of the script. What I cannot figure out is how to provide the script my input text file with the misspellings, and have it generate an output file with the corrections.
Can someone show me what I need to change in the script to accept an input file and generate an output file? Secondly, how do you run all of this on the command line?
The following is Peter Norvig's spelling corrector -

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Hint, this guesses the spelling from sentence from keyboard:

# replace the raw_input with your file read function
s = ' '.join(correct(word) for word in words(raw_input('Give sentence: '))).capitalize()+'.'
print s

The function overfavors shorter words quite much so it suggests day for correcting tday, even today is more likely to be the meant word.

This article has been dead for over six months. Start a new discussion instead.