Hi everyone, I have created a random text generator -ok, not so random, works with certain parameters- and at times it runs slowly. I looked up on the internet about speeding tricks and applied some of them that looked concurrent with my code piece. The code however, is still slow and I want to optimize it as much as possible in order to make it work on even bigger data. -Currently, 18 books from the Gutenberg corpus are processed by my code and I am trying to get one sentence per every 3-4 minutes but most of the time, the code is hazardously slow, I even witnessed no response for 8 minutes- The algorithm and the implementation is below:

ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5


IMPLEMENTATION

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]
			

doappend = corpusSentences.append

def appending():
    
    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):
    
	
    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
	
    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
 
    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)
	
    triggerSentence = choice(sentencesContainingLongestWord)
    
    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "\n"
 	
    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
	
 
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

How can I rescue this from its hazardous form ? Thanks in advance.

To start you off (untested):

def find_longest_word(trigger_sentence):
    """ 2. = find the longest word in the trigger sentence
    """

    ## container for the longest word or words
    longest_list = [0]

    ## assumes no punctuation
    for word in trigger_sentence.split():
        ## new word found so re-initialize the list
        if len(word) > longest_list[0]:
            longest_list = [len(word), word]
        elif len(word) == longest_list[0]:
            longest_list.append(word)

    ## element[0] = length, [1]+ = word or words
    return longest_list

If you find that you have to access this function more than once, then change the list to a dictionary, with key=word length pointing to a list of words, then if you want the next longest word, you have it without going through the sentence again.

def find_sentence(longest_word_list, list_of_sents):
    """3- Find all the sentences of the corpus that contain the word at step2
    """

    sentences_found = []
    for sentence in list_of_sentences:
        ## if we are looking for "the" we do not want "there", so split the sentence
        sentence_list = sentence.split()
        ## multiple longest words can be found in the same sentence
        found = False
        for word in longest_word_list[1:]:
            if (not found) and (word in sentence_list): 
                found = True
                sentences_found.append(sentence)
    return sentences_found
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.