Hello everyone, I have created my own random text generator with a custom method, no Markov chains included, and now I would like to try it on a different text corpus that is larger from that of NLTK's and I wanted to know which Data structure should I use in order to make the code work faster since additional text files will surely make the code a painstaking procedure to execute. My algorithm is as follows:

1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5

Which data structure would be the most optimal for this one ? -I originally used Lists for the code I created- Thanks in advance.

Profile your code with cProfile to see what operations take most time. i would think that dictionary of list of sentences (or their index) containing given word would be helpfull.

Profile your code with cProfile to see what operations take most time. i would think that dictionary of list of sentences (or their index) containing given word would be helpfull.

Thanks for the information.

This question has already been answered. Start a new discussion instead.