Project Gutenberg regular expression problem

Question

koveras vehcna 0 Newbie Poster

13 Years Ago

Hello everyone, I am currently working on a code and I got stuck in a part. My algorithm is a text generator that operates on project Gutenberg and its flow is like this:

Enter a sentence as input
1-Pick longest word of input sentence
2-Search the longest word of the input sentence in all sentences of Project Gutenberg books -these sentences are accessed by using the gutenberg.sents() function as specified in NLTK book at http://www.nltk.org/book
3-Find the longest sentence that has the longest word of input sentence
4-Append that sentence to the first sentence
5-Go back to 1-

I want to find the longest word without worrying about case sensitivities -e.g. whenever the word is uppercase / lowercase, it should be found- but I can't access the sentences as it is a list of list -the gutenberg.sents() function prints the sentences as list of list, a strange way in my opinion- so the re.search can't return me any results. Any ideas on how I can find do this ? Thanks.

processing python regex text

4 Contributors
5 Replies
240 Views
8 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by griswolf

All 5 Replies

Beat_Slayer 17 Posting Pro in Training

13 Years Ago

Post some code, it should be easy to try to correct.

Cheers and Happy coding

griswolf 304 Veteran Poster

13 Years Ago

Now a substantive message:
I don't understand how regex comes into play from your description of the problem, so I see no need to worry about it. However, assuming you will eventually need it, there are two ways out. This line: sets = [w for w in mac if s in w] can be modified to search the whole sentence: sets = [w for w in mac if theRe.find(" ".join(w))] This may not be ideal, but I think it gets there.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

koveras vehcna 0 Newbie Poster · Answer 1 · 2010-08-22T22:17:30+00:00

Post some code, it should be easy to try to correct.

Cheers and Happy coding

Here is the code so far. I have not used regex yet, this one only retrieves the exact
word

import nltk, re
from nltk.corpus import gutenberg
mac = gutenberg.sents('shakespeare-macbeth.txt')

x = raw_input("Please enter a sentence: ") #entering input as it is

split_str=x.split()
p=0
for item in split_str:
    if len(item)>p:
        s=item
        p=len(item)
#s is the longest word in the normal input version




sets = [w for w in mac if s in w] #sentences that have the longest word of the trigger sentence



longest_len = max([len(w) for w in sets])#longest length of the sentence in 'sets'
secondsentence = [a for a in sets if len(a) == longest_len]


for c in secondsentence:

    sent = " ".join(c)
    print input2+" " + sent.upper()

griswolf 304 Veteran Poster · Answer 2 · 2010-08-22T23:50:33+00:00

First a meta message: You misunderstood the instructions about how to post code (easy to do). The easiest way to post code is to press the (CODE) button at the top of the message box, then paste your code between the your code here tags that it places for you. Alternatively, you can just type (code) at the top of your code and (/code) at the bottom: Works exactly the same. The (icode) tags are for inline code, such as This is inline code. Mix and match doesn't work, as you can see.

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 3 · 2010-08-22T23:53:54+00:00

import nltk, re
from nltk.corpus import gutenberg
mac = gutenberg.sents('shakespeare-macbeth.txt')

x = raw_input("Please enter a sentence: ") #entering input as it is

split_str=x.split()
p=0
for item in split_str:
    if len(item)>p:
        s=item
        p=len(item)
#s is the longest word in the normal input version

sets = [w for w in mac if s in w] #sentences that have the longest word of the trigger sentence

longest_len = max([len(w) for w in sets])#longest length of the sentence in 'sets'
secondsentence = [a for a in sets if len(a) == longest_len]

for c in secondsentence:
                    
    sent = " ".join(c)
    print input2+" " + sent.upper()

You can use max(sets, key=len) and max(split_str, key=len) to simplify, I think.

Project Gutenberg regular expression problem

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers