Hello,

I make an exercise from a book : Think Python: How to Think Like a Computer Scientist by Allen B. Downey, Version 1.1.22, random words exercise 7.

I made a script like this (see below) but how can I make list (named book) just once and since then after each execution of the script it will make random choice from this list (if I well understand the task). There are more alternatives (in this manual book) and I don't know which one is more efficient, can you also tell me?

import random
def random_word(h):
    k=histogram(h)
##    print k
    t=[]
    for word, freq in k.items():
        t.extend([word]*freq) # output is a list but it will be build again after each execution of the script
    return random.choice(t)

def histogram(h):
    d = dict()
    for c in h:
        if c not in d:
            d[c] = 1
        else:
            d[c] += 1
    return d


print random_word('I went out for a walk')

Thank you

Vlady

Recommended Answers

All 12 Replies

Letters have different frequencys and the list is built from input source to reprecent the frequency by multiplication.

can you give me an example pls?

Can you give us the text of exercise 7?

13.7 Random words

To choose a random word from the histogram, the simplest algorithm is to build a list with multiple copies of each word, according to the observed frequency, and then choose from the list:

def random_word(h):
t = []
for word, freq in h.items():
t.extend([word] * freq)

return random.choice(t)

The expression [word] * freq creates a list with freq copies of the string word. The extend method is similar to append except that the argument is a sequence.
Exercise 7

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

An alternative is:

1. Use keys to get a list of the words in the book.
2. Build a list that contains the cumulative sum of the word frequencies (see Exercise 10.1). The last item in this list is the total number of words in the book, n.
3. Choose a random number from 1 to n. Use a bisection search (See Exercise 10.8) to find the index where the random number would be inserted in the cumulative sum.
4. Use the index to find the corresponding word in the word list.

Write a program that uses this algorithm to choose a random word from the book.

Looks like the author is talking about the fact that a word:frequency dictionary of the book would be a lot shorter than a word list. Efficiency would be to get a representative random word from the book, that also reflects the frequency of the word, directly from the dictionary without expanding it into a long word list.

Imagine a book of 1 million words. A word list of every word in the book would have 1 million items. A word:frequency dictionary might only have 5,000 to 10,000 items, since duplicate words are taken care of in the frequency value.

This exercise recuires histogram of words, not letters.

Did you do that (Exercise 10.1). You must remove unwanted characters (punctuation etc) to get words in the end of sentence, for example:

import string
sentences='It was a dark, cold night. Commander said: "I will not tolerate any insubordinates."'

print [i.strip(string.punctuation) for i in sentences.lower().split()]

I've corrected a bit a script because it wasn't what I'd expected.

import random

def random_words(t):
    k=hist(t)
    d=dict()
    t=[]
    a=1
    for i in k:
        d[i]=a
        a+=1
    for w, f in d.items():
        t.extend([w]*f)
    return random.choice(t)


def hist(t):
    l=t.split()
    return l


print random_words('vlady went for a beer')

but still, I don't know how I would proceed with exercise 7. Is it worthy to learn it?

Looks like the author is talking about the fact that a word:frequency dictionary of the book would be a lot shorter than a word list. Efficiency would be to get a representative random word from the book, that also reflects the frequency of the word, directly from the dictionary without expanding it into a long word list.

Imagine a book of 1 million words. A word list of every word in the book would have 1 million items. A word:frequency dictionary might only have 5,000 to 10,000 items, since duplicate words are taken care of in the frequency value.

do you think like this? :

import random

def random_words(t):
    g=words(t)
    t=[]
    for w, f in g.items():
        t.extend([w]*f)
    return random.choice(t)


def words(t):
    k=hist(t)
##    print k
    d=dict()
    t=[]
    for i in k:
        if i not in d:
            d[i]=1
        else:
            d[i]+=1
    return d # {'a': 1, 'went': 1, 'vlady': 1, 'beer': 2, 'for': 1}



def hist(t):
       l=t.split() # ['vlady-went', 'for', 'a', 'beer', 'beer']
       for i in l:
           i=t.replace('-',' ') # vlady went for a beer beer
           l=i.split()
       return l # ['vlady', 'went', 'for', 'a', 'beer', 'beer']
##hist('vlady-went for a beer beer')


##random_words('vlady-went for a beer beer')

print random_words('vlady-went for a beer beer')

No, because
t.extend([w]*f)
extends into the large word list again!

vegaseat means that you must reduce the f to be reasonable size by choosing scaling factor (in case of normal book size material for input).

The biggest problem is that some rare words are going to be multiplied be zero, means they never appear in output even they are in input.

vegaseat means that you must reduce the f to be reasonable size by choosing scaling factor (in case of normal book size material for input).

The biggest problem is that some rare words are going to be multiplied be zero, means they never appear in output even they are in input.

I am not sure if I am able to do it, I found it difficult.

You can find the maximum of the counts and total number of words. If you want to prepare list of 1000 elements, you would divide each count with total count of words and multiply be desired number of elements, here 1000. By int(x) you can take out decimals from results or use integer division // (after multiplying by 1000).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.