Beginner Python Question

Question

sjgood 0 Newbie Poster

13 Years Ago

I am having trouble with a python homework question. The problem is the following:

You are to write a program that counts the frequencies of each word in a text file (text.in), and outputs each word with its count to a file (word.out). Here a word is defined as a contiguous sequence of non-blank space characters, and different capitalizations of the same character sequence should be considered same word (e.g., Text and text). The input file includes several lines with only blank spaces and alphabet characters. The output is formatted as the following: each line begins with a number indicating the frequency of the word, a white space, and then the word itself.
You need to define two functions in the program:
read_text(), which has a parameter to get the file name of the article, and returns a dictionary with the words as the keys, and their counts as the values;
save_to_file(), which should have a parameter to receive the dictionary of words, and a parameter to receive the file name to which the information will be stored.

I have been working on this for a few hours how so I'll post below what I have so far, but it's completely wrong. The first function is taking a section of the text and counting how many words are in it and the second function isn't writing into the output file at all. Can anyone help please?

def read_text():
    word_dict = {}
    text_file = open("line.in.txt","r")
    
    for line in text_file:
        line = line.replace('-', ' ')
        
    for word in line.split():
        word = line.strip("-")
        word = word.lower()

        word_dict[word] = word_dict.get(word, 0) + 1
 
    text_file.close()
    return word_dict

def save_to_file(word_dict):
    text_file = open("output.txt","w")
    text_file.write(word_dict)
    text_file.close()
    text_file = open("output.txt","r")
    text_file.read()
    text_file.close()

python

Edited 13 Years Ago by sjgood because: code didn't show up as code

4 Contributors
8 Replies
330 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by JoshuaBurleson

All 8 Replies

TrustyTony 888 ex-Moderator

13 Years Ago

Shouldn't the word loop be inside line loop? You have not the asked parameters.

woooee 814 Nearly a Posting Maven

13 Years Ago

Damn pyguy62, that's quite a bit of effort. Hopefully it will be appreciated. Look at the indentation after line 55. I hope you read this before the 30 minute limit is up.

To sort by value can also be done with

from operator import itemgetter
print sorted(word_count.items(), key=itemgetter(1))

but homework probably means the OP is supposed to write a sort method.

The comments looked much nicer in IDLE

IDLE probably uses a fixed width font, i.e. all letter have the same width. A browser generally used a proportional font, i.e. "i" takes up less space than "w", so I just post what I have and ignore the misaligned comments.

Edited 13 Years Ago by woooee because: n/a

JoshuaBurleson commented: good eye. and thanks for the itemgetter reference! +4

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

JoshuaBurleson 23 Posting Whiz · Answer 1 · 2011-10-11T10:18:50+00:00

I'm going to over complicate the matter,as to show you step by painful step how the process goes, in my mind that is. I'm not going to make functions either, let's see you re-engineer my code, and hopefully simplify it. Also, post your result. "Note that help.txt was a copy of your original post, without the code. I can't state enough that it shouldn't take this many steps and lists. I did that for educational purposes, because I love to have things over explained.

#Overly Thourough Process
#This could surely be done without 3 lists and so many steps

with open('help.txt','r') as f:
    #Readlines will make me a nice, pretty(ish) list
    file_list=f.readlines()

#list for the larger strings
unsanitized_words=[]
#list for words that still have non-alpha characters
iteration_list=[]
value_list=[]#list to be sorted based on frequency of word "more on this later"

for word in file_list:
    #'''These are large strings I'm gonna break 'em down'''
    for real in word.split():
        sub=real.split(',')#'''okay now I want to break them down a bit more'''
        
        for wrd in sub:
            unsanitized_words.append(wrd)#'''and now I have words, given they still have non-alpha chars, so
                                         #they're not sanitized enough for me, into the unsanitized list for them '''
            

for word in unsanitized_words:
    #'''Like a surgeon scrubbing in I'm gonna sanitize the crap out of these words
       #I'll make a new word of each word, without non-alphas,hold for apostrophes'''
    word=word.lower()
    new_word=''
    for char in word:
        if char=="'":
            new_word+=char
        elif char.isalpha():
            new_word+=char
        else:
            pass
    if new_word=='i':#'''and I'll add that word to the list to be iterated through'''
       new_word='I'
        #'''lowercase I's just bother the crap out of me,ignore this if you want'''
    if new_word[0:2]=="i'":
        new_word=new_word.replace("i'","I'")
                          #'''Okay, now add the nice clean word to the iteration_list to be further inspected and added to the dict
                          #note that the inspection could have taken place here.'''
    iteration_list.append(new_word)

#'''Obvious dictionary is obvious'''
word_count={}
for word in iteration_list:
    if word not in[' ','']:# '''Hmmm looks like spaces got counted, easy fix'''
        if word not in word_count:
            word_count[word]=1 #'''not in the dict? Add it with a value of 1'''
        else:
            word_count[word]+=1# '''Oh you are in the dict, nice to see you again, value +1'''
          

for key in word_count.keys():
   # '''Alright, I want to sort this dictionary, but I can't sort a dictionary...
   # but, I can sort a list pretty easily. I'll use a list of tuples so that...'''
    #'''...I can sort the list by the value, which is at index[1] of each tuple'''
value_list.append((key,word_count[key]))
value_list=sorted(value_list, key=lambda i: i[1],reverse=True)

for tup in value_list:
   print(tup[1],tup[0])    #'''Now to make it pretty and print'''
                            #'''<----Remember if you need to print to a file you can
                             #  open the file and print into it,i.e.
                              # print('You're welcome',file=f)
                               # assuming f is a file opened to write'''

The comments looked much nicer in IDLE, I tried to format them for Daniweb, but it's just not quite as good... :'(

and the Top 10 from your original post:

23 the
14 a
10 and
7 to
7 of
6 word
6 file
5 with
4 text
4 is

JoshuaBurleson 23 Posting Whiz · Answer 2 · 2011-10-11T10:59:06+00:00

yeah that happened while I was trying to reformat the comments, and it's too late to fix now, but I complained to the mods. Thanks for pointing that out though. Oh I've never seen the itemgetter before, that's a handy tool; and easier on the eyes to a newbie than lambda, I remember the first time I saw a lambda I was so confused, and then confusion became love when I realized how much easier they can make life.

Also, apparently daniweb didn't like my multiline comments ''' ''' so I changed them each to one line # comments. I don't remember having an issue with that before, maybe I just never did it...

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2011-10-11T17:19:45+00:00

Counter is handy in real life where you are allowed to us it, even this exercise is so common that most of us oldies can do it six ways by heart with various levels of sofistication:

from collections import Counter

def clean_up(w):
    word = ''.join(c.lower() for c in w if c.isdigit() or c.isalpha())
    return word if word != 'i' else 'I'

with open('help.txt','r') as f:
    frequencies = Counter(clean_up(word)
                  for line in f
                  for word in line.split() if clean_up(word))

print(frequencies)

sjgood 0 Newbie Poster · Answer 4 · 2011-10-12T01:34:22+00:00

I'm still pretty confused.
I tried to condense it but I don't think I did it very well.
Here's what I have :

def read_text():
    words = {}
    file = open("line.in.txt","r")
    file.readlines()
        for line in file:
            word = word.lower()
            new_word = ''
            for word in line.split():
                separate = word.split(',')
                for new_word in separate:
                    if word not in[' ','']:
                        if word not in words:
                            words[word] += 1
        file.close()
    return words

I am just not getting this at all.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 5 · 2011-10-12T02:10:01+00:00

Line 4 reads all lines and discards them, line 5 tries to read from file which is in end of file so then for loop never executes, indention is off for lines 5 until 15, so you can not have run your code at all. Also the line 7 has no meaning as line 10 overwrites it, condition at line 12 quarantees that line 13 raises error, when you do enter the loop.

JoshuaBurleson 23 Posting Whiz · Answer 6 · 2011-10-12T03:35:31+00:00

condition at line 12 quarantees that line 13 raises error, when you do enter the loop.

Yes, you essentially said, if this doesn't exist, add TO it, well that's quite impossible, if you told me to throw my recycling in the bin, but there was no bin I'd scream ERROR, so maybe you need to ADD it to the dictionary before we can add to it..."i.e. maybe you need to put a recycling bin in the trash area before I can add to it."

Beginner Python Question

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers