I'm trying to create a program that will prompt the user for a list of text files to read from, then read those text files and build a dictionary of all the unique words found. Then finally put those unique words into another file and make it alphabetical order

from string import punctuation

def unique_words(sentence, number):
    return [w for w in set(sentence.translate(punctuation).lower().split()) if len(w) >= number]

print (unique_words("This text is a sample text. It need to be parsed correctly.", 2))

I figured out how to make take the unique words out of a sentence but that's it. Can anyone help me out a little?

Sort of looks like a VB language version. Have no idea how that all works. In .NET you have Dictionary(TKey, TValue) Constructor (System.Collections.Generic) that can be used. This will NOT sort the data, but will efficiently determine if the word is unique. I don't see how the above code snippet could do that. You can set the TValue type to bool, value to true, and ignore it. You can then transfer the Keys list into an array and use the built-in array.Sort function.
If you aren't using .NET, I don't have the experience to help you.

Comments
Out of topic, this is Python forum

Here is a hint ...

''' set comprehension101.py
show a list of unique words in a sentence via set comrehension
'''

import string

s = "Girl meets boy, and boy meets girl once a week."

print("\nOriginal sentence:")
print(s)

# remove all punctuation marks and make lower case
s_nopunct = "".join(c for c in s if c not in string.punctuation).lower()

print("\nSentence in lower case with punctuation marks removed:")
print(s_nopunct)  # test

# convert to a sorted list of unique words via set comprehension
list_unique = sorted(list({word for word in s_nopunct.split()}))

print("\nSorted list of unique words in sentence:")
print(list_unique)

''' result ...
Original sentence:
Girl meets boy, and boy meets girl once a week.

Sentence in lower case with punctuation marks removed:
girl meets boy and boy meets girl once a week

Sorted list of unique words in sentence:
['a', 'and', 'boy', 'girl', 'meets', 'once', 'week']
'''

By the way you want a Python list not a Python dictionary object.

Edited 2 Years Ago by vegaseat

That actually helped me out a lot, thank you, I appreciate it.
So far I've got:

import string

s = input("Enter a file name: ") + ".txt"
filepath = "I:\\" + s

# remove all punctuation marks and make lower case
s_nopunct = "".join(c for c in s if c not in string.punctuation).lower()

# convert to a sorted list of unique words via set comprehension
list_unique = sorted(list({word for word in s_nopunct.split()}))

print("\nSorted list of unique words in sentence:")
print(list_unique)

with open("C:\\Users\\Desktop\\words.dat", "w") as f:
    for x in list_unique:
        f.write(x + "\n")

I need help making it so that the user is prompted for 3 files.
And also, I tried making those unique words to write to another file (I got it that far), but how do I make it more of an arbitrary path (rather than the C:\Users etc) since I need it so that anyone can run that program and write to that file.

Do you want to include numbers and everything else not in string.punctuation

s_nopunct = "".join(c for c in s if c not in string.punctuation).lower()

Generally it is better to include what you want instead of excluding a certain set as above, as anything you forgot about is automatically included.

s_nopunct = "".join(c for c in s if c in string.letters).lower()

A set intersection of string.letters and "s" can also be used.

Note also that list_unique is only one name as you have it in the code above so the

for x in list_unique:

is not necessary.

Edited 2 Years Ago by woooee

Comments
I only want to include words that are 2 or more letters. Words such as "error-free" would need to be separated to form just two words.

but how do I make it more of an arbitrary path (rather than the C:\Users etc) since I need it so that anyone can run that program and write to that file.

Tkinter has an askdirectory Click Here

Tkinter would be perfect but I need it so that the program asks the user for a list of text files to read from, and to type the file names as [filename1] [space] [filename2] [space] [filename3]

This gets the directory name and has nothing to do with file name. TKinter has an askopenfilename method to get the file names.

def checkFileList(filename):
    filelist = list()
    try:
        fi = open('alice.txt', 'r')
    except:
        print('Unable to read from filelist: '+filename)
        sys.exit()

    lines = fi.readlines()
    print ('  Including files from filelist ...')
    for l in lines:
        l = l.strip()
        if l != '' and not re.search('^#',l):
            if os.path.exists(l) is True:
                print ('    + ')+l
                filelist.append(l)
            else:
                print ('    - ')+l+(' skipping (cannot open file)')
        else:
            filelist = (filename,)

    if len(filelist) == 0:
        printError('No files to process')
        sys.exit()

    return filelist

I'm trying to make it without Tkinter. I tried the above, but it didn't work

Here is a hint ...

''' filename_list_given_dir1.py
use module glob to list all the filenames of .jpg files
or any extension(s) you specify in a given directory
'''

import glob
import os

# all files (split off file names) in a given directory
directory = "C:/Temp/*.jpg"
# this would give you all files
#directory = "C:/Temp/*.*"
for path in glob.glob(directory):
    #print(path)  # test
    # separate path from filename
    dirname, filename = os.path.split(path)
    print(filename)

Is there any way for when someone else opens this file, then runs this file, it'll prompt them to enter 3 files to read from? Then write them to another file.

I've got:

import string
import glob
import os

# all files (split off file names) in a given directory
directory = "C:/Users/Desktop/*.txt"
# this would give you all files
for path in glob.glob(directory):
    # separate path from filename
    dirname, filename = os.path.split(path)
    print(filename)


s = filename
with open("s", "r") as g:
    g.readlines()

# remove all punctuation marks and make lower case
    s_nopunct = "".join(c for c in s if c not in string.punctuation).lower()

# convert to a sorted list of unique words via set comprehension
    list_unique = sorted(list({word for word in s_nopunct.split()}))

    print("\nSorted list of unique words in sentence:")
    print(list_unique)

with open("C:\\Users\\Desktop\\words.dat", "w") as f:
    f.write(s + "\n")
This article has been dead for over six months. Start a new discussion instead.