i have a text document 'topics.txt':

1~cocoa

2~

3~

4~

5~grain~wheat~corn~barley~oat~sorghum

6~veg-oil~linseed~lin-oil~soy-oil~sun-oil~soybean~oilseed~corn~sunseed~grain~sorghum~wheat

7~

8~

9~earn

10~acq

and so on..
here the numbers correspond to the file names, i have about 20000 files.

    import os
    import re
    import sys
    sys.stdout=open('f1.txt','w')
    from collections import Counter
    from glob import glob

    def removegarbage(text):
        text=re.sub(r'\W+',' ',text)
        text=text.lower()
        return text

    folderpath='d:/individual-articles'
    counter=Counter()


    filepaths = glob(os.path.join(folderpath,'*.txt'))

    num_files = len(filepaths)

    with open('topics.txt','r') as filehandle:
        lines = filehandle.read()
        words = removegarbage(lines).split()
       counter.update(words)


    for word, count in counter.most_common():
        probability=count//num_files
        print('{}  {} {}'.format(word,count,probability))

i need my output to be of the form:
word,count of word in topics.txt,probability, list of files,number of files in list,

so far my program works fine till probability, but how do i get a list of the files belonging to a word?
ex: 'grain' must contain the list(5,6,....)

would counting the line number for each word and storing the line number work?

how do i go about it?
Plz help!

I'm not sure of a real efficient way of doing this. Someone else might. But by iterating over each line in your test data I was able to build a dictionary with each word and a list of file numbers as the values, like {"grain":["5", "6"], "soybean": ["6"], ... }. Running this over that many items may not be the best idea, but its a start. The other thing I thought of was restructuring the file to better suit your needs.

wordfiles = {}
# basically cycling through each line,
# maybe you could integrate it into whatever processing you
# are already doing.
for line in testdata.split('\n'):
    # just bypassing any junk/empty lines
    if line.strip(' ').strip('\t').strip('\n') != "":
        items = line.strip(' ').strip('\t').strip('\n').split('~')

        if len(items) > 0:
            fileno=items[0]
            # grab each word from this line
            # empty entries don't count.
            words = [w for w in items[1:] if w]
            for word in words:
                if wordfiles.has_key(word):
                    # key already created, add this fileno to the list
                    wordfiles[word].append(fileno)
                else:
                    # key doesn't exist, create a list value
                    wordfiles[word] = [fileno]

# Check which files "grain" is in.
print "grain is in " + str(len(wordfiles["grain"])) + " files:"
print "File numbers:\n" + "\n    ".join(wordfiles["grain"])
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.