I would like to search for more than one word in a text file and if the word found i would like to save it to a group in which i have files belonging to English,Entertainment etc.The thing is Im unable to search for more than one string in a file.I wrote few lines of code.Can any one please tell me the correct procedure to classify text files.

import shutil
fp=open("net.txt","r")  #opening a file for searching
text=fp.read()
fp.close()
search="sandhya"        #searching for the term sandhya  
index=text.find(search)
print((search), "found at index", index)
print("name file")      #telling to which group this input file belongs to
fp=open('name.txt','w')
shutil.copyfile("net.txt","name.txt") #copying
fp=open("net.txt","r")
fp.read()
fp.close()

I hope someone help me out for this
thanks

Recommended Answers

All 4 Replies

Independantly from writing the code, you must first define your classification criterion: how will you decide if a file belongs to 'name' or 'english', or 'entertainment' ?

Consider if you are dealing with single word or set of words belonging in classes. Also think is it better to go through the words in file and find where they belong or look for words that belong to some class for each class? I think you must loop over all words in file and categorize and in the end analyze. Or if your categorizing is not so critical, you could stop reading words from file when you have found word you are able to put in a class. Do not expect to get good result for the joke about Jesus,Ronald Reagan and Bill Gates then. (No, I do not know any of those but similar do exist and you do mention entertainment ;) ).

i would like to stiming and stop word that read from the file
this is steming algorithm
1.read the tokenization file after spliting
2.read stiming file
3.if perfix in the stiming file
4.if postfix in the stiming file
5.remove prefix in the file
6.remove post fix in the file
7.end if
8.end if
9.write the stiming file
10.close file

stopword algorithm

1.read stopword from list
2.if stopword in file
3.remove stopword
4.write file
5.close file
6.end if

my code is this but it doesn't work so how it does work?
def removestemm(token):# after tokenization

for i in range(0,len(pre)):
    for w in token:
        if w.startswith(pre[i]):
             w=w.replace(pre[i],'')
             print(w)
    for w in token:
        if w.endswith('ed'):
            w=w.replace('ed','')
            print(w)

def removestopword(token):

 stopwords = open('G:/ሚኪ/fullist/stopword.txt','r'").read().split()
#fin = open("f:\corpus.txt", "r",encoding="utf-8" ).read()
filteredtext = [word for word in token if word not in stopwords ]
print(filteredtext )
token.write(str(filteredtext))
#print(filteredtext)
#return token

Hello all,
I have a hindi text file, from which i tried to get the index of a word(hindi word)

i am getting some unusual output. (shown in the attached image)
Kindly help

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.