Hello,
I have a project to make a text comparer. I've downloaded almost 90000 texts from a site, parsed them and stored the texts in a file(150MB). So now, a new text is in another file and should be compared with the others, and return a result of 20 most similar texts, together with an ID.

The problem is that the comparing it is going to slow. The program opens the file and stores one article at a time in a list (word by word), and compares it with the new text also in a list. The comparing part is going something like this:

for word1 in list1:
  for word2 in list2:
    if word1 == word2:
      common += 1

Is there a way to speed things up?

Best Regards,
Igor

Python has a high speed module for this sort of thing, take a look at module difflib in the Python manual. Initially you would do a line by line compare and identify the lines that are different. If you need more help, ask the forum.

Python has a high speed module for this sort of thing, take a look at module difflib in the Python manual. Initially you would do a line by line compare and identify the lines that are different. If you need more help, ask the forum.

Thanks I will take a look

ok I used:

diffInstance = difflib.Differ()
diffList = list(diffInstance.compare(list1, list2))

and this returns a list of words with + - before them. As I read and understand probably - means just list1 has the word, and + that just list2 has the word. And I'm not getting and similar words, maybe I'm doing something wrong, can you help me a little bit. Also I'm new to python, I'm working in it just few months

Be warned I'd never heard of difflib before today, but your "(word by word)" structure does not sound appropriate: it looks as though you need to supply a pair of texts, each as a string (including '\n'). Though how you convert that to a similarity score is unclear.
http://docs.python.org/lib/differ-examples.html

ok I used:

diffInstance = difflib.Differ()
diffList = list(diffInstance.compare(list1, list2))

and this returns a list of words with + - before them. As I read and understand probably - means just list1 has the word, and + that just list2 has the word. And I'm not getting and similar words, maybe I'm doing something wrong, can you help me a little bit. Also I'm new to python, I'm working in it just few months

I modified one of vegaseat's Python snippets in the hope it will help you:

# find the difference between two texts

import difflib

text1 = """The World's Shortest Books:
Human Rights Advances in China
"My Plan to Find the Real Killers" by OJ Simpson
"Strom Thurmond:  Intelligent Quotes"
America's Most Popular Lawyers
Career Opportunities for History Majors
Different Ways to Spell "Bob"
Dr. Kevorkian's Collection of Motivational Speeches
Spotted Owl Recipes by the EPA
The Engineer's Guide to Fashion
Ralph Nader's List of Pleasures
"""

text2 = """The World's Shortest Books:
Human Rights Advances in China
"My Plan to Find the Real Killers" by OJ Simpson
"Strom Thurmond:  Intelligent Quotes"
America's Most Popular Lawyers
Career Opportunities for History Majors
Different Ways to Sell "Bob"
Dr. Kevorkian's Collection of Motivational Speeches
Spotted Owl Recipes by the EPA
The Engineer's Guide to Passion
Ralph Nader's List of Pleasures
"""

# create a list of lines in text1
text1_lines = text1.splitlines(1)

# dito for text2
text2_lines = text2.splitlines(1)

diff_instance = difflib.Differ()
diff_list = list(diff_instance.compare(text1_lines, text2_lines))

print "Lines in diff_list with markers:"
for line in diff_list:
    print line,
print

print '-'*50

print "Lines common to text1 and text2:"
for line in diff_list:
  if line[0] == ' ':
    print line[2:],
print

print '-'*50

print "Lines unique to text1"
for line in diff_list:
  if line[0] == '-':
    print line[2:],
print

print '-'*50

print "Lines unique to text2:"
for line in diff_list:
  if line[0] == '+':
    print line[2:],
print

Thanks guys I just tried sequenceMacther and it's going pretty fast, 4min with ratio(), and 2mins with quick_ratio()

I'm not sure comparing words will give the best results. If two articles use "and", "the", "a", "of", "to", etc. a lot then they will be deemed similar. If you want to do it that way, then go through each of the two files and place unique words in a set (that is if it is not found in that set). Then use can use the number of words from set_a.difference(set_b) to determine how similar they are. Sets are indexed and so should be hundreds or thousands of times faster.

Thanks for the code modification. I tried something similar, but the problem is that the program does not give any matches for "Lines common to text1 and text2", and this is the thing I need. I have to find the words that are contained in both text1 and text2.

Also I don't know if SequenceMatcher's ratio gives the correct ratio for completely similar words or just parts of words (ex. mokey and key). And all of my texts are parsed from /n /t .,'; etc. and all letters are lowercase.

I'm not sure comparing words will give the best results. If two articles use "and", "the", "a", "of", "to", etc. a lot then they will be deemed similar. If you want to do it that way, then go through each of the two files and place unique words in a set (that is if it is not found in that set). Then use can use the number of words from set_a.difference(set_b) to determine how similar they are. Sets are indexed and so should be hundreds or thousands of times faster.

my text's are parsed of such words, they do not contain "and", "the", "a", "of", "to", etc. Thanks for the suggestion I will give it a try

It would help if we knew more about what you were trying to do. For example if you were trying to find texts on the same subject you would use a different strategy than if you were trying to vet student texts for copying. It may be that a hierarchical approach is better: i.e. filter on keywords taken from the probe file and then apply a more detailed comparison to the hits. The more we know the more we might be able to make useful suggestions.

It would help if we knew more about what you were trying to do. For example if you were trying to find texts on the same subject you would use a different strategy than if you were trying to vet student texts for copying. It may be that a hierarchical approach is better: i.e. filter on keywords taken from the probe file and then apply a more detailed comparison to the hits. The more we know the more we might be able to make useful suggestions.

The project is just to find 10 most similar texts to the new text which will be given. The database of texts is created from a news site, the texts are on my native language (Macedonian). Texts are stored in one line, with a ID before them. The text base doesn't contain any stop words ("and", "the", "a", "of", "to"), or any other characters, just letters(all lower case) and numbers. I should give the number of similar words from the two texts divided by number of words of text1+text2. I have done that with the word by word comparison, but it is to slow.

Using set.union would give words that are found in both. Again, it is indexed and would be much faster. If you want to use your original idea, then use a dictionary for the words in the base text as the lookup would be 100s of times faster than using 2 lists. This is an example using a dictionary instead of a list.

dic_1 = {}
for word1 in list1:
     if word1 not in dic_1:
          dic_1[word1]=0.
for word2 in list2:
     if word2 in dic_1:
          dic_1[word2] += 1

total_words_found=0
for key in dict_1:
     if dict_1[key] > 0:
          total_words_found += 1
result = float(total_words_found) / (len(list1) + len(list2))
##
##   or
found_set = set_1.union(set_2)
result = float(len(found_set)) / (len(list1) + len(list2))

I tried the sets, just now, it's going very fast :D
it passed all texts in 30 secs. just it gives wrong results. is it ok if I define the sets this way:

s1 = Set(text1.split(' '))
s2 = Set(text2.split(' '))

Using set.union would give words that are found in both. Again, it is indexed and would be much faster. If you want to use your original idea, then use a dictionary for the words in the base text as the lookup would be 100s of times faster than using 2 lists. This is an example using a dictionary instead of a list.

dic_1 = {}
for word1 in list1:
     if word1 not in dic_1:
          dic_1[word1]=0.
for word2 in list2:
     if word2 in dic_1:
          dic_1[word2] += 1

total_words_found=0
for key in dict_1:
     if dict_1[key] > 0:
          total_words_found += 1
result = float(total_words_found) / (len(list1) + len(list2))
##
##   or
found_set = set_1.union(set_2)
result = float(len(found_set)) / (len(list1) + len(list2))

Now it's ok, instead of union i used intersection, and it's working FAST :)
Thanks guys for all your help

This question has already been answered. Start a new discussion instead.