Slow Comparing

Question

Igor_MK 0 Newbie Poster

17 Years Ago

Hello,
I have a project to make a text comparer. I've downloaded almost 90000 texts from a site, parsed them and stored the texts in a file(150MB). So now, a new text is in another file and should be compared with the others, and return a result of 20 most similar texts, together with an ID.

The problem is that the comparing it is going to slow. The program opens the file and stores one article at a time in a list (word by word), and compares it with the new text also in a list. The comparing part is going something like this:

for word1 in list1:
  for word2 in list2:
    if word1 == word2:
      common += 1

Is there a way to speed things up?

Best Regards,
Igor

python

5 Contributors
16 Replies
168 Views
1 Day Discussion Span
Latest Post 17 Years Ago Latest Post by woooee

All 16 Replies

vegaseat 1,735 DaniWeb's Hypocrite

17 Years Ago

Python has a high speed module for this sort of thing, take a look at module difflib in the Python manual. Initially you would do a line by line compare and identify the lines that are different. If you need more help, ask the forum.

ZZucker 342 Practically a Master Poster

17 Years Ago

ok I used:
diffInstance = difflib.Differ()
diffList = list(diffInstance.compare(list1, list2))
and this returns a list of words with + - before them. As I read and understand probably - means just list1 has the word, and + that just list2 has the word. And I'm not getting and similar words, maybe I'm doing something wrong, can you help me a little bit. Also I'm new to python, I'm working in it just few months

I modified one of vegaseat's Python snippets in the hope it will help you:

# find the difference between two texts

import difflib

text1 = """The World's Shortest Books:
Human Rights Advances in China
"My Plan to Find the Real Killers" by OJ Simpson
"Strom Thurmond:  Intelligent Quotes"
America's Most Popular Lawyers
Career Opportunities for History Majors
Different Ways to Spell "Bob"
Dr. Kevorkian's Collection of Motivational Speeches
Spotted Owl Recipes by the EPA
The Engineer's Guide to Fashion
Ralph Nader's List of Pleasures
"""

text2 = """The World's Shortest Books:
Human Rights Advances in China
"My Plan to Find the Real Killers" by OJ Simpson
"Strom Thurmond:  Intelligent Quotes"
America's Most Popular Lawyers
Career Opportunities for History Majors
Different Ways to Sell "Bob"
Dr. Kevorkian's Collection of Motivational Speeches
Spotted Owl Recipes by the EPA
The Engineer's Guide to Passion
Ralph Nader's List of Pleasures
"""

# create a list of lines in text1
text1_lines = text1.splitlines(1)

# dito for text2
text2_lines = text2.splitlines(1)

diff_instance = difflib.Differ()
diff_list = list(diff_instance.compare(text1_lines, text2_lines))

print "Lines in diff_list with markers:"
for line in diff_list:
    print line,
print

print '-'*50

print "Lines common to text1 and text2:"
for line in diff_list:
  if line[0] == ' ':
    print line[2:],
print

print '-'*50

print "Lines unique to text1"
for line in diff_list:
  if line[0] == '-':
    print line[2:],
print

print '-'*50

print "Lines unique to text2:"
for line in diff_list:
  if line[0] == '+':
    print line[2:],
print

woooee 814 Nearly a Posting Maven

17 Years Ago

I'm not sure comparing words will give the best results. If two articles use "and", "the", "a", "of", "to", etc. a lot then they will be deemed similar. If you want to do it that way, then go through each of the two files and place unique words in a set (that is if it is not found in that set). Then use can use the number of words from set_a.difference(set_b) to determine how similar they are. Sets are indexed and so should be hundreds or thousands of times faster.

woooee 814 Nearly a Posting Maven

17 Years Ago

Using set.union would give words that are found in both. Again, it is indexed and would be much faster. If you want to use your original idea, then use a dictionary for the words in the base text as the lookup would be 100s of times faster than using 2 lists. This is an example using a dictionary instead of a list.

dic_1 = {}
for word1 in list1:
     if word1 not in dic_1:
          dic_1[word1]=0.
for word2 in list2:
     if word2 in dic_1:
          dic_1[word2] += 1

total_words_found=0
for key in dict_1:
     if dict_1[key] > 0:
          total_words_found += 1
result = float(total_words_found) / (len(list1) + len(list2))
##
##   or
found_set = set_1.union(set_2)
result = float(len(found_set)) / (len(list1) + len(list2))

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Igor_MK 0 Newbie Poster · Answer 1 · 2008-05-08T07:44:05+00:00

Python has a high speed module for this sort of thing, take a look at module difflib in the Python manual. Initially you would do a line by line compare and identify the lines that are different. If you need more help, ask the forum.

Thanks I will take a look

Igor_MK 0 Newbie Poster · Answer 2 · 2008-05-08T16:36:13+00:00

ok I used:

diffInstance = difflib.Differ()
diffList = list(diffInstance.compare(list1, list2))

and this returns a list of words with + - before them. As I read and understand probably - means just list1 has the word, and + that just list2 has the word. And I'm not getting and similar words, maybe I'm doing something wrong, can you help me a little bit. Also I'm new to python, I'm working in it just few months

rodG 0 Newbie Poster · Answer 3 · 2008-05-08T23:35:00+00:00

Be warned I'd never heard of difflib before today, but your "(word by word)" structure does not sound appropriate: it looks as though you need to supply a pair of texts, each as a string (including '\n'). Though how you convert that to a similarity score is unclear.
http://docs.python.org/lib/differ-examples.html

rodG 0 Newbie Poster · Answer 4 · 2008-05-08T23:46:14+00:00

Sorry, should have read further. The following looks more like what you need than what you've tried so far:
http://docs.python.org/lib/sequence-matcher.html
But, my comment about your data structure still applies.

Igor_MK 0 Newbie Poster · Answer 5 · 2008-05-09T01:16:57+00:00

Thanks guys I just tried sequenceMacther and it's going pretty fast, 4min with ratio(), and 2mins with quick_ratio()

Igor_MK 0 Newbie Poster · Answer 6 · 2008-05-09T01:47:02+00:00

Thanks for the code modification. I tried something similar, but the problem is that the program does not give any matches for "Lines common to text1 and text2", and this is the thing I need. I have to find the words that are contained in both text1 and text2.

Also I don't know if SequenceMatcher's ratio gives the correct ratio for completely similar words or just parts of words (ex. mokey and key). And all of my texts are parsed from /n /t .,'; etc. and all letters are lowercase.

Igor_MK 0 Newbie Poster · Answer 7 · 2008-05-09T01:48:45+00:00

I'm not sure comparing words will give the best results. If two articles use "and", "the", "a", "of", "to", etc. a lot then they will be deemed similar. If you want to do it that way, then go through each of the two files and place unique words in a set (that is if it is not found in that set). Then use can use the number of words from set_a.difference(set_b) to determine how similar they are. Sets are indexed and so should be hundreds or thousands of times faster.

my text's are parsed of such words, they do not contain "and", "the", "a", "of", "to", etc. Thanks for the suggestion I will give it a try

rodG 0 Newbie Poster · Answer 8 · 2008-05-09T02:32:26+00:00

It would help if we knew more about what you were trying to do. For example if you were trying to find texts on the same subject you would use a different strategy than if you were trying to vet student texts for copying. It may be that a hierarchical approach is better: i.e. filter on keywords taken from the probe file and then apply a more detailed comparison to the hits. The more we know the more we might be able to make useful suggestions.

Igor_MK 0 Newbie Poster · Answer 9 · 2008-05-09T04:12:21+00:00

It would help if we knew more about what you were trying to do. For example if you were trying to find texts on the same subject you would use a different strategy than if you were trying to vet student texts for copying. It may be that a hierarchical approach is better: i.e. filter on keywords taken from the probe file and then apply a more detailed comparison to the hits. The more we know the more we might be able to make useful suggestions.

The project is just to find 10 most similar texts to the new text which will be given. The database of texts is created from a news site, the texts are on my native language (Macedonian). Texts are stored in one line, with a ID before them. The text base doesn't contain any stop words ("and", "the", "a", "of", "to"), or any other characters, just letters(all lower case) and numbers. I should give the number of similar words from the two texts divided by number of words of text1+text2. I have done that with the word by word comparison, but it is to slow.

Igor_MK 0 Newbie Poster · Answer 10 · 2008-05-09T06:13:14+00:00

I tried the sets, just now, it's going very fast :D
it passed all texts in 30 secs. just it gives wrong results. is it ok if I define the sets this way:

s1 = Set(text1.split(' '))
s2 = Set(text2.split(' '))

Igor_MK 0 Newbie Poster · Answer 11 · 2008-05-09T06:36:00+00:00

Using set.union would give words that are found in both. Again, it is indexed and would be much faster. If you want to use your original idea, then use a dictionary for the words in the base text as the lookup would be 100s of times faster than using 2 lists. This is an example using a dictionary instead of a list.
dic_1 = {}
for word1 in list1:
     if word1 not in dic_1:
          dic_1[word1]=0.
for word2 in list2:
     if word2 in dic_1:
          dic_1[word2] += 1

total_words_found=0
for key in dict_1:
     if dict_1[key] > 0:
          total_words_found += 1
result = float(total_words_found) / (len(list1) + len(list2))
##
##   or
found_set = set_1.union(set_2)
result = float(len(found_set)) / (len(list1) + len(list2))

Now it's ok, instead of union i used intersection, and it's working FAST :)
Thanks guys for all your help

woooee 814 Nearly a Posting Maven · Answer 12 · 2008-05-09T08:39:52+00:00

Sorry about that. You do want intersection and not union.

Slow Comparing

Recommended Answers Collapse Answers

All 16 Replies

Recommended Answers