0

Hey guys,

I'm working on a basic search engine and am really close to completion.

I currently have a function that takes a string and compares each word and its synonyms to a webpage.

My output at the moment is [("closeness" percentage of terms to webpage, webpage contents,(x,y),(x,y)...(x,y)]

I am almost there, but I now need to remove the items that have no match to a site (ie, where x = 0.

I have found out that the itemgetter() function isolated just the first variables, then I filtered out the zeros from there with this code

def Google_search(string):
    internet_length = len(Internet)
    percentage_list = []
    
    for x in range(0,internet_length):
        position = x
        closeness_percentage = closeness(string, Internet[x])
        percentage_list.append([closeness_percentage, Internet[position]])

    sorted_list = sorted(percentage_list, key=operator.itemgetter(1), reverse = True)
##    print sorted_list
    
    ## now to delete the ones with zero percentage


    get_percentages = operator.itemgetter(0)
    percentages = map(get_percentages, sorted_list)
    print percentages
    no_zeros = [x for x in percentages if x is not 0]
    print no_zeros
    print sorted_list

So any example of the output would be
[13, 0, 3, 2, 0, 0, 4, 0, 0, 6, 2, 3, 0, 0]
[13, 3, 2, 4, 6, 2, 3]

This is good, however, deleting the zeros from percentage only list does not correlate to them being deleted from the list with the webpages - obviously as its a new list!

I have been straining my brain for hours about how to get around this! I think I need to make a loop that compares the 2nd value in each SUBLIST to the values of the original list, then if its a match return true, then filter the results! But i dont know how to do something like

for x in range(0, length):
     for y in range(0, no_zeros_length):
           if sorted_list[x].itemgetter(1) == no_zeros:
                   return true

Do you guys get what I mean? Or is there a much easier way to omit the zeros from the original list?

Thanks heaps in advance!

ps. Ive attached the file (rename to .py if you want to use it)..so its easier to understand whats going on as this is part 4 and each part is dependant on the others before it (thought it would be too much code for a post)!

or get them here

Python File

As .txt

Attachments
########## BEGIN OVERVIEW ############################################
##
##  ITB001 Assignment 1 - Let's Google
##
##  This file contains the template for ITB001 Assignment 1.  After
##  reading the instructions for the assignment you should complete
##  the marked sections for each for the four tasks and submit the
##  entire file via the Online Assessment System.
##
########## END OVERVIEW ##############################################



########## BEGIN STUDENT DETAILS #####################################
##
##  Solution created by:  Peter Marceta
## 
##  Student number:  n6325777
##
########## END STUDENT DETAILS #######################################



########## BEGIN MARKER'S FEEDBACK ###################################
##
##  Mark awarded (out of 20):
##  *** MARKER TO INSERT MARK ***
##
##  Comments about program functionality:
##  *** MARKER TO INSERT COMMENTS ABOUT TEST RESULTS ***
##
##  Comments about algorithm quality:
##  *** MARKER TO INSERT COMMENTS ABOUT QUALITY OF THE ALGORITHMS ***
##
##  Comments about code quality:
##  *** MARKER TO INSERT COMMENTS ABOUT QUALITY OF THE PROGRAM CODE ***
##
########## END MARKER'S FEEDBACK #####################################



########## BEGIN ACCEPTANCE TESTS ####################################
##
##  This section contains the acceptance tests for Tasks 1 to 4.
##  Comment out the code at the end of the file if you don't want
##  these tests to run, but leave the code uncommented when you
##  submit your program.
##
'''
---------- Tests for Task 1 (Synonyms) -------------------------------

Note: In this set of tests we use function "unique_items" (defined
at the end of this file) so that the order in which the lists are
returned is not significant.

A lower-case word which is found in the first position in a group
in the thesaurus:
>>> unique_items(synonyms('program'))
... ## Test 1.1
['program', 'programme', 'programmes', 'programming', 'programs']

A lower-case word which is found in the middle of a group in
the thesaurus:
>>> unique_items(synonyms('directs'))
... ## Test 1.2
['dir', 'directed', 'director', 'directs']

A lower-case word which is not in the thesaurus:
>>> unique_items(synonyms('unknown'))
... ## Test 1.3
['unknown']

A mixed-case word which is found in the first position
in a group in the thesaurus:
>>> unique_items(synonyms('Music'))
... ## Test 1.4
['band', 'bands', 'group', 'groups', 'music', 'musical', \
'musicals', 'rock', 'song', 'songs']

A mixed-case word which is found in the middle of a group in
the thesaurus:
>>> unique_items(synonyms('Thriller'))
... ## Test 1.5
['action', 'adventure', 'exciting', 'sci-fi', 'thriller', 'thrills']

A mixed-case word which is not in the thesaurus:
>>> unique_items(synonyms('Catherine'))
... ## Test 1.6
['catherine']

An example of a numeric "word", which is not in the thesaurus:
>>> unique_items(synonyms('1960'))
... ## Test 1.7
['1960']

An example of a non-alpanumeric "word", which does not
contain any of our selected punctuation marks and is not in the
thesaurus:
>>> unique_items(synonyms('****'))
... ## Test 1.8
['****']

An example of a hyphenated word, which we treat as a single
word:
>>> unique_items(synonyms('Sci-Fi'))
... ## Test 1.9
['action', 'adventure', 'exciting', 'sci-fi', 'thriller', 'thrills']

An example of a word with a possessive apostrophe, which we
treat as a single word (NB: the doubly-escaped apostrophe in the
test, i.e., "\\'" is due to the need to escape the escape character
in the docstring):
>>> unique_items(synonyms('Google\\'s'))
... ## Test 1.10
['google', "google's", 'googling']


---------- Tests for Task 2 (Search Terms) ---------------------------

Note: In this set of tests we use function "unique_items" (defined
at the end of this file) so that the order in which the lists are
returned is not significant.

A pattern consisting of a single lower-case word which is not
in the thesaurus or common-words list:
>>> unique_items(search_terms('books'))
... ## Test 2.1
['books']

A pattern consisting of a single lower-case word which is in
the thesaurus:
>>> unique_items(search_terms('tasks'))
... ## Test 2.2
['task', 'tasking', 'tasks']

A pattern consisting of a single mixed-case word which is not
in the thesaurus or common-words list:
>>> unique_items(search_terms('Python'))
... ## Test 2.3
['python']

A pattern consisting of a single mixed-case word which is in
the thesaurus:
>>> unique_items(search_terms('Comedy'))
... ## Test 2.4
['amusing', 'comedies', 'comedy', 'fun', 'funny', 'laugh', \
'laughs', 'satire', 'spoof']

A pattern consisting of a single lower-case word which is in
the common-words list:
>>> unique_items(search_terms('who'))
... ## Test 2.5
[]

A pattern consisting of a single mixed-case word which is in
common-words list, and a punctuation mark:
>>> unique_items(search_terms('What?'))
... ## Test 2.6
[]

A pattern consisting of several lower-case words, all of
which are in the thesaurus:
>>> unique_items(search_terms('rock group comedy'))
... ## Test 2.7
['amusing', 'band', 'bands', 'comedies', 'comedy', 'fun', 'funny', \
'group', 'groups', 'laugh', 'laughs', 'music', 'musical', \
'musicals', 'rock', 'satire', 'song', 'songs', 'spoof']

A pattern consisting of several mixed-case words, all of which
are in the thesaurus:
>>> unique_items(search_terms('Googling Computer Programs'))
... ## Test 2.8
['calculate', 'calculates', 'computation', 'computational', \
'compute', 'computer', 'computers', 'computing', 'google', \
"google's", 'googling', 'program', 'programme', 'programmes', \
'programming', 'programs']

A pattern consisting of several lower-case words, all of which
are in the common-words list:
>>> unique_items(search_terms('and for what'))
... ## Test 2.9
[]

A pattern consisting of several mixed-case words, all of which
are in the common-words list:
>>> unique_items(search_terms('How in the Who'))
... ## Test 2.10
[]

A pattern consisting of several lower-case words, some of
which are in the thesaurus, some of which are in the common-words
list, and some of which are in neither:
>>> unique_items(search_terms('movies with rock groups'))
... ## Test 2.11
['band', 'bands', 'film', 'films', 'flicks', 'group', 'groups', \
'movie', 'movies', 'music', 'musical', 'musicals', 'pictures', \
'rock', 'song', 'songs']

A pattern consisting of several mixed-case words, some of
which are in the thesaurus, some of which are in the common-words
list, and some of which are in neither:
>>> unique_items(search_terms('Who was the Director of Goldfinger?'))
... ## Test 2.12
['dir', 'directed', 'director', 'directs', 'goldfinger']

Another typical search pattern:
>>> unique_items(search_terms('Show me Oscar winning films!'))
... ## Test 2.13
['film', 'films', 'flicks', 'me', 'movie', 'movies', 'oscar', \
'pictures', 'show', 'winning']

A search pattern which contains redundancy so that two of the
words in the pattern match the same group of words in the
thesaurus:
>>> unique_items(search_terms('Programs and programming'))
... ## Test 2.14
['program', 'programme', 'programmes', 'programming', 'programs']

Another search pattern with considerable redundancy:
>>> unique_items(search_terms('Rock bands, films and movies'))
... ## Test 2.15
['band', 'bands', 'film', 'films', 'flicks', 'group', 'groups', \
'movie', 'movies', 'music', 'musical', 'musicals', 'pictures', \
'rock', 'song', 'songs']

A pattern consisting of nothing (which is actually the
most common search pattern submitted to Google!):
>>> unique_items(search_terms(''))
... ## Test 2.16
[]


---------- Tests for Task 3 (Closeness) ------------------------------

A search pattern which produces no search terms:
>>> closeness('what', Internet[0]) # Star Wars
... ## Test 3.1
0

A search pattern which produces several search terms, none of
which appear in the web page:
>>> closeness('Computers and programming',
... Internet[5]) # Diplomaniacs
... ## Test 3.2
0

A search pattern where one term appears more than once in
the web page so gets counted for each distinct occurrence:
>>> closeness('Blondie and Dagwood', Internet[7]) # Blondie
... ## Test 3.3
22

An example where two different synonyms of a word in the search
pattern are found in the web page:
>>> closeness('Starring Harrison Ford', Internet[0]) # Star Wars
... ## Test 3.4
12

A search pattern where the only words that match the web
page are synonyms of those in the pattern:
>>> closeness('Scary funny films!', Internet[12]) # Howling III
... ## Test 3.5
9

A tenuous, but non-zero, match:
>>> closeness('A movie', Internet[0]) # Star Wars
... ## Test 3.6
3

A strong, but not perfect, match:
>>> closeness('One of those silly Wheeler and Woolsey comedies',
... Internet[5]) # Diplomaniacs
... ## Test 3.7
28

A search pattern which matches the web page perfectly:
>>> closeness('Dull tedious ** 1968 comedy Candy',
... Internet[2]) # Candy
... ## Test 3.8
100


---------- Tests for Task 4 (Google Search) --------------------------

A search pattern which produces no usable search terms, and thus
no search results at all:
>>> Google_search('What and where?')
... ## Test 4.1

A search which produces some usable search terms but none of
them match anything in any web page, and hence the search again
produces no results at all:
>>> Google_search('Programs for computers')
... ## Test 4.2

A search which matches a single web page, which is shorter than
our maximum line width so is printed in full:
>>> Google_search('Penny Singleton as Blondie Bumstead')
... ## Test 4.3
 22% 'Blondie (1938) *** First of the 28 Blondie comedies.'

A search which matches a single web page, which is longer than
our maximum line width so is truncated:
>>> Google_search('Who is Mark Hamill?')
... ## Test 4.4
  6% 'Star Wars (1977) ***1/2 Dir: George Lucas. Stars: Mark Ha...'

A precise search which matches only a couple of web pages:
>>> Google_search('Harold Ramis')
... ## Test 4.5
  8% 'Ghostbusters (1984) *** Starring: Dan Ackroyd, Harold Ram...'
  7% 'National Lampoon's Vacation (1983) *** Dir: Harold Ramis....'

A search which matches several web pages and shows how the
most relevant ones appear at the top:
>>> Google_search('Films with Orson Welles.')
... ## Test 4.6
 25% 'Citizen Kane (1941) **** Dir: Orson
2
Contributors
1
Reply
3
Views
8 Years
Discussion Span
Last Post by jrcagle
0

I think you want a dictionary, if I understand correctly. The dictionary is the standard way of mapping one set of items to another.

So you have

mydict = {URL1: 13, URL2: 0, URL3: 3, URL4: 2, ...}

And then you run this bit of code:

for URL in mydict.copy():

   if mydict[URL] == 0:
        mydict.pop(URL)

and then your list of hot URLs is simply mydict.keys().

Jeff

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.