I'm trying to create a web crawler. I've read about web crawler's duty and about how it works and what he does.
But just need more information. Could you please tell me what does a web crawler can do? What kind of duty i can define for my web crawler? What can i ask it to do?

2 Years
Discussion Span
Last Post by Niloofar24

Some things I learned when I created a basic search engine:
* Count the number of in_links (pages linking into a site) and the out_links (pages going out of the site) to measure a websites popularity - more in_links probably means more popular
* record text found in a website and put it in a dictionary with individual words as keys and values=url - useful for making a search engine.
* You can combine the two above to make a search engine that ranks the webpages based on a simple scoring system (score = in_links - out_links)

Other Ideas:
* Use the web-crawler to find <img></img> tags and somehow distiguish between 'Visual/Picture' type websites and 'Document/Text' Type Sites
* Use the web-crawler to create a grahical representation of networks of the pages - how does the internet look like visually?


Hello @Slyte, thank you for your explanation and your other ideas!

Well, let me ask you some more question.
And also ask you to make some parts more clear for me, because my English is not very well, so sometimes i need more clear explanation, so will be happy if you help me understand better the parts i didn't get well! Thank you in advance :)

The second paragraph (record text found....); can you explain it more please? What kind of word i should record when i visit a wabpage for example? What do you mean by in a dictionary with individual words as keys and values? And what is it's usage?

And about your other ideas:
Can you explaine the first idea more clear please? I did'nt understand your purpose exactly but it seems interesting idea to me.

And the second idea; i did'nt understand it, what do you mean?!

Your explanation and ideas, made my mind to start some other new idea :)


For example we have two webpages called helloworld.html and hellopython.html

helloworld.html contains the text "Hello World"
if you look at its source it would have some lines like
<p>Hello World</p>

and for hellopython.html that contains "Hello Python"
the line <p>Hello Python</p>

Now I can instruct my crawler to filter-out these webpages to get these lines in the format <p>some text</p> (There are other formats like the title of the page, maybe the author etc.)

I then process each individual word to become keys in a dictionary (a built-in data type in python) and setting the values as a list of url where I found them.

In my example, after I make the crawler do some things, the dictionary becomes
words = {'Hello':['helloworld.html','hellopython.html'], 'World':['helloworld.html'], 'Python':['hellopython.html']}

For a general concept, think of a dictionary as an array of values where you can get the values by using the keys to index them. example
print(words['Hello']) outputs ['helloworld.html','hellopython.html']

Now if you have built a search-engine, you can use these to retrieve the sites containing the words you are looking for.
If you search for 'Hello' it would give you the list ['helloworld.html','hellopython.html']

If using the scoring system I posted and it turns out hellopython.html has a higher score (more popular), the output would sort it and becomes ['hellopython.html','helloworld.html'] instead.

The first was an idea of some sort of webpage categorization - if a site has lots of pictures then label it as Visual/Picture type. Pictures in webpages are usually tagged with <img> so maybe you can count those hence count how many pictures a webpage has.

My second idea is something like this: Opte Project

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.