Hi everybody, I'm interested in creating a web crawler, but can't really settle on what I'd like the program to do. It's more of an exercise in the technology, and expanding it to achieve new, great things.

I am proficient in Python, so I will naturally be using that language, alongside the module urllib2, because I have some experience with it, and it is fantastic for pulling a webpage's source code, which can then be parsed.

So what we have so far:
-Python
-Urllib2 module

I will need to research regex and re-learn it, so that I will be able to create the functions that will handle parsing the page source in order to extract all URLs.

Now this is where my question really comes in. What types of things can I/Should I use a web crawler to do?

Throw at me some really interesting things! Thanks!

Recommended Answers

All 6 Replies

No comments? Maybe this isn't the best board for me to be discussing this technology in.

Note to Mods: Maybe this should be moved to the Python boards? I'd rather not get in trouble for making a duplicate topic over there. Thanks

The problem is that you just list a few buzzwords yet don't seem to even know what they mean and haven't apparently gone to the trouble of figuring out what they mean.
That shows a lack of interest in doing your own work, which leads to us being disinclined to help you.

Note to Mods: Maybe this should be moved to the Python boards?

That's fine with me. Moved.

Jwenting, not sure what your deal is, but I didn't throw around any buzzwords with no prior knowledge in what they mean. I've been programming in python for some time now, and have worked on some very large projects. I know what python is, I know what modules are, and I know what regex is, so what part of my post exactly did you have a problem with?

As for the point of the topic, all I wanted were some ideas related to: "What types of things can I/Should I use a web crawler to do?"

Please take your elitism elsewhere.

Ah, just in time for halloween.

A web crawler visits a given URL and retrieves any URLs from the hyperlinks on that page. It visits these URLs and collects more URLs and so on. Kind of spooky.

What you do with these URLs is up to you. You can collect all the images, do data mining, spy, steal information etc.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.