I've been programming a web crawler for a while, I'm almost done, it works perfectly but when it crawls vbulletin forums i get weird urls

example:

forum/index.php?phpsessid=oed7fqnm9ikhqq9jvbt23lo8e4
index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d

obviously this is being causes by php session IDs

what can I do to stop this?
I tried using cookies with HTTP::Cookies but the problem persists.

thanks

Recommended Answers

All 3 Replies

Actually, using cookies fixed it, but the site has to be requested twice (the first time to get the cookie and the second one to get normal links).
I will leave this thread opened in case someone knows about another solution.

I've been programming a web crawler for a while, I'm almost done, it works perfectly but when it crawls vbulletin forums i get weird urls

example:

forum/index.php?phpsessid=oed7fqnm9ikhqq9jvbt23lo8e4
index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d

obviously this is being causes by php session IDs

what can I do to stop this?
I tried using cookies with HTTP::Cookies but the problem persists.

thanks

If the phpsessid always occurs at the end of the url you could remove it with a regex substitution like this:

#!/usr/bin/perl;
use strict;
use warnings;

my $url = 'index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d';

$url =~ s/\?phpsessid=\w+$//;

print $url;

Go for security HTTPS. Try in this.Clear all cookies

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.