0

I've been programming a web crawler for a while, I'm almost done, it works perfectly but when it crawls vbulletin forums i get weird urls

example:

forum/index.php?phpsessid=oed7fqnm9ikhqq9jvbt23lo8e4
index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d

obviously this is being causes by php session IDs

what can I do to stop this?
I tried using cookies with HTTP::Cookies but the problem persists.

thanks

3
Contributors
3
Replies
4
Views
6 Years
Discussion Span
Last Post by maninaction
0

Actually, using cookies fixed it, but the site has to be requested twice (the first time to get the cookie and the second one to get normal links).
I will leave this thread opened in case someone knows about another solution.

Edited by terabyte: n/a

0

I've been programming a web crawler for a while, I'm almost done, it works perfectly but when it crawls vbulletin forums i get weird urls

example:

forum/index.php?phpsessid=oed7fqnm9ikhqq9jvbt23lo8e4
index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d

obviously this is being causes by php session IDs

what can I do to stop this?
I tried using cookies with HTTP::Cookies but the problem persists.

thanks

If the phpsessid always occurs at the end of the url you could remove it with a regex substitution like this:

#!/usr/bin/perl;
use strict;
use warnings;

my $url = 'index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d';

$url =~ s/\?phpsessid=\w+$//;

print $url;
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.