Hey guys,

I am making a bot for a link sharing site. I am using the request external curl class and i have a dilemma that i think you guys could help with.

Query Variables appended on the end of a url such as ?something=234

I do not know how to handle these.

Problem: I do not need the same unique web page in my database using more than one url.

Ex 1: http://www.amazon.com/gp/product/1455503304/ www.amazon.com/gp/product/1455503304/ref=as_li_tf_tl?ie=UTF8&linkCode=as2&camp=1789&creative=9325&creativeASIN=1455503304

those two urls actually point to the same resource regardless of the query vars.

However Ex 2:

youtube.com/watch?v=OrAgxoWo_qs
youtube.com/watch?v=7p17AM0J-dk

These two urls must have the query vars appended because they are not the same resource, and the query vars DO make a difference.

Further information: If one user shares a url for the first time, it will be stored in the database. If another user shares that same url later, the number of shares on the "already stored" url will increase. There are sites that rely on query vars, and there are others that do not.

How do i handle this problem?

I appreciate any help and responses you guys can provide me.

Thanks.

Recommended Answers

All 2 Replies

Personally, I'd use a database to store which domain should use which query vars. As soon as someone enters a new domain, you'd have to check which (if any) you should allow. For example in your youtube link there could be session/language/etc query vars, which you don't need. I don't think there is a fail proof way to automate this.

Ive been looking into something also since i posted this. Canonical urls and shortlinks specified in link tags in header.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.