DaniWeb breaking links in posts

Question

happygeek 2,411 Most Valuable Poster

8 Years Ago

OK, so in this thread I offered advice to visit the Microsoft troubleshooter here: https://support.microsoft.com/en-gb/help/927477/how-to-troubleshoot-a-damaged-presentation-in-powerpoint-2007-and-powerpoint-2010 and DaniWeb broke the link and struck a line through it to say it didn't exist, when it quite clearly does.

It may well break again here, let's see. I suspect that DaniWeb isn't handling long URLs very well.

I ended up doing a Bit.ly on it so the chap can still get to the site, but I'd much rather have left the original so he knows he is going to Microsoft...

daniweb-bug php

6 Contributors
25 Replies
912 Views
2 Weeks Discussion Span
Latest Post 8 Years Ago Latest Post by happygeek

Dani 4,558 The Queen of DaniWeb

8 Years Ago

Sorry, I haven't had usable access to my computer the past week (due to moving).

The problem is our bad link checker is catching false positives. This can be due to the web servers rejecting connections from what they detect to be bots (as our link checker is a bot. )

Once I have usable PC access again, I'll work on seeing if I can fix it.

Dani 4,558 The Queen of DaniWeb

8 Years Ago

Soooo, as it turns out, I have a usable PC again, and I'm not catching any glaring bugs with our link checker. The problem seems to be that the link in question here is not just timing out when DaniBot attempts to reach it. It's actually actively returning a 404 error to the bot. This means that the server is actually responding and saying yes, we received your request, that specific page doesn't exist. As of right now, the only thing I can think of is that the web server just doesn't like bots.

A lot of web servers that are tight on bandwidth actively choose to return bad http status codes instead of serving up the correct content when they detect that the request is originating from a bot they don't recognize.

cereal 1,524 Nearly a Senior Poster

8 Years Ago

Hello Dani,

I don't think it's the user agent, I'm testing with Phantomjs and it uses this user agent:

Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1

The testing script render.js:

var page   = require('webpage').create(),
    system = require('system'),
    vsize  = {width: 1280, height: 1024},
    address, output;

address = system.args[1];
output  = system.args[2];

page.viewportSize = vsize;
page.clipRect = {
  top: 0,
  left: 0,
  width: vsize.width,
  height: vsize.height
};

page.open(address, function() {
  page.render(output);
  phantom.exit();
});

Execution:

./phantomjs render.js LINK output.png

And it works fine. In this specific case Microsoft is rejecting HEAD requests, it allows GET requests, in fact, it returns 200, but the page has no contents because are loaded by Javascript: test with Postman to see how it renders. So, it seems it needs a rendering engine to show the contents.

Edited 8 Years Ago by cereal

Dani 4,558 The Queen of DaniWeb

8 Years Ago

Yes, this website is very poorly configured. I can understand them choosing to not waste bandwidth on bots. However, they seem to:

return a 404 error on a HEAD request
return a 500 error on a GET request where a referer is set but a user agent is not set

I'm now making full GET requests (instead of just HEAD) setting the referer to the actual URL of the post that the link appears in, and the user agent to DaniBot. It's more bandwidth, obviously, but hopefully it will allow us to bypass servers that don't like bots, without us faking our user agent or being a bad bot.

We're still going to get bad results for servers that actively reject requests where the user agent has a Bot in the name, or isn't a known user agent (Mozilla, IE, Chrome, etc.) However, I think we have to be good botizens as well.

Edited 8 Years Ago by Dani

Dani 4,558 The Queen of DaniWeb

8 Years Ago

Better to let a few dead links through for now instead of killing everything.

We're not killing everything. We're just killing the links when servers return clearly incorrect / confusing status codes. For example, LinkedIn doesn't like bots, so they return an http status 999, which we silently ignore. However, Microsoft seems to not like bots, but they are returning http 404 errors which are reserved for old pages that no longer exist. If the server tells us the page doesn't exist, we have no way of verifying whether they're lying to us because they don't like bots, or if the page really does not exist. IMHO, it's a badly configured server to purposefully lie by using an http status code that is specifically reserved for a different purpose.

Hopefully we're acting friendlier now by specifying a referer and useragent.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 1 · 2017-03-13T11:19:10+00:00

Now that is odd, as the link seems to be left alone here yet was quite clearly struck through in the original thread. You can see from the edit history of the original where I changed it.

[later] Ah, that was just my cached copy - a refresh and DaniWeb continues to have the screaming abdabs and breaks a perfectly valid URL...

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 2 · 2017-03-13T11:22:02+00:00

Even odder. In the original post DaniWeb has struck the Bit.ly shortened URL through as a broken link as well. WTF?

I can assure you that the Microsoft page exists, I've just visited it. And I can assure you the URL is correct as it was a direct cut and paste for both the original and the shortened version.

Can I change that to WTAF instead :-)

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 3 · 2017-03-13T11:24:21+00:00

And even funnier, if I click 'continue to the site anyways' or whatever the button says, it loads perfectly well.

This is one of the most mind-boggling things I've seen for a while. Most likely the user isn't even going to click on the struck throuigh link, and if they do and see the doesn't exist message they are unlikely to click the continue anyway button...

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 4 · 2017-03-13T11:27:29+00:00

I have now put the original URL back into my post and told the member how he can get to it while DaniWeb goes and lies down to recover from the funny turn it is obviously having...

score 0 · Answer 5 · 2017-03-13T13:14:35+00:00

Reverend Jim 5,224 Hi, I'm Jim, one of DaniWeb's moderators.

8 Years Ago

Let's see what happens if you embed the link like this.

Edited 8 Years Ago by Reverend Jim

score 0 · Answer 6 · 2017-03-13T13:17:04+00:00

OK. It displays correctly and the link works. Maybe that's the secret - use [] and () to embed the link.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 7 · 2017-03-13T17:01:03+00:00

No it doesn't Jim, the embedded link is displaying as broken here.

rproffitt 2,701 https://5calls.org Moderator · Answer 8 · 2017-03-13T17:43:05+00:00

I recall our host to tell why this is. While it's an annoyance, I don't recall why this feature must be kept.

To RevJim if I see a strike out on like this, is that not like this?

cereal 1,524 Nearly a Senior Poster Featured Poster · Answer 9 · 2017-03-13T18:54:18+00:00

Test: https://http2.akamai.com/demo

//Okay, it's not due to HTTP/2 :p

score 0 · Answer 10 · 2017-03-13T18:55:48+00:00

That's odd because I posted it and it seemed fine. Then I refreshed the page and it was still fine. Perhaps there is some post-post-processing going on. Is there some automatic process crawling the threads and checking to see if the links are actually valid? If so then maybe that's the thing that is broken.

rproffitt 2,701 https://5calls.org Moderator · Answer 11 · 2017-03-13T20:32:11+00:00

Dani wrote in some prior discussion that DW scans these and breaks them for some reason later. So it looks good on posting then later, the bot breaks it for us.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 12 · 2017-03-14T07:49:33+00:00

Yep, it's a post-post process. And it's badly broken in a way that will prevent folk from following valid links and, as an aside, think the poster of them is a numpty :-)

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 13 · 2017-03-15T07:45:44+00:00

Dani? Any news as to what is happening here and how it can be prevented?

nullptr 167 Occasional Poster · Answer 14 · 2017-03-24T07:41:44+00:00

Another broken link in https://www.daniweb.com/hardware-and-software/hardware/threads/94301/help-my-keyboard-does-french-instead-of-question-marks/2#post2218579

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 15 · 2017-03-24T08:31:07+00:00

They are all over the place, looks very messy and far from helpful when people are trying to help other members.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 16 · 2017-03-28T07:45:09+00:00

happygeek 2,411 Most Valuable Poster

8 Years Ago

Dani, any news on this?

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 17 · 2017-03-29T08:21:12+00:00

happygeek 2,411 Most Valuable Poster

8 Years Ago

Thanks.

Dani 4,558 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 18 · 2017-03-31T07:57:32+00:00

So I tried to spoof the user agent and no luck. This is my cURL request

    $ch[$row->url] = curl_init($row->url);
    curl_setopt($ch[$row->url], CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch[$row->url], CURLOPT_HEADER, true);
    curl_setopt($ch[$row->url], CURLOPT_NOBODY, true);
    curl_setopt($ch[$row->url], CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch[$row->url], CURLOPT_CONNECTTIMEOUT, 15);
    curl_setopt($ch[$row->url], CURLOPT_TIMEOUT, 30);
    curl_setopt($ch[$row->url], CURLOPT_SSL_VERIFYPEER, false);

$row->url is set to https://support.microsoft.com/en-gb/help/927477/how-to-troubleshoot-a-damaged-presentation-in-powerpoint-2007-and-powerpoint-2010

Can anyone let me know how to get curl_getinfo() to stop telling me that the request returns a 404 status code. I mean, obviously it's not lying and it is returning a 404. I just meant why is my PHP code doing so and my web browser is not.

score 0 · Answer 19 · 2017-03-31T12:42:31+00:00

How about modifying the bot until you find the problem so that it stops invalidating the links? Better to let a few dead links through for now instead of killing everything.

happygeek 2,411 Most Valuable Poster Team Colleague Featured Poster · Answer 20 · 2017-04-01T06:27:29+00:00

Well the link that kickstarted this thread is now showing properly, so looks like a job well done. Thanks Dani.