OK, so in this thread I offered advice to visit the Microsoft troubleshooter here: https://support.microsoft.com/en-gb/help/927477/how-to-troubleshoot-a-damaged-presentation-in-powerpoint-2007-and-powerpoint-2010 and DaniWeb broke the link and struck a line through it to say it didn't exist, when it quite clearly does.

It may well break again here, let's see. I suspect that DaniWeb isn't handling long URLs very well.

I ended up doing a Bit.ly on it so the chap can still get to the site, but I'd much rather have left the original so he knows he is going to Microsoft...

Now that is odd, as the link seems to be left alone here yet was quite clearly struck through in the original thread. You can see from the edit history of the original where I changed it.

[later] Ah, that was just my cached copy - a refresh and DaniWeb continues to have the screaming abdabs and breaks a perfectly valid URL...

Even odder. In the original post DaniWeb has struck the Bit.ly shortened URL through as a broken link as well. WTF?

I can assure you that the Microsoft page exists, I've just visited it. And I can assure you the URL is correct as it was a direct cut and paste for both the original and the shortened version.

Can I change that to WTAF instead :-)

And even funnier, if I click 'continue to the site anyways' or whatever the button says, it loads perfectly well.

This is one of the most mind-boggling things I've seen for a while. Most likely the user isn't even going to click on the struck throuigh link, and if they do and see the doesn't exist message they are unlikely to click the continue anyway button...

I have now put the original URL back into my post and told the member how he can get to it while DaniWeb goes and lies down to recover from the funny turn it is obviously having...

Let's see what happens if you embed the link like this.

OK. It displays correctly and the link works. Maybe that's the secret - use [] and () to embed the link.

No it doesn't Jim, the embedded link is displaying as broken here.

I recall our host to tell why this is. While it's an annoyance, I don't recall why this feature must be kept.

To RevJim if I see a strike out on like this, is that not like this?

That's odd because I posted it and it seemed fine. Then I refreshed the page and it was still fine. Perhaps there is some post-post-processing going on. Is there some automatic process crawling the threads and checking to see if the links are actually valid? If so then maybe that's the thing that is broken.

Dani wrote in some prior discussion that DW scans these and breaks them for some reason later. So it looks good on posting then later, the bot breaks it for us.

Yep, it's a post-post process. And it's badly broken in a way that will prevent folk from following valid links and, as an aside, think the poster of them is a numpty :-)

Dani? Any news as to what is happening here and how it can be prevented?

They are all over the place, looks very messy and far from helpful when people are trying to help other members.

Sorry, I haven't had usable access to my computer the past week (due to moving).

The problem is our bad link checker is catching false positives. This can be due to the web servers rejecting connections from what they detect to be bots (as our link checker is a bot. )

Once I have usable PC access again, I'll work on seeing if I can fix it.

Soooo, as it turns out, I have a usable PC again, and I'm not catching any glaring bugs with our link checker. The problem seems to be that the link in question here is not just timing out when DaniBot attempts to reach it. It's actually actively returning a 404 error to the bot. This means that the server is actually responding and saying yes, we received your request, that specific page doesn't exist. As of right now, the only thing I can think of is that the web server just doesn't like bots.

A lot of web servers that are tight on bandwidth actively choose to return bad http status codes instead of serving up the correct content when they detect that the request is originating from a bot they don't recognize.

So I tried to spoof the user agent and no luck. This is my cURL request

    $ch[$row->url] = curl_init($row->url);
    curl_setopt($ch[$row->url], CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch[$row->url], CURLOPT_HEADER, true);
    curl_setopt($ch[$row->url], CURLOPT_NOBODY, true);
    curl_setopt($ch[$row->url], CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch[$row->url], CURLOPT_CONNECTTIMEOUT, 15);
    curl_setopt($ch[$row->url], CURLOPT_TIMEOUT, 30);
    curl_setopt($ch[$row->url], CURLOPT_SSL_VERIFYPEER, false);

$row->url is set to https://support.microsoft.com/en-gb/help/927477/how-to-troubleshoot-a-damaged-presentation-in-powerpoint-2007-and-powerpoint-2010

Can anyone let me know how to get curl_getinfo() to stop telling me that the request returns a 404 status code. I mean, obviously it's not lying and it is returning a 404. I just meant why is my PHP code doing so and my web browser is not.

Hello Dani,

I don't think it's the user agent, I'm testing with Phantomjs and it uses this user agent:

Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1

The testing script render.js:

var page   = require('webpage').create(),
    system = require('system'),
    vsize  = {width: 1280, height: 1024},
    address, output;

address = system.args[1];
output  = system.args[2];

page.viewportSize = vsize;
page.clipRect = {
  top: 0,
  left: 0,
  width: vsize.width,
  height: vsize.height
};

page.open(address, function() {
  page.render(output);
  phantom.exit();
});

Execution:

./phantomjs render.js LINK output.png

And it works fine. In this specific case Microsoft is rejecting HEAD requests, it allows GET requests, in fact, it returns 200, but the page has no contents because are loaded by Javascript: test with Postman to see how it renders. So, it seems it needs a rendering engine to show the contents.

How about modifying the bot until you find the problem so that it stops invalidating the links? Better to let a few dead links through for now instead of killing everything.

Yes, this website is very poorly configured. I can understand them choosing to not waste bandwidth on bots. However, they seem to:

  • return a 404 error on a HEAD request
  • return a 500 error on a GET request where a referer is set but a user agent is not set

I'm now making full GET requests (instead of just HEAD) setting the referer to the actual URL of the post that the link appears in, and the user agent to DaniBot. It's more bandwidth, obviously, but hopefully it will allow us to bypass servers that don't like bots, without us faking our user agent or being a bad bot.

We're still going to get bad results for servers that actively reject requests where the user agent has a Bot in the name, or isn't a known user agent (Mozilla, IE, Chrome, etc.) However, I think we have to be good botizens as well.

Better to let a few dead links through for now instead of killing everything.

We're not killing everything. We're just killing the links when servers return clearly incorrect / confusing status codes. For example, LinkedIn doesn't like bots, so they return an http status 999, which we silently ignore. However, Microsoft seems to not like bots, but they are returning http 404 errors which are reserved for old pages that no longer exist. If the server tells us the page doesn't exist, we have no way of verifying whether they're lying to us because they don't like bots, or if the page really does not exist. IMHO, it's a badly configured server to purposefully lie by using an http status code that is specifically reserved for a different purpose.

Hopefully we're acting friendlier now by specifying a referer and useragent.

Well the link that kickstarted this thread is now showing properly, so looks like a job well done. Thanks Dani.