Shit In, Shit Out: Why Link Checkers Are Annoyingly Inaccurate
So, you have a list of URLs, and you need to know if any of them (still) link to your domain. Either you’re checking links you’ve built, or you’re checking links you’ve unbuilt (aka link removal).
Using an automated link checker tool can quickly give you this information.
HOWEVER, all link checkers fail at reporting live links with 100% accuracy. In this post I tell you why, and how we made URL Profiler’s link checker 5X more accurate than the competition.
The Problem With Link Checkers
Distilled published a good post last year describing a variety of link checker tools. None of them are optimal, however:
- You have to wait for online tools to schedule your job in among a load of other jobs
- You are unable to check your links ‘on demand’ – at the exact moment you need the information
- You don’t trust the data because you constantly find inaccurate results
The third issue is really the killer here – if you can’t trust your data it is essentially worthless to you. Many people end up running the same data through several tools to cross-check, costing them more time and effort in the process.
Why Are Link Checkers Inaccurate?
Ignoring the time issue, the biggest complaint people have with automated link checkers is that they simply aren’t accurate enough. It only takes a client to question one report for you to lose faith in your software.
But there is a good reason why link checkers can be inaccurate in the first place, and it’s not really anything to do with the software:
- The page requested took too long to download and timed out
- The server failed to respond when the crawler requested the page URL
- Desktop anti-virus software blocked the program from downloading a page as it detected a virus signature
So the problem isn’t that the software incorrectly classifies the link, it’s that the software isn’t able to view the page to classify the link. Quite literally, the URLs they are checking are just too shit.
The Most Accurate Desktop Link Checker
We built URL Profiler with an awareness of these issues in mind, and specifically built in some extra settings to handle them. When you run a profile using URL Profiler, ‘unfound’ links will either be classified ‘Not Checked’ or ‘Not Found’. The difference is important:
- Not Found – The page was downloaded but the link couldn’t be found (i.e. the link genuinely isn’t there)
- Not Checked – The tool was unable to check if the link was there, either because the page no longer exists or one of the errors listed above
So basically, you want to see minimal ‘Not Checked’ results as these are the unknowns. In order to minimise them, utilise the settings we built specifically for ‘Not Checked’ links.
To activate these settings, click the ‘Settings’ (you guessed it) button in the top left.
The first screen you come to allows you to change the number of connections (threads) the software uses, which effectively determines how fast it will run the profile. We don’t recommend adjusting this, ‘Automatically Optimised’ is best in most cases.
The second setting on this screen, ‘Connection Timeout’, allows you to tell the software how long to spend trying to download the page before moving on to the next URL. Set this at 40 seconds+ to give the tool a good chance of reporting these correctly.
The ‘Maximum Download Size’ refers to the size of the HTML content on the page. Some bad links will live on pages which are particularly large (think 1000+ links), so it can be worth increasing this limit from the default size of 1024Kb.
If you want to go all guns out, move the slider to the far left, which is ‘Unlimited’. However please be warned, there are some absolutely terrible URLs on the internet, with masses of content on page, and trying to download it all may cause the tool to hang. You have been warned.
We don’t stop there, however. Click to the ‘Link Analysis’ setting tab, which will look like this:
Move the top slider to adjust ‘Maximum Retries.’ This is a really important setting if you are keen to get a really accurate check to see which links are live.
If you run a profile with the retries set to 0, URL Profiler will literally try each link once. If a few of them time out or don’t respond properly, you may have some links reported as dead when they are actually live. This is typical of most desktop link checkers.
However, set the retries to 1 or more, and URL Profiler will take all of these links ‘Not Checked’ and re-run the profile again – but after a deliberate 10 minute wait for each re-run.
You can set up to 5 maximum retries, so your links will be checked 5 further times after the initial run, meaning the data you get back is incredibly accurate. The 10 minute gap between each run will obviously slow down your total profile time. However it is necessary, as this time difference allows servers to come back online if they were offline the last time you checked.
These extra settings mean that URL Profiler is the most accurate on-demand link checker on the market – with a 5 times better chance of picking up a link than any other tool.