LOTS OF LINKS ON THE WEB

Outgoing broken links, malware removal, plugins updates... maintenance is a nightmare for most web owners. Either you call your full time web designer, or venture in the uncharted land of freelancers and virtual assistants. Wouldn't you trust more a 24/7 web monitoring system, fully automated, keeping an eye on your web pages?

  • No festive extra fees
  • No chance of letting something slip by
  • Everything under your control
  • No expert knowledge required
  • Always up to date
  • And never asks for a raise! (pricing is fixed)

Just a glimpse of what can be done to keep your site fit

How do you ensure your site has no broken links?

Web crawling is the essential task of web spiders. The term spider sprang naturally from the web’s name, however it couldn’t be more inaccurate. While a real spider builds its own web and lies there, mostly still, the web spiders are relentless in their search for informations.

Every spider, or “crawler bot” to use a more accurate name, starts its journey at a specific location. In the old days, this could have been a well known website (such as a popular portal) or one from an humanly selected list of seed sites. Today search engines have become such a financial asset that every detail of their inner workings is kept secret, so there’s no way to know from which seed a crawl begins for, say, Google or Bing.

Crawling, however, is not related only to search engines. A spider can be instructed to start at a specific site or a single section of a site, if an ad hoc crawl is going to be performed. Specific crawls can be, for example, crawls meant not to gather data on linked sites but on the status of the site itself, such as broken links either to internal pages or to external sites that have changed their structure or even disappeared from the web.

The simplest, HTML only, crawler

The crawler fetches the page data and in doing so only requests the pure HTML content. Even if the crawlers from large companies such as Google or Bing can crawl and parse dynamic pages such as javascript driven Ajax sites, most crawlers are interested only in the raw static data served from the page.

At its basic, the crawler now filters the page contents so it only gets a list of links to other pages that will serve as a guide to continue crawling further. Links are then sanitized and decimated.

Sanitization is the process of making each and every link an absolute, valid address. This means excluding malformed links, or sometimes fixing them by applying some sort of heuristics approach. For example, an invalid address starting with http://http://www.example.com (an absolutely common situation) can be fixed by replacing the duplicated protocol with a single instance.

Internal links are then expanded to include the entire root domain. This way, a link such as index.html, which by itself would be perfectly valid, is turned into a non ambiguous one by taking the current working directory and web address and rewriting it as http://www.example.com/currentfolder/index.html — this step is necessary because otherwise there would be several duplicate instances of an “index.html” page across the same site while each one is really an unique content page. This is formally known as the URL or Uniform Resource Locator.

On to the decimation phase

Once this step is done (and remember, we’re still at page one now) a decimation process is performed. Let’s assume the same page links to the home page several times. With an upper left logo, in the breadcrumb navigation and in the footer copyright note. These would be three instances of the same URL in the list to be crawled next. Decimation excludes the duplicates and leaves the spider with a list of internal and external, unique and valid links, ready for the next hop.

One by one, these links are fed back into the main entry point, essentially acting as a new starting page, but with a twist. For each one of the links a flag has to be stored, stating at the very least whether that page has already been crawled or not. Otherwise, a crawler would immediately be stuck in an infinite loop. Just have a look at this example:

  • Entry point www.example.com links to www.example.com/clients.html
  • Page www.example.com/clients.html links back to the root domain

The crawler would keep moving between the two pages following an endless chain of just two links.

Optimizing the process

Instead, once a crawler determines it has already performed a crawl on a page it does not follow anymore the links marked as already crawled (or, more commonly, “unchanged” since the last crawl was performed; by looking just at the headers sent back by the web server, a spider can be easily optimized so that it skips a page that wasn’t modified since the last crawl).

The procedure now repeats itself for each one of the valid and decimated links, generating a new list of links to be followed and this list keeps growing exponentially. This is one of the key issues to consider when optimizing a crawler: a single page may spawn millions of other pages to be crawled.

As an example let’s say that www.example.com has only nine links on its home page (one being the home page itself, and therefore excluded from the crawl). This means 10 pages have to be crawled, the initial one and 9 others.

Now each one of the 9 new pages again has ten links (counting the entry point). It’s easy to see that at level two, 10x10=100 pages have to be crawled. Now assume again that each one of them contains the same number of links, with no duplicates, and at the third level we are already dealing with one thousand pages.

Assuming a web page has an average of 50 valid, followable, links a spider’s job gets suddenly a lot more intense. Five hops are enough to skyrocket the number of crawled pages to 312,500,000. That’s three hundred millions pages.