Controlling how your website gets indexed by programatically managing web crawler access to individual pages

This post is about what I'm doing to control how my blog is indexed. My main requirements are:

  • Keep HTTPS links out of search engine indexes
  • Minimize duplicates in search results

Keep HTTPS links out of search engine indexes

I'm using Varnish Cache to achieve pretty awesome performance (over 4K reqs/sec with hardly any resource hit), remarkable, considering the puny resources of the cheapo VPS this site is hosted on. Because Varnish won't cache SSL reqs, to maximize responsiveness, I try to limit use of HTTPS to authenticated sessions -- 99.9% of traffic on this site is anonymous and doesn't need to be encrypted.

So, first requirement is to prevent search engines from publishing HTTPS links.

I'm accomplishing this with the following PHP in my header.php:

  if ($_SERVER['SERVER_PORT'] == '443') {
      echo '<meta name="robots" content="noindex, nofollow">'. PHP_EOL;
  }

The above adds a meta tag that tells web crawlers (e.g. bots, spiders) to not index the page the metatag appears on and to not follow any links. I think this is a safe and correct way to do this and should work with crawlers that follow rules.

However, as an additional measure, I'm also using the following Apache mod_rewrite rule to redirect HTTPS requests to HTTP based on user agent string:

 # Already adding noindex/nofollow meta tags for SSL in PHP,
 # but also doing it at the Apache level to save resources
 # when bots don't check meta tags. Impact should be minimal
 # as long as browsers used by content managers don't contain
 # strings below..
 RewriteCond %{HTTP_USER_AGENT} (crawler|bot|spider|slurp) [NC]
 RewriteRule (.*) http://giantdorks.org/$1 [R=301,L]

I used to have some HTTPS links appear in the Google index, but after using the combination of these two methods for a while, they're all gone.

Minimize duplicates in search results

The only pages I want indexed are each unique post. I don't want categories, for instance. So, the mysql category page, shouldn't get indexed, while every individual post that belongs to the mysql category (e.g. this or this, etc) should.

I care about this in part because I'm using Google Custom Search to provide search functionality for my site, and I can't stand getting pages and pages of search results that all point to the same page.

So, the next requirement is to only index the front page and individual posts or pages.

For this, I'm using a combination of two methods. First, is the generic robots.txt approach, second is application specific.

I'm using WordPress multisite for this blog and it provides functions for this sort of thing. Specifically, I'm using is_singular() and is_front_page() to identify the two types of pages I want to get indexed and to exclude the rest:

  if ( is_singular() || is_front_page() ) {
      echo '<meta name="robots" content="index, follow">'. PHP_EOL;
  } else {
      echo '<meta name="robots" content="noindex, follow">'. PHP_EOL;
  }

P.S. Combining the two PHP snippets, I end up with:

  if ($_SERVER['SERVER_PORT'] == '443') {
      echo '<meta name="robots" content="noindex, nofollow">'. PHP_EOL;
  } else {
      if ( is_singular() || is_front_page() ) {
          echo '<meta name="robots" content="index, follow">'. PHP_EOL;
      } else {
          echo '<meta name="robots" content="noindex, follow">'. PHP_EOL;
      }
  }

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).