Improve SEO with Varnish by returning HTTP status code 410 for invalid URLs matching a pattern

An old site of mine used to run on Joomla and URLs with query strings like the following, returned valid content:

http://example.com/index.php?option=com_weblinks&task=view&catid=13&id=14

Then one day, I had decided to retire the old site and redirected requests for it to this site with:

1
2
        RewriteEngine on
        RewriteRule / http://giantdorks.org/alain/ [R=permanent,L]

URLs like "/about", being valid on the new site, continued to work, but WordPress with "clean URL" permalink structure wouldn't return an error for the now invalid URLs like "?option=com_weblinks&etc" and would just serve the page whose URL was valid right up to the question mark, making the whole mess seem like a valid page.

This led to all these invalid pages to be listed in the Google index (and I'm assuming other search engines as well). Google Webmaster Tools would also list them as "Duplicate title tags" under HTML suggestions.

To remove, the Google Webmaster Tools help suggests:

If the page no longer exists, make sure that the server returns a 404 (Not Found) or 410 (Gone) HTTP status code. This will tell Google that the page is gone and that it should no longer appear in search results.

I'm using the amazing Varnish cache as my frontend and the following vcl returns a 410 for requests with a question mark in the URL, except when previewing posts, requesting by page id, doing login or wp-admin related stuff. Perhaps there are also other situations with query strings in URLs when WordPress is set to use one of the "clean" permalink structures -- I'm sure I'll find out soon enough. :-)

1
2
3
4
5
  if (req.url ~ "\?"){
    if (req.url !~ "preview=true" && req.url !~ "p=[0-9]" && req.url !~ "wp-admin" && req.url !~ "wp-login"){
      error 410 "Content may have moved.";
    }
  }

After these changes are made and Google has crawled the site again, the content should naturally drop out of the Google index.

We'll see..

I also don't think I want /category/* and /page/* and date (e.g. /2010/*) indexed because they're an alternative way to view the same content and probably just lead to duplicates. I just want the each unique page indexed. Jason uses permalink structure with dates, so I left his alone, arriving at the following robots.txt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
User-agent: *
Disallow: /category
Disallow: /page
Disallow: /2005
Disallow: /2006
Disallow: /2007
Disallow: /2008
Disallow: /2009
Disallow: /2010
Disallow: /2011
Disallow: /2012
Disallow: /2013
Disallow: /2014
Disallow: /2015
Disallow: /alain/category
Disallow: /alain/page
Disallow: /alain/2005
Disallow: /alain/2006
Disallow: /alain/2007
Disallow: /alain/2008
Disallow: /alain/2009
Disallow: /alain/2010
Disallow: /alain/2011
Disallow: /alain/2012
Disallow: /alain/2013
Disallow: /alain/2014
Disallow: /alain/2015
Disallow: /jason/category
Disallow: /jason/page

Here's a little shell script to produce the above without tedious copy/pasting

1
2
3
4
5
6
7
for base in " " /alain /jason; do 
 echo Disallow: $base/category
 echo Disallow: $base/page
  for year in $(seq 2005 2015); do
   echo Disallow: $base/$year
  done
done

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).