Little shell script to recursively check a site for broken links

Needed a script for work to recursively spider one of our sites to check for problems and build a content inventory. As a side project, produced a little shell script that just checks links and produces a report by HTTP response code.

While script runs, it'll produce a status report every 3 seconds of how many URLs it's spidered so far. Example output:

ak@gd:~$ ./check.url http://giantdorks.org/
Removing existing log..
213 URLs checked thus far
624 URLs checked thus far
 
All done, calculating response codes..
 
Response counts, sorted by HTTP code
    545 200
     62 302
     17 301

Also useful to populate the Varnish cache as well as check effectiveness of the setup (run the script while running varnishhist on the web server). Then run it again to make sure you see mostly cache hits. In fact, I see all cache hits on the second run. Cache hits are pipes "|", backend hits are pounds "#" -- all pipes, no pounds make me happy.

Bad:

1:15, n = 635                                me.sweet.box
 
                                #
                                #
          |                     #
          |                     #
          |                     #
          |                     #
         ||                     #
         ||                     #
         |||                   ##
         |||                   ###
         |||               #   ###
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0  |1e1  |1e2

Good:

1:20, n = 634                                me.sweet.box
 
        |
        |
        |
        |
        |
        |
        |
       |||
       |||
       |||
       ||||
       |||| 
       |||||
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0  |1e1  |1e2

And here's the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
 
# error handling
function err_exit { echo -e 1>&2; exit 1; }
 
# check if proper arguments are supplied
if [ $# -ne 1 ]; then
  echo -e "\n Usage error!\n Please provide URL to check.\n Example: $0 http://example.com\n"
  exit 1
fi
 
# check if wget is a valid command
if ! which wget &> /dev/null; then echo wget not found; exit 1; fi
 
# normalize url for log name
url=$(echo $1 | sed -r 's_https?://__;s/www\.//;s_/_._g;s/\.+/\./g;s/\.$//')
 
# remove log if exists
if [ -f /tmp/$url.log ]; then
   echo "Removing existing log.."
   rm /tmp/$url.log || err_exit
fi
 
wget -e robots=off --spider -S -r -nH -nd --delete-after $1 &> /tmp/$url.log &
 
while [ $(pgrep -l -f $url | grep wget | wc -l) != 0 ]; do
  sleep 3
  total=$(grep "HTTP request sent" /tmp/$url.log | wc -l)
  echo "$total HTTP requests sent thus far"
done
 
echo -e "\nAll done, calculating response codes.."
echo -e "\nResponse counts, sorted by HTTP code"
grep -A1 "^HTTP request sent" /tmp/$url.log |egrep -o "[0-9]{3} [A-Za-z]+(.*)" |sort |uniq -c |sort -nr || err_exit

2 Comments

  • 1. Srikanth replies at 7th March 2013, 12:03 am :

    How to run this script? just save as some one and run? can you please help me? I

  • 2. Abia replies at 29th December 2015, 2:56 pm :

    This script works!

    Run it:
    copy the content to a file and save it.
    make the file executable with:
    chmod +x your_file

    Now actually run it ./your_file yourURL

    See results as they happen:
    tail -f /tmp/yourURL.log

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).