Little shell script to recursively check a site for broken links

Needed a script for work to recursively spider one of our sites to check for problems and build a content inventory. As a side project, produced a little shell script that just checks links and produces a report by HTTP response code.

While script runs, it’ll produce a status report every 3 seconds of how many URLs it’s spidered so far. Example output:

ak@gd:~$ ./check.url http://giantdorks.org/
Removing existing log..
213 URLs checked thus far
624 URLs checked thus far
 
All done, calculating response codes..
 
Response counts, sorted by HTTP code
    545 200
     62 302
     17 301

Also useful to populate the Varnish cache as well as check effectiveness of the setup (run the script while running varnishhist on the web server). Then run it again to make sure you see mostly cache hits. In fact, I see all cache hits on the second run. Cache hits are pipes “|”, backend hits are pounds “#” — all pipes, no pounds make me happy.

Bad:

1:15, n = 635                                me.sweet.box
 
                                #
                                #
          |                     #
          |                     #
          |                     #
          |                     #
         ||                     #
         ||                     #
         |||                   ##
         |||                   ###
         |||               #   ###
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0  |1e1  |1e2

Good:

1:20, n = 634                                me.sweet.box
 
        |
        |
        |
        |
        |
        |
        |
       |||
       |||
       |||
       ||||
       |||| 
       |||||
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0  |1e1  |1e2

And here’s the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
 
# error handling
function err_exit { echo -e 1>&2; exit 1; }
 
# check if proper arguments are supplied
if [ $# -ne 1 ]; then
  echo -e "\n Usage error!\n Please provide URL to check.\n Example: $0 http://example.com\n"
  exit 1
fi
 
# normalize url for log name
url=$(echo $1 | 's_https?://__;s/www\.//;s_/_._g;s/\.+/\./g;s/\.$//')
 
# remove log if exists
if [ -f /tmp/$url.log ]; then
   echo "Removing existing log.."
   rm /tmp/$url.log || err_exit
fi
 
wget --spider -r $1 &> /tmp/$url.log || err_exit &
 
while [ $(pgrep -l -f $url | grep wget | wc -l) != 0 ]; do
  sleep 3
  total=$(grep "HTTP request sent" /tmp/$url.log | wc -l)
  echo "$total URLs checked thus far"
done
 
echo -e "\nAll done, calculating response codes.."
echo -e "\nResponse counts, sorted by HTTP code"
grep "^HTTP" /tmp/$url.log | awk '{print$6}' | sort | uniq -c | sort -nr || err_exit

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).