Little shell script to recursively check a site for broken links
Needed a script for work to recursively spider one of our sites to check for problems and build a content inventory. As a side project, produced a little shell script that just checks links and produces a report by HTTP response code.
While script runs, it'll produce a status report every 3 seconds of how many URLs it's spidered so far. Example output:
ak@gd:~$ ./check.url http://giantdorks.org/
Removing existing log..
213 URLs checked thus far
624 URLs checked thus far
All done, calculating response codes..
Response counts, sorted by HTTP code
545 200
62 302
17 301
Also useful to populate the Varnish cache as well as check effectiveness of the setup (run the script while running varnishhist on the web server). Then run it again to make sure you see mostly cache hits. In fact, I see all cache hits on the second run. Cache hits are pipes "|", backend hits are pounds "#" -- all pipes, no pounds make me happy.
Bad:
1:15, n = 635 me.sweet.box
#
#
| #
| #
| #
| #
|| #
|| #
||| ##
||| ###
||| # ###
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0 |1e1 |1e2
Good:
1:20, n = 634 me.sweet.box
|
|
|
|
|
|
|
|||
|||
|||
||||
||||
|||||
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0 |1e1 |1e2
And here's the script:
#!/bin/bash
# error handling
function err_exit { echo -e 1>&2; exit 1; }
# check if proper arguments are supplied
if [ $# -ne 1 ]; then
echo -e "\n Usage error!\n Please provide URL to check.\n Example: $0 http://example.com\n"
exit 1
fi
# check if wget is a valid command
if ! which wget &> /dev/null; then echo wget not found; exit 1; fi
# normalize url for log name
url=$(echo $1 | sed -r 's_https?://__;s/www\.//;s_/_._g;s/\.+/\./g;s/\.$//')
# remove log if exists
if [ -f /tmp/$url.log ]; then
echo "Removing existing log.."
rm /tmp/$url.log || err_exit
fi
wget -e robots=off --spider -S -r -nH -nd --delete-after $1 &> /tmp/$url.log &
while [ $(pgrep -l -f $url | grep wget | wc -l) != 0 ]; do
sleep 3
total=$(grep "HTTP request sent" /tmp/$url.log | wc -l)
echo "$total HTTP requests sent thus far"
done
echo -e "\nAll done, calculating response codes.."
echo -e "\nResponse counts, sorted by HTTP code"
grep -A1 "^HTTP request sent" /tmp/$url.log |egrep -o "[0-9]{3} [A-Za-z]+(.*)" |sort |uniq -c |sort -nr || err_exit
2 Comments
1. Srikanth replies at 7th March 2013, 12:03 am :
How to run this script? just save as some one and run? can you please help me? I
2. Abia replies at 29th December 2015, 2:56 pm :
This script works!
Run it:
copy the content to a file and save it.
make the file executable with:
chmod +x your_file
Now actually run it ./your_file yourURL
See results as they happen:
tail -f /tmp/yourURL.log
Leave a comment