Little shell script to recursively check a site for broken links
Needed a script for work to recursively spider one of our sites to check for problems and build a content inventory. As a side project, produced a little shell script that just checks links and produces a report by HTTP response code.
While script runs, it’ll produce a status report every 3 seconds of how many URLs it’s spidered so far. Example output:
ak@gd:~$ ./check.url http://giantdorks.org/
Removing existing log..
213 URLs checked thus far
624 URLs checked thus far
All done, calculating response codes..
Response counts, sorted by HTTP code
545 200
62 302
17 301Also useful to populate the Varnish cache as well as check effectiveness of the setup (run the script while running varnishhist on the web server). Then run it again to make sure you see mostly cache hits. In fact, I see all cache hits on the second run. Cache hits are pipes “|”, backend hits are pounds “#” — all pipes, no pounds make me happy.
Bad:
1:15, n = 635 me.sweet.box
#
#
| #
| #
| #
| #
|| #
|| #
||| ##
||| ###
||| # ###
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0 |1e1 |1e2Good:
1:20, n = 634 me.sweet.box
|
|
|
|
|
|
|
|||
|||
|||
||||
||||
|||||
+-----+-----+-----+-----+-----+-----+-----+-----+-----
|1e-6 |1e-5 |1e-4 |1e-3 |1e-2 |1e-1 |1e0 |1e1 |1e2And here’s the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | #!/bin/bash # error handling function err_exit { echo -e 1>&2; exit 1; } # check if proper arguments are supplied if [ $# -ne 1 ]; then echo -e "\n Usage error!\n Please provide URL to check.\n Example: $0 http://example.com\n" exit 1 fi # normalize url for log name url=$(echo $1 | 's_https?://__;s/www\.//;s_/_._g;s/\.+/\./g;s/\.$//') # remove log if exists if [ -f /tmp/$url.log ]; then echo "Removing existing log.." rm /tmp/$url.log || err_exit fi wget --spider -r $1 &> /tmp/$url.log || err_exit & while [ $(pgrep -l -f $url | grep wget | wc -l) != 0 ]; do sleep 3 total=$(grep "HTTP request sent" /tmp/$url.log | wc -l) echo "$total URLs checked thus far" done echo -e "\nAll done, calculating response codes.." echo -e "\nResponse counts, sorted by HTTP code" grep "^HTTP" /tmp/$url.log | awk '{print$6}' | sort | uniq -c | sort -nr || err_exit |
Leave a comment