Needed a script for work to recursively spider one of our sites to check for problems and build a content inventory. As a side project, produced a little shell script that just checks links and produces a report by HTTP response code.
The goal is to download pdf files from http://example.com/dir1/dir2/, where dir1 name is constant, but dir2 is a number between 100 and 499 (e.g. http://example.com/dir1/309/file258.pdf).
First, I’ll say that wget is a powerful tool and can place a burden on the site we’re grabbing data from (and probably get us banned as we’ll be perceived as carrying out a DOS attack). To avoid this, the example below will use a 5 second wait between downloads so as to not trouble the web server too much and keep the site usable for others.