Responsibly download lots of files with wget
The goal is to download pdf files from http://example.com/dir1/dir2/, where dir1 name is constant, but dir2 is a number between 100 and 499 (e.g. http://example.com/dir1/309/file258.pdf).
First, I'll say that wget is a powerful tool and can place a burden on the site we're grabbing data from (and probably get us banned as we'll be perceived as carrying out a DOS attack). To avoid this, the example below will use a 5 second wait between downloads and limit download bandwidth to 1 MB/s so as to not trouble the web server too much and keep the site usable for others.
for dir2 in $(echo seq 100 499); do wget -A pdf -w 5 --random-wait --limit-rate=1m --retr-symlinks -k -r -U "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.19) Gecko/2010040118 Debian Lenny Firefox/3.0.19" http://example.com/dir1/$dir2/; done
Wget created a dir structure I don't want, e.g.:
./example.com/dir1/309/file123.pdf
Rename each file using parent directory name as part of the file name, e.g.:
309.file123.pdf
for f in $(find . -name "*.pdf"); do new=$(echo $f | sed 's_^\./__;s_/_\._g') && echo renaming $f to $new && mv $f $new; done
Then place all in a dir named example.com.dir1:
mkdir example.com.dir1
for f in $(find . -name "*.pdf"); do echo moving $f && mv $f example.com.dir1/; done
Leave a comment