Alain Kelder | Programatically download all available Linux Journal Magazine issues

Programatically download all available Linux Journal Magazine issues

10 Sep 2010 curl, Linux, pdftotext, Script, wget Trackback

My work subscribed to the Linux Journal and I wanted to grab all the available issues to read at my leisure, but couldn't be bothered to manually download them all. Here's how to grab them all with "wget". Please note that you'll need a valid subscription to use this method. This is not a how to steal the mags. 😉

The Linux Journal folks give issues away for free after a couple of months and for just a $30/year subscription you not only get access to the current issues for a year, but to all the back issues as well, not to mention the "Linux Journal's System Administration Special Edition" bonus pdf. So please don't ask me to give them away.

Ok, so you purchased a subscription. Good for you. Here's how to grab them.

1. Go to www.linuxjournal.com/digital
2. Sign in with your account number (LJxxxxxxx) and your Zip code
3. Click on "Digital Downloads"
4. Save the source of the page as "dljdownload.html"
5. Then run the following to download all available issues:

for pdfcode in $(grep get-pdf.php dljdownload.html | cut -d\" -f2); do
  pdfaddress=$(curl "$pdfcode" | grep action=spit2 | cut -d\" -f2 | sed 's/amp;//g')
  pdfname=$(curl "$pdfcode" | grep action=spit2 | grep -o "dlj.*pdf")
  wget "http://download.linuxjournal.com$pdfaddress" -O $pdfname
done

6. You should end up with a directory of pdf files (69 for me):

ls | head

dlj132.pdf
dlj133.pdf
dlj134.pdf
dlj135.pdf
dlj136.pdf
dlj137.pdf
dlj138.pdf
dlj139.pdf
dlj140.pdf
dlj141.pdf

7. Let's say you were particularly bored and wanted to use issue date and number in the file name:

for old in $(ls dlj*.pdf); do 
  num=$(echo $old | grep -Eo "[0-9]+")
  date=$(pdftotext $old - | head -200 | grep -Eo "(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)\ [0-9]{4}" | head -1 | awk '{print$2"."$1'})
  new=ISSUE.$num-$date.pdf
  echo renaming $old to $new
  mv $old $new
done

Which should produce:

ls | head

ISSUE.132-2005.APRIL.pdf
ISSUE.133-2005.MAY.pdf
ISSUE.134-2005.JUNE.pdf
ISSUE.135-2005.JULY.pdf
ISSUE.136-2005.AUGUST.pdf
ISSUE.137-2005.SEPTEMBER.pdf
ISSUE.138-2005.OCTOBER.pdf
ISSUE.139-2005.NOVEMBER.pdf
ISSUE.140-2005.DECEMBER.pdf
ISSUE.141-2006.JANUARY.pdf

13 Comments

1. Matt Dunlap replies at 10th September 2010, 4:35 pm :

I’ll trade you my subscription to “Seventeen” for your subscription of “Linux Journal”
2. Alain replies at 10th September 2010, 4:47 pm :

Hah, I would of course, except I already subscribe to all the teen magazines.
3. anyone replies at 2nd June 2011, 12:22 pm :

Thanks alot, made my day 🙂
4. abeltje replies at 24th June 2011, 12:21 am :

Thanks a lot for this explanation
great work 🙂
5. Hamradio replies at 20th August 2011, 4:09 am :

Your scripts work fine, thanks a lot Alain!
6. Josiah Ritchie replies at 23rd August 2011, 7:13 pm :

Excellent, just saved me a bunch of time.
7. Josiah Ritchie replies at 23rd August 2011, 7:23 pm :

For some reason, the script didn’t work when put in the background. Any idea why?

8. Jan van Haarst replies at 1st March 2012, 1:02 pm :

I have changed the script so that it works with the current state of the website, and let curl get the name of the file from the response of the remote website, which saves one curl request.

for pdfcode in $(grep download-pdf.jpg dljdownload.html | cut -d\" -f2);
do
	pdfaddress=$(curl "$pdfcode" | grep action=spit2 | cut -d\" -f2 | sed 's/amp;//g');
	curl --remote-time --remote-header-name --remote-name "http://download.linuxjournal.com$pdfaddress";
done

9. Jon replies at 7th April 2012, 4:13 am :

Thanks to Jan van Haarst – this now works, although the OP’s solution no longer does. Thanks to both, though, as this is the *only* page where anyone has attempted to solve this problem. You guys rock 🙂

10. Bjarte replies at 27th July 2012, 2:40 am :

Hi,

If you get tired of the first four steps, try curl!

#!/bin/bash

# urlencoded values.
UE_EMAIL="first.last%40gmail.com"
UE_ZIPCODE=1234
UE_ACCOUNT_NUMBER=123456

# initialize the cookie-jar
curl --cookie-jar cjar --output /dev/null \
https://www.pubservice.com/SubInfo.aspx?PC=LJ

# time to login
curl --cookie cjar --cookie-jar cjar \
--user-agent 'Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0' \
--data '__EVENTTARGET' \
--data '__EVENTARGUMENT' \
--data '__VIEWSTATE' \
--data "_ctl0%3AContentPlaceHolder1%3ATextBox4=${UE_MAIL}" \
--data "_ctl0%3AContentPlaceHolder1%3ATextBox5=${UE_ZIPCODE}" \
--data '_ctl0%3AContentPlaceHolder1%3AButton2=SUBMIT' \
--data '_ctl0%3AContentPlaceHolder1%3ACollapsiblePanelExtenderAddress_ClientState' \
--data '_ctl0%3AContentPlaceHolder1%3ACollapsiblePanelExtenderEmail_ClientState' \
--location \
--output loginresult.html \
https://www.pubservice.com/SubInfo.aspx?PC=LJ

# get the page containing the urls.
curl --cookie cjar --cookie-jar cjar \
--user-agent 'Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0' \
--data "ucLJFooter_accountnumber=${UE_ACCOUNT_NUMBER}" \
--data 'ucLJFooter%3AhrefDigitalDownload=Digital Downloads' \
--output dljdownload.html \
https://secure2.linuxjournal.com/pdf/dljdownload.php

11. Jeronimo replies at 28th March 2013, 8:32 pm :

good script, i take the Bjarte script (curl version) and append this to the end in case anyone want all the files mobi, pdf, epub, curl is cool.

for pdfcode in $(grep get-doc.php dljdownload.html | cut -d\" -f2); do
  pdfaddress=$(curl "$pdfcode" | grep action=spit2 | cut -d\" -f2 | sed 's/amp;//g')
  #pdfname=$(curl "$pdfcode" | grep action=spit2 | grep -o "dlj.*pdf")
  #wget "http://download.linuxjournal.com$pdfaddress" -O $pdfname

  name=$(echo $pdfcode | cut -d\& -f2 | cut -d\= -f2 | cut -d\- -f2- )
  ext=$(echo $pdfcode | cut -d\& -f2 | cut -d\= -f2 | cut -d\- -f1 )

  curl --cookie cjar --cookie-jar cjar -C - \
  --user-agent 'Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0' \
  "http://download.linuxjournal.com"$pdfaddress \
  -o $name.$ext
done

12. Mattcen replies at 11th April 2013, 2:02 pm :

Hi all,

I’ve taken all the ideas from the OP and previous comments, added a few of my own improvements, and ended up with a single script to do everything:

#!/bin/bash
# Code derived from
# http://giantdorks.org/alain/programatically-download-all-available-linux-journal-magazine-issues/

# urlencoded values.
# Set as environment variables in shell to override
: ${UE_EMAIL:="first.last%40example.net"}
: ${UE_ZIPCODE:=1234}
: ${UE_ACCOUNT_NUMBER:=654321}

# Create a temporary directory to work in, and delete it on exit.
d="$(mktemp -dt "${0##*/}".XXXXXX)"
trap "rm -rf \"$d\"" 0 TERM ERR QUIT

# This removes code duplication for curl arguments later on
curl_()
{
  curl --cookie "$d"/cjar --cookie-jar "$d"/cjar --user-agent \
  'Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0' \
  "$@"
}

# initialize the cookie-jar
curl --cookie-jar "$d"/cjar --output /dev/null \
https://www.pubservice.com/SubInfo.aspx?PC=LJ

# time to login
curl_ \
--data '__EVENTTARGET' \
--data '__EVENTARGUMENT' \
--data '__VIEWSTATE' \
--data "_ctl0%3AContentPlaceHolder1%3ATextBox4=$UE_MAIL" \
--data "_ctl0%3AContentPlaceHolder1%3ATextBox5=$UE_ZIPCODE" \
--data '_ctl0%3AContentPlaceHolder1%3AButton2=SUBMIT' \
--data '_ctl0%3AContentPlaceHolder1%3ACollapsiblePanelExtenderAddress_ClientState' \
--data '_ctl0%3AContentPlaceHolder1%3ACollapsiblePanelExtenderEmail_ClientState' \
--location \
--output "$d"/loginresult.html \
"https://www.pubservice.com/SubInfo.aspx?PC=LJ"

# get the page containing the urls.
curl_ \
--data "ucLJFooter_accountnumber=$UE_ACCOUNT_NUMBER" \
--data 'ucLJFooter%3AhrefDigitalDownload=Digital Downloads' \
--output "$d"/dljdownload.html \
https://secure2.linuxjournal.com/pdf/dljdownload.php

# Iterate through files, grabbing just the section between the
# first pair of double-quotes
while IFS=\" read -r _ pdfcode _
do
  # Strip out suitable name and extension for destination file, using
  # URL as base
  ext=${pdfcode#*=*=}
  ext=${ext%%-*}
  name=${pdfcode%-*}
  name=${name#*-*-}

  pdfaddress="$(curl "$pdfcode" | grep action=spit2 | cut -d\" -f2 | sed 's/amp;//g')"

  curl_ \
  -C - "http://download.linuxjournal.com$pdfaddress" \
  -o "$name.$ext"
done < <(grep get-doc.php "$d"/dljdownload.html)

Hope this helps.
--
Mattcen

13. Antonio replies at 8th April 2014, 1:59 pm :

Hello guys,
I had the need to do a clever download of my favourite magazine.
Basically it’s adding just few enhancement to the previous ones but I had to satifsy some different needs such looking at the issues before downloading them, do a selective download (either filtered by month, by type or by content) and adding a bit of multithreaded support in order to reduce the time spent for the download.

In case you’re interested into trying it out, please feel free to clone and collaborate to my project, you can find it at the following link on bitbucket
https://bitbucket.org/tonyqui/linux-journal-downloader

Hoping it would help you,
Cheers,
Antonio

Alain Kelder is a Giant Dork

Programatically download all available Linux Journal Magazine issues

13 Comments

Leave a comment