Shell script to calculate spam statistics
Cooked up a little shell script to produce monthly email statistics such as amount of email received, how much of it was spam, percentage of spam correctly identified, etc. Previously, I had manually ran the numbers and input into OpenOffice Calc to get the stats — boring!
Example output:
root@dpork:~# spam-stats-month Aug 2009 ------------------------------------ Stats for Aug 2009 ------------------------------------ Ham SpamC SpamR SpamM HamC 151 122 3444 7 0 -------------------------------------------------------------- 3724 Total messages 3573 Total Spam (Caught + Missed + Rejected) 95.94% Spam as % of all mail 96.38% % of Spam rejected by Postfix at SMTP time 0% False positive rate (Ham misclassified as Spam) .18% False negative rate (Spam misclassified as Ham) 99.80% Spam catch rate (Spam filter accuracy) --------------------------------------------------------------
My anti spam arsenal consists of Postfix anti-spam measures which reject obvious spam at the SMTP level and SpamAssassin via amavisd-new, which scans mail that was accepted for delivery. The latter consists of network tests (Razor, Pyzor), Bayes statistical analysis, FuzzyOcr for the image spam and some custom pattern matching rules by me.
After SA is done analyzing and tagging a message, it hands it back to Postfix, which then hands it to Cyrus for final delivery. Cyrus IMAP provides a powerful Sieve filtering interface, which I use to sort mail into my IMAP folders. My Sieve filter sorts spam over a certain threshold into the SpamCaught folder, while marking message with really high scores as read, so I don’t see them, the low scoring stuff I like to review and manually move any false positives to HamCaught folder and any false negatives to SpamMissed folder, which my nightly Bayes train script scans and learns from.
The spam stats script runs on the mail server (same Debian box runs Postfix and Cyrus), and very simply loops through my IMAP folders counting messages in the Inbox (ham), SpamCaught (correctly identified spam), SpamMissed (false negatives), HamCaught (false positives) and also parses mail logs to get a total for messages rejected by Postfix. It then does some basic bash arithmetic and calls bc to calculate percentages.
And here’s the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | #!/bin/bash # Get some spam numbers # Usage: spam-stats-month Oct 2009 set -e PATH=/usr/bin:/bin:/usr/local/bin; export PATH # Verify input if ([[ ! $1 =~ "\b[A-Z][a-z]{2}\b" ]]) || ([[ ! $2 =~ "\b[0-9]{4}\b" ]]); then echo -e 1>&2 "\n Usage error..\n Example: spam-stats-month Oct 2009\n" exit 127 fi # ----------------------------------------------------------------- # # Convert text month input date format to numeric for use with "ls" # (e.g. "Nov 2009" to "2009-11"") case "$1" in Jan) nummon="01";; Feb) nummon="02";; Mar) nummon="03";; Apr) nummon="04";; May) nummon="05";; Jun) nummon="06";; Jul) nummon="07";; Aug) nummon="08";; Sep) nummon="09";; Oct) nummon="10";; Nov) nummon="11";; Dec) nummon="12";; esac # Construct numeric date numdate=$2-$nummon #1. Ham cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/ Ham=$(ls -l *. | grep "$numdate" | wc -l) #2. SpamRejected file="/tmp/spam.stats.$1.rejected" if [ -f $file ]; then rm $file && touch $file else touch $file fi for f in $(ls -l /var/log/mail.log* | grep $numdate | awk '{print$NF}') do zgrep "postfix" $f | grep "reject:" | grep "^$1" | wc -l >> $file done SpamR=$(awk '{s+=$0} END {print s}' $file) #3. SpamCaught cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamCaught/ SpamC1=$(ls -l *. | grep "$numdate" | wc -l) SpamC2=$(ls -l /var/lib/amavis/virusmails/ | grep "$numdate" | wc -l) SpamC=$(( $SpamC1 + $SpamC2 )) #4. SpamMissed cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamMissed/ SpamM=$(ls -l *. | grep "$numdate" | wc -l) #5. HamCaught cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/HamCaught/ HamC=$(ls -l *. | grep "$numdate" | wc -l) echo "" echo "------------------------------------" echo -e "Stats for $1 $2" echo "------------------------------------" echo -e "Ham\tSpamC\tSpamR\tSpamM\tHamC" echo -e "$Ham\t$SpamC\t$SpamR\t$SpamM\t$HamC" # Calc percentages TotalMsg=$(( $Ham + $SpamC + $SpamR + $SpamM + $HamC )) TotalSpam=$(( $SpamC + $SpamR + $SpamM )) TotalSpamI=$(( $SpamC + $SpamR )) FPpercent=$( echo "scale=2; $HamC*100/$TotalMsg" | bc) FNpercent=$( echo "scale=2; $SpamM*100/$TotalMsg" | bc) SpamPercent=$( echo "scale=2; $TotalSpam*100/$TotalMsg" | bc) SpamRPercent=$( echo "scale=2; $SpamR*100/$TotalSpam" | bc) SpamCatchRate=$( echo "scale=2; $TotalSpamI*100/$TotalSpam" | bc) echo "--------------------------------------------------------------" echo -e "$TotalMsg\t\tTotal messages" echo -e "$TotalSpam\t\tTotal Spam (Caught + Missed + Rejected)" echo -e "$SpamPercent%\t\tSpam as % of all mail" echo -e "$SpamRPercent%\t\t% of Spam rejected by Postfix at SMTP time" echo -e "$FPpercent%\t\tFalse positive rate (Ham misclassified as Spam)" echo -e "$FNpercent%\t\tFalse negative rate (Spam misclassified as Ham)" echo -e "$SpamCatchRate%\t\tSpam catch rate (Spam filter accuracy)" echo "--------------------------------------------------------------" |
Leave a comment