Shell script to calculate spam statistics
Cooked up a little shell script to produce monthly email statistics such as amount of email received, how much of it was spam, percentage of spam correctly identified, etc. Previously, I had manually ran the numbers and input into OpenOffice Calc to get the stats -- boring!
Example output:
root@dpork:~# spam-stats-month Aug 2009
------------------------------------
Stats for Aug 2009
------------------------------------
Ham SpamC SpamR SpamM HamC
151 122 3444 7 0
--------------------------------------------------------------
3724 Total messages
3573 Total Spam (Caught + Missed + Rejected)
95.94% Spam as % of all mail
96.38% % of Spam rejected by Postfix at SMTP time
0% False positive rate (Ham misclassified as Spam)
.18% False negative rate (Spam misclassified as Ham)
99.80% Spam catch rate (Spam filter accuracy)
--------------------------------------------------------------
My anti spam arsenal consists of Postfix anti-spam measures which reject obvious spam at the SMTP level and SpamAssassin via amavisd-new, which scans mail that was accepted for delivery. The latter consists of network tests (Razor, Pyzor), Bayes statistical analysis, FuzzyOcr for the image spam and some custom pattern matching rules by me.
After SA is done analyzing and tagging a message, it hands it back to Postfix, which then hands it to Cyrus for final delivery. Cyrus IMAP provides a powerful Sieve filtering interface, which I use to sort mail into my IMAP folders. My Sieve filter sorts spam over a certain threshold into the SpamCaught folder, while marking message with really high scores as read, so I don't see them, the low scoring stuff I like to review and manually move any false positives to HamCaught folder and any false negatives to SpamMissed folder, which my nightly Bayes train script scans and learns from.
The spam stats script runs on the mail server (same Debian box runs Postfix and Cyrus), and very simply loops through my IMAP folders counting messages in the Inbox (ham), SpamCaught (correctly identified spam), SpamMissed (false negatives), HamCaught (false positives) and also parses mail logs to get a total for messages rejected by Postfix. It then does some basic bash arithmetic and calls bc to calculate percentages.
And here's the script:
#!/bin/bash
# Get some spam numbers
# Usage: spam-stats-month Oct 2009
set -e
PATH=/usr/bin:/bin:/usr/local/bin; export PATH
# Verify input
if ([[ ! $1 =~ "\b[A-Z][a-z]{2}\b" ]]) || ([[ ! $2 =~ "\b[0-9]{4}\b" ]]); then
echo -e 1>&2 "\n Usage error..\n Example: spam-stats-month Oct 2009\n"
exit 127
fi
# ----------------------------------------------------------------- #
# Convert text month input date format to numeric for use with "ls"
# (e.g. "Nov 2009" to "2009-11"")
case "$1" in
Jan) nummon="01";;
Feb) nummon="02";;
Mar) nummon="03";;
Apr) nummon="04";;
May) nummon="05";;
Jun) nummon="06";;
Jul) nummon="07";;
Aug) nummon="08";;
Sep) nummon="09";;
Oct) nummon="10";;
Nov) nummon="11";;
Dec) nummon="12";;
esac
# Construct numeric date
numdate=$2-$nummon
#1. Ham
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/
Ham=$(ls -l *. | grep "$numdate" | wc -l)
#2. SpamRejected
file="/tmp/spam.stats.$1.rejected"
if [ -f $file ]; then
rm $file && touch $file
else
touch $file
fi
for f in $(ls -l /var/log/mail.log* | grep $numdate | awk '{print$NF}')
do zgrep "postfix" $f | grep "reject:" | grep "^$1" | wc -l >> $file
done
SpamR=$(awk '{s+=$0} END {print s}' $file)
#3. SpamCaught
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamCaught/
SpamC1=$(ls -l *. | grep "$numdate" | wc -l)
SpamC2=$(ls -l /var/lib/amavis/virusmails/ | grep "$numdate" | wc -l)
SpamC=$(( $SpamC1 + $SpamC2 ))
#4. SpamMissed
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamMissed/
SpamM=$(ls -l *. | grep "$numdate" | wc -l)
#5. HamCaught
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/HamCaught/
HamC=$(ls -l *. | grep "$numdate" | wc -l)
echo ""
echo "------------------------------------"
echo -e "Stats for $1 $2"
echo "------------------------------------"
echo -e "Ham\tSpamC\tSpamR\tSpamM\tHamC"
echo -e "$Ham\t$SpamC\t$SpamR\t$SpamM\t$HamC"
# Calc percentages
TotalMsg=$(( $Ham + $SpamC + $SpamR + $SpamM + $HamC ))
TotalSpam=$(( $SpamC + $SpamR + $SpamM ))
TotalSpamI=$(( $SpamC + $SpamR ))
FPpercent=$( echo "scale=2; $HamC*100/$TotalMsg" | bc)
FNpercent=$( echo "scale=2; $SpamM*100/$TotalMsg" | bc)
SpamPercent=$( echo "scale=2; $TotalSpam*100/$TotalMsg" | bc)
SpamRPercent=$( echo "scale=2; $SpamR*100/$TotalSpam" | bc)
SpamCatchRate=$( echo "scale=2; $TotalSpamI*100/$TotalSpam" | bc)
echo "--------------------------------------------------------------"
echo -e "$TotalMsg\t\tTotal messages"
echo -e "$TotalSpam\t\tTotal Spam (Caught + Missed + Rejected)"
echo -e "$SpamPercent%\t\tSpam as % of all mail"
echo -e "$SpamRPercent%\t\t% of Spam rejected by Postfix at SMTP time"
echo -e "$FPpercent%\t\tFalse positive rate (Ham misclassified as Spam)"
echo -e "$FNpercent%\t\tFalse negative rate (Spam misclassified as Ham)"
echo -e "$SpamCatchRate%\t\tSpam catch rate (Spam filter accuracy)"
echo "--------------------------------------------------------------"
Leave a comment