Shell script to calculate spam statistics

Cooked up a little shell script to produce monthly email statistics such as amount of email received, how much of it was spam, percentage of spam correctly identified, etc. Previously, I had manually ran the numbers and input into OpenOffice Calc to get the stats -- boring!

Example output:

root@dpork:~# spam-stats-month Aug 2009

------------------------------------
Stats for Aug 2009
------------------------------------
Ham	SpamC	SpamR	SpamM	HamC
151	122	3444	7	0
--------------------------------------------------------------
3724		Total messages
3573		Total Spam (Caught + Missed + Rejected)
95.94%		Spam as % of all mail
96.38%		% of Spam rejected by Postfix at SMTP time
0%		False positive rate (Ham misclassified as Spam)
.18%		False negative rate (Spam misclassified as Ham)
99.80%		Spam catch rate (Spam filter accuracy)
--------------------------------------------------------------

My anti spam arsenal consists of Postfix anti-spam measures which reject obvious spam at the SMTP level and SpamAssassin via amavisd-new, which scans mail that was accepted for delivery. The latter consists of network tests (Razor, Pyzor), Bayes statistical analysis, FuzzyOcr for the image spam and some custom pattern matching rules by me.

After SA is done analyzing and tagging a message, it hands it back to Postfix, which then hands it to Cyrus for final delivery. Cyrus IMAP provides a powerful Sieve filtering interface, which I use to sort mail into my IMAP folders. My Sieve filter sorts spam over a certain threshold into the SpamCaught folder, while marking message with really high scores as read, so I don't see them, the low scoring stuff I like to review and manually move any false positives to HamCaught folder and any false negatives to SpamMissed folder, which my nightly Bayes train script scans and learns from.

The spam stats script runs on the mail server (same Debian box runs Postfix and Cyrus), and very simply loops through my IMAP folders counting messages in the Inbox (ham), SpamCaught (correctly identified spam), SpamMissed (false negatives), HamCaught (false positives) and also parses mail logs to get a total for messages rejected by Postfix. It then does some basic bash arithmetic and calls bc to calculate percentages.

And here's the script:

#!/bin/bash
# Get some spam numbers
# Usage: spam-stats-month Oct 2009

set -e

PATH=/usr/bin:/bin:/usr/local/bin; export PATH

  # Verify input
  if ([[ ! $1 =~ "\b[A-Z][a-z]{2}\b" ]]) || ([[ ! $2 =~ "\b[0-9]{4}\b" ]]); then
    echo -e 1>&2 "\n Usage error..\n Example: spam-stats-month Oct 2009\n"
    exit 127
  fi

# ----------------------------------------------------------------- #
# Convert text month input date format to numeric for use with "ls"
# (e.g. "Nov 2009" to "2009-11"")
case "$1" in
        Jan) nummon="01";;
        Feb) nummon="02";;
        Mar) nummon="03";;
        Apr) nummon="04";;
        May) nummon="05";;
        Jun) nummon="06";;
        Jul) nummon="07";;
        Aug) nummon="08";;
        Sep) nummon="09";;
        Oct) nummon="10";;
        Nov) nummon="11";;
        Dec) nummon="12";;
esac
# Construct numeric date
numdate=$2-$nummon

#1. Ham
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/
Ham=$(ls -l *. | grep "$numdate" | wc -l)

#2. SpamRejected
file="/tmp/spam.stats.$1.rejected"

  if [ -f $file ]; then
    rm $file && touch $file
  else
    touch $file
  fi

  for f in $(ls -l /var/log/mail.log* | grep $numdate | awk '{print$NF}')
    do zgrep "postfix" $f | grep "reject:" | grep "^$1" | wc -l >> $file
  done

SpamR=$(awk '{s+=$0} END {print s}' $file)

#3. SpamCaught
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamCaught/
SpamC1=$(ls -l *. | grep "$numdate" | wc -l)
SpamC2=$(ls -l /var/lib/amavis/virusmails/ | grep "$numdate" | wc -l)
SpamC=$(( $SpamC1 + $SpamC2 ))

#4. SpamMissed
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/SpamMissed/
SpamM=$(ls -l *. | grep "$numdate" | wc -l)

#5. HamCaught
cd /var/spool/cyrus/mail/domain/g/gd.org/a/user/ak/HamCaught/
HamC=$(ls -l *. | grep "$numdate" | wc -l)

echo ""
echo "------------------------------------"
echo -e "Stats for $1 $2"
echo "------------------------------------"
echo -e "Ham\tSpamC\tSpamR\tSpamM\tHamC"
echo -e "$Ham\t$SpamC\t$SpamR\t$SpamM\t$HamC"

# Calc percentages
TotalMsg=$(( $Ham + $SpamC + $SpamR + $SpamM + $HamC ))
TotalSpam=$(( $SpamC + $SpamR + $SpamM ))
TotalSpamI=$(( $SpamC + $SpamR ))
FPpercent=$( echo "scale=2; $HamC*100/$TotalMsg" | bc)
FNpercent=$( echo "scale=2; $SpamM*100/$TotalMsg" | bc)
SpamPercent=$( echo "scale=2; $TotalSpam*100/$TotalMsg" | bc)
SpamRPercent=$( echo "scale=2; $SpamR*100/$TotalSpam" | bc)
SpamCatchRate=$( echo "scale=2;  $TotalSpamI*100/$TotalSpam" | bc)

echo "--------------------------------------------------------------"
echo -e "$TotalMsg\t\tTotal messages"
echo -e "$TotalSpam\t\tTotal Spam (Caught + Missed + Rejected)"
echo -e "$SpamPercent%\t\tSpam as % of all mail"
echo -e "$SpamRPercent%\t\t% of Spam rejected by Postfix at SMTP time"
echo -e "$FPpercent%\t\tFalse positive rate (Ham misclassified as Spam)"
echo -e "$FNpercent%\t\tFalse negative rate (Spam misclassified as Ham)"
echo -e "$SpamCatchRate%\t\tSpam catch rate (Spam filter accuracy)"
echo "--------------------------------------------------------------"

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).