Supplement your existing comment spam detection solution with a web server level DNSBL/RBL based approach
I've been running a personal blog since the late 90's with the current iteration since 2007. Over the years, I'd tried a few different comment spam prevention solutions, since 2010, I've been using Akismet almost exclusively. For the most part, it's been over 99% effective, typically letting through just a few spam comments per month. On Thu, April 11, however, started getting an unusally high number of notifications from WordPress about comments in the moderation queue awaiting approval.
Akismet has detected a problem, but what is it?
When I finally got around to troubleshooting it, the WP comments page showed the following message:
Akismet has detected a problem. Some comments have not yet been checked for spam by Akismet. They have been temporarily held for moderation. Please check your Akismet configuration and contact your web host if problems persist.
Curiously enough, in "Plugins" > "Akismet Configuaration", everything was green: it said that the API key was valid and that "Akismet is working correctly" and "All servers are accessible".
History of individual comments, showed messages like:
6 hours ago - Akismet cleared this comment during an automatic retry.
1 day ago - Akismet was unable to check this comment (response: ), will automatically retry again later.
Fired up a packet sniffer and saw requests going out to Akismet servers. The WordPress core and the Akismet plugin were both up to date. Aside from new posts and comments, there have been no changes in months. My server didn't seem overwhelemed, so I figured the problem might be on Akismet's side. I'm using their free service so perhaps I'd exceeded some quota. Went over to akismet.com to check, but found no information there that would suggest any issues with my account.
A quick hack to buy time
As spam comment submissions continued to pour in, and moderation queue continued to grow, I decided to do something about it and whipped up a quick script to grap IP addresses of hosts with more than one comment in the moderation queue (from the WordPress comments table), to block using Apache mod_access.
Then, as new spam comments came in, spammers would automatically get added to my local black list and refused access before they could attempt more comment submissions. This stopped the rapid growth of the moderation queue (which had thousands of comments by now). I marked a couple of pages worth as spam (maybe 40 or so), I assumed this reported them back to Akismet, which felt like a good thing to do, but because each page of just 20 or so comments would take like half a minute to process, I gave up and just deleted them all with a quick SQL query.
Then over the weekend, since it was too windy to wakeboard (again), and Deb and Olivia went to see the extended family, I started thinking about improving my temporary fix. While it worked well to address the flood, I felt it was too aggressive as a permament solution because, if Akismet completely stopped working, my hack would soon be banning legitimate commenters, unless I kept up with moderation. Last month I got over 11K spam comments, so there's no way I'd be able to keep up, even if I wanted to.
I was tempted to just disable comments altogether, but didn't because sometimes I actually get useful comments. Check out this post on programatically downloading Linux Journal Magazine issues -- it started off with me posting a simple script, others continued to improve it long after I stopped caring, and it eventually turned into something much better than the original. I love that.
I also don't like giving in to lowlife spammers. One of the things I enjoy about hosting my own email is effectively having achieved 100% spam identification accuracy years ago using all Free Software. Plus whatever I learn here could be useful at work.
With that in mind, considered a few different solutions, including Akismet competitors like Defensio and Mollom, thought about implementing some sort of a reverse Turing test like CAPTCHA (I personally find these annoying), and just outsourcing comment hosting to Disqus altogether, but rejected all (for now, at least) because Akismet had by then decided to start working again and was keeping up with new comments still flowing into to the moderation queue (albeit at a much reduced rate than before), so I decided to keep using it.
Still wanted something to complement Akismet in case this problem returned. One of the potential solutions I came across was a WordPress plugin called AVH First Defense Against Spam, which leveraged third party services like Stop Forum Spam, Project Honey Pot and the good old Spamhaus -- I'd been using Real-time Blackhole Lists for years for my mail server and prior to Akismet used Project Honey Pot -- so I'm familiar and comfortable with a DNSBL/RBL based approach. Stop Forum Spam shares their dataset with dnsbl.tornevall.org, making it 100% compatible with the DNSBL methodology and toolset I'm familiar with.
It's DNSBL, stupid!
Using my DNSBL/RBL lookup script, I checked a few dozen different DNSBL providers and found that dnsbl.tornevall.org was by far the best for this application, correctly identifying 95% of comment spammers in the local black list I'd built up over the past couple of days.
Instead of a WordPress plugin, I used Apache mod_defensible (apt-get it!) so the functionality is web application agnostic and would work well at my day job since as a lone sysadmin at Stanford Law, if a solution doesn't scale or requires baby sitting, it doesn't fly.
By now, this setup has been running for over a day and working well so far -- currently, majority of spam comment submissions are blocked by Apache before WordPress ever sees them (and therefore Akismet via the WordPress plugin). Akismet is still very useful, but instead of having to deal with 100% of spam comments, it only has to look at a few that get through the DNSBL checks. Having two layers also buys me time if one layer fails for whatever reason.
[ UPDATE 2013-05-27 ] Here's an area chart visualizing the number of comments blocked using the three methods since first implementing the Apache (mod_access) based block list on Apr 12, then adding the DNSBL based layer on Apr 14:
As of May 27, DNSBL + Apache (mod_access) handled 96.6% of comment spam (15,719 comments), with Akismet taking care of the remaining 3.4% (542).
Because of this, while May has seen by far the highest number of spam comments ever attempted (over 16K), Akismet stats report an all time low number of spam comments for May:
Month Spam detected Ham detected Missed spam False positives 2013-05 542 4 0 0 2013-04 7,565 56 51 0 2013-03 11,845 15 1 0 2013-02 8,824 9 0 0 2013-01 8,798 6 4 0 2012-12 6,975 7 4 0 2012-11 9,590 6 0 0
I could probably disable the custom mod_access based block list and just use DNSBL and Akismet, but leaving it in for now since to date it hasn't caused any false positives.
[ UPDATE 2015-07-21 ] Added new spam comment block graph and stats here.
Performance and scalability considerations
Security usually involves tradeoffs and it's no different here. Depending on your architecture and traffic patterns, a DNSBL based solution can be a non-starter since checking IP addresses of visitors against DNSBLs could significanly slow down your response times and potentially add lots of DNS traffic, which is unacceptable for me even on my personal blog, and would never scale at my day job.
For me, the problem is solved by using Varnish Cache to serve the vast majority of requests surprisingly quickly and efficiently (considering puny resources of my $9/month VPS), so DNSBL lookups via Apache mod_defensible essentially just happen on POSTs, which I have Varnish passing through to the backend (Apache, in my case). If you're using Apache to handle all your traffic, perhaps consider DNSBL lookups via mod_security instead, since that allows you to target specific requests by REQUEST_METHOD or REQUEST_URI, but I would really urge you to consider a frontend cache, especially if you're using a heavy framework or a fancy content management system (Drupal, WordPress, etc) to assemble your content. Varnish, Nginx and BigIP are the ones I have experience with, but there are other web caches and accelerators out there.
Also, if you're going to set this up, I'd recommend setting up a local caching name server on the same box as the web server to minimize query volume to your upstream DNS provider (could matter depending on your web traffic), just be sure to secure it so your name server can't be misused for DNS amplification attacks.
If in doubt, I might be available for consulting.