How I built an effective blog comment spam blocker
Mention comment spam and most people, in particular those crazy WordPress users, mention Akismet. Great tool and I have nothing against it but I wanted to build my own, avoiding the external call to the Akismet service. What has been interesting to see, is just how effective it is. Turns out, my spammers are quite obvious.
I use a points system, which I got the idea from Movable Type, whose spam protection is also based on a points system. For everything in a comment that I like, you get a point. For everything I don't like, you lose a point (or two, or three). If you get a 1 or higher, you've made it on the site as a valid comment. If you get a 0, it's set for moderation and I'll take a look at it. If it's below 0, it's marked as spam and I'll never see it (although I check every couple weeks just in case a legitimate comment needs to be unflagged). If it falls below -10, I don't even bother saving it to the database since it is so obviously spam.
Types of Spam
There are two main types of spam: automated and manual.
Automated spam is the most obvious. There are a number of tricks they try to pull and stands out when you see the same message a dozen times posted within seconds of each other. Automated spam is also the easiest to catch. So insanely simple that just a few rules would catch about 95% of all comment spam hitting a server. (That percentage may even be higher...I'm just guessing).
Manual spam, on the other hand, is more devious. People actually try and respond to the article at hand, which makes it slightly harder to catch. I say slightly because the vast majority of manual spammers do such a poor job at leaving a comment that they stand out like a sore thumb. The remaining few are usually the ones you end up filtering by hand.
The quickest solution to reducing the amount of comment spam you get, and doesn't require any server-side programming and is built into almost all blogging tools, is to simply turn off the comments on a post after a certain amount of time. It works quite well and here are the two major reasons why:
- Automated spam has a database of pages to which they try to submit to. If the form is no longer there then you don't get spam. Spammers are forced to discover new pages in which to spam.
- Manual spam often tries to hit pages that have higher page ranks. There's plenty of search engine tools to help people look this information up. (I'd actually see referrers from these search tools, followed shortly by a new blog comment.) Higher page ranks will happen on older and popular posts. By shutting down the comment form, manual spammers are left to target newer pages in the hopes of getting missed until the page gets a higher ranking.
I've had old posts that I left the comments open for years and would still see users come across it and add to the discussion in meaningful ways. I loved that. However, that almost never happens now. So, I finally gave in and just close comments.
In a blog comment, there are 5 fields and I test each one separately and in various combinations for various rules. The fields are: body, email, author name, url, and ip.
Here now are my rules for filtering blog comments.
|How many links are in the body||More than 2||-1 point per link|
|Less than 2||+2 points|
|How long is the body||More than 20 characters and there's no links||+ 2 points|
|Less than 20 characters||-1 point|
|Number of previous comments from email||Approved comments||+1 point per|
|Marked as spam||-1 point per|
|Keyword search||Levitra, viagra, casino, etc.||-1 point per|
|URLs that have certain words or characters in them||.html, .info, ?, & or free||-1 point per|
|URLs that have certain TLDs||.de, .pl, or .cn (sorry guys)||-1 point|
|URL length||More than 30 characters||-1 point|
|Body starts with...||Interesting, Sorry, Nice or Cool.||-10 points|
|Author name||has http:// in it||-2 points per|
|Body used in previous comment||-1 point per|
|Random character match||5 consonants||-1 point per|
Once you have a database of spam messages, you can observe certain patterns. In checking some information from time to time, I discovered some interesting stats:
Write something of consequence. If it's less than 20 characters, you obviously don't have much to say.
Most people who include a URL usually have a top level domain or a subdomain that they use. They're not using querystring parameters or any other crazy URL structures. And I'm sorry for all the German, Polish or Chinese but a few of your fellow countrymen aren't being very nice.
URLs that are longer than 30 characters are almost always spam. This ties in with the last filter. If you've got a URL, it's short, sweet and sexy. It's not crazy long — although I have seen some crazy long, perfectly legitimate URLs.
It may seem like I'm being overly severe on people who start their comments like this but it's a very specific pattern that I'm matching. I was getting 10 to 20 hits of the same message coming in. It was just easier to match the messages and essentially ban them.
Random character matches
The other thing I noticed was email addresses or author names that were just a random string of characters. If there's no vowels, sure you might be Polish but more likely that you're spam. Rarely do even the Polish have 5 consonants in a row!
How effective has it been? These days, I only see a new spam message get through maybe once every week or two. It's usually a message that somebody has handtyped to be relevant to the page but the comment is near useless and their author name is most evidently spam.
I've also reworded the disclaimer text under the submit box to let people know that I'm actively on the look out for spam and even legitimate comments will get edited or marked as spam if they plan to abuse the system. This lets those people know — like those who like to leave signatures on a blog comment or who like to use their company name as their author name — that being underhanded will not be rewarded.
Despite my past frustration with spam, things are at a point now where I'm happy to leave comments open on recent posts for a couple weeks and then just close them up and never have to worry about them again. It certainly isn't the death of comments I thought it might need to come to.