Block Spam with a Scoring System

One of the most effective spam blocking techniques I used in my custom CMS was a scoring system. The idea is that the more spammy someone is, the higher the score they are given. After a certain cut-off point whereby a person or spam-bot has a score than our pre-set threshold, their comment or correspondence is ditched. The beauty of a system like this is that you get total control over scoring, and over what you consider to be spam.

This should — in theory — work with any PHP script that allows users to interact via a form; be it a guestbook, commenting system, faq script, etc… they’re all identical in that they should take the data, validate it and process it. (There’s no reason why this won’t work with other languages, but you’ll have to work that out yourself.) A scoring-based spam protection system is not a replacement for proper validation of data, of course, but it’s great used alongside.

First thing you’d need to do is figure out exactly what should trigger a score, and how severe each trigger is. There’s theoretically no limit to how many tests you can have, but it is worth bearing in mind that the more there are, and the higher the scores assigned to each, the more likely a user is going to fall foul of the system. I had no scientific method of deciding a cut off point for spam scores, I simply bumped it up or down dependant on a) how much spam was getting through and b) how many genuine commenters were getting caught out.

Anyway, to get you started I recommend checking for the following as a bare minimum (swap comment for feedback, response, whatever is applicable):

The amount of links in a comment. In theory, the more links the more likely it is to be spam.
Occurrences of popular spam words (porn, viagra, xanax, etc). Unless you’re a hot laydee publishing pictures of your boobs on the ‘net or a pharmaceutical company, you probably won’t need to allow these words and this check is unlikely to trigger false positives.
Length of comment. Anything under about 10 characters is not worth anyone’s time.

If you want to be thorough, and depending on the scope of your system, you may also wish to check for:

HTML in comments (must be checked before strip_tags() is applied)
No existing comments stored in the database
URLs greater than a certain character length (spammers often leave large URLs as they are directed at specific pages deep in their site structure)
Profanity/other less common spam words

The process is then simple. Create a variable to store your score ($score should just about cover it) and then increment the score as and when necessary:

if (check returns true)
	$score += #;

..where # is the numeric value assigned to the check you’re doing.

When you’ve finished running tests — and your data is 100% validated against any necessary regular expressions, etc — you can check the score and update the database or ditch as necessary. I did this in 3 levels:

Not spam — less than the lowest threshold, obviously not spam, add as approved comment
Possibly spam — higher than the lowest threshold but lower than the top, may be spam but may also be a new commenter not previously approved, hold in moderation
Spam — no questions asked, ditch this one

And that is, hypothetically speaking, all there is to it. Now all you’ve got to do is figure out how to code it :P

16 comments so far

Vera said:
On 21 May at 7:52 pm

*makes notes* Interesting… although sometimes one could be joking… like say post “zomg ursomean” comment to one of your pants awards type of posts. In that case you might mark it as spam, even though it is a genuine comment. :\
Vasili said:
On 21 May at 8:02 pm

Woah! The post has been made :3 You only left us with the hardest thing to do..CODE IT! XP
Veronica said:
On 21 May at 8:08 pm

It’s very interesting seeing how this can correlate into websites. In the tech support system, email spam is judged many times on the exact same kind of system.
Ian said:
On 21 May at 8:29 pm

thanks for this article. its a very interesting and thorough logistic you’ve given us. it’s very versatile as well since every user can decide what words they consider profane/spam and thus is very customizable.
Emsz said:
On 21 May at 8:57 pm

I always wondered about your scoring system, because some of my comment occasionally were marked “5” and I never knew why :P
fran said:
On 21 May at 9:17 pm

Excellent, I needed a scripting project for the weekend and this is it. All I do at the moment is strip links, the odd nasty ad-word and limit comments to x per y minutes. I think you use regex for nasty word searches (?) – is there any reason why strpos is not sufficient? Cheers.
Sarah said:
On 21 May at 10:10 pm

This is interesting. It’s really cool how you created this!
Aisling said:
On 21 May at 11:09 pm

One day, when I get my act together and actually settle down to learn stuff, this will come in handy. :)
Han said:
On 21 May at 11:25 pm

ooh I feel a build and fix model coming on for this, or perhaps a waterfall model.

YAY no more software engineering! :P
Michael Aulia said:
On 22 May at 7:11 am

Mostly nowadays spams are generated from bots (Javascript and alike)..by identifying those bots alone can reduce lots of extra work needed!
Jem said:
On 22 May at 8:32 am

@fran:

I think you use regex for nasty word searches (?) – is there any reason why strpos is not sufficient?

None at all, I think it was just the easiest option for me at the time. :)

@Michael:

Mostly nowadays spams are generated from bots (Javascript and alike)..by identifying those bots alone can reduce lots of extra work needed!

Absolutely! But that’s the sort of thing I did before anything else – before getting to the scoring system – to cut down on unnecessary processing time. Even better would be to block it using the htaccess file, on which there is an article here for anyone who’s interested: http://tinyurl.com/n3ck
Regina said:
On 22 May at 12:50 pm

Interesting blog entry Jem! This will come in handy in future when I get hundreds of spam comments.
fran said:
On 22 May at 2:05 pm

Thank you, young tutor. :-)

Although I think if one wanted to erase “Viagra” yet for some unknown reason keep “Viagrakitteh” strpos would remove the latter while pregmatch would leave it intact, assuming the search string was “Viagra”. I’d best look that up cause I’m probably wrong.
Chans said:
On 23 May at 6:52 am

Interesting article, although it’s a little over my head, so I won’t attempt to do anything with it (just yet) so I don’t mess things up ;)
Vasili said:
On 23 May at 6:19 pm

Would you ever use decimals for the score?
Bloody Akismet — jemjabella.co.uk said:
On 03 Jun at 10:54 am

[…] wonder if I can convert my old score based spam prevention into a WordPress plugin. If nothing else it’ll give me a hands-on look at actually creating a […]