May 21 2008
Block Spam with a Scoring System
One of the most effective spam blocking techniques I used in my custom CMS was a scoring system. The idea is that the more spammy someone is, the higher the score they are given. After a certain cut-off point whereby a person or spam-bot has a score than our pre-set threshold, their comment or correspondence is ditched. The beauty of a system like this is that you get total control over scoring, and over what you consider to be spam.
This should — in theory — work with any PHP script that allows users to interact via a form; be it a guestbook, commenting system, faq script, etc… they’re all identical in that they should take the data, validate it and process it. (There’s no reason why this won’t work with other languages, but you’ll have to work that out yourself.) A scoring-based spam protection system is not a replacement for proper validation of data, of course, but it’s great used alongside.
First thing you’d need to do is figure out exactly what should trigger a score, and how severe each trigger is. There’s theoretically no limit to how many tests you can have, but it is worth bearing in mind that the more there are, and the higher the scores assigned to each, the more likely a user is going to fall foul of the system. I had no scientific method of deciding a cut off point for spam scores, I simply bumped it up or down dependant on a) how much spam was getting through and b) how many genuine commenters were getting caught out.
Anyway, to get you started I recommend checking for the following as a bare minimum (swap comment for feedback, response, whatever is applicable):
- The amount of links in a comment. In theory, the more links the more likely it is to be spam.
- Occurrences of popular spam words (porn, viagra, xanax, etc). Unless you’re a hot laydee publishing pictures of your boobs on the ‘net or a pharmaceutical company, you probably won’t need to allow these words and this check is unlikely to trigger false positives.
- Length of comment. Anything under about 10 characters is not worth anyone’s time.
If you want to be thorough, and depending on the scope of your system, you may also wish to check for:
- HTML in comments (must be checked before strip_tags() is applied)
- No existing comments stored in the database
- URLs greater than a certain character length (spammers often leave large URLs as they are directed at specific pages deep in their site structure)
- Profanity/other less common spam words
The process is then simple. Create a variable to store your score ($score should just about cover it) and then increment the score as and when necessary:
if (check returns true) $score += #;
..where # is the numeric value assigned to the check you’re doing.
When you’ve finished running tests — and your data is 100% validated against any necessary regular expressions, etc — you can check the score and update the database or ditch as necessary. I did this in 3 levels:
- Not spam — less than the lowest threshold, obviously not spam, add as approved comment
- Possibly spam — higher than the lowest threshold but lower than the top, may be spam but may also be a new commenter not previously approved, hold in moderation
- Spam — no questions asked, ditch this one
And that is, hypothetically speaking, all there is to it. Now all you’ve got to do is figure out how to code it :P
16 Responses so far
-
*makes notes* Interesting… although sometimes one could be joking… like say post “zomg ursomean” comment to one of your pants awards type of posts. In that case you might mark it as spam, even though it is a genuine comment. :\
-
Woah! The post has been made :3 You only left us with the hardest thing to do..CODE IT! XP
-
It’s very interesting seeing how this can correlate into websites. In the tech support system, email spam is judged many times on the exact same kind of system.
-
thanks for this article. its a very interesting and thorough logistic you’ve given us. it’s very versatile as well since every user can decide what words they consider profane/spam and thus is very customizable.
-
I always wondered about your scoring system, because some of my comment occasionally were marked “5″ and I never knew why :P
-
Excellent, I needed a scripting project for the weekend and this is it. All I do at the moment is strip links, the odd nasty ad-word and limit comments to x per y minutes. I think you use regex for nasty word searches (?) – is there any reason why strpos is not sufficient? Cheers.
-
This is interesting. It’s really cool how you created this!
-
One day, when I get my act together and actually settle down to learn stuff, this will come in handy. :)
-
ooh I feel a build and fix model coming on for this, or perhaps a waterfall model.
YAY no more software engineering! :P
-
Mostly nowadays spams are generated from bots (Javascript and alike)..by identifying those bots alone can reduce lots of extra work needed!
-
Interesting blog entry Jem! This will come in handy in future when I get hundreds of spam comments.
-
Thank you, young tutor. :-)
Although I think if one wanted to erase “Viagra” yet for some unknown reason keep “Viagrakitteh” strpos would remove the latter while pregmatch would leave it intact, assuming the search string was “Viagra”. I’d best look that up cause I’m probably wrong.
-
Interesting article, although it’s a little over my head, so I won’t attempt to do anything with it (just yet) so I don’t mess things up ;)
-
Would you ever use decimals for the score?
-
[...] wonder if I can convert my old score based spam prevention into a WordPress plugin. If nothing else it’ll give me a hands-on look at actually creating a [...]









