Comment Filtering and Other Goo

22 Nov

A couple of “legitimate people” (and a few trolls) have mentioned having yet more trouble getting through my far-too-strict comment blacklists, and yet frustratingly certain spammers are getting through by posting normal comments with links that simply redirect to dodgy porn/drug sites. I don’t want to ban URLs or comments altogether so I’m going to try and implement a ‘score’ system.

People that post comments using fake e-mail addresses/etc will still be blocked — in fact, e-mail validation should be more strict because I’m hoping to implement fsockopen() to check for actual e-mail address existence as per a nifty article on E-mail Verification with PHP — but things like accidental tripping up of the blacklist (with words like “Christmas” that I had to blacklist FFS) won’t stop a comment from going through. I’m assuming I can set something up much like the scoring for mail apps like Spam Assassin (any PHP gurus got any ideas for potential checks/blocks that I may have missed?) Sounds good in theory, no idea what it’s like in practice…

Karl and I have just discovered: WE HAVE INTERNET. Oh yeah, Internet at home again. This means that I can finally get ’round to replying to about 300 backlogged e-mails, can crop and upload some pictures of the flat, can implement some bug fixes to BellaBook and minor changes to BellaBuffs, part two of the Beginners Guide to PHP, BellaBuffs Collective, [insert other promised scripts/code here]… oh, and last but not least: send out a heartfelt thanks to Jessica who gave me the most creative retaliation to a Pants Award yet. How sweet :)

ETA: I’m just fiddling with a basic scoring on top of spam-prevention that is already in place. If anyone wants to leave a comment just as a test (that looks relatively real), please do so. That way I can adjust assigned scores/deletion levels as appropriate.

Categories: Interwebs

Tags: commenting, internet, scripts

37 Comments

Grant Mc
22 Nov at 8:13 pm

YAY, Internet at last! i can’t wait you do now know that you will have to post a lot more now to make up for what you missed, i think many people would agree. Why not use chaptcha’s or what ever it is called or even we could register and type user name and password in instead of name, email and URL that way we could develop a profile to automatically put in our name, email, URL, profile pic etc…
Hillarie
22 Nov at 8:34 pm

That is awesome that you have internet again! I think I’ll stop talking about the Jessica thing, because if I continue, the ‘net cops will be after me for being ruuude…
Hillarie
22 Nov at 8:48 pm

Again, sorry about the whole cutenews comment…it was very…unlike me to do that, because I don’t always sink as low as those around me…so I think I’ll stop revisiting sites that win pants awards, because I was way out of hand…
Jem (Post author)
22 Nov at 8:51 pm

@Hillarie: don’t worry about it, I wasn’t offended and I daresay most others weren’t, but I know it does bother some people. That said, you should come to my mum’s house: EVERYTHING is “big” and “gay” there. The fridge is a big gay fridge, the TV is a big gay TV, my brother is a big gay brother, the cat is a big gay cat… don’t ask why, ’cause I’ve no idea.
Grant Mc
22 Nov at 9:00 pm

Just to add this may help the spam war your having. http://akismet.com/ that should help!
Dee
22 Nov at 9:59 pm

I went through a similar thing with sk.log’s commenting system a few months ago. Might I suggest that before attempting the ‘long hard road’ of implimenting Bayesian-style weighting, you impliment a basic CAPTCHA? The coding involved is almost nonexistant, and even the most basic CAPTCHA I’ve found defeats 99% of blog comment spam (tehcnically, since putting mine in I’ve gotten no spam whatsoever, out of ~60 attempts day). Forum fights aside, the vast majority of spam is not specifically targetted at any site in particular, so there is no AI involved in attempting to defeat CAPTCHAs. Tactics like fsockopen() -sound- like a good idea for verifying blog comments, until you realise that a) they can be fairly resource intensive, b) they don’t do anything against those hundreds of spammers who use @hotmail.com and @maul.ru emails (or spoofed addresses), and c) because you’re not protecting against bot-abuse you’re potentially opening yourself up as a DoS resource/victim.
Jem (Post author)
22 Nov at 10:13 pm

@Dee: I know about the code behind a basic CAPTCHA, BellaBook/BellaBuffs both have that functionality, but I can’t bloody stand them and won’t consider them unless as a last resort. Thanks for the info on fsockopen() – hadn’t actually read the article yet (still rejoicing over the fact that I finally have Internet back!)
Hillarie
23 Nov at 12:19 am

@Jem: Whew! Oh, and I can’t wait for the PHP article…I’m trying to learn as much about php as I can…even better: I want to learn it the RIGHT or CORRECT way…hehe
Julie
23 Nov at 12:24 am

I’ll take a captcha-free jemjabella, thanks. What does “FFS” mean? (“words like Xmas that I had to blacklist FFS”, you said)
Jenny
23 Nov at 12:42 am

I just added a captcha, but instead of the usual letter jumble (which I also loathe), it just asks you to pick the right picture (radio buttons with randomly generated values). It stops most of the spam I get — most haven’t even re-scraped the site yet to know I’ve added another field. (Although, one spammer was smart enough to start scraping the form for the correct value, although they are still kicked by the blacklist. However, if I were to add a few more pictures and randomize the order, I suspect the captcha would be strong enough on it’s own — for my humble purposes at least.) But I like the scoring idea. Perhaps certain words that are always spam should be an automatic fail, but the merely suspicious ones are added up by frequency and suspicious-ness… I know one person who checks that comments contain keywords from the article — maybe you could try that, but it would be more work on your part to put together a list of keywords for everything you post. And alone it won’t stop the people who spam you by hand.
Jenny
23 Nov at 12:51 am

One other thing I thought of: I use htaccess to deny suspicious user-agents (blank, the default UAs of various programming languages, known bad bots) and to deny access to the comment processing script for bad referrers (blank is OK, but some idiot was setting his referrer as the URL of the processing script — clearly bogus). That alone keeps out a lot of trash.
Dee
23 Nov at 12:56 am

Hm, okay, if you don’t want to do CAPTCHAs; the first point of call I’d recommend is to make sure that your forms are actually being submitted from your own page, as opposed to injected remotely. From looking at my own CAPTCHA logs, I can see that the vast majority of spam comments I get are getting blocked with one particular CAPTCHA ‘phrase’, which I know is the value that gets returned if the image isn’t actually loaded up at all; my guess from this is ‘non-browser-based bot’. I’ve never actually implimented one myself, but I’ve heard it suggested that a possible way around this kind of attack is to make sure that your forms are originating from a ‘person’ loading your page up in a browser and filling in a form; a non-intrusive kinda of CAPTCHA. The most common way I’ve heard of this being done is setting a session hash (based on the server time with some random salt or wahtever), including it in as a hidden input field, then double-checking the value in the form processing function; a CAPTCHA without the image, in other words. I’m not sure how effective this is, but I’ve heard it recommended.
Belinda
23 Nov at 1:19 am

Jemba..I like it, I really do. Yay on getting internet, wooo! I really have no tips on spam provention with PHP because I’m utterly clueless, and Askimet+Wordpress=No Probles here! :P Hm, I wanted to see the video she made of you. Too bad it’s gone :(
Gabrielle
23 Nov at 3:28 am

I have a form that used to get about twenty spam a day even with extensive anti-spam techniques. I added a pretty simple math equation that’s dynamic. Now I get a spam maybe once a month even though it gets hit at least 100 times a day. I know there aren’t that many people dying to put their link on my site.
Sami
23 Nov at 3:38 am

Hi! I just downloaded BellaBook to my website and it was really easy to set up. Thank you!
Mumblies
23 Nov at 7:18 am

Yay!! congrats on getting your net back Jem :o)
Dave
23 Nov at 8:28 am

The way I built mine was to add a scoring system. Xmas alone wouldn’t be enough to block a comment – it would need to have “buy online” and any URL too. That mail sevrer PHP script is a pretty good idea! I might implement that on my site. The worry is though that made up domain names would timeout probably and so a very short timeout would be needed (30 seconds – mentioned in the article – sounds far too long. I’d be happy to go with 5 and take it from there). It’s probably a good idea to use it as a last resort though – Dee mentioned the issues with it but I’d only use it on a comment that had no points maybe because the only spam comments that seem to get through for me now aren’t even words and this check would block them.
Jem (Post author)
23 Nov at 8:32 am

What does “FFS” mean? (“words like Xmas that I had to blacklist FFS”, you said) http://www.auditmypc.com/acronym/FFS.asp Number 3 ;) I do try and avoid swearing in my blog to stop the site being banned by school filters/etc.
Jem (Post author)
23 Nov at 9:12 am

the first point of call I’d recommend is to make sure that your forms are actually being submitted from your own page I was doing this by checking that the referrer matched my domain – the problem is, people with strict firewalls/certain browsers etc block $_SERVER[‘HTTP_REFERER’] and they were being rejected. I was too lazy to find a better way to do this. :P
Nick
23 Nov at 11:05 am

I really wouldn’t recommend using CAPTCHA at all – for a start its a massive accessibility issue, people with bad eyesight/colorblindness or who are actually blind won’t be able to use them at all. Plus i don’t think they are that effective, OCR is getting better so it would only be a short term solution anyway. A scoring system and using black/white lists would be much more effective – spamassassin is definitely a good model to work from, i think it would be quite easy to implement something basic in PHP – especially as spamassassin is written in Perl. If you download the most recent version of spamassassin and take a look in the ‘rules’ folder it will contain all the different regular expressions for testing. You could just use the ones that were worthwhile, although i wouldn’t use too many if you write it in php – it might be better to just cook something up in perl. As Dee said above it can be worthwhile checking the user agent too – a surprising number of spammers dont even bother trying to disguise it so you will get things like ‘libwww-perl’ being sent through. It’s worthwhile adding into the scoring system something regarding where there IP address is from (use geolocate or something) and performing more stringent checks on certain locations – i used to get an almost exclusive amount from poland :)
Han
23 Nov at 11:18 am

I made my comments member only for a while to ward off the spammers, I also use akismet and spam karma 2, their all virtually gone now!
Jem (Post author)
23 Nov at 11:21 am

Hi Nick, thanks for your comment. :) As I said, I have no intentions of implementing a captcha. Beyond the obvious problems they cause for visually-impaired users they can also cause massive problems for dyslexics (and myself, hah!) I do already have a rant on CAPTCHAs in my scribblings section so I’m not about to go back on what I’ve said before! Already have a blacklist system set up but that’s what’s causing the problems ..interesting idea taking apart spamassassin though (I was just going to write my own and do it the hard way). I’d also not thought about looking at IP address locations (although 99% of my spam is added through a proxy that cycles through hundreds of different IPs so I’m not sure if it’d work). Hmm, anyway, you’ve given me plenty to think about – thanks!
Xeronia
23 Nov at 2:12 pm

Nice! Maybe you could blacklist words in the URLs too? Or groups of words. If I give examples, this comment will be filtered out.
Jem (Post author)
23 Nov at 2:37 pm

Maybe you could blacklist words in the URLs too? I do compare the blacklist against URLs :/ The problem is spammers are leaving normal comments like “hi nice site” with ‘normal’ URLs that forward to pr0n/etc. Or they leave completely pointless comments with no URLs.. which I just don’t understand.
Loadx
23 Nov at 4:56 pm

You are never going to find a real method to stop those spammers using the genuine looking url’s which foward to pr0n unless you actually open the url and buffer up some text looking at the key words…which is utterly ridiculous and would cause insane amounts of load on your server. CAPTCHA sucks as mentioned above OCR is becoming better and better at recognising the chracters etc. The only fool proof method is to ban all comments from being auto-posted and regulating the heap yourself. However a good alternative is what you talk of using a points system for each user and rating their trustworthyness but i think it causes some serious bloat on the database. fopen on email addresses as described in that article above, will fail horribly because all you are doing is checking the existence of the domain not the actual address being valid on that domain. You don’t need to write your own spam assassin thats crazy, merely pipe the text from php into perl and get the repsonse back passthru(), popen() etc. Also look at those blacklist websites which provide a neat database of all the blacklisted mail accounts etc and use that for comparison aswell. Good luck.
Jem (Post author)
23 Nov at 5:01 pm

You don’t need to write your own spam assassin thats crazy Hah, no, I know! I wasn’t actually planning on writing my own version of spam assassin (perhaps my comment wasn’t very clear..) I just meant I was going to write a few of my own functions for checking spam without actually LOOKING at spam assassin. About the points thing: I wasn’t thinking of storing points on a per-user basis or anything like that, rather just assign points to blacklisted words based on their severity (as well as other checks) and then just send the comment to moderation (for medium scorers) or auto-approve it for known-users (like now)/low scorers. High scorers would automatically have their comment deleted.
Nick Barrett
23 Nov at 8:43 pm

Wahey! Internet! Askimet is good for spam, don’t use CAPTCHA, it’s too annoying for real users…
Sarah
24 Nov at 12:36 am

How many knees would a Negro grow if a Negro could grow knees? *feigns being a realistic comment*
Dee
24 Nov at 12:57 am

Just to hijack these comments further (hu-haha), all those people arguing that “OCR defeats CAPTCHAs” are essentially making a sunroof-arrow argument. Yes, OCR -can- defeat CAPTCHAs in the same way that you -can- kill someone by shooting an arrow in through their sunroof, but the vast majority (in my experience; all) spam that targets -personal weblogs- doesn’t use OCR. Saying that just because you -can- defeat something means you shouldn’t use it at all is kinda like saying just because you -can- break a lock they’re useless and you shouldn’t put any on your doors. It’s a dumb argument and literally the -only- place it appears is in computer sccurity. — D. (who deals with this stuff all day long at work)
Loadx
24 Nov at 4:23 am

Knowing both the pro’s an cons of a particular method helps to make a better choice. OCR detecting a CAPTCHA is a very real issue, give it more time and OCR will defeat it entirely. Going back to your lock example, if you had the choice of a door lock or a security keypad, knowing simply the lock has less combinations to pick rather than the keypad.. which would you choose to protect yourself?
Mike Haddad
24 Nov at 6:28 am

I love you more than life. ♥ How was that?
Mumblies
24 Nov at 7:26 am

I am so proud of you :o) how’s that?
Jem (Post author)
24 Nov at 8:51 am

Going back to your lock example, if you had the choice of a door lock or a security keypad, knowing simply the lock has less combinations to pick rather than the keypad.. which would you choose to protect yourself? How about both.. because surely two levels of security is better than one?
Amelie
24 Nov at 9:14 am

Blah blah here’s a comment for Jem to test her new commenty stuffs flurbywoobledooble :P …er. Buy pills. And view some dodgy porn. And don’t forget to buy my replica watches!1111
Amelie
24 Nov at 9:41 am

Testing again :P
Mumblies
26 Nov at 2:50 pm

Hey Jem- you forgot about the ” big gay gays” lol
Jem (Post author)
26 Nov at 10:36 pm

..and of course, the big gay gays!