Anti-SPAM plugin (code name "Willow")

21st Feb 2006 stk
Thoughts[spamstats]

¥åßßå and I have been discussing b2evo's poor handling of comment/trackback/referrer SPAM. The 'slam-the-door-in-your-face' blacklist isn't ideal. Black-lists, white-lists ... it's all a gray area and SPAM is something that makes people see red :>.

What color is your anti-spam parachute?

The goal here would be to come up with a much better way of handling SPAM. We've tossed around some ideas and some of them are covered here. I think we should ADD to this, as we move forward, because the concept of improved SPAM handling is universally appealing and would rocket AM! to the forefront of a "must-visit" site, if we were to successfully come up with a plug-in/hack that provided hassle-free, superior anti-SPAM measures.

So ... what would make a system "superior"? We've identified the following list (to which we invite additions/subtractions/discussion):

A Superior Anti-SPAM System

  • It must be 'visitor-friendly'. (i.e., rules out bobo, moderation, disallowing links (<a> tag) and capcha methods ... as all of these either makes visitors jump thru hoops or restrains them, in some fashion.)
  • It must consider and weigh a variety of parameters, in an effort to determine if a comment/trackback or referrer is "spammy". (i.e., we're thinking of a predictive, stochastic percentage that indicates the probability of something being 'spammy'). A list of those criteria follow, later on.
  • It must be easy to administer and must not prohibit admins from posting whatever they want, however spammy it may appear.
  • It must be adaptable (i.e., as SPAM tactics and methodologies change, new parameters need to be added in an easy fashion.)
  • It must LEARN and grow stronger as more SPAM attempts are made and SHARE this learned information within a "trusted" network of like installations.
  • It shouldn't drain CPU resources, with regard to CPU time and table space, so as not to become too burdonsome for the blog engine.
  • Probability cutoffs should be user-set for the handling of SPAM (i.e., less than 45% probability ... gets posted, between 46%-80% ... gets set aside for moderation (and then LEARNS from admin decision, so that future %-range for moderation becomes more narrow); and >81% probability, dosn't get posted (and admin can review to see if system accidentally mis-categorized a non-SPAM post/comment/trackback).
  • It should INVITE SPAM ... (the more SPAM received, the stronger the defenses), and limit energy and bandwidth used by SPAM messages (for SPAM, just say "Thanks for your comment", rather than a 303 error), let SPAMMERS think they've succeeded. Bend, but do not BREAK. (Hence ... "Willow" as a code name.)
  • It should log and categorize SPAM attempts, both to confirm SPAM blockage and volume, but also for admins to analyze and tweak the system to make it better.

These are just some of the ideas that need to be built into the system.

SPAM Factors

Below are things that make a site/comment/post and/or trackback more likely to be SPAMMY and situations in which they AREN'T SPAMMY. For every thing that LOOKS like SPAM, there's generally a situation in which it ISN'T SPAM.

  1. Content: When the post or comment contain words and/or phrases like "penis, incest, 'nice site', cialis, viagra, etc." this makes it more likely to be SPAM. (However, a visitor or poster might just be writing about how he's tired of seeing "cialis" SPAM messages all the time).
  2. Same Email Address - Different Name: Lots of SPAMMERS re-hash a fake email address and this might indicate that a message is more likely to be "spammy". (Or ... stk might be leaving you two comments, but use "Scott" one time and "stk" another)
  3. Invalid Email Addy: Lots of Spammers use made-up email addresses. If we can easily test for this, then it would help identify a comment/trackback as 'spammy'. (However, not 100%, because maybe a visitor is afraid to leave his REAL address and leaves one like 'John-nospam@hisRealSite.com' or a made up one instead)
  4. Repeated Comment URL[color=red]/IP[/color]: If you're being hammered with SPAM, it's likely that a the same URL will be repeated in quick succession. Need to look at time intervals and if this is happening, there's a greater liklihood that comment/trackback is spammy. (Of course, you might also have an avid fan who likes to leave lots of comments, too, so it's not 100% and need to take into account X comments over Y time)
  5. URLs known to be spammy: This is where the blacklist will come in handy, BUT not a 100% sure it's spam kind of blockage. (What about mis-reported blacklisted sites? Or a URL that has been re-assigned to someone else? PLUS ... blacklist only contains KNOWN spam sites ... need something more proactive that can determine spamminess of some site it's never seen.)
  6. Words in URLs: Certainly, URLs that contain certain words provide some measure of the spamminess of the site. (But it's not 100% as some people just pick crappy URLs for their site)
  7. Changing IP Address: In a recent hack, by ¥åßßå, he's implemented an IP-dependent HASH for each of the field names in the comment_post.php and feedback.php files. IF a spammer loads the comment form with one IP, but tries to post with a proxied IP ... they will be unsuccessful, because the field names won't match. (very tricky guy, our ¥åßßå ;)) (But, it's possible that a dial-up connected poster loses his connection and returns to post with a different IP assigned)
  8. Using old file/folder names: If one changes the htsrv folder name to something else and/or the comment_post.php filename to something else, and a spammer comes along, tring to use /htsrv/comment_post.php ... then there's a really strong liklihood that it's SPAM. (Can't really think of a situation where it isn't)
  9. Wrong post ID: If there posting a comment/trackback for a different post ID than they searched for, browsed to, etc., then the comment/trackback is likely to be SPAM. (likewise, I can't think of a situation where it's not).
  10. Links in Content: When there are multiple links in a comment (<a> tag allowed), this increases the liklihood that the message is SPAM. (Of course, you might have a determined poster who's citing his argument with lots of juicy links, too).
  11. Offsite call of comment_post.php: If the referrer isn't your own site, but rather, some other location, then someone is trying to remotely comment and exploit the comment_post.php file. (However, lots of visitors don't send referrer information AND referrers CAN be spoofed ...)
  12. Admin post/comment may look SPAMMY as hell: Regardless of how spammy an admin post/comment is, b/c they OWN the blog ... they should not be hampered by an anti-spam system. Likewise, priority should be given to members, but when running a "get a free blog" system, there needs to be admin moderation for member posts above a certain %-spammy threshold.
  13. [color=red]White URL/IP Check:[/color] A comment/trackback is much less likely to be SPAM if the url/ip of the visitor has been accepted by the admin in previous comment/trackback. (Of course, the SPAMMER may have recently sent a flood, one or more of which got by the SPAM check, so the check would need to include an "older-than" clause ... like older than a week or something)

These "ideal system" characteristics and "is this SPAM or not" %-probability assignations represent WHERE WE ARE TODAY. If you can add more ... PLEASE DO. I will update/refine this list so that when we end up making this plugin-hack ... it will ROCK the SPAMMY world. :D

Page 2 updated 17th March

Tests

Ok, this is a list of test that are running, as more tests are coded up I'll add them here. Feel free to add to this list if you start any tests

IP/post ID key
Ok, this is a really simple test, when the comment form is called a hidden input is added that contains a key based on the ip & post id. When a comment is posted the key is checked for and has 4 potential results :
  1. No key - this pretty much means that we have a spammer who's using crap software
  2. Wrong IP - Although this has a high probability of being a spammer, it could just be a dialup user who lost their connection.
  3. Wrong ID - Again, this pretty much means we've got a spammer.
  4. Invalid key - This probably means we have a spammer who's trying to guess how the key works!
By storing all of the keys as they are issued (along with other data such as post id, ip etc) the original IP will also gain a "spam karma", this karma would be used as part of the final figure.
Reset button
This is another simple test, it just checks for param( 'reset' ) if it exists we've got a spammer
commented input box
A variation of the key, the input box is commented out in html, this checks if the value is sent, if it is it pretty muchs means spam software.
CSS
I haven't implemented this one yet, but basically I'm going to use css to "remove" the 'name' input box from the form (either with display:none or by positioning it a couple of million miles off the page, or z-indexing it below content) and replace it with another box (for the normal user), if $name is set then this is a spammer
Expanding the blacklist concept
Well, I've finally totally emptied my blacklist :O, what I have in place is an expansion of it's methodolgy. Instead of having a true/false blacklist I now have a list of all the ip's, url's and common spam comment content. When a comment is received (along with some of the checks above) it is checked against these lists. If it's found in any of the lists then it is marked as "spam" and put in a little table all of it's own. During the shakedown period I'm also "moderating" any comments that get through these checks (so far none have ..... but I've only had 3,000 spam to judge by) at the time of writing I have the following :-
1450 spammer IP's
30 spam base domains
112 common spam comments
When I get a chance I'll add a page to my blog so that you can see the relevant lists.
You can find the lists here List of wankers, please note, the "counts" aren't completely accurate as yet, they're often far lower than the reality.
More thoughts

I'm still working on all of this, and I still have a few little tricks up my sleeve to stop these wankers once and for all. Unfortunately most of this would be 100% easier in 1.8+ as it's far more geared to the sort of stuff that I'm currently implementing. An unfortunate side effect of running these tests is that I no longer report anything to the evo blacklist, however, it'd be very easy to implement reporting for new spam base urls automatically as they are added to my own lists.

At the moment I still create new entries on my lists manually, mainly because I've been concentrating on methods and testing, but the next stage is to automatically create new entries for any comment that fails ## number of tests. in this way the anti-spam measures will "learn" from the spammers themselves. As a measure of their effectiveness, the wanker who started spamming everybody with the bogus google links was stopped by the fact that he used "known" spam content and also used a few known IP's.

I'm also trying to work out a method whereby the lists could be shared amongst "trusted" blogs so that the net is cast wider, which would help trap new spam before it gets a chance to pound every blog it finds. Basically, if one blog traps a bit of spam that exceeds ## threshold then it would inform the other blogs, in the "trusted" list, of the details of the new spammer (ip's, url's, common content etc). Then, if the same spammer hits any of the other blogs they're already geared up to stop them.

There's, obviously, still a shedload of work to do, and, as I mentioned earlier, I have a few other tests that I want to run as well. The more ways a spammer can "trip up" the better. My eventual goal is to have no antispam blacklist and no moderated comments.

 
 
 
 

Comments

Anonymous
22nd Feb 2006
I've been using EdB's rechecker with Isaac's script that updates the blacklist and rechecks, called via a cron job. It has made my life a lot easier. I don't get nearly as much spam, and what makes it through doesn't last long. Even if I'm away from the web or just forget to despam for a week, new spam on the blog gets deleted automatically shortly after it's added to the central blacklist. I know it's still reactive, but it doesn't have to be me reacting. I also know that there's a slim chance someone could add a keyword to the blacklist that matches one of my commenters and wipe all of their comments out in one fell swoop. A db backup would be my only hope of fixing that. But the chance is so remote and the convenience is so great that it's worth the risk to me.

The problem might be that most folks don't know how to set up a cron job. We could create a plugin event that uses the /htsrv/call_plugin.php file to trigger an update and recheck. Then users could register with us to have our server go hit thier site and force an update. We could manage the cronjob for them. There could also be an option for them to do it themselves. The plugin would just spit out the url in the backoffice that they could copy and paste.
 
Anonymous
22nd Feb 2006
Had another thought, so added it. (Checking the IP/URL of the visitor to see if it's already been allowed by the Admin as a comment or trackback.) This would decrease the probability it's SPAM, BUT the check would need to include an "older than a week" clause, or some time factor, because you wouldn't want to include very recent comments/trackbacks.

In fact ... just the opposite. If the IP or URL is repeated a lot recently (like a flood) then each new one would INCREASE the probability it's SPAM, not the other way around.
 
Anonymous
23rd Feb 2006
Based on the stats I've managed to gather from WafflesOn so far, just adding a key check to the comment form would have stopped 100% of the spam :O However, as you know, I'm all for the multiple analysis methods that you've written about. No one method should decide if a comment is spam or not, rather it should be decided by a group of tests (that can be added to/modified).

Based on the latest changes in CVS it would appear that this will be possible with "mostly" a plugin in phoenix beta, they've added a bunch of new hooks and look like they've started inplementing the "comment karma" idea, which is sort of what this is. Once I've coded up a plugin that can do some tests I'm going to modify the antispam yes/no check to a "percentage chance" check. I'll then run the system for a while and see how much spam it would have stopped.

With a tad of planning you can also make the systems "learn" from the past, especially from any admin overide/confirmation action such as allowing a comment that's above the threshold or deketing a comment that was below the threshold. Again this is made more possible due to the changes they are (still) making.

The main problems that you have with any systems that fight spam are the fact that the moment you publish your methods the spammers can then work at defeating them, and the resources that fighting spam take up.

The ideal goal is to make a spammer have to jump through so many hoops that in the end they trip over whilst not making the innocent jump through a single hoop, or at least, as few as possible.

You don't actually need a whitelist, as the list is sort of a grey one anyway. For example, your IP whitelist would actually be part of an IP list that has a mark for every IP it knows. If an IP is used for the first time you would create a new entry with a default value (say 50%) which gets taken into account when deciding the spam factor. If the admin then allows/disallows a comment from that ip then it's value would be amended accordingly. Eventually a "whitelist IP" would end up having a "0% chance of spam" figure.

I also agree that you shouldn't give spammers a 403 error, all that does is tell them that they need to change their tactics, it's far better to just give them a plain old "cheers for that comment", pretty much uses the same bandwidth. Whilst this means they'll send you a few more comments, that just gives you more data to analyse so that you can refine your methods.

I'll add a page to the post so that we can share any stats that we collect.

¥
 
Anonymous
1st Mar 2006
Hey there all and everyone also. My 'bobo box' hack was pretty damned effective. First, it was entirely text based. If you could see and fill in the 'comment' field you could do the same with the bobo box. Next, it was auto-filled if you ate the commenter cookie. That meant you only had to type it once in your life unless you unchecked "save this crap". Finally, it took advantage of b2evo's "you missed something so click here and you'll go back with all you're useless text intact" feature.

This doesn't mean it's the be-all end-all of antispam! It is supposed to mean that it was very user-friendly and very effective.

Bleh.

I still can't upload properly using cheap (okay FREE) effteepee software.

(Note to self: aggregate AM!)

Oh and for the record I am still getting spam comments after implementing a trick in .htaccess that supposedly means only those refered from an my domain can get to the comments page. I think it helped, but it's also possible another trick helped as much. Anyway without an intelligent system in place spam will be forevermore.

So think different. Think about 7 or 8 or 12 completely different and totally simple antispam methods that don't need each other and can be part of one complete package. When method A fails the user says "fine I'll use method D instead" and on and on and on.

Bleh with a capital Bl and an emphasised eh.
 
Anonymous
2nd Mar 2006
I forgot about your bobo box, it's one of those "so bloody simple" methods that works :D

I actually had an error in my testing logic (yeah, yeah Scott, feel free to take the piss ;) ), the spammers do actually call the form for each of the IP's that they use.

Round about the same time, I also came up with a blinding idea of seeing if $submit was sent, which would mean that the button had actually been pressed....... that one failed as well, there was always a $submit

So I wandered off back to the drawing board ....... had a few beers ....... and a smoke .......anyway, when I got up the next day I felt much better and put some more thought into shit.

I'd already come to the conclusion that this spam is automated, yet it picked up they key and 'appeared' to press the submit button ..... which kind of left only one conclusion, they are using software that parses the form and collects ALL inputs on the form (ohhhh, I should test if "reset" is sent!!!!) ...... I'll let you know how that one goes soon ....... anyway, back to the point...... the software collects all of the inputs ( will let you know if it collects "reset" as well ), fills in the ones it knows and just sends the rest back with whatever value they have (hence I was getting my hidden key back and $submit) .....so ...... I wondered what would happen if I commented the key out in html .......BINGO :D, a browser (tested in IE & FF so far) rightfully ignores the input as it's commented text and DOESN'T return it as a form value BUT THE SPAMMING SOFTWARE DOES .... well, so far 100% of the spammmers that hit my site do.

Anyway, I'm gonna wander off and add the reset thing before I forget and then I'll come and comment on the rest ;)

..... ok, be interesting to see the results of that one ;)

FTP ... I'm assuming you're still trying with firezilla? I think it's a problem with the latest version (Danny and Scott can use it but you and I can't), the one I use is LeapFtp, it's another free one, I'll root down a link or upload the exe if you're interested? In the meantime you can use evos filemanager to upload stuff and leave a note and either Scott or myself can move it to the correct folder for you.

The beta version appears to be geared up for this multiple antispam plugin, it has hooks for both comments and normal posts that a plugin can react to, it'll be interesting to see all of the methods that people employ. The main goal is to have so many varied tests that it becomes a nightmare for the spammers to check for.

My next little sneaky test is to see if they call images and/or style sheets, I'm betting that they don't, in which case I'm going to add/remove boxes using css. Of course, they'll eventually twig what's happening, and start calling the css, but then they need to find a way of parsing it to find out what's there and what isn't, and that'll be fairly impossible to get right. ;)

Sheesh, I'm going to have to add more fields and tests to my anti-spam stats table.

¥
 
Anonymous
11th Mar 2006
I don't know how many of you guys are on the b2evo admin mailing list, but if you're not here's a bit of news summed up in my head. BigF wants to/is going to dump the antispam table thing in favor of a semi-logical analysis of referers to determine if it's spam or not. A series of questions with points awarded for each spammy type of behavior. Like being a .info site or being a blogspot blog or having certain keywords in the url - stuff like that. Anyway if the score is low enough you pass and if it's high enough you fail. Everything else is moderated, though I've no idea why one would moderate referer traffic since it doesn't go anywhere. Maybe comments, but damn then just moderate the freakin comments!

Erp.
 
Anonymous
12th Mar 2006
I'm assuming that lowly members such as myself aren't cool enough to be allowed on the admin list, but, from a few posts on the dev list and from working with the code, it looks like they're going to (finally) be implementing "spam karma" for posts and comments. They also have the neccessary hooks for antispam plugins.

I actually binned my antispam blacklist months ago (well, it has about 20 entries but that's about it). Funny enough it's because I run a few of my own checks (which determine likelyhood of spam) AND I moderate my comments :P If you read page two of this post it'll tell you some of the checks that I'm currently running / have run, I'll be adding some "results" to it soon. Plus a few more sneaky tricks I have up my sleeve to stop these wankers.

I'm still toying with the idea of replacing the "piss off you wanker" screen with a redirect to the FBI site instead (in fact, I'm still considering changing my htsrv url to http://www.ic3.gov/contact/ if I detect a spammer BEFORE they request the comment form ...... though I doubt the spammers will appreciate it, but who gives a fuck about a spammers feelings.

Anyway, I'll write up the results of my tests and do some sort of summary soon.

¥
 
 

Recent Comments

     
     

    Archives