Thoughts[spamstats]
¥åßßå and I have been discussing b2evo's poor handling of comment/trackback/referrer SPAM. The 'slam-the-door-in-your-face' blacklist isn't ideal. Black-lists, white-lists ... it's all a gray area and SPAM is something that makes people see red :>.
What color is your anti-spam parachute?
The goal here would be to come up with a much better way of handling SPAM. We've tossed around some ideas and some of them are covered here. I think we should ADD to this, as we move forward, because the concept of improved SPAM handling is universally appealing and would rocket AM! to the forefront of a "must-visit" site, if we were to successfully come up with a plug-in/hack that provided hassle-free, superior anti-SPAM measures.
So ... what would make a system "superior"? We've identified the following list (to which we invite additions/subtractions/discussion):
A Superior Anti-SPAM System
- It must be 'visitor-friendly'. (i.e., rules out bobo, moderation, disallowing links (<a> tag) and capcha methods ... as all of these either makes visitors jump thru hoops or restrains them, in some fashion.)
- It must consider and weigh a variety of parameters, in an effort to determine if a comment/trackback or referrer is "spammy". (i.e., we're thinking of a predictive, stochastic percentage that indicates the probability of something being 'spammy'). A list of those criteria follow, later on.
- It must be easy to administer and must not prohibit admins from posting whatever they want, however spammy it may appear.
- It must be adaptable (i.e., as SPAM tactics and methodologies change, new parameters need to be added in an easy fashion.)
- It must LEARN and grow stronger as more SPAM attempts are made and SHARE this learned information within a "trusted" network of like installations.
- It shouldn't drain CPU resources, with regard to CPU time and table space, so as not to become too burdonsome for the blog engine.
- Probability cutoffs should be user-set for the handling of SPAM (i.e., less than 45% probability ... gets posted, between 46%-80% ... gets set aside for moderation (and then LEARNS from admin decision, so that future %-range for moderation becomes more narrow); and >81% probability, dosn't get posted (and admin can review to see if system accidentally mis-categorized a non-SPAM post/comment/trackback).
- It should INVITE SPAM ... (the more SPAM received, the stronger the defenses), and limit energy and bandwidth used by SPAM messages (for SPAM, just say "Thanks for your comment", rather than a 303 error), let SPAMMERS think they've succeeded. Bend, but do not BREAK. (Hence ... "Willow" as a code name.)
- It should log and categorize SPAM attempts, both to confirm SPAM blockage and volume, but also for admins to analyze and tweak the system to make it better.
These are just some of the ideas that need to be built into the system.
SPAM Factors
Below are things that make a site/comment/post and/or trackback more likely to be SPAMMY and situations in which they AREN'T SPAMMY. For every thing that LOOKS like SPAM, there's generally a situation in which it ISN'T SPAM.
- Content: When the post or comment contain words and/or phrases like "penis, incest, 'nice site', cialis, viagra, etc." this makes it more likely to be SPAM. (However, a visitor or poster might just be writing about how he's tired of seeing "cialis" SPAM messages all the time).
- Same Email Address - Different Name: Lots of SPAMMERS re-hash a fake email address and this might indicate that a message is more likely to be "spammy". (Or ... stk might be leaving you two comments, but use "Scott" one time and "stk" another)
- Invalid Email Addy: Lots of Spammers use made-up email addresses. If we can easily test for this, then it would help identify a comment/trackback as 'spammy'. (However, not 100%, because maybe a visitor is afraid to leave his REAL address and leaves one like 'John-nospam@hisRealSite.com' or a made up one instead)
- Repeated Comment URL[color=red]/IP[/color]: If you're being hammered with SPAM, it's likely that a the same URL will be repeated in quick succession. Need to look at time intervals and if this is happening, there's a greater liklihood that comment/trackback is spammy. (Of course, you might also have an avid fan who likes to leave lots of comments, too, so it's not 100% and need to take into account X comments over Y time)
- URLs known to be spammy: This is where the blacklist will come in handy, BUT not a 100% sure it's spam kind of blockage. (What about mis-reported blacklisted sites? Or a URL that has been re-assigned to someone else? PLUS ... blacklist only contains KNOWN spam sites ... need something more proactive that can determine spamminess of some site it's never seen.)
- Words in URLs: Certainly, URLs that contain certain words provide some measure of the spamminess of the site. (But it's not 100% as some people just pick crappy URLs for their site)
- Changing IP Address: In a recent hack, by ¥åßßå, he's implemented an IP-dependent HASH for each of the field names in the comment_post.php and feedback.php files. IF a spammer loads the comment form with one IP, but tries to post with a proxied IP ... they will be unsuccessful, because the field names won't match. (very tricky guy, our ¥åßßå ;)) (But, it's possible that a dial-up connected poster loses his connection and returns to post with a different IP assigned)
- Using old file/folder names: If one changes the htsrv folder name to something else and/or the comment_post.php filename to something else, and a spammer comes along, tring to use /htsrv/comment_post.php ... then there's a really strong liklihood that it's SPAM. (Can't really think of a situation where it isn't)
- Wrong post ID: If there posting a comment/trackback for a different post ID than they searched for, browsed to, etc., then the comment/trackback is likely to be SPAM. (likewise, I can't think of a situation where it's not).
- Links in Content: When there are multiple links in a comment (<a> tag allowed), this increases the liklihood that the message is SPAM. (Of course, you might have a determined poster who's citing his argument with lots of juicy links, too).
- Offsite call of comment_post.php: If the referrer isn't your own site, but rather, some other location, then someone is trying to remotely comment and exploit the comment_post.php file. (However, lots of visitors don't send referrer information AND referrers CAN be spoofed ...)
- Admin post/comment may look SPAMMY as hell: Regardless of how spammy an admin post/comment is, b/c they OWN the blog ... they should not be hampered by an anti-spam system. Likewise, priority should be given to members, but when running a "get a free blog" system, there needs to be admin moderation for member posts above a certain %-spammy threshold.
- [color=red]White URL/IP Check:[/color] A comment/trackback is much less likely to be SPAM if the url/ip of the visitor has been accepted by the admin in previous comment/trackback. (Of course, the SPAMMER may have recently sent a flood, one or more of which got by the SPAM check, so the check would need to include an "older-than" clause ... like older than a week or something)
These "ideal system" characteristics and "is this SPAM or not" %-probability assignations represent WHERE WE ARE TODAY. If you can add more ... PLEASE DO. I will update/refine this list so that when we end up making this plugin-hack ... it will ROCK the SPAMMY world. :D
Page 2 updated 17th March
Tests
Ok, this is a list of test that are running, as more tests are coded up I'll add them here. Feel free to add to this list if you start any tests
- IP/post ID key
- Ok, this is a really simple test, when the comment form is called a hidden input is added that contains a key based on the ip & post id. When a comment is posted the key is checked for and has 4 potential results :
- No key - this pretty much means that we have a spammer who's using crap software
- Wrong IP - Although this has a high probability of being a spammer, it could just be a dialup user who lost their connection.
- Wrong ID - Again, this pretty much means we've got a spammer.
- Invalid key - This probably means we have a spammer who's trying to guess how the key works!
By storing all of the keys as they are issued (along with other data such as post id, ip etc) the original IP will also gain a "spam karma", this karma would be used as part of the final figure.
- Reset button
- This is another simple test, it just checks for param( 'reset' ) if it exists we've got a spammer
- commented input box
- A variation of the key, the input box is commented out in html, this checks if the value is sent, if it is it pretty muchs means spam software.
- CSS
- I haven't implemented this one yet, but basically I'm going to use css to "remove" the 'name' input box from the form (either with display:none or by positioning it a couple of million miles off the page, or z-indexing it below content) and replace it with another box (for the normal user), if $name is set then this is a spammer
- Expanding the blacklist concept
- Well, I've finally totally emptied my blacklist :O, what I have in place is an expansion of it's methodolgy. Instead of having a true/false blacklist I now have a list of all the ip's, url's and common spam comment content. When a comment is received (along with some of the checks above) it is checked against these lists. If it's found in any of the lists then it is marked as "spam" and put in a little table all of it's own. During the shakedown period I'm also "moderating" any comments that get through these checks (so far none have ..... but I've only had 3,000 spam to judge by) at the time of writing I have the following :-
1450 spammer IP's
30 spam base domains
112 common spam comments
When I get a chance I'll add a page to my blog so that you can see the relevant lists.
You can find the lists here List of wankers, please note, the "counts" aren't completely accurate as yet, they're often far lower than the reality.
More thoughts
I'm still working on all of this, and I still have a few little tricks up my sleeve to stop these wankers once and for all. Unfortunately most of this would be 100% easier in 1.8+ as it's far more geared to the sort of stuff that I'm currently implementing. An unfortunate side effect of running these tests is that I no longer report anything to the evo blacklist, however, it'd be very easy to implement reporting for new spam base urls automatically as they are added to my own lists.
At the moment I still create new entries on my lists manually, mainly because I've been concentrating on methods and testing, but the next stage is to automatically create new entries for any comment that fails ## number of tests. in this way the anti-spam measures will "learn" from the spammers themselves. As a measure of their effectiveness, the wanker who started spamming everybody with the bogus google links was stopped by the fact that he used "known" spam content and also used a few known IP's.
I'm also trying to work out a method whereby the lists could be shared amongst "trusted" blogs so that the net is cast wider, which would help trap new spam before it gets a chance to pound every blog it finds. Basically, if one blog traps a bit of spam that exceeds ## threshold then it would inform the other blogs, in the "trusted" list, of the details of the new spammer (ip's, url's, common content etc). Then, if the same spammer hits any of the other blogs they're already geared up to stop them.
There's, obviously, still a shedload of work to do, and, as I mentioned earlier, I have a few other tests that I want to run as well. The more ways a spammer can "trip up" the better. My eventual goal is to have no antispam blacklist and no moderated comments.
Recent Comments