Fighting Comment Spam With Project Honey Pot

Fighting Comment Spam With Project Honey Pot (Posted on May 11^th, 2013)

Early on when developing this blog my goal was to allow for a commenting system that didn't require login of my site or any other third party site to post. This has made me a huge target for spammers. In fact when my site first launched I didn't use recaptcha so I was getting thousands of comments per day and I didn't really have all that much content. However, a majority of these posts never saw the light of day thanks to Project Honey Pot's HTTP Blacklist. Taken from their site:

The HTTP Blacklist, or "http:BL", is a system that allows website administrators to take advantage of the data generated by Project Honey Pot in order to keep suspicious and malicious web robots off their sites. Project Honey Pot tracks harvesters, comment spammers, and other suspicious visitors to websites. Http:BL makes this data available to any member of Project Honey Pot in an easy and efficient way.

The service works by tracking the IPs of bad things and giving it a rating. With that rating you can choose how content gets marked as spam. So all I do is send the IP address of the poster to the service and then check to see what the odds are of it being a spammer.

Using HTTP Blacklist

The first step is registering for an API key which also requires an account on the site. Once you have that done using the service with Python (or any language really) is actually only a few lines of code which is pretty sweet. I've included some extra code for those using Django as well.

#Getting the users IP from a Django request
x_forwarded_for = request.META.get('HTTP_X_FORWARDED_FOR')
if x_forwarded_for:
    ip = x_forwarded_for.split(',')[0]
else:
    ip = request.META.get('REMOTE_ADDR')

#Step 1: Reverse the order of the IP address octets
iplist = ip.split('.')
iplist.reverse()

#Step 2: Build the query
query = YOUR_HTTP_BL_API_KEY + '.' + '.'.join(iplist) + '.' + 'dnsbl.httpbl.org'

#Step 3: Execute the query
from socket import gethostbyname
httpbl_result = gethostbyname(query)
httpbl_resultlist = httpbl_result.split('.')

From there it's up to you to interpret the data. The first item in the result list should be 127 signaling that the query was successful. The second item will be a value from 0-255 of when the last time that IP address was marked. The third item will be a threat score. I find this to be the most useful metric. From my experience anything above 45 is usually spam. The fourth and final item will be the type of user such as search engine, suspicious, harvester, or comment spammer. The API goes in to more detail for all of these fields. My code for marking comments as spam looks like this:

#Check if response is proper
if httpbl_resultlist[0] == "127":
    #Check threat level
    if httpbl_resultlist[2] > 45:
        comment.spam = True
    else:
        comment.spam = False
else:
    comment.spam = True

Overall I've found http:BL to be a super useful and FREE service to catch spammers. A lot of people also use Akismet so definitely check that out to see which fits your needs better. As of now I'm using recaptcha and http:BL to filter comments. I'd say about 10 comments get past recaptcha per day and about 5% of the comments that get past recaptcha also get past http:BL. So about 2-3 comments make it to the site every week which isn't bad for a userless commenting system. I'll definitely be looking to cut this number down with some more advanced filtering in the near future.

As always if you have any feedback or questions feel free to drop them in the comments below or contact me privately on my contact page. Thanks for reading!

Tags: Django, Python, Tools

About Me

My name is Max Burstein and I am a graduate of the University of Central Florida and creator of Problem of the Day and Live Dota. I enjoy developing large, scalable web applications and I seek to change the world.

Follow @maxburstein

Max Burstein