80legs spider overloading sites

The last hours we have noticed in several websites we manage an increased number of sessions. The first investigation show that there were no human visitors but rather bots / spider / crawlers since these sessions didn’t load tracking code of monitoring tools (Google Analytics, SiteMeter, statcounter, etc). So, even with more than 300 sessions on the CMS the Google analytics real time users were just 20 or 30..

The next step was to find the originator of these extra sessions. There were many different IP addresses, most of them originated at Russia but also from Ukraine, Saudi Arabia and similar countries. Looking at the log files of the web servers having these hits the common signature was the user agent record saying 008/0.83. The full log entry in all cases, in all different source IPs was:
Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620

 In a visit at 80slegs.com given URL it seems it claims to be a well mean robot / crawl service. There are also instructions how to exclude the site being hit by the spider with the robots.txt directive for User Agent 008. Did you try it? There will be no result, since you are probably already too loaded with robots crawling your site and not bothering asking again for your robots allow / deny policy.   For the record, there were thousands of hits coming from Russia in all sites’ pages and sub-pages but only a few for robots.txt and they were actually mostly from USA.

What’s next? Why they do it! I have no clue.. But what made me suspicious is this: the version of the Spider’s User Agent is 0.83.. But a visit to 80legs release page says that the 0.83 version is out since 2009 and the latest version is 1.40 since early 2010.. So, is it a cracked / buggy / malicious older release that crawls my sites for who-knows-what reason? And if is so, do I want to allow that?

For me the answer was NO.. So the next thing to do is blocking the spider. But how? Since the IPs are changing all the time (they write it at their site: Blocking our web crawler by IP address will not work. Due to the distributed nature of our infrastructure, we have thousands of constantly changing IP addresses.

 They are right.. So besides adding

User-agent: 008
Disallow: /

in your robots.txt at your site’s root folder you may need to add some extra measures. That could be a smart defense mechanism (if you use something like that, OSSEC, ModSecuriry, Sentinel) to add the 008/0.83 as a deny filter, or even better search the 80legs.com string and return a polite but blocking message instead.


Tags: , , , , , , , , , , , , ,
Filed under: Bloggin', Strange news

Tell us your thoughts

Connect with Facebook