I recently found this in the web server logs of one of the websites I look after:
18.104.22.168 - - [26/Jan/2010:05:01:44 -0800] "GET /application/json HTTP/1.1" 404 763 "-" "panscient.com" 22.214.171.124 - - [26/Jan/2010:05:01:47 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com" 126.96.36.199 - - [26/Jan/2010:05:01:55 -0800] "GET /AppleWebKit/ HTTP/1.1" 404 763 "-" "panscient.com" 188.8.131.52 - - [26/Jan/2010:05:01:58 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com"
In case you are not familiar with web server log files, these line mean is that someone/something from IP address 184.108.40.206 requested the pages named after “GET” on the website, for example, a page named “following-sibling::*” etc.
Does it need to be said that no such pages exist (that’s what the “404” indicates)?
When I saw this I was rather puzzled; and looked up panscient.com (the last item on each line). Their home page says they provide some kind of vertical search service, whatever that is. On their FAQ page, I found this:
Why is your web crawler trying to access pages that don’t exist on my website?
Looks like a pretty competitive business when people start pulling at straws like this. Also I take it bandwidth is easier to come by than crawling software that avoids such silly attempts.