Stephan Wehner

March 2, 2010

thrackle.org alive again

Filed under: internet, programming — sw @ 11:13 am

My thrackle.org website is alive again. It’s about a nice math problem that I worked on 10 – 18 years ago.

January 27, 2010

webcrawlers desperate for content

Filed under: internet, programming — sw @ 9:42 am

I recently found this in the web server logs of one of the websites I look after:

38.100.8.50 - - [26/Jan/2010:05:01:44 -0800] "GET /application/json HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:47 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:55 -0800] "GET /AppleWebKit/ HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:58 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com"

In case you are not familiar with web server log files, these line mean is that someone/something from IP address 38.100.8.50 requested the pages named after “GET” on the website, for example, a page named “following-sibling::*” etc.

Does it need to be said that no such pages exist (that’s what the “404” indicates)?

When I saw this I was rather puzzled; and looked up panscient.com (the last item on each line). Their home page says they provide some kind of vertical search service, whatever that is. On their FAQ page, I found this:

Why is your web crawler trying to access pages that don’t exist on my website?

Our web crawler attempts to extract links to valid web pages from javascript and other scripting languages. The crawler may misinterpret the information in these scripts and request a page that does not actually exist. These requests are attempts to retrieve valid web content, and are not an attempt to circumvent your webserver security.

(Emphasis mine) Oh ok. They are looking into javascript files on the web site and attempting to extract names of pages that might have content for the “vertical search”. But not successful in this case. As a web developer, I can tell you that javascript files very rarely contain interesting links to web pages.

Looks like a pretty competitive business when people start pulling at straws like this. Also I take it bandwidth is easier to come by than crawling software that avoids such silly attempts.

December 27, 2009

truste.org ssl certificate problems

Filed under: internet — sw @ 5:31 pm

Today, a little note about a problem with https that I ran into with https://www.truste.org

When visiting that site my Firefox (Version 3.0) warned me that

Secure Connection Failed
www.truste.org uses an invalid security certificate.
The certificate is only valid for *.truste.com

Visiting https://www.truste.com instead simply timed out: “The server at www.truste.com is taking too long to respond.”

Looks like they didn’t configure their web server properly. A bit odd since they specialize “as the leading internet privacy services provider.”

May 7, 2009

slashdot down

Filed under: internet — sw @ 8:44 pm

Website administrators fear the slashdot effect (“slashdotting” / “being slashdotted”) — now slashdot.org, “News for nerds. Stuff that matters.”,  is down itself. Here is a screen shot:

Screenshot

Unclear what “Guru Meditation” refers to, but in case you’re wondering, the Varnish link generated by the slashdot web server goes to http://www.varnish-cache.org. Which takes you to http://varnish.projects.linpro.no, which says,

Welcome to the Varnish project
Varnish is a state-of-the-art, high-performance HTTP accelerator

(The slashdot site was working again an hour later)

April 23, 2009

on github

Filed under: internet — sw @ 9:21 pm

Joined github today, you can look up my (future) public software at

http://github.com/stephanwehner

Added a little project which should make Rails development a little easier when it comes to working with the database directly. For now only for mysql. See my_sql.rb under http://github.com/stephanwehner/railsgoodies.

Thanks to my friend Sam for encouraging me.

Learn about git if you haven’t heared about it,

January 31, 2009

google broken

Filed under: internet — sw @ 8:38 am

This morning Google’s search results don’t work.

Let’s say you search Google for water. Then:

Each result has a warning under its link “This site may harm your computer”:

Google search results page for "water"

Clicking on the link doesn’t take your browser to the page as usual, but brings up an error message.

Error page when clicking on a search result link

Clicking on the “This site may harm your computer link” produces a help page with the title “Concerns About Web Search Results: Results labeled ‘This site may harm your computer’”:

Google help page about harmful search result pages

So now I search Google for “Concerns About Web Search Results: Results labeled ‘This site may harm your computer‘”. The first result is a Google support page at

http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=45449

But accessing that page leads to another error (no screen shot):

We’re sorry…

… but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can’t process your request right now.

What a mess!

Questions

  • Should visiting any web page really “harm your computer”?
  • On what basis would Google think that a web page is going to “harm your computer”? Does it take into account or even know what kind of computer you are using?
  • If Google had reason to believe a web page were to “harm your computer”, should the page be really listed as a search result? (Less is more?)
  • If a search result page is not marked with the warning, would you blame Google if you then visited the search result page and your computer came out “harmed”?
  • Are these search result page getting too crowded altogether? craigslist has barely changed their listing format and they’re doing just fine.

Update

Of course, this was a temporary glitch. According to google’s blog, “the errors began appearing between 6:27 a.m. and 6:40 a.m. and began disappearing between 7:10 and 7:25 a.m. [PST]“. (So I ran into this just towards the end, around 7:20). The problem’s root cause is given as:

Unfortunately (and here’s the human error), the URL of ‘/’ was mistakenly checked in [to a list of bad URL's] as a value to the file and ‘/’ expands to all URLs”.

And it wasn’t StopBadWare.org’s list as Google had originally posted. (The two organizations work together on this list). Well, mistakes happen …

Powered by WordPress