I'm taking the week off next week. Today I was writing a spider to crawl a website and inspect the pages. I wanted to finish this week with some analysis on my new website and then let the data gel for the rest of the week. Unfortunately, a "Service Unavailable" event interfered with my plans. I guess I'll blog my psuedo-failure and move on to vacation.
The spider process starts at a given url and generates a request, retrieving the response from the site. It then uses a regex to find all the href's, creates valid urls from each of those links, and queues each of the links for crawling. Then it de-queues each of the links and crawls it if it hasn't already retrieved that page. The whole thing repeats for each page sleeping for 1 full second in between requests, and stops when either the queue reaches zero links or the number of pages requested exceeds the maximum threshold. At this time it is a single-threaded, simple app that closes its connections explicitly and makes requests very conservatively. At least I thought so. Does that description sound like it should crash a production website?
I ran it against a local test site several times. Then I hit http://www.openmicmadness.com. I then ran it against the beta rentals site. Everything went fine, I discovered several things about the site that I wasn't aware of, and I wanted to see more. So I bumped the max threshold to 2000 pages and let it run for a few minutes against beta Rentals. I didn't think much of it as it can only request 1 page per second. Unfortuntately, the beta site is now showing "Service Unavailable". I've performed several load tests in the dev environment on the rentals site. While I was able to load it to the point of incredibly slow response, I never witnessed a complete meltdown. I suspect that its only hosted on one server in production and something went kah-blewy. Since I am not given any access to tools in production, I can't diagnose the problem and since its not a production site but rather a beta, I can't justify escalating the failure.
OK, I just ran the thing against several local website applications and my blog. None of them even hiccupped. I think something definitely crashed on that server but I don't think it was anything special that I did. I just happened to be the user(s) that got the error.
I think I'll follow Google's example and extend the spider to crawl just one level per session. Rather then crawl the queued links, I'll serialize them to a data store and then load that back up to kick off a future session. I'll also extend the error response to store the context of server errors, 404's and bad links.
Another realization I had during a crawl was how easy and useful it would be to set certain criteria to search for and then retrieve that content locally for later inspection. For example, find all mp3's on a band site and store them in a folder. Or you might want to scrape email addresses off of a website. My initial purpose was to inspect pages for richness of content and compliance to html standards as well as accessibility compliance. I've wanted to write a spider for various reasons for a very long time. I even modified the spider example found here, once upon a time, but lost that project when nAnt deleted my C: drive. I'm pleased to be back in the crawling game, the possibilities are endless!