YiJing::0x11 - Gu\/ (Web Crawler)
"Beware: There is only a thin line between a crawler and a worm!"
Web crawler: Nice and fun. Suitable to sift in the great Data Flow. Test run for three days before sending it out; analyze the data for three days before sending it out again.
彖曰。蠱。剛上而柔下。巽而止。 蠱。元亨。而天下治也。利涉大川。往有事也。 先甲三日。后甲三日。終則有始。天行也。
This hexagram is emblematic of the trouble that you would face in writing or managing a web crawler: the program has to strike out on its own and unobtrusively sift great gobs of data in any number of messy formats.
It needs testing and retesting, planning and monitoring. It has to follow old standards and accept new ones, and tolerate sites that don't follow standards at all. To do its work, the web crawler has to get as much data from as many sites as it can, without bothering any webmasters in the process.
It has to be efficient, but deliberate. It is a matter of contradictory goals -- a situation that comes up in all sorts of systems besides web crawlers.
Gathering Data under Standards, is the Image of a Web Crawler. A wise hacker makes careful use of it to provide people with interesting information while maintaining the proper ethics.
Crawling and the Data.
... The web crawler will harvest some bad data. Make sure it can recover well and move on correctly.
Crawling and the Network.
... The web crawler should back off from network trouble, wait, compromise, and improvise.
... You'll have to fix some mistakes that shouldn't have been made, but it's no big deal.
Obvious and oblivious.
... The implementation is nice and simple, and dangerously wrong. Watch it upset everyone!
... Make it clear that you're listening to what people say. The web crawler depends on the kindness of strangers.
... You should act on principle depite the authority's demands, so that you can serve a higher goal.