Saturday, May 06, 2006

WebCite

Just read WebCite from Lorcan Dempsey's blog.

WebCite is an archiving system for webreferences (cited webpages and websites). It works in a way of tinyurl, by submitting a URL you get a new URL back, e.g. I submit "http://www.yahoo.com" on May 6th, 2006, webcite gives me back a new URL http://www.webcitation.org/5Fgu0Xf5x, and yahoo page on May 6th, 2006 is archived in http://www.webcitation.org/5Fgu0Xf5x.

So this is actually a combination of archive.org and tinyurl.com, and may look extremely simple. However, as I worked on persistent URL and web archiving issues previously, I can testify simple is not easy.

Persistent URLs issues have been studied by Steve Lawrence (citeseer, now google) in (Persistence of Web References in Scientific Research. IEEE Computer, 34(2):26--31, 2001), Thomas A. Phelps and Robert Wilensky in (Robust Hyperlinks and Locations ), Frank McCown and Michael L. Nelson's "The Availability and Persistence of Web References in D-Lib Magazine".

In particular Thomas Phelps has proposed that most web pages can be uniquely identified by 5 keywords, so you should referene the web page by the five keywords, and issue the 5 keywords to web search engine to find the web page (in case it's relocated).

Web archiving is best addressed archive.org for now, in a rather typical web crawling way, and still an active research area. And now we have a on-demand web archiving system such as webcitation.org.

I don't know how webcitation.org scales, or how it protects the data, and it certainly doesn't solve all web archiving issues (we may only realize a web page should be preserved after the right time is long gone), but still, it's really simple, neat, and targetted two problems (persistent URL and web archiving) so nicely.

0 Comments:

Post a Comment

<< Home