Programming

WebCite

Just read WebCite from Lorcan Dempsey's blog.

WebCite is an archiving system for webreferences (cited webpages and websites). It works in a way of tinyurl, by submitting a URL you get a new URL back, e.g. I submit "http://www.yahoo.com" on May 6th, 2006, webcite gives me back a new URL http://www.webcitation.org/5Fgu0Xf5x, and yahoo page on May 6th, 2006 is archived in http://www.webcitation.org/5Fgu0Xf5x.

So this is actually a combination of archive.org and tinyurl.com, and may look extremely simple. However, as I worked on persistent URL and web archiving issues previously, I can testify simple is not easy.

Persistent URLs issues have been studied by Steve Lawrence (citeseer, now google) in (Persistence of Web References in Scientific Research. IEEE Computer, 34(2):26--31, 2001), Thomas A. Phelps and Robert Wilensky in (Robust Hyperlinks and Locations ), Frank McCown and Michael L. Nelson's "The Availability and Persistence of Web References in D-Lib Magazine".

In particular Thomas Phelps has proposed that most web pages can be uniquely identified by 5 keywords, so you should referene the web page by the five keywords, and issue the 5 keywords to web search engine to find the web page (in case it's relocated).

Web archiving is best addressed archive.org for now, in a rather typical web crawling way, and still an active research area. And now we have a on-demand web archiving system such as webcitation.org.

I don't know how webcitation.org scales, or how it protects the data, and it certainly doesn't solve all web archiving issues (we may only realize a web page should be preserved after the right time is long gone), but still, it's really simple, neat, and targetted two problems (persistent URL and web archiving) so nicely.

Programming

Saturday, May 06, 2006

0 Comments:

About Me

Previous Posts