Saturday, May 06, 2006

WebCite

Just read WebCite from Lorcan Dempsey's blog.

WebCite is an archiving system for webreferences (cited webpages and websites). It works in a way of tinyurl, by submitting a URL you get a new URL back, e.g. I submit "http://www.yahoo.com" on May 6th, 2006, webcite gives me back a new URL http://www.webcitation.org/5Fgu0Xf5x, and yahoo page on May 6th, 2006 is archived in http://www.webcitation.org/5Fgu0Xf5x.

So this is actually a combination of archive.org and tinyurl.com, and may look extremely simple. However, as I worked on persistent URL and web archiving issues previously, I can testify simple is not easy.

Persistent URLs issues have been studied by Steve Lawrence (citeseer, now google) in (Persistence of Web References in Scientific Research. IEEE Computer, 34(2):26--31, 2001), Thomas A. Phelps and Robert Wilensky in (Robust Hyperlinks and Locations ), Frank McCown and Michael L. Nelson's "The Availability and Persistence of Web References in D-Lib Magazine".

In particular Thomas Phelps has proposed that most web pages can be uniquely identified by 5 keywords, so you should referene the web page by the five keywords, and issue the 5 keywords to web search engine to find the web page (in case it's relocated).

Web archiving is best addressed archive.org for now, in a rather typical web crawling way, and still an active research area. And now we have a on-demand web archiving system such as webcitation.org.

I don't know how webcitation.org scales, or how it protects the data, and it certainly doesn't solve all web archiving issues (we may only realize a web page should be preserved after the right time is long gone), but still, it's really simple, neat, and targetted two problems (persistent URL and web archiving) so nicely.

Monday, May 01, 2006

learn tricks in java jar file

I used java jar files for some years but didn't realize several interesting mechanisms inside of jar file. Now I have a chance of reading jar specification and found several very handy features:

-- specify class-path in MANIFEST.MF

-- create index files (INDEX.LIST) for all packages.

-- META-INF/services directory to associate an implementation with its definition (e.g. class to interface)

These all come handy for easy distribution of a package