Sunday, May 24, 2009

unAPI format and Semantic Web

Ross Singer's recent post about "One Data Format Identifier (and Registry) to Rule Them All" caused some interesting discussions. I didn't read all of them, and don't have much to add, but I do have a chance of reflecting how unAPI format wiki was created at the first place.

It turned out that I always have trouble of understanding why Semantic Web is so obsessed with URI and Ontology. I am not Semantic Web expert, but to me the strict URI approach seems to directly conflict with other "convention over configuration" approaches like tagging, wiki, twitter. And it doesn't seem everyone will have the time to learn each other's ontology anyhow. As a result, any sufficiently large RDF file always make my head spin because of all these long URIs, and I am not sure how much bandwidth were used to carry them on the Internet.

So why don't just use a word as the identifier, and let everyone pickup a dictionary and find out its semantic. I guess Oxford dictionary is a better agreed-upon ontology. In retrospect, perhaps this is how unAPI format differs from SRU/OpenURL, we can choose a name and hopefully it will work. Does it work? I don't know.

Since multiple copies keep stuff safe, I also paste here the unAPI format retrieved from Internet Archive.

name type example desc doc
amazon application/xml opa a convention used by OPA
asn1 text/plain opa Abstract Syntax Notation One asn.1
bibtex text/plain hubmed bibtex bibtex
dc text/plain opa unqualified Dublin Core
didl application/xml TODO MPEG-21 DIDL didl
endnote text/plain refbase endnote endnote
latex application/x-latex refbase latex latex
marcxml application/xml Technosophia MAchine Readable Cataloging in XML marcxml
markdown text/plain refbase markdown markdown
mods application/xml Technosophia Metadata Object Description Schema mods
html text/html refbase HTML HTML
oai_citeseer application/xml opa
oai_dc application/xml Technosophia unqualified OAI Dublin Core oai_dc
pdf application/pdf refbase Portable Document Format PDF
pubmed application/xml opa pubmed article pubmed
rdf/xml application/rdf+xml hubmed Resource Description Framework RDF
ris text/plain hubmed ris ris
rss application/xml Technosophia Really Simple Syndication RSS
rtf application/rtf refbase Rich Text Format RTF
srw_dc application/xml Technosophia unqualified SRW Dublin Core
srw_mods application/xml refbase unqualified SRW MODS
text text/plain opa
wrap application/x-javascript opa unalog json format

Friday, February 27, 2009

cannonical URLs in <link> tag

Having observed a few canonical URL solutions in Digital Library world, such as Handle, PURL, DOI, it is really nice to see the cannonical URLs from Google. Google partially solves the problem by simply adding a <link rel="canonical" href=""> to specify your preferred version.

It always surprises me how elegant and simple a solution can be, and how long it takes to find these simple solutions.

Tuesday, May 22, 2007

canonical URL

Just came across these URLs


I was trying to compare same Worldcat pages with different URLs, by google backward link search features (

So and
are same page, however due to variants of these URLs, Google treats them as different URLs, and this certainly doesn't help moving up these pages up in Google search results, if we consider the inbound link factor in PageRank.

Friday, March 02, 2007

new xISBN API - beta

I have been working on a new xISBN service in last few months, and I had a chance of sharing latest progress in excellent code4lib conference held in Athens, Ga. Thanks for all your valuable comments, folks!

The new service is now running as a beta version. We add more features/metadata, at the same time try to keep the API simple and clean.

x-identifier list is a venue for comments/suggestions.

Tuesday, February 27, 2007

things learned from SOLR api

things learned from SOLR api

I am fortunte to attend Erik Hatcher's one day Lucene/Solr tutorial, what's really attracting me is elegance of SOLR's REST API, which is at the same time concise and consistent. To start with:

-- add/query/delete/commit are all simple REST API.

as an example, delete a record is a simple HTTP POST "<delete><id>SP2514N</id></delete>", do I need to say more?

-- allows different format (but equavalent infoset)
"select?q=ipod&wt=xml" -- returns an xml format

They all return same infoset.

-- the way of controlling returned fields by query

q=video&fl=name,id (return only name and id fields)
q=video&fl=name,id,score (return relevancy score as well)
q=video&fl=*,score (return all stored fields, as well as relevancy score)

So user has control of which fields to return.

-- no xml namespace at all

This is subject to argue, but I tend to think in RPC-oriented application XML namespace doesn't really matter, and in Document-oriented applications XML namespace are important.

-- the way of organizing schema and result.

Basically everything is a field, a field has a datatype, name and value. Such as following result:

<str name="id">IW-02</str>
<bool name="inStock">false</bool>
<str name="manu">Belkin</str>

This bascially allows any fields without XML namespace/schemas.

Tuesday, January 30, 2007

google's moon shot

New Yorker magazine runs an article "google's moon shot" about google book project. It mentioned Google aims to scan at least as many as worldcat records (32M) in 10 years. And it also mentioned the hate/love relationship between publishers and Google, and seems optimistic that the copyright issue can be resolved this way or other. There are also interesting thoughts of networking books, similar to the way of how web pages are linked together by hyperlink.

Ebook has been here a while and we are still trying to figure out the right way, maybe Google, Open Content Alliance, and other's work can make a big difference, we will see how this evolves in next few years

Saturday, January 20, 2007

microsoft live book search with fulltext download

I just have a chance of looking at Microsoft live book search demo , it's still a small collection, so search results can be confusing. However an entire book can be downloaded from as a PDF file, such as this one:

I guess this is part of deal of Open Contents Alliance. Google Print doesn't really allow download of a full book.