Programming

Thursday, March 02, 2006

UTF-8 character ”RIGHT DOUBLE QUOTATION MARK, or ” or ” or UTF-8 e2809d.

This takes me a while staring at the screen and figure out the problem.

In one web page I am testing, the RIGHT DOUBLE QUOTATION MARK (unicode 0x201d); character is used to quote attributes, instead of plain quote (unicode 0x22). the problem is that these character looks terribly similar, and usually in debug tool you don't realize the differences.

Anyhow it takes me hour to figure out the problem, so perhaps better record the steps here.

1) use "od -ax" find utf-8 value of this character e2809d.

2) get the unicode 0x201d by following formula in [http://en.wikipedia.org/wiki/UTF-8] to get its unicode value.

3) search the unicode either in "Character Map" of redhat, or search it in the web [http://www.fileformat.info/search/search.htm]

1 Comments:

Anonymous said...: I find this character map very useful: Zvon Character Search.

Also, if you have the file on disk and you have perl handy, here's a one-liner that will spit out the line number in the source file, the offending byte sequence and the unicode codepoint (Normal Form C version) of each byte sequence:

perl -Mencoding=utf8 -MUnicode::Normalize -e 'while(<>){ $_=NFC($_);while(/([\x{0080}-\x{fffd}])/gc){ print "line $., $1 -> ".sprintf("\%0.4x",ord($1))."\n";}}' filename.ext

(If the perl command above is one line, in case the comment system breaks it up.)

Hope that can help you in the future. :); 7:02 PM

Programming

Thursday, March 02, 2006

1 Comments:

About Me

Previous Posts