读书人

XPCOM字符串操作(2)

发布时间: 2012-12-27 10:17:10 作者: rapoo

XPCOM字符串操作(二)

?

If the string is ASCII, will it be compared to, assigned to, or otherwise interact with non-ASCII strings??When assigning or comparing an 8-bit ASCII value (in)to a 16-bit UCS2 string, an "inflation" needs to happen at runtime. If your strings are small enough (say, less than 64 bytes) then it may make sense to store your string in a 16-bit unicode class as well, to avoid the extra conversion. The tradeoff is that your ASCII string takes up twice as much space as a 16-bit Unicode string than it would as an 8-bit string.Is the string usually ASCII, but needs to support unicode??If your string is most often ASCII but needs to be able to store Unicode characters, then UTF-8 may be the right encoding. ASCII characters will still be stored in 8-bit storage but other Unicode characters will take up 2 to 4 bytes. However if the string ever needs to be compared or assigned to a 16-bit string, a runtime conversion will be necessary.Are you storing large strings of non-ASCII data??Up until this point, UTF-8 might seem like the ideal encoding. The drawback is that for most non-European characters (such as Chinese, Indian and Japanese) in BMP, UTF-8 takes 50% more space than UTF-16. For characters in plane 1 and above, both UTF-8 and UTF-16 take 4 bytes.Do you need to manipulate the contents of a Unicode string??One problem with encoding Unicode characters in UTF-8 or other 8-bit storage formats is that the actual Unicode character can span multiple bytes in a string. In most encodings, the actual number of bytes varies from character to character. When you need to iterate over each character, you must take the encoding into account. This is vastly simplified when iterating 16-bit strings because each 16-bit code unit (PRUnichar) corresponds to a Unicode character as long as all characters are in BMP, which is often the case. However, you have to keep in mind that a single Unicode character in plane 1 and beyond is represented in two 16-bit code units in 16-bit strings so that the number of?PRUnichar's is?not?always equal to the number of Unicode characters. For the same reason, the position and the index in terms of 16-bit code units are not always the same as the position and the index in terms of Unicode characters.


To assist with ASCII, UTF-8, and UTF-16 conversions, there are some helper methods and classes. Some of these classes look like functions, because they are most often used as temporary objects on the stack.

For example, imagine a UTF-8 string where the first Unicode character of the string is represented with a 3-byte UTF-8 sequence, the "inflated" UTF-16 string will contain the 3?PRUnichar's instead of the single?PRUnichar?that represents the first character. These?PRUnichar's have nothing to do with the first Unicode character in the UTF-8 string.

读书人网 >编程

热点推荐