读书人

minidom无法解析特殊unicode字符的有关

发布时间: 2012-04-16 16:20:04 作者: rapoo

minidom无法解析特殊unicode字符的问题.

Python code
>>> c=u'\u0e42'>>> cu'\u0e42'>>> print c?>>> from xml.dom import minidom>>> xmlstring="<tag>">>> xmlstring+=c>>> xmlstring+="</tag>">>> xmlstringu'<tag>\u0e42</tag>'>>> minidom.parseString(xmlstring)


报错:UnicodeEncodeError: 'ascii' codec can't encode characters in position... .
我上网查了一下,要修改C:\Python26\Lib下面的site.py,把
Python code
def setencoding():    """Set the string encoding used by the Unicode implementation.  The    default is 'ascii', but if you're willing to experiment, you can    change this."""    encoding = "ascii" # Default value set by _PyUnicode_Init()    if 0:

这里的if 0改成if 1
我改完重启python,运行一样的程序,现在报的错误是:
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
minidom.parseString(xmlstring)
File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0e42' in position 5: character maps to <undefined>

这是为什么呢? 是python的问题,还是minidom的问题? 怎么解决这个问题?

[解决办法]
建议别乱动site.py,既然要byte string就自己动手转换成utf-8编码:
>>> s = u'<tag>\u0e42</tag>'.encode('utf-8')
>>> s
'<tag>\xe0\xb9\x82</tag>'
>>> from xml.dom.minidom import parseString
>>> doc = parseString(s)
>>> doc.documentElement.firstChild.data
u'\u0e42'
>>> from xml.etree.ElementTree import fromstring
>>> root = fromstring(s)
>>> root.text
u'\u0e42'
>>>

读书人网 >perl python

热点推荐