读书人

UTF8转Unicode有关问题

发布时间: 2012-07-31 12:33:46 作者: rapoo

UTF8转Unicode问题
我用Wininet API 读取一个UTF8网站源文件,用MultiByteToWideChar转换,可有个别字符还是显示乱码请问这是怎么回事呀?

C/C++ code
BOOL Task::ReadURL(const TCHAR *url,std::wstring &Source){    HINTERNET hInternet = InternetOpen(NULL,INTERNET_OPEN_TYPE_PRECONFIG,NULL,NULL,0);    if(hInternet)    {        HINTERNET hFile = InternetOpenUrl(hInternet,url,NULL,0,INTERNET_FLAG_RELOAD,0);        if(hFile)        {            BYTE buff[1000];            DWORD lpRead = 0;            TCHAR tchar[1000];            Source.clear();            do            {                memset(buff,0,sizeof(buff));                memset(tchar,0,sizeof(tchar));                InternetReadFile(hFile,&buff,999,&lpRead);                buff[lpRead] = '\0';                MultiByteToWideChar(CP_UTF8,0,(LPCSTR)buff,strlen((char *)buff),tchar,lpRead);                Source += tchar;            }while(lpRead);            InternetCloseHandle(hFile);            InternetCloseHandle(hInternet);            return TRUE;        }        InternetCloseHandle(hInternet);    }    return FALSE;}



[解决办法]
应该是把UTF8转成UNICODE,因为网站上是UTF8编码,UTF8和UNICODE不是一回事,虽然都是两个字节代表一个字节.网上有UTF8转UNICODE程序
[解决办法]
MultiByteToWideChar(CP_UTF8,0,(LPCSTR)buff,strlen((char *)buff),tchar,lpRead);

dwFlags
A set of bit flags that indicate whether to translate to precomposed or composite wide characters (if a composite form exists), whether to use glyph characters in place of control characters, and how to deal with invalid characters. You can specify a combination of the following flag constants: Value Meaning
MB_PRECOMPOSED Always use precomposed characters — that is, characters in which a base character and a nonspacing character have a single character value. This is the default translation option. Cannot be used with MB_COMPOSITE.
MB_COMPOSITE Always use composite characters — that is, characters in which a base character and a nonspacing character have different character values. Cannot be used with MB_PRECOMPOSED.
MB_ERR_INVALID_CHARS If the function encounters an invalid input character, it fails and GetLastError returns ERROR_NO_UNICODE_TRANSLATION.
MB_USEGLYPHCHARS Use glyph characters instead of control characters.


A composite character consists of a base character and a nonspacing character, each having different character values. A precomposed character has a single character value for a base/non-spacing character combination. In the character è, the e is the base character and the accent grave mark is the nonspacing character.

The function's default behavior is to translate to the precomposed form. If a precomposed form does not exist, the function attempts to translate to a composite form.

The flags MB_PRECOMPOSED and MB_COMPOSITE are mutually exclusive. The MB_USEGLYPHCHARS flag and the MB_ERR_INVALID_CHARS can be set regardless of the state of the other flags.

[解决办法]
先试试‘MB_ERR_INVALID_CHARS’,
用GetLastError看看是什么问题(字符)
[解决办法]
可能是缓冲区长度不够, 用动态分配的方式来处理

C/C++ code
  int WLength = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)buff, -1, NULL, NULL );    LPWSTR tchar = (LPWSTR) _malloca( (WLength + 1) *sizeof(WCHAR) ) ;        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)buff, -1, pszW, WLength );    tchar[WLength] = 0;    Source += tchar;  _freea(tchar);
[解决办法]
探讨


可能是缓冲区长度不够, 用动态分配的方式来处理

C/C++ code

int WLength = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)buff, -1, NULL, NULL );
LPWSTR tchar = (LPWSTR) _malloca( (WLength + 1) *sizeof(WCHAR) ) ; ……

读书人网 >VC/MFC

热点推荐