UTF8转Unicode问题
我用Wininet API 读取一个UTF8网站源文件,用MultiByteToWideChar转换,可有个别字符还是显示乱码请问这是怎么回事呀?
- C/C++ code
BOOL Task::ReadURL(const TCHAR *url,std::wstring &Source){ HINTERNET hInternet = InternetOpen(NULL,INTERNET_OPEN_TYPE_PRECONFIG,NULL,NULL,0); if(hInternet) { HINTERNET hFile = InternetOpenUrl(hInternet,url,NULL,0,INTERNET_FLAG_RELOAD,0); if(hFile) { BYTE buff[1000]; DWORD lpRead = 0; TCHAR tchar[1000]; Source.clear(); do { memset(buff,0,sizeof(buff)); memset(tchar,0,sizeof(tchar)); InternetReadFile(hFile,&buff,999,&lpRead); buff[lpRead] = '\0'; MultiByteToWideChar(CP_UTF8,0,(LPCSTR)buff,strlen((char *)buff),tchar,lpRead); Source += tchar; }while(lpRead); InternetCloseHandle(hFile); InternetCloseHandle(hInternet); return TRUE; } InternetCloseHandle(hInternet); } return FALSE;}
[解决办法]
应该是把UTF8转成UNICODE,因为网站上是UTF8编码,UTF8和UNICODE不是一回事,虽然都是两个字节代表一个字节.网上有UTF8转UNICODE程序
[解决办法]
MultiByteToWideChar(CP_UTF8,0,(LPCSTR)buff,strlen((char *)buff),tchar,lpRead);
dwFlags
A set of bit flags that indicate whether to translate to precomposed or composite wide characters (if a composite form exists), whether to use glyph characters in place of control characters, and how to deal with invalid characters. You can specify a combination of the following flag constants: Value Meaning
MB_PRECOMPOSED Always use precomposed characters — that is, characters in which a base character and a nonspacing character have a single character value. This is the default translation option. Cannot be used with MB_COMPOSITE.
MB_COMPOSITE Always use composite characters — that is, characters in which a base character and a nonspacing character have different character values. Cannot be used with MB_PRECOMPOSED.
MB_ERR_INVALID_CHARS If the function encounters an invalid input character, it fails and GetLastError returns ERROR_NO_UNICODE_TRANSLATION.
MB_USEGLYPHCHARS Use glyph characters instead of control characters.
A composite character consists of a base character and a nonspacing character, each having different character values. A precomposed character has a single character value for a base/non-spacing character combination. In the character è, the e is the base character and the accent grave mark is the nonspacing character.
The function's default behavior is to translate to the precomposed form. If a precomposed form does not exist, the function attempts to translate to a composite form.
The flags MB_PRECOMPOSED and MB_COMPOSITE are mutually exclusive. The MB_USEGLYPHCHARS flag and the MB_ERR_INVALID_CHARS can be set regardless of the state of the other flags.
[解决办法]
先试试‘MB_ERR_INVALID_CHARS’,
用GetLastError看看是什么问题(字符)
[解决办法]
可能是缓冲区长度不够, 用动态分配的方式来处理
- C/C++ code
int WLength = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)buff, -1, NULL, NULL ); LPWSTR tchar = (LPWSTR) _malloca( (WLength + 1) *sizeof(WCHAR) ) ; MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)buff, -1, pszW, WLength ); tchar[WLength] = 0; Source += tchar; _freea(tchar);
[解决办法]