读书人

一个网页抓数据的有关问题高难度

发布时间: 2012-03-19 22:03:04 作者: rapoo

一个网页抓数据的问题,高难度请指教。
HttpHelper类的主要代码如下:

C# code
        private CookieContainer cc;        private string contentType = "application/x-www-form-urlencoded";        private string accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/x-silverlight, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-silverlight-2-b1, */*";        private string userAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)";        private Encoding encoding = Encoding.GetEncoding("gb2312");        public string GetHtml(string url, CookieContainer cookieContainer)        {            HttpWebRequest httpWebRequest;            httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);            httpWebRequest.CookieContainer = cookieContainer;            httpWebRequest.ContentType = contentType;            httpWebRequest.Referer = url;            httpWebRequest.Accept = accept;            httpWebRequest.UserAgent = userAgent;            httpWebRequest.Method = "GET";            HttpWebResponse httpWebResponse;            httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();            Stream responseStream = httpWebResponse.GetResponseStream();            StreamReader streamReader = new StreamReader(responseStream, encoding);            string html = streamReader.ReadToEnd();            streamReader.Close();            responseStream.Close();            return html;        }


调用该方法的代码如下
C# code
            HttpHelper helper = new HttpHelper();            string ss = helper.GetHtml("http://bill.finance.sina.com.cn/bill/detail.php?stock_code=sh600550&bill_size=40000");


我现在要抓取的页面是http://bill.finance.sina.com.cn/bill/detail.php?stock_code=sh600550&bill_size=40000
如果抓取的页面是http://www.sina.com.cn,没有任何问题。
可是抓取上述页面就有问题,应该是上面这个页面做了什么限制或判断,不知哪位高手能给看一下?
谢谢!

[解决办法]
用我这个方法就可以了!我试过了的!

public string gethtml(string url)
{
string text2 = "";
WebClient client1 = new WebClient();
try
{
byte[] buffer1 = client1.DownloadData(url);

string text1 = Encoding.Default.GetString(buffer1);
text2 = text1;
}
catch
{
text2 = null;
}
return text2;
}


[解决办法]
http://blog.csdn.net/jiang_jiajia10/archive/2008/11/18/3325407.aspx
[解决办法]
网页经过deflate压缩的

System.IO.Compression.DeflateStream responseStream =new System.IO.Compression.DeflateStream( httpWebResponse.GetResponseStream(),System.IO.Compression.CompressionMode.Decompress);

*****************************************************************************
欢迎使用CSDN论坛专用阅读器 : CSDN Reader(附全部源代码)

http://feiyun0112.cnblogs.com/
[解决办法]
看这里。root_兄给我的方法:http://topic.csdn.net/u/20081215/23/28f9ae30-2fa4-4b8d-8f84-710b4b5ddb6e.html
[解决办法]
对,就是这个 把流解压下再streamReader
探讨
网页经过deflate压缩的

System.IO.Compression.DeflateStream responseStream =new System.IO.Compression.DeflateStream( httpWebResponse.GetResponseStream(),System.IO.Compression.CompressionMode.Decompress);



*****************************************************************************
欢迎使用CSDN论坛专用阅读器 : CSDN Reader(附全部源代码)

http://feiyun0112.cnblogs.com/


[解决办法]
不知道 你要 具体 抓什么

可以做个 bho 插件 内部转换为html 处理

可以得到任意元素 及数据集合

读书人网 >C#

热点推荐