读书人

数据采集 嵌套div 正则婚配

发布时间: 2013-09-05 16:02:07 作者: rapoo

数据采集 嵌套div 正则匹配
匹配区域:<div id="hot">
html数据区域,该区域有子div嵌套
</div>
提取id=hot的div中所有超链接 地址和超链接中文说明
提取网址:http://news.qq.com/
获取html函数:


public static string GetContent(string url, string regStr)
{
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.ContentType = "application/x-www-form-urlencoded";
HttpWebResponse wsp = (HttpWebResponse)req.GetResponse();
Stream st = wsp.GetResponseStream();
if (wsp.ContentEncoding.ToLower().Contains("gzip"))
{
st = new GZipStream(st, CompressionMode.Decompress);
}
StreamReader sr = new StreamReader(st, Encoding.Default);
string value = sr.ReadToEnd();
Regex reg = new Regex(regStr);
foreach (Match m in reg.Matches(value))
{
var a = m.Groups[1].Value;
}
string s = reg.Matches(value)[0].Groups[1].Value;
return s;
}



在线等 结贴加分


[解决办法]
用HtmlAgilityPack比较方便:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("采集到的string");
HtmlNodeCollection anchors = htmlDoc.DocumentNode.SelectNodes(@"//div[@id='hot']//a");
foreach (HtmlNode anchor in anchors)
{
Response.Write(anchor.Attributes["href"].Value + "<br/>");
Response.Write(anchor.InnerText + "<br/><br/>");
}


[解决办法]
Regex reg = new Regex(@"(?is)(?<=<div\b(?:(?!id=).)*id=ShowPhoto[^>]*>)(?><div[^>]*>(?<o>)
[解决办法]
</div>(?<-o>)
[解决办法]
(?:(?!</?div\b).)*)*(?(o)(?!))(?=</div>)");




string input = @"<div id=""a"">AAA<div id=""b"">BB<div id=""c"">CCC</div> B</div> </div> ";
string id = Console.ReadLine(); //输入要取得div的id
while (id.Trim().ToUpper().CompareTo("G")!=0)
{
string pattern = @"<div id=""" + id + @""">[^<>]*(((?'Open'<div[^>]*>)[^<>]*)+((?'-Open'</div>)[^<>]*)+)*(?(Open)(?!))</div>";
Console.WriteLine(Regex.Match(input, pattern));
id = Console.ReadLine();
}


[解决办法]
引用:
引用:

用HtmlAgilityPack比较方便:
C# code
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("采集到的string");
HtmlNodeCollection anchors = htmlDoc.DocumentNode.SelectNodes(@"//……

去网上找下HtmlAgilityPack这个dll,这个里面什么都有了!
[解决办法]
引用:

引用:

用HtmlAgilityPack比较方便:
C# code
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("采集到的string");
HtmlNodeCollection anchors = htmlDoc.DocumentNode.SelectNodes(@"//……

去网上找下HtmlAgilityPack这个dll,里面你用到的东西都有了!
[解决办法]
http://www.codeplex.com/htmlagilitypack/

读书人网 >C#

热点推荐