抓取页面的内容
我已经抓取了某个页面,已经把内容局限在以下的这个表里,请问2个问题:
1)如何抓取每对的 "英文名称 "和 "中文名称 "
2)如何抓取PAGE数(page= "xx ")
<TABLE WIDTH= "100% " BORDER= "0 " CELLPADDING= "0 " CELLSPACING= "5 ">
<TR>
<TD VALIGN= "top "> <TABLE WIDTH= "100% " BORDER= "0 " CELLPADDING= "0 " CELLSPACING= "5 ">
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> Acid c <font color=red> ya </font> nine </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:酸性花青[染料] </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> Actual count(=actual <font color=red> ya </font> rn count) </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:实际支数,实际纱支 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> actual <font color=red> ya </font> rn count </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:实际纱支 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> Aerated <font color=red> ya </font> rn </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:气泡纱[由含有气泡的纤维构成] </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> agfa pol <font color=red> ya </font> mide </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:阿克发聚酰胺 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> air entangled <font color=red> ya </font> rn </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:气流喷射交缠丝《化纤》 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> Air jet bulky <font color=red> ya </font> rn </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:喷气(法)膨松变形丝 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> air-bulked <font color=red> ya </font> rn </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:喷气膨化纱 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
<TR>
<TD WIDTH= "4% "> <IMG SRC= "image/retail.gif " WIDTH= "18 " HEIGHT= "14 "> </TD>
<TD WIDTH= "96% "> 英文名称: <span style= "font-family:Verdana, Arial, Helvetica, sans-serif "> air-buo <font color=red> ya </font> ncy force </span>
</TD>
</TR>
<TR>
<TD> </TD>
<TD> 中文名称:空气浮力 </TD>
</TR>
<TR>
<TD> </TD>
<TD> </TD>
</TR>
</TABLE>
<center>
[上一页] <font color=red > [1] </font> <a href= '/dictionary/result.asp?page=2&Blur=1&Keyword=ya ' > [2] </a> <a href= '/dictionary/result.asp?page=3&Blur=1&Keyword=ya ' > [3] </a> <a href= '/dictionary/result.asp?page=4&Blur=1&Keyword=ya ' > [4] </a> <a href= '/dictionary/result.asp?page=5&Blur=1&Keyword=ya ' > [5] </a> <a href= '/dictionary/result.asp?page=6&Blur=1&Keyword=ya ' > [6] </a> <a href= '/dictionary/result.asp?page=7&Blur=1&Keyword=ya ' > [7] </a> <a href= '/dictionary/result.asp?page=8&Blur=1&Keyword=ya ' > [8] </a> <a href= '/dictionary/result.asp?page=9&Blur=1&Keyword=ya ' > [9] </a> <a href= '/dictionary/result.asp?page=10&Blur=1&Keyword=ya ' > [10] </a> <a href= '/dictionary/result.asp?page=2&Blur=1&Keyword=ya ' > [下一页] </a> </center>
</TD>
</TR>
</TABLE>
[解决办法]
VS 2003 下调试结果:
英文名称:Acidcyanine中文名称:酸性花青[染料]
英文名称:Actualcount(=actualyarncount)中文名称:实际支数,实际纱支
英文名称:actualyarncount中文名称:实际纱支
英文名称:Aeratedyarn中文名称:气泡纱[由含有气泡的纤维构成]
英文名称:agfapolyamide中文名称:阿克发聚酰胺
英文名称:airentangledyarn中文名称:气流喷射交缠丝《化纤》
英文名称:Airjetbulkyyarn中文名称:喷气(法)膨松变形丝
英文名称:air-bulkedyarn中文名称:喷气膨化纱
英文名称:air-buoyancyforce中文名称:空气浮力
------------------------------------------
一个Form1 一个button1 一个 textBox1
private void button1_Click(object sender, System.EventArgs e)
{
StreamReader f1=new StreamReader( "1.txt ",Encoding.Default);
string str1=f1.ReadToEnd();
this.textBox1.Text=str1;
MatchCollection str2=Regex.Matches(str1, " <TR> .*?英文名称.*? </TR> .*? <TR> .*?中文名称.*? </TR> ",RegexOptions.Singleline);
MessageBox.Show( "找到 "+str2.Count.ToString()+ "匹配项 ");
this.textBox1.Text= " ";
foreach (Match str in str2)
{
string str3=Regex.Replace(str.ToString(), "(\\s)|( <.*?> )|( ) ", " ",RegexOptions.Singleline);
this.textBox1.Text+=str3+ "\r\n ";
}
}
[解决办法]
这个的正则要怎样写:
<a href= '/dictionary/result.asp?page=正整数&Blur=1&Keyword=任意字符 ' > [正整数] </a>
正整数和任意字符要怎样表达?
<a href= '/dictionary/result.asp?page=(\d)+?&Blur=1&Keyword=.*? ' > [(\d)+?] </a>
使用时注意转义符