学习脚步---- 网络爬虫(Spider)Java实现原理(转载)
connection.getContentType() );
return;
}
首先,为每个传递进来的变量url中存储的URL构造一个“URLConnection”对象,因为网站上会有多种类型的文档,而“蜘蛛”只对那些包含HTML,尤其是基于文本的文档感兴趣。前述代码是为了确保文档内容以“text/”打头,如果文档类型为非文本,会从等待区移除此URL,并把它添加到已处理区,这也是为了保证不会再次访问此URL。在对特定URL建立连接之后,接下来就要解析其内容了。下面的代码打开了URL连接,并读取内容:
InputStream is = connection.getInputStream();
Reader r = new InputStreamReader(is);
现在,我们有了一个Reader对象,可以用它来读取此URL的内容,对本文中的“蜘蛛”来说,只需简单地把其内容传递给HTML解析器就可以了。本例中使用的HTML解析器为Swing HTML解析器,其由Java内置,但由于Java对HTML解析的支持力度不够,所以必须重载一个类来实现对HTML解析器的访问,这就是为什么我们要调用“HTMLEditorKit”类中的“getParser”方法。但不幸的是,Sun公司把这个方法置为protected,唯一的解决办法就是创建自己的类并重载“getParser”方法,并把它置为public,这由“HTMLParse”类来实现,请看例4:
import javax.swing.text.html.*;
public class HTMLParse extends HTMLEditorKit {
public HTMLEditorKit.Parser getParser()
{
return super.getParser();
}
}
这个类用在Spider类的“processURL”方法中,我们也会看到,Reader对象会用于读取传递到“HTMLEditorKit.Parser”中网页的内容:
HTMLEditorKit.Parser parse = new HTMLParse().getParser();
parse.parse(r,new Parser(url),true);
请留意,这里又构造了一个新的Parser类,这个Parser类是一个Spider类中的内嵌类,而且还是一个回调类,它包含了对应于每种HTML tag将要调用的特定方法。在本文中,我们只需关心两类回调函数,它们分别对应一个简单tag(即不带结束tag的tag,如
)和一个开始tag,这两类回调函数名为“handleSimpleTag”和“handleStartTag”。因为每种的处理过程都是一样的,所以“handleStartTag”方法仅是简单地调用“handleSimpleTag”,而“handleSimpleTag”则会负责从文档中取出超链接,这些超链接将会用于定位“蜘蛛”要访问的其他页面。在当前tag被解析时,“handleSimpleTag”会检查是否存在一个“href”或超文本引用:
String href = (String)a.getAttribute(HTML.Attribute.HREF);
if( (href==null) && (t==HTML.Tag.FRAME) )
href = (String)a.getAttribute(HTML.Attribute.SRC);
if ( href==null )
return;
如果不存在“href”属性,会继续检查当前tag是否为一个Frame,Frame会使用一个“src”属性指向其他页面,一个典型的超链接通常为以下形式:
上面链接中的“href”属性指向其链接到的页面,但是“linkedpage.html”不是一个地址,它只是指定了这个Web服务器上一个页面上的某处,这称为相对URL,相对URL必须被解析为绝对URL,而这由以下代码完成:
URL url = new URL(base,str);
这又会构造一个URL,str为相对URL,base为这个URL上的页面,这种形式的URL类构造函数可构造一个绝对URL。在URL变为正确的绝对形式之后,通过检查它是否在等待区,来确认此URL是否已经被处理过。如果此URL没有被处理过,它会添加到等待区,之后,它会像其他URL一样被处理。
相关的代码如下所示:
1.CheckLinks.java
import java.awt.*;
?????? import javax.swing.*;
?????? import java.net.*;
?????? import java.io.*;
public class CheckLinks extends javax.swing.JFrame implements
???????????? Runnable,ISpiderReportable {
?
? public CheckLinks()
? {
??? //{{INIT_CONTROLS
??? setTitle("Find Broken Links");
??? getContentPane().setLayout(null);
??? setSize(405,288);
??? setVisible(true);
??? label1.setText("Enter a URL:");
??? getContentPane().add(label1);
??? label1.setBounds(12,12,84,12);
??? begin.setText("Begin");
??? begin.setActionCommand("Begin");
??? getContentPane().add(begin);
??? begin.setBounds(12,36,84,24);
??? getContentPane().add(url);
??? url.setBounds(108,36,288,24);
??? errorScroll.setAutoscrolls(true);
??? errorScroll.setHorizontalScrollBarPolicy(javax.swing.
??????????????? ScrollPaneConstants.HORIZONTAL_SCROLLBAR_ALWAYS);
??? errorScroll.setVerticalScrollBarPolicy(javax.swing.
??????????????? ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS);
??? errorScroll.setOpaque(true);
??? getContentPane().add(errorScroll);
??? errorScroll.setBounds(12,120,384,156);
??? errors.setEditable(false);
??? errorScroll.getViewport().add(errors);
??? errors.setBounds(0,0,366,138);
??? current.setText("Currently Processing: ");
??? getContentPane().add(current);
??? current.setBounds(12,72,384,12);
??? goodLinksLabel.setText("Good Links: 0");
??? getContentPane().add(goodLinksLabel);
??? goodLinksLabel.setBounds(12,96,192,12);
??? badLinksLabel.setText("Bad Links: 0");
??? getContentPane().add(badLinksLabel);
??? badLinksLabel.setBounds(216,96,96,12);
??? //}}
? //{{INIT_MENUS
??? //}}
??? //{{REGISTER_LISTENERS
??? SymAction lSymAction = new SymAction();
??? begin.addActionListener(lSymAction);
??? //}}
? }
?
? static public void main(String args[])
? {
??? (new CheckLinks()).setVisible(true);
? }
?
? public void addNotify()
? {
??? // Record the size of the window prior to calling parent's
??? // addNotify.
??? Dimension size = getSize();
super.addNotify();
??? if ( frameSizeAdjusted )
????? return;
??? frameSizeAdjusted = true;
// Adjust size of frame according to the insets and menu bar
??? Insets insets = getInsets();
??? javax.swing.JMenuBar menuBar = getRootPane().getJMenuBar();
??? int menuBarHeight = 0;
??? if ( menuBar != null )
????? menuBarHeight = menuBar.getPreferredSize().height;
??? setSize(insets.left + insets.right + size.width, insets.top +
????????????????????????? insets.bottom + size.height +
????????????????????????? menuBarHeight);
? }
? // Used by addNotify
? boolean frameSizeAdjusted = false;
? //{{DECLARE_CONTROLS
? javax.swing.JLabel label1 = new javax.swing.JLabel();
?
? javax.swing.JButton begin = new javax.swing.JButton();
?
? javax.swing.JTextField url = new javax.swing.JTextField();
?
? javax.swing.JScrollPane errorScroll =
??????? new javax.swing.JScrollPane();
?
? javax.swing.JTextArea errors = new javax.swing.JTextArea();
? javax.swing.JLabel current = new javax.swing.JLabel();
? javax.swing.JLabel goodLinksLabel = new javax.swing.JLabel();
? javax.swing.JLabel badLinksLabel = new javax.swing.JLabel();
? //}}
? //{{DECLARE_MENUS
? //}}
?
? protected Thread backgroundThread;
?
? protected Spider spider;
?
? protected URL base;
?
? protected int badLinksCount = 0;
?
? protected int goodLinksCount = 0;
?
? class SymAction implements java.awt.event.ActionListener {
??? public void actionPerformed(java.awt.event.ActionEvent event)
??? {
????? Object object = event.getSource();
????? if ( object == begin )
??????? begin_actionPerformed(event);
??? }
? }
?
? void begin_actionPerformed(java.awt.event.ActionEvent event)
? {
??? if ( backgroundThread==null ) {
????? begin.setLabel("Cancel");
????? backgroundThread = new Thread(this);
????? backgroundThread.start();
????? goodLinksCount=0;
????? badLinksCount=0;
??? } else {
????? spider.cancel();
??? }
? }
?
? public void run()
? {
??? try {
????? errors.setText("");
????? spider = new Spider(this);
????? spider.clear();
????? base = new URL(url.getText());
????? spider.addURL(base);
????? spider.begin();
????? Runnable doLater = new Runnable()
????? {
??????? public void run()
??????? {
????????? begin.setText("Begin");
??????? }
????? };
????? SwingUtilities.invokeLater(doLater);
????? backgroundThread=null;
??? } catch ( MalformedURLException e ) {
????? UpdateErrors err = new UpdateErrors();
????? err.msg = "Bad address.";
????? SwingUtilities.invokeLater(err);
??? }
? }
?
? public boolean spiderFoundURL(URL base,URL url)
? {
??? UpdateCurrentStats cs = new UpdateCurrentStats();
??? cs.msg = url.toString();
??? SwingUtilities.invokeLater(cs);
??? if ( !checkLink(url) ) {
????? UpdateErrors err = new UpdateErrors();
????? err.msg = url+"(on page " + base + ")\n";
????? SwingUtilities.invokeLater(err);
????? badLinksCount++;
????? return false;
??? }
??? goodLinksCount++;
??? if ( !url.getHost().equalsIgnoreCase(base.getHost()) )
????? return false;
??? else
????? return true;
? }
?
? public void spiderURLError(URL url)
? {
? }
?
? protected boolean checkLink(URL url)
? {
??? try {
????? URLConnection connection = url.openConnection();
????? connection.connect();
????? return true;
??? } catch ( IOException e ) {
????? return false;
??? }
? }
?
? public void spiderFoundEMail(String email)
? {
? }
?
class UpdateErrors implements Runnable {
??? public String msg;
??? public void run()
??? {
????? errors.append(msg);
??? }
? }
?
? class UpdateCurrentStats implements Runnable {
??? public String msg;
??? public void run()
??? {
????? current.setText("Currently Processing: " + msg );
????? goodLinksLabel.setText("Good Links: " + goodLinksCount);
????? badLinksLabel.setText("Bad Links: " + badLinksCount);
??? }
? }
}
2.ISpiderReportable .java
import java.net.*;
interface ISpiderReportable {
? public boolean spiderFoundURL(URL base,URL url);
? public void spiderURLError(URL url);
? public void spiderFoundEMail(String email);
}
3.Spider .java
import java.util.*;
?????? import java.net.*;
?????? import java.io.*;
?????? import javax.swing.text.*;
?????? import javax.swing.text.html.*;
public class Spider {
?
? protected Collection workloadError = new ArrayList(3);
?
? protected Collection workloadWaiting = new ArrayList(3);
?
? protected Collection workloadProcessed = new ArrayList(3);
?
? protected ISpiderReportable report;
?
? protected boolean cancel = false;
?
? public Spider(ISpiderReportable report)
? {
??? this.report = report;
? }
?
? public Collection getWorkloadError()
? {
??? return workloadError;
? }
?
? public Collection getWorkloadWaiting()
? {
??? return workloadWaiting;
? }
?
? public Collection getWorkloadProcessed()
? {
??? return workloadProcessed;
? }???
?
? public void clear()
? {
??? getWorkloadError().clear();
??? getWorkloadWaiting().clear();
??? getWorkloadProcessed().clear();
? }
?
? public void cancel()
? {
??? cancel = true;
? }
?
? public void addURL(URL url)
? {
??? if ( getWorkloadWaiting().contains(url) )
????? return;
??? if ( getWorkloadError().contains(url) )
????? return;
??? if ( getWorkloadProcessed().contains(url) )
????? return;
??? log("Adding to workload: " + url );
??? getWorkloadWaiting().add(url);
? }
?
? public void processURL(URL url)
? {
??? try {
????? log("Processing: " + url );
????? // get the URL's contents
????? URLConnection connection = url.openConnection();
????? if ( (connection.getContentType()!=null) &&
?????????? !connection.getContentType().toLowerCase().startsWith("text/") ) {
??????? getWorkloadWaiting().remove(url);
??????? getWorkloadProcessed().add(url);
??????? log("Not processing because content type is: " +
???????????? connection.getContentType() );
??????? return;
????? }
?????
????? // read the URL
????? InputStream is = connection.getInputStream();
????? Reader r = new InputStreamReader(is);
????? // parse the URL
????? HTMLEditorKit.Parser parse = new HTMLParse().getParser();
????? parse.parse(r,new Parser(url),true);
??? } catch ( IOException e ) {
????? getWorkloadWaiting().remove(url);
????? getWorkloadError().add(url);
????? log("Error: " + url );
????? report.spiderURLError(url);
????? return;
??? }
??? // mark URL as complete
??? getWorkloadWaiting().remove(url);
??? getWorkloadProcessed().add(url);
??? log("Complete: " + url );
? }
?
? public void begin()
? {
??? cancel = false;
??? while ( !getWorkloadWaiting().isEmpty() && !cancel ) {
????? Object list[] = getWorkloadWaiting().toArray();
????? for ( int i=0;(i<list.length)&&!cancel;i++ )
??????? processURL((URL)list[i]);
??? }
? }
? protected class Parser
? extends HTMLEditorKit.ParserCallback {
??? protected URL base;
??? public Parser(URL base)
??? {
????? this.base = base;
??? }
??? public void handleSimpleTag(HTML.Tag t,
??????????????????????????????? MutableAttributeSet a,int pos)
??? {
????? String href = (String)a.getAttribute(HTML.Attribute.HREF);
?????
????? if( (href==null) && (t==HTML.Tag.FRAME) )
??????? href = (String)a.getAttribute(HTML.Attribute.SRC);
???????
????? if ( href==null )
??????? return;
????? int i = href.indexOf('#');
????? if ( i!=-1 )
??????? href = href.substring(0,i);
????? if ( href.toLowerCase().startsWith("mailto:") ) {
??????? report.spiderFoundEMail(href);
??????? return;
????? }
????? handleLink(base,href);
??? }
??? public void handleStartTag(HTML.Tag t,
?????????????????????????????? MutableAttributeSet a,int pos)
??? {
????? handleSimpleTag(t,a,pos);??? // handle the same way
??? }
??? protected void handleLink(URL base,String str)
??? {
????? try {
??????? URL url = new URL(base,str);
??????? if ( report.spiderFoundURL(base,url) )
????????? addURL(url);
????? } catch ( MalformedURLException e ) {
??????? log("Found malformed URL: " + str );
????? }
??? }
? }
?
? public void log(String entry)
? {
??? System.out.println( (new Date()) + ":" + entry );
? }
}
4.HTMLParse .java
import javax.swing.text.html.*;
public class HTMLParse extends HTMLEditorKit {
? public HTMLEditorKit.Parser getParser()
? {
??? return super.getParser();
? }
}
?
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/wuhailin2005/archive/2009/01/08/3736026.aspx
1 楼 jfk1983 2011-12-02 好文章啊 转载了 程序员之家