读书人

Java代码兑现远程网页抓取

发布时间: 2012-12-18 12:43:41 作者: rapoo

Java代码实现远程网页抓取
一、返回头信息的获取

步骤:

1、定义URL对象并初始化;

2、定义URLConnection对象,并通过URL对象的openConnection()方法获取该对象;
3、调用URLConnection对象的connect()方法实现和服务器的连接;

4、通过URLConnection对象获取请求头的域信息(getHeaderFields()、getHeaderField(key));

5、使用URLConnection对象的方法来获取信息。

示例:

String urlName = "http://www....com ";

try {

URL url = new URL(urlName);

URLConnection connection = url.openConnection();

connection.connect();

// print header fields

Map<String, List<String>> headers = connection.getHeaderFields();

for (Map.Entry<String, List<String>> entry : headers.entrySet()) {

String key = entry.getKey();

for (String value : entry.getValue()) {

System.out.println(key + ": " + value);

}

}


// print convenience functions

System.out.println("------------------------");

System.out.println("getContentType:" + connection.getContentType());

System.out.println("getContentLength:"

+ connection.getContentLength());

System.out.println("getContentEncoding:"

+ connection.getContentEncoding());

System.out.println("getDate:" + connection.getDate());

System.out.println("getExpiration:" + connection.getExpiration());

System.out.println("getLastModified:"

+ connection.getLastModified());

System.out.println("------------------------");


Scanner in = new Scanner(connection.getInputStream());

// print first ten lines of contents

for (int n = 1; in.hasNextLine() && n <= 10; n++) {

System.out.println(in.nextLine());

}

if (in.hasNextLine())

System.out.println("...");

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}


二、带参数的请求

在默认情况下,建立的连接只有从服务器读取信息的输入流,并没有任何之行写操作的输出流。如果想获取输出流(例如,想一个Web服务器提交数据),那么需要调用:connection.setDoOutput(true);


示例:

String urlName = "……";

Map<String, String> paras = new HashMap<String, String>();

paras.put("flightway", "Single");


String result;

try {

result = doPost(urlName, paras);

System.out.println(result);

} catch (IOException e) {

e.printStackTrace();

}


public static String doPost(String rlString,

Map<String, String> nameValuePairs) throws IOException {

URL url = new URL(rlString);

URLConnection connection = url.openConnection();

connection.setDoOutput(true);

PrintWriter out = new PrintWriter(connection.getOutputStream());

boolean first = true;

for (Map.Entry<String, String> pair : nameValuePairs.entrySet()) {

if (first)

first = false;

else

out.print('&');

String name = pair.getKey();

String value = pair.getValue();

out.print(name);

out.print('=');

out.print(URLEncoder.encode(value, "GB2312"));//UTF-8

}

out.close();
Scanner in;

StringBuffer response = new StringBuffer();

try {

in = new Scanner(connection.getInputStream());

} catch (IOException e) {

if (!(connection instanceof HttpURLConnection))

throw e;

InputStream err = ((HttpURLConnection) connection).getErrorStream();

if (err == null)

throw e;

in = new Scanner(err);

}


while (in.hasNextLine()) {

response.append(in.nextLine());

response.append("\n");

}


in.close();

return response.toString();

}

备注:

import java.io.*;import java.net.*;import java.util.*;


huc.setDoOutput(true);

// 设置为post方式

huc.setRequestMethod("POST");

huc.setRequestProperty("user-agent", "mozilla/4.7 [en] (win98; i)");


转载自:http://incan.iteye.com/blog/279000

读书人网 >编程

热点推荐