读书人

采取Jsoup解析网络资源

发布时间: 2013-02-24 17:58:56 作者: rapoo

采用Jsoup解析网络资源

Jsoup为一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。

场景如下:

1.获取京东的图书类目

2.以类目id为key,类目名称为value保存到map中

代码如下:

private static Map<String, String> getWareCategory() {Connection conn = Jsoup.connect(JDConstants.CATEGORY_URL_FORMAT).userAgent(        JDConstants.MOZILLA_AGENT).timeout(JDConstants.TIME_OUT);Map<String, String> categoryMap = new HashMap<String, String>();Document document = null;try {Connection.Response response = conn.execute();int statusCode = response.statusCode();if (statusCode != JDConstants.HTTP_OK_CODE) {return categoryMap;}document = conn.get();Elements tmp = document.select("div.left").select("#booksort").first().select(        "div.mc ul").first().select("li");for (int i = 0; i < tmp.size(); i++) {Element e = tmp.get(i);String url = e.select("a").attr("href");String name = e.select("a").text();String categoryId = StringUtils.isNotEmpty(url) ? (url.split("-").length == 3 ? url        .split("-")[1] : "") : "";categoryMap.put(categoryId, name);}} catch (Exception e) {LOG.error("getCategory response:" + document);LOG.error("getCategory error:" + e.getMessage());}LOG.info("***********categoryMap:" + categoryMap);return categoryMap;}

?其他常量变量如下:

public abstract class JDConstants {public static final int TIME_OUT = 1000 * 60 * 30;public static final String MOZILLA_AGENT = "Mozilla";public static final int HTTP_OK_CODE = 200;public static final String CATEGORY_URL_FORMAT = "http://www.360buy.com/products/1713-3269-000.html";}

?评价:

操作非常方便

读书人网 >JavaScript

热点推荐