读书人

nutch学习札记2.Injector 数据下载入口

发布时间: 2012-06-29 15:48:46 作者: rapoo

nutch学习笔记2.Injector 数据下载入口

org.apache.nutch.crawl.Injector

public class Injector extends Configured implements Tool 从继承类和实现接口可以看出,Injector封装了Hadoop并在构造函数中初始化Hadoop配置参数Configuration( Configuration 内部机制请参考博文hadoop学习笔记1.Configuration),这也是nutch封装Hadoop的一种机制。

Injector包含两个属性:
/** metadata key reserved for setting a custom score for a specific URL */

?public static String nutchScoreMDName = "nutch.score";
/** metadata key reserved for setting a custom fetchInterval for a specific URL */
public static String nutchFetchIntervalMDName = "nutch.fetchInterval";

?

?

一个方法:

第一个参数?crawlDb为nutch抓取目录下crawlDb目录的路径;

第二个参数 urlDir为nutch抓取文件列表目录;

?

?

public void inject(Path crawlDb, Path urlDir)

?

?public void inject(Path crawlDb, Path urlDir) throws IOException {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();

//输出日志
if (LOG.isInfoEnabled()) {
LOG.info("Injector: starting at " + sdf.format(start));
LOG.info("Injector: crawlDb: " + crawlDb);
LOG.info("Injector: urlDir: " + urlDir);
}
//随机生成临时文件夹
? Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

// map text input file to a <url,CrawlDatum> file
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Converting injected urls to crawl db entries.");
}

//创建作业,并设置作业实现类InjectMapper.class?
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);
//设置设置map输出路径
FileOutputFormat.setOutputPath(sortJob, tempDir);

//设置作业参数
sortJob.setOutputFormat(SequenceFileOutputFormat.class);
sortJob.setOutputKeyClass(Text.class);
sortJob.setOutputValueClass(CrawlDatum.class);
sortJob.setLong("injector.current.time", System.currentTimeMillis());

//启动作业
JobClient.runJob(sortJob);

// merge with existing crawl db
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Merging injected urls into crawl db.");
}


JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb);
FileInputFormat.addInputPath(mergeJob, tempDir);
mergeJob.setReducerClass(InjectReducer.class);
JobClient.runJob(mergeJob);
CrawlDb.install(mergeJob, crawlDb);

// clean up
FileSystem fs = FileSystem.get(getConf());
fs.delete(tempDir, true);

long end = System.currentTimeMillis();
LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
}

?

?

?

读书人网 >开源软件

热点推荐