MapReuce 编程小结-多MapReduce执行

MapReuce 编程总结-多MapReduce执行

学习hadoop，必不可少的就是写MapReduce程序，当然，对于简单的分析程序，我们只需一个MapReduce就能搞定，这里就不提单MapReuce的情况了，网上例子很多，大家可以百度Google一下。对于比较复杂的分析程序，我们可能需要多个Job或者多个Map或者Reduce进行分析计算。

多Job或者多MapReduce的编程形式有以下几种：

1、迭代式MapReduce

MapReduce迭代方式，通常是前一个MapReduce的输出作为下一个MapReduce的输入，最终可只保留最终结果，中间数据可以删除或保留，根据业务需要自己决定

示例代码如下：

....private NutchTool currentTool = null;....private Map&lt;String, Object&gt; runTool(Class&lt;? extends NutchTool&gt; toolClass,Map&lt;String, Object&gt; args) throws Exception {currentTool = (NutchTool) ReflectionUtils.newInstance(toolClass,getConf());return currentTool.run(args);}...@Overridepublic Map&lt;String, Object&gt; run(Map&lt;String, Object&gt; args) throws Exception {results.clear();status.clear();String crawlId = (String) args.get(Nutch.ARG_CRAWL);if (crawlId != null) {getConf().set(Nutch.CRAWL_ID_KEY, crawlId);}String seedDir = null;String seedList = (String) args.get(Nutch.ARG_SEEDLIST);if (seedList != null) { // takes precedenceString[] seeds = seedList.split("\\s&#43;");// create tmp. dirString tmpSeedDir = getConf().get("hadoop.tmp.dir") &#43; "/seed-"&#43; System.currentTimeMillis();FileSystem fs = FileSystem.get(getConf());Path p = new Path(tmpSeedDir);fs.mkdirs(p);Path seedOut = new Path(p, "urls");OutputStream os = fs.create(seedOut);for (String s : seeds) {os.write(s.getBytes());os.write('\n');}os.flush();os.close();cleanSeedDir = true;seedDir = tmpSeedDir;} else {seedDir = (String) args.get(Nutch.ARG_SEEDDIR);}Integer depth = (Integer) args.get(Nutch.ARG_DEPTH);if (depth == null)depth = 1;boolean parse = getConf().getBoolean(FetcherJob.PARSE_KEY, false);String solrUrl = (String) args.get(Nutch.ARG_SOLR);int onePhase = 3;if (!parse)onePhase&#43;&#43;;float totalPhases = depth * onePhase;if (seedDir != null)totalPhases&#43;&#43;;float phase = 0;Map&lt;String, Object&gt; jobRes = null;LinkedHashMap&lt;String, Object&gt; subTools = new LinkedHashMap&lt;String, Object&gt;();status.put(Nutch.STAT_JOBS, subTools);results.put(Nutch.STAT_JOBS, subTools);// inject phaseif (seedDir != null) {status.put(Nutch.STAT_PHASE, "inject");jobRes = runTool(InjectorJob.class, args);if (jobRes != null) {subTools.put("inject", jobRes);}status.put(Nutch.STAT_PROGRESS, &#43;&#43;phase / totalPhases);if (cleanSeedDir && tmpSeedDir != null) {LOG.info(" - cleaning tmp seed list in " &#43; tmpSeedDir);FileSystem.get(getConf()).delete(new Path(tmpSeedDir), true);}}if (shouldStop) {return results;}// run "depth" cyclesfor (int i = 0; i &lt; depth; i&#43;&#43;) {status.put(Nutch.STAT_PHASE, "generate " &#43; i);jobRes = runTool(GeneratorJob.class, args);if (jobRes != null) {subTools.put("generate " &#43; i, jobRes);}status.put(Nutch.STAT_PROGRESS, &#43;&#43;phase / totalPhases);if (shouldStop) {return results;}status.put(Nutch.STAT_PHASE, "fetch " &#43; i);jobRes = runTool(FetcherJob.class, args);if (jobRes != null) {subTools.put("fetch " &#43; i, jobRes);}status.put(Nutch.STAT_PROGRESS, &#43;&#43;phase / totalPhases);if (shouldStop) {return results;}if (!parse) {status.put(Nutch.STAT_PHASE, "parse " &#43; i);jobRes = runTool(ParserJob.class, args);if (jobRes != null) {subTools.put("parse " &#43; i, jobRes);}status.put(Nutch.STAT_PROGRESS, &#43;&#43;phase / totalPhases);if (shouldStop) {return results;}}status.put(Nutch.STAT_PHASE, "updatedb " &#43; i);jobRes = runTool(DbUpdaterJob.class, args);if (jobRes != null) {subTools.put("updatedb " &#43; i, jobRes);}status.put(Nutch.STAT_PROGRESS, &#43;&#43;phase / totalPhases);if (shouldStop) {return results;}}if (solrUrl != null) {status.put(Nutch.STAT_PHASE, "index");jobRes = runTool(SolrIndexerJob.class, args);if (jobRes != null) {subTools.put("index", jobRes);}}return results;}

MapReuce 编程小结-多MapReduce执行

热点推荐