分布式计算开源框架Hadoop入门实践（3）

分布式计算开源框架Hadoop入门实践（三）

分布式计算开源框架Hadoop入门实践（三）

一个图片太大了，只好分割成为两部分。根据流程图来说一下具体一个任务执行的情况。

验证输入的格式是否符合JobConfig的输入定义，这个在实现Map和构建Conf的时候就会知道，不定义可以是Writable的任意子类。将input的文件切分为逻辑上的输入InputSplit，其实这就是在上面提到的在分布式文件系统中blocksize是有大小限制的，因此大文件会被划分为多个block。通过RecordReader来再次处理inputsplit为一组records，输出给Map。（inputsplit只是逻辑切分的第一步，但是如何根据文件中的信息来切分还需要RecordReader来实现，例如最简单的默认方式就是回车换行的切分）

业务场景和代码范例

业务场景描述：可设定输入和输出路径（操作系统的路径非HDFS路径），根据访问日志分析某一个应用访问某一个API的总次数和总流量，统计后分别输出到两个文件中。这里仅仅为了测试，没有去细分很多类，将所有的类都归并于一个类便于说明问题。

测试代码类图

LogAnalysiser就是主类，主要负责创建、提交任务，并且输出部分信息。内部的几个子类用途可以参看流程中提到的角色职责。具体地看看几个类和方法的代码片断：

LogAnalysiser::MapClass

??? public static class MapClass extends MapReduceBase??? ??? implements Mapper&lt;LongWritable, Text, Text, LongWritable&gt; ??? {??? ??? public void map(LongWritable key, Text value, OutputCollector&lt;Text, LongWritable&gt; output, Reporter reporter)??? ??? ??? ??? throws IOException??? ??? {??? ??? ??? ??? String line = value.toString();//没有配置RecordReader，所以默认采用line的实现，key就是行号，value就是行内容??? ??? ??? if (line == null || line.equals(""))??? ??? ??? ??? return;??? ??? ??? String[] words = line.split(",");??? ??? ??? if (words == null || words.length &lt; 8)??? ??? ??? ??? return;??? ??? ??? String appid = words[1];??? ??? ??? String apiName = words[2];??? ??? ??? LongWritable recbytes = new LongWritable(Long.parseLong(words[7]));??? ??? ??? Text record = new Text();??? ??? ??? record.set(new StringBuffer("flow::").append(appid)??? ??? ??? ??? ??? ??? ??? .append("::").append(apiName).toString());??? ??? ??? reporter.progress();??? ??? ??? output.collect(record, recbytes);//输出流量的统计结果，通过flow::作为前缀来标示。??? ??? ??? record.clear();??? ??? ??? record.set(new StringBuffer("count::").append(appid).append("::").append(apiName).toString());??? ??? ??? output.collect(record, new LongWritable(1));//输出次数的统计结果，通过count::作为前缀来标示??? ??? }??? ??? }

LogAnalysiser:: PartitionerClass

??? public static class PartitionerClass implements Partitioner&lt;Text, LongWritable&gt;??? {??? ??? public int getPartition(Text key, LongWritable value, int numPartitions)??? ??? {??? ??? ??? if (numPartitions &gt;= 2)//Reduce 个数，判断流量还是次数的统计分配到不同的Reduce??? ??? ??? ??? if (key.toString().startsWith("flow::"))??? ??? ??? ??? ??? return 0;??? ??? ??? ??? else??? ??? ??? ??? ??? return 1;??? ??? ??? else??? ??? ??? ??? return 0;??? ??? }??? ??? public void configure(JobConf job){}??? }

LogAnalysiser:: CombinerClass

参看ReduceClass，通常两者可以使用一个，不过这里有些不同的处理就分成了两个。在ReduceClass中蓝色的行表示在CombinerClass中不存在。

LogAnalysiser:: ReduceClass

??? public static class ReduceClass extends MapReduceBase??? ??? implements Reducer&lt;Text, LongWritable,Text, LongWritable&gt; ??? {??? ??? public void reduce(Text key, Iterator&lt;LongWritable&gt; values,??? ??? ??? ??? OutputCollector&lt;Text, LongWritable&gt; output, Reporter reporter)throws IOException??? ??? {??? ??? ??? Text newkey = new Text();??? ??? ??? newkey.set(key.toString().substring(key.toString().indexOf("::")+2));??? ??? ??? LongWritable result = new LongWritable();??? ??? ??? long tmp = 0;??? ??? ??? int counter = 0;??? ??? ??? while(values.hasNext())//累加同一个key的统计结果??? ??? ??? {??? ??? ??? ??? tmp = tmp + values.next().get();??? ??? ??? ??? ??? ??? ??? ??? counter = counter +1;//担心处理太久，JobTracker长时间没有收到报告会认为TaskTracker已经失效，因此定时报告一下??? ??? ??? ??? if (counter == 1000)??? ??? ??? ??? {??? ??? ??? ??? ??? counter = 0;??? ??? ??? ??? ??? reporter.progress();??? ??? ??? ??? }??? ??? ??? }??? ??? ??? result.set(tmp);??? ??? ??? output.collect(newkey, result);//输出最后的汇总结果??? ??? }??? ??? }

LogAnalysiser

public static void main(String[] args){try{run(args);} catch (Exception e){e.printStackTrace();}}public static void run(String[] args) throws Exception{if (args == null || args.length &lt;2){System.out.println("need inputpath and outputpath");return;}String inputpath = args[0];String outputpath = args[1];String shortin = args[0];String shortout = args[1];if (shortin.indexOf(File.separator) &gt;= 0)shortin = shortin.substring(shortin.lastIndexOf(File.separator));if (shortout.indexOf(File.separator) &gt;= 0)shortout = shortout.substring(shortout.lastIndexOf(File.separator));SimpleDateFormat formater = new SimpleDateFormat("yyyy.MM.dd");shortout = new StringBuffer(shortout).append("-").append(formater.format(new Date())).toString();if (!shortin.startsWith("/"))shortin = "/" + shortin;if (!shortout.startsWith("/"))shortout = "/" + shortout;shortin = "/user/root" + shortin;shortout = "/user/root" + shortout;File inputdir = new File(inputpath);File outputdir = new File(outputpath);if (!inputdir.exists() || !inputdir.isDirectory()){System.out.println("inputpath not exist or isn't dir!");return;}if (!outputdir.exists()){new File(outputpath).mkdirs();}JobConf conf = new JobConf(new Configuration(),LogAnalysiser.class);//构建ConfigFileSystem fileSys = FileSystem.get(conf);fileSys.copyFromLocalFile(new Path(inputpath), new Path(shortin));//将本地文件系统的文件拷贝到HDFS中conf.setJobName("analysisjob");conf.setOutputKeyClass(Text.class);//输出的key类型，在OutputFormat会检查conf.setOutputValueClass(LongWritable.class); //输出的value类型，在OutputFormat会检查conf.setMapperClass(MapClass.class);conf.setCombinerClass(CombinerClass.class);conf.setReducerClass(ReduceClass.class);conf.setPartitionerClass(PartitionerClass.class);conf.set("mapred.reduce.tasks", "2");//强制需要有两个Reduce来分别处理流量和次数的统计FileInputFormat.setInputPaths(conf, shortin);//hdfs中的输入路径FileOutputFormat.setOutputPath(conf, new Path(shortout));//hdfs中输出路径Date startTime = new Date();    System.out.println("Job started: " + startTime);    JobClient.runJob(conf);    Date end_time = new Date();    System.out.println("Job ended: " + end_time);    System.out.println("The job took " + (end_time.getTime() - startTime.getTime()) /1000 + " seconds.");    //删除输入和输出的临时文件fileSys.copyToLocalFile(new Path(shortout),new Path(outputpath));fileSys.delete(new Path(shortin),true);fileSys.delete(new Path(shortout),true);}

以上的代码就完成了所有的逻辑性代码，然后还需要一个注册驱动类来注册业务Class为一个可标示的命令，让hadoop jar可以执行。

public class ExampleDriver {? public static void main(String argv[]){??? ProgramDriver pgd = new ProgramDriver();??? try {????? pgd.addClass("analysislog", LogAnalysiser.class, "A map/reduce program that analysis log .");????? pgd.driver(argv);??? }??? catch(Throwable e){????? e.printStackTrace();??? }? }}

将代码打成jar，并且设置jar的mainClass为ExampleDriver这个类。在分布式环境启动以后执行如下语句：

hadoop jar analysiser.jar analysislog /home/wenchu/test-in /home/wenchu/test-out

在/home/wenchu/test-in中是需要分析的日志文件，执行后就会看见整个执行过程，包括了Map和Reduce的进度。执行完毕会在/home/wenchu/test-out下看到输出的内容。有两个文件：part-00000和part-00001分别记录了统计后的结果。如果需要看执行的具体情况，可以看在输出目录下的_logs/history/xxxx_analysisjob，里面罗列了所有的Map，Reduce的创建情况以及执行情况。在运行期也可以通过浏览器来查看Map,Reduce的情况：http://MasterIP:50030/jobtracker.jsp

Hadoop集群测试

首先这里使用上面的范例作为测试，也没有做太多的优化配置，这个测试结果只是为了看看集群的效果，以及一些参数配置的影响。

文件复制数为1，blocksize 5M

Slave数处理记录数(万条)执行时间（秒）295382950337495244950178695216950114

Blocksize 5M

Slave数处理记录数(万条)执行时间（秒）2（文件复制数为1）9503372（文件复制数为3）9503396（文件复制数为1）9501146（文件复制数为3）950117

文件复制数为1

Slave数处理记录数(万条)执行时间（秒）6(blocksize 5M)95216(blocksize 77M)95264(blocksize 5M)9501784(blocksize 50M)950546(blocksize 5M)9501146(blocksize 50M)950446(blocksize 77M)95074

测试的数据结果很稳定，基本测几次同样条件下都是一样。通过测试结果可以看出以下几点：

机器数对于性能还是有帮助的（等于没说^_^）。文件复制数的增加只对安全性有帮助，但是对于性能没有太多帮助。而且现在采取的是将操作系统文件拷贝到HDFS中，所以备份多了，准备的时间很长。blocksize对于性能影响很大，首先如果将block划分的太小，那么将会增加job的数量，同时也增加了协作的代价，降低了性能，但是配置的太大也会让job不能最大化并行处理。所以这个值的配置需要根据数据处理的量来考虑。最后就是除了这个表里面列出来的结果，应该去仔细看输出目录中的_logs/history中的xxx_analysisjob这个文件，里面记录了全部的执行过程以及读写情况。这个可以更加清楚地了解哪里可能会更加耗时。随想

“云计算”热的烫手，就和SAAS、Web2及SNS等一样，往往都是在搞概念，只有真正踏踏实实的大型互联网公司，才会投入人力物力去研究符合自己的分布式计算。其实当你的数据量没有那么大的时候，这种分布式计算也就仅仅只是一个玩具而已，只有在真正解决问题的过程中，它深层次的问题才会被挖掘出来。

这三篇文章（分布式计算开源框架Hadoop介绍，Hadoop中的集群配置和使用技巧）仅仅是为了给对分布式计算有兴趣的朋友抛个砖，要想真的掘到金子，那么就踏踏实实的去用、去想、去分析。或者自己也会更进一步地去研究框架中的实现机制，在解决自己问题的同时，也能够贡献一些什么。

前几日看到有人跪求成为架构师的方式，看了有些可悲，有些可笑，其实有多少架构师知道什么叫做架构？架构师的职责是什么？与其追求这么一个名号，还不如踏踏实实地做块石头沉到水底。要知道，积累和沉淀的过程就是一种成长。

分布式计算开源框架Hadoop入门实践（3

热点推荐