Something stuff of Apress-Pro Hadoop(be going on.)

Something stuff of Apress-Pro Hadoop(be going on...)

电子版在http://caibinbupt.iteye.com/blog/418846下载

Getting started with hadoop core

??? Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer fit on
a single cost-effective computer.

??? A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is supported
by the fastest machines available, and usually the only limiting factor is your budget.

???An alternative solution is to build a high-availability cluster.

MapReduce Model:

? Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel.
? Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity.

?
Something stuff of Apress-Pro Hadoop(be going on.)

MapReduce Application is a?specialized web crawler which received as input large sets of URLs.Job had serverl steps:

?? 1,Ingest Urls.

?? 2,Normalize the urls.

?? 3,eliminate duplicate urls.

???4,filter all urls.

?? 5,fetch the urls.

???6,fingerprint the content items.

?? 7,update the recently sets.

?? 8,prepare the work list for next application.

The Hadoop-based application was running faster and well.

?Introducing Hadoop

??? this is a top-level project in apache,provoding and supporting development of open source software that supplies a framework for developments of highly scalable distributed computing applications.

??? The two fundamental pieces of hadoop are the mapreduce framework and hadoop distributed file system(HDFS).

???? The mapreduce framework required a shared file system such as HDFS,S3,NFS,GFS..but the HDFS is the best suitable.

Introducing MapReduce

??? required as?following:

??? 1,The locations in the distributed file system of input.

??? 2,the locations in the distributed file system for output.

??? 3,the input format.

??? 4,the output format.

??? 5,the class contains the?map function.

??? 6,optionally,the class contains the reduce function.

??? 7,the jar fils containing the above class.

if a job does not need a reduce?function,the framework will partition ?the input,and schedule and execute maps tasks across the cluster.if requested, it will sort the results of the map task and execute the map reduce with the map output.the final output will be moved the output directory and the state of job report user.

Managing the mapreduce:

?? there are two process to manage jobs:

??? TaskTracker manages the execution of individual map and reduce task on a compute node in the cluster.

??? JobTracker accepts job submission provides job monitoring and control,and manager the distribution of tasks to the tasktracker nodes.

Note: one nice feature is that you can add tasktracker to the cluster when a job is running and have the job spread to the new node.

?Introducing HDFS

?HDFS is designed for use for mapreduce jobs that? read input in large churks of input and write large churk of output.this is referred?as replication in hadoop.

Installing Hadoop

??? the prerequisites:

??? 1,fedora 8

????2,jdk1.6

??? 3,hadoop 0.19 or later

?Go to the Hadoop download site at http://www.apache.org/dyn/closer.cgi/hadoop/core/. find? the gz file,download the file,tar the file,then export HADOOP_HOME=[yourdirectory],export PATH=${HADOOP_HOME}/bin:${PATH}.

??? last,check all..

Running examples and tests

????? domonstrate all?examples...:)

?Chapter 2 the basices of mapreduce job

the chapter

?
Something stuff of Apress-Pro Hadoop(be going on.)
?the user is responsiable for handing the job setup,specifying the inputs locations,specifying .

there is a simple example:

/** Default implementation that does nothing. */public void close() throws IOException {}/** Default implementation that does nothing. */public void configure(JobConf job) {}

the configure is the way to access to the jobconf..

the close is the way to close resource or other things.

The makeup of cluster

?? In the context of Hadoop, a node/machine running the TaskTracker or DataNode server is considered a slave node. It is common to have nodes that run both the TaskTracker and
DataNode servers. The Hadoop server processes on the slave nodes are controlled by their respective masters, the JobTracker and NameNode servers.

Something stuff of Apress-Pro Hadoo

热点推荐