在大数据时代,MapReduce 是处理海量数据的核心技术之一。本教程将手把手教你如何在 Ubuntu 系统中搭建 Hadoop 环境,并编写第一个 MapReduce 程序。无论你是编程小白还是刚接触分布式计算的新手,都能轻松上手!
MapReduce 是 Google 提出的一种编程模型,用于大规模数据集的并行处理。它包含两个核心阶段:
要运行 MapReduce 程序,首先需要安装 Hadoop。以下是基于 Ubuntu 22.04 的安装步骤:
sudo apt updatesudo apt install openjdk-8-jdk -yjava -version # 验证安装 wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gztar -xzf hadoop-3.3.6.tar.gzsudo mv hadoop-3.3.6 /usr/local/hadoop 接着,配置环境变量(编辑 ~/.bashrc):
export HADOOP_HOME=/usr/local/hadoopexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 然后执行 source ~/.bashrc 使配置生效。
我们将用 Java 编写一个经典的“单词计数”(WordCount)程序,这是 MapReduce入门教程 中最基础的例子。
mkdir -p ~/mapreduce_demo/srcmkdir -p ~/mapreduce_demo/classes 创建文件 ~/mapreduce_demo/src/WordCountMapper.java:
import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String token : line.split("\\s+")) { word.set(token); context.write(word, one); } }} 创建文件 ~/mapreduce_demo/src/WordCountReducer.java:
import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }} 创建文件 ~/mapreduce_demo/src/WordCount.java:
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(WordCountMapper.class); job.setCombinerClass(WordCountReducer.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }} 使用以下命令编译代码:
javac -classpath $(hadoop classpath) -d ~/mapreduce_demo/classes \ ~/mapreduce_demo/src/*.java 打包成 JAR 文件:
jar -cvf wordcount.jar -C ~/mapreduce_demo/classes/ . 准备测试文本文件(例如 input.txt),上传到 HDFS:
hdfs dfs -mkdir /inputhdfs dfs -put input.txt /input 运行 MapReduce 任务:
hadoop jar wordcount.jar WordCount /input /output 查看结果:
hdfs dfs -cat /output/part-r-00000 通过本教程,你已经掌握了在 Ubuntu Hadoop开发 环境中编写和运行 MapReduce 程序的基本方法。MapReduce 虽然已被 Spark 等新框架部分取代,但理解其原理对学习 分布式计算教程 依然至关重要。
现在,你可以尝试修改 WordCount 程序,比如忽略大小写、过滤标点符号等,进一步巩固 Ubuntu MapReduce编程 技能!
祝你在大数据开发之旅中一路顺风!
本文由主机测评网于2025-12-10发表在主机测评网_免费VPS_免费云服务器_免费独立服务器,如有疑问,请联系我们。
本文链接:https://www.vpshk.cn/2025125886.html