前言

快考试了，尝试记录一波题型，可能有用，可能无用

大数据可能出现题型汇总

spark系列

spark词频统计

参考内容：

前提准备：

在自己的HDFS服务器中创建/1900301538/spark/input/wordcount.txt文件

文件内容随意，只要是一个个单词就行，也可以使用其余文献资料，例如

wordcount.txt

hello world
hello hadoop
hello mapreduce
hello spark

代码如下：

val text = sc.textFile("hdfs://hadoop01:9000/1900301538/spark/input/wordcount.txt")
val counts = text.flatMap(line => line.split(" "))
var wordcount = counts.map(counts => (counts, 1))
wordcount = wordcount.reduceByKey(_ + _)
wordcount.foreach(println)
wordcount.saveAsTextFile("hdfs://hadoop01:9000/1900301538/spark/output")

解释：

val text = sc.textFile("hdfs://hadoop01:9000/1900301538/spark/input/wordcount.txt")
//读取hadoop01的文件到text
val counts = text.flatMap(line => line.split(" "))
//将凡是相隔一个空格的字符对其进行分割
var wordcount = counts.map(counts => (counts, 1))
//将其创建成为（key，1）的格式
wordcount = wordcount.reduceByKey(_ + _)
//将上述所创建的map进行合并
wordcount.foreach(println)
//终端中输出查看
wordcount.saveAsTextFile("hdfs://hadoop01:9000/1900301538/spark/output")
//将其输出到指定目录下进行保存

运行结果如下：

(spark,1)
(hadoop,1)
(mapreduce,1)
(hello,4)
(world,1)

spark 计算pi值

参考文章：

老师教程（实验四）

代码如下：

var NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>
  val x = math.random
  val y = math.random
  x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")

解释：

var NUM_SAMPLES = 100000
//定义一个随机值，越大越好
val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>
    //进行一个for循环？大概
  val x = math.random
    //定义一个x随机数
  val y = math.random
//定义一个y随机数
  x*x + y*y < 1
    //使其相乘并相加小于1
}.count()
//记1
println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")
//输出结果

MapReduce系列

资料来源：

[Hadoop集群（第9期）_MapReduce初级案例]

数据去重

前提准备：

创建文件

a.txt

2012-3-1 a
2012-3-2 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-7 c
2012-3-3 c

b.txt

2012-3-1 b
2012-3-2 a
2012-3-3 b
2012-3-4 d
2012-3-5 a
2012-3-6 c
2012-3-7 d
2012-3-3 c

代码如下：

package hadoopdemo;

import java.io.IOException;
import java.util.Properties;
import java.util.logging.Logger;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Remove_same {

    private static final String HDFS = "hdfs://hadoop01:9000/";

    //map将输入中的value复制到输出数据的key上，并直接输出

    public static class Map extends Mapper<Object,Text,Text,Text>{

        private static Text line=new Text();//每行数据



        //实现map函数

        public void map(Object key,Text value,Context context)

                throws IOException,InterruptedException{

            line=value;

            context.write(line, new Text(""));

        }



    }



    //reduce将输入中的key复制到输出数据的key上，并直接输出

    public static class Reduce extends Reducer<Text,Text,Text,Text>{

        //实现reduce函数

        public void reduce(Text key,Iterable<Text> values,Context context)

                throws IOException,InterruptedException{

            context.write(key, new Text(""));

        }



    }



    public static void main(String[] args) throws Exception{

        Properties properties = System.getProperties();
        properties.setProperty("HADOOP_USER_NAME", "hadoop");

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", HDFS);
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        conf.set("dfs.client.use.datanode.hostname", "true");



        Tools tool = new Tools(HDFS, conf);
        if (tool.exists("/1900301538/Remove_same"))
            tool.rmr("/1900301538/Remove_same");
        tool.mkdirs("/1900301538/Remove_same");
        tool.mkdirs("/1900301538/Remove_same/input");
        tool.copyFile("D:\\li\\a.txt","/1900301538/Remove_same/input/a.txt");
        tool.copyFile("D:\\li\\b.txt","/1900301538/Remove_same/input/b.txt");




//        Job job = new Job(conf, "数据去重");
        Job job = Job.getInstance(conf, "数据去重");
//        job.setJarByClass(Dedup.class);



        //设置Map、Combine和Reduce处理类

        job.setMapperClass(Map.class);

        job.setCombinerClass(Reduce.class);

        job.setReducerClass(Reduce.class);



        //设置输出类型

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);



        //设置输入和输出目录

        FileInputFormat.addInputPath(job, new Path("/1900301538/Remove_same/input/"));

        FileOutputFormat.setOutputPath(job, new Path("/1900301538/Remove_same/output/"));
        tool.cat("/1900301538/Remove_same/output/part-r-00000");
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

运行结果：

2012-3-1 a	
2012-3-1 b	
2012-3-2 a	
2012-3-2 b	
2012-3-3 b	
2012-3-3 c	
2012-3-4 d	
2012-3-5 a	
2012-3-6 b	
2012-3-6 c	
2012-3-7 c	
2012-3-7 d

求平均成绩

前提准备：

database.txt

小明 95
小红 81
小新 89
小丽 85

python.txt

1
2
3

小红 83
小新 94
小丽 91

c++.txt

小明 92
小红 87
小新 82
小丽 90

代码如下：

package hadoopdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.Properties;
import java.util.StringTokenizer;


public class average_s {

    private static final String HDFS = "hdfs://hadoop01:9000/";
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

        // 实现map函数
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 将输入的纯文本文件的数据转化成String
            String line = value.toString();
            // 将输入的数据首先按行进行分割
            StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");
            // 分别对每一行进行处理

            while (tokenizerArticle.hasMoreElements()) {

                // 每行按空格划分

                StringTokenizer tokenizerLine = new StringTokenizer(tokenizerArticle.nextToken());



                String strName = tokenizerLine.nextToken();// 学生姓名部分

                String strScore = tokenizerLine.nextToken();// 成绩部分



                Text name = new Text(strName);

                int scoreInt = Integer.parseInt(strScore);

                // 输出姓名和成绩

                context.write(name, new IntWritable(scoreInt));

            }

        }



    }



    public static class Reduce extends

            Reducer<Text, IntWritable, Text, IntWritable> {

        // 实现reduce函数

        public void reduce(Text key, Iterable<IntWritable> values,

                           Context context) throws IOException, InterruptedException {

            int sum = 0;
            int count = 0;

            for (IntWritable value : values) {

                sum += value.get();// 计算总分

                count++;// 统计总的科目数

            }
            int average = sum / count;// 计算平均成绩

            context.write(key, new IntWritable(average));

        }



    }



    public static void main(String[] args) throws Exception {

        Properties properties = System.getProperties();
        properties.setProperty("HADOOP_USER_NAME", "hadoop");

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", HDFS);
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        conf.set("dfs.client.use.datanode.hostname", "true");



        Tools tool = new Tools(HDFS, conf);
        if (tool.exists("/1900301538/average_s"))
            tool.rmr("/1900301538/average_s");
        tool.mkdirs("/1900301538/average_s");
        tool.mkdirs("/1900301538/average_s/input");
        tool.copyFile("D:\\li\\python.txt","/1900301538/average_s/input/python.txt");
        tool.copyFile("D:\\li\\c++.txt","/1900301538/average_s/input/c++.txt");
        tool.copyFile("D:\\li\\database.txt","/1900301538/average_s/input/database.txt");



        Job job = Job.getInstance(conf, "求平均");

        // 设置Map、Combine和Reduce处理类

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        // 设置输出类型

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 设置输入和输出目录

        FileInputFormat.addInputPath(job, new Path("/1900301538/average_s/input/"));
        FileOutputFormat.setOutputPath(job, new Path("/1900301538/average_s/output/"));
        job.waitForCompletion(true);
        tool.cat("/1900301538/average_s/output/part-r-00000");

    }

}

输出结果：

小丽	88
小新	88
小明	89
小红	83

多表关联

准备文件：

factory.txt

factoryname                　　　　addressed
Beijing Red Star                　　　　1
Shenzhen Thunder            　　　　3
Guangzhou Honda            　　　　2
Beijing Rising                   　　　　1
Guangzhou Development Bank      2
Tencent                　　　　　　　　3
Back of Beijing                　　　　 1

address.txt

addressID    addressname
1        　　　　Beijing
2        　　　　Guangzhou
3        　　　　Shenzhen
4        　　　　Xian

代码：

package hadoopdemo;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class Factory_where {
    private static final String HDFS = "hdfs://hadoop01:9000/";
    public static int time = 0;
    
    /*

     * 在map中先区分输入行属于左表还是右表，然后对两列值进行分割，

     * 保存连接列在key值，剩余列和左右表标志在value中，最后输出

     */
    public static class Map extends Mapper<Object, Text, Text, Text> {



        // 实现map函数

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            String line = value.toString();// 每行文件

            String relationtype = new String();// 左右表标识



            // 输入文件首行，不处理

            if (line.contains("factoryname") == true

                    || line.contains("addressed") == true) {

                return;

            }



            // 输入的一行预处理文本

            StringTokenizer itr = new StringTokenizer(line);

            String mapkey = new String();

            String mapvalue = new String();

            int i = 0;

            while (itr.hasMoreTokens()) {

                // 先读取一个单词

                String token = itr.nextToken();

                // 判断该地址ID就把存到"values[0]"

                if (token.charAt(0) >= '0' && token.charAt(0) <= '9') {

                    mapkey = token;

                    if (i > 0) {

                        relationtype = "1";

                    } else {

                        relationtype = "2";

                    }

                    continue;

                }
                
                // 存工厂名

                mapvalue += token + " ";

                i++;

            }

            // 输出左右表

            context.write(new Text(mapkey), new Text(relationtype + "+"+ mapvalue));

        }

    }
    

    /*

     * reduce解析map输出，将value中数据按照左右表分别保存，

     * 然后求出笛卡尔积，并输出。

     */

    public static class Reduce extends Reducer<Text, Text, Text, Text> {



        // 实现reduce函数

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {
            
            // 输出表头

            if (0 == time) {

                context.write(new Text("factoryname"), new Text("addressname"));

                time++;

            }
            

            int factorynum = 0;

            String[] factory = new String[10];

            int addressnum = 0;

            String[] address = new String[10];
            

            Iterator ite = values.iterator();

            while (ite.hasNext()) {

                String record = ite.next().toString();

                int len = record.length();

                int i = 2;

                if (0 == len) {

                    continue;

                }
                
                // 取得左右表标识

                char relationtype = record.charAt(0);

                // 左表

                if ('1' == relationtype) {

                    factory[factorynum] = record.substring(i);

                    factorynum++;

                }


                // 右表

                if ('2' == relationtype) {

                    address[addressnum] = record.substring(i);

                    addressnum++;

                }

            }



            // 求笛卡尔积

            if (0 != factorynum && 0 != addressnum) {

                for (int m = 0; m < factorynum; m++) {

                    for (int n = 0; n < addressnum; n++) {

                        // 输出结果

                        context.write(new Text(factory[m]),

                                new Text(address[n]));

                    }

                }

            }



        }

    }



    public static void main(String[] args) throws Exception {

        Properties properties = System.getProperties();
        properties.setProperty("HADOOP_USER_NAME", "hadoop");

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", HDFS);
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        conf.set("dfs.client.use.datanode.hostname", "true");

        Tools tool = new Tools(HDFS, conf);
        if (tool.exists("/1900301538/Factory_where"))
            tool.rmr("/1900301538/Factory_where");
        tool.mkdirs("/1900301538/Factory_where");
        tool.mkdirs("/1900301538/Factory_where/input");
        tool.copyFile("D:\\li\\factory.txt","/1900301538/Factory_where/input/factory.txt");
        tool.copyFile("D:\\li\\address.txt","/1900301538/Factory_where/input/address.txt");
        Job job = Job.getInstance(conf, "公司位置汇总");



        // 设置Map和Reduce处理类

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);



        // 设置输出类型

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);



        // 设置输入和输出目录

        FileInputFormat.addInputPath(job, new Path("/1900301538/Factory_where/input/"));
        FileOutputFormat.setOutputPath(job, new Path("/1900301538/Factory_where/output/"));
        job.waitForCompletion(true);
        tool.cat("/1900301538/Factory_where/output/part-r-00000");

    }

}

输出结果：

factoryname	addressname
Back of Beijing 	　　　　Beijing 
Beijing Rising 	　　　　Beijing 
Beijing Red Star 	　　　　Beijing 
Guangzhou Development Bank 	　　　　Guangzhou 
Guangzhou Honda 	　　　　Guangzhou 
Tencent 	　　　　Shenzhen 
Shenzhen Thunder 	　　　　Shenzhen

单表关联（爷孙表）

需要文件：

c_p.txt

child        parent
Tom        Lucy
Tom        Jack
Jone        Lucy
Jone        Jack
Lucy        Mary
Lucy        Ben
Jack        Alice
Jack        Jesse
Terry        Alice
Terry        Jesse
Philip        Terry
Philip        Alma
Mark        Terry
Mark        Alma

代码如下：

package hadoopdemo;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class c_p{
    private static final String HDFS = "hdfs://hadoop01:9000/";
    public static int time = 0;
    /*
     * map将输出分割child和parent，然后正序输出一次作为右表，
     * 反序输出一次作为左表，需要注意的是在输出的value中必须
     * 加上左右表的区别标识。
     */
    public static class Map extends Mapper<Object, Text, Text, Text> {
        // 实现map函数
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String childname = "";// 孩子名称
            String parentname = "";// 父母名称
            String relationtype = "";// 左右表标识
            // 输入的一行预处理文本

            StringTokenizer itr=new StringTokenizer(value.toString());

            String[] values=new String[2];

            int i=0;

            while(itr.hasMoreTokens()){

                values[i]=itr.nextToken();

                i++;

            }



            if (values[0].compareTo("child") != 0) {

                childname = values[0];

                parentname = values[1];



                // 输出左表

                relationtype = "1";

                context.write(new Text(values[1]), new Text(relationtype +

                        "+"+ childname + "+" + parentname));



                // 输出右表

                relationtype = "2";

                context.write(new Text(values[0]), new Text(relationtype +

                        "+"+ childname + "+" + parentname));

            }

        }



    }



    public static class Reduce extends Reducer<Text, Text, Text, Text> {



        // 实现reduce函数

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {



            // 输出表头

            if (0 == time) {

                context.write(new Text("grandchild"), new Text("grandparent"));

                time++;

            }



            int grandchildnum = 0;

            String[] grandchild = new String[10];

            int grandparentnum = 0;

            String[] grandparent = new String[10];



            Iterator ite = values.iterator();

            while (ite.hasNext()) {

                String record = ite.next().toString();

                int len = record.length();

                int i = 2;

                if (0 == len) {

                    continue;

                }



                // 取得左右表标识

                char relationtype = record.charAt(0);

                // 定义孩子和父母变量

                String childname = new String();

                String parentname = new String();



                // 获取value-list中value的child

                while (record.charAt(i) != '+') {

                    childname += record.charAt(i);

                    i++;

                }



                i = i + 1;



                // 获取value-list中value的parent

                while (i < len) {

                    parentname += record.charAt(i);

                    i++;

                }



                // 左表，取出child放入grandchildren

                if ('1' == relationtype) {

                    grandchild[grandchildnum] = childname;

                    grandchildnum++;

                }



                // 右表，取出parent放入grandparent

                if ('2' == relationtype) {

                    grandparent[grandparentnum] = parentname;

                    grandparentnum++;

                }

            }



            // grandchild和grandparent数组求笛卡尔儿积

            if (0 != grandchildnum && 0 != grandparentnum) {

                for (int m = 0; m < grandchildnum; m++) {

                    for (int n = 0; n < grandparentnum; n++) {

                        // 输出结果

                        context.write(new Text(grandchild[m]), new Text(grandparent[n]));

                    }

                }

            }

        }

    }



    public static void main(String[] args) throws Exception {

        Properties properties = System.getProperties();
        properties.setProperty("HADOOP_USER_NAME", "hadoop");

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", HDFS);
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
        conf.set("dfs.client.use.datanode.hostname", "true");

        Tools tool = new Tools(HDFS, conf);
        if (tool.exists("/1900301538/c_p"))
            tool.rmr("/1900301538/c_p");
        tool.mkdirs("/1900301538/c_p");
        tool.mkdirs("/1900301538/c_p/input");
        tool.copyFile("D:\\li\\c_p.txt","/1900301538/c_p/input/c_p.txt");
      
        Job job = Job.getInstance(conf, "爷孙关系");



        // 设置Map和Reduce处理类

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        

        // 设置输出类型

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);



        // 设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path("/1900301538/c_p/input/"));
        FileOutputFormat.setOutputPath(job, new Path("/1900301538/c_p/output/"));
        job.waitForCompletion(true);
        tool.cat("/1900301538/c_p/output/part-r-00000");

    }

}

运行结果如下：

grandchild	grandparent
Tom	Alice
Tom	Jesse
Jone	Alice
Jone	Jesse
Tom	Ben
Tom	Mary
Jone	Ben
Jone	Mary
Philip	Alice
Philip	Jesse
Mark	Alice
Mark	Jesse

尾言

如有其余问题，请留言

ZHYCarge的博客

大数据蒙题系列

前言

大数据可能出现题型汇总

spark系列

spark词频统计

spark 计算pi值

MapReduce系列

数据去重

求平均成绩

多表关联

单表关联（爷孙表）

尾言