• 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏吧

详细讲解MapReduce二次排序过程

hadoop wangting 7个月前 (05-14) 1492次浏览 66个评论

我在15年处理大数据的时候还都是使用MapReduce, 随着时间的推移, 计算工具的发展, 内存越来越便宜, 计算方式也有了极大的改变. 到现在再做大数据开发的好多同学都是直接使用spark, hive等工具, 很少有再写MapReduce的了.
这里整理一下MapReduce中经常用到的二次排序的方法, 全当复习.

简介

二次排序(secondary sort)问题是指在Reduce阶段对某个键关联的值排序. 利用二次排序技术,可以对传入Reduce的值完成 升序/降序 排序.
MapReduce框架会自动对Map生成的完成排序. 所以, 在启动Reduce之前,中间文件 key-value 是按照key有序的(而不是按照值有序). 它们的值得顺序有可能是任意的.

二次排序解决方案

对Reduce中的值排序至少有两种方案, 这两种方案在MapReduce/HadoopSpark框架中都可以使用.

  • 第一种方案是让Reduce读取和缓存给定key的所有的value, 然后在Reduce中对这些值完成排序.(例如: 把一个key对应的所有value放到一个ArrayList中,再排序). 但是这种方式有局限性, 如果数据量较少还可以使用,如果数据量太大,一个Reduce中放不下所有的值,就会导致内存溢出(OutOfMemory).
  • 第二种方式是使用MapReduce框架来对值进行排序. 因为MapReduce框架会自动对Map生成的文件的key进行排序, 所以我们把需要排序的value增加到这个key上,这样让框架对这个new_key进行排序,来实现我们的目标.

第二种方法小结:

  1. 使用值键转换设计模式:构造一个组合的中间key,new_key(k, v1), 其中v1是次键(secondary key).
  2. MapReduce执行框架完成排序.
  3. 重写分区器,使组合键(k, v1) 按照之前单独的 k 进行分区.

示例

假设有一组科学实验的温度数据如下:
有4列分别为: 年, 月, 日, 温度.

2000,12,04,10
2000,11,01,20
2000,12,02,-20
2000,11,07,30
2000,11,24,-40
2000,01,12,10
...

我们需要输出每一个年-月的温度,并且值按照升序排序.
所以输出如下:

(2000-11),[-40,20,30]
(2000-01),[10]
(2000-12),[-20,10]

MapReduce二次排序实现细节

要实现二次排序的特性,还需要一些java的插件类, 去告诉MapReduce框架一些信息:

  • 如何对Reduce的键排序.
  • 如何对Map产出的数据进行分区,进到不同的Reduce.
  • 如何对Reduce中的数据进行分组.

组合键的排序顺序

要实现二次排序, 我们需要控制组合键的排序顺序,以及Reduce处理键的顺序.
首先组合键的组成由(年-月 + 温度)一起组成, 如下图:
详细讲解MapReduce二次排序过程

temperature的数据放到键中之后, 我们还要指定这个组合键排序方式. 使用DateTemperaturePair对象保存组合键, 重写其compareTo()方法指定排序顺序.
Hadoop中,如果需要持久存储定制数据类型(如DateTemperaturePair),必须实现Writable接口. 如果要比较定制数据类型, 他们还必须实现另外一个接口WritableComparable. 示例代码如下:

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
...
public class DateTemperaturePair implements Writable, WritableComparable<DateTemperaturePair> {
    private Text yearMonth = new Text(); //自然键
    private Text day = new Text();
    private IntWritable temperature = new IntWritable(); // 次键
    ...
    @Override
    /**
    * 这个比较器将控制键的排序顺序
    * /
    public int compareTo(DateTemperaturePair pair) {
        int compareValue = this.yearMonth.compareTo(pair.getYearMonth());
        if (compareValue == 0) {
            compareValue = temperature.compareTo(pair.getTemperature());
    }
        return compareValue; //升序排序
        //return -1 * compareValue; //降序排序
    }
}

定制分区器

分区器默认会根据Map产出的key来决定数据进到哪个Reduce.
在这里,我们需要根据yearMonth来分区把数据入到不同的Reduce中, 但是我们的键已经变成了(yearMonth + temperature)的组合了. 所以需要定制分区器来根据yearMonth进行数据分区,把相同的yearMonth入到一个Reduce中. 代码如下:

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class DateTemperaturePartitioner extends Partitioner<DateTemperaturePair, Text> {
    @Override
    public int getPartition(DatetemperaturePair pair, Text text, int numberOfPartitions) {
    //确保分区数非负
    return math.abs(pair.getYearMonth().hashCode() % numberOfPartitions);
    }
}

Hadoop提供了一个插件体系,允许在框架中注入定制分区器代码. 我们在驱动累中完成这个工作,如下:

import org.apache.hadoop.mapreduce.Job;
...
Job job = ...;
...
job.setPartitionerClass(TemperaturePartitioner.class);

分组比较器

分组比较器会控制哪些键要分组到一个Reduce.reduce()方法中调用.
默认是按照key分配, 这里我们期望的是按照组合key(yearMonth + temperature) 中的yearMonth分配, 所以需要重写分组方法.
如下:

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class DateTemperatureGroupingComparator extends WritableComparator {
    public DateTemperatureGroupingComparator() {
        super(DateTemperaturePair.class, true);
    }
    
    @Override
    /**
    * 比较器控制哪些键要分组到一个reduce()方法调用
    */
    public int compare(WritableComparable wc1, WritableComparable wc2) {
        DateTemperaturePair pair = (DateTemperaturePair) wc1;
        DateTemperaturePair pair2 = (DateTemperaturePair) wc12;
        return pair.getYearMonth().compareTo(pair2.getYearMonth());
    
    }
}

在驱动类中注册比较器:
job.setGroupingComparatorClass(YearMonthGroupingComparator.class);

使用插件的数据流

详细讲解MapReduce二次排序过程

原理总结

MapReduce框架默认会按照key来进行分区,排序,分组.
我们需要排序的时候使用key+value所以我们把key变成了新key, (firstkey, secondkey) 对应为(yearMonth, 温度) .

但是又不想在分区 和 分组的时候使用新key, 所以自己写了Partitioner 和 GroupingComparator 来指定使用组合key中的firstkey来分区,分组.


程序员灯塔 , 版权所有
转载请注明原文链接:https://www.wangt.cc/2019/05/xiang-xi-jiang-jiemapreduce-er-ci-pai-xu-guo-cheng/
喜欢 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
(66)个小伙伴在吐槽
  1. I simply desired to say thanks again. I'm not certain what I might have carried out in the absence of the entire smart ideas shown by you over such area. It was before an absolute troublesome concern in my view, but noticing the very specialised way you resolved the issue made me to cry for contentment. I'm thankful for the service and hope that you recognize what an amazing job your are carrying out instructing the mediocre ones through your web page. Most likely you've never come across all of us.
    cheap nba jerseys2019-08-26 13:51 回复
  2. I'm commenting to let you know what a helpful discovery my wife's girl experienced studying the blog. She even learned many details, most notably what it's like to possess an awesome giving mindset to have many people without hassle know a number of specialized topics. You truly exceeded readers' expectations. Thanks for supplying those great, safe, explanatory and as well as unique tips on this topic to Sandra.
    michael kors outlet2019-08-26 07:15 回复
  3. I wanted to develop a remark to be able to thank you for all the precious tips you are posting on this site. My long internet lookup has at the end been recognized with good quality tips to talk about with my great friends. I would say that most of us website visitors are definitely lucky to exist in a fabulous site with many lovely people with good guidelines. I feel truly happy to have used your entire web page and look forward to many more enjoyable minutes reading here. Thanks once again for a lot of things.
    adidas nmd r12019-08-26 01:24 回复
  4. I wanted to send you one little bit of observation just to thank you so much once again for these exceptional suggestions you have shared above. It was seriously generous of you to grant without restraint what exactly numerous people would have advertised for an electronic book in order to make some dough for themselves, even more so since you could possibly have tried it in case you decided. Those ideas likewise acted as a great way to know that other people online have the same desire just as mine to learn way more with regards to this matter. I am certain there are some more pleasurable periods up front for people who scan through your website.
    ferragamo belts2019-08-25 12:47 回复
  5. I happen to be writing to make you be aware of of the excellent experience our princess obtained viewing your web page. She noticed numerous pieces, with the inclusion of what it's like to possess a very effective coaching heart to get folks without problems learn about specified extremely tough things. You truly surpassed people's desires. Thanks for providing those necessary, trustworthy, educational not to mention cool guidance on your topic to Janet.
    huarache shoes2019-08-24 19:56 回复
  6. I'm just commenting to make you know of the cool discovery our princess developed visiting your web page. She came to find a good number of issues, most notably how it is like to possess an ideal helping spirit to get the others with ease understand some specialized subject matter. You actually surpassed our own desires. Thanks for supplying these warm and helpful, trustworthy, explanatory as well as cool tips on that topic to Emily.
    yeezy boost 3502019-08-24 11:07 回复
  7. I precisely wished to thank you very much yet again. I'm not certain the things that I would have used without these pointers discussed by you over this subject matter. It absolutely was a daunting condition in my view, nevertheless spending time with this skilled approach you managed that took me to leap with delight. I am grateful for your service and wish you really know what a great job you are always undertaking instructing people today using your webblog. I'm certain you haven't got to know all of us.
    yeezy boost2019-08-24 06:02 回复
  8. I truly wanted to jot down a small word so as to appreciate you for some of the wonderful guidelines you are writing at this site. My time-consuming internet lookup has now been paid with pleasant knowledge to go over with my pals. I 'd admit that we site visitors actually are unquestionably fortunate to exist in a remarkable community with so many marvellous professionals with very helpful techniques. I feel truly privileged to have discovered the site and look forward to plenty of more exciting moments reading here. Thank you again for everything.
    hermes birkin2019-08-23 23:08 回复
  9. This page makes me think of the other page here I was seeing
  10. I precisely had to say thanks again. I'm not certain what I could possibly have gone through in the absence of the creative ideas documented by you about that topic. It actually was an absolute traumatic problem in my opinion, but coming across this specialised avenue you managed the issue forced me to leap for fulfillment. Now i'm thankful for this assistance as well as pray you know what an amazing job you were accomplishing educating other individuals using your webblog. I am certain you haven't got to know all of us.
  11. I happen to be writing to let you know of the exceptional encounter my daughter had browsing yuor web blog. She came to understand several things, which included what it is like to have a marvelous coaching heart to get many people very easily learn about a number of extremely tough matters. You really did more than my expected results. Thank you for presenting those productive, healthy, explanatory and easy thoughts on the topic to Evelyn.
    nfl jerseys2019-08-23 08:32 回复
  12. I must express my thanks to the writer for rescuing me from this type of scenario. Because of looking through the online world and coming across tricks that were not productive, I believed my entire life was gone. Existing without the presence of approaches to the difficulties you've fixed as a result of this article is a critical case, as well as ones which might have adversely damaged my career if I had not discovered your site. The talents and kindness in handling every aspect was excellent. I'm not sure what I would've done if I had not come upon such a subject like this. I can also at this moment look forward to my future. Thanks very much for the professional and effective help. I will not hesitate to propose your site to anyone who needs to have tips on this subject matter.
    jordan 1 off white2019-08-23 03:50 回复
  13. I merely desire to inform you that I am new to putting up a blog and absolutely loved your write-up. Very likely I am inclined to remember your blog post . You really have amazing article blog posts. Like it for discussing with us your favorite domain post
    navigate here2019-07-21 16:58 回复
  14. Thanks so much for giving everyone an extraordinarily superb opportunity to discover important secrets from this site. It's always very beneficial plus packed with a good time for me personally and my office mates to visit your site at the very least three times a week to find out the fresh guidance you will have. And indeed, I'm usually fascinated with your sensational principles served by you. Certain 3 ideas on this page are unquestionably the most suitable we have all ever had.
  15. Genuinely intriguing specifics you'll have stated, thanks for writing.
    try this site2019-07-20 23:02 回复
  16. My husband and i have been now delighted that Michael managed to complete his analysis because of the precious recommendations he made from your very own weblog. It's not at all simplistic just to be offering strategies that people could have been selling. And we also fully grasp we need the blog owner to give thanks to for that. Most of the illustrations you made, the simple blog navigation, the friendships you can assist to promote - it's got many astounding, and it's really making our son and the family recognize that this topic is amusing, which is certainly wonderfully fundamental. Thanks for all!
1 2