FreeOZ论坛

标题: [Big data] Map reduce question [打印本页]

作者: DDD888 时间: 8-4-2016 08:04
标题: [Big data] Map reduce question
Good morning,

I read this url http://www.thegeekstuff.com/2014/05/map-reduce-algorithm/ related to the first example
------------------------------------------------------------------------
Mapping Phase

So our map phase of our algorithm will be as follows:

1. Declare a function “Map”
2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “football”=>counter

Reducing Phase

The reducing function will accept the input from all these mappers in form of key value pair and then processing it. So, input to the reduce function will look like the following:

reduce(“football”=>2)
reduce(“Olympics”=>3)
Our algorithm will continue with the following steps:

5. Declare a function reduce to accept the values from map function.
6. Where for each key-value pair, add value to counter.
7. Return “games”=> counter.

At the end, we will get the output like “games”=>5.
------------------------------------------------------------------------

My question is why not direct map to games? This will be more efficient

Mapping Phase

So our map phase of our algorithm will be as follows:

1. Declare a function “Map”
2. Loop: For each words equal to “football”
3. Increment counter
4. Return key value “games”=>counter

TIA

作者: michaelsusu 时间: 8-4-2016 13:56
单就这个具体的问题而已，直接Map到GAME是比较省空间，但是作者的意思是要你体会(K1,V1) -> (K2,V2) -> (K3, V3)的变化，这个也是M&R的标准模式。简化版当然就是(K1,V1) -> (K3,V3)了，也就是之间到GAME。
另外，你要知道，MAP的意思就是保持数据原来的SEMANTICS，SHUFFLE之后，在COMBINE和REDUCE的阶段才引入输出的SEMANTICS，所以例子这样安排也是符合惯用做法的。很多时候MAP的中间结果是需要JOIN，OUTJOIN的，所以保持K2的中间结果在数据处理上更加灵活。

考察你一个问题，你知道在这个例子中K1，K2，K3分别是什么吗？

作者: DDD888 时间: 8-4-2016 14:32
本帖最后由 DDD888 于 8-4-2016 13:59 编辑

michaelsusu 发表于 8-4-2016 12:56
单就这个具体的问题而已，直接Map到GAME是比较省空间，但是作者的意思是要你体会(K1,V1) -> (K2,V2) -> (K3 ...

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

作者: DDD888 时间: 8-4-2016 21:00
本帖最后由 DDD888 于 8-4-2016 20:02 编辑

顺便问一下，如果我要求平均数

Your Mappers read the text file and apply the following map function on every line

map: (key, value)
  time = value[2]
  emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function

reduce: (key, value)
  result = sum(value) / n
  emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
http://stackoverflow.com/questio ... job-to-find-average

根据这个连结的回答，我要写两个job, 一个job计算累加总和，另一个job计算有多少数据，然后相应相除即可，但我想，两个job,读取同样的文件，是不是效率太低啦？如果该文件有几千个tb大

我的测试数据文件只有69kb,所以也看不出啥慢来

作者: ubuntuhk 时间: 8-4-2016 23:27

DDD888 发表于 8-4-2016 20:00
顺便问一下，如果我要求平均数

Your Mappers read the text file and apply the following map function ...

说了半天你用一个69kb的文件测试，诚意不够啊。。。。

作者: DDD888 时间: 9-4-2016 07:49

ubuntuhk 发表于 8-4-2016 22:27
说了半天你用一个69kb的文件测试，诚意不够啊。。。。

我自己只有这样大的log文件可以测试啦

对学习写程序来说应该够啦，现在网上demo将/etc/passwd文件测试，那文件更小啦，我的文件69kb相比之下算大的啦

欢迎光临 FreeOZ论坛 (https://www.freeoz.org/ibbs/)