MapReduce is a framework that is designed to process huge datasets. It uses a large cluster of computers which are called as nodes to perform the computations. This computational processing is done on data stored either in a file system or within a database. In MapReduce applications, there are basically two components namely, map and reduce. In Map step, the master node receives the input, partitions it into smaller sub-problems, and the finally distributes those to worker nodes. This is again repeated by the worker node leading to a multi-level tree structure. The smaller problems that are made in the worker node process each one of them and pass the answer back to its master node. On the other hand, the reduce step takes the answers and combines them in some way to get the final output.
In the MapReduce framework there is a large distributed sort which consists of hot spots as defined.
• an input reader
• a Map function
• a Reduce function
• a partition function
• a compare function
• an output writer
Here, Input reader basically divides the input into appropriate size splits. The MapReduce framework then assigns one split to each Map function. There is a distributed file system from where the input reader reads data and generates the required key/value pairs. Another component namely Map function takes a series of key/value pairs, processes them and then generates zero or more output key/value pairs. Often the input and output types of the reduce function is different from each other.
Reduce function in the MapReduce framework calls each reduce function once for each unique key in the sorted order. This Reduce function can iterate through the values which are eventually associated with that key. The output value can be 0 or some more values as well. Another important function is partition function where each Map function output is allocated to a particular reducer. This is done with the help of the application’s partition function. Then comes, a comparison function, which is used to run and sort the Map function. Then there is another very important function called as the output writer. The output writer is used to write the output of the Reduce function to the distributed file system, often called as stable storage.
Each component in the MapReduce applications is important and even if one is missing or not properly optimized, the results would not be as expected. For defining a MapReduce framework correctly you need to understand each component closely, for which you must read online tutorials. Explore the online resources and make use of this application and serve your various important purposes.
Jeniffer Thomas is a sucessful Internt Marketer and working in this area from past 5 years.Know about Mapreduce information about MapReduce applications and MapReduce.
Mod 1 of 5 part course on MapReduce.
Video Rating: 4 / 5
Related MapReduce Articles