<>MapReduce

<>MapReduce Introduction and advantages

* MapReduce It is a programming framework of distributed computing program , yes Hadoop The core of data analysis
* MapReduce The core idea is to integrate the logic code written by users and various components in the architecture into a distributed computing program , Parallel processing of massive data for a certain program , increase of efficiency
*
Massive data is difficult to process on a single machine , Once the stand-alone program is extended to the cluster for distributed operation, the complexity of the program will be greatly increased , So the introduction MapReduce framework , Developers can focus on the core business logic of data processing , It encapsulates the common functions of distributed programs into a framework , To reduce the difficulty of development
* A complete MapReduce The program has three types of instance processes
* MRAppMaster: Be responsible for the coordination process of the whole procedure
* MapTask: be responsible for map Stage data processing
* ReduceTask: be responsible for reduce Stage data processing
<>MapReduce Basic process

Step 1 :InputFormat
Step 2 :Split
Step 3 :RecordReader
Step 4 :Map
Step 5 :Partition
Step 7 :Sort (Sort Before that Combiner)
Step 7 :Merge (Merge Before that Combiner)
Step 8 :Group
Step 9 :Reduce
Step 10 :OutputFormat

<>MapReduce Overall detailed process

<>1.Map stage
1, data fetch , Logical slice first , Components that read data InputFormat( default TextInputFormat)
Will pass getSplits Methods the files in the input directory were logically sliced and planned splits section
How many split How many start MapTask, By default split And block The default correspondence of is one-to-one 2,RecordReader Process output
Split the input file into splits after , from RecordReader object ( default LineRecordReader) Read
with \n As separator , Read a row of data , return <key,value> Key Indicates the offset value of the first character of each line ,value Represents the text content of this line 3, implement map method
read split return <key,value>, Enter the inherited Mapper Class , Perform user overridden map function
RecordReader Read a row , User overridden map Call once , And output a <key,value>
<>2.Shuffle stage (Map After output ,Reduce Before output )

1. Data partition , Write to memory
here outputCollectot, Will collect Map Written data , take key/value as well as Partition The result of will be put into a ring buffer in memory be careful :
a. Partition during write to ring buffer b. Buffer is used for batch collection map result , Reduce disk IO Impact of
c. Ring data structure is used for more efficient use of memory space , Put as much data as possible in memory ;
d. A ring buffer is actually an array of bytes , The array contains key,value And key,value Metadata information for ,
include partition,key From ,value Starting position and value Length of ; Ring structure is an abstract concept
e. Buffer is limited in size , The default is 100MB, Modifiable ; When Map task When there are many output results of , It may burst the memory ,
Therefore, it is necessary to temporarily write the data in the buffer to disk under certain conditions , Then reuse the buffer ;
This process of writing data from memory to disk is called Spill, Chinese can be translated into overflow ; This overflow is done by a single thread ,
Write to buffer is not affected map Thread of result ; The default ratio for overflow triggering is 0.8, That is, when the buffer data has reached the threshold value , Overflow thread start , Lock this 80MB Memory of , Perform overflow process .Map
task The output of can go to the rest 20MB Write in memory , No mutual influence 2, Overwrite , sort ,combine( Local merge ) When the data in the ring buffer reaches a certain proportion , Enable overflow (
The process of writing data from memory to disk is called Spill, Chinese can be translated into overflow ), Lock data that needs to be over written , Write the data that needs to be over written to the temporary file in the disk , And before writing key conduct
sort (sort) And merge (combine, Optional actions ) be careful : a. partition (Partitioner)
MapReduce provide Partitioner Interface , Its function is based on key or value and reduce Number of
To determine which output data of the current pair should be handed over to Reducetask handle ; Default pair key HashPartitioner And then Reducetask Quantity sampling ;
The default mode is only for average Reduce Handling capacity of , If the user is right Partitioner There is demand , Self setting ;
b. sort (key.compareTo): call key Of compareTo Methods for comparison , The default is express , Dictionary sort
c.Combiner: It is used for local summary ; Working mechanism and Reduce Exactly the same ; It is up to the user to decide whether to use d. The data area will have three data to record the index : order , size , Offset, etc
e. When 80% After the data area of is filled , Data area will be locked , here outputcollector The collected data will be written 20% Reservation of . When data overflow of data area is completed , Lock release .
f. At this time 20% Of the data area 60% Form a new data area , Data area remaining 20% For new reserves g. The ratio of reserved area to data area can be determined by users .
h. Each generated file is called a segment (segment), There will also be one for each file index Index file ( Offset used to describe the area ,
Original file size , Compressed data size ), The file is stored in memory by default , Write to disk when memory space is insufficient 3.merge( integration )
When the whole data processing is finished , All temporary files on the disk will be merged , Generate a large file
Merge at this time is to merge the same partition in all temporary files , And sort the data in each partition again (sort) Problems needing attention :
a. One Maptask In the end, it will be integrated into a large file , Large files will also have indexes b.merge In the process of combiner It will work
c.reducetask Will go to each Maptask Pull up the file of the corresponding partition , Default store in reduce In ring buffer at end (100M),
Threshold overrun disk reached , After pulling , You get a bunch of little papers , For all small files , conduct merge, Merge and sort into a large file

<>3.Reduce stage
1. Pulling , grouping , Merge data ReduceTask From time to time MapTask Pull the data belonging to its own partition and read it into the cache in memory , Overflow to disk after reaching threshold ,
Start after all data in memory is written merge Merge , Finally merge into a large file with the same partition , And then to this
The key value pairs in the file are based on the key conduct sort sort , Group according to partition rules after sorting , Group by group call after group completion reduce method 2. implement reduce method

<>MapReduce tuning

*
Map The highest efficiency of the end is : Minimize ring buffers flush Times of ( Reduce disk IO Number of uses of )
How to reduce ring buffer flush Times of :

* 1, Increase the memory of ring buffer
* 2, Increase the size of the buffer threshold ( Consider whether the remaining space is enough for the system )
* 3, Compress the output ( compress - The decompression process will consume CPU)
*
Reduce The highest efficiency of the end is : Minimize ring buffers flush Times of

* 1, Try to calculate all data in memory
*
In network bandwidth , disk IO Under the premise of bottleneck , Can not use IO The Internet doesn't work , Under the premise of necessary use , Use less if you can .

*
be-all , As long as it can reduce the cost of network bandwidth , As long as you can reduce the number of disks io Configuration item for the number of uses of , Are all options for cluster tuning ;( Options include :
Software level 【 System software and cluster software 】, Hardware level , Network level )

<>MapReduce Development summary
mapreduce When programming , Basically a curing mode , There's not much flexibility , In addition to the following : 1) Input data interface :
InputFormat--->FileInputFormat( A general abstract class for reading file type data ) DBInputFormat ( A general abstract class for database data reading )
The default implementation class used is :TextInputFormat job.setInputFormatClass(TextInputFormat.class)
TextInputFormat The functional logic of is : Read one line at a time , The starting offset of the line is then taken as key, Line content as value return 2) Logical processing interface : Mapper
It's completely up to the user to implement it :map() setup() clean()
3)map The output results are shuffle The stage will be partition as well as sort, There are two interfaces to customize here : (1)Partitioner With default implementation
HashPartitioner, Logic is according to key and numReduces To return a partition number ; key.hashCode()&Integer.MAXVALUE %
numReduces under normal conditions , Use the default HashPartitioner Can , If there are special needs in the business , Can be customized (2)Comparable
When we use custom objects as key To output , It must be realized WritableComparable Interface ,override Of which compareTo() method
4)reduce Data grouping and comparison interface :Groupingcomparator
reduceTask Get the input data ( One partition All data for ) after , First, you need to group the data , The default principle for grouping is key identical ,
And then for each group kv Data call once reduce() method , And put this group kv First of kv Of key Pass as parameter to reduce Of key,
The value Iterator passed to reduce() Of values parameter Using this mechanism , We can implement an efficient logic of maximum grouping :
Customize a bean Object to encapsulate our data , Then rewrite it compareTo Method to produce the effect of reverse sort
Then customize one Groupingcomparator, take bean The grouping logic of objects is changed to group according to our business id To group ( Such as order number )
such , The maximum value we have to take is reduce() Method key 5) Logical processing interface :Reducer It's completely up to the user to implement it reduce() setup()
clean() 6) Output data interface :OutputFormat---> There are a series of subclasses FileOutputformat DBoutputFormat .....
The default implementation class is TextOutputFormat, Functional logic is : Put each KV Pair output as one line to target text file

Technology
©2019-2020 Toolsou All rights reserved,
Mybatis Error resolution :There is no getter for property named '*' in 'class Java.lang.String Big data tells you , How tired are Chinese women Message quality platform series | Full link troubleshooting Gude Haowen serial - You deserve to be an engineer ( Preface ) Image explanation of over fitting and under fitting Springboot of JPA Common query methods JAVA Detailed explanation of anomalies vue Of v-if And v-show The difference between python To solve the problem of dictionary writing list in Codeup——601 | problem A: task scheduling