<> Partition concept
The word "partition" is not unfamiliar to many students , such as Java In many Middleware , image kafka Partition of ,mysql Partition table, etc , The significance of partitioning lies in the reasonable division of data according to business rules , Facilitate the subsequent efficient processing of data in each partition
<>Hadoop partition
hadoop Partitions in , Is to output different data to different reduceTask , Finally output to different files
hadoop Default partition rule
* hash partition
* according to key of hashCode % reduceTask quantity = Partition number
* default reduceTask Quantity is 1, Of course, you can also driver End setting
Here is Partition Class , It's still easy to understand
hash Partition code demonstration
Here is wordcount In the case driver Partial code , By default, we do not make any settings , The final output of a statistics of the number of words txt file , If we add such a line to this code
After running the following program again , What will happen ?
public class DemoJobDriver { public static void main(String[] args) throws
Exception { //1, obtain job Configuration configuration = new Configuration(); Job
job = Job.getInstance(configuration); //2, set up jar route
job.setJarByClass(DemoJobDriver.class); //3, relation mapper and Reducer
job.setMapperClass(DemoMapper.class); job.setReducerClass(DemoReducer.class);
//4, set up map Exported key/val Type of job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class); //5, Set the final output key / val type
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
//6, Set the final output path String inputPath = "F:\\ Net disk \\csv\\hello.txt"; String outPath =
"F:\\ Net disk \\csv\\wordcount\\hello_result.txt"; // Set the output file to 2 individual
job.setNumReduceTasks(2); FileInputFormat.setInputPaths(job,new
Path(inputPath)); FileOutputFormat.setOutputPath(job,new Path(outPath)); // 7
Submit job boolean result = job.waitForCompletion(true); System.exit(result ? 0 :
1); } }
Can see , Finally output 2 Statistics result files , The contents of each file are different , This is by default , When reducer When the number is set to multiple , Will follow hash The partition algorithm calculates the results and outputs them to the files corresponding to different partitions
<> To customize a partition
* Custom class inheritance Partitioner
* rewrite getPartition method , In this method, different data are controlled to enter different partitions according to business rules
* stay Job In the driver class of , Set custom Partitioner class
* custom Partition after , To customize Partition Logical setting corresponding number of ReduceTask
<> Business requirements
Add the following file The name of the person is based on the last name ,“ horse ” Put the last name in the first partition ,“ Li ” The last name is placed in the second partition , Put the others in the third partition
Custom partition
import org.apache.commons.lang3.StringUtils; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.mapreduce.Partitioner; import org.apache.hadoop.io.Text;
public class MyPartioner extends Partitioner<Text, IntWritable> { @Override
public int getPartition(Text text, IntWritable intWritable, int partion) {
String key = text.toString(); if(StringUtils.isNotEmpty(key.trim())){
if(key.startsWith(" horse ")){ partion = 0; }else if(key.startsWith(" Li ")){ partion =
1; }else { partion = 2; } } return partion; } }
Associate a custom partition to Driver In class , Pay attention here ReduceTasks The number is consistent with the number of customized partitions
job.setNumReduceTasks(3); job.setPartitionerClass(MyPartioner.class);
Run below Driver class , Observe the final output , As expected , Output different last name data to different files
Summary of custom partitions
* If ReduceTask Quantity of > custom partion Number of partitions in , Then several empty output files will be generated
* If 1 < ReduceTask < custom partion Number of partitions in , Some of the data processing cannot find the corresponding partition file storage , Throwing exception
* If ReduceTask = 1 , Regardless of customized partion How many partitions are there in the , The final result will only be given to this one ReduceTask
handle , Only one result file will be generated in the end
* The partition number must be from 0 start , Accumulate one by one
Technology
Daily Recommendation