A lot of small files namenode There's a lot of pressure , In addition, the recent platform requires cleaning small files . The following is a summary of the small file method used in the work .
1. Parameter method ( General method , Suitable for a large number of map Small files and reduce The amount of data is still relatively large . such as : Log files from text format insert reach orc format )
1.map End parameters
set mapred.max.split.size=256000000;// each Map Maximum input size
set mapred.min.split.size=256000000;// each Map Minimum input size
set mapred.min.split.size.per.node=100000000;// One DataNode Total file size at least
set mapred.min.split.size.per.rack=100000000;// The minimum size of total files under a switch
set
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;// implement Map Forward small file merge
set hive.merge.mapfiles = true // set up map End outputs are merged , The default is true
2.reduce End parameters
set hive.merge.size.per.task = 256000000//reduce The size of the output file
set
hive.merge.smallfiles.avgsize=16000000// When the average size of the output file is less than this value , Start a stand-alone MapReduce Task file merge
set hive.merge.mapredfiles = true// set up reduce End outputs are merged , The default is false
3. set up reduce number ( need SQL Yes reduce process )
set mapred.reduce.tasks=5
2. Set the storage format of the table to Sequencefile( It is mainly used for statistical results SQL,reduce The results are relatively small )
3. use HAR Filing documents
set hive.archive.enabled=true;
set hive.archive.har.parentdir.settable=true;
set har.partfile.size=2560000000000;
ALTER TABLE table_name ARCHIVE PARTITION (XXX)
4. use distribute by col( This method is applicable to those only map No, reduce Of SQL, especially hive on spark.)
such as select
time,
id
from page
distribute by rand()//distribute by substr(time,0,5)
be careful :
1. stay hive On the other hand, it is subject to reduce number . Best setting reduce number = Number of barrels (substr(time,0,5) )
2. stay hive on spark use , Best setting shuffle Concurrent number or on SparkSQL Adaptive execution .hive on
spark stay map End even if set hive The corresponding parameters have no effect ,task The number is still equal to HDFS Number of files , Unless used Scala Programming . For statistics, the time cycle is relatively long , It is better to use the corresponding period table , For example, the daily statistical cycle is a daily table , Don't use an hour meter. You can save a lot task. At the same time, this method is only artificially increased reduce process .
5. Add one more after the statistics insert overwrite operation ( General method , Especially for those statistics (reduce) The effect of small files produced is particularly good )
This method is equivalent to starting a stand-alone MapReduce Task file merge.

Technology
©2019-2020 Toolsou All rights reserved,
c Language to achieve mine clearance Games and mine source code Node.js Middle template engine 3 species Python data structure ,13 Creation methods , This is the conclusion , Great ! The situation of receiving and receiving multi-path git It's set correctly ssh, But it still prompts Permission denied (publickey) new iPhone I won't support it 5G Will lead to further decline in shipment C# Sorting method of Chinese Dictionaries 10909 rice ! Chinese fighter successfully landed in Mariana Trench MySQL8.0MGR Single owner / Multi master installation and switching Java Poor reflection performance ?