<> One , Hadoop The origin and development history of
Doug Cutting The architecture of the full-text search engine of Lucene, In the processing of massive data, we encountered and google The same problem .
Google It's open GFS and Mapreduce thought
Doug Cutting It's for people 2 In his spare time HDFS and Mapreduce mechanism
file system GFS -> HDFS
calculation MapReduce -> Mapreduce
Large table BigTable -> HBase
Doug Coutting: Hadoop Father of
<> Two , hadoop Module of :
<> Common module :
<>Hadoop major component ： distributed file system HDFS and MapReduce computational model
NameNode： Metadata management ( metadata ： file name , size , Number of copies , Location of each replica on the node …)
DataNode： For specific data storage .
SecordayNamenode： Synchronization of metadata .
Client： Request for data （ upload , read , write ...）
Resourcemanager： Global task scheduling and resource management （Cpu, Memory ）
nodemanager： Management of the node
Client： Request to initiate task
application master： Manage a task , Request resources for app , And assign internal task monitoring and fault tolerance
container： Abstraction of environment , Encapsulated CPU, Memory and other multi-dimensional resources .
<> Three , namenode Start up process
Synchronization of metadata ?
NameNode Metadata information of edits Write in file , When edits File reaches a certain threshold (3600 Seconds or size to 64M) When , The merging process will be started .
1. When the merger begins ,
SecondaryNameNode I will edits and fsimage Copy to the memory of the server ,
The merge build is named fsimage.ckpt Documents of .
2. take fsimage.ckpt File copy to NameNode upper ,
Delete old fsimage,
And will fsimage.ckpt Rename to fsimage.
3. When SecondaryNameNode take edits and fsimage After copying ,
NameNode Will generate a edits.new file , Used to record new metadata ,
When the merge is complete , Original edits File will be deleted ,
And will edits.new Rename file to edits file ,
Start the next process
4. to configure hdfs-site.xml=
<property> <name>dfs.namenode.checkpoint.period</name> <value>3600</value>
<description>The number of seconds between two periodic
checkpoints.</description> </property> <property>
<name>dfs.namenode.checkpoint.txns</name> <value>1000000</value> </property>
<> Four , HDFS characteristic
advantage ： 1. Handling large files The large file here usually refers to 100 MB, hundreds TB File size . Currently in practical application ,HDFS Already available for storage management PB Class data .
2. Streaming access data HDFS Design based on more response " Write once , Multiple reads " Based on the task . In most cases , The analysis task involves most of the data in the data set .
It is more efficient to request to read the whole dataset than to read a record . 3. Running on a cheap cluster of commercial machines Hadoop The design requires less hardware , Without expensive high availability machines .
Low cost commercial aircraft has a high probability of failure . Design HDFS The reliability of data should be fully considered , Security and high availability . shortcoming ： 1. Not suitable for low latency data access 2.
Suitable for storing big data sets , Larger storage file utilization HDFS The design goal is to stream large data sets because Namenode Put the metadata of the file system in memory ,
So the number of files the file system can hold is determined by Namenode The memory size of . generally speaking , Every file , Folders and Block Need to occupy 150 Byte space ,
Larger storage file utilization . 3. Random modification is not supported