One . Opening and closing
Many times , Don't cling to a technical point , Technology is interlinked . What's important is the idea of programming , Thought is the most important .
When the amount of data is large , Need to have the idea of division to refine the granularity . When the amount of data is too fragmented , Need to have the idea of combining to coarsen the granularity .
Many technologies use the idea of programming , Here are a few examples , These are all ideas of division
From centralized service to distributed service
from Collections.synchronizedMap(x) reach 1.7ConcurrentHashMap Until then 1.8ConcurrentHashMap, Refine the granularity of lock while still ensuring thread safety
from AtomicInteger reach LongAdder,ConcurrentHashMap Of size() Method . With decentralized thinking , reduce cas frequency , Enhance the accumulation of a number by multithreading
JVM Of G1 GC algorithm , Divide the heap into many Region For memory management
Hbase Of RegionServer in , Divide data into multiple Region Manage
Is the resource isolation of thread pool in normal development
Many technologies are also applied to the integrated programming idea , Here are a few examples , These are the thoughts of harmony
TLAB（Thread Local Allocation Buffers）, Thread local allocation cache . Avoid multithreading conflicts , Improve the efficiency of object allocation
escape analysis , Allocate the instantiated memory of variables directly in the stack , No need to enter the reactor , Thread end stack space is recycled . Reduce the number of temporary objects allocated in the heap
GC Under algorithm , Although the mark is used to clear , But there are also configurations that support defragmentation of memory . as ：-XX:UseCMS-CompactAtFullCollection（FullGC Whether to arrange after ,Stop
The World It's going to grow ） and -XX:CMSFullGCs-BeforeCompaction（ How many times? FullGC After compression ）
Lock coarsening , When JIT It is found that a series of continuous operations repeatedly lock and release the same object , Will increase the range of lock synchronization
kafka There are some data configurations for the network data transmission of , Reduce network overhead . as ：batch.size and linger.ms Wait
Is development usually called batch access interface
Two . partition
This article is based on MySql InnoDB
Said so much , Next, the main body , Partition first , Because the blogger wrote an article before MySql Blog in different sections, so it won't cost much ink to write here
2.1 Implementation mode
How to realize the above link , Just remember here if there is a primary key or a unique index in the table , Partition columns must be part of a unique index .
This is from the database , Application transparency , Code doesn't need to change anything .
2.2 Internal documents
First go data Catalog , If you do not know the directory location, you can execute ：
Let's take a look at the internal files ：
We can see from the picture above , Yes 2 Files of type in ,.frm Documents and .ibd file
.frm file ： Table structure file
：InnoDB in , Index and data are in the same file .ibdata（ Your execution may result in .MYD Index files and .MYI data file , No problem , This is MyIsAm Storage engine , Corresponding to InnoDB Of .ibd file ）. because Order This table is divided into 5 Districts , So there are 5 Files like this
.par file ： The results of your execution may be .par Files may not . Be careful ： from MySql 5.7.6 start , Don't create again .par Partition definition file . Partition definitions are stored in the internal data dictionary .
2.3 data processing
After partition table , Improved MySql performance . If it's a watch , There's only one .ibd file , A big one B+ tree . If after the sub table , By partition rule , Divide into different areas , It's a big one B+ tree , Split into small trees .
The efficiency of reading must be improved , If the partition key index is used , The auxiliary index of the corresponding partition B+ tree , Go to the clustered index of the corresponding partition again B+ tree .
If there is no partition key , Will be executed once in all partitions . Can cause multiple logics IO!
If you want to check sql The partition query of the statement can use the explain partitons select xxxxx Sentence . You can see a sentence select Statement goes through several sections .
mysql> explain partitions select * from TxnList where startTime>'2016-08-25 00:00:00' and startTime<'2016-08-25 23:59:00';
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | ClientActionTrack | p20160825 | ALL | NULL | NULL | NULL | NULL | 33868 | Using where |
row in set (0.00 sec)
Three . Sub database and sub table
When a table develops over time and business , The amount of data in the library table will be larger and larger . Data operations will grow .
Limited resources of a physical machine , The final data capacity , Data processing capacity will be limited . At this time, we will use the sub database and sub table to undertake the super large-scale tables , The kind that can't be put on a single machine .
Different from partition , The partition is usually placed in a single machine , Time range partition is used more often , Easy to file . Only the code implementation is needed for the sub database and sub table , Partition is mysql Internal implementation . No conflict between sub database, sub table and partition , Can be used in combination .
3.1.1 Standard of sub warehouse and sub table
Storage occupation 100G+
Data increment every day 200w+
Number of single table 1 Billion bars +
3.1.2 Sub database and sub table fields
It is very important to take the value of sub database and sub table fields
In most scenarios, this field is a query field
General use userId, The above conditions can be met
3.2 Distributed database middleware
Distributed database middleware is divided into two types ,proxy And client architecture .proxy Patterns are MyCat,DBProxy etc. , Client architecture has TDDL,Sharding-JDBC etc .
that proxy What's the difference with client architecture ? What are the advantages and disadvantages of each ? In fact, you can know by looking at a picture .
proxy In terms of mode, our select and update Statements are sent to agents , This agent operates the specific underlying database . Therefore, the agent itself must be required to ensure high availability , Otherwise, the database is not down ,proxy Hang up , That's a long way to go .
The client mode usually has a layer of encapsulation on the connection pool , Internal connection with different libraries ,sql Hand smart-client Processing . Usually only one language is supported , If other languages are to be used , Need to develop multilingual client .
Their advantages and disadvantages are as follows ：
3.3 Internal documents
Found a sub database and sub table + Partition example , It's basically the same as the partition table , It's just a lot of watches .ibd file , There's a file explanation on it ：
[miaojiaxing@Grim testmydata]# ls | grep 'base_info'
3.4.1 Transaction issues
Now that we've sorted the databases and tables , It must involve distributed transactions , How to ensure that multiple records inserted into different databases can succeed at the same time , Or fail at the same time .
Some students may think XA,XA Poor performance and no need to use mysql5.7. Flexible transaction is the mainstream solution at present ,TCC Patterns are flexible .
For distributed transaction problems, each company has its own implementation , For Huawei saga, For Ali TXC, For ants DTX, Support FMT Mode and TCC Pattern .
3.4.2 join problem
tddl,MyCAT All of them support cross sharding join. But try to avoid cross Library join, For example, through field redundancy .
If this happens and the middleware supports sharding join, Then you can use it like this . If it does not support manual query .
Four . summary
The use of the sub table is different from that of the sub table , The purpose of the sub table is to undertake the super large scale table , You can't put that on a single machine . The partition is usually placed in a single machine , Time range partition is used more often , Easy to file .
If the performance is stable, it is a sub table , almost , The difference should be that the partition table is mysql Internally implemented , Less data interaction in the scheme of score table .