Hive As data warehouse , Similar to relational database development process , All need to be modeled first , So called modeling , It is to specify the relationship between tables . Modeling in hive It can be roughly divided into star type , Snowflake and constellation . We should have a deep understanding of modeling , First of all, we need to hive In the storehouse
Several
Table concept .hive The table in is divided into internal table , External table , Bucket Watch , Partition table . In data logic, it is divided into dimension table and fact table . Dimension table is equivalent to dictionary table . The fact table is a data table outside the dictionary table .
1.1 Star model
Multiple dimension tables , A fact sheet , There is no relationship between dimension tables . Better query performance , Redundant storage . Star model is used more often .
1.2 Snowflake Schema
Snowflake is an extension of star modeling , There are relationships between dimension tables . Storage reduces redundancy , Loss of query performance , Multi level connection required . In common with the star model, there is only one fact table .
1.3 Constellation model
Constellation is also an extension of star model , There are multiple fact tables .
Problem expansion
For example, the information of a business system does not provide a summary table , It's a new business , You need to add a data table . When we count business volume , We have to take dozens of data tables for association , Report execution efficiency is very poor , After that, we directly help the source system to make a summary table of new business on the data warehouse , It improves the processing efficiency of follow-up reports , It also shields the impact of new business varieties on the report .
Here we can consider the basic characteristics of our book warehouse management design , This is the concept of layering :
* Clear data structure
* Data lineage tracking
* Reduce duplication of development
* Simplify complex problems
* Shield the exception of original data
* Influence of shielding business , You don't have to change the service once, you need to re access
Here we can follow these concepts , Which of our basic data is divided into several storehouses , Combined with our business requirements, we build the market layer of each business layer by layer ; Can effectively reduce the cost of our later operation and maintenance ;
( Four ) Application scenarios
The main advantage of star model design is to improve query efficiency , Because the generated fact table has been preprocessed , The main data is in the fact table , So as long as you scan the real-time table, you can do a lot of queries , You don't have to do a lot of it join, Secondly, the dimension table data is generally less , stay join It can be directly put into memory join To improve efficiency , besides , The fact table of star model is more readable , You can get most of the core information without associating multiple tables , Design and maintenance is relatively simple .
The design of snowflake model is more in line with the concept of database paradigm , The design is more formal , Less data redundancy , However, it may be necessary during query join Multiple tables lead to query efficiency degradation , In addition, standardized operation is more complex in the later maintenance .
( Five ) summary
Through the above comparison , We can find that the data warehouse is more suitable to use star model to build the underlying data Hive surface , Through a lot of redundancy to improve query efficiency , Star model pair OLAP Analysis engine support is more friendly , This point lies in Kylin It can be reflected in Chinese . And the snowflake model in relational database such as MySQL,Oracle Very common in , Especially the database table of e-commerce . In the data warehouse, the application scenarios of snowflake model are relatively few , But it's not without it , So in the specific design , We can consider whether we can combine the advantages of the two to participate in the design , In order to achieve the optimization of the design .
Technology
Daily Recommendation