hello hello everyone , Long time no see. ! Today's 《 Database system of teaching and learning 》 To learn query optimization in database system
. Teaching and learning database , I haven't seen such a cool title ?“ Language startle ”, you 're right , The title is so cool .

My sister, xiaoburi 18 year , The existence of goddess in Campus , Excellent performance, versatile sports , Gentle personality, honest and kind . however , Only I know , The bright burying in the eyes of everyone , It used to be a hamster Cape , Rolling all over the place , Except for eating, sleeping and playing . And the transformation of all this , From that night .

Since then , Xiaoburi often asked me to help her with her lessons . Today she wants to know about query optimization in database system . This tutorial talks about query optimization through my dialogue with xiaoburi .

<> Logical query optimization

<> Algebraic expression of equivalence relation

If two relational algebra expressions have the same result on any database instance , Then these two relational algebraic expressions are equivalent (equivalent)

<> Equivalent transformation rule

<> The equivalent transformation rules about selection

,

<> Select push down

Relation algebra expression tree (expression tree) Select operation in push down (push down) Generally, it can improve the efficiency of query execution

*
Select push down to filter out tuples irrelevant to the result as early as possible

Selectivity (selectivity) Highest selection first

*
Selectivity : Proportion of tuples satisfying the condition

*
Filter out more result independent tuples as early as possible

Decompose complex selection conditions , Then select and push down

<> Override of selection conditions

Rewriting inefficient selection conditions before rewriting : X=Y AND X=3

After rewriting :X=3 AND X=3

Remove unnecessary selection conditions

Before rewriting : SELECT * FROM R WHERE 1=1; After rewriting :SELECT * FROM R;

Check for selection conditions that cannot be met
SELECT * FROM R WHERE 1=0;
Merge selection criteria

Before rewriting :
SELECT*FROMR WHEREABETWEEN1AND100 ORABETWEEN50AND150;
After rewriting :
SELECT*FROMR WHEREABETWEEN1AND150;
When the selection operation can be pushed down to multiple branches of the expression tree , You need to consider which branch to push down

* Index will affect the scheme selected to be pushed down

<> On the equivalent transformation rules of projection

<> Projection pushdown

Push down the projection operation in the expression tree of relational algebra (pushdown) Generally, query execution efficiency can be improved

* Projection pushdown can reduce the size of tuples
In some cases , The projection operation will waste the opportunity of query optimization. There is no index on the result of projection operation

* Hypothetical relationship Student Built on property Sname Secondary index on
* ΠSno,Sname(Student) Make the index unavailable for subsequent selection and connection operations

<> query rewrite

Remove unnecessary projections

Remove unnecessary connections

Suppose all the students have taken courses

<> Cost estimation of logical query plan

<> Measure of query plan cost

The cost of a query plan is measured by the number of tuples of intermediate results generated during the execution of the query plan .

<> Result size estimation of operation

DBMS According to the size of the input relation of the relation algebra operation ( Number of tuples ) To estimate the size of the operation result ( Number of tuples )

*
accuracy

*
Easy to calculate

*
Logical consistency (logicallyconsistent)

Logical consistency

*
Monotonicity : The larger the input of an operation , The larger the estimate of the size of the result of the operation

*
Order independent : When performing the same operation satisfying the exchange law and the Union Law on multiple relationships ( as on,×,∪,∩), The estimated size of the final result is independent of the order in which the operations are performed

<> Statistics required to estimate the size of operation results

System catalog (systemcatalog) Some statistics related to the size of estimated operation results are recorded in

*
T( R ): relationship R Tuples of

*
V(R,A): relationship R Property set for A Number of different values of

*
Assumption of uniform distribution of attribute values , The values of each attribute are uniformly distributed

*
Attribute independence hypothesis , All attributes of a relationship are independent of each other

Collect statistics manually
Postgres/SQLite:ANALYZE Oracle/MySQL:ANALYZETABLE SQLServer:UPDATESTATISTICS
DB2:RUNSTATS
Estimation of the result size of Cartesian product
T(R×S)=T(R)T(S)
Estimation of projection result size
T(ΠL(R))=T(R)( Projection without de duplication ) T(ΠL(R))=V(R,L)( Projection with de duplication )
Select an estimate of the result size

Is the latter two different

<> Estimation of the result size of two natural connections

Consider two relationships R and S Natural connection of R on S

* Property value contains assumptions
For connection properties K, If V(R,K) ≤ V(S,K), be R Of K Property value set is S Of K Subset of attribute value set

* Property value retention assumptions
about R Any non connection property in A, Yes V(R on S,A) = V(R,A)

situation 2: R and S Yes 2 Connection properties X and Y

Natural connections do not have the same properties , Degenerate to Cartesian connection

<> Logical consistency of size estimation of natural connection results

Two way connection in different order (2-wayjoin), Still going 1 Secondary multiplexing (multi-wayjoin), The estimated values of the results obtained by the above methods are the same

<> Estimation of the result size of set operation

<> Estimation of the size of de duplication results

<> histogram

<> equi-width histogram

* The width of each interval of attribute value is the same
* The occurrence times of attribute values in each interval are different

<> Contour histogram

*
The width of each interval of attribute value is different

*
The occurrence times of attribute values in each interval are basically the same

<> Estimation of connection result size based on histogram

Use histogram : T(R on S)=10×5/10+5×20/10=15

Do not use histogram : T(R on S)=245×245/100=600

<> Heuristic optimization method of logical query plan

If some equivalent transformation can reduce the cost of query plan , Then perform the transformation on the query plan

<> Optimization of connection order

Although the connection operation satisfies the switching law , However, the input relationship of join operation in query plan has different functions

*
stay R on S in ,R It's a left-wing relationship (left relation),S It's right (right relation)

*
If a single connection is used (one-pass join), Then the left relation is to build the relation (build relation), Right relation is probe relation (proberelation)

*
If nested loop connection is used (nested-loop join), Then the left relation is the external relation (outer relation), Right relation is internal relation (inner relation)

*
If using index connection (index-based join), Then the right relation is indexed (indexed relation)

<> Connection tree (JoinTrees)

The order in which the join operations on a set of relationships are performed can be represented by the join tree (join tree) To represent

Left deep connection tree (left-deep join tree): Only one relationship is left (left relation), Other relationships are right (right relation)

Right deep junction tree (right-deep join tree): Only one relationship is right , Other relationships are left-wing (left relation)

Dense tree (bushy tree): Connection trees other than left and right deep connection trees

<> Why the query plan of left deep connection tree is more efficient ?

If the following query plans all use one connection (one-pass join) algorithm , Then in the pipeline query execution model (piplining model) lower , Left deep connection tree query plan at any time
Lower memory requirements than other query plans .


If the following query plans all use nested loop join (nested-loopjoin) algorithm , Then in the pipeline query execution model (pipliningmodel) lower , Left deep connection tree query plan is not required
Build intermediate relationships many times .

<> Dynamic programming method for optimizing connection order

The price is 0, No intermediate results

<> Optimization of physical query plan

<> Determine the execution method of the selection operation

<> Index based selection

Applicable conditions

* The form of the selection condition is K=v or l≤K≤u
* relationship R Built on property K Index of
* method : Use index based selection algorithm
<> Determine how to perform the connect operation

* Index connection (IndexJoin)
Applicable conditions :

* Little left-hand relationship
* Right relation has index on connection property
* Sort merge connection (Sort-MergeJoin)
Applicable conditions :

* At least one relationship is already sorted by connection properties
It is also suitable to use sort merge connection to make multiple connections on the same connection attribute , as R(a,b) on S(a,c) on T(a,d)

* When there are very few pages available in the memory buffer , You can use nested loop connections
One trip connection (One-PassJoin)
Index connection (IndexJoin)
Sort merge connection (Sort-MergeJoin)
HASH join (HashJoin)
Nested loop connection (Nested-LoopJoin)

<> Determine execution model

The number of pages available in memory buffer pool affects the selection of execution model

* Materialized execution (materialization)
* Pipeline execution (piplining)
<> summary

Let's play , To return to trouble , Don't make fun of learning .

This article introduces query optimization : Including logic query optimization, such as selecting push down, etc , And physical query optimization, such as index based selection . Pay attention to the principle of query optimization in learning , Keep in mind the method of query optimization .

Technology
©2019-2020 Toolsou All rights reserved,
Summary of artificial intelligence algorithm python Dynamic programming for single source shortest path git Pull the remote branch and switch to it mysql Recursively finds all child nodes of the parent class ( Essence )2020 year 7 month 12 day webpack Use of common plug-ins Output string at specified position ( Detailed analysis )Result Encapsulation of return result information Keras Summary of training data loading Vue Development tips mysql Modify primary key