The teacher explained it to us two days ago BM25 algorithm , It includes the explanation of the origin , And algorithm derivation , I'll sort it out here , I won't explain the previous one here BIM Model , If you are interested, you can find out by yourself .
Okapi BM25: A non binary model
bm25 It is an algorithm used to evaluate the correlation between search terms and documents , It is an algorithm based on probability retrieval model .
for instance : We look for keywords red apple , The participle is red
and apple, We are in our 1000 The two words are indexed in documents , But we found that red It seems to appear frequently , however apple The frequency of occurrence is not high , Let's sort out the scores of these 1000 documents , If the red It's very frequent , and apple It occurs rarely , Install the general score sorting ( Every time you appear, you get one point ) Then we red More and more , The higher its score will be , But it goes against what we need , Because what we're searching for is red
apple, therefore ,BM25 It is to eliminate the problem of low correlation , That is, the words we query have a weight proportion , mean idf( We'll talk about it later ).
1.BM25 Model

actually , The formula is not difficult to understand , He has only three parts
1. Calculate word weight :
2. Relevance of words and documents :
3. Words and query( key word ) The relevance of :

2.idf explain ( Word weight calculation )
We have seen the formula above , But I don't quite understand what it means , So here we slowly understand :
N: Is the number of all documents .
dft: Is our keyword through the inverted algorithm contains t Number of documents for ( In the above example ,red stay 1000 Number of documents in documents )
for example , We are here 1000 In documents red The number of times is 10, that N/dft=100, You can calculate his weight . explain ( Word and document relevance )
actually ,BM25 The main aspect is that idftf, It is the weight of query words and the correlation between query words and documents .
tftd:tftd It's a word item t In document d Weight in .
Ld and Lave : The documents are d And the average length of documents in the entire document set .
k1: Is a positive tuning parameter , Used to scale the frequency of word items in a document . If k 1 take 0, It is equivalent to not considering word frequency , If k
1 Take the larger value , Then it corresponds to the frequency of using the original term .
b : It's another adjustment parameter (0≤ b≤ 1), Determines the scale of the document length :b = 1 Represents full scaling of term weights based on document length ,b = 0
The document length factor is not considered in normalization .
4. Words and query( key word ) The relevance of the interpretation of
tftq: It's a word item t In the query q Weight in .
k3: Is another tuning parameter that takes a positive value , It is used to check the word items in the query tq Frequency scaling control .

Here's the call AP90 Data results (BM25):

©2019-2020 Toolsou All rights reserved,
about String How to create objects JavaScript Hundred refining into Immortals 1.15 Legendary Is MCU embedded , It's a cliche Resume the 13th session python Blue Bridge Cup html+css+js Make a simple website home page java Connect to the database to realize basic addition, deletion, modification and query VHDL——JK trigger Java of JDBC programming 3 4j It's not legal python expression _3+4j It's not legal Python expression .【linux】shell: ordinary shell Script exercise