The teacher explained it to us two days ago BM25 algorithm , It includes the explanation of the origin , And algorithm derivation , I'll sort it out here , I won't explain the previous one here BIM Model , If you are interested, you can find out by yourself .
Okapi BM25: A non binary model
bm25 It is an algorithm used to evaluate the correlation between search terms and documents , It is an algorithm based on probability retrieval model .
for instance ： We look for keywords red apple , The participle is red
and apple, We are in our 1000 The two words are indexed in documents , But we found that red It seems to appear frequently , however apple The frequency of occurrence is not high , Let's sort out the scores of these 1000 documents , If the red It's very frequent , and apple It occurs rarely , Install the general score sorting （ Every time you appear, you get one point ） Then we red More and more , The higher its score will be , But it goes against what we need , Because what we're searching for is red
apple, therefore ,BM25 It is to eliminate the problem of low correlation , That is, the words we query have a weight proportion , mean idf（ We'll talk about it later ）.
actually , The formula is not difficult to understand , He has only three parts
1. Calculate word weight ：
2. Relevance of words and documents ：
3. Words and query（ key word ） The relevance of ：
2.idf explain （ Word weight calculation ）
We have seen the formula above , But I don't quite understand what it means , So here we slowly understand ：
N： Is the number of all documents .
dft: Is our keyword through the inverted algorithm contains t Number of documents for （ In the above example ,red stay 1000 Number of documents in documents ）
for example , We are here 1000 In documents red The number of times is 10, that N/dft=100, You can calculate his weight .
3.tf explain （ Word and document relevance ）
actually ,BM25 The main aspect is that idftf, It is the weight of query words and the correlation between query words and documents .
tftd：tftd It's a word item t In document d Weight in .
Ld and Lave ： The documents are d And the average length of documents in the entire document set .
k1： Is a positive tuning parameter , Used to scale the frequency of word items in a document . If k 1 take 0, It is equivalent to not considering word frequency , If k
1 Take the larger value , Then it corresponds to the frequency of using the original term .
b ： It's another adjustment parameter （0≤ b≤ 1）, Determines the scale of the document length ：b = 1 Represents full scaling of term weights based on document length ,b = 0
The document length factor is not considered in normalization .
4. Words and query（ key word ） The relevance of the interpretation of
tftq： It's a word item t In the query q Weight in .
k3： Is another tuning parameter that takes a positive value , It is used to check the word items in the query tq Frequency scaling control .
Here's the call AP90 Data results （BM25）：