Introduction to search

Search implementation scheme

Traditional implementation scheme

According to the keywords entered by users (java), Application server usage SQL Statement query database , Return the query results to the user .

characteristic : If the amount of data is large , Large number of users , The pressure on the database server increases , Causes the query speed to slow down .

Lucene Implementation scheme

According to the keywords entered by users (java), Application server through Lucene Provided API Query index base , The index library returns the search results to the application server , The server will then return the query results to the user

characteristic : Solve the problem of large amount of data , Large number of users , The business system needs high query speed ( Such as real-time query ).

Data query method

Sequential scanning method

give an example : There are multiple files A,B,C... It is required to find out that the content of the file contains keywords [java] All documents of .

Thinking of sequential scanning method : from A File start scanning search , Rescan B file ... Scan the last file all the time , To get all the inclusion java Content file .

characteristic : More files , It's very slow to find .

Inverted index method ( Reverse index )

give an example : Using Xinhua dictionary to find Chinese characters , First find the radical of Chinese character , Then find the target Chinese character according to the page number corresponding to the radical .

with Lucene Building inverted index for example :

Document 1 ( The number is 1): we like java java java

Document 2 ( The number is 2): we like Lucene Lucene Lucene

term(doc, freq)(pos)
we(1, 1) (2, 1)(0) (0)
like(1, 1) (2, 1)(1) (1)
java(1, 3)(2, 3, 4)
Lucene(2, 3)(2, 3, 4)
explain :

Inverted index is to establish the corresponding relationship between words and documents ( What document does the word appear in , How many times , Where does it appear );

When searching , According to the keywords entered by users , Query directly in the index , Faster .

Application scenarios of search technology

Single machine software search (Office, Eclipse...);

On site search ( JD.COM , TaoBao );

Vertical search ( Limited industry search , as : medical care , education );

Platform search (Google, Baidu , 360, Sogou ).

Basic principles of full text retrieval

What is full text retrieval ?

There are two types of data in our lives : Structured data and unstructured data .

Structured data : Data of fixed format or limited length , Such as database , Metadata, etc .

Unstructured data : Data of indefinite length or format , Such as mail ,word Documents, etc .

Of course, some places will also mention the third one , Semi structured data , as XML,HTML etc. , When needed, it can be processed as structured data , It can also extract pure text and process it as unstructured data .
Unstructured data is also called full-text data .

Classification by data , There are also two types of search :

Search for structured data : Such as the search for the database , use SQL sentence . Another example is the search for metadata , Such as using windows Search for file names , type , Modify the time to search, etc .

Search for unstructured data : Such as using windows You can also search for file content ,Linux Under the grep command , Use it again Google And Baidu can search a lot of content data .

Unstructured data search method

Sequential scanning method (Serial Scanning)

So called sequential scanning , For example, if you want to find a file that contains a string , It's a document by document look , For each document , From the beginning to the end , If this document contains this string , Then this document is the file we are looking for , Then look at the next file , Until all the files are scanned .

Such as using windows You can also search for file content , It's just quite slow . If you have one 80G Hard disk , If you want to find a file that contains a string , It doesn't take him a few hours , I'm afraid it can't be done .

Linux Under the grep Orders are the same way . You may think this method is more primitive , But for files with a small amount of data , This method is still the most direct , The most convenient . But for a lot of files , This method is very slow .

Full text index

Basic ideas of full text retrieval : Extract some information from unstructured data , Reorganize , Make it have a certain structure , Then search the data with certain structure , So as to achieve the purpose of relatively fast search .

This part of the information extracted from unstructured data and then reorganized , We call it the index .

This is indexed first , The process of searching the index is called full-text retrieval (Full-text Search).

Dictionary example

Like dictionaries , The Pinyin Table and radical check list of a dictionary are equivalent to the index of a dictionary , The interpretation of each word is unstructured , If the dictionary does not have syllables and radical checklists , In the vast sea of words to find a word can only scan in sequence .

However, some information of words can be extracted for structural processing , Like pronunciation , It's more structured , It is divided into initials and finals , Only a few can be listed one by one , So the pronunciation is taken out and arranged in a certain order , The number of pages in which each pronunciation points to a detailed explanation of the word .

When searching, the pronunciation is found according to the structured Pinyin , Then press the number of pages it points to , You can find our unstructured data —— That is, the interpretation of words .

General process of full text retrieval

Picture from 《Lucene in action》

Full text retrieval is divided into two processes , Index creation (Indexing)  and   Search index (Search).

Index creation : Extract information from all structured and unstructured data in the real world , The process of creating an index .

Search index : Is to get the user's query request , Search created index , Then return the result of the process .

So there are three important problems in full-text retrieval :

What exactly is in the index ?(Index)

How to create an index ?(Indexing)

How to search the index ?(Search)

What's in the index ?

Why is sequential scanning slow ? It is caused by the inconsistency between the information to be searched and the information stored in unstructured data .

The information stored in unstructured data is which strings are contained in each file , That is, known documents , It's relatively easy to ask for strings , This is the mapping from a file to a string .

The information we want to search for is which files contain this string , That is, known strings , Request document , That is, the mapping from string to file .

Reverse index

The two are just the opposite . So if the index always holds the mapping from string to file , Will greatly improve the search speed .
Because the mapping from string to file is the reverse process of file to string mapping , The index that holds this information is called a reverse index .

Reverse index saved information ( Dictionary - Inverted table )

Suppose my document collection contains 100 Documents , For convenience of expression , We number documents from 1 reach 100, Get the following structure :

On the left is a series of strings , It's called a dictionary .
Each string points to the document containing the string (Document) Linked list , This document linked list is called inverted list (Posting List).
There's an index , So that the information saved is consistent with the information to be searched , Can greatly speed up the search .

Reverse index query example

for instance , We are looking for both containing strings “lucene” Also contains a string “solr” Document for , We just need the following steps :

Extract the containing string “lucene” Document linked list of .

Extract the containing string “solr” Document linked list of .

By merging linked list , Find out “lucene” It also contains “solr” File for .

Advantages and disadvantages of reverse index

shortcoming : Plus the process of creating a new index , Full text retrieval is not necessarily faster than sequential scanning , This is especially true when the amount of data is small . It is also a slow process to index a large amount of data .

advantage : Sequential scanning is scanning every time , The full-text index can be indexed once , Multiple use ; Fast retrieval speed .

How to create an index ?

The index creation process of full-text retrieval generally has the following steps :

Original document to be indexed (Document)

Change the original document (Document) Pass to word breaker (Tokenizer)

Word segmentation component (Tokenizer) Will do the following things ( This process is called Tokenize):

Divide the document into separate words ;

Remove punctuation ;

Remove stop words (Stop word);

The so-called stop word (Stop
word) Are some of the most common words in a language , Because it has no special significance , Therefore, in most cases, it cannot be used as a search keyword , When the index is created , The word is removed, reducing the size of the index .

Supporting words in English (Stop word) as :“the”,“a”,“this” etc .

Word segmentation components for each language (Tokenizer), There is a stop word (stop word) aggregate .

After participle (Tokenizer) The result is called word number (Token).

Will word order (Token) Pass to language processing component (Linguistic Processor)

Language processing component (linguistic processor) It's mainly about the word number we get (Token) Do some language related processing .

For English , Language processing component (Linguistic Processor) Generally do the following :

lower case (Lowercase).

Reduce words to root form , as “cars” reach “car” etc . This operation is called :stemming.

Change a word into a root form , as “drove” reach “drive” etc . This operation is called :lemmatization.

Language processing component (linguistic processor) The result of is called lexical element (Term).

Will word element (Term) Pass to index component (Indexer)

Index component (Indexer) Mainly do the following things :

Use the words you get (Term) Create a dictionary (Term-DocumentID)

Sort dictionaries in alphabetical order .

Merge the same word elements (Term) Make document inverted (Posting List) Linked list .
In this table , There are several definitions :

Document Frequency Document frequency , Indicates how many files in total contain this word (Term).

Frequency That is word frequency , Indicates that several words are included in this file (Term).

The process of index creation and retrieval

Index creation process

Collect raw data ;
Create document object (Document);
Create parser object (Analyzer), Used for participle ;
Create index configuration object (IndexWriterConfig), For configuration Lucene;
Create index library directory location object (Directory), Specifies the storage location of the index library ;
Create index write object (IndexWriter), Write the document object to the index library ;
use IndexWriter object , Create index ;
Releasing resources .

Retrieval process

Create parser object (Analyzer), Used for participle ;
Create query object (Query);
Create index library directory location object (Directory), Specifies the location of the index library ;
Create index read object (IndexReader), Used to read the index ;
Create index search object (IndexSearcher), Used to perform a search ;
use IndexSearcher object , Perform search , Return to search result set TopDocs;
Processing result set ;
Releasing resources .

Kotlin Developer community

Focus on sharing Java, Kotlin,Spring/Spring
Boot,MySQL,redis,neo4j,NoSQL,Android,JavaScript,React,Node, Functional programming , Programming ideas ," High availability , High performance , High real time " Design theme of large distributed system architecture .

High availability, high performance, high real-time large-scale distributed
system architecture design.

Distributed framework :Zookeeper, Distributed middleware framework, etc
Distributed storage :GridFS,FastDFS,TFS,MemCache,redis etc.
Distributed database :Cobar,tddl,Amoeba,Mycat
cloud computing , big data ,AI algorithm
Virtualization , Cloud native technology
Distributed computing framework :MapReduce,Hadoop,Storm,Flink etc.
Distributed communication mechanism :Dubbo,RPC call , Share remote data , Message queue, etc
Message queuing MQ:Kafka,MetaQ,RocketMQ
How to build high availability system : Hardware based , Software middleware , System architecture and other typical solutions :HAProxy, be based on Corosync+Pacemaker Middleware system of high availability cluster suite based on
Mycat Architecture distributed evolution
big data Join Problems behind : data , network , Contradiction and reconciliation between memory and computing power
Java High performance problems in Distributed Systems :AIO,NIO,Netty Or develop your own framework ?
High performance event dispatch mechanism : Thread pool model ,Disruptor Models, etc ...

The wood of embracing , Born in the dust ; Nine story platform , From the base soil ; A journey of a thousand miles , Start with a single step . Step by step , Nothing is as far as a thousand miles ; Non product flow , Nothing makes a river .

©2019-2020 Toolsou All rights reserved,
Send love - A little romance for programmers VHDL—— Design of frequency divider Python Implementation of Hanoi Tower code It's over , Starting salary 30khtml+css+js Make a simple website home page QQ Login interface implementation Hill sorting of sorting algorithm ——c++ realization 【 Wechat applet learning 】 Netease music cloud code page implementation details Resume the 13th session python Blue Bridge Cup 2022 Solution to the 13th Blue Bridge Cup ( whole )