1. What do named entities do ?
In the field of natural language processing applications , Named entity recognition is information retrieval , knowledge graph , MT , Emotional analysis , Q & a system and other basic tasks of natural language processing applications , for example , We need to use named entity recognition technology to automatically identify user queries , Then the
The entity in the query is linked to the node corresponding to the knowledge map, and its recognition accuracy will directly affect the follow-up work .
2. What are the difficulties of named entity recognition ?
* The recognition of named entities is different in different fields or scenes
. At present, the labeled corpus is usually limited to some fields , It is difficult to apply to other corpora , for example : Training based on news corpus , Then the test was conducted in social corpus , The test results are often difficult to achieve the desired results , Because there are a lot of nonstandard words in social corpus .
* The cost of named entity identification and annotation is high
, At present, there are few labeled corpus for named entity recognition , How to learn a better model from less corpus , Or with the help of other similar task corpus and a large number of unlabeled corpus for learning , This brings new challenges to named entity recognition .
* Chinese named entity recognition “ word ” The boundary of is determined , however “ Word ” The boundary is fuzzy , Therefore, there are usually some semantic ambiguity
The situation , for example :“ It's amazing ” There are two word segmentation schemes for this sentence ,“ Give Way / People's Congress / Take a surprise ” and “ Let people / be startled at ”, The sentence meaning of the two word segmentation schemes is completely different . Chinese named entity recognition is usually combined with Chinese word segmentation , The combination of shallow grammar analysis and other processes , and
participle , The accuracy of parsing directly affects the effect of named entity recognition .
* There are a lot of unknown words in the text to be recognized , It is a new entity word , as time goes on , It is difficult for us to maintain these new words .
3 Existing research
Induction of named entity recognition methods based on statistical model
4 CRF( Conditional Random Fields, Conditional random field )
4.1 Introduction to conditional random fields
Comparison of four models
In a given observation sequence X Time , A specific sequence of tags Y The probability can be defined as
4.2 CRF Parameter estimation of
4.3 forecast
5 experiment
1998 People's daily
#sentence
#PER
#LOC
#ORG
train
46364
17615
36517
20571
test
4365
1973
2877
1331
Technology
Daily Recommendation