0%

elasticsearch(五)-分词器

Analysis

1
2
3
4
5
6
7
8
9
10
11
12
分析器:
1.character filter(mapping):分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《es》=> es
HTML Strip Character Filter:html_strip 自动过滤html标签
参数:escaped_tags 需要保留的html标签

Mapping Character Filter:type mapping

Pattern Replace Character Filter:type pattern_replace

4.tokenizer(分词器):分词

5.token filter:停用词、时态转换、大小写转换、同义词转换、语气词处理等。如:has=>have him=>he apples=>apple

normalization

标准化处理是用于完善分词器结果的,提升recall召回率。

es

es

character filter

HTML Strip Character Filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
//HTML Strip Character Filter
PUT my_index
{
"settings": {
//分析器
"analysis": {
"char_filter": {
"my_char_filter": { //设置html字符过滤器,并保留<a>标签
"type": "html_strip",
"escaped_tags": ["a"]
}
},
//分词器
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
//尝试分词
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I&apos;m so <a>happy</a>!</p>"
}

输出结果:
"token" : """
I'm so <a>happy</a>!
"""

Mapping Character Filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
//Mapping Character Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping", //为指定的字符制定映射规则
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
}
}
}
}
1
2
3
4
5
6
7
8
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My license plate is ٢٥٠١٥"
}

输出结果:
"token" : "My license plate is 25015"

Pattern Replace Character Filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//Pattern Replace Character Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace", //通过正则匹配,把-替换成_
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
1
2
3
4
5
6
7
8
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}

输出结果:
"token" : "My credit card is 123_456_789"

token filter

lowercase token filter

1
2
3
4
5
6
7
8
9
10
//大小写 lowercase token filter
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}

输出结果:
the、quick、fox、jumps
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//当分词后的词项长度满足 < 5 时,执行 filter配置
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "condition",
"filter": [ "lowercase" ],
"script": {
"source": "token.getTerm().length() < 5"
}
}
],
"text": "THE QUICK BROWN FOX"
}

输出结果:
the、QUICK、BROWN、fox

stopwords token filter

在信息检索中,停用词是为节省存储空间和提高搜索效率,处理文本时自动过滤掉某些字或词,这些字或词即被称为Stop Words(停用词)。

停用词大致分为两类。一类是语言中的功能词,这些词极其普遍而无实际含义,如“the”、“is“、“which“、“on”等。另一类是词汇词,比如’want’等,这些词应用广泛,但搜索引擎无法保证能够给出真正相关的搜索结果,难以缩小搜索范围,还会降低搜索效率。

实践中,通过配置stopwords使得这些词汇不添加进倒排索引中,从而节省索引的存储空间、提高搜索性能。

1
2
3
4
5
6
7
8
9
10
11
12
13
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"standard",
"stopwords":"_english_" //启动英语停用词
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Teacher Ma is in the restroom"
}

未启用英语停用词结果:
Teacher、Ma、is、in、the、restroom

启用英语停用词结果:
Teacher、Ma、restroom

tokenizer

基于ES 7.6版本,支持的内置分词器有15种。官网介绍

1
2
3
4
5
6
7
8
1.standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。
max_token_length:最大令牌长度。如果看到令牌超过此长度,则将其max_token_length间隔分割。默认为255。

2.Pattern Tokenizer:以正则匹配分隔符,把文本拆分成若干词项。

3.Simple Pattern Tokenizer:以正则匹配词项,速度比Pattern Tokenizer快。

4.whitespace analyzer:以空白符分隔 Tim_cookie
1
2
3
4
5
6
7
8
9
10
11
12
13
//配置内置分词器
PUT /test_analysis/
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": { //自定义分词器名称
"tokenizer":"whitespace" //指定使用的分词器 standard、pattern、simple
}
}
}
}
}
1
2
3
4
5
6
//验证分词器
GET /test_analysis/_analyze
{
"analyzer": "my_analyzer",
"text": "ooo is bbbb!!"
}

自定义 analysis

设置"type": "custom"告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
PUT /test_analysis
{
"settings": {
"analysis": {
"char_filter": { //自定义字符过滤器,设置映射规则
"test_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": { //过滤器,过滤指定字符
"test_stopwords": {
"type": "stop",
"stopwords": ["is","in","at","the","a","for"]
}
},
"tokenizer": { //正则分析器
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]" //以正则匹配到的字符做分词
}
},
"analyzer": { //自定义分词器
"my_analyzer": {
"type": "custom", //告诉ES这是一个自定义分词器
"char_filter": [ //设置两个字符过滤器
"html_strip",
"test_char_filter"
],
"tokenizer": "standard", //设置内置分词器
"filter": ["lowercase","test_stopwords"]
}
}
}
}
}
1
2
3
4
5
6
7
8
GET /test_analysis/_analyze
{
"text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
"analyzer": "my_analyzer"
}

分词结果:
teacher、ma、and、zhang、also、thinks、mother's、friends、good、or、nice

IK中文分词器

IK分词器下载地址

1
2
3
4
5
IK分词器安装步骤:
1.上Github下载IK分词器
2.解压分词器,使用maven进行package打包
3.从releases中获取打好的zip包,放到es安装目录/plugins/ik/ 目录下解压
4.重启ES

如果ES版本比IK分词器版本高,启动ES时可能会出现如下错误:

1
Plugin [analysis-ik] was built for Elasticsearch version 7.4.0 but version 7.10.1 is running

需要修改IK分词器的配置文件plugin-descriptor.properties

1
2
#改成自己使用的ES版本号
elasticsearch.version=7.10.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
IK分词器提供两种analyzer
ik_max_word:细粒度
ik_smart:粗粒度

2.IK文件描述
IKAnalyzer.cfg.xml:IK分词配置文件
主词库:main.dic
英文停用词:stopword.dic,不会建立在倒排索引中
特殊词库:
quantifier.dic:特殊词库:计量单位等
suffix.dic:特殊词库:后缀名
surname.dic:特殊词库:百家姓
preposition:特殊词库:语气词

自定义词库:比如当下流行词:857、emmm...、996
热更新:
修改ik分词器源码
基于ik分词器原生支持的热更新方案,部署web服务器,提供http接口,通过modified和tag两个http响应头,来提供词语的热更新
1
2
3
4
5
6
7
8
GET _analyze
{
"text": "你的衣服真好看",
"analyzer": "ik_max_word"
}

输出结果:
你、的、衣服、真好、好看
1
2
3
4
5
6
7
8
GET _analyze
{
"text": "你的衣服真好看",
"analyzer": "ik_smart"
}

输出结果:
你、的、衣服、真、好看