elasticsearch同义词查询

无所谓,该放就放别让自己那么累,让梦纯粹静候轮回

Posted by yishuifengxiao on 2022-01-24

一 使用系统同义词插件

ES自带了synonym插件,可以实现同义词功能。

1.1 指定分词字典

在ES的安装目录下的创建一个名为analysis的文件夹,然后再该文件夹下创建一个名为synonyms.txt的文件。文件的内容如下所示

1
2
3
4
5
6
裙子,裙
西红柿,番茄
china,中国,中华人民共和国
男生,男士,man
女生,女士,women
全年一次性奖金,年终奖

1.2 创建索引

创建一个自定义索引

命令如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"word_sync": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
},
"analyzer": {
"ik_sync_smart": {
"filter": [
"word_sync"
],
"type": "custom",
"tokenizer": "ik_smart"
}
}
}
},
"mappings": {
"properties": {
"nickname": {
"analyzer": "ik_sync_smart",
"type": "text"
},
"username": {
"analyzer": "ik_sync_smart",
"type": "text"
}
}
}
}

1.3 测试分词

执行以下请求

1
2
3
4
5
POST /my_index/_analyze
{
"analyzer": "ik_sync_smart",
"text":"西红柿"
}

得到的结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"tokens" : [
{
"token" : "西红柿",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "番茄",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 0
}
]
}

由此可见,同意配置成功。

插入两条数据

1
2
3
4
5
put /my_index/_doc/1
{
"username": "西红柿",
"nickname": "番茄时西红柿的别名"
}

查询插入的数据

1
2
3
4
5
6
7
8
get /my_index/_search
{
"query":{
"match": {
"username": "番茄"
}
}
}

得到的查询结果如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
"took" : 499,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.46029136,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.46029136,
"_source" : {
"username" : "西红柿",
"nickname" : "番茄时西红柿的别名"
}
}
]
}
}

根据上述结果可以知道,同义词查询成功。

二 动态更新功能

原文参见官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/7.3/analysis-synonym-tokenfilter.html

同义词标记过滤器允许在分析过程中轻松处理同义词。同义词是使用配置文件配置的。以下是一个例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
PUT /test_index
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
}

上面配置了一个同义词过滤器,路径为 analysis/synonym.txt(相对于配置位置)。 然后为同义词分析器配置过滤器。

此过滤器使用链中出现在它之前的任何标记器和标记过滤器来标记同义词。

其他设置是:

  • expand (默认为true)。
  • lenient(默认为false)。 如果 true 在解析同义词配置时忽略异常。 需要注意的是,只有那些无法解析的同义词规则才会被忽略。 例如考虑以下请求:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
PUT /test_index
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "standard",
"filter" : ["my_stop", "synonym"]
}
},
"filter" : {
"my_stop": {
"type" : "stop",
"stopwords": ["bar"]
},
"synonym" : {
"type" : "synonym",
"lenient": true,
"synonyms" : ["foo, bar => baz"]
}
}
}
}
}
}

对于上述请求,单词 bar 被跳过,但映射 foo => baz 仍被添加。 但是,如果要添加的映射是“foo, baz ⇒ bar”,则不会将任何内容添加到同义词列表中。 这是因为映射的目标词本身被消除了,因为它是一个停用词。 同样,如果映射是“bar, foo, baz”并且 expand 设置为 false,则不会添加映射,因为 expand=false 时目标映射是第一个单词。 但是,如果 expand=true,那么添加的映射将等效于 foo, baz => foo, baz,即除停用词之外的所有映射。

tokenizer 参数控制将用于标记同义词的标记器,此参数用于向后兼容 6.0 之前创建的索引。 ignore_case 参数仅适用于 tokenizer 参数。

支持两种同义词格式:Solr、WordNet。

三 自定义Filter

3.1 自定义filter

核心代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
package com.yishuifengxiao.plugin.es.filter;

import com.yishuifengxiao.plugin.es.util.SynonymRuleManager;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.cfg.Configuration;

import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;

/**
* Created by ginozhang on 2017/1/12.
*/
public class SynonymTokenFilter extends TokenFilter {

public static final String TYPE_SYNONYM = "SYNONYM";

private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);

private final OffsetAttribute offset = addAttribute(OffsetAttribute.class);

private String currentInput = null;

private int startOffset = 0;

private int endOffset = 0;

private List<String> currentWords = null;

private int currentIndex = 0;

private Configuration configuration;

public SynonymTokenFilter(TokenStream input, Configuration configuration) {
super(input);
this.configuration = configuration;
}

@Override
public boolean incrementToken() throws IOException {
if (currentInput == null) {
if (!input.incrementToken()) {
return false;
}
System.out.println(" termAtt = " + termAtt + " termAtt.buffer() == " + termAtt.buffer());
currentInput = new String(termAtt.buffer(), 0, termAtt.length());
startOffset = offset.startOffset();
endOffset = offset.endOffset();
System.out.println(" currentInput == = = " + currentInput);
currentWords = new SynonymRuleManager().getSynonymWords(currentInput);
System.out.println(" currentWords == = = " + currentWords.stream().collect(Collectors.joining(",")));
if (currentWords == null || currentWords.isEmpty()) {
currentInput = null;

// 返回当前的token
return true;
}
currentIndex = 0;
}

if (currentIndex >= currentWords.size()) {
System.out.println(" == == currentIndex >= currentWords.size ");
currentInput = null;
return incrementToken();
}

String newWords = currentWords.get(currentIndex);
System.out.println(" == == newWords = " + newWords);
currentIndex++;
clearAttributes();
char[] output = newWords.toCharArray();
termAtt.copyBuffer(output, 0, output.length);
typeAtt.setType(TYPE_SYNONYM);
offset.setOffset(startOffset, endOffset);
System.out.println(" == == termAtt.copyBuffer = " + termAtt);
return true;
}

@Override
public void reset() throws IOException {
super.reset();
currentInput = null;
startOffset = 0;
endOffset = 0;
currentWords = null;
currentIndex = 0;
}
}

在这里实现具体的逻辑。

3.2 自定义AbstractTokenFilterFactory

核心代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
package com.yishuifengxiao.plugin.es.factory;

import com.yishuifengxiao.plugin.es.dict.SyncMonitor;
import com.yishuifengxiao.plugin.es.filter.SynonymTokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.wltea.analyzer.cfg.Configuration;

import java.io.IOException;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.util.Collections;
import java.util.List;

public class SynonymTokenFilterFactory extends AbstractTokenFilterFactory {

private Configuration configuration;

public SynonymTokenFilterFactory(IndexSettings indexSettings, Environment env,
String name, Settings settings) throws IOException {
super(indexSettings, name, settings);
Configuration configuration = new Configuration(env, settings);
System.out.println("===============> SynonymTokenFilterFactory ");
}


@Override
public TokenStream create(TokenStream tokenStream) {
return new SynonymTokenFilter(tokenStream, this.configuration);
}
}

3.3 添加filter

在插件的入口文件里添加代码

1
2
3
4
5
6
7
8
@Override
public Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> tokenFilters = new HashMap<>();

tokenFilters.put("sync_ik", SynonymTokenFilterFactory::new);

return tokenFilters;
}

3.4 创建自定义分词器

创建一个带有自定义分词器的索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"word_sync": {
"type": "sync_ik"
}
},
"analyzer": {
"ik_sync_smart": {
"filter": [
"word_sync"
],
"type": "custom",
"tokenizer": "smart_ik"
}
}
}
},
"mappings": {
"properties": {
"nickname": {
"analyzer": "ik_sync_smart",
"type": "text"
},
"username": {
"analyzer": "ik_sync_smart",
"type": "text"
}
}
}
}

测试分词

1
2
3
4
5
POST /my_index/_analyze
{
"analyzer": "ik_sync_smart",
"text":"土豆"
}

得到的分词结果为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"tokens" : [
{
"token" : "土豆",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "马铃薯",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 1
}
]
}