搜索的相关性算分

# 一、相关性

搜索的相关性算分，描述了一个文档和查询语句匹配的程度。ES会对每个匹配查询条件的结果进行算分_score。
打分的本质是排序，需要把最符合用户需求的文档排在前面。ES 5之前，默认相关性算分是用的TF-IDF，现在采用BM 25.

# 二、词频TF

Term Frequency：检索词在一篇文章中出现的频率，相当于是检索词出现的次数处于文档的总数。

而计算一条查询和结果文档相关的简单方法：就是把一句话的每个分词的TF相加，例如：TF（我）+ TF（爱）+ TF（你）。

但是这个存在一个问题，比如“的”这个词在文档出现了很多次，但是对于算分基本上没有什么意义，就不应该考虑他们的存在。

# 三、IDF逆文档频率

DF：检索词在所有文档中出现的频率。
IDF：log（全部文档数/检索词出现过的文档总数）。
TF-IDF本质上就是将TF求和变成了加权求和
- TF（我） * IDF（我）+ TF（爱） * IDF（爱）+ TF（你）* IDF（你）

# 四、案例实践

数据准备

PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }

相关性查询

POST /testscore/_search
{
  // 打开算分分析
  //"explain": true,
  "query": {
    "match": {
      //"content":"you"
      "content": "elasticsearch"
      //"content":"the"
      //"content": "the elasticsearch"
    }
  }
}

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.8713851,
    "hits" : [
      {
        "_index" : "testscore",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8713851,
        "_source" : {
          "content" : "we like elasticsearch"
        }
      },
      {
        "_index" : "testscore",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6489038,
        "_source" : {
          "content" : "we use Elasticsearch to power the search"
        }
      }
    ]
  }
}

we use Elasticsearch to power the search和we like elasticsearch都包含了elasticsearch这个关键词，但是后者整个语句更短，因此elasticsearch的权重更多。

Boosting控制

当boost > 1时，打分的相关度相对性提高。
当0 < boost < 1时，打分的权重相对性降低。
当boost < 0时，贡献负分。

POST testscore/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "content" : "elasticsearch"
                }
            },
            "negative" : {
                 "term" : {
                     "content" : "like"
                }
            },
            "negative_boost" : 0.2
        }
    }
}

# 参考

Elasticsearch 核心技术与实战 (opens new window)

#Elasticsearch

上次更新: 2024/06/29, 15:13:44

← 结构化搜索单字符串多字段查询→