Elasticsearch

从基础到高级,涵盖安装、索引管理、CRUD、查询DSL、聚合、分词、集群、性能优化、Python客户端等核心知识点。以 Elasticsearch 8.x 为基准。


目录

  1. 基础篇

    • 安装与启动

    • 核心概念

    • REST API 基础

    • 索引管理

  2. Mapping 篇

    • 字段类型

    • 动态 Mapping

    • 自定义 Mapping

    • 索引模板

  3. 文档 CRUD 篇

    • 新增文档

    • 查询文档

    • 更新文档

    • 删除文档

    • 批量操作

  4. 查询 DSL 篇

    • 全文查询

    • 精确查询

    • 范围查询

    • 复合查询(bool)

    • 嵌套查询

    • 地理位置查询

  5. 分词篇

    • 内置分析器

    • 中文分词(IK)

    • 自定义分析器

    • 分析 API

  6. 聚合篇

    • Bucket 聚合

    • Metric 聚合

    • Pipeline 聚合

    • 嵌套聚合

  7. 搜索进阶篇

    • 相关性评分

    • 高亮显示

    • 分页与深度分页

    • 排序

    • 字段折叠

    • 搜索建议

  8. 索引优化篇

    • 写入优化

    • 查询优化

    • 索引生命周期(ILM)

    • 冷热分层

  9. 集群篇

    • 集群架构

    • 节点类型

    • 分片与副本

    • 集群监控

  10. Python 客户端篇

    • 安装与连接

    • 索引操作

    • 文档 CRUD

    • 搜索

    • 聚合

    • 批量操作

  11. 运维与监控篇

  12. 部署篇


一、基础篇

1.1 安装与启动

# Docker 启动(推荐开发环境)
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms1g -Xmx1g" \
  elasticsearch:8.12.0

# Docker Compose(ES + Kibana)
# 见部署篇

# 验证
curl http://localhost:9200
curl http://localhost:9200/_cluster/health?pretty

# Ubuntu 安装
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
sudo apt update && sudo apt install elasticsearch
sudo systemctl start elasticsearch

1.2 核心概念

概念

类比关系型数据库

说明

Index(索引)

Database(数据库)

文档的集合

Document(文档)

Row(行)

一条数据,JSON 格式

Field(字段)

Column(列)

文档的属性

Mapping(映射)

Schema(表结构)

字段类型定义

Shard(分片)

-

索引的水平拆分

Replica(副本)

-

分片的备份

Node(节点)

-

一个 ES 实例

Cluster(集群)

-

多个节点的集合

关键概念说明

  • 倒排索引:将词条映射到文档ID,实现全文搜索的核心数据结构

  • 分片(Shard):索引数据的水平分割,默认1个主分片,创建后不可修改

  • 副本(Replica):主分片的备份,提高可用性和读性能,可动态修改数量

  • 段(Segment):Lucene 的基本存储单位,不可变,定期合并


1.3 REST API 基础

# ES 使用 RESTful API,格式:
# METHOD /index/_action
# Content-Type: application/json

# 常用端点
GET  /                           # 集群信息
GET  /_cluster/health            # 集群健康
GET  /_cat/indices?v             # 查看所有索引(表格格式)
GET  /_cat/nodes?v               # 查看所有节点
GET  /_cat/shards?v              # 查看分片分布
GET  /_cat/aliases?v             # 查看别名

# 索引操作
PUT  /my_index                   # 创建索引
GET  /my_index                   # 查看索引信息
DELETE /my_index                 # 删除索引
HEAD /my_index                   # 检查索引是否存在(200/404)

# 文档操作
POST /my_index/_doc              # 新增(自动生成ID)
PUT  /my_index/_doc/1            # 新增/替换(指定ID)
GET  /my_index/_doc/1            # 获取文档
DELETE /my_index/_doc/1          # 删除文档
POST /my_index/_update/1         # 更新文档
POST /my_index/_bulk             # 批量操作
POST /my_index/_search           # 搜索

1.4 索引管理

# 创建索引(指定分片、副本、Mapping)
PUT /articles
{
  "settings": {
    "number_of_shards":   3,
    "number_of_replicas": 1,
    "refresh_interval":   "1s",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type":      "custom",
          "tokenizer": "ik_max_word"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title":   { "type": "text", "analyzer": "ik_max_word" },
      "content": { "type": "text", "analyzer": "ik_max_word" },
      "author":  { "type": "keyword" },
      "tags":    { "type": "keyword" },
      "views":   { "type": "integer" },
      "publish_at": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||epoch_millis" }
    }
  }
}

# 修改索引设置(不能修改分片数)
PUT /articles/_settings
{
  "number_of_replicas": 2,
  "refresh_interval": "5s"
}

# 关闭/打开索引(关闭后不可读写,但占用资源极少)
POST /articles/_close
POST /articles/_open

# 索引别名(解耦索引名与应用,支持零停机重建索引)
POST /_aliases
{
  "actions": [
    { "add":    { "index": "articles_v2", "alias": "articles" } },
    { "remove": { "index": "articles_v1", "alias": "articles" } }
  ]
}

# 为别名添加过滤器(虚拟视图)
POST /_aliases
{
  "actions": [{
    "add": {
      "index": "articles",
      "alias": "published_articles",
      "filter": { "term": { "status": "published" } }
    }
  }]
}

# Reindex(重建索引,用于修改 Mapping)
POST /_reindex
{
  "source": { "index": "articles_v1" },
  "dest":   { "index": "articles_v2" }
}

# 按条件 Reindex
POST /_reindex
{
  "source": {
    "index": "articles_v1",
    "query": { "term": { "status": "published" } }
  },
  "dest": { "index": "articles_v2" }
}

二、Mapping 篇

2.1 字段类型

文本类型

类型

说明

text

全文搜索,会分词,不支持排序/聚合

keyword

精确匹配,不分词,支持排序/聚合/过滤

match_only_text

只支持全文搜索,节省存储(8.0+)

数值类型

类型

说明

integer / long

整数

float / double

浮点数

scaled_float

缩放浮点(如价格用 scaling_factor=100)

unsigned_long

无符号长整型

其他类型

类型

说明

boolean

布尔

date

日期

ip

IP 地址

geo_point

地理坐标(经纬度)

geo_shape

地理形状

object

嵌套对象(扁平化存储)

nested

嵌套对象(独立索引,支持独立查询)

flattened

扁平化对象(key 为 keyword)

dense_vector

稠密向量(向量搜索)


2.2 动态 Mapping

# ES 自动推断字段类型规则
# true/false → boolean
# 123       → long
# 1.5       → float
# "2024-01-01" → date(匹配日期格式)
# "hello"   → text + keyword(自动multi-field)

# 查看自动生成的 Mapping
GET /my_index/_mapping

# 动态 Mapping 控制
PUT /my_index
{
  "mappings": {
    "dynamic": "strict"   # true=自动创建(默认), false=忽略新字段, strict=新字段报错
  }
}

# 动态模板(批量定义字段规则)
PUT /my_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keyword": {
          "match_mapping_type": "string",
          "mapping": { "type": "keyword" }   # 所有字符串字段默认 keyword
        }
      },
      {
        "long_as_integer": {
          "match_mapping_type": "long",
          "mapping": { "type": "integer" }
        }
      },
      {
        "price_fields": {
          "match": "*_price",               # 字段名匹配
          "mapping": {
            "type": "scaled_float",
            "scaling_factor": 100
          }
        }
      }
    ]
  }
}

2.3 自定义 Mapping

PUT /products
{
  "mappings": {
    "properties": {
      "id":        { "type": "long" },
      "name": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
          # name.keyword 可以精确匹配和聚合
        }
      },
      "description": {
        "type":  "text",
        "analyzer": "ik_max_word",
        "index_options": "positions"   # offsets/positions/freqs/docs
      },
      "price": {
        "type":           "scaled_float",
        "scaling_factor": 100
      },
      "category":  { "type": "keyword" },
      "tags":      { "type": "keyword" },
      "stock":     { "type": "integer" },
      "is_active": { "type": "boolean" },
      "created_at": {
        "type":   "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "location": { "type": "geo_point" },
      "images": {
        "type": "object",
        "properties": {
          "url":    { "type": "keyword" },
          "width":  { "type": "integer" },
          "height": { "type": "integer" }
        }
      },
      "specs": {
        "type": "nested",              # 用 nested 才能独立查询数组中的对象
        "properties": {
          "name":  { "type": "keyword" },
          "value": { "type": "keyword" }
        }
      },
      "suggest": {
        "type": "completion"           # 搜索建议字段
      }
    }
  }
}

2.4 索引模板

# 索引模板(新建匹配的索引时自动应用)
PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*", "events-*"],
  "priority": 100,
  "template": {
    "settings": {
      "number_of_shards":   2,
      "number_of_replicas": 1,
      "refresh_interval":   "5s"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level":      { "type": "keyword" },
        "message":    { "type": "text" },
        "service":    { "type": "keyword" },
        "host":       { "type": "keyword" }
      }
    },
    "aliases": {
      "all_logs": {}
    }
  }
}

# 组件模板(可复用的 Mapping 片段)
PUT /_component_template/timestamp_mapping
{
  "template": {
    "mappings": {
      "properties": {
        "created_at": { "type": "date" },
        "updated_at": { "type": "date" }
      }
    }
  }
}

# 在索引模板中引用组件模板
PUT /_index_template/my_template
{
  "index_patterns": ["my_*"],
  "composed_of": ["timestamp_mapping"]
}

三、文档 CRUD 篇

3.1 新增文档

# 自动生成 ID(POST)
POST /articles/_doc
{
  "title":      "Elasticsearch 入门",
  "author":     "Alice",
  "tags":       ["搜索", "大数据"],
  "views":      100,
  "publish_at": "2024-01-01 10:00:00"
}

# 指定 ID(PUT/POST)
PUT /articles/_doc/1
{
  "title":  "Elasticsearch 入门",
  "author": "Alice"
}

# 仅创建,ID 已存在则报错(op_type=create)
PUT /articles/_doc/1?op_type=create
PUT /articles/_create/1
{
  "title": "新文章"
}

3.2 查询文档

# 根据 ID 获取
GET /articles/_doc/1

# 只返回 _source
GET /articles/_source/1

# 指定返回字段
GET /articles/_doc/1?_source=title,author

# 批量获取(mget)
GET /_mget
{
  "docs": [
    { "_index": "articles", "_id": "1" },
    { "_index": "articles", "_id": "2", "_source": ["title"] }
  ]
}

# 同一索引批量获取
GET /articles/_mget
{
  "ids": ["1", "2", "3"]
}

# 检查文档是否存在
HEAD /articles/_doc/1

3.3 更新文档

# 部分更新(update,保留原有字段)
POST /articles/_update/1
{
  "doc": {
    "views": 200,
    "title": "Elasticsearch 入门(更新版)"
  }
}

# 不存在则创建(upsert)
POST /articles/_update/1
{
  "doc": { "views": 200 },
  "upsert": {
    "title":  "默认标题",
    "views":  200,
    "author": "unknown"
  }
}

# 脚本更新(Painless 脚本)
POST /articles/_update/1
{
  "script": {
    "source": "ctx._source.views += params.increment",
    "lang":   "painless",
    "params": { "increment": 1 }
  }
}

# 条件更新(按查询更新)
POST /articles/_update_by_query
{
  "query": { "term": { "author": "Alice" } },
  "script": {
    "source": "ctx._source.verified = true",
    "lang":   "painless"
  }
}

# 完全替换(PUT,整个文档替换)
PUT /articles/_doc/1
{
  "title":  "全新内容(原字段全部丢失)",
  "author": "Bob"
}

3.4 删除文档

# 按 ID 删除
DELETE /articles/_doc/1

# 按查询删除
POST /articles/_delete_by_query
{
  "query": {
    "range": {
      "publish_at": { "lt": "2020-01-01" }
    }
  }
}

# 异步删除(大量数据时)
POST /articles/_delete_by_query?wait_for_completion=false
{
  "query": { "match_all": {} }
}

3.5 批量操作(Bulk)

# bulk API(每两行为一组:操作行 + 数据行)
POST /_bulk
{ "index": { "_index": "articles", "_id": "1" } }
{ "title": "文章1", "author": "Alice" }
{ "index": { "_index": "articles", "_id": "2" } }
{ "title": "文章2", "author": "Bob" }
{ "update": { "_index": "articles", "_id": "1" } }
{ "doc": { "views": 100 } }
{ "delete": { "_index": "articles", "_id": "3" } }

# 同一索引的 bulk
POST /articles/_bulk
{ "index": { "_id": "1" } }
{ "title": "文章1", "author": "Alice" }
{ "create": { "_id": "2" } }
{ "title": "文章2" }
{ "update": { "_id": "1" } }
{ "doc": { "views": 100 } }
{ "delete": { "_id": "3" } }

建议每批 5~15MB,5000~10000 条,过大会占用大量内存。


四、查询 DSL 篇

4.1 全文查询

# match(标准全文查询,会分词)
GET /articles/_search
{
  "query": {
    "match": {
      "title": "Elasticsearch 搜索"
      # 默认 OR,匹配任意词
    }
  }
}

# match(AND 模式,全部词必须匹配)
{
  "query": {
    "match": {
      "title": {
        "query":    "Elasticsearch 搜索",
        "operator": "and"
      }
    }
  }
}

# match_phrase(短语匹配,词序一致,位置相邻)
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "全文搜索引擎",
        "slop":  1    # 允许词间距离
      }
    }
  }
}

# match_phrase_prefix(前缀短语匹配,搜索补全)
{
  "query": {
    "match_phrase_prefix": {
      "title": "Elastic"
    }
  }
}

# multi_match(多字段匹配)
{
  "query": {
    "multi_match": {
      "query":  "Elasticsearch 教程",
      "fields": ["title^3", "content", "tags"],  # ^3 表示 title 权重×3
      "type":   "best_fields"
      # best_fields:  取最高分字段的分数(默认)
      # most_fields:  所有字段分数相加
      # cross_fields: 跨字段匹配(适合姓名等)
      # phrase:       短语匹配
    }
  }
}

# query_string(支持 Lucene 语法)
{
  "query": {
    "query_string": {
      "query":          "title:(Elasticsearch AND 入门) AND author:Alice",
      "default_field":  "content"
    }
  }
}

# simple_query_string(用户输入,容错性强)
{
  "query": {
    "simple_query_string": {
      "query":   "Elasticsearch +入门 -过时",
      "fields":  ["title", "content"],
      "default_operator": "AND"
    }
  }
}

4.2 精确查询

# term(精确匹配,不分词,用于 keyword/数字/布尔)
{
  "query": {
    "term": {
      "author": { "value": "Alice" }
    }
  }
}

# terms(IN 查询)
{
  "query": {
    "terms": {
      "tags": ["搜索", "大数据", "Python"]
    }
  }
}

# ids(按 ID 批量查询)
{
  "query": {
    "ids": { "values": ["1", "2", "3"] }
  }
}

# exists(字段存在)
{
  "query": {
    "exists": { "field": "tags" }
  }
}

# prefix(前缀匹配,keyword 字段)
{
  "query": {
    "prefix": {
      "title.keyword": { "value": "Elastic" }
    }
  }
}

# wildcard(通配符,* 任意多字符,? 单个字符)
{
  "query": {
    "wildcard": {
      "title.keyword": { "value": "Elastic*" }
    }
  }
}

# regexp(正则匹配,性能差慎用)
{
  "query": {
    "regexp": {
      "email": { "value": ".*@gmail\\.com" }
    }
  }
}

# fuzzy(模糊匹配,处理拼写错误)
{
  "query": {
    "fuzzy": {
      "title": {
        "value":       "Elasticsearh",   # 拼写错误
        "fuzziness":   "AUTO",           # 自动(0/1/2)
        "prefix_length": 2               # 前N位不模糊
      }
    }
  }
}

4.3 范围查询

# range(范围查询)
{
  "query": {
    "range": {
      "views": {
        "gte": 100,
        "lte": 10000
      }
    }
  }
}

# 日期范围
{
  "query": {
    "range": {
      "publish_at": {
        "gte": "2024-01-01",
        "lt":  "2025-01-01",
        "format": "yyyy-MM-dd",
        "time_zone": "+08:00"
      }
    }
  }
}

# 相对日期
{
  "query": {
    "range": {
      "publish_at": {
        "gte": "now-7d/d",    # 7天前,取整到天
        "lte": "now/d"        # 今天
      }
    }
  }
}

4.4 复合查询(bool)

# bool 查询(最常用的复合查询)
{
  "query": {
    "bool": {
      "must": [                           # 必须匹配(影响相关性分数)
        { "match": { "title": "Elasticsearch" } },
        { "term":  { "status": "published" } }
      ],
      "must_not": [                       # 必须不匹配
        { "term": { "author": "spam_user" } }
      ],
      "should": [                         # 可选匹配(匹配则分数更高)
        { "term": { "tags": "推荐" } },
        { "range": { "views": { "gte": 1000 } } }
      ],
      "minimum_should_match": 1,          # should 至少匹配1个
      "filter": [                         # 过滤(不影响分数,有缓存)
        { "term":  { "is_active": true } },
        { "range": { "views": { "gte": 10 } } }
      ]
    }
  }
}

must / filter / should / must_not 对比

子句

影响相关性分数

是否必须匹配

是否缓存

must

filter

✅(推荐用于过滤条件)

should

must_not

必须不匹配


4.5 嵌套查询(nested)

# 查询 nested 类型的字段(必须用 nested query)
{
  "query": {
    "nested": {
      "path":  "specs",
      "query": {
        "bool": {
          "must": [
            { "term": { "specs.name":  "颜色" } },
            { "term": { "specs.value": "红色" } }
          ]
        }
      },
      "score_mode": "avg"    # avg / max / sum / none
    }
  }
}

4.6 地理位置查询

# 圆形范围查询
{
  "query": {
    "geo_distance": {
      "distance":  "5km",
      "location": {
        "lat": 39.90,
        "lon": 116.40
      }
    }
  }
}

# 矩形范围查询
{
  "query": {
    "geo_bounding_box": {
      "location": {
        "top_left":     { "lat": 40.0, "lon": 116.0 },
        "bottom_right": { "lat": 39.5, "lon": 117.0 }
      }
    }
  }
}

# 距离排序
{
  "sort": [{
    "_geo_distance": {
      "location": { "lat": 39.90, "lon": 116.40 },
      "order":    "asc",
      "unit":     "km"
    }
  }]
}

五、分词篇

5.1 内置分析器

分析器

说明

standard

默认,按词边界分词,小写,适合英文

simple

按非字母字符分词,小写

whitespace

按空格分词,不小写

stop

standard + 停用词过滤

keyword

不分词,整体作为一个词条

pattern

正则分词

language

语言特定(english/french 等)

fingerprint

去重排序后合并

# 测试分析器
GET /_analyze
{
  "analyzer": "standard",
  "text":     "Elasticsearch 全文搜索引擎"
}

# 测试指定索引的分析器
GET /articles/_analyze
{
  "field": "title",
  "text":  "Elasticsearch 全文搜索"
}

5.2 中文分词(IK)

# 安装 IK 分词插件(版本需与 ES 一致)
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip
# 或 Docker 内安装
docker exec -it elasticsearch ./bin/elasticsearch-plugin install analysis-ik

# 重启 ES 后生效

# IK 两种模式
# ik_max_word:细粒度分词(索引时用,最多词条)
# ik_smart:粗粒度分词(搜索时用,语义更准确)
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text":     "中华人民共和国国歌"
}
# → [中华人民共和国, 中华人民, 中华, 华人, 人民共和国, 人民, 共和国, 共和, 国歌]

GET /_analyze
{
  "analyzer": "ik_smart",
  "text":     "中华人民共和国国歌"
}
# → [中华人民共和国, 国歌]

# 自定义词典(热更新)
# elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml
# <entry key="remote_ext_dict">http://yourserver/dict.txt</entry>

5.3 自定义分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_strip": { "type": "html_strip" },                    # 去除 HTML 标签
        "replace_and": {
          "type":        "mapping",
          "mappings":    ["& => and", "| => or"]
        }
      },
      "tokenizer": {
        "comma_tokenizer": {
          "type":    "pattern",
          "pattern": ","                                           # 按逗号分词
        }
      },
      "filter": {
        "my_stop": {
          "type":      "stop",
          "stopwords": ["的", "了", "是", "在", "我", "有", "和"]  # 停用词
        },
        "my_synonym": {
          "type":     "synonym",
          "synonyms": ["手机,mobile => 手机", "电脑,PC,计算机"]   # 同义词
        },
        "edge_ngram_filter": {
          "type":     "edge_ngram",
          "min_gram": 1,
          "max_gram": 20                                           # 前缀搜索补全
        }
      },
      "analyzer": {
        "my_cn_analyzer": {
          "type":        "custom",
          "char_filter": ["html_strip"],
          "tokenizer":   "ik_max_word",
          "filter":      ["lowercase", "my_stop", "my_synonym"]
        },
        "autocomplete_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "filter":    ["lowercase", "edge_ngram_filter"]
        }
      },
      "normalizer": {                    # keyword 字段的 normalizer(类似 analyzer 但不分词)
        "lowercase_normalizer": {
          "type":   "custom",
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  }
}

六、聚合篇

6.1 Bucket 聚合(分组)

# terms(分组计数,类似 GROUP BY)
GET /articles/_search
{
  "size": 0,                      # 不返回文档,只返回聚合结果
  "aggs": {
    "by_author": {
      "terms": {
        "field": "author",
        "size":  10,              # 返回前10个桶
        "order": { "_count": "desc" }
      }
    }
  }
}

# date_histogram(按时间分组)
{
  "aggs": {
    "articles_over_time": {
      "date_histogram": {
        "field":             "publish_at",
        "calendar_interval": "month",     # month/week/day/hour/minute
        "format":            "yyyy-MM",
        "min_doc_count":     0,           # 空桶也返回
        "extended_bounds": {
          "min": "2024-01-01",
          "max": "2024-12-31"
        }
      }
    }
  }
}

# histogram(数值直方图)
{
  "aggs": {
    "price_ranges": {
      "histogram": {
        "field":    "price",
        "interval": 100,
        "min_doc_count": 0
      }
    }
  }
}

# range(自定义范围分组)
{
  "aggs": {
    "price_range": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500, "key": "high_end" }
        ]
      }
    }
  }
}

# filter(过滤聚合)
{
  "aggs": {
    "recent_articles": {
      "filter": {
        "range": { "publish_at": { "gte": "now-30d" } }
      },
      "aggs": {
        "by_author": {
          "terms": { "field": "author" }
        }
      }
    }
  }
}

6.2 Metric 聚合(统计)

{
  "aggs": {
    "avg_views":  { "avg":   { "field": "views" } },
    "max_views":  { "max":   { "field": "views" } },
    "min_views":  { "min":   { "field": "views" } },
    "sum_views":  { "sum":   { "field": "views" } },
    "total_docs": { "value_count": { "field": "views" } },

    # stats:一次性获取 count/min/max/avg/sum
    "views_stats": { "stats": { "field": "views" } },

    # extended_stats:含方差/标准差
    "views_ext": { "extended_stats": { "field": "views" } },

    # percentiles:百分位数
    "views_percentiles": {
      "percentiles": {
        "field":    "views",
        "percents": [50, 75, 90, 95, 99]
      }
    },

    # cardinality:基数(近似去重计数)
    "unique_authors": {
      "cardinality": {
        "field":               "author",
        "precision_threshold": 100    # 精度,越大越准但内存越多
      }
    },

    # top_hits:每个分组的 top N 文档
    "top_articles": {
      "top_hits": {
        "size": 3,
        "_source": ["title", "author"],
        "sort": [{ "views": { "order": "desc" } }]
      }
    }
  }
}

6.3 嵌套聚合

# 先分组,再统计(最常用模式)
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category", "size": 10 },
      "aggs": {
        "avg_price":    { "avg": { "field": "price" } },
        "max_price":    { "max": { "field": "price" } },
        "top_products": {
          "top_hits": {
            "size": 3,
            "_source": ["name", "price"],
            "sort": [{ "price": { "order": "desc" } }]
          }
        }
      }
    }
  }
}

# 先过滤再聚合(query + aggs)
{
  "query": {
    "bool": {
      "filter": [
        { "term":  { "is_active": true } },
        { "range": { "publish_at": { "gte": "2024-01-01" } } }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_author": {
      "terms": { "field": "author", "size": 5 }
    }
  }
}

6.4 Pipeline 聚合

# 对聚合结果再聚合
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field":             "order_date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_amount": { "sum": { "field": "amount" } },
        # 计算环比增长率
        "sales_growth": {
          "derivative": {
            "buckets_path": "total_amount"
          }
        }
      }
    },
    # 所有月份中的最大销售额
    "best_month": {
      "max_bucket": {
        "buckets_path": "monthly_sales>total_amount"
      }
    },
    # 移动平均
    "avg_sales": {
      "moving_avg": {
        "buckets_path": "monthly_sales>total_amount",
        "window":       3
      }
    }
  }
}

七、搜索进阶篇

7.1 相关性评分

# 自定义评分(function_score)
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "Elasticsearch" } },
      "functions": [
        {
          # 按字段值加权
          "field_value_factor": {
            "field":    "views",
            "factor":   0.1,
            "modifier": "log1p",    # log1p(views * 0.1)
            "missing":  1
          }
        },
        {
          # 时间衰减(越新分越高)
          "gauss": {
            "publish_at": {
              "origin": "now",
              "scale":  "30d",
              "offset": "7d",
              "decay":  0.5
            }
          }
        },
        {
          # 固定加分
          "filter": { "term": { "is_featured": true } },
          "weight": 5
        }
      ],
      "score_mode":  "sum",    # 多个 function 得分如何合并:sum/avg/max/min/multiply
      "boost_mode":  "sum"     # function 得分与 query 得分如何合并
    }
  }
}

# 固定分数(constant_score,不计算相关性)
{
  "query": {
    "constant_score": {
      "filter": { "term": { "status": "published" } },
      "boost":  1.0
    }
  }
}

7.2 高亮显示

{
  "query": { "match": { "content": "Elasticsearch 搜索" } },
  "highlight": {
    "pre_tags":  ["<em>"],
    "post_tags": ["</em>"],
    "fields": {
      "title":   { "number_of_fragments": 0 },      # 0=返回完整字段
      "content": {
        "fragment_size":       150,                  # 片段大小(字符)
        "number_of_fragments": 3,                    # 返回片段数
        "order":               "score"               # 按相关性排序
      }
    },
    "require_field_match": false    # false=所有字段都高亮(即使不是查询字段)
  }
}

7.3 分页与深度分页

# 普通分页(from + size,最多 10000 条)
{
  "from": 0,
  "size": 20,
  "query": { "match_all": {} }
}

# 深度分页方案1:search_after(游标分页,推荐)
# 第一页
{
  "size": 20,
  "sort": [
    { "publish_at": "desc" },
    { "_id": "asc" }           # 必须包含唯一字段保证稳定性
  ],
  "query": { "match_all": {} }
}
# 取最后一条记录的 sort 值,作为下一页的 search_after
{
  "size": 20,
  "sort": [{ "publish_at": "desc" }, { "_id": "asc" }],
  "search_after": ["2024-01-15T10:00:00", "abc123"]
}

# 深度分页方案2:scroll(导出大量数据,不适合实时搜索)
# 初始化 scroll
POST /articles/_search?scroll=1m      # 保持1分钟
{
  "size": 1000,
  "query": { "match_all": {} },
  "sort":  ["_doc"]   # 按磁盘顺序,最快
}
# 继续滚动(使用返回的 _scroll_id)
POST /_search/scroll
{
  "scroll":    "1m",
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2..."
}
# 清除 scroll
DELETE /_search/scroll
{
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2..."
}

# 修改最大 from+size 限制(不推荐)
PUT /articles/_settings
{
  "max_result_window": 50000
}

7.4 搜索建议(Suggest)

# term suggest(拼写纠错)
{
  "suggest": {
    "title_suggest": {
      "text":  "Elasticsearh",    # 输入(含拼写错误)
      "term":  {
        "field":          "title",
        "suggest_mode":   "popular",    # missing/popular/always
        "max_edits":      2,
        "sort":           "frequency"
      }
    }
  }
}

# phrase suggest(短语纠错)
{
  "suggest": {
    "phrase_suggest": {
      "text": "Elasticsearch 全文搜索教程",
      "phrase": {
        "field":      "title",
        "gram_size":  3,
        "highlight": {
          "pre_tag":  "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

# completion suggest(自动补全,最快,需 completion 类型字段)
{
  "suggest": {
    "title_autocomplete": {
      "prefix": "Ela",
      "completion": {
        "field":    "suggest",
        "size":     10,
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": 1
        }
      }
    }
  }
}

八、索引优化篇

8.1 写入优化

# 批量写入(bulk)
# 建议单批 5~15MB,5000~10000 条

# 临时禁用副本(写入期间)
PUT /my_index/_settings
{ "number_of_replicas": 0 }
# 写入完成后恢复
PUT /my_index/_settings
{ "number_of_replicas": 1 }

# 延长刷新间隔(减少 Segment 生成频率)
PUT /my_index/_settings
{ "refresh_interval": "30s" }

# 初始化大量数据时
PUT /my_index/_settings
{
  "refresh_interval": "-1",      # 禁用自动刷新
  "number_of_replicas": 0
}
# 导入完成后
POST /my_index/_refresh
PUT /my_index/_settings
{
  "refresh_interval": "1s",
  "number_of_replicas": 1
}

# translog 配置(异步写,性能好但可能丢失)
PUT /my_index/_settings
{
  "translog": {
    "durability":       "async",
    "sync_interval":    "30s",
    "flush_threshold_size": "512mb"
  }
}

8.2 查询优化

# 1. filter 代替 query(可缓存,无评分计算)
# 有相关性排序需求 → must/should
# 只做过滤 → filter/must_not

# 2. 避免 wildcard/regexp(性能差)
# 改用 ngram tokenizer 或 prefix 查询

# 3. 避免深度 from 翻页,改用 search_after

# 4. 只返回需要的字段
{
  "_source": ["title", "author", "publish_at"]
}

# 5. 使用 routing 路由到特定分片
PUT /articles/_doc/1?routing=alice
GET /articles/_search?routing=alice
{
  "query": { "term": { "author": "alice" } }
}

# 6. 强制合并(只对只读索引使用,合并小 Segment)
POST /articles/_forcemerge?max_num_segments=1

# 7. 查看查询计划
GET /articles/_search
{
  "explain": true,
  "query": { "match": { "title": "Elasticsearch" } }
}

# 8. Profile API(分析查询性能)
GET /articles/_search
{
  "profile": true,
  "query": { "match": { "title": "Elasticsearch" } }
}

8.3 索引生命周期(ILM)

# 定义 ILM 策略(适合日志、时序数据)
PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size":  "50gb",
            "max_age":   "7d",
            "max_docs":  1000000
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink":       { "number_of_shards": 1 },
          "forcemerge":   { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze":       {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

# 在索引模板中应用 ILM
PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name":       "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

九、集群篇

9.1 节点类型

节点角色

配置

说明

Master

node.roles: [master]

集群管理(选举、节点加入、分片分配)

Data

node.roles: [data]

存储数据、执行查询

Data Hot

node.roles: [data_hot]

热数据节点(高性能 SSD)

Data Warm

node.roles: [data_warm]

温数据节点(普通磁盘)

Data Cold

node.roles: [data_cold]

冷数据节点(对象存储)

Coordinating

node.roles: []

协调节点(路由请求,不存数据)

Ingest

node.roles: [ingest]

数据预处理

ML

node.roles: [ml]

机器学习


9.2 分片策略

# 分片数量建议
# - 单分片大小控制在 10~50GB
# - 单节点分片数不超过 20个/GB 堆内存
# - 节点数 × 20 = 最大分片数

# 查看分片分布
GET /_cat/shards/my_index?v

# 分片分配控制
PUT /my_index/_settings
{
  "index.routing.allocation.require.box_type": "hot",   # 只在 hot 节点
  "index.routing.allocation.exclude._name":   "node-1", # 排除 node-1
  "index.number_of_replicas": 1
}

# 手动迁移分片
POST /_cluster/reroute
{
  "commands": [{
    "move": {
      "index": "my_index",
      "shard": 0,
      "from_node": "node-1",
      "to_node":   "node-2"
    }
  }]
}

9.3 集群监控

# 集群健康(green/yellow/red)
GET /_cluster/health?pretty
GET /_cluster/health/my_index

# 集群统计
GET /_cluster/stats?pretty

# 节点统计
GET /_nodes/stats?pretty
GET /_nodes/stats/jvm,os,process

# 索引统计
GET /my_index/_stats?pretty
GET /my_index/_stats/indexing,search

# 慢查询日志
PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn":  "10s",
  "index.search.slowlog.threshold.query.info":  "5s",
  "index.search.slowlog.threshold.fetch.warn":  "1s",
  "index.indexing.slowlog.threshold.index.warn": "10s"
}

# 热点线程(查看 CPU 热点)
GET /_nodes/hot_threads

# 任务管理
GET /_tasks
GET /_tasks?actions=*search&detailed=true
POST /_tasks/<task_id>/_cancel

十、Python 客户端篇

10.1 安装与连接

pip install elasticsearch

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk, scan

# 连接(ES 8.x)
es = Elasticsearch(
    hosts=['http://localhost:9200'],
    # 如果开启了安全认证
    # http_auth=('elastic', 'password'),
    # scheme='https',
    # verify_certs=False,
    request_timeout=30,
    max_retries=3,
    retry_on_timeout=True,
)

# 验证连接
print(es.ping())
print(es.info())

10.2 索引操作

# 创建索引
es.indices.create(
    index='articles',
    body={
        'settings': {
            'number_of_shards':   3,
            'number_of_replicas': 1,
            'analysis': {
                'analyzer': {
                    'ik_analyzer': {'type': 'custom', 'tokenizer': 'ik_max_word'}
                }
            }
        },
        'mappings': {
            'properties': {
                'title':      {'type': 'text', 'analyzer': 'ik_max_word'},
                'author':     {'type': 'keyword'},
                'content':    {'type': 'text', 'analyzer': 'ik_max_word'},
                'tags':       {'type': 'keyword'},
                'views':      {'type': 'integer'},
                'publish_at': {'type': 'date', 'format': 'yyyy-MM-dd HH:mm:ss||epoch_millis'},
            }
        }
    }
)

# 检查索引是否存在
es.indices.exists(index='articles')

# 删除索引
es.indices.delete(index='articles', ignore=[400, 404])

# 更新 Mapping
es.indices.put_mapping(index='articles', body={
    'properties': {
        'new_field': {'type': 'keyword'}
    }
})

# 刷新索引(使写入立即可查)
es.indices.refresh(index='articles')

10.3 文档 CRUD

from datetime import datetime

# 新增文档
es.index(
    index='articles',
    id=1,
    document={
        'title':      'Elasticsearch Python 教程',
        'author':     'Alice',
        'content':    '本文介绍如何使用 Python 操作 Elasticsearch',
        'tags':       ['Python', 'Elasticsearch'],
        'views':      0,
        'publish_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    }
)

# 获取文档
doc = es.get(index='articles', id=1)
print(doc['_source'])

# 检查文档是否存在
es.exists(index='articles', id=1)

# 更新文档(部分更新)
es.update(
    index='articles',
    id=1,
    doc={'views': 100, 'title': '更新后的标题'}
)

# 脚本更新
es.update(
    index='articles',
    id=1,
    script={
        'source': 'ctx._source.views += params.increment',
        'lang':   'painless',
        'params': {'increment': 1}
    }
)

# 删除文档
es.delete(index='articles', id=1)

# 按查询删除
es.delete_by_query(
    index='articles',
    body={
        'query': {'term': {'author': 'spam_user'}}
    }
)

10.4 搜索

# 基础搜索
result = es.search(
    index='articles',
    body={
        'query': {
            'bool': {
                'must':   [{'match': {'title': 'Elasticsearch'}}],
                'filter': [{'term': {'author': 'Alice'}}]
            }
        },
        'highlight': {
            'fields': {'title': {}, 'content': {'fragment_size': 150}}
        },
        'sort': [{'views': 'desc'}, {'_score': 'desc'}],
        'from': 0,
        'size': 20,
        '_source': ['title', 'author', 'views', 'publish_at']
    }
)

# 处理结果
total = result['hits']['total']['value']
hits  = result['hits']['hits']
for hit in hits:
    doc = hit['_source']
    score = hit['_score']
    highlight = hit.get('highlight', {})
    print(f"[{score:.2f}] {doc['title']} - {highlight.get('title', [])}")

print(f"总数: {total}")


# 使用 search_after 翻页
def paginate_with_search_after(es, index, query, size=20):
    search_after = None
    while True:
        body = {
            'query': query,
            'sort':  [{'publish_at': 'desc'}, {'_id': 'asc'}],
            'size':  size,
        }
        if search_after:
            body['search_after'] = search_after

        result = es.search(index=index, body=body)
        hits = result['hits']['hits']
        if not hits:
            break

        yield [h['_source'] for h in hits]
        search_after = hits[-1]['sort']

10.5 聚合

result = es.search(
    index='articles',
    body={
        'size': 0,
        'aggs': {
            'by_author': {
                'terms': {'field': 'author', 'size': 10},
                'aggs': {
                    'avg_views': {'avg': {'field': 'views'}},
                    'top_articles': {
                        'top_hits': {
                            'size': 3,
                            '_source': ['title', 'views'],
                            'sort': [{'views': 'desc'}]
                        }
                    }
                }
            },
            'monthly': {
                'date_histogram': {
                    'field':             'publish_at',
                    'calendar_interval': 'month',
                    'format':            'yyyy-MM'
                }
            }
        }
    }
)

# 处理聚合结果
for bucket in result['aggregations']['by_author']['buckets']:
    author    = bucket['key']
    count     = bucket['doc_count']
    avg_views = bucket['avg_views']['value']
    print(f'{author}: {count} 篇文章,平均阅读 {avg_views:.0f}')

10.6 批量操作

from elasticsearch.helpers import bulk, streaming_bulk, parallel_bulk

# 方式1:bulk(推荐,一次性)
def generate_actions(data_list):
    for item in data_list:
        yield {
            '_index': 'articles',
            '_id':    item['id'],
            '_source': {
                'title':   item['title'],
                'author':  item['author'],
                'content': item['content'],
            }
        }

success, failed = bulk(
    es,
    generate_actions(data_list),
    chunk_size=500,
    request_timeout=60
)
print(f'成功: {success}, 失败: {len(failed)}')


# 方式2:streaming_bulk(流式,节省内存)
for ok, result in streaming_bulk(
    es,
    generate_actions(data_list),
    chunk_size=500,
    raise_on_error=False
):
    if not ok:
        print(f'写入失败: {result}')


# 方式3:parallel_bulk(并行,最快)
from elasticsearch.helpers import parallel_bulk
from collections import deque

deque(
    parallel_bulk(es, generate_actions(data_list), thread_count=4),
    maxlen=0
)


# 批量扫描(大量读取)
from elasticsearch.helpers import scan

for doc in scan(
    es,
    index='articles',
    query={'query': {'match_all': {}}},
    scroll='5m',
    size=500
):
    process(doc['_source'])

十一、运维与监控篇

11.1 常用运维命令

# 集群健康检查
GET /_cluster/health?wait_for_status=green&timeout=30s

# 查看未分配分片原因
GET /_cluster/allocation/explain?pretty

# 手动触发分片分配
POST /_cluster/reroute?retry_failed=true

# 清除缓存
POST /my_index/_cache/clear
POST /_cache/clear

# 刷新(使内存数据持久化,但慎用,影响性能)
POST /my_index/_flush

# 强制合并(减少 Segment,提高查询速度)
POST /my_index/_forcemerge?max_num_segments=1

# 查看索引磁盘占用
GET /_cat/indices?v&s=store.size:desc

# 快照(备份)
PUT /_snapshot/my_backup
{
  "type": "fs",
  "settings": { "location": "/backup/elasticsearch" }
}

POST /_snapshot/my_backup/snapshot_1?wait_for_completion=true
{
  "indices": "articles,products",
  "ignore_unavailable": true
}

# 恢复快照
POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "articles",
  "rename_pattern":     "(.+)",
  "rename_replacement": "restored_$1"
}

11.2 JVM 调优

# jvm.options
# 堆内存设置(不超过系统内存50%,最大不超过32GB)
-Xms4g
-Xmx4g

# GC 配置(ES 8.x 默认 G1GC)
-XX:+UseG1GC
-XX:G1HeapRegionSize=4m
-XX:InitiatingHeapOccupancyPercent=30
# elasticsearch.yml
cluster.name:  my-cluster
node.name:     node-1
path.data:     /var/data/elasticsearch
path.logs:     /var/log/elasticsearch
network.host:  0.0.0.0
http.port:     9200
transport.port: 9300

# 集群发现
discovery.seed_hosts:       ["node-1", "node-2", "node-3"]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

# 内存锁定(防止 Swap)
bootstrap.memory_lock: true

十二、部署篇

12.1 Docker Compose(ES + Kibana)

# docker-compose.yml
version: '3.8'

services:
  elasticsearch:
    image: elasticsearch:8.12.0
    environment:
      - cluster.name=my-cluster
      - node.name=es-node-1
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
      - "9300:9300"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
      interval: 30s
      timeout:  10s
      retries:  5

  kibana:
    image: kibana:8.12.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

volumes:
  es_data:

12.2 三节点集群

# docker-compose-cluster.yml
version: '3.8'

services:
  es01:
    image: elasticsearch:8.12.0
    environment:
      - node.name=es01
      - cluster.name=my-cluster
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es02,es03
      - node.roles=master,data
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
      - xpack.security.enabled=false
    volumes:
      - es01_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  es02:
    image: elasticsearch:8.12.0
    environment:
      - node.name=es02
      - cluster.name=my-cluster
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es01,es03
      - node.roles=master,data
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
      - xpack.security.enabled=false
    volumes:
      - es02_data:/usr/share/elasticsearch/data

  es03:
    image: elasticsearch:8.12.0
    environment:
      - node.name=es03
      - cluster.name=my-cluster
      - cluster.initial_master_nodes=es01,es02,es03
      - discovery.seed_hosts=es01,es02
      - node.roles=master,data
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
      - xpack.security.enabled=false
    volumes:
      - es03_data:/usr/share/elasticsearch/data

  kibana:
    image: kibana:8.12.0
    environment:
      - ELASTICSEARCH_HOSTS=http://es01:9200
    ports:
      - "5601:5601"
    depends_on:
      - es01

volumes:
  es01_data:
  es02_data:
  es03_data:

常用 DSL 速查

# 查询所有
{ "query": { "match_all": {} } }

# 按关键词搜索
{ "query": { "match": { "title": "关键词" } } }

# 精确匹配
{ "query": { "term": { "author": "Alice" } } }

# 多条件 AND
{ "query": { "bool": { "must": [ {...}, {...} ] } } }

# 过滤(无评分)
{ "query": { "bool": { "filter": [ {...} ] } } }

# 范围
{ "query": { "range": { "views": { "gte": 100 } } } }

# 排序 + 分页
{ "sort": [{ "views": "desc" }], "from": 0, "size": 20 }

# 指定返回字段
{ "_source": ["title", "author"] }

# 聚合
{ "size": 0, "aggs": { "名称": { "terms": { "field": "category" } } } }

参考资源