ElasticSearch
Author: CodingGorit
Date: 2020年10月22日
Note:学习笔记记录自 B站狂神说:ElasticSearch 学习
一、学习大纲
- 安装
- 生态圈
- 分词器 lk
- RestFul 操作 ES
- CRUD
- SpringBoot 继承 ElasticSearch (从原理分析!!!)
- 爬虫爬取数据!!! 京东
- 实战,模拟全文检索
搜索相关使用 ES(大数据量下使用)
> Lucene 是一套信息检索工具包 (Jar 包,不包含 搜索引擎系统)! Solr
>
> 包含的:索引结构!读写索引的工具!排序,搜索规则… 工具类
>
> Lucene 和 EslasticSearch 关系:
>
> ElasticSearch 是基于 Lucene 做了一些封装 和 增强
二、ElasticSearch 概述
简称 es
- 一个开源的高扩展的 分布式全文检索引擎
- 近乎实时的存储,检索数据
- es使用 java 开发并使用 Licene 作为其核心来实现所有索引 和 搜索功能
- 它的目的是通过简单的 RESTFul API,来隐藏 Lucene 的复杂性,从而让全文搜索变得简单
三、ElasticSearch 安装
- JDK 1.8
-
下载,解压
-
熟悉目录:
bin: 启动文件
config: 配置文件
log4j: 日志文件
jvm.options: java 虚拟机先关的配置
elasticsearch.xml: elasticsearch 的配置文件!
lib: 相关 jar 包
logs: 日志
modules: 功能模块
plugins: 插件 ik
- 启动,访问 9200
- 访问测试:localhost:9200
> 安装可视化插件 es head 插件
npm install
npm run start
在 elasticSearch.yml 配置跨域
http.cors.enabled: true
http.cors.allow-origin: "*"
安装 kibana
- 下载,解压
- 国际化
找到 config 下的 kibana.yml 文件,修改最后一行为 i18n.locale: “zh-CN”
四、ES 核心概念
- 索引
- 字段类型 (mapping)
- 文档(documents)
集群、节点、索引、类型、文档、分片、映射是什么?
> ElasticSearch 是面向文档,关系型数据库 和 elasticSearch 客观的对比! 一切都是 JSON
>
> {
>
> }
名词对应
ElasticSearch | Relational DB |
---|---|
索引(indices) | 数据库(database) |
types | 表(tables) |
documents | 行(rows) |
fields | 字段(columns) |
elasticSearch (集群)中可以包含多个索引(数据库),每个索引中可以包含多个类型(表),每个类型下又包含多个文档(行),每个文档又包含多个字段(列)
物理设计
elasticSearch 一个就是一个集群
文档
一条条记录
user
zs: 15
ls: 22
类型
自动识别, string,
索引
数据库
五、IK 分词器插件
下载好的添加到 plugin 中
跳过,第 8 集
-
elasticsearch-plugin 可以通过这个命令来查看加载进来的插件
-
ik_smart(最少切分) 和 ik_max_word(最细粒度划分)
-
kibana 测试
-
自定义分词
六、 Rest 风格说明
基础 Rest 命令
method | url 地址 | 描述 |
---|---|---|
PUT | localhost:9200/索引名称/类型名称/文档 id | 创建文档(指定文档 id) |
POST | localhost:9200/索引名称/类型名称 | 创建文档(随机文档 id) |
POST | localhost:9200/索引名称/类型名称/文档id/_update | 修改文档 |
DELETE | localhost:9200/索引名称/类型名称/文档id | 删除文档 |
GET | localhost:9200/索引名称/类型名称/文档id | 查询文档通过文档 id |
POST | localhost:9200/索引名称/类型名称/_seaarch | 查询所有数据 |
> 基本测试
6.1 创建索引
- 创建一个索引
PUT /索引名/~类型名~/文档id
{
"name":"Gorit",
"age": 18,
"gender": "male"
}
返回值,数据成功添加
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "test",
"_type" : "type1",
"_id" : "1",
"_version" : 1, // 修改次数
"result" : "created", // 状态
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
- 创建索引规则
PUT /test1/
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "long"
},
"birthday": {
"type": "date"
}
}
}
}
返回值
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "test1"
}
es 默认配置字段类型!
6.2 查询
GET test
# 结果
{
"test" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "long"
},
"gender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1603203146037",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "q47lWt_4ToOBo1rxQ1pPNw",
"version" : {
"created" : "7060299"
},
"provided_name" : "test"
}
}
}
}
GET test1
{
"test1" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "long"
},
"birthday" : {
"type" : "date"
},
"name" : {
"type" : "text"
}
}
},
"settings" : {
"index" : {
"creation_date" : "1603203453667",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "a-upVXJwR7u7JZztTjyVGg",
"version" : {
"created" : "7060299"
},
"provided_name" : "test1"
}
}
}
}
扩展:通过 _cat/ 可以获得 es 当前很多的信息
GET _cat/health
GET _cat/indices?v
6.3 修改索引
> 提交 PUT,覆盖即可
修改数据
PUT /test/type1/1
{
"name":"Gorit111",
"age": 18,
"gender": "male"
}
修改结果
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "test",
"_type" : "type1",
"_id" : "1",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
新的方法 POST 命令更新
POST /test/_doc/1/_update
{
"doc": {
"name":"张三"
}
}
// 结果
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_version" : 3,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "张三",
"age" : 18,
"gender" : "male"
}
}
6.4 删除索引
> 删除索引!!!
DELETE test
通过 delete 命令实现删除,根据你的请求来判断删除的是索引 还是 文档
七、关于文档的操作
7.1 基本操作 (复习巩固)
- 添加数据(添加多条记录)
PUT /gorit/user/1
{
"name": "CodingGorit",
"age": 23,
"desc": "一个独立的个人开发者",
"tags": ["Python","Java","JavaScript"]
}
PUT /gorit/user/2
{
"name": "龙",
"age": 20,
"desc": "全栈工程师",
"tags": ["Python","JavaScript"]
}
结果:
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "gorit",
"_type" : "user",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
- 获取数据
GET /gorit/user/_search # 查询所有数据
GET /gorit/user/1 # 查询单个数据
- 更新数据 PUT
PUT /gorit/user/3
{
"name": "李四222",
"age": 20,
"desc": "Java开发工程师",
"tags": ["Python","Java"]
}
# PUT 更新字段不完整,数据会被滞空
- post _update , 推荐使用这种方式!
# 修改方式和 PUT 一样会使数据滞空
POST /gorit/user/1
{
"doc": {
"name": "coco"
}
}
# 修改数据不会滞空, 效率更加高效
POST /gorit/user/1/_update
{
"doc": {
"name": "coco"
}
}
简单的搜索!
# 查询一条记录
GET /gorit/user/1
# 查询所有
GET /gorit/user/_search
# 条件查询 [精确匹配] ,如果我们没有个这个属性设置字段,它会背默认设置为 keyword,这个 keyword 字段就是使用全匹配来匹配的,如果是 text 类型,模糊查询就会起效果
GET /gorit/user/_search?q=name:coco
7.2 复杂的查询搜索:select(排序、分页、高亮、模糊查询、精确查询)!
- 过滤加指定字段查询
GET /gorit/user/_search
{
"query": {
"match": {
"name": "李四"
}
},
"_source": ["name","desc"]
}
7.3 排序
GET /gorit/user/_search
{
"query": {
"match": {
"name": "gorit"
}
},
"sort": [
{
"age": {
"order": "desc"
}
}
]
}
7.4 分页查询
使用字段 from 和 size 进行分页查询,方式和 limit pageSize 是一模一样的
- from 从第几页开始
- 返回多少条数据
GET /gorit/user/_search
{
"query": {
"match": {
"name": "李四"
}
},
"sort": [
{
"age": {
"order": "desc"
}
}
],
"from": 0,
"size": 1
}
7.5 filiter 区间查询
# 根据年龄的范围大小查询
GET /gorit/user/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "gorit"
}
}
],
"filter": [
{
"range": {
"age": {
"gte": 1,
"lte": 25
}
}
}
]
}
}
}
- gt 大于
- gte 大于等于
- lt 小于
- lte 小于等于
7.6 布尔值查询
must (and), 所有的条件都要符合 where id=1 and name = xxx
# 布尔查询
GET /gorit/user/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "gorit"
}
},{
"match": {
"age": "16"
}
}
]
}
}
}
7.7 匹配多个条件
同时匹配即可
# 多个条件用空格隔开,只要满足一个即可被查出,这个时候可以根据分值判断
GET /gorit/user/_search
{
"query": {
"match": {
"tags": "Java Python"
}
}
}
7.7 精确查询
term 查询是直接通过倒排索引指定的词条进程精确的查找的!
关于分词
-
term,直接精确查询
-
match:会使用分词器解析!!(先分析文档,然后通过分析的文档进行查询!!!)
两个类型 text keyword
结论:
- text 可分
- keyword 不可再分
7.8 高亮查询
# 高亮查询, 搜索的结果,可以高亮显示, 也能添加自定义高亮条件
GET /gorit/user/_search
{
"query": {
"match": {
"name": "Gorit"
}
},
"highlight": {
"pre_tags": "<h3 class="key" style="color:#FF0000;">",
"post_tags": "</h3>",
"fields": {
"name": {}
}
}
}
# 响应结果
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.6375021,
"hits" : [
{
"_index" : "gorit",
"_type" : "user",
"_id" : "6",
"_score" : 1.6375021,
"_source" : {
"name" : "Gorit",
"age" : 16,
"desc" : "运维工程师",
"tags" : [
"Linux",
"c++",
"python"
]
},
"highlight" : {
"name" : [
"<h3 class="key" style="color:#FF0000;">Gorit</h3>"
]
}
}
]
}
}
这些 MySQL 也可以做,只是 MySQL 效率更低
- 匹配
- 按照条件匹配
- 精确匹配
- 区间范围匹配
- 匹配字段过滤
- 多条件查询
- 高亮查询
- 倒排索引
八、集成 SpringBoot
> 找官方文档
> 具体测试
- 创建索引
- 判断索引是否存在
- 删除索引
- 创建文档
- 操作文档
// 坐标依赖
org.springframework.bootspring-boot-starter-data-elasticsearch
// 核心代码
package cn.gorit;
import cn.gorit.pojo.User;
import com.alibaba.fastjson.JSON;
import javafx.scene.control.IndexRange;
import org.apache.lucene.util.QueryBuilder;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.master.AcknowledgedRequest;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContent;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.MatchAllQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.FetchSourceContext;
import org.json.JSONObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.http.codec.cbor.Jackson2CborDecoder;
import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;
/**
* es 7.6.2 API 测试
*/
@SpringBootTest
class DemoApplicationTests {
// 名称匹配
@Autowired
@Qualifier("restHighLevelClient")
private RestHighLevelClient client;
@Test
void contextLoads() {
}
// 索引的创建
@Test
void testCreateIndex() throws IOException {
// 1. 创建索引请求 等价于 PUT /gorit_index
CreateIndexRequest request = new CreateIndexRequest("gorit_index");
// 2. 执行创建请求 IndicesClient, 请求后获得响应
CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT);
System.out.println(response);
}
// 测试获取索引,判断其是否存在
@Test
void testGetIndexExist() throws IOException {
GetIndexRequest request = new GetIndexRequest("gorit_index");
boolean exist = client.indices().exists(request,RequestOptions.DEFAULT);
System.out.println(exist);
}
// 删除索引
@Test
void testDeleteIndex() throws IOException {
DeleteIndexRequest request = new DeleteIndexRequest("gorit_index");
// 删除
AcknowledgedResponse delete = client.indices().delete(request,RequestOptions.DEFAULT);
System.out.println(delete.isAcknowledged());
}
// 添加文档
@Test
void testAddDocument() throws IOException {
// 创建对象
User u = new User("Gorit",3);
// 创建请求
IndexRequest request = new IndexRequest("gorit_index");
// 规则 PUT /gorit_index/_doc/1
request.id("1");
request.timeout(TimeValue.timeValueSeconds(3));
request.timeout("1s");
// 将数据放入请求 json
IndexRequest source = request.source(JSON.toJSONString(u), XContentType.JSON);
// 客户端发送请求
IndexResponse response = client.index(request, RequestOptions.DEFAULT);
System.out.println(response.toString());
System.out.println(response.status());// 返回对应的状态 CREATED
}
// 获取文档,判断存在 get /index/_doc/1
@Test
void testIsExists() throws IOException {
GetRequest getRequest = new GetRequest("gorit_index", "1");
// 不获取返回的 _source 的上下文了
getRequest.fetchSourceContext(new FetchSourceContext(false));
getRequest.storedFields("_none_");
boolean exists = client.exists(getRequest, RequestOptions.DEFAULT);
System.out.println(exists);
}
// 获取文档信息
@Test
void testGetDocument() throws IOException {
GetRequest getRequest = new GetRequest("gorit_index", "1");
GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(getResponse.getSourceAsString());
System.out.println(getResponse); // 返回全部的内容和命令是一样的
}
// 更新文档信息
@Test
void testUpdateDocument() throws IOException {
UpdateRequest updateRequest = new UpdateRequest("gorit_index", "1");
updateRequest.timeout("1s");
User user = new User("CodingGoirt", 18);
updateRequest.doc(JSON.toJSONString(user),XContentType.JSON);
UpdateResponse updateResponse = client.update(updateRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(updateResponse.status());
System.out.println(updateResponse); // 返回全部的内容和命令是一样的
}
// 删除文档记录
@Test
void testDeleteDocument() throws IOException {
DeleteRequest deleteRequest = new DeleteRequest("gorit_index", "1");
deleteRequest.timeout("1s");
DeleteResponse deleteResponse = client.delete(deleteRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(deleteResponse.status());
System.out.println(deleteResponse); // 返回全部的内容和命令是一样的
}
// 特殊的,真的项目。 批量插入数据
@Test
void testBulkRequest() throws IOException {
BulkRequest bulkRequest = new BulkRequest();
bulkRequest.timeout("10s");
ArrayList userList = new ArrayList<>();
userList.add(new User("张三1",1));
userList.add(new User("张三2",2));
userList.add(new User("张三3",3));
userList.add(new User("张三4",4));
userList.add(new User("张三5",5));
userList.add(new User("张三6",6));
userList.add(new User("张三7",7));
// 批处理请求
for (int i=0;iorg.jsoupjsoup1.10.2com.alibabafastjson1.2.68org.springframework.bootspring-boot-starter-data-elasticsearchorg.springframework.bootspring-boot-starter-thymeleaforg.springframework.bootspring-boot-starter-weborg.springframework.bootspring-boot-devtoolsruntimetrueorg.springframework.bootspring-boot-configuration-processortrueorg.projectlomboklomboktrueorg.springframework.bootspring-boot-starter-testtestorg.junit.vintagejunit-vintage-engine
爬虫
配置文件
package cn.gorit.config;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* Spring 步骤
* 1. 找对象
* 2. 放到 spring 中使用
* 3. 分析源码
*
* @Classname ElasticSearchConfig
* @Description TODO
* @Date 2020/10/21 17:20
* @Created by CodingGorit
* @Version 1.0
*/
@Configuration // xml -bean
public class ElasticSearchConfig {
@Bean
public RestHighLevelClient restHighLevelClient() {
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")
)
);
return client;
}
}
> 爬取京东搜索的内容
config 配置类
package cn.gorit.util;
import cn.gorit.pojo.Content;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Component;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
/**
* @Classname HtmlParseUtil
* @Description TODO
* @Date 2020/10/21 23:17
* @Created by CodingGorit
* @Version 1.0
*/
@Component
public class HtmlParseUtil {
// public static void main(String[] args) throws Exception {
// new HtmlParseUtil().parseJD("英语").forEach(System.out::println);
// }
public List parseJD(String keyword) throws Exception {
// 请求 url
// 联网,不能获取 ajax 数据
String url = "https://search.jd.com/Search?keyword=wd&enc=utf-8";
// 解析网页 (返回的 Document 对象)
Document document = Jsoup.parse(new URL(url.replace("wd",keyword)),30000);
// 获取所有节点标签
Element element = document.getElementById("J_goodsList");
// 获取所有的 li 元素
Elements elements = element.getElementsByTag("li");
// 获取元素中的内容
List goodsList = new ArrayList<>();
for (Element e: elements) {
String img = e.getElementsByTag("img").eq(0).attr("data-lazy-img");
String price = e.getElementsByClass("p-price").eq(0).text();
String title = e.getElementsByClass("p-name").eq(0).text();
goodsList.add(new Content(title,img,price));
// System.out.println(img);
// System.out.println(price);
// System.out.println(title);
}
return goodsList;
}
}
Service 方法
package cn.gorit.service;
import cn.gorit.pojo.Content;
import cn.gorit.util.HtmlParseUtil;
import com.alibaba.fastjson.JSON;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightField;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
/**
* @Classname ContentService
* @Description TODO
* @Date 2020/10/22 18:44
* @Created by CodingGorit
* @Version 1.0
*/
@Service
public class ContentService {
@Autowired
private RestHighLevelClient restHighLevelClient;
// 不能直接使用,只要 Spring 容器
public static void main(String[] args) throws Exception {
new ContentService().parseContent("java");
}
// 1. 解析数据放入 es 索引中
public Boolean parseContent (String keywords) throws Exception {
// 获取查询到的列表的信息
List contents = new HtmlParseUtil().parseJD(keywords);
// 把查询到的数据放入 es 中
BulkRequest bulkRequest = new BulkRequest();
bulkRequest.timeout("2m");
for (int i=0;i < contents.size();++i) {
bulkRequest.add(
new IndexRequest("jd_goods")
.source(JSON.toJSONString(contents.get(i)),XContentType.JSON));
}
BulkResponse bulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
return !bulkResponse.hasFailures();
}
// 2. 获取这些数据,实现基本的搜索功能
public List> searchPagehighLight (String keyword, int pageNo,int pageSize) throws IOException {
if (pageNo <= 1)
pageNo = 1;
// 条件清晰
SearchRequest searchRequest = new SearchRequest("jd_goods");
SearchSourceBuilder builder = new SearchSourceBuilder();
builder.from(pageNo);
builder.size(pageSize);
// 精准匹配
TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
builder.query(termQueryBuilder);
builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
// 高亮
HighlightBuilder highlightBuilder = new HighlightBuilder();
highlightBuilder.field("title");
highlightBuilder.requireFieldMatch(false);
highlightBuilder.preTags("<span style="color:#FF0000;">");
highlightBuilder.postTags("</span>");
builder.highlighter(highlightBuilder);
// 执行搜索
searchRequest.source(builder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
// 解析结果
ArrayList> list= new ArrayList<>();
for (SearchHit hit: searchResponse.getHits().getHits()) {
// 解析高亮的字段
Map highlightFields = hit.getHighlightFields();
HighlightField title = highlightFields.get("title");
Map sourceAsMap = hit.getSourceAsMap();// 原来的结果
// 解析高亮字段,将原来的字段换成我们高亮的字段即可
if (title != null) {
Text[] fragments = title.fragments();
StringBuilder nTitle = new StringBuilder();
for (Text text:fragments) {
nTitle.append(text);
}
sourceAsMap.put("title",nTitle);
}
list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
}
return list;
}
// 2. 获取这些数据,实现基本的搜索功能
public List> searchPage (String keyword, int pageNo,int pageSize) throws IOException {
if (pageNo <= 1)
pageNo = 1;
// 条件清晰
SearchRequest searchRequest = new SearchRequest("jd_goods");
SearchSourceBuilder builder = new SearchSourceBuilder();
builder.from(pageNo);
builder.size(pageSize);
// 精准匹配
TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
builder.query(termQueryBuilder);
builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
// 执行搜索
searchRequest.source(builder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
// 解析结果
ArrayList> list= new ArrayList<>();
for (SearchHit hit: searchResponse.getHits().getHits()) {
list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
}
return list;
}
}
Controller
package cn.gorit.controller;
import cn.gorit.pojo.Content;
import cn.gorit.service.ContentService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.RestControllerAdvice;
import java.io.IOException;
import java.util.List;
import java.util.Map;
/**
* @Classname ContentController
* @Description TODO
* @Date 2020/10/22 18:45
* @Created by CodingGorit
* @Version 1.0
*/
@RestController
public class ContentController {
@Autowired
private ContentService service;
/**
* 将数据添加到 ES 中
* @param keyword
* @return
* @throws Exception
*/
@GetMapping("/parse/{keyword}")
public Boolean pares(@PathVariable("keyword") String keyword) throws Exception {
return service.parseContent(keyword);
}
/**
* 查询 ES 的数据
* @param keyword
* @param pageNo
* @param pageSize
* @return
* @throws IOException
*/
@GetMapping("/search/{keyword}/{pageNo}/{pageSize}")
public List> search(@PathVariable("keyword") String keyword,@PathVariable("pageNo") int pageNo, @PathVariable("pageSize") int pageSize) throws IOException {
if (pageNo == 0) {
pageNo = 1;
}
return service.searchPage(keyword, pageNo, pageSize);
}
}
前后端分离
POSTMAN 测试
搜索高亮
> 一套项目,多端运用
十、总结
- ElasticSearch 基本使用
- SpringBoot 整合 ES
- 实战搜索
> 个人开源项目 (Coding-With-Java ) 欢迎大家点赞
[TOC]
ElasticSearch
Author: CodingGorit
Date: 2020年10月22日
Note:学习笔记记录自 B站狂神说:ElasticSearch 学习
一、学习大纲
- 安装
- 生态圈
- 分词器 lk
- RestFul 操作 ES
- CRUD
- SpringBoot 继承 ElasticSearch (从原理分析!!!)
- 爬虫爬取数据!!! 京东
- 实战,模拟全文检索
搜索相关使用 ES(大数据量下使用)
> Lucene 是一套信息检索工具包 (Jar 包,不包含 搜索引擎系统)! Solr
>
> 包含的:索引结构!读写索引的工具!排序,搜索规则… 工具类
>
> Lucene 和 EslasticSearch 关系:
>
> ElasticSearch 是基于 Lucene 做了一些封装 和 增强
二、ElasticSearch 概述
简称 es
- 一个开源的高扩展的 分布式全文检索引擎
- 近乎实时的存储,检索数据
- es使用 java 开发并使用 Licene 作为其核心来实现所有索引 和 搜索功能
- 它的目的是通过简单的 RESTFul API,来隐藏 Lucene 的复杂性,从而让全文搜索变得简单
三、ElasticSearch 安装
- JDK 1.8
-
下载,解压
-
熟悉目录:
bin: 启动文件
config: 配置文件
log4j: 日志文件
jvm.options: java 虚拟机先关的配置
elasticsearch.xml: elasticsearch 的配置文件!
lib: 相关 jar 包
logs: 日志
modules: 功能模块
plugins: 插件 ik
- 启动,访问 9200
- 访问测试:localhost:9200
> 安装可视化插件 es head 插件
npm install
npm run start
在 elasticSearch.yml 配置跨域
http.cors.enabled: true
http.cors.allow-origin: "*"
安装 kibana
- 下载,解压
- 国际化
找到 config 下的 kibana.yml 文件,修改最后一行为 i18n.locale: “zh-CN”
四、ES 核心概念
- 索引
- 字段类型 (mapping)
- 文档(documents)
集群、节点、索引、类型、文档、分片、映射是什么?
> ElasticSearch 是面向文档,关系型数据库 和 elasticSearch 客观的对比! 一切都是 JSON
>
> {
>
> }
名词对应
ElasticSearch | Relational DB |
---|---|
索引(indices) | 数据库(database) |
types | 表(tables) |
documents | 行(rows) |
fields | 字段(columns) |
elasticSearch (集群)中可以包含多个索引(数据库),每个索引中可以包含多个类型(表),每个类型下又包含多个文档(行),每个文档又包含多个字段(列)
物理设计
elasticSearch 一个就是一个集群
文档
一条条记录
user
zs: 15
ls: 22
类型
自动识别, string,
索引
数据库
五、IK 分词器插件
下载好的添加到 plugin 中
跳过,第 8 集
-
elasticsearch-plugin 可以通过这个命令来查看加载进来的插件
-
ik_smart(最少切分) 和 ik_max_word(最细粒度划分)
-
kibana 测试
-
自定义分词
六、 Rest 风格说明
基础 Rest 命令
method | url 地址 | 描述 |
---|---|---|
PUT | localhost:9200/索引名称/类型名称/文档 id | 创建文档(指定文档 id) |
POST | localhost:9200/索引名称/类型名称 | 创建文档(随机文档 id) |
POST | localhost:9200/索引名称/类型名称/文档id/_update | 修改文档 |
DELETE | localhost:9200/索引名称/类型名称/文档id | 删除文档 |
GET | localhost:9200/索引名称/类型名称/文档id | 查询文档通过文档 id |
POST | localhost:9200/索引名称/类型名称/_seaarch | 查询所有数据 |
> 基本测试
6.1 创建索引
- 创建一个索引
PUT /索引名/~类型名~/文档id
{
"name":"Gorit",
"age": 18,
"gender": "male"
}
返回值,数据成功添加
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "test",
"_type" : "type1",
"_id" : "1",
"_version" : 1, // 修改次数
"result" : "created", // 状态
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
- 创建索引规则
PUT /test1/
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "long"
},
"birthday": {
"type": "date"
}
}
}
}
返回值
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "test1"
}
es 默认配置字段类型!
6.2 查询
GET test
# 结果
{
"test" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "long"
},
"gender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1603203146037",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "q47lWt_4ToOBo1rxQ1pPNw",
"version" : {
"created" : "7060299"
},
"provided_name" : "test"
}
}
}
}
GET test1
{
"test1" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "long"
},
"birthday" : {
"type" : "date"
},
"name" : {
"type" : "text"
}
}
},
"settings" : {
"index" : {
"creation_date" : "1603203453667",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "a-upVXJwR7u7JZztTjyVGg",
"version" : {
"created" : "7060299"
},
"provided_name" : "test1"
}
}
}
}
扩展:通过 _cat/ 可以获得 es 当前很多的信息
GET _cat/health
GET _cat/indices?v
6.3 修改索引
> 提交 PUT,覆盖即可
修改数据
PUT /test/type1/1
{
"name":"Gorit111",
"age": 18,
"gender": "male"
}
修改结果
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "test",
"_type" : "type1",
"_id" : "1",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
新的方法 POST 命令更新
POST /test/_doc/1/_update
{
"doc": {
"name":"张三"
}
}
// 结果
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_version" : 3,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "张三",
"age" : 18,
"gender" : "male"
}
}
6.4 删除索引
> 删除索引!!!
DELETE test
通过 delete 命令实现删除,根据你的请求来判断删除的是索引 还是 文档
七、关于文档的操作
7.1 基本操作 (复习巩固)
- 添加数据(添加多条记录)
PUT /gorit/user/1
{
"name": "CodingGorit",
"age": 23,
"desc": "一个独立的个人开发者",
"tags": ["Python","Java","JavaScript"]
}
PUT /gorit/user/2
{
"name": "龙",
"age": 20,
"desc": "全栈工程师",
"tags": ["Python","JavaScript"]
}
结果:
#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
"_index" : "gorit",
"_type" : "user",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
- 获取数据
GET /gorit/user/_search # 查询所有数据
GET /gorit/user/1 # 查询单个数据
- 更新数据 PUT
PUT /gorit/user/3
{
"name": "李四222",
"age": 20,
"desc": "Java开发工程师",
"tags": ["Python","Java"]
}
# PUT 更新字段不完整,数据会被滞空
- post _update , 推荐使用这种方式!
# 修改方式和 PUT 一样会使数据滞空
POST /gorit/user/1
{
"doc": {
"name": "coco"
}
}
# 修改数据不会滞空, 效率更加高效
POST /gorit/user/1/_update
{
"doc": {
"name": "coco"
}
}
简单的搜索!
# 查询一条记录
GET /gorit/user/1
# 查询所有
GET /gorit/user/_search
# 条件查询 [精确匹配] ,如果我们没有个这个属性设置字段,它会背默认设置为 keyword,这个 keyword 字段就是使用全匹配来匹配的,如果是 text 类型,模糊查询就会起效果
GET /gorit/user/_search?q=name:coco
7.2 复杂的查询搜索:select(排序、分页、高亮、模糊查询、精确查询)!
- 过滤加指定字段查询
GET /gorit/user/_search
{
"query": {
"match": {
"name": "李四"
}
},
"_source": ["name","desc"]
}
7.3 排序
GET /gorit/user/_search
{
"query": {
"match": {
"name": "gorit"
}
},
"sort": [
{
"age": {
"order": "desc"
}
}
]
}
7.4 分页查询
使用字段 from 和 size 进行分页查询,方式和 limit pageSize 是一模一样的
- from 从第几页开始
- 返回多少条数据
GET /gorit/user/_search
{
"query": {
"match": {
"name": "李四"
}
},
"sort": [
{
"age": {
"order": "desc"
}
}
],
"from": 0,
"size": 1
}
7.5 filiter 区间查询
# 根据年龄的范围大小查询
GET /gorit/user/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "gorit"
}
}
],
"filter": [
{
"range": {
"age": {
"gte": 1,
"lte": 25
}
}
}
]
}
}
}
- gt 大于
- gte 大于等于
- lt 小于
- lte 小于等于
7.6 布尔值查询
must (and), 所有的条件都要符合 where id=1 and name = xxx
# 布尔查询
GET /gorit/user/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "gorit"
}
},{
"match": {
"age": "16"
}
}
]
}
}
}
7.7 匹配多个条件
同时匹配即可
# 多个条件用空格隔开,只要满足一个即可被查出,这个时候可以根据分值判断
GET /gorit/user/_search
{
"query": {
"match": {
"tags": "Java Python"
}
}
}
7.7 精确查询
term 查询是直接通过倒排索引指定的词条进程精确的查找的!
关于分词
-
term,直接精确查询
-
match:会使用分词器解析!!(先分析文档,然后通过分析的文档进行查询!!!)
两个类型 text keyword
结论:
- text 可分
- keyword 不可再分
7.8 高亮查询
# 高亮查询, 搜索的结果,可以高亮显示, 也能添加自定义高亮条件
GET /gorit/user/_search
{
"query": {
"match": {
"name": "Gorit"
}
},
"highlight": {
"pre_tags": "<h3 class="key" style="color:#FF0000;">",
"post_tags": "</h3>",
"fields": {
"name": {}
}
}
}
# 响应结果
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.6375021,
"hits" : [
{
"_index" : "gorit",
"_type" : "user",
"_id" : "6",
"_score" : 1.6375021,
"_source" : {
"name" : "Gorit",
"age" : 16,
"desc" : "运维工程师",
"tags" : [
"Linux",
"c++",
"python"
]
},
"highlight" : {
"name" : [
"<h3 class="key" style="color:#FF0000;">Gorit</h3>"
]
}
}
]
}
}
这些 MySQL 也可以做,只是 MySQL 效率更低
- 匹配
- 按照条件匹配
- 精确匹配
- 区间范围匹配
- 匹配字段过滤
- 多条件查询
- 高亮查询
- 倒排索引
八、集成 SpringBoot
> 找官方文档
> 具体测试
- 创建索引
- 判断索引是否存在
- 删除索引
- 创建文档
- 操作文档
// 坐标依赖
org.springframework.bootspring-boot-starter-data-elasticsearch
// 核心代码
package cn.gorit;
import cn.gorit.pojo.User;
import com.alibaba.fastjson.JSON;
import javafx.scene.control.IndexRange;
import org.apache.lucene.util.QueryBuilder;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.master.AcknowledgedRequest;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.action.update.UpdateResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContent;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.MatchAllQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.FetchSourceContext;
import org.json.JSONObject;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.http.codec.cbor.Jackson2CborDecoder;
import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;
/**
* es 7.6.2 API 测试
*/
@SpringBootTest
class DemoApplicationTests {
// 名称匹配
@Autowired
@Qualifier("restHighLevelClient")
private RestHighLevelClient client;
@Test
void contextLoads() {
}
// 索引的创建
@Test
void testCreateIndex() throws IOException {
// 1. 创建索引请求 等价于 PUT /gorit_index
CreateIndexRequest request = new CreateIndexRequest("gorit_index");
// 2. 执行创建请求 IndicesClient, 请求后获得响应
CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT);
System.out.println(response);
}
// 测试获取索引,判断其是否存在
@Test
void testGetIndexExist() throws IOException {
GetIndexRequest request = new GetIndexRequest("gorit_index");
boolean exist = client.indices().exists(request,RequestOptions.DEFAULT);
System.out.println(exist);
}
// 删除索引
@Test
void testDeleteIndex() throws IOException {
DeleteIndexRequest request = new DeleteIndexRequest("gorit_index");
// 删除
AcknowledgedResponse delete = client.indices().delete(request,RequestOptions.DEFAULT);
System.out.println(delete.isAcknowledged());
}
// 添加文档
@Test
void testAddDocument() throws IOException {
// 创建对象
User u = new User("Gorit",3);
// 创建请求
IndexRequest request = new IndexRequest("gorit_index");
// 规则 PUT /gorit_index/_doc/1
request.id("1");
request.timeout(TimeValue.timeValueSeconds(3));
request.timeout("1s");
// 将数据放入请求 json
IndexRequest source = request.source(JSON.toJSONString(u), XContentType.JSON);
// 客户端发送请求
IndexResponse response = client.index(request, RequestOptions.DEFAULT);
System.out.println(response.toString());
System.out.println(response.status());// 返回对应的状态 CREATED
}
// 获取文档,判断存在 get /index/_doc/1
@Test
void testIsExists() throws IOException {
GetRequest getRequest = new GetRequest("gorit_index", "1");
// 不获取返回的 _source 的上下文了
getRequest.fetchSourceContext(new FetchSourceContext(false));
getRequest.storedFields("_none_");
boolean exists = client.exists(getRequest, RequestOptions.DEFAULT);
System.out.println(exists);
}
// 获取文档信息
@Test
void testGetDocument() throws IOException {
GetRequest getRequest = new GetRequest("gorit_index", "1");
GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(getResponse.getSourceAsString());
System.out.println(getResponse); // 返回全部的内容和命令是一样的
}
// 更新文档信息
@Test
void testUpdateDocument() throws IOException {
UpdateRequest updateRequest = new UpdateRequest("gorit_index", "1");
updateRequest.timeout("1s");
User user = new User("CodingGoirt", 18);
updateRequest.doc(JSON.toJSONString(user),XContentType.JSON);
UpdateResponse updateResponse = client.update(updateRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(updateResponse.status());
System.out.println(updateResponse); // 返回全部的内容和命令是一样的
}
// 删除文档记录
@Test
void testDeleteDocument() throws IOException {
DeleteRequest deleteRequest = new DeleteRequest("gorit_index", "1");
deleteRequest.timeout("1s");
DeleteResponse deleteResponse = client.delete(deleteRequest, RequestOptions.DEFAULT);
// 打印文档的内容
System.out.println(deleteResponse.status());
System.out.println(deleteResponse); // 返回全部的内容和命令是一样的
}
// 特殊的,真的项目。 批量插入数据
@Test
void testBulkRequest() throws IOException {
BulkRequest bulkRequest = new BulkRequest();
bulkRequest.timeout("10s");
ArrayList userList = new ArrayList<>();
userList.add(new User("张三1",1));
userList.add(new User("张三2",2));
userList.add(new User("张三3",3));
userList.add(new User("张三4",4));
userList.add(new User("张三5",5));
userList.add(new User("张三6",6));
userList.add(new User("张三7",7));
// 批处理请求
for (int i=0;iorg.jsoupjsoup1.10.2com.alibabafastjson1.2.68org.springframework.bootspring-boot-starter-data-elasticsearchorg.springframework.bootspring-boot-starter-thymeleaforg.springframework.bootspring-boot-starter-weborg.springframework.bootspring-boot-devtoolsruntimetrueorg.springframework.bootspring-boot-configuration-processortrueorg.projectlomboklomboktrueorg.springframework.bootspring-boot-starter-testtestorg.junit.vintagejunit-vintage-engine
爬虫
配置文件
package cn.gorit.config;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* Spring 步骤
* 1. 找对象
* 2. 放到 spring 中使用
* 3. 分析源码
*
* @Classname ElasticSearchConfig
* @Description TODO
* @Date 2020/10/21 17:20
* @Created by CodingGorit
* @Version 1.0
*/
@Configuration // xml -bean
public class ElasticSearchConfig {
@Bean
public RestHighLevelClient restHighLevelClient() {
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")
)
);
return client;
}
}
> 爬取京东搜索的内容
config 配置类
package cn.gorit.util;
import cn.gorit.pojo.Content;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Component;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
/**
* @Classname HtmlParseUtil
* @Description TODO
* @Date 2020/10/21 23:17
* @Created by CodingGorit
* @Version 1.0
*/
@Component
public class HtmlParseUtil {
// public static void main(String[] args) throws Exception {
// new HtmlParseUtil().parseJD("英语").forEach(System.out::println);
// }
public List parseJD(String keyword) throws Exception {
// 请求 url
// 联网,不能获取 ajax 数据
String url = "https://search.jd.com/Search?keyword=wd&enc=utf-8";
// 解析网页 (返回的 Document 对象)
Document document = Jsoup.parse(new URL(url.replace("wd",keyword)),30000);
// 获取所有节点标签
Element element = document.getElementById("J_goodsList");
// 获取所有的 li 元素
Elements elements = element.getElementsByTag("li");
// 获取元素中的内容
List goodsList = new ArrayList<>();
for (Element e: elements) {
String img = e.getElementsByTag("img").eq(0).attr("data-lazy-img");
String price = e.getElementsByClass("p-price").eq(0).text();
String title = e.getElementsByClass("p-name").eq(0).text();
goodsList.add(new Content(title,img,price));
// System.out.println(img);
// System.out.println(price);
// System.out.println(title);
}
return goodsList;
}
}
Service 方法
package cn.gorit.service;
import cn.gorit.pojo.Content;
import cn.gorit.util.HtmlParseUtil;
import com.alibaba.fastjson.JSON;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightField;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
/**
* @Classname ContentService
* @Description TODO
* @Date 2020/10/22 18:44
* @Created by CodingGorit
* @Version 1.0
*/
@Service
public class ContentService {
@Autowired
private RestHighLevelClient restHighLevelClient;
// 不能直接使用,只要 Spring 容器
public static void main(String[] args) throws Exception {
new ContentService().parseContent("java");
}
// 1. 解析数据放入 es 索引中
public Boolean parseContent (String keywords) throws Exception {
// 获取查询到的列表的信息
List contents = new HtmlParseUtil().parseJD(keywords);
// 把查询到的数据放入 es 中
BulkRequest bulkRequest = new BulkRequest();
bulkRequest.timeout("2m");
for (int i=0;i < contents.size();++i) {
bulkRequest.add(
new IndexRequest("jd_goods")
.source(JSON.toJSONString(contents.get(i)),XContentType.JSON));
}
BulkResponse bulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
return !bulkResponse.hasFailures();
}
// 2. 获取这些数据,实现基本的搜索功能
public List> searchPagehighLight (String keyword, int pageNo,int pageSize) throws IOException {
if (pageNo <= 1)
pageNo = 1;
// 条件清晰
SearchRequest searchRequest = new SearchRequest("jd_goods");
SearchSourceBuilder builder = new SearchSourceBuilder();
builder.from(pageNo);
builder.size(pageSize);
// 精准匹配
TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
builder.query(termQueryBuilder);
builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
// 高亮
HighlightBuilder highlightBuilder = new HighlightBuilder();
highlightBuilder.field("title");
highlightBuilder.requireFieldMatch(false);
highlightBuilder.preTags("<span style="color:#FF0000;">");
highlightBuilder.postTags("</span>");
builder.highlighter(highlightBuilder);
// 执行搜索
searchRequest.source(builder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
// 解析结果
ArrayList> list= new ArrayList<>();
for (SearchHit hit: searchResponse.getHits().getHits()) {
// 解析高亮的字段
Map highlightFields = hit.getHighlightFields();
HighlightField title = highlightFields.get("title");
Map sourceAsMap = hit.getSourceAsMap();// 原来的结果
// 解析高亮字段,将原来的字段换成我们高亮的字段即可
if (title != null) {
Text[] fragments = title.fragments();
StringBuilder nTitle = new StringBuilder();
for (Text text:fragments) {
nTitle.append(text);
}
sourceAsMap.put("title",nTitle);
}
list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
}
return list;
}
// 2. 获取这些数据,实现基本的搜索功能
public List> searchPage (String keyword, int pageNo,int pageSize) throws IOException {
if (pageNo <= 1)
pageNo = 1;
// 条件清晰
SearchRequest searchRequest = new SearchRequest("jd_goods");
SearchSourceBuilder builder = new SearchSourceBuilder();
builder.from(pageNo);
builder.size(pageSize);
// 精准匹配
TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title",keyword);
builder.query(termQueryBuilder);
builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
// 执行搜索
searchRequest.source(builder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
// 解析结果
ArrayList> list= new ArrayList<>();
for (SearchHit hit: searchResponse.getHits().getHits()) {
list.add(hit.getSourceAsMap()); // 高亮的字段替换为原来的内容即可
}
return list;
}
}
Controller
package cn.gorit.controller;
import cn.gorit.pojo.Content;
import cn.gorit.service.ContentService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.RestControllerAdvice;
import java.io.IOException;
import java.util.List;
import java.util.Map;
/**
* @Classname ContentController
* @Description TODO
* @Date 2020/10/22 18:45
* @Created by CodingGorit
* @Version 1.0
*/
@RestController
public class ContentController {
@Autowired
private ContentService service;
/**
* 将数据添加到 ES 中
* @param keyword
* @return
* @throws Exception
*/
@GetMapping("/parse/{keyword}")
public Boolean pares(@PathVariable("keyword") String keyword) throws Exception {
return service.parseContent(keyword);
}
/**
* 查询 ES 的数据
* @param keyword
* @param pageNo
* @param pageSize
* @return
* @throws IOException
*/
@GetMapping("/search/{keyword}/{pageNo}/{pageSize}")
public List> search(@PathVariable("keyword") String keyword,@PathVariable("pageNo") int pageNo, @PathVariable("pageSize") int pageSize) throws IOException {
if (pageNo == 0) {
pageNo = 1;
}
return service.searchPage(keyword, pageNo, pageSize);
}
}
前后端分离
POSTMAN 测试
搜索高亮
> 一套项目,多端运用
十、总结
- ElasticSearch 基本使用
- SpringBoot 整合 ES
- 实战搜索
> 个人开源项目 (Coding-With-Java ) 欢迎大家点赞
共同学习,写下你的评论
评论加载中...
作者其他优质文章