为了账号安全,请及时绑定邮箱和手机立即绑定

深入浅析Elasticsearch中的聚合操作

标签:
大数据

如果写过Elasticsearch的聚合操作DSL,都知道它非常的繁琐,很简单的业务就导致异常复杂的json。因为它的聚合操作是嵌套的,一个聚合的输出可以是另一个聚合的输入,并且聚合还支持pipeline,能引用父亲或者兄弟节点的聚合,所以导致其结构非常难以理解。本文将根据一个实际的例子来逐步的构建一个Elasticsearch的聚合DSL语句,来方便大家理解ES的聚合操作。

假设在index test-order中存储了用户的订单信息,name表示用户姓名,price表示订单价格。
GET test-order/_search?filter_path=hits.hits._source
数据如下

{
  "hits": {
    "hits": [
      {
        "_source": {
          "name": "Jack",
          "price": 80
        }
      },
      {
        "_source": {
          "name": "Ross",
          "price": 70
        }
      },
      {
        "_source": {
          "name": "Susan",
          "price": 50
        }
      },
      {
        "_source": {
          "name": "Ross",
          "price": 40
        }
      },
      {
        "_source": {
          "name": "Tom",
          "price": 65
        }
      },
      {
        "_source": {
          "name": "Tom",
          "price": 85
        }
      }
    ]
  }
}

现在有如下需求,首先规定消费总金额在100以上的用户为VIP,然后要计算系统中VIP的数量。这要是在传统的关系数据库中,是非常简单的,首先group by name,计算sum(price),然后用having语句过滤VIP,最后再count临时表,得到 VIP的数量。SQL语句如下

select count(*) from (
  select sum(price), name from test-order
  group by name
  having sum(price) > 100 
) as VIP

在ES中,我们也需要按照同样的顺序构建DSL。

group

首先按照name来分组,用Terms Aggregation来充当group。

GET test-order/_search
{
  "size": 0,
  "aggs": {
    "userNames": {
      "terms": {
        "field": "name"
      }
    }
  }
}

sum

将name分组后的结果作为输入,对price字段进行sum。可以看到下面是将一个sum聚合嵌套在了term聚合中。

GET test-order/_search
{
  "size": 0,
  "aggs": {
    "userNames": {
      "terms": {
        "field": "name"
      },
      "aggs": {
        "paymentSum": {
          "sum": {
            "field": "price"
          }
        }
      }
    }
  }
}

看到paymentSum这个聚合是在此时已经完成了sum(price) group by name

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "userNames": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Ross",
          "doc_count": 2,
          "paymentSum": {
            "value": 110
          }
        },
        {
          "key": "Tom",
          "doc_count": 2,
          "paymentSum": {
            "value": 150
          }
        },
        {
          "key": "Jack",
          "doc_count": 1,
          "paymentSum": {
            "value": 80
          }
        },
        {
          "key": "Susan",
          "doc_count": 1,
          "paymentSum": {
            "value": 50
          }
        }
      ]
    }
  }
}

having

下面要用bucket_selector来完成SQL中的having部分。

GET test-order/_search
{
  "size": 0,
  "aggs": {
    "userNames": {
      "terms": {
        "field": "name"
      },
      "aggs": {
        "paymentSum": {
          "sum": {
            "field": "price"
          }
        },
        "sumFilter": {
          "bucket_selector": {
            "buckets_path": {
              "userPaymentSum": "paymentSum"
            },
            "script": "params.userPaymentSum > 100"
          }
        }
      }
    }
  }
}

上面sumFilter是一个bucket_selector,这是一个parent类型的pipeline,用来过滤上层聚合的结果。sumFilter中引用了paymentSum,用sum的结果进行过滤。可以看出,现在已经查出了VIP。

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "userNames": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Ross",
          "doc_count": 2,
          "paymentSum": {
            "value": 110
          }
        },
        {
          "key": "Tom",
          "doc_count": 2,
          "paymentSum": {
            "value": 150
          }
        }
      ]
    }
  }
}

模拟子查询

上面已经查询出了VIP,需要进行一下count。这里利用stats_bucket来统计VIP的数量。下面的vip_count是一个stats_bucket,这是一个sibling类型的pipeline,用来统计其他聚合操作的数据。

GET test-order/_search
{
  "size": 0,
  "aggs": {
    "userNames": {
      "terms": {
        "field": "name"
      },
      "aggs": {
        "paymentSum": {
          "sum": {
            "field": "price"
          }
        },
        "sumFilter": {
          "bucket_selector": {
            "buckets_path": {
              "userPaymentSum": "paymentSum"
            },
            "script": "params.userPaymentSum > 100"
          }
        }
      }
    },
    "vip_count": {
      "stats_bucket": {
        "buckets_path": "userNames>paymentSum"
      }
    }
  }
}

最终结果中的vip_count中,count就是系统中VIP的数量。

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "userNames": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Ross",
          "doc_count": 2,
          "paymentSum": {
            "value": 110
          }
        },
        {
          "key": "Tom",
          "doc_count": 2,
          "paymentSum": {
            "value": 150
          }
        }
      ]
    },
    "vip_count": {
      "count": 2,
      "min": 110,
      "max": 150,
      "avg": 130,
      "sum": 260
    }
  }
}

结语

select count(*) from (
  select sum(price), name from test-order
  group by name
  having sum(price) > 100 
) as VIP

变成

GET test-order/_search
{
  "size": 0,
  "aggs": {
    "userNames": {
      "terms": {
        "field": "name"
      },
      "aggs": {
        "paymentSum": {
          "sum": {
            "field": "price"
          }
        },
        "sumFilter": {
          "bucket_selector": {
            "buckets_path": {
              "userPaymentSum": "paymentSum"
            },
            "script": "params.userPaymentSum > 100"
          }
        }
      }
    },
    "vip_count": {
      "stats_bucket": {
        "buckets_path": "userNames>paymentSum"
      }
    }
  }
}

代码行数膨胀为6倍,这就是ES DSL的威力。。。

点击查看更多内容
TA 点赞

若觉得本文不错,就分享一下吧!

评论

作者其他优质文章

正在加载中
  • 推荐
  • 评论
  • 收藏
  • 共同学习,写下你的评论
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦
今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与 放弃机会
意见反馈 帮助中心 APP下载
官方微信

举报

0/150
提交
取消