Easy Elastic - Group by Having in Elasticsearch

Easy Elastic – Group by Having in Elasticsearch

January 24, 2024

An example Elasticsearch query and response demonstrate this, noting a non-zero “sum_other_doc_count”, indicating missing countries in the response. Re-running with different sizes is necessary for a complete result.

Conversely, composite aggregation offers an actionable indicator, “after_key”, guiding the subsequent value to begin scanning from for complete data.

termByCountry.json

				
					GET /trainstations/_search
{
    "size" : 0,
    "aggregations" : {
    "groupby" : {
              "terms" : {
                "size": 20, 
                "field" : "country.keyword"
      }    
    }
  }
}

				
					{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "groupby" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 1077,
      "buckets" : [
        {
          "key" : "DE",
          "doc_count" : 7855
        }, ...

Conversely, composite aggregation offers an actionable indicator, “after_key”, guiding the subsequent value to begin scanning from for complete data.

groupByGeneral.json

				
					GET /trainstations/_search
{
    "size" : 0,
    "aggregations" : {
    "groupby" : {
      "composite" : {
        "size" : 20,
        "sources" : [
          {
            "countries" : {
              "terms" : {
                "field" : "country.keyword",
                "order" : "asc"
              }
            }
          }
        ]
      }
    }
  }
}

				
					{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "groupby" : {
      "after_key" : {
        "countries" : "HU"
      },
      "buckets" : [
        {
          "key" : {
            "countries" : "AD"
          },
          "doc_count" : 10
        }, ...

groupByWithAfterKey.json

				
					GET /trainstations/_search
{
    "size" : 0,
    "aggregations" : {
    "groupby" : {
      "composite" : {
        "after" : {
        "countries" : "HU"
      },  ...

The reliability of term aggregation decreases when filtering values within it, such as adding a ‘having’ clause, which we’ll explore later.

Aggregating Metadata

Metadata is crucial in database management. In transactional databases, obtaining various statistics and metadata is standard. For optimization, sometimes segregating data into separate indexes is beneficial. In our train station example, indexing each station by its country provides a fresh perspective for counting.

				
					GET /trainstations*/_search
{
  "size":0,
    "aggs": {
        "count_by_type": {
                      "terms": {
                "field": "_index"
            }
        }
        
    }
}

				
					{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 44,
    "successful" : 44,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "count_by_type" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 4197,
      "buckets" : [
        {
          "key" : "trainstations",
          "doc_count" : 36558
        }, ...

Again, the limitation of term aggregation is evident here.

Having clause

Elasticsearch interprets a ‘having’ clause as nested aggregations. A query example illustrates filtering countries with at least 1000 non-major train stations. A composite query is used for accuracy and integrity, resembling pagination to ensure comprehensive results.

SqlGroupBbyHaving.sql

				
					GET /trainstations/_search
{
   "size":0,
   "aggregations":{
      "groupby":{
         "composite":{
            "size":100,
            "sources":[
               {
                  "countries":{
                     "terms":{
                        "field":"country.keyword",
                        "missing_bucket":true,
                        "order":"asc"
                     }
                  }
               }
            ]
         },
         "aggs":{
            "minor_stations":{
               "filter":{
                  "terms":{
                     "is_main_station.keyword":[
                        "FALSE"
                     ]
                  }
               },
               "aggs":{
                  "count":{
                     "value_count":{
                        "field":"name.keyword"
                     }
                  }
               }
            },
            "having":{
               "bucket_selector":{
                  "buckets_path":{
                     "count_path":"minor_stations>count"
                  },
                  "script":{
                     "source":"params.count_path>1000",
                     "lang":"painless"
                  },
                  "gap_policy":"skip"
               }
            }
         }
      }
   }
}

				
					{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "groupby" : {
      "after_key" : {
        "countries" : "UA"
      },
      "buckets" : [
        {
          "key" : {
            "countries" : "AT"
          },
          "doc_count" : 1462,
          "minor_stations" : {
            "doc_count" : 1456,
            "count" : {
              "value" : 1456
            }
          }
        },
        {
          "key" : {
            "countries" : "CH"
          },
          "doc_count" : 2258,
          "minor_stations" : {
            "doc_count" : 2052,
            "count" : {
              "value" : 2052
            }
          }
        },
        {
          "key" : {
            "countries" : "DE"
          },
          "doc_count" : 7855,
          "minor_stations" : {
            "doc_count" : 7743,
            "count" : {
              "value" : 7743
            }
          }
        },
        {
          "key" : {
            "countries" : "ES"
          },
          "doc_count" : 5025,
          "minor_stations" : {
            "doc_count" : 4996,
            "count" : {
              "value" : 4996
            }
          }
        },
        {
          "key" : {
            "countries" : "FR"
          },
          "doc_count" : 7396,
          "minor_stations" : {
            "doc_count" : 7149,
            "count" : {
              "value" : 7149
            }
          }
        },
        {
          "key" : {
            "countries" : "GB"
          },
          "doc_count" : 2791,
          "minor_stations" : {
            "doc_count" : 2681,
            "count" : {
              "value" : 2681
            }
          }
        },
        {
          "key" : {
            "countries" : "IT"
          },
          "doc_count" : 4309,
          "minor_stations" : {
            "doc_count" : 4259,
            "count" : {
              "value" : 4259
            }
          }
        }
      ]
    }
  }
}

runJoin.py

				
					select country, count(*)
from trainstations
where is_main_station is false
group by 1
having count(*)> 1000;

Using a composite query intentionally creates pagination, akin to scrolling through a long page. This method avoids guessing the number of countries, presenting all possible results on one page to ensure accurate and complete output.

Utilizing SQL Platforms for Elasticsearch Aggregation

Numerous SQL platforms enable writing SQL-based aggregations on Elasticsearch. Our platform stands out by offering high-efficiency operations without requiring full data loading into memory. This feature allows for processing queries on less powerful, cost-effective hardware, optimizing resource use and lowering operational costs.

For a free trial, contact us!