Finding duplicate documents in elastic search Part 1


This is a story every developer came across, you indexed same documents twice or even 20 times....
It could be during testing/manual work and it can even be a defect, but fact it --> you have duplicates.

In big number, its not important I know, but if you still want to identify those documents than below you will find simple query to do so

Lets assume we have a field called "title" that is not_analyzed (important because we want to have the ENTIRE field's value and not each term),
We create a query that will do:

  1. Terms aggregation and request minimum doc count to 2 (e.g. duplicates)
  2. Tops hits aggregation - so that we get the actual documents

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "title",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

Above example is good for one field, in the next post I will do another example for multiple field, since in the real world if this was only one field you would even make this field as an ID and simply do upsert.

Hope this helps you

Thanks surefiresearch for the image
Finding duplicate documents in elastic search Part 1 Finding duplicate documents in elastic search Part 1 Reviewed by Ran Davidovitz on 12:03 AM Rating: 5

2 comments:

Anonymous said...

Where is part 2 ?

Anonymous said...

Still waiting for Part 2...

Powered by Blogger.