Skip to content

Commit

Permalink
Update documentation (#9)
Browse files Browse the repository at this point in the history
  • Loading branch information
deric committed Sep 29, 2021
1 parent 262af36 commit 86bd353
Showing 1 changed file with 73 additions and 60 deletions.
133 changes: 73 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,80 +3,79 @@
[![](https://images.microbadger.com/badges/version/deric/es-dedupe.svg)](https://microbadger.com/images/deric/es-dedupe)
[![](https://images.microbadger.com/badges/image/deric/es-dedupe.svg)](https://microbadger.com/images/deric/es-dedupe)

A tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`). Removal process consists of two phases:

1. Aggregate query find documents that have same `field` value and at least 2 occurences. One copy of such document is left in ES all other are deleted via Bulk API (almost all, usually - there's always some catch). We wait for index update after each `DELETE` operatation. Processed documents are logged into `/tmp/es_dedupe.log`.
2. Unfortunately aggregate queries are not necessarily exact. Based on `/tmp/es_dedupe.log` logfile we query for each `field` value and DELETE document copies on other shards. Depending on number of nodes and shards in cluster there might be still document that aggregate query didn't return. In order to disable 2nd step use `--no-check` flag.

## Docker

Running from Docker:
```
docker run -it -e ES=locahost -e INDEX=my-index -e FIELD=id deric/es-dedupe:latest
```
You can either override Docker command or use ENV variable to pass arguments.
A tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`).

## Usage
Use `-h/--help` to see supported options:
```
docker run --rm deric/es-dedupe:latest dedupe --help
```

```
docker run --rm deric/es-dedupe:latest dedupe -H localhost -P 9200 -i exact-index-name -f Uuid > es_dedupe.log 2>&1
```
will try to find duplicated documents in an index called 'exact-index-name' where documents are grouped by `Uuid` field.

Use `-h/--help` to see supported options:
```
docker run --rm deric/es-dedupe:latest dedupe -H localhost -P 9200 --all --prefix 'esindexprefix' --prefixseparator '-' --indexexclude '^excludedindex.*' -f fingerprint > es_dedupe.log 2>&1
docker run --rm deric/es-dedupe:latest esdedupe --help
```
will try to find duplicated documents in all indices known to the ES instance on localhost:9200, that look akin to 'esindexprefix-\*' while excluding all indices starting with 'excludedindex', where documents are grouped by `fingerprint` field.
Remove duplicates from index `exact-index-name` while searching for unique `Uuid` field:

* `-a` will process all indexes known to the ES instance that match the prefix and prefixseparator.
* `-b` batch size - critical for performance ES queries might take several minutes, depending on size of your indexes
* `-f` name of field that should be unique
* `-h` displays help
* `-m` number of duplicated documents with same unique field value
* `-t` document type in ES
* `--sleep 60` time between aggregation requests (gives ES time to run GC on heap), 15 seconds seems to be enough to avoid triggering ES flood protection though.

WARNING: Running huge bulk operations on ES cluster might influence performance of your cluster or even crash some nodes if heap
is not large enough. Increment `-b` and `-m` parameters with caution! ES returns at most `b * m` documents, eventually you might hit
maximum POST request size with bulk requests.

A log file containing documents with unique fields is written into `/tmp/es_dedupe.log`.

By design ES aggregate queries are not necessarily precise. Depending on your cluster setup, some documents won't be deleted due to
inaccurate shard statistics.

`--check_log` will query for documents found by aggregate and queries check whether were actually deleted.
```
docker run --rm deric/es-dedupe:latest dedupe --check_log /tmp/es_dedupe.log --noop
docker run --rm deric/es-dedupe:latest esdedupe -H localhost -P 9200 -i exact-index-name -f Uuid > es_dedupe.log 2>&1
```

```
== Starting ES deduplicator....
PRETENDING to delete:
{"delete":{"_index":"nginx_access_logs-2017.03.17","_type":"nginx.access","_id":"AVrdoYEJy1wo8jcgI7t5"}}

== Total. OK: 4 (80.00%) out of 5. Fixable: 1. Missing: 0
Queried for 5 documents, retrieved status of 5 (100.00%).
More advanced example with documents containing timestamps.

```bash
esdedupe -H localhost -f request_id -i nginx_access_logs-2021.01.29 -b 10000 --timestamp Timestamp --since "2021-01-29T15:30:00.000Z" --until "2021-01-29T16:30:00.000Z" --flush 1500 --request_timeout 180
2021-02-01T19:58:25 [139754520647488] INFO esdedupe elastic: es01, host: localhost, version: 7.6.0
2021-02-01T19:58:25 [139754520647488] INFO esdedupe Unique fields: ['request_id']
2021-02-01T19:58:25 [139754520647488] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.01.29, batch size: 10000
2021-02-01T19:59:16 [139754520647488] INFO esdedupe Scanned 987,892 unique documents
2021-02-01T19:59:16 [139754520647488] INFO esdedupe Memory usage: 414.0MB
2021-02-01T20:00:03 [139754520647488] INFO esdedupe Scanned 1,950,957 unique documents
2021-02-01T20:00:03 [139754520647488] INFO esdedupe Memory usage: 695.0MB
2021-02-01T20:00:46 [139754520647488] INFO esdedupe Scanned 2,861,671 unique documents
2021-02-01T20:00:46 [139754520647488] INFO esdedupe Memory usage: 1007.3MB
2021-02-01T20:01:37 [139754520647488] INFO esdedupe Scanned 3,579,286 unique documents
2021-02-01T20:01:37 [139754520647488] INFO esdedupe Memory usage: 1.2GB
2021-02-01T20:02:16 [139754520647488] INFO esdedupe Found 810,993 duplicates out of 4,833,500 docs, unique documents: 4,022,507 (16.8% duplicates)
100%█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 810001/810993 [7:39:44<00:26, 37.16docs/s]
2021-02-02T03:42:01 [139754520647488] INFO esdedupe Deleted 1,621,986/810,993 documents
2021-02-02T03:42:01 [139754520647488] INFO esdedupe Successfully completed duplicates removal. Took: 7:43:36.313482
```

## Performance

Most time is spent on ES queries, choose `--batch` and `--max_dupes` size wisely! Between each bulk request script sleeps for `--sleep {seconds}`.
WARNING: Running huge bulk operations on Elastic cluster might influence performance of your cluster or even crash some nodes if heap is not large enough.

A sliding window `-w / --window` could be used to prevent running out of memory on larger indexes (if you have a timestamp field):

```bash
$ esdedupe -H localhost -f request_id -i nginx_access_logs-2021.02.01 -b 10000 --timestamp Timestamp --since 2021-02-01T00:00:00 --until 2021-02-01T10:30:00 --flush 2500 --request_timeout 180 -w 10m --es-level WARN
2021-02-07T01:27:07 [140045012879168] INFO esdedupe Found 1,544 duplicates out of 162,805 docs, unique documents: 161,261 (0.9% duplicates)
0%| | 1/1544 [00:17<7:25:23, 17.32s/docs]2021-02-07T01:27:25 [140045012879168] INFO esdedupe Deleted 3,088 documents (including shard replicas)
2021-02-07T01:27:25 [140045012879168] INFO esdedupe Using window 10m, from: 2021-02-01T09:30:00.000Z until: 2021-02-01T09:40:00.000Z
2021-02-07T01:27:25 [140045012879168] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000
100%|██████████| 1544/1544 [00:18<00:00, 83.11docs/s]
2021-02-07T01:27:33 [140045012879168] INFO esdedupe Found 1,338 duplicates out of 162,882 docs, unique documents: 161,544 (0.8% duplicates)
0%| | 1/1338 [00:19<7:23:17, 19.89s/docs]2021-02-07T01:27:53 [140045012879168] INFO esdedupe Deleted 2,676 documents (including shard replicas)
2021-02-07T01:27:53 [140045012879168] INFO esdedupe Using window 10m, from: 2021-02-01T09:40:00.000Z until: 2021-02-01T09:50:00.000Z
2021-02-07T01:27:53 [140045012879168] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000
100%|██████████| 1338/1338 [00:20<00:00, 64.36docs/s]
2021-02-07T01:28:02 [140045012879168] INFO esdedupe Found 1,321 duplicates out of 165,664 docs, unique documents: 164,343 (0.8% duplicates)
0%| | 1/1321 [00:13<4:56:58, 13.50s/docs]2021-02-07T01:28:15 [140045012879168] INFO esdedupe Deleted 2,642 documents (including shard replicas)
2021-02-07T01:28:15 [140045012879168] INFO esdedupe Using window 10m, from: 2021-02-01T09:50:00.000Z until: 2021-02-01T10:00:00.000Z
2021-02-07T01:28:15 [140045012879168] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000
100%|██████████| 1321/1321 [00:14<00:00, 88.39docs/s]
2021-02-07T01:28:25 [140045012879168] INFO esdedupe Found 1,291 duplicates out of 168,842 docs, unique documents: 167,551 (0.8% duplicates)
0%| | 1/1291 [00:12<4:20:59, 12.14s/docs]2021-02-07T01:28:37 [140045012879168] INFO esdedupe Deleted 2,582 documents (including shard replicas)
2021-02-07T01:28:37 [140045012879168] INFO esdedupe Using window 10m, from: 2021-02-01T10:00:00.000Z until: 2021-02-01T10:10:00.000Z
2021-02-07T01:28:37 [140045012879168] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000
100%|██████████| 1291/1291 [00:15<00:00, 82.91docs/s]
2021-02-07T01:28:48 [140045012879168] INFO esdedupe Found 1,371 duplicates out of 173,650 docs, unique documents: 172,279 (0.8% duplicates)
0%| | 1/1371 [00:18<7:07:57, 18.74s/docs]2021-02-07T01:29:07 [140045012879168] INFO esdedupe Deleted 2,742 documents (including shard replicas)
2021-02-07T01:29:07 [140045012879168] INFO esdedupe Using window 10m, from: 2021-02-01T10:10:00.000Z until: 2021-02-01T10:20:00.000Z
2021-02-07T01:29:07 [140045012879168] INFO esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000
100%|██████████| 1371/1371 [00:19<00:00, 68.59docs/s]
2021-02-07T01:29:16 [140045012879168] INFO esdedupe Found 1,340 duplicates out of 183,592 docs, unique documents: 182,252 (0.7% duplicates)
0%| | 1/1340 [00:21<8:00:21, 21.52s/docs]2021-02-07T01:29:38 [140045012879168] INFO esdedupe Deleted 2,680 documents (including shard replicas)
2021-02-07T01:29:38 [140045012879168] INFO esdedupe Altogether 14115806 documents were removed (including doc replicas)
2021-02-07T01:29:38 [140045012879168] INFO esdedupe Total time: 1 day, 10:15:43.528495

Delete queries are send via `_bulk` API. Processing batch with several thousand documents takes several seconds:
```
== Starting ES deduplicator....
Using index nginx_access_logs-2017.03.17
ES query took 0:00:27.552083, retrieved 0 unique docs
Using index nginx_access_logs-2017.03.18
ES query took 0:01:30.705209, retrieved 10000 unique docs
Deleted 333129 duplicates, in total 333129. Batch processed in 0:00:09.999008, running time 0:02:08.259759
ES query took 0:01:58.288188, retrieved 10000 unique docs
Deleted 276673 duplicates, in total 609802. Batch processed in 0:00:08.487847, running time 0:04:16.037037
```

## Requirements
Expand All @@ -95,6 +94,20 @@ On Windows you are pretty much on your own, but fear not, you can do the followi
pip install -r requirements.txt


## Testing

Test can be run from a Docker container. You can use supplied `docker-compose` file:
```bash
docker-compose up
```

Manually run tests:
```bash
pip3 install -r requirements-dev.txt
python3 -m pytest -v --capture=no tests/
```


## History

Originally written in bash which performed terribly due to slow JSON processing with pipes and `jq`. Python with `ujson` seems to be better fitted for this task.

0 comments on commit 86bd353

Please sign in to comment.