Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike - Indexer performance on different Data Persistence Model designs #255

Closed
Tracked by #22887
AlexRuiz7 opened this issue Jun 5, 2024 · 15 comments
Closed
Tracked by #22887
Assignees
Labels
level/task Task issue request/operational Operational requests type/enhancement Enhancement issue

Comments

@AlexRuiz7
Copy link
Member

AlexRuiz7 commented Jun 5, 2024

Description

As part of the new Data Persistence Model to be implemented across Wazuh, we need to elaborate a performance analysis over different designs, in order to see how the indexer behaves on them.

The objective of this issue is to measure the performance of bulk requests for:

  • indexing
  • updates
  • deletions

on:

  • stateless stream indices, using rollover and alias. Stateless indices do not wait for complete (non-blocking request)
  • stateful indices, without rollover and alias. Stateful indices do wait for complete (blocking request)

given the following scenarios:

  1. Single bulk request.
    • 1x (big) bulk request with Stateless and Stateful data.
graph LR
    A[Server cluster] -->|Single bulk| B[Indexer cluster]
Loading
  1. Per-module bulk requests.
    • 1x (smaller) bulk request for Stateless data.
    • 3x (smaller) bulk request for Stateful data (state_1, state_2, state_3).
graph LR
    A[Server cluster] -->|Stateless bulk| B[Indexer cluster]
    A[Server cluster] -->|state_1 bulk| B[Indexer cluster]
    A[Server cluster] -->|state_2 bulk| B[Indexer cluster]
    A[Server cluster] -->|state_3 bulk| B[Indexer cluster]
Loading

The goal is to discover which design performs better on a well-configured indexer cluster.

For the tests, we are considering mocking events for 5K agents, generating events of 1 KB maximum. The EPS for each of the indices is defined by the formula below:

n_agents = 5000
req_size = 1 KB
stateful = 1 EPS   * n_agents = 5000 EPS  (5 MB)
state_1  = 0.6 EPS * n_agents = 3000 EPS  (3 MB)
state_2  = 0.3 EPS * n_agents = 1500 EPS  (1.5 MB)
state_3  = 0.1 EPS * n_agents =  500 EPS  (0.5 MB)
                                          --------
Total / single bulk request                10 MB

Functional requirements

  • Measure the performance of bulk requests for indexing, updates and deletions on both scenarios.

Implementation restrictions

Both test scenario must run on:

  • 3x Indexer nodes, having 8 CPUs, 16 GB of RAM and SSD storage each.
  • 5,000 agents (mocked)
  • Bulk requests are performed every second.
  • EPS distribution of requests per type are:
    • Stateless. 100 %
    • 3 stateful
      • state_1 10 %
      • state_2 30 %
      • state_3 60 %
  • Each request weights 1 Kilobyte.
  • Indices refresh every 5 seconds.

Plan

  • Define measurement tooling (most likely OpenSearch Benchmark)
  • Define performance metrics.
  • Perform test on both scenarios (local).
  • Acquire the required infrastructure.
  • Perform tests on scenario A.
  • Perform tests on scenario B.
  • Results comparisons and conclusions.
@AlexRuiz7 AlexRuiz7 added level/task Task issue type/enhancement Enhancement issue request/operational Operational requests labels Jun 5, 2024
@wazuhci wazuhci moved this to Triage in Release 5.0.0 Jun 5, 2024
@wazuhci wazuhci moved this from Triage to Backlog in Release 5.0.0 Jun 10, 2024
@wazuhci wazuhci moved this from Backlog to In progress in Release 5.0.0 Jun 12, 2024
@AlexRuiz7 AlexRuiz7 self-assigned this Jun 12, 2024
@AlexRuiz7
Copy link
Member Author

AlexRuiz7 commented Jun 12, 2024

OpenSearch Benchmark

Currently reading the docs to understand how OSB works. Some notes:

  • Running OSB on Docker has some limitations.

  • Seems like OSB is able to spawn OpenSearch nodes by its own.

    from-sources: Builds and provisions OpenSearch, runs a benchmark, and then publishes the results.
    from-distribution: Downloads an OpenSearch distribution, provisions it, runs a benchmark, and then publishes the results.
    benchmark-only: The default pipeline. Assumes an already running OpenSearch instance, runs a benchmark on that instance, and then publishes the results.


I managed to run OSB locally using Pyenv.

Details

(.venv) @alex-GL66 ➜ opensearch-benchmark  python3 -m venv .venv; source .venv/bin/activate
(.venv) @alex-GL66 ➜ opensearch-benchmark  pip install opensearch-benchmark
(.venv) @alex-GL66 ➜ opensearch-benchmark  export JAVA17_HOME=/usr/lib/jvm/temurin-17-jdk-amd64 
(.venv) @alex-GL66 ➜ opensearch-benchmark  opensearch-benchmark execute-test --distribution-version=2.13.0 --workload percolator --test-mode

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: cf58479a-77c9-4694-8b88-bfee848cdfa6
[INFO] Preparing for test execution ...
[INFO] Downloading OpenSearch 2.13.0 (844.4 MB total size)                          [100%]
[INFO] Downloading workload data (191 bytes total size)                           [100.0%]
[INFO] Decompressing workload data from [/home/alex/wazuh/opensearch-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json.bz2] to [/home/alex/wazuh/opensearch-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json] ... [OK]
[INFO] Preparing file offset table for [/home/alex/wazuh/opensearch-benchmark/.benchmark/benchmarks/data/percolator/queries-2-1k.json] ... [OK]
[INFO] Executing test with workload [percolator], test_procedure [append-no-conflicts] and provision_config_instance ['defaults'] with version [2.13.0].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [100% done]
Running index                                                                  [100% done]
Running refresh-after-index                                                    [100% done]
Running force-merge                                                            [100% done]
Running refresh-after-force-merge                                              [100% done]
Running wait-until-merges-finish                                               [100% done]
Running percolator_with_content_president_bush                                 [100% done]
Running percolator_with_content_saddam_hussein                                 [100% done]
Running percolator_with_content_hurricane_katrina                              [100% done]
Running percolator_with_content_google                                         [100% done]
Running percolator_no_score_with_content_google                                [100% done]
Running percolator_with_highlighting                                           [100% done]
Running percolator_with_content_ignore_me                                      [100% done]
Running percolator_no_score_with_content_ignore_me                             [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                                       Task |       Value |   Unit |
|---------------------------------------------------------------:|-------------------------------------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                                            |   0.0122667 |    min |
|             Min cumulative indexing time across primary shards |                                            |           0 |    min |
|          Median cumulative indexing time across primary shards |                                            |  0.00209167 |    min |
|             Max cumulative indexing time across primary shards |                                            |       0.004 |    min |
|            Cumulative indexing throttle time of primary shards |                                            |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                                            |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                                            |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                                            |           0 |    min |
|                        Cumulative merge time of primary shards |                                            |           0 |    min |
|                       Cumulative merge count of primary shards |                                            |           0 |        |
|                Min cumulative merge time across primary shards |                                            |           0 |    min |
|             Median cumulative merge time across primary shards |                                            |           0 |    min |
|                Max cumulative merge time across primary shards |                                            |           0 |    min |
|               Cumulative merge throttle time of primary shards |                                            |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                                            |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                                            |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                                            |           0 |    min |
|                      Cumulative refresh time of primary shards |                                            |  0.00226667 |    min |
|                     Cumulative refresh count of primary shards |                                            |          30 |        |
|              Min cumulative refresh time across primary shards |                                            |           0 |    min |
|           Median cumulative refresh time across primary shards |                                            | 0.000358333 |    min |
|              Max cumulative refresh time across primary shards |                                            |      0.0007 |    min |
|                        Cumulative flush time of primary shards |                                            |           0 |    min |
|                       Cumulative flush count of primary shards |                                            |           0 |        |
|                Min cumulative flush time across primary shards |                                            |           0 |    min |
|             Median cumulative flush time across primary shards |                                            |           0 |    min |
|                Max cumulative flush time across primary shards |                                            |           0 |    min |
|                                        Total Young Gen GC time |                                            |           0 |      s |
|                                       Total Young Gen GC count |                                            |           0 |        |
|                                          Total Old Gen GC time |                                            |           0 |      s |
|                                         Total Old Gen GC count |                                            |           0 |        |
|                                                     Store size |                                            | 4.31528e-05 |     GB |
|                                                  Translog size |                                            | 3.07336e-07 |     GB |
|                                         Heap used for segments |                                            |           0 |     MB |
|                                       Heap used for doc values |                                            |           0 |     MB |
|                                            Heap used for terms |                                            |           0 |     MB |
|                                            Heap used for norms |                                            |           0 |     MB |
|                                           Heap used for points |                                            |           0 |     MB |
|                                    Heap used for stored fields |                                            |           0 |     MB |
|                                                  Segment count |                                            |          22 |        |
|                                                 Min Throughput |                                      index |     10299.1 | docs/s |
|                                                Mean Throughput |                                      index |     10299.1 | docs/s |
|                                              Median Throughput |                                      index |     10299.1 | docs/s |
|                                                 Max Throughput |                                      index |     10299.1 | docs/s |
|                                        50th percentile latency |                                      index |     81.7099 |     ms |
|                                       100th percentile latency |                                      index |     91.5731 |     ms |
|                                   50th percentile service time |                                      index |     81.7099 |     ms |
|                                  100th percentile service time |                                      index |     91.5731 |     ms |
|                                                     error rate |                                      index |           0 |      % |
|                                                 Min Throughput |                   wait-until-merges-finish |       72.58 |  ops/s |
|                                                Mean Throughput |                   wait-until-merges-finish |       72.58 |  ops/s |
|                                              Median Throughput |                   wait-until-merges-finish |       72.58 |  ops/s |
|                                                 Max Throughput |                   wait-until-merges-finish |       72.58 |  ops/s |
|                                       100th percentile latency |                   wait-until-merges-finish |     13.1389 |     ms |
|                                  100th percentile service time |                   wait-until-merges-finish |     13.1389 |     ms |
|                                                     error rate |                   wait-until-merges-finish |           0 |      % |
|                                                 Min Throughput |     percolator_with_content_president_bush |       32.24 |  ops/s |
|                                                Mean Throughput |     percolator_with_content_president_bush |       32.24 |  ops/s |
|                                              Median Throughput |     percolator_with_content_president_bush |       32.24 |  ops/s |
|                                                 Max Throughput |     percolator_with_content_president_bush |       32.24 |  ops/s |
|                                       100th percentile latency |     percolator_with_content_president_bush |     37.6739 |     ms |
|                                  100th percentile service time |     percolator_with_content_president_bush |     6.40732 |     ms |
|                                                     error rate |     percolator_with_content_president_bush |           0 |      % |
|                                                 Min Throughput |     percolator_with_content_saddam_hussein |      115.68 |  ops/s |
|                                                Mean Throughput |     percolator_with_content_saddam_hussein |      115.68 |  ops/s |
|                                              Median Throughput |     percolator_with_content_saddam_hussein |      115.68 |  ops/s |
|                                                 Max Throughput |     percolator_with_content_saddam_hussein |      115.68 |  ops/s |
|                                       100th percentile latency |     percolator_with_content_saddam_hussein |     14.9318 |     ms |
|                                  100th percentile service time |     percolator_with_content_saddam_hussein |     5.95973 |     ms |
|                                                     error rate |     percolator_with_content_saddam_hussein |           0 |      % |
|                                                 Min Throughput |  percolator_with_content_hurricane_katrina |       84.38 |  ops/s |
|                                                Mean Throughput |  percolator_with_content_hurricane_katrina |       84.38 |  ops/s |
|                                              Median Throughput |  percolator_with_content_hurricane_katrina |       84.38 |  ops/s |
|                                                 Max Throughput |  percolator_with_content_hurricane_katrina |       84.38 |  ops/s |
|                                       100th percentile latency |  percolator_with_content_hurricane_katrina |     18.1493 |     ms |
|                                  100th percentile service time |  percolator_with_content_hurricane_katrina |     5.96843 |     ms |
|                                                     error rate |  percolator_with_content_hurricane_katrina |           0 |      % |
|                                                 Min Throughput |             percolator_with_content_google |       47.06 |  ops/s |
|                                                Mean Throughput |             percolator_with_content_google |       47.06 |  ops/s |
|                                              Median Throughput |             percolator_with_content_google |       47.06 |  ops/s |
|                                                 Max Throughput |             percolator_with_content_google |       47.06 |  ops/s |
|                                       100th percentile latency |             percolator_with_content_google |     27.8973 |     ms |
|                                  100th percentile service time |             percolator_with_content_google |     6.37702 |     ms |
|                                                     error rate |             percolator_with_content_google |           0 |      % |
|                                                 Min Throughput |    percolator_no_score_with_content_google |      101.72 |  ops/s |
|                                                Mean Throughput |    percolator_no_score_with_content_google |      101.72 |  ops/s |
|                                              Median Throughput |    percolator_no_score_with_content_google |      101.72 |  ops/s |
|                                                 Max Throughput |    percolator_no_score_with_content_google |      101.72 |  ops/s |
|                                       100th percentile latency |    percolator_no_score_with_content_google |     17.8059 |     ms |
|                                  100th percentile service time |    percolator_no_score_with_content_google |     7.73091 |     ms |
|                                                     error rate |    percolator_no_score_with_content_google |           0 |      % |
|                                                 Min Throughput |               percolator_with_highlighting |        81.3 |  ops/s |
|                                                Mean Throughput |               percolator_with_highlighting |        81.3 |  ops/s |
|                                              Median Throughput |               percolator_with_highlighting |        81.3 |  ops/s |
|                                                 Max Throughput |               percolator_with_highlighting |        81.3 |  ops/s |
|                                       100th percentile latency |               percolator_with_highlighting |     20.5377 |     ms |
|                                  100th percentile service time |               percolator_with_highlighting |     7.81483 |     ms |
|                                                     error rate |               percolator_with_highlighting |           0 |      % |
|                                                 Min Throughput |          percolator_with_content_ignore_me |       17.47 |  ops/s |
|                                                Mean Throughput |          percolator_with_content_ignore_me |       17.47 |  ops/s |
|                                              Median Throughput |          percolator_with_content_ignore_me |       17.47 |  ops/s |
|                                                 Max Throughput |          percolator_with_content_ignore_me |       17.47 |  ops/s |
|                                       100th percentile latency |          percolator_with_content_ignore_me |     85.7778 |     ms |
|                                  100th percentile service time |          percolator_with_content_ignore_me |     28.0983 |     ms |
|                                                     error rate |          percolator_with_content_ignore_me |           0 |      % |
|                                                 Min Throughput | percolator_no_score_with_content_ignore_me |       54.39 |  ops/s |
|                                                Mean Throughput | percolator_no_score_with_content_ignore_me |       54.39 |  ops/s |
|                                              Median Throughput | percolator_no_score_with_content_ignore_me |       54.39 |  ops/s |
|                                                 Max Throughput | percolator_no_score_with_content_ignore_me |       54.39 |  ops/s |
|                                       100th percentile latency | percolator_no_score_with_content_ignore_me |      26.549 |     ms |
|                                  100th percentile service time | percolator_no_score_with_content_ignore_me |     7.92226 |     ms |
|                                                     error rate | percolator_no_score_with_content_ignore_me |           0 |      % |


--------------------------------
[INFO] SUCCESS (took 89 seconds)
--------------------------------

We'll work on creating a Vagrant environment with 3 OpenSearch nodes and OpenSearch Benchmark installed on each of them to perform the tests. See Running distributed loads.

@AlexRuiz7
Copy link
Member Author

AlexRuiz7 commented Jun 13, 2024

Update

  1. We have generated a Vagrant environment that sets up a cluster with 3 nodes of OpenSearch v2.14.0 + OpenSearch Benchmark 1.6.0. They are configured among themselves to make use of distributed load and load balancing.
  2. The command in the OpenSearch documentation is not correct, but we have fixed it.
    opensearch-benchmark execute-test --pipeline=benchmark-only --workload=eventdata --load-worker-coordinator-hosts=node-2,node-3 --target-hosts=node-1 --kill-running-processes
  3. Even so, we have not managed to make it work, as it fails after the following error:
    2024-06-13 10:13:11,377 -not-actor-/PID:3267 osbenchmark.benchmark ERROR Cannot run subcommand [execute-test].
    
  4. We then realized this mode is useful for big cluster with massive loads that reach max CPU usage on the host running OSB (+3 nodes). This mode allows dedicating some nodes of the cluster to distribute the workload generation (not the processing). We fall back to regular mode. Fortunately, we can reuse the Vagrantfile.

@f-galland f-galland self-assigned this Jun 13, 2024
@AlexRuiz7
Copy link
Member Author

AlexRuiz7 commented Jun 13, 2024

Running a workload

With this command, we can run the default http_logs workload. This workload mixes ingest, update and search queries.

Note

This operation is time consuming.

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=http_logs --target-host=https://localhost:9200 --client-options=basic_auth_user:admin,basic_auth_password:"${OPENSEARCH_INITIAL_ADMIN_PASSWORD}",verify_certs:false

Creating a custom workload

There are 2 ways of creating custom workloads:

  • From an existing cluster with indexed data: OSB can automatically generate a custom workload using indexed data (at least 1k documents). The workload has to be configured manually once created.

    opensearch-benchmark create-workload \
    --workload="<WORKLOAD NAME>" \
    --target-hosts="<CLUSTER ENDPOINT>" \
    --client-options="basic_auth_user:'<USERNAME>',basic_auth_password:'<PASSWORD>'" \
    --indices="<INDEXES TO GENERATE WORKLOAD FROM>" \
    --output-path="<LOCAL DIRECTORY PATH TO STORE WORKLOAD>"
  • From scratch. The workload generation and its configuration is manual. We have tried to follow the example in the docs without success.

    opensearch-benchmark execute-test \
    --pipeline="benchmark-only" \
    --workload-path="./workload" \
    --target-host="https://localhost:9200/" \
    --client-options="basic_auth_user:'admin',basic_auth_password:'Bc9ZyWSBu19[BK#6MBgbJ98Tofv)Vsw',verify_certs:false"

@f-galland
Copy link
Member

I tested creating a workload from an existing cluster, for which I used a test AIO deployment with real world data.

I used this command:

opensearch-benchmark create-workload \
--workload="wazuh-test" \
--target-hosts="https://localhost:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin',verify_certs:false" \
--indices="wazuh-alerts-4.x-2024.04.22" \
--output-path="./wazuh-workload"

I then run the test:

opensearch-benchmark execute-test \
--pipeline="benchmark-only" \
--workload-path="./wazuh-workload/wazuh-test" \
--target-host="https://localhost:9200" \
--client-options="basic_auth_user:'admin',basic_auth_password:'admin',verify_certs:false"

Below is the result:

# ./run_custom_workload.sh 

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 586e8225-db0d-4f26-bcb8-ce616f6b8ec6
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[INFO] Executing test with workload [wazuh-test], test_procedure [default-test-procedure] and provision_config_instance ['external'] with version [7.10.2].

[WARNING] merges_total_time is 93391 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 59258 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 559684 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 21547 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running index-append                                                           [100% done]
Running refresh-after-index                                                    [100% done]
Running force-merge                                                            [100% done]
Running refresh-after-force-merge                                              [100% done]
Running wait-until-merges-finish                                               [100% done]
Running match-all                                                              [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                     Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                          |   0.985733 |    min |
|             Min cumulative indexing time across primary shards |                          |          0 |    min |
|          Median cumulative indexing time across primary shards |                          |          0 |    min |
|             Max cumulative indexing time across primary shards |                          |   0.142783 |    min |
|            Cumulative indexing throttle time of primary shards |                          |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                          |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                          |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                          |          0 |    min |
|                        Cumulative merge time of primary shards |                          |    1.56205 |    min |
|                       Cumulative merge count of primary shards |                          |       6115 |        |
|                Min cumulative merge time across primary shards |                          |          0 |    min |
|             Median cumulative merge time across primary shards |                          |          0 |    min |
|                Max cumulative merge time across primary shards |                          |   0.178017 |    min |
|               Cumulative merge throttle time of primary shards |                          |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                          |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                          |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                          |          0 |    min |
|                      Cumulative refresh time of primary shards |                          |    9.35108 |    min |
|                     Cumulative refresh count of primary shards |                          |      57555 |        |
|              Min cumulative refresh time across primary shards |                          |          0 |    min |
|           Median cumulative refresh time across primary shards |                          |          0 |    min |
|              Max cumulative refresh time across primary shards |                          |    1.29032 |    min |
|                        Cumulative flush time of primary shards |                          |     0.3596 |    min |
|                       Cumulative flush count of primary shards |                          |        726 |        |
|                Min cumulative flush time across primary shards |                          |          0 |    min |
|             Median cumulative flush time across primary shards |                          |          0 |    min |
|                Max cumulative flush time across primary shards |                          |   0.122833 |    min |
|                                        Total Young Gen GC time |                          |      0.014 |      s |
|                                       Total Young Gen GC count |                          |          1 |        |
|                                          Total Old Gen GC time |                          |          0 |      s |
|                                         Total Old Gen GC count |                          |          0 |        |
|                                                     Store size |                          |   0.116531 |     GB |
|                                                  Translog size |                          | 0.00978717 |     GB |
|                                         Heap used for segments |                          |          0 |     MB |
|                                       Heap used for doc values |                          |          0 |     MB |
|                                            Heap used for terms |                          |          0 |     MB |
|                                            Heap used for norms |                          |          0 |     MB |
|                                           Heap used for points |                          |          0 |     MB |
|                                    Heap used for stored fields |                          |          0 |     MB |
|                                                  Segment count |                          |        722 |        |
|                                                 Min Throughput |             index-append |    6846.26 | docs/s |
|                                                Mean Throughput |             index-append |    6846.26 | docs/s |
|                                              Median Throughput |             index-append |    6846.26 | docs/s |
|                                                 Max Throughput |             index-append |    6846.26 | docs/s |
|                                        50th percentile latency |             index-append |    240.359 |     ms |
|                                       100th percentile latency |             index-append |    243.049 |     ms |
|                                   50th percentile service time |             index-append |    240.359 |     ms |
|                                  100th percentile service time |             index-append |    243.049 |     ms |
|                                                     error rate |             index-append |          0 |      % |
|                                                 Min Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                                Mean Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                              Median Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                                 Max Throughput | wait-until-merges-finish |      24.51 |  ops/s |
|                                       100th percentile latency | wait-until-merges-finish |    37.1372 |     ms |
|                                  100th percentile service time | wait-until-merges-finish |    37.1372 |     ms |
|                                                     error rate | wait-until-merges-finish |          0 |      % |
|                                                 Min Throughput |                match-all |       3.02 |  ops/s |
|                                                Mean Throughput |                match-all |       3.03 |  ops/s |
|                                              Median Throughput |                match-all |       3.03 |  ops/s |
|                                                 Max Throughput |                match-all |       3.05 |  ops/s |
|                                        50th percentile latency |                match-all |    6.85162 |     ms |
|                                        90th percentile latency |                match-all |    7.55348 |     ms |
|                                        99th percentile latency |                match-all |    8.55737 |     ms |
|                                       100th percentile latency |                match-all |    9.84485 |     ms |
|                                   50th percentile service time |                match-all |    4.84304 |     ms |
|                                   90th percentile service time |                match-all |    5.46714 |     ms |
|                                   99th percentile service time |                match-all |    6.36302 |     ms |
|                                  100th percentile service time |                match-all |    7.92025 |     ms |
|                                                     error rate |                match-all |          0 |      % |


---------------------------------
[INFO] SUCCESS (took 109 seconds)
---------------------------------

@f-galland f-galland reopened this Jun 13, 2024
@wazuhci wazuhci moved this from In progress to On hold in Release 5.0.0 Jun 13, 2024
@wazuhci wazuhci moved this from On hold to In progress in Release 5.0.0 Jun 20, 2024
@wazuhci wazuhci moved this from In progress to On hold in Release 5.0.0 Jun 24, 2024
@wazuhci wazuhci moved this from On hold to In progress in Release 5.0.0 Jun 25, 2024
@f-galland
Copy link
Member

Using the method to create a custom workload described above, I created a workload with the following test procedures:

root@os-benchmarks:~# cat benchmarks/wazuh-alerts/test_procedures/default.json
{
  "name": "parallel-any",
  "description": "Workload completed-by property",
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "name": "parellel-task-1",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1000
            },
            "clients": 100
          },
          {
            "name": "parellel-task-2",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1000
            },
            "clients": 100
          }
        ]
      }
    }
  ]
}

This was run with the following docker environment:

services:

  opensearch-benchmark:
    image: opensearchproject/opensearch-benchmark:1.6.0
    hostname: opensearch-benchmark
    depends_on:
      opensearch-node1:
        condition: service_healthy
      permissions-setter:
        condition: service_completed_successfully
    container_name: opensearch-benchmark
    volumes:
      - ./benchmarks:/opensearch-benchmark/.benchmark
    environment:
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
        #command: execute-test --target-hosts https://opensearch-node1:9200 --pipeline benchmark-only --workload geonames --client-options basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false --test-mode
    command: execute-test --pipeline="benchmark-only" --workload-path="/opensearch-benchmark/.benchmark/wazuh-alerts" --target-host="https://opensearch-node1:9200" --client-options="basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false"

    networks:
      - opensearch-net

  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.14.0
    container_name: opensearch-node1
    hostname: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node1 # Name the node that will run in this container
      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later)
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    healthcheck:
      test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://localhost:9200/_cat/health | grep -q opensearch-cluster
      start_period: 10s
      start_interval: 3s
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network

  opensearch-node2:
    image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues
    container_name: opensearch-node2
    hostname: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
    depends_on:
      opensearch-node1:
        condition: service_healthy
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query
    networks:
      - opensearch-net

  permissions-setter:
    image: alpine:3.14
    container_name: permissions-setter
    volumes:
      - ./benchmarks:/benchmark
    entrypoint: /bin/sh
    command: >
      -c '
        chmod -R a+rw /benchmark
      '

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

Below are the results of the test:

  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 41430721-7ced-41b4-b363-8eaf19f73221
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[WARNING] refresh_total_time is 6 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running parellel-task-2,parellel-task-1                                        [100% done][INFO] Executing test with workload [wazuh-alerts], test_procedure [parallel-any] and provision_config_instance ['external'] with version [2.14.0].


------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |            Task |       Value |   Unit |
|---------------------------------------------------------------:|----------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                 |      0.0863 |    min |
|             Min cumulative indexing time across primary shards |                 |           0 |    min |
|          Median cumulative indexing time across primary shards |                 |           0 |    min |
|             Max cumulative indexing time across primary shards |                 |      0.0863 |    min |
|            Cumulative indexing throttle time of primary shards |                 |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                 |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                 |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                 |           0 |    min |
|                        Cumulative merge time of primary shards |                 |           0 |    min |
|                       Cumulative merge count of primary shards |                 |           0 |        |
|                Min cumulative merge time across primary shards |                 |           0 |    min |
|             Median cumulative merge time across primary shards |                 |           0 |    min |
|                Max cumulative merge time across primary shards |                 |           0 |    min |
|               Cumulative merge throttle time of primary shards |                 |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                 |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                 |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                 |           0 |    min |
|                      Cumulative refresh time of primary shards |                 |      0.0001 |    min |
|                     Cumulative refresh count of primary shards |                 |          77 |        |
|              Min cumulative refresh time across primary shards |                 |           0 |    min |
|           Median cumulative refresh time across primary shards |                 |           0 |    min |
|              Max cumulative refresh time across primary shards |                 | 8.33333e-05 |    min |
|                        Cumulative flush time of primary shards |                 |           0 |    min |
|                       Cumulative flush count of primary shards |                 |           0 |        |
|                Min cumulative flush time across primary shards |                 |           0 |    min |
|             Median cumulative flush time across primary shards |                 |           0 |    min |
|                Max cumulative flush time across primary shards |                 |           0 |    min |
|                                        Total Young Gen GC time |                 |       0.086 |      s |
|                                       Total Young Gen GC count |                 |           7 |        |
|                                          Total Old Gen GC time |                 |           0 |      s |
|                                         Total Old Gen GC count |                 |           0 |        |
|                                                     Store size |                 |   0.0571623 |     GB |
|                                                  Translog size |                 |   0.0342067 |     GB |
|                                         Heap used for segments |                 |           0 |     MB |
|                                       Heap used for doc values |                 |           0 |     MB |
|                                            Heap used for terms |                 |           0 |     MB |
|                                            Heap used for norms |                 |           0 |     MB |
|                                           Heap used for points |                 |           0 |     MB |
|                                    Heap used for stored fields |                 |           0 |     MB |
|                                                  Segment count |                 |          73 |        |
|                                                 Min Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                                Mean Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                              Median Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                                 Max Throughput | parellel-task-1 |      6930.8 | docs/s |
|                                        50th percentile latency | parellel-task-1 |     963.212 |     ms |
|                                       100th percentile latency | parellel-task-1 |      1050.9 |     ms |
|                                   50th percentile service time | parellel-task-1 |     963.212 |     ms |
|                                  100th percentile service time | parellel-task-1 |      1050.9 |     ms |
|                                                     error rate | parellel-task-1 |           0 |      % |
|                                                 Min Throughput | parellel-task-2 |      752.03 | docs/s |
|                                                Mean Throughput | parellel-task-2 |      752.03 | docs/s |
|                                              Median Throughput | parellel-task-2 |      752.03 | docs/s |
|                                                 Max Throughput | parellel-task-2 |      752.03 | docs/s |
|                                        50th percentile latency | parellel-task-2 |     991.137 |     ms |
|                                       100th percentile latency | parellel-task-2 |     1094.09 |     ms |
|                                   50th percentile service time | parellel-task-2 |     991.137 |     ms |
|                                  100th percentile service time | parellel-task-2 |     1094.09 |     ms |
|                                                     error rate | parellel-task-2 |           0 |      % |



--------------------------------
[INFO] SUCCESS (took 16 seconds)
--------------------------------

The clients and bulk-size parameters seem to correlate with the actual amount of data being indexed:

root@os-benchmarks:~# curl -ku admin:Secret.Password.1234 https://localhost:9200/_cat/indices?s=store.size
green open .opensearch-observability    VEodRP5XRWCaUTyIxp947g 1 1     0 0    416b    208b
green open .ql-datasources              kKh4Hp4HQeaM17jF-h4ZFg 1 1     0 0    416b    208b
green open .plugins-ml-config           KqwdywM0QpGBHGDBgxTsPA 1 1     1 0   7.8kb   3.9kb
green open .kibana_92668751_admin_1     75yGMhz_S42b_lxukdV5zA 1 1     1 0  10.3kb   5.1kb
green open .kibana_1                    ehjT55saT0S_2ragBN9O_g 1 1     1 0  10.3kb   5.1kb
green open .opendistro_security         tHeF1aZ6SImXyyB_TGVeDA 1 1    10 0  97.8kb  48.9kb
green open security-auditlog-2024.06.12 iGBt812KR4CMYZR0WAlprA 1 1    55 0 143.3kb  63.9kb
green open queries                      WaVFFYN-QDyU4WCO-kDdPA 5 0  1000 0 196.2kb 196.2kb
green open security-auditlog-2024.06.25 9QI-2iC7QM2ocX6LvfHHBw 1 1   263 0 530.9kb 274.2kb
green open wazuh-alerts-4.x-2024.05     ji1Q8AvcQHePD2LeSSbRDg 1 1 31480 0  47.6mb  22.9mb

@f-galland
Copy link
Member

An OpenSearch Benchmark workload can run various types of operations.

The bulk operation seems to be the only one at the document level, and I cannot find a mentions of it being capable of bulk-updating or bulk-deleting documents.

There still might be a way to achieve this since operations like force-merge are mentioned throughout the documentation, but there is no reference for it.

@f-galland
Copy link
Member

The nyc taxis sample workload seems to include an update operation.

@f-galland
Copy link
Member

New operation-types can be defined as functions in a workload.py file.

Here is an example of a reindex operation being referenced in a test procedure, and the corresponding definition of the custom operation:

@f-galland
Copy link
Member

I tried creating a workload that uses the bulk operation-type with metadata indicating the action (indexing, deleting, updating documents).

My corpora looks as follows:

test.json

root@os-benchmarks:~/benchmarks/wazuh-alerts# cat test.json
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e5" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:54"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":786,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:54 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:55.891-03:00","location":"/var/log/auth.log","id":"1714661155.429268","timestamp":"2024-05-02T11:45:55.891-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e6" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:56"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":787,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:56 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:57.894-03:00","location":"/var/log/auth.log","id":"1714661157.429646","timestamp":"2024-05-02T11:45:57.894-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e7" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:45:58"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":788,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:45:58 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:45:59.896-03:00","location":"/var/log/auth.log","id":"1714661159.430024","timestamp":"2024-05-02T11:45:59.896-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e8" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:46:13"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":795,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:46:13 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:46:13.912-03:00","location":"/var/log/auth.log","id":"1714661173.432670","timestamp":"2024-05-02T11:46:13.912-0300"}
{ "index": { "_index": "wazuh-alerts-4.x-2024.05", "_id": "a1b2c3d4e9" } }
{"predecoder":{"hostname":"jenkins","program_name":"smbd","timestamp":"May  2 11:46:15"},"agent":{"ip":"192.168.56.1","name":"jenkins","id":"014"},"manager":{"name":"manager"},"data":{"dstuser":"nobody"},"rule":{"firedtimes":796,"mail":false,"level":3,"pci_dss":["10.2.5"],"hipaa":["164.312.b"],"tsc":["CC6.8","CC7.2","CC7.3"],"description":"PAM: Login session closed.","groups":["pam","syslog"],"id":"5502","nist_800_53":["AU.14","AC.7"],"gpg13":["7.8","7.9"],"gdpr":["IV_32.2"]},"decoder":{"parent":"pam","name":"pam"},"full_log":"May  2 11:46:15 jenkins smbd: pam_unix(samba:session): session closed for user nobody","input":{"type":"log"},"@timestamp":"2024-05-02T11:46:15.914-03:00","location":"/var/log/auth.log","id":"1714661175.433048","timestamp":"2024-05-02T11:46:15.914-0300"

workload.json

{% import "benchmark.helpers" as benchmark with context %}
{
  "version": 2,
  "description": "Tracker-generated workload for wazuh-alerts",
  "indices": [
    {
      "name": "wazuh-alerts-4.x-2024.05",
      "body": "wazuh-alerts-4.x-2024.05.json"
    }
  ],
  "corpora": [
    {
      "name": "wazuh-alerts-4.x-2024.05",
      "documents": [
        {
          "target-index": "wazuh-alerts-4.x-2024.05",
          "source-file": "test.json",
          "document-count": 10
        }
      ]
    }
  ],
  "operations": [
    {{ benchmark.collect(parts="operations/*.json") }}
  ],
  "test_procedures": [
    {{ benchmark.collect(parts="test_procedures/*.json") }}
  ]
}

test_procedures/default.json

{
    "name": "index-append",
    "operation-type": "bulk",
    "bulk-size": 5,
    "action-metadata-present": true,
    "ingest-percentage": 100
},
{
    "name": "wait-until-merges-finish",
    "operation-type": "index-stats",
    "index": "_all",
    "condition": {
      "path": "_all.total.merges.current",
      "expected-value": 0
    },
    "retry-until-success": true,
    "include-in-reporting": false
},
{
    "name": "match-all",
    "operation-type": "search",
    "index": "wazuh-alerts-4.x-2024.05",
    "body": {
        "size": 10,
        "query": {
            "match_all": {}
        }
    }
}

I set the action-metadata-present field to true in the above test procedure's bulk operation based on the following comment from the opensearch-benchmark code:

* ``action_metadata_present``: if ``True``, assume that an action and metadata line is present (meaning only half of the lines contain actual documents to index)

This resulted in the "metadata" lines of the test.json corpora to be indexed as if they were regular documents.

Image

After closer inspection, I realized that the action_metadata_present flag doesn't really change the actual metadata used in the bulk operations in the code.

We need to determine whether we are to:

  • Modify opensearch-benchmark's code to add the feature
  • Create a workload.py in our benchmark that registers a runner that allows for mixed bulk operations (not just indexing)
  • Create our own benchmarking solution.

@f-galland
Copy link
Member

I run two benchmarks on indexing operations only.
These were run on the docker compose shared in a previous comment with a 3 node opensearch cluster.

The test_procedures were defined as follows:

{
  "name": "single-bulk",
  "description": "Customized test procedure with a single bulk request indexing 10k wazuh-alerts documents.",
  "schedule": [
    {
      "operation": {
        "name": "single-bulk-index-task",
        "operation-type": "bulk",
        "bulk-size": 10000
      }
    }
  ]
}
{
  "name": "parallel-any",
  "description": "Customized test procedure with a parallel bulk requests indexing 5k, 3k, 1.5k and 0.5k wazuh-alerts documents in parallel bulks.",
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "name": "5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 5000
            },
            "clients": 1
          },
          {
            "name": "3k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 3000
            },
            "clients": 1
          },
          {
            "name": "1.5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 1500
            },
            "clients": 1
          },
          {
            "name": "0.5k-events-task",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 500
            },
            "clients": 1
          }
        ]
      }
    }
  ]
}

Results

Single 10k bulk:

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 6e80dbbd-f96b-4fe9-b685-0b63710abb0e
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[WARNING] indexing_total_time is 42 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 339 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running single-bulk-index-task                                                 [100% done]
[INFO] Executing test with workload [wazuh-alerts-single-bulk], test_procedure [default-test-procedure] and provision_config_instance ['external'] with version [2.14.0].


------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                   Task |      Value |   Unit |
|---------------------------------------------------------------:|-----------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                        |    0.03075 |    min |
|             Min cumulative indexing time across primary shards |                        |          0 |    min |
|          Median cumulative indexing time across primary shards |                        |    0.00035 |    min |
|             Max cumulative indexing time across primary shards |                        |    0.02785 |    min |
|            Cumulative indexing throttle time of primary shards |                        |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                        |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                        |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                        |          0 |    min |
|                        Cumulative merge time of primary shards |                        | 0.00353333 |    min |
|                       Cumulative merge count of primary shards |                        |          5 |        |
|                Min cumulative merge time across primary shards |                        |          0 |    min |
|             Median cumulative merge time across primary shards |                        |          0 |    min |
|                Max cumulative merge time across primary shards |                        | 0.00353333 |    min |
|               Cumulative merge throttle time of primary shards |                        |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                        |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                        |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                        |          0 |    min |
|                      Cumulative refresh time of primary shards |                        |     0.0314 |    min |
|                     Cumulative refresh count of primary shards |                        |        127 |        |
|              Min cumulative refresh time across primary shards |                        |          0 |    min |
|           Median cumulative refresh time across primary shards |                        |   0.002825 |    min |
|              Max cumulative refresh time across primary shards |                        |  0.0139333 |    min |
|                        Cumulative flush time of primary shards |                        |          0 |    min |
|                       Cumulative flush count of primary shards |                        |          0 |        |
|                Min cumulative flush time across primary shards |                        |          0 |    min |
|             Median cumulative flush time across primary shards |                        |          0 |    min |
|                Max cumulative flush time across primary shards |                        |          0 |    min |
|                                        Total Young Gen GC time |                        |      0.162 |      s |
|                                       Total Young Gen GC count |                        |         17 |        |
|                                          Total Old Gen GC time |                        |          0 |      s |
|                                         Total Old Gen GC count |                        |          0 |        |
|                                                     Store size |                        |  0.0294357 |     GB |
|                                                  Translog size |                        |  0.0379527 |     GB |
|                                         Heap used for segments |                        |          0 |     MB |
|                                       Heap used for doc values |                        |          0 |     MB |
|                                            Heap used for terms |                        |          0 |     MB |
|                                            Heap used for norms |                        |          0 |     MB |
|                                           Heap used for points |                        |          0 |     MB |
|                                    Heap used for stored fields |                        |          0 |     MB |
|                                                  Segment count |                        |         26 |        |
|                                                 Min Throughput | single-bulk-index-task |     940.61 | docs/s |
|                                                Mean Throughput | single-bulk-index-task |    1205.66 | docs/s |
|                                              Median Throughput | single-bulk-index-task |    1205.66 | docs/s |
|                                                 Max Throughput | single-bulk-index-task |    1470.72 | docs/s |
|                                        50th percentile latency | single-bulk-index-task |     3858.7 |     ms |
|                                       100th percentile latency | single-bulk-index-task |    6775.95 |     ms |
|                                   50th percentile service time | single-bulk-index-task |     3858.7 |     ms |
|                                  100th percentile service time | single-bulk-index-task |    6775.95 |     ms |
|                                                     error rate | single-bulk-index-task |          0 |      % |


--------------------------------
[INFO] SUCCESS (took 21 seconds)
--------------------------------

Parallel indexing

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: 63d78463-b816-44de-9a5f-16a08084a061
[INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
[INFO] Preparing file offset table for [/opensearch-benchmark/.benchmark/wazuh-alerts-parallelized/wazuh-alerts-benchmark-data-documents.json] ... [OK]
[WARNING] indexing_total_time is 36 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 328 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running 3k-events-task,0.5k-events-task,5k-events-task,1.5k-events-task        [100% done][INFO] Executing test with workload [wazuh-alerts-parallelized], test_procedure [parallel-any] and provision_config_instance ['external'] with version [2.14.0].


------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |             Task |      Value |   Unit |
|---------------------------------------------------------------:|-----------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                  |   0.270033 |    min |
|             Min cumulative indexing time across primary shards |                  |          0 |    min |
|          Median cumulative indexing time across primary shards |                  |     0.0003 |    min |
|             Max cumulative indexing time across primary shards |                  |   0.262817 |    min |
|            Cumulative indexing throttle time of primary shards |                  |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                  |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                  |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                  |          0 |    min |
|                        Cumulative merge time of primary shards |                  |     0.0474 |    min |
|                       Cumulative merge count of primary shards |                  |         16 |        |
|                Min cumulative merge time across primary shards |                  |          0 |    min |
|             Median cumulative merge time across primary shards |                  |          0 |    min |
|                Max cumulative merge time across primary shards |                  |  0.0384833 |    min |
|               Cumulative merge throttle time of primary shards |                  |          0 |    min |
|       Min cumulative merge throttle time across primary shards |                  |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                  |          0 |    min |
|       Max cumulative merge throttle time across primary shards |                  |          0 |    min |
|                      Cumulative refresh time of primary shards |                  |  0.0934167 |    min |
|                     Cumulative refresh count of primary shards |                  |        215 |        |
|              Min cumulative refresh time across primary shards |                  |          0 |    min |
|           Median cumulative refresh time across primary shards |                  | 0.00273333 |    min |
|              Max cumulative refresh time across primary shards |                  |  0.0502833 |    min |
|                        Cumulative flush time of primary shards |                  |          0 |    min |
|                       Cumulative flush count of primary shards |                  |          0 |        |
|                Min cumulative flush time across primary shards |                  |          0 |    min |
|             Median cumulative flush time across primary shards |                  |          0 |    min |
|                Max cumulative flush time across primary shards |                  |          0 |    min |
|                                        Total Young Gen GC time |                  |      0.654 |      s |
|                                       Total Young Gen GC count |                  |         65 |        |
|                                          Total Old Gen GC time |                  |          0 |      s |
|                                         Total Old Gen GC count |                  |          0 |        |
|                                                     Store size |                  |   0.111155 |     GB |
|                                                  Translog size |                  |   0.153145 |     GB |
|                                         Heap used for segments |                  |          0 |     MB |
|                                       Heap used for doc values |                  |          0 |     MB |
|                                            Heap used for terms |                  |          0 |     MB |
|                                            Heap used for norms |                  |          0 |     MB |
|                                           Heap used for points |                  |          0 |     MB |
|                                    Heap used for stored fields |                  |          0 |     MB |
|                                                  Segment count |                  |         24 |        |
|                                                 Min Throughput |   5k-events-task |        544 | docs/s |
|                                                Mean Throughput |   5k-events-task |     567.73 | docs/s |
|                                              Median Throughput |   5k-events-task |     544.24 | docs/s |
|                                                 Max Throughput |   5k-events-task |     614.94 | docs/s |
|                                        50th percentile latency |   5k-events-task |    2398.31 |     ms |
|                                       100th percentile latency |   5k-events-task |    9173.81 |     ms |
|                                   50th percentile service time |   5k-events-task |    2398.31 |     ms |
|                                  100th percentile service time |   5k-events-task |    9173.81 |     ms |
|                                                     error rate |   5k-events-task |          0 |      % |
|                                                 Min Throughput |   3k-events-task |     516.46 | docs/s |
|                                                Mean Throughput |   3k-events-task |        619 | docs/s |
|                                              Median Throughput |   3k-events-task |     608.69 | docs/s |
|                                                 Max Throughput |   3k-events-task |     732.44 | docs/s |
|                                        50th percentile latency |   3k-events-task |    1587.65 |     ms |
|                                       100th percentile latency |   3k-events-task |    5794.57 |     ms |
|                                   50th percentile service time |   3k-events-task |    1587.65 |     ms |
|                                  100th percentile service time |   3k-events-task |    5794.57 |     ms |
|                                                     error rate |   3k-events-task |          0 |      % |
|                                                 Min Throughput | 1.5k-events-task |     326.99 | docs/s |
|                                                Mean Throughput | 1.5k-events-task |     564.57 | docs/s |
|                                              Median Throughput | 1.5k-events-task |     585.95 | docs/s |
|                                                 Max Throughput | 1.5k-events-task |     780.17 | docs/s |
|                                        50th percentile latency | 1.5k-events-task |    1036.56 |     ms |
|                                       100th percentile latency | 1.5k-events-task |    4576.54 |     ms |
|                                   50th percentile service time | 1.5k-events-task |    1036.56 |     ms |
|                                  100th percentile service time | 1.5k-events-task |    4576.54 |     ms |
|                                                     error rate | 1.5k-events-task |          0 |      % |
|                                                 Min Throughput | 0.5k-events-task |     180.57 | docs/s |
|                                                Mean Throughput | 0.5k-events-task |     484.79 | docs/s |
|                                              Median Throughput | 0.5k-events-task |      520.7 | docs/s |
|                                                 Max Throughput | 0.5k-events-task |     851.66 | docs/s |
|                                        50th percentile latency | 0.5k-events-task |    364.129 |     ms |
|                                        90th percentile latency | 0.5k-events-task |    725.394 |     ms |
|                                       100th percentile latency | 0.5k-events-task |    2762.02 |     ms |
|                                   50th percentile service time | 0.5k-events-task |    364.129 |     ms |
|                                   90th percentile service time | 0.5k-events-task |    725.394 |     ms |
|                                  100th percentile service time | 0.5k-events-task |    2762.02 |     ms |
|                                                     error rate | 0.5k-events-task |          0 |      % |



--------------------------------
[INFO] SUCCESS (took 26 seconds)
--------------------------------

@f-galland
Copy link
Member

It was determined that we need to test the optimal ingest bulk size in the 10-100MB range.
For this, I set up a workload that progressively ramps the bulk size in 2MB intervals. The benchmark information will be stored to an opensearch cluster so we can plot it.

I'm currently working on setting up a benchmark as the one above on a 3 node cluster on top of 3 EC2 instances.
Given the fact that we need the tests to be easily reproducible, I'm setting up dockerized nodes. This should allow bringing the node and its data down and back up in a few commands using docker contexts from an outside host.

The benchmark can be run locally from any terminal.

@f-galland
Copy link
Member

I created a number of wazuh-alerts json files with file sizes ranging from 5MB through 100MB in 5MB increments.

root@os-benchmarks:~/benchmarks/wazuh-alerts# ls -lh
total 1.1G
-rwxrwxrwx 1 root root 1.1K Jul  1 20:05 generate_config.sh
-rwxrwxrwx 1 root root  270 Jul  1 19:14 generate_files.sh
drwxrwxrwx 2 root root 4.0K Jul  1 20:16 operations
drwxrwxrwx 2 root root 4.0K Jul  1 20:31 test_procedures
-rw-rw-rw- 1 root root  10M Jul  1 19:14 wazuh-alerts-10.json
-rw-rw-rw- 1 root root 100M Jul  1 19:06 wazuh-alerts-100.json
-rw-rw-rw- 1 root root  15M Jul  1 19:14 wazuh-alerts-15.json
-rw-rw-rw- 1 root root  20M Jul  1 19:14 wazuh-alerts-20.json
-rw-rw-rw- 1 root root  25M Jul  1 19:14 wazuh-alerts-25.json
-rw-rw-rw- 1 root root  30M Jul  1 19:14 wazuh-alerts-30.json
-rw-rw-rw- 1 root root  35M Jul  1 19:14 wazuh-alerts-35.json
-rw-rw-rw- 1 root root  40M Jul  1 19:14 wazuh-alerts-40.json
-rw-rw-rw- 1 root root  45M Jul  1 19:14 wazuh-alerts-45.json
-rw-rw-rw- 1 root root 5.0M Jul  1 19:14 wazuh-alerts-5.json
-rw-rw-rw- 1 root root  50M Jul  1 19:14 wazuh-alerts-50.json
-rw-rw-rw- 1 root root  55M Jul  1 19:14 wazuh-alerts-55.json
-rw-rw-rw- 1 root root  60M Jul  1 19:14 wazuh-alerts-60.json
-rw-rw-rw- 1 root root  65M Jul  1 19:14 wazuh-alerts-65.json
-rw-rw-rw- 1 root root  70M Jul  1 19:14 wazuh-alerts-70.json
-rw-rw-rw- 1 root root  75M Jul  1 19:14 wazuh-alerts-75.json
-rw-rw-rw- 1 root root  80M Jul  1 19:14 wazuh-alerts-80.json
-rw-rw-rw- 1 root root  85M Jul  1 19:14 wazuh-alerts-85.json
-rw-rw-rw- 1 root root  90M Jul  1 19:14 wazuh-alerts-90.json
-rw-rw-rw- 1 root root  95M Jul  1 19:14 wazuh-alerts-95.json
-rw-rw-rw- 1 root root 141K Jun 27 11:28 wazuh-alerts.json
-rw-rw-rw- 1 root root 6.2K Jul  1 20:36 workload.json

The workload.json and test_procedures/default.json files were updated to run bulk indexing tests for each of these files.

workload.json
{% import "benchmark.helpers" as benchmark with context %}
{
  "version": 2,
  "description": "Tracker-generated workload for wazuh-alerts",
  "indices": [
    {
      "name": "wazuh-alerts-5",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-10",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-15",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-20",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-25",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-30",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-35",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-40",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-45",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-50",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-55",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-60",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-65",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-70",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-75",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-80",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-85",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-90",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-95",
      "body": "wazuh-alerts.json"
    },
    {
      "name": "wazuh-alerts-100",
      "body": "wazuh-alerts.json"
    }
  ],
  "corpora": [
        {
      "name": "wazuh-alerts-5",
      "documents": [
        {
          "target-index": "wazuh-alerts-5",
          "source-file": "wazuh-alerts-5.json",
          "document-count": 3426
        }
      ]
    },
    {
      "name": "wazuh-alerts-10",
      "documents": [
        {
          "target-index": "wazuh-alerts-10",
          "source-file": "wazuh-alerts-10.json",
          "document-count": 6616
        }
      ]
    },
    {
      "name": "wazuh-alerts-15",
      "documents": [
        {
          "target-index": "wazuh-alerts-15",
          "source-file": "wazuh-alerts-15.json",
          "document-count": 9958
        }
      ]
    },
    {
      "name": "wazuh-alerts-20",
      "documents": [
        {
          "target-index": "wazuh-alerts-20",
          "source-file": "wazuh-alerts-20.json",
          "document-count": 13933
        }
      ]
    },
    {
      "name": "wazuh-alerts-25",
      "documents": [
        {
          "target-index": "wazuh-alerts-25",
          "source-file": "wazuh-alerts-25.json",
          "document-count": 17180
        }
      ]
    },
    {
      "name": "wazuh-alerts-30",
      "documents": [
        {
          "target-index": "wazuh-alerts-30",
          "source-file": "wazuh-alerts-30.json",
          "document-count": 20404
        }
      ]
    },
    {
      "name": "wazuh-alerts-35",
      "documents": [
        {
          "target-index": "wazuh-alerts-35",
          "source-file": "wazuh-alerts-35.json",
          "document-count": 23737
        }
      ]
    },
    {
      "name": "wazuh-alerts-40",
      "documents": [
        {
          "target-index": "wazuh-alerts-40",
          "source-file": "wazuh-alerts-40.json",
          "document-count": 27706
        }
      ]
    },
    {
      "name": "wazuh-alerts-45",
      "documents": [
        {
          "target-index": "wazuh-alerts-45",
          "source-file": "wazuh-alerts-45.json",
          "document-count": 30998
        }
      ]
    },
    {
      "name": "wazuh-alerts-50",
      "documents": [
        {
          "target-index": "wazuh-alerts-50",
          "source-file": "wazuh-alerts-50.json",
          "document-count": 34187
        }
      ]
    },
    {
      "name": "wazuh-alerts-55",
      "documents": [
        {
          "target-index": "wazuh-alerts-55",
          "source-file": "wazuh-alerts-55.json",
          "document-count": 37774
        }
      ]
    },
    {
      "name": "wazuh-alerts-60",
      "documents": [
        {
          "target-index": "wazuh-alerts-60",
          "source-file": "wazuh-alerts-60.json",
          "document-count": 41473
        }
      ]
    },
    {
      "name": "wazuh-alerts-65",
      "documents": [
        {
          "target-index": "wazuh-alerts-65",
          "source-file": "wazuh-alerts-65.json",
          "document-count": 44729
        }
      ]
    },
    {
      "name": "wazuh-alerts-70",
      "documents": [
        {
          "target-index": "wazuh-alerts-70",
          "source-file": "wazuh-alerts-70.json",
          "document-count": 47947
        }
      ]
    },
    {
      "name": "wazuh-alerts-75",
      "documents": [
        {
          "target-index": "wazuh-alerts-75",
          "source-file": "wazuh-alerts-75.json",
          "document-count": 51993
        }
      ]
    },
    {
      "name": "wazuh-alerts-80",
      "documents": [
        {
          "target-index": "wazuh-alerts-80",
          "source-file": "wazuh-alerts-80.json",
          "document-count": 55225
        }
      ]
    },
    {
      "name": "wazuh-alerts-85",
      "documents": [
        {
          "target-index": "wazuh-alerts-85",
          "source-file": "wazuh-alerts-85.json",
          "document-count": 58442
        }
      ]
    },
    {
      "name": "wazuh-alerts-90",
      "documents": [
        {
          "target-index": "wazuh-alerts-90",
          "source-file": "wazuh-alerts-90.json",
          "document-count": 61854
        }
      ]
    },
    {
      "name": "wazuh-alerts-95",
      "documents": [
        {
          "target-index": "wazuh-alerts-95",
          "source-file": "wazuh-alerts-95.json",
          "document-count": 65786
        }
      ]
    },
    {
      "name": "wazuh-alerts-100",
      "documents": [
        {
          "target-index": "wazuh-alerts-100",
          "source-file": "wazuh-alerts-100.json",
          "document-count": 69053
        }
      ]
    }
  ],
  "test_procedures": [
    {{ benchmark.collect(parts="test_procedures/*.json") }}
  ]
}
test_procedures/default.json
{
  "name": "Wazuh Alerts Ingestion Test",
  "description": "Test ingestion in 5MB increments",
  "default": true,
  "schedule": [
    {
      "operation": {
        "name": "bulk-index-5-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-5",
        "bulk-size": 3426
      }
    },
    {
      "operation": {
        "name": "bulk-index-10-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-10",
        "bulk-size": 6616
      }
    },
    {
      "operation": {
        "name": "bulk-index-15-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-15",
        "bulk-size": 9958
      }
    },
    {
      "operation": {
        "name": "bulk-index-20-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-20",
        "bulk-size": 13933
      }
    },
    {
      "operation": {
        "name": "bulk-index-25-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-25",
        "bulk-size": 17180
      }
    },
    {
      "operation": {
        "name": "bulk-index-30-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-30",
        "bulk-size": 20404
      }
    },
    {
      "operation": {
        "name": "bulk-index-35-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-35",
        "bulk-size": 23737
      }
    },
    {
      "operation": {
        "name": "bulk-index-40-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-40",
        "bulk-size": 27706
      }
    },
    {
      "operation": {
        "name": "bulk-index-45-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-45",
        "bulk-size": 30998
      }
    },
    {
      "operation": {
        "name": "bulk-index-50-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-50",
        "bulk-size": 34187
      }
    },
    {
      "operation": {
        "name": "bulk-index-55-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-55",
        "bulk-size": 37774
      }
    },
    {
      "operation": {
        "name": "bulk-index-60-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-60",
        "bulk-size": 41473
      }
    },
    {
      "operation": {
        "name": "bulk-index-65-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-65",
        "bulk-size": 44729
      }
    },
    {
      "operation": {
        "name": "bulk-index-70-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-70",
        "bulk-size": 47947
      }
    },
    {
      "operation": {
        "name": "bulk-index-75-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-75",
        "bulk-size": 51993
      }
    },
    {
      "operation": {
        "name": "bulk-index-80-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-80",
        "bulk-size": 55225
      }
    },
    {
      "operation": {
        "name": "bulk-index-85-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-85",
        "bulk-size": 58442
      }
    },
    {
      "operation": {
        "name": "bulk-index-90-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-90",
        "bulk-size": 61854
      }
    },
    {
      "operation": {
        "name": "bulk-index-95-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-95",
        "bulk-size": 65786
      }
    },
    {
      "operation": {
        "name": "bulk-index-100-mb",
        "operation-type": "bulk",
        "corpora": "wazuh-alerts-100",
        "bulk-size": 69053
      }
    }
  ]
}
Local results
root@os-benchmarks:~# docker compose up opensearch-benchmark
[+] Running 3/0
 ✔ Container opensearch-node1      Running                                                                                                                                       0.0s 
 ✔ Container permissions-setter    Created                                                                                                                                       0.0s 
 ✔ Container opensearch-benchmark  Created                                                                                                                                       0.0s 
Attaching to opensearch-benchmark
opensearch-benchmark  | 
opensearch-benchmark  |    ____                  _____                      __       ____                  __                         __
opensearch-benchmark  |   / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
opensearch-benchmark  |  / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
opensearch-benchmark  | / /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
opensearch-benchmark  | \____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
opensearch-benchmark  |     /_/
opensearch-benchmark  | 
opensearch-benchmark  | [INFO] [Test Execution ID]: 83bb247f-44e8-46f0-a763-73f10b6d4577
opensearch-benchmark  | [INFO] You did not provide an explicit timeout in the client options. Assuming default of 10 seconds.
opensearch-benchmark  | [WARNING] merges_total_time is 1059 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
opensearch-benchmark  | [WARNING] indexing_total_time is 16708 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
opensearch-benchmark  | [WARNING] refresh_total_time is 11434 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
opensearch-benchmark  | [WARNING] flush_total_time is 38 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
opensearch-benchmark  | Running bulk-index-5-mb                                                        [100% done]
Running bulk-index-10-mb                                                       [100% done]
Running bulk-index-15-mb                                                       [100% done]
Running bulk-index-20-mb                                                       [100% done]
Running bulk-index-25-mb                                                       [100% done]
Running bulk-index-30-mb                                                       [100% done]
Running bulk-index-35-mb                                                       [100% done]
Running bulk-index-40-mb                                                       [100% done]
Running bulk-index-45-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=56497001, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=56497001, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429})
Running bulk-index-50-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=62166679, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=62166679, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429})
Running bulk-index-55-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [516691048/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [387461696/369.5mb], new bytes reserved: [129229352/123.2mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=129229352/123.2mb]', 'bytes_wanted': 516691048, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [516691048/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [387461696/369.5mb], new bytes reserved: [129229352/123.2mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=129229352/123.2mb]', 'bytes_wanted': 516691048, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-60-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=73480223, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=73480223, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429})
Running bulk-index-65-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [541932828/516.8mb], which is larger than the limit of [510027366/486.3mb], real usage: [391200848/373mb], new bytes reserved: [150731980/143.7mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=150731980/143.7mb]', 'bytes_wanted': 541932828, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [541932828/516.8mb], which is larger than the limit of [510027366/486.3mb], real usage: [391200848/373mb], new bytes reserved: [150731980/143.7mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=150731980/143.7mb]', 'bytes_wanted': 541932828, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-70-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [555845794/530mb], which is larger than the limit of [510027366/486.3mb], real usage: [394297376/376mb], new bytes reserved: [161548418/154mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=161548418/154mb]', 'bytes_wanted': 555845794, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [555845794/530mb], which is larger than the limit of [510027366/486.3mb], real usage: [394297376/376mb], new bytes reserved: [161548418/154mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=161548418/154mb]', 'bytes_wanted': 555845794, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-75-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [572270300/545.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [399970840/381.4mb], new bytes reserved: [172299460/164.3mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=172299460/164.3mb]', 'bytes_wanted': 572270300, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [572270300/545.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [399970840/381.4mb], new bytes reserved: [172299460/164.3mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=172299460/164.3mb]', 'bytes_wanted': 572270300, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-80-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [591658024/564.2mb], which is larger than the limit of [510027366/486.3mb], real usage: [408608040/389.6mb], new bytes reserved: [183049984/174.5mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=183049984/174.5mb]', 'bytes_wanted': 591658024, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [591658024/564.2mb], which is larger than the limit of [510027366/486.3mb], real usage: [408608040/389.6mb], new bytes reserved: [183049984/174.5mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=183049984/174.5mb]', 'bytes_wanted': 591658024, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-85-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] rejected_execution_exception ({'error': {'root_cause': [{'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=101732377, max_coordinating_and_primary_bytes=53687091]'}], 'type': 'rejected_execution_exception', 'reason': 'rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=101732377, max_coordinating_and_primary_bytes=53687091]'}, 'status': 429})
Running bulk-index-90-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] circuit_breaking_exception ({'error': {'root_cause': [{'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [526842176/502.4mb], which is larger than the limit of [510027366/486.3mb], real usage: [322218352/307.2mb], new bytes reserved: [204623824/195.1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=204623824/195.1mb]', 'bytes_wanted': 526842176, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}], 'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<http_request>] would be [526842176/502.4mb], which is larger than the limit of [510027366/486.3mb], real usage: [322218352/307.2mb], new bytes reserved: [204623824/195.1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=204623824/195.1mb]', 'bytes_wanted': 526842176, 'bytes_limit': 510027366, 'durability': 'TRANSIENT'}, 'status': 429})
Running bulk-index-95-mb                                                       [100% done]
opensearch-benchmark  | [ERROR] 
Running bulk-index-100-mb                                                      [100% done][INFO] Executing test with workload [wazuh-alerts], test_procedure [Wazuh Alerts Ingestion Test] and provision_config_instance ['external'] with version [2.14.0].
opensearch-benchmark  | 
opensearch-benchmark  | 
opensearch-benchmark  | ------------------------------------------------------
opensearch-benchmark  |     _______             __   _____
opensearch-benchmark  |    / ____(_)___  ____ _/ /  / ___/_________  ________
opensearch-benchmark  |   / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
opensearch-benchmark  |  / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
opensearch-benchmark  | /_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
opensearch-benchmark  | ------------------------------------------------------
opensearch-benchmark  |             
opensearch-benchmark  | |                                                         Metric |              Task |       Value |   Unit |
opensearch-benchmark  | |---------------------------------------------------------------:|------------------:|------------:|-------:|
opensearch-benchmark  | |                     Cumulative indexing time of primary shards |                   |    0.512283 |    min |
opensearch-benchmark  | |             Min cumulative indexing time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |          Median cumulative indexing time across primary shards |                   |   0.0343667 |    min |
opensearch-benchmark  | |             Max cumulative indexing time across primary shards |                   |   0.0954833 |    min |
opensearch-benchmark  | |            Cumulative indexing throttle time of primary shards |                   |           0 |    min |
opensearch-benchmark  | |    Min cumulative indexing throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | | Median cumulative indexing throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |    Max cumulative indexing throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |                        Cumulative merge time of primary shards |                   |     0.01765 |    min |
opensearch-benchmark  | |                       Cumulative merge count of primary shards |                   |          74 |        |
opensearch-benchmark  | |                Min cumulative merge time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |             Median cumulative merge time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |                Max cumulative merge time across primary shards |                   |     0.01765 |    min |
opensearch-benchmark  | |               Cumulative merge throttle time of primary shards |                   |           0 |    min |
opensearch-benchmark  | |       Min cumulative merge throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |    Median cumulative merge throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |       Max cumulative merge throttle time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |                      Cumulative refresh time of primary shards |                   |    0.223817 |    min |
opensearch-benchmark  | |                     Cumulative refresh count of primary shards |                   |         787 |        |
opensearch-benchmark  | |              Min cumulative refresh time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |           Median cumulative refresh time across primary shards |                   |    0.008775 |    min |
opensearch-benchmark  | |              Max cumulative refresh time across primary shards |                   |    0.103083 |    min |
opensearch-benchmark  | |                        Cumulative flush time of primary shards |                   |  0.00116667 |    min |
opensearch-benchmark  | |                       Cumulative flush count of primary shards |                   |           5 |        |
opensearch-benchmark  | |                Min cumulative flush time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |             Median cumulative flush time across primary shards |                   |           0 |    min |
opensearch-benchmark  | |                Max cumulative flush time across primary shards |                   | 0.000533333 |    min |
opensearch-benchmark  | |                                        Total Young Gen GC time |                   |       0.502 |      s |
opensearch-benchmark  | |                                       Total Young Gen GC count |                   |         116 |        |
opensearch-benchmark  | |                                          Total Old Gen GC time |                   |           0 |      s |
opensearch-benchmark  | |                                         Total Old Gen GC count |                   |           0 |        |
opensearch-benchmark  | |                                                     Store size |                   |    0.268493 |     GB |
opensearch-benchmark  | |                                                  Translog size |                   |    0.463153 |     GB |
opensearch-benchmark  | |                                         Heap used for segments |                   |           0 |     MB |
opensearch-benchmark  | |                                       Heap used for doc values |                   |           0 |     MB |
opensearch-benchmark  | |                                            Heap used for terms |                   |           0 |     MB |
opensearch-benchmark  | |                                            Heap used for norms |                   |           0 |     MB |
opensearch-benchmark  | |                                           Heap used for points |                   |           0 |     MB |
opensearch-benchmark  | |                                    Heap used for stored fields |                   |           0 |     MB |
opensearch-benchmark  | |                                                  Segment count |                   |          58 |        |
opensearch-benchmark  | |                                                 Min Throughput |   bulk-index-5-mb |      7803.9 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |   bulk-index-5-mb |      7803.9 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |   bulk-index-5-mb |      7803.9 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |   bulk-index-5-mb |      7803.9 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |   bulk-index-5-mb |     429.577 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |   bulk-index-5-mb |     429.577 |     ms |
opensearch-benchmark  | |                                                     error rate |   bulk-index-5-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-10-mb |     8445.34 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-10-mb |     8445.34 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-10-mb |     8445.34 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-10-mb |     8445.34 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-10-mb |     774.252 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-10-mb |     774.252 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-10-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-15-mb |        9598 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-15-mb |        9598 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-15-mb |        9598 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-15-mb |        9598 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-15-mb |     1028.43 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-15-mb |     1028.43 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-15-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-20-mb |     9030.53 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-20-mb |     9030.53 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-20-mb |     9030.53 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-20-mb |     9030.53 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-20-mb |     1530.53 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-20-mb |     1530.53 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-20-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-25-mb |     9354.05 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-25-mb |     9354.05 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-25-mb |     9354.05 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-25-mb |     9354.05 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-25-mb |     1821.08 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-25-mb |     1821.08 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-25-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-30-mb |     10155.6 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-30-mb |     10155.6 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-30-mb |     10155.6 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-30-mb |     10155.6 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-30-mb |     1992.57 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-30-mb |     1992.57 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-30-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-35-mb |     9319.89 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-35-mb |     9319.89 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-35-mb |     9319.89 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-35-mb |     9319.89 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-35-mb |     2526.72 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-35-mb |     2526.72 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-35-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-40-mb |     8431.44 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-40-mb |     8431.44 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-40-mb |     8431.44 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-40-mb |     8431.44 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-40-mb |     3258.53 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-40-mb |     3258.53 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-40-mb |           0 |      % |
opensearch-benchmark  | |                                                 Min Throughput |  bulk-index-45-mb |     8907.38 | docs/s |
opensearch-benchmark  | |                                                Mean Throughput |  bulk-index-45-mb |     8907.38 | docs/s |
opensearch-benchmark  | |                                              Median Throughput |  bulk-index-45-mb |     8907.38 | docs/s |
opensearch-benchmark  | |                                                 Max Throughput |  bulk-index-45-mb |     8907.38 | docs/s |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-45-mb |     3453.96 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-45-mb |     3453.96 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-45-mb |           0 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-50-mb |     225.545 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-50-mb |     225.545 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-50-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-55-mb |     210.582 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-55-mb |     210.582 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-55-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-60-mb |     194.882 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-60-mb |     194.882 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-60-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-65-mb |     243.262 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-65-mb |     243.262 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-65-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-70-mb |     209.641 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-70-mb |     209.641 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-70-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-75-mb |     262.526 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-75-mb |     262.526 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-75-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-80-mb |     252.767 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-80-mb |     252.767 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-80-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-85-mb |     262.211 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-85-mb |     262.211 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-85-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-90-mb |      320.98 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-90-mb |      320.98 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-90-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency |  bulk-index-95-mb |     300.169 |     ms |
opensearch-benchmark  | |                                  100th percentile service time |  bulk-index-95-mb |     300.169 |     ms |
opensearch-benchmark  | |                                                     error rate |  bulk-index-95-mb |         100 |      % |
opensearch-benchmark  | |                                       100th percentile latency | bulk-index-100-mb |     165.695 |     ms |
opensearch-benchmark  | |                                  100th percentile service time | bulk-index-100-mb |     165.695 |     ms |
opensearch-benchmark  | |                                                     error rate | bulk-index-100-mb |         100 |      % |
opensearch-benchmark  | 
opensearch-benchmark  | 
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-50-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-50-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-55-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-55-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-60-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-60-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-65-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-65-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-70-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-70-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-75-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-75-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-80-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-80-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-85-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-85-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-90-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-90-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-95-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-95-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | [WARNING] Error rate is 100.0 for operation 'bulk-index-100-mb'. Please check the logs.
opensearch-benchmark  | [WARNING] No throughput metrics available for [bulk-index-100-mb]. Likely cause: Error rate is 100.0%. Please check the logs.
opensearch-benchmark  | 
opensearch-benchmark  | ---------------------------------
opensearch-benchmark  | [INFO] SUCCESS (took 136 seconds)
opensearch-benchmark  | ---------------------------------
opensearch-benchmark exited with code 0

Operations above 50MB return a 429 error code (too many requests) so I need to tweak this a little further to make sure I'm giving the cluster enough time to process the request.

So far, I've only tested this on my local dockerized environment, but I have the EC2 infrastructure ready to run the tests as soon as I've refined the workloads to include shard allocation and proper warm-up and clean up stages.

@f-galland
Copy link
Member

f-galland commented Jul 4, 2024

Setup:

The benchmarks were run on top of 3 EC2 instances with 16GB RAM and a 8 core, 2200MHz AMD EPYC 7571 processor.
Each node had the docker backend installed, and was controlled using a context from my local machine.

docker context ls
$ docker context ls
NAME        DESCRIPTION                               DOCKER ENDPOINT               ERROR
benchmark                                             ssh://root@benchmark          
default *   Current DOCKER_HOST based configuration   unix:///var/run/docker.sock   
node-1                                                ssh://root@benchmark-node1    
node-2                                                ssh://root@benchmark-node2    
node-3                                                ssh://root@benchmark-node3  

Each node had its own docker compose:

node-1.yml
services:

  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.14.0
    container_name: opensearch-node1
    hostname: opensearch-node1
    environment:
      - NODE1_LOCAL_IP=${NODE1_LOCAL_IP}
      - NODE2_LOCAL_IP=${NODE2_LOCAL_IP}
      - NODE3_LOCAL_IP=${NODE3_LOCAL_IP}
      - cluster.name=opensearch-cluster # Name the cluster
      - network.publish_host=${NODE1_LOCAL_IP}
      - http.publish_host=${NODE1_LOCAL_IP}
      - transport.publish_host=${NODE1_LOCAL_IP}
      - node.name=opensearch-node1 # Name the node that will run in this container
      - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later)
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
        #healthcheck:
        #  test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://opensearch-node1:9200/_cat/health | grep -q opensearch-cluster
        #  start_period: 10s
        #  start_interval: 3s
    ports:
      - 9200:9200 # REST API
      - 9300:9300 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
      #depends_on:
      #  opensearch-node1:
      #    condition: service_healthy
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      - NODE1_LOCAL_IP=${NODE1_LOCAL_IP}
      - NODE2_LOCAL_IP=${NODE2_LOCAL_IP}
      - NODE3_LOCAL_IP=${NODE3_LOCAL_IP}
      - OPENSEARCH_HOSTS=["https://${NODE1_LOCAL_IP}:9200","https://${NODE2_LOCAL_IP}:9200","https://${NODE3_LOCAL_IP}:9200"]
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:
node-2.yml
services:

  opensearch-node2:
    image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues
    container_name: opensearch-node2
    hostname: opensearch-node2
    environment:
      - NODE1_LOCAL_IP=${NODE1_LOCAL_IP}
      - NODE2_LOCAL_IP=${NODE2_LOCAL_IP}
      - NODE3_LOCAL_IP=${NODE3_LOCAL_IP}
      - cluster.name=opensearch-cluster
      - network.publish_host=${NODE2_LOCAL_IP}
      - http.publish_host=${NODE2_LOCAL_IP}
      - transport.publish_host=${NODE2_LOCAL_IP}
      - node.name=opensearch-node2
      - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g"
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    ports:
      - 9200:9200
      - 9300:9300
      - 9600:9600
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net

volumes:
  opensearch-data2:

networks:
  opensearch-net:
node-3.yml
services:

  opensearch-node3:
    image: opensearchproject/opensearch:2.14.0 # This should be the same image used for opensearch-node1 to avoid issues
    container_name: opensearch-node3
    hostname: opensearch-node3
    environment:
      - NODE1_LOCAL_IP=${NODE1_LOCAL_IP}
      - NODE2_LOCAL_IP=${NODE2_LOCAL_IP}
      - NODE3_LOCAL_IP=${NODE3_LOCAL_IP}
      - network.publish_host=${NODE3_LOCAL_IP}
      - http.publish_host=${NODE3_LOCAL_IP}
      - transport.publish_host=${NODE3_LOCAL_IP}
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node3
      - discovery.seed_hosts=${NODE1_LOCAL_IP},${NODE2_LOCAL_IP},${NODE3_LOCAL_IP}, # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2,opensearch-node3
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms8g -Xmx8g"
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    ports:
      - 9200:9200
      - 9300:9300
      - 9600:9600
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data3:/usr/share/opensearch/data
    networks:
      - opensearch-net

volumes:
  opensearch-data3:

networks:
  opensearch-net:

The cluster itself was brought up from a script in my local machine (making use of the remote contexts) for convenience:

cluster.sh
#!/bin/bash


case $1 in
  down)
    for i in {1..3}
    do
      echo "Bringing node-$i down"
      docker --context=node-$i compose -f node-$i.yml down -v
    done
  ;;
  up)
    for i in {1..3}
    do
      echo "Bringing node-$i up"
      docker --context=node-$i compose -f node-$i.yml up -d
    done
  ;;
  logs)
    docker --context=node-$2 logs opensearch-node$2
  ;;
  ps)
    docker --context=node-$2 ps -a
  ;;
  run)
    docker --context=benchmark compose -f benchmark.yml up -d 
  ;;
  results)
    docker --context=benchmark logs opensearch-benchmark -f
  ;;
  *)
    echo "Unrecognized option"
  ;;
esac

exit 0

Lastly, a 4th ec2 instance was used to run the actual benchmark from the following docker compose:

docker-compose.yml
services:

  opensearch-benchmark:
    image: opensearchproject/opensearch-benchmark:1.6.0
    hostname: opensearch-benchmark
    container_name: opensearch-benchmark
    volumes:
      - /root/benchmarks:/opensearch-benchmark/.benchmark
    environment:
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
      - BENCHMARK_NAME=${BENCHMARK_NAME}
      - NODE1_LOCAL_IP=${NODE1_LOCAL_IP}
      - NODE2_LOCAL_IP=${NODE2_LOCAL_IP}
      - NODE3_LOCAL_IP=${NODE3_LOCAL_IP}
        #command: execute-test --target-hosts https://opensearch-node1:9200 --pipeline benchmark-only --workload geonames --client-options basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false --test-mode
    command: execute-test --pipeline="benchmark-only" --workload-path="/opensearch-benchmark/.benchmark/${BENCHMARK_NAME}" --target-hosts="https://${NODE1_LOCAL_IP}:9200,https://${NODE2_LOCAL_IP}:9200,https://${NODE3_LOCAL_IP}:9200" --client-options="basic_auth_user:admin,basic_auth_password:${OPENSEARCH_INITIAL_ADMIN_PASSWORD},verify_certs:false"
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network

  permissions-setter:
    image: alpine:3.14
    container_name: permissions-setter
    volumes:
      - /root/benchmarks:/benchmark
    entrypoint: /bin/sh
    command: >
      -c '
        chmod -R a+rw /benchmark
      '
  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.14.0
    container_name: opensearch-node1
    hostname: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node1 # Name the node that will run in this container
      - cluster.initial_cluster_manager_nodes=opensearch-node1 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD} # Sets the demo admin user password when using demo configuration (for OpenSearch 2.12 and later)
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
        #healthcheck:
        #  test: curl -sku admin:${OPENSEARCH_INITIAL_ADMIN_PASSWORD} https://opensearch-node1:9200/_cat/health | grep -q opensearch-cluster
        #  start_period: 10s
        #  start_interval: 3s
    ports:
      - 9200:9200 # REST API
      - 9300:9300 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.14.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
      #depends_on:
      #  opensearch-node1:
      #    condition: service_healthy
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      - OPENSEARCH_HOSTS=["https://opensearch-node1:9200"]
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:

I initially also included an opensearch node in this machine because opensearch-benchmark allows its tests' output to be directed to an opensearch cluster for analysis/visualization, but later I opted by using the csv output instead.

Benchmark files:

In order to create benchmark files I downloaded my wazuh-alerts-* indices from an existing Wazuh installation, and consolidated them in a jsonfile. I trimmed the file to be 20MB in length and then created smaller versions of it with sizes ranging from 1-20MB.

This was done using a simple bash script:

generate_files.sh
#!/bin/bash


PREFIX="wazuh-alerts"
SOURCE_FILE="$PREFIX-100.json"
MB_TO_B=1048576

for i in {1..20}
do
  MB_SIZE=$i
  SIZE=$(( $MB_SIZE * $MB_TO_B ))
  FILENAME="$PREFIX-$MB_SIZE.json"
  head -c $SIZE $SOURCE_FILE > $FILENAME
  sed -i '$ d' $FILENAME
done

The rest of the configuration files for the benchmark were created using the following bash script (which admittedly is a little rough around the edges, but still works):

generate_config.sh
#!/bin/bash

PREFIX="wazuh-alerts"
SOURCE_FILE="$PREFIX-20.json"
MB_TO_B=1048576
CORPORAE=()
TEST_PROCEDURES=()
INDICES=()
OPERATIONS=()
PARALLEL_JOBS=4
SINGLE_BULK_TEST=()
PARALLEL_BULK_TEST=()
TASKS=()
CLIENTS=2

for i in {01..20}
do
  MB_SIZE=$i
  NAME="$PREFIX-$MB_SIZE"
  FILENAME="$NAME.json"
  DOCUMENT_COUNT=$(wc -l $FILENAME | cut -d' ' -f1)
  OPERATION_NAME="${MB_SIZE}MB-bulk"

  CORPORAE+="
    {
      \"name\": \"$NAME\",
      \"documents\": [
        {
          \"target-index\": \"$NAME\",
          \"source-file\": \"$FILENAME\",
          \"document-count\": $DOCUMENT_COUNT
        }
      ]
    },"

  SINGLE_BULK_TEST+="
    {
      \"operation\": \"$OPERATION_NAME\",
      \"clients\": $CLIENTS
    },"

  OPERATIONS+="
    {
      \"name\": \"$OPERATION_NAME\",
      \"operation-type\": \"bulk\",
      \"corpora\": \"$NAME\",
      \"bulk-size\": $DOCUMENT_COUNT
    },"
  

  INDICES+="
    {
      \"name\": \"$NAME\",
      \"body\": \"${PREFIX}.json\"
    },"
done
  
SINGLE_BULK_TEST=${SINGLE_BULK_TEST%%,}

TEST_PROCEDURES+="
  {
    \"name\": \"single-bulk-index-test\",
    \"description\": \"Wazuh Alerts bulk index test\",
    \"default\": true,
    \"schedule\": [
      ${SINGLE_BULK_TEST}
    ]
  },"
  

for i in {01..05}
do
  MB_SIZE=$i
  OPERATION_NAME="${MB_SIZE}MB-bulk"
  TASKS=()
  for j in $(seq --format="%02g" 1 ${PARALLEL_JOBS})
  do
    TASKS+="
      {
        \"name\": \"parallel-test-${i}-thread-${j}\",
        \"operation\": \"$OPERATION_NAME\",
        \"clients\": $CLIENTS
      },"
  done
  TASKS=${TASKS%%,}
  PARALLEL_BULK_TEST+="
        {
          \"parallel\": {
            \"tasks\": [
              ${TASKS}
            ]
          }
        },"
done
PARALLEL_BULK_TEST=${PARALLEL_BULK_TEST%%,}
 
TEST_PROCEDURES+="
    {
      \"name\": \"parallel-bulk-index-test\",
      \"description\": \"Test using ${PARALLEL_JOBS} parallel indexing operations\",
      \"schedule\": [
      	${PARALLEL_BULK_TEST}
      ]
    },"

CORPORAE=${CORPORAE%%,}
OPERATIONS=${OPERATIONS%%,}
TEST_PROCEDURES=${TEST_PROCEDURES%%,}
INDICES=${INDICES%%,}

OLDIFS=$IFS
IFS=$'`'

WORKLOAD="
{% import \"benchmark.helpers\" as benchmark with context %}
{
  \"version\": 2,
  \"description\": \"Wazuh Indexer Bulk Benchmarks\",
  \"indices\": [
    ${INDICES[@]}
  ],
  \"corpora\": [
    ${CORPORAE[@]}
  ],
  \"operations\": [
    {{ benchmark.collect(parts=\"operations/*.json\") }}
  ],
  \"test_procedures\": [
    {{ benchmark.collect(parts=\"test_procedures/*.json\") }}
  ]
}
"

mkdir -p ./operations
mkdir -p ./test_procedures

echo ${OPERATIONS[@]} > ./operations/default.json
echo ${TEST_PROCEDURES[@]} > ./test_procedures/default.json
echo ${WORKLOAD[@]} > ./workload.json

IFS=$OLDIFS

This script generates a workload.json, an operations/default.json and a test_procedures/default.json files, necessary for the benchmark to run.

Tests

The nature of the benchmark itself can be assessed by looking at the output test_procedures/default.json file.

test_procedures/default.json
{
  "name": "single-bulk-index-test",
  "description": "Wazuh Alerts bulk index test",
  "default": true,
  "schedule": [
    {
      "operation": "01MB-bulk",
      "clients": 2
    },
    {
      "operation": "02MB-bulk",
      "clients": 2
    },
    {
      "operation": "03MB-bulk",
      "clients": 2
    },
    {
      "operation": "04MB-bulk",
      "clients": 2
    },
    {
      "operation": "05MB-bulk",
      "clients": 2
    },
    {
      "operation": "06MB-bulk",
      "clients": 2
    },
    {
      "operation": "07MB-bulk",
      "clients": 2
    },
    {
      "operation": "08MB-bulk",
      "clients": 2
    },
    {
      "operation": "09MB-bulk",
      "clients": 2
    },
    {
      "operation": "10MB-bulk",
      "clients": 2
    },
    {
      "operation": "11MB-bulk",
      "clients": 2
    },
    {
      "operation": "12MB-bulk",
      "clients": 2
    },
    {
      "operation": "13MB-bulk",
      "clients": 2
    },
    {
      "operation": "14MB-bulk",
      "clients": 2
    },
    {
      "operation": "15MB-bulk",
      "clients": 2
    },
    {
      "operation": "16MB-bulk",
      "clients": 2
    },
    {
      "operation": "17MB-bulk",
      "clients": 2
    },
    {
      "operation": "18MB-bulk",
      "clients": 2
    },
    {
      "operation": "19MB-bulk",
      "clients": 2
    },
    {
      "operation": "20MB-bulk",
      "clients": 2
    }
  ]
},
{
  "name": "parallel-bulk-index-test",
  "description": "Test using 4 parallel indexing operations",
  "schedule": [
    {
      "parallel": {
        "tasks": [
          {
            "name": "parallel-test-01-thread-01",
            "operation": "01MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-01-thread-02",
            "operation": "01MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-01-thread-03",
            "operation": "01MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-01-thread-04",
            "operation": "01MB-bulk",
            "clients": 2
          }
        ]
      }
    },
    {
      "parallel": {
        "tasks": [
          {
            "name": "parallel-test-02-thread-01",
            "operation": "02MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-02-thread-02",
            "operation": "02MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-02-thread-03",
            "operation": "02MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-02-thread-04",
            "operation": "02MB-bulk",
            "clients": 2
          }
        ]
      }
    },
    {
      "parallel": {
        "tasks": [
          {
            "name": "parallel-test-03-thread-01",
            "operation": "03MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-03-thread-02",
            "operation": "03MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-03-thread-03",
            "operation": "03MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-03-thread-04",
            "operation": "03MB-bulk",
            "clients": 2
          }
        ]
      }
    },
    {
      "parallel": {
        "tasks": [
          {
            "name": "parallel-test-04-thread-01",
            "operation": "04MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-04-thread-02",
            "operation": "04MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-04-thread-03",
            "operation": "04MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-04-thread-04",
            "operation": "04MB-bulk",
            "clients": 2
          }
        ]
      }
    },
    {
      "parallel": {
        "tasks": [
          {
            "name": "parallel-test-05-thread-01",
            "operation": "05MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-05-thread-02",
            "operation": "05MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-05-thread-03",
            "operation": "05MB-bulk",
            "clients": 2
          },
          {
            "name": "parallel-test-05-thread-04",
            "operation": "05MB-bulk",
            "clients": 2
          }
        ]
      }
    }
  ]
}

There are two tests:

  1. single-bulk-index-test
  2. parallel-bulk-index-test

The first sequentially indexes data in 1 through 20MB bulks. The second one runs 4 parallel bulk indexing operations at a time, increasing the bulk size in 1MB increments.

Running the benchmark

In order to obtain a fair sample size from these tests, we considered using the iterations parameter but later found out it only really applies to read operations in general and has no effect on bulk indexing operations.

For that reason, I opted to simply launch the test repeatedly from the simplest of bash scripts:

benchmark.sh
#!/bin/bash


TEST="parallel-bulk-index-test"

curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-*
curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge
for i in {1..100}
do
  opensearch-benchmark execute-test --pipeline="benchmark-only" --workload-path="./benchmarks/wazuh-alerts" --target-hosts="https://node-1:9200,https://node-2:9200,https://node-3:9200" --client-options="basic_auth_user:admin,basic_auth_password:Password,verify_certs:false" --results-format csv --results-file ./${TEST}/results-$(date +%F-%T).csv --test-procedure=${TEST}
  curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-*
  curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge
done

  
TEST="single-bulk-index-test"

for i in {1..100}
do
  opensearch-benchmark execute-test --pipeline="benchmark-only" --workload-path="./benchmarks/wazuh-alerts" --target-hosts="https://node-1:9200,https://node2:9200,https://node3:9200" --client-options="basic_auth_user:admin,basic_auth_password:Password,verify_certs:false" --results-format csv --results-file ./${TEST}/results-$(date +%F-%T).csv --test-procedure=${TEST}
  curl -sku admin:Password -XDELETE https://node-1:9200/wazuh-*
  curl -sku admin:Password -XPOST https://node-1:9200/_forcemerge
done

This script simply runs all the benchmarks in a loop and outputs their results to a csv file. After each pass, it deletes all the indices it created and forces a merge to clean the state of the cluster.

Results

The results are dumped and plotted to the team's drive:

Image

In the graphs above, the y axis holds the number of indexed documents per second, and the x axis the size of each bulk operation in MB for each test.

These are results are averaged from the output of running the tests 30 times, but the results for each pass don't vary a lot. So it seems that increasing the bulk size increases the throughput until we hit diminishing returns.

We chose to run this up to 20MB because it is recommended to keep bulk indexing operations below 15MB.

@wazuhci wazuhci moved this from In progress to Pending review in Release 5.0.0 Jul 4, 2024
@AlexRuiz7
Copy link
Member Author

AlexRuiz7 commented Jul 9, 2024

Results

We ran more benchmark tests for single and parallel bulks. The most representative data set runs an OpenSearch Benchmark workload using 1 client and 4 parallel bulk tasks, summing up to (bulk_size * threads) MB concurrently, measuring the Mean Throughput in average after 100 runs. The infrastructure uses 3 nodes of Wazuh Indexer in cluster mode, v4.8.0, using the default wazuh-alerts template: 3 primary shards and 1 replica shard.

In the charts below, we can see a clear comparison between using a single bulk request vs parallel bulks

Single bulk request
Average Sum (docs_sec) vs  Bulk size (MB) (1)

Parallel bulks (4)
Average Sum (docs_sec) vs  Bulk size (MB)

Conclusions

The parallel bulk request scenario has proven to return way higher metrics. In the table below, we can see the performance boost in ingestion metrics (ingested documents per second) parallelizing 4 bulk request vs using a single bulk request. The difference is substantial, while we can see that the performance gain tends to drop as we increase the bulk size.

On the other hand, the table shows that the trend line is strictly increasing, which demonstrates the Indexer is able to ingest more documents per second by increasing the bulk size and or increasing the parallel requests. However, we decided to stop further analysis past the 20 MB bulk size, as it's above the recommended settings by Elastic and OpenSearch. Using values higher than 15 MB is not recommended as it can make the cluster unstable. Preliminary analysis shown that we can increase this number until 50 MB. At this point, the Indexer stops responding.

Parallel / Single boost
674.77%
557.23%
410.91%

Parallel vs Single bulk requests

For best tradeoff between performance and stability, we recommend not passing the 15 MB threshold per bulk request. It's also important to note that the bulk size depends on the number of document and their size:

  • 1,000 documents at 1 KB each is 1 MB.
  • 1,000 documents at 100 KB each is 100 MB.

Also, the client should make sure that bulk requests are round-robined across all the data nodes, to prevent a single node from storing all the bulks in memory while processing.

References:

@wazuhci wazuhci moved this from Pending review to Done in Release 5.0.0 Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue request/operational Operational requests type/enhancement Enhancement issue
Projects
Status: Done
Development

No branches or pull requests

2 participants