Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve batch overload protection and log handler #671

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SergeTupchiy
Copy link
Contributor

This is quite huge PR, a short summary:

  • Common batch processor functionality is moved to a single module: otel_batch_olp (olp - for overload protection).
  • otel_batch_olp implementation is quite different from the previous otel_batch_processor and deserves additional explanations:
    • First of all, the previous olp implementation (max_queue_size implemented as a periodic ETS tab size check) couldn't react to the changes timely due to its periodic nature. Even if the check interval is quite low, e.g. 1s, an abrupt burst of events (be it trace spans or log messages) can insert much more records to the ETS table than the configured max_queue_size. The issue can be mitigated by draining ETS tab up to max_queue_size during the export, but it would only confirm the fact that overload protection is not perfect.
    • Batch processor heavily relies on persistent_term. Though the stored terms (atoms) should fit a machine word and, thus, their quite frequent (periodic) update during ETS table rotation should not have triggered global GC, it still can be quite expensive if the number of other terms is large (https://www.erlang.org/doc/man/persistent_term.html):

      Storing or updating a term (using put/2) is proportional to the number of already created persistent terms because the hash table holding the keys will be copied. In addition, the term itself will be copied.

    • Due to the above reasons, an alternative solution is proposed which is based on atomic counters: https://github.com/SergeTupchiy/opentelemetry-erlang/blob/improve-batch-overload-protection-and-log-handler/apps/opentelemetry/src/otel_batch_olp.erl#L83
      The performance of atomic ops is comparable to persistent_term (at least on my basic hardware, more benchmarks are welcome), they are meant to be mutable and the proposed olp mechanism ensures that the table doesn't grow beyond max_queue_size value.
  • otel_batch_olp is re-used in otel_batch_processor and otel_log_handler
  • ETS tab export is changed to eliminate possible data loss due to certain race conditions (discussed in: fix(otel_processor): wait for runner process termination #641).
  • ETS tab keys are changed from instrumentation_scope to trace id / log timestamp. These values have higher entropy, and thus are better fit to be used as keys (storing a large number of objects under the same key in ETS bag can degrade performance). This was also discussed in: fix(otel_processor): wait for runner process termination #641.

Copy link

codecov bot commented Dec 27, 2023

Codecov Report

Attention: Patch coverage is 89.35185% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 74.47%. Comparing base (8f01a9a) to head (d9afb2d).
Report is 1 commits behind head on main.

Files Patch % Lines
apps/opentelemetry/src/otel_batch_olp.erl 92.07% 13 Missing ⚠️
apps/opentelemetry/src/otel_exporter.erl 60.00% 6 Missing ⚠️
apps/opentelemetry/src/otel_batch_processor.erl 80.00% 2 Missing ⚠️
apps/opentelemetry/src/otel_exporter_stdout.erl 0.00% 1 Missing ⚠️
apps/opentelemetry_api/src/otel_utils.erl 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #671      +/-   ##
==========================================
- Coverage   74.48%   74.47%   -0.02%     
==========================================
  Files          56       62       +6     
  Lines        1862     2014     +152     
==========================================
+ Hits         1387     1500     +113     
- Misses        475      514      +39     
Flag Coverage Δ
api 70.05% <90.00%> (-3.46%) ⬇️
elixir 17.71% <70.00%> (?)
erlang 75.79% <89.35%> (+1.30%) ⬆️
exporter 71.42% <100.00%> (+3.94%) ⬆️
sdk 80.04% <88.60%> (+1.34%) ⬆️
zipkin 54.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tsloughter
Copy link
Member

It has been a week so wanted to say a big thanks! And let you know I hoped to get around to reviewing this next week at the latest.

@tsloughter
Copy link
Member

Hey @SergeTupchiy, sorry I haven't gotten this merged yet! Can you update to resolve the merge conflicts?

@SergeTupchiy
Copy link
Contributor Author

Hi @tsloughter,
Sure, hopefully I will manage to do it later this week.
Also, I'm planning to remove this commit bf08e05 from this branch and fix the relevant test case, as it was rejected in another PR: #670

@tsloughter
Copy link
Member

Great, thanks. And yes, please remove that commit but don't close that PR bc I don't think a final decision has been made.

@SergeTupchiy SergeTupchiy force-pushed the improve-batch-overload-protection-and-log-handler branch from 15ea883 to d9afb2d Compare March 20, 2024 18:07
@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Mar 20, 2024

Great, thanks. And yes, please remove that commit but don't close that PR bc I don't think a final decision has been made.

Re-based, dropped 'cleanup persistent terms' commit from this branch and added a workaround fix for the failing test: https://github.com/open-telemetry/opentelemetry-erlang/pull/671/files#diff-ca895028100a6d11170ba82e246533e4bcaf1ffa2b15cfce32777a56e7cf8af4R122-R124

@bryannaegele
Copy link
Contributor

@SergeTupchiy how does this PR align with #637? Is it building on that one?

@SergeTupchiy
Copy link
Contributor Author

@SergeTupchiy how does this PR align with #637? Is it building on that one?

Yes, this one includes changes done in #637 and makes the latter unnecessary.

@tsloughter
Copy link
Member

@SergeTupchiy sorry this still hasn't gotten a real review. I really want to get it in but it is large and changes a lot since it doesn't just change the logs handler to add olp but changes the span batch processor too. I'm wondering about breaking it up into smaller PRs so we can move it forward in pieces. What do you think?

@tsloughter
Copy link
Member

Ooh, so I think I get your ets table lookup strategy now. At first it worried me because I recalled we used to do similar (2-tuple with which to use kept in a counter) but then moved to the persistent term but I think I see now that it sort of blends the changes we made with the use of a reg_name with the original 2-tuple design.

Because we use reg_name now we don't have to put pid in the name anymore to make them unique but then require a name lookup instead of storing in the config/state.

Really wish we already had benchmarks so we could at least guarantee this doesn't have any regressions. Thinking about it I don't see why it would, but always better to measure of course :). Maybe I need to do some quick work on that before we can change the span batch processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants