Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga 2 API as Event Source #112

Merged
merged 65 commits into from
Apr 12, 2024
Merged

Icinga 2 API as Event Source #112

merged 65 commits into from
Apr 12, 2024

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Oct 19, 2023

Fetch events from the Icinga 2 API. For details, consider the commit messages.

Closes #89.
refs Icinga/icinga-notifications-web#156 (corresponding configuration interface)

@cla-bot cla-bot bot added the cla/signed CLA is signed by all contributors of a PR label Oct 19, 2023
@oxzi oxzi changed the title Icinga2 source Icinga 2 API as Event Source Oct 19, 2023
Copy link
Collaborator

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from the first pass over the code. In general, I will need to have a closer look over that event deduplication and whether we want change that in general so that it would be able to generate events for while the notifications daemon wasn't running (but those already get lost with the current hacky approach, so no downgrade in that respect).

internal/eventstream/api_responses_test.go Outdated Show resolved Hide resolved
cmd/icinga-notifications-daemon/main.go Outdated Show resolved Hide resolved
internal/eventstream/client_es.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/api_responses.go Outdated Show resolved Hide resolved
@julianbrost
Copy link
Collaborator

From what I can tell, you only try to remove duplicates but there's nothing that would prevent reordering?

Also, looks like you're using the timestamps from Icinga 2 for the events. So far, all events receive their timestamp from the notifications daemon:


I think I'd keep it that way for now, after all, we can't send notifications in the past. (Maybe, as a future extension, we might want to track both source and processing timestamps and show hints that events were delayed.)

Anyways, I have a suggestion that shouldn't be too difficult to implement and would also handle events that were missed due to a notifications daemon restart:

  1. Reconnect the event stream, once established start buffering events (no processing yet)
  2. Query all state from the objects API and process events accordingly (as a first step, processing an event for every object, but there's the obvious improvement of using the current incident state (which will be loaded on daemon startup as part of Trigger time based escalations #68) and only emitting events where there was a change), storing the last timestamp per object.
  3. Start processing the buffered events from step 1 as well as new ones, ignoring those older than the timestamps from step 2.

@oxzi
Copy link
Member Author

oxzi commented Oct 26, 2023

@julianbrost: I did an implementation with some necessary refactoring in the three just pushed commits. Please feel free to take a look.

oxzi added a commit that referenced this pull request Oct 27, 2023
While working on #112, I experienced some delay when replaying the whole
Icinga 2 Objects API state after an Event Stream connection loss. After
taking some pprof snapshots, some big unlocks occurred.

Thus, I minimized the scopes or code areas being mutex protected.
Especially in the GetCurrent function creates a long Lock even over SQL
queries.

However, in most experimental sessions the locks were insignificant,
while the SQL internals required huge time slots.
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/api_responses.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
oxzi added a commit that referenced this pull request Nov 2, 2023
Switch to a signal based context which is being canceled for SIGINT and
SIGTERM. This change was extracted from the ongoing PR #112.
@oxzi oxzi mentioned this pull request Nov 2, 2023
@oxzi oxzi requested a review from julianbrost November 2, 2023 15:35
julianbrost pushed a commit that referenced this pull request Nov 3, 2023
Switch to a signal based context which is being canceled for SIGINT and
SIGTERM. This change was extracted from the ongoing PR #112.
Copy link
Collaborator

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments for some details. On a higher level, I'd consider changing two things:

  1. Currently, states and acknowledgements are fetched completely independently. This means those could be out of sync (you might acknowledge too early or too late in relation to the state events). So this should be more consistent if there was one request to /v1/objects/hosts and one to /v1/objects/services which is used as the baseline for everything (severity, acknowledgement, whatever we might add in the future). You emit a state event for that state and if it is flagged as acknowledged, you fetch the necessary extra information and emit another event for that as well (after the state event).
  2. In eventDispatcher(), I find the use of replayTrigger and replayPhase somewhat confusing. Maybe that would be simpler if there was one call to eventDispatcher() per event stream request, i.e. it would only handle one replay phase overall (but you'd probably have to be careful to only start the next one after the previous is gone for sure and won't submit further events for processing). Other ideas I'd consider: use only channels for signalling, i.e. replace replayPhase with one and maybe close eventDispatcherReplay to signal that the replay is done.

internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client_api.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
@oxzi
Copy link
Member Author

oxzi commented Nov 3, 2023

I addressed the annotated comments within the code. However, the two main bullet points are still open.
I'll request another review when I worked on those.

internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
internal/eventstream/client.go Outdated Show resolved Hide resolved
oxzi added 9 commits January 12, 2024 10:38
By introducing ErrSuperfluousStateChange to signal superfluous state
changes and returning a wrapped error, those messages can now be
suppressed (logged with the debug level) for Event Stream processing.
After #132 got merged and each Source's state is now within the
database, the Event Stream's configuration could go there, too.

This resulted in some level of refactoring as the data flow logic was
now reversed at some points. Especially Golang's non-cyclic imports and
the omnipresence of the RuntimeConfig made the "hack" of the
eventstream.Launcher necessary to not have an importing edge from config
to eventstream.
This refactoring converges this representation of the Icinga 2 API Unix
timestamp to those of Icinga DB. Eventually, those common or similar
code will be extracted into the icinga-go-library.
Some values are returned as constant integer values, e.g., 0 for an OK
state service. Those known integers were now replaced by consts.
The catch-up-phase logic was extended to also propagate back an error
if the state was left unsuccessfully. In this case - and not if another
catch-up-phase was requested and the old phase worker got canceled -
another attempt will be made.
@julianbrost
Copy link
Collaborator

After being asked something about the config, I noticed that it would probably be a good idea to have a separate config attribute for the Icinga 2 endpoint name, i.e. the name expected in the certificate. Given that both are configured separately in Icinga 2 (and you have to specify both, one isn't implicitly used for the other), it's quite possible to have configs like this which wouldn't be possible with the current implementation here without disabling certificate validation altogether:

object Endpoint "master" {
  host = "icinga.example.com"
  port = 5665
}

icinga2_common_name, or Icinga2CommonName in Go, allows overriding the
expected Common Name of the Certificate from the Icinga 2 API.

For testing, I acquired the CA's PEM by:

> openssl s_client \
>   -connect docker-master:5665 \
>   -showcerts < /dev/null 2> /dev/null \
> | awk '/BEGIN CERTIFICATE/ || p { p = 1; print } /END CERTIFICATE/ { exit }'

and populated the source table as follows:

> UPDATE source SET
> icinga2_ca_pem = $$-----BEGIN CERTIFICATE-----
> [ . . . ]
> -----END CERTIFICATE-----$$,
> icinga2_common_name = 'docker-master',
> icinga2_insecure_tls = 'n';

Afterwards, one can verify the check by altering icinga2_common_name
either to NULL or an invalid common name.
@oxzi
Copy link
Member Author

oxzi commented Jan 17, 2024

@julianbrost: A distinct Common Name was implemented in 24a4843, with additional information regarding adjusting one's testing environment in the commit message.

internal/config/source.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
@oxzi oxzi requested a review from julianbrost March 19, 2024 09:57
internal/config/source.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
@julianbrost
Copy link
Collaborator

I noticed a small detail that would be pretty helpful: setting the user agent in all the requests. Icinga 2 logs the user agent for each request, so it would be helpful to easily identify the requests in these logs.

oxzi added 3 commits March 21, 2024 10:22
- Rate limit catch-up-phase worker start. In case of a network
  disruption during the catch-up-phase, this will result in an error and
  infinite retries. Those, however, might result in lots of useless
  logging, which can be rate limited.
- Remove the both useless and broken catchupEventCh drainage logic. All
  sends are being protected by context checks.
- Abort early on errors received from the catchupEventCh and don't store
  them for later.
@oxzi oxzi requested a review from julianbrost March 21, 2024 10:24
Copy link
Collaborator

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took quite some time (where I have a good part in 😅), now it LGTM (you've got to put this in a review for a PR of this size).

My tests1 were successful, besides that I found #171, but that's not caused by this PR, it affects its functionality though. I didn't test every last detail, the big picture is working, so if there's something left, that would be for another PR, this one is sitting around here for long enough now.

Footnotes

  1. different configs in the source table (including certificate validation), handling of Icinga 2 restart, state and ack events during catch-up and normal operation

@julianbrost julianbrost merged commit 97e0114 into main Apr 12, 2024
4 checks passed
@julianbrost julianbrost deleted the icinga2-source branch April 12, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed CLA is signed by all contributors of a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework Icinga 2 event source
4 participants