-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga 2 API as Event Source #112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments from the first pass over the code. In general, I will need to have a closer look over that event deduplication and whether we want change that in general so that it would be able to generate events for while the notifications daemon wasn't running (but those already get lost with the current hacky approach, so no downgrade in that respect).
From what I can tell, you only try to remove duplicates but there's nothing that would prevent reordering? Also, looks like you're using the timestamps from Icinga 2 for the events. So far, all events receive their timestamp from the notifications daemon:
I think I'd keep it that way for now, after all, we can't send notifications in the past. (Maybe, as a future extension, we might want to track both source and processing timestamps and show hints that events were delayed.) Anyways, I have a suggestion that shouldn't be too difficult to implement and would also handle events that were missed due to a notifications daemon restart:
|
@julianbrost: I did an implementation with some necessary refactoring in the three just pushed commits. Please feel free to take a look. |
While working on #112, I experienced some delay when replaying the whole Icinga 2 Objects API state after an Event Stream connection loss. After taking some pprof snapshots, some big unlocks occurred. Thus, I minimized the scopes or code areas being mutex protected. Especially in the GetCurrent function creates a long Lock even over SQL queries. However, in most experimental sessions the locks were insignificant, while the SQL internals required huge time slots.
Switch to a signal based context which is being canceled for SIGINT and SIGTERM. This change was extracted from the ongoing PR #112.
Switch to a signal based context which is being canceled for SIGINT and SIGTERM. This change was extracted from the ongoing PR #112.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments for some details. On a higher level, I'd consider changing two things:
- Currently, states and acknowledgements are fetched completely independently. This means those could be out of sync (you might acknowledge too early or too late in relation to the state events). So this should be more consistent if there was one request to
/v1/objects/hosts
and one to/v1/objects/services
which is used as the baseline for everything (severity, acknowledgement, whatever we might add in the future). You emit a state event for that state and if it is flagged as acknowledged, you fetch the necessary extra information and emit another event for that as well (after the state event). - In
eventDispatcher()
, I find the use ofreplayTrigger
andreplayPhase
somewhat confusing. Maybe that would be simpler if there was one call toeventDispatcher()
per event stream request, i.e. it would only handle one replay phase overall (but you'd probably have to be careful to only start the next one after the previous is gone for sure and won't submit further events for processing). Other ideas I'd consider: use only channels for signalling, i.e. replacereplayPhase
with one and maybe closeeventDispatcherReplay
to signal that the replay is done.
I addressed the annotated comments within the code. However, the two main bullet points are still open. |
906310b
to
500b2a9
Compare
By introducing ErrSuperfluousStateChange to signal superfluous state changes and returning a wrapped error, those messages can now be suppressed (logged with the debug level) for Event Stream processing.
After #132 got merged and each Source's state is now within the database, the Event Stream's configuration could go there, too. This resulted in some level of refactoring as the data flow logic was now reversed at some points. Especially Golang's non-cyclic imports and the omnipresence of the RuntimeConfig made the "hack" of the eventstream.Launcher necessary to not have an importing edge from config to eventstream.
This refactoring converges this representation of the Icinga 2 API Unix timestamp to those of Icinga DB. Eventually, those common or similar code will be extracted into the icinga-go-library.
Some values are returned as constant integer values, e.g., 0 for an OK state service. Those known integers were now replaced by consts.
The catch-up-phase logic was extended to also propagate back an error if the state was left unsuccessfully. In this case - and not if another catch-up-phase was requested and the old phase worker got canceled - another attempt will be made.
After being asked something about the config, I noticed that it would probably be a good idea to have a separate config attribute for the Icinga 2 endpoint name, i.e. the name expected in the certificate. Given that both are configured separately in Icinga 2 (and you have to specify both, one isn't implicitly used for the other), it's quite possible to have configs like this which wouldn't be possible with the current implementation here without disabling certificate validation altogether:
|
icinga2_common_name, or Icinga2CommonName in Go, allows overriding the expected Common Name of the Certificate from the Icinga 2 API. For testing, I acquired the CA's PEM by: > openssl s_client \ > -connect docker-master:5665 \ > -showcerts < /dev/null 2> /dev/null \ > | awk '/BEGIN CERTIFICATE/ || p { p = 1; print } /END CERTIFICATE/ { exit }' and populated the source table as follows: > UPDATE source SET > icinga2_ca_pem = $$-----BEGIN CERTIFICATE----- > [ . . . ] > -----END CERTIFICATE-----$$, > icinga2_common_name = 'docker-master', > icinga2_insecure_tls = 'n'; Afterwards, one can verify the check by altering icinga2_common_name either to NULL or an invalid common name.
@julianbrost: A distinct Common Name was implemented in 24a4843, with additional information regarding adjusting one's testing environment in the commit message. |
I noticed a small detail that would be pretty helpful: setting the user agent in all the requests. Icinga 2 logs the user agent for each request, so it would be helpful to easily identify the requests in these logs. |
- Rate limit catch-up-phase worker start. In case of a network disruption during the catch-up-phase, this will result in an error and infinite retries. Those, however, might result in lots of useless logging, which can be rate limited. - Remove the both useless and broken catchupEventCh drainage logic. All sends are being protected by context checks. - Abort early on errors received from the catchupEventCh and don't store them for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took quite some time (where I have a good part in 😅), now it LGTM (you've got to put this in a review for a PR of this size).
My tests1 were successful, besides that I found #171, but that's not caused by this PR, it affects its functionality though. I didn't test every last detail, the big picture is working, so if there's something left, that would be for another PR, this one is sitting around here for long enough now.
Footnotes
-
different configs in the
source
table (including certificate validation), handling of Icinga 2 restart, state and ack events during catch-up and normal operation ↩
Fetch events from the Icinga 2 API. For details, consider the commit messages.
Closes #89.
refs Icinga/icinga-notifications-web#156 (corresponding configuration interface)