-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Add a check which inspects device configuration status periodically #123
Conversation
c304e4d
to
dabafcc
Compare
QUERIES:
|
0094d9e
to
58539d6
Compare
Check out the README file. According to the current version, a status will be critical when the metric is defined in |
58539d6
to
fc42293
Compare
c3a1a23
to
d3ac3f4
Compare
) | ||
# Input in days | ||
CONFIG_MODIFIED_RETENTION_POLICY = getattr( | ||
settings, 'OPENWISP_MONITORING_CONFIG_STATUS_RETENTION_POLICY', '48h0m0s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the standard we use for setting names: the user defines OPENWISP_MONITORING_X_Y_Z
but we use X_Y_Z
internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be an exception here. I have named the setting as OPENWISP_MONITORING_DEVICE_CONFIGURATION_X_Y_Z
but internally the settings are named as CONFIG_X_Y_Z
. The reason was to explicitly say to newcomers that the setting is related to DEVICE_CONFIGURATION
(there's also chart config and soon there shall be metric_config) but internally i.e. in the check settings
we know that it means device config hence it's shortened. Let me know if it seems right?
7ab69dd
to
c7ffc08
Compare
@nemesisdesign, I realized it now that there is no migration for |
1047e03
to
ba31564
Compare
openwisp_monitoring/check/tasks.py
Outdated
@@ -35,22 +35,43 @@ def perform_check(uuid): | |||
|
|||
|
|||
@shared_task | |||
def auto_create_ping(model, app_label, object_id, created): | |||
def auto_create_ping(model, app_label, object_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to remove the the keyword argument created
, since it's not being used at all.
63e1612
to
d6203a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress.
I have not tested this yet, I'm only doing code-level review, see my comments.
@nemesisdesign, I realized it now that there is no migration for
ping
,config_modified
checks in thesample_check
. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?
No data migration needed for sample app.
PS: rebase on the current master
README.rst
Outdated
@@ -247,6 +247,53 @@ in terms of disk space. | |||
|
|||
Whether ping checks are created automatically for devices. | |||
|
|||
``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK
?
README.rst
Outdated
+--------------+-----------+ | ||
|
||
This setting allows you to configure the time that the check is allowed to | ||
fail after which the device health status changes to ``problem``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this setting from its explanation, we should also find a shorter name, eg OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_MAX_TIME
README.rst
Outdated
**Note**: If the setting ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED`` | ||
is disabled then this setting need not be declared. | ||
|
||
``OPENWISP_MONITORING_DEVICE_CONFIGURATION_MODIFIED_RETENTION_POLICY`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too long.. OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_RP
?
openwisp_monitoring/check/apps.py
Outdated
|
||
class CheckConfig(AppConfig): | ||
name = 'openwisp_monitoring.check' | ||
label = 'check' | ||
verbose_name = _('Network Monitoring Checks') | ||
|
||
def ready(self): | ||
manage_config_modified_retention_policy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this play with the abstraction? Can you make it play well with the changes introduced in #103?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manage_config_modified_retention_policy()
makes use of timeseries_db.get_list_retention_policies
after rebasing which has been defined by us so it's abstract enough, I think.
154bae6
to
6e5a612
Compare
c5b4ad4
to
8a868fc
Compare
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still work to do, I'm disappointed to be highlighting the same silly mistakes 👎
Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:
- run dev server, run worker, run celery beat
- set the config of a device to
modified
- wait more than 5 minutes, no alert is ever generated
- if I set
seconds
to0
in the alert settings, the alert is generated
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
self.assertIn('Ping', c.check) | ||
self.assertEqual(Check.objects.count(), 2) | ||
with self.subTest('Test AUTO_PING'): | ||
c1 = Check.objects.get(name='Ping') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't query by name, use the check field, please correct the same for any other occurrence of this everywhere!
Do not query by fields that can be changed by users, do not use get() for querying objects that can return multiple items!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use get() for querying objects that can return multiple items!
I expect one Ping
and one ConfigApplied
check per device. Since, the test had only one device, I thought using get
was safe enough. Though yes, querying by name
(which can be changed) is bad practice on my side :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get will raise an error if multiple objects are returned, but it's better to have failures rather than errors since it's usually faster to understand what's wrong
5 minutes, sorry, I mistyped. |
with self.subTest('Test AUTO_CONFIG_CHECK'): | ||
c2 = Check.objects.get(name='Configuration Applied') | ||
self.assertEqual(c2.content_object, d) | ||
self.assertIn('ConfigApplied', c2.check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you use equal and you compare it to self._CONFIG_APPLIED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was already written for Ping
check. So, I have just adapted it for ConfigApplied
checks. Made the change.
@nemesisdesign, no it's not a bug. You had asked to cover this use case earlier 😕, I replied back with a solution #123 (comment). Awaited your response, pinged you two days later #123 (comment) and on IM. Still no response, maybe you were slightly busy back then. Hence, went ahead with the approach and informed nevertheless #123 (comment). As you had requested, the The zero second case I think is an edge case where device goes offline. Note: There is one specific test ( |
8a868fc
to
5ba3b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read again what was written previously here: #123 (review), it is correct, but that's not what is happening here.
Let me explain again. Please test it in the same way.
- run dev server, run worker, run celery beat
- ensure you have a device with health status OK and config status set to applied
- set the config of a device to modified
- wait more than 5 minutes, no alert is ever generated, but it should
- if I set seconds to 0 in the alert settings, the alert is generated, why this difference?
I dug deeper with ipdb
, I found out that when previous measurements are retrieved, no point is found.
Same if I use the influxdb shell:
> select * from config_applied;
>
The situation changes if I remove retention_policy=SHORT_RP
on the write
method call:
> select * from config_applied;
name: config_applied
time config_applied content_type object_id
---- -------------- ------------ ---------
1596134001603513262 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134078032234031 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134379375040639 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
Please repeat the same tests and let me know:
- if you can replicate the same issue successfully
- if yes, why it is happening? Why writing with the short retention policy doesn't write anything?
- if yes, write a failing test for it
- if yes, how you fixed it
You have to do this kind of testing on your own, without the need of me requesting you to do it.
It must become the standard way of how you work.
Mentioning so if anyone faces same issue in future, then can avoid the trouble, hopefully.
Yes
When a retention policy is used besides
Added
Added In [7]: m.read(retention_policy='short')
Out[7]:
[{'time': 1596102060, 'config_applied': 1},
{'time': 1596102120, 'config_applied': 1},
... Personally, I don't like this approach since user will need to specify the The alert didn't came immediately after 5 minutes (took roughly 6-7 minutes, I think) but yes tested twice and added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress, it finally works 👍
I still think we have dangerous code in our software and I'd like to get it resolved since we're already here and have our mind on it.
[...]
3. if yes, write a failing test for it
Added
openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp
I prefer to ensure the actual functional test of the check fails (it would have helped you to spot this without me being involved), I explained in one comment below how to do it.
4. if yes, how you fixed it
Added
retention_policy
as a parameter inmetric.read()
andtimeseries_db.read()
.
Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:
q = device_data_query.format(SHORT_RP, self.__key, self.pk) |
device_data_query = ( |
Personally, I don't like this approach since user will need to specify the
retention_policy
everytime in read / write. Maybe we can refactorretention_policy
to be a field inMetric
model (later on).
Doesn't worry me, there are other things a lot more important that we need to work on.
b07ac63
to
eab6145
Compare
@nemesisdesign, as you have correctly pointed it out, openwisp-monitoring/openwisp_monitoring/device/base/models.py Lines 100 to 102 in 9c53057
also abstraction was not complete as device_data_query is a dict in case of elasticsearch . I think if needed, we can simplify it there 😃.
Ok, thanks freezegun works great. Currently, implemented only for this test, will do for all others in the suite. Though we will need to run checks twice in order to validate alerts after this. The reason being To get it working, I had to alter tsdb backend timestamp = kwargs.get('timestamp') or now() # <--- now() is extra here
if isinstance(timestamp, datetime):
timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
point['time'] = timestamp |
eab6145
to
4f87998
Compare
Check this build (https://travis-ci.org/github/openwisp/openwisp-monitoring/jobs/713662695#L1163) to know the failure. Found it, it was related to timestamp = kwargs.get('timestamp') or now() # now() is added here
if isinstance(timestamp, datetime):
timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') # This was the reason, seconds precision
point['time'] = timestamp # Passing timestamp was optional earlier I will try to add a failing test for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds all good @nepython, let me know when ready for review.
9a6ed40
to
ec5e68b
Compare
…odically #54 [influxdb] Add support for retention_policy in read
ec5e68b
to
0041002
Compare
Closes #54