[feature] Add a check which inspects device configuration status periodically #123

nepython · 2020-06-10T20:19:33Z

Closes #54

nepython · 2020-06-11T20:32:42Z

QUERIES:

When is the status supposed to be made critical? The issue metioned that status should change to problem/critical depending upon the configuration. I didn't understand this.
I think that I should rename all instances of config_status to config_modified, it looks more appropriate. Can I please know if it seems a better name?

PabloCastellano · 2020-06-12T09:31:50Z

QUERIES:

When is the status supposed to be made critical? The issue metioned that status should change to problem/critical depending upon the configuration. I didn't understand this.

Check out the README file. According to the current version, a status will be critical when the metric is defined in OPENWISP_MONITORING_CRITICAL_DEVICE_METRICS. Critical status is reserved for some metrics such as ping.

README.rst

PabloCastellano · 2020-06-15T16:36:59Z

openwisp_monitoring/check/settings.py

+)
+# Input in days
+CONFIG_MODIFIED_RETENTION_POLICY = getattr(
+    settings, 'OPENWISP_MONITORING_CONFIG_STATUS_RETENTION_POLICY', '48h0m0s'


Please follow the standard we use for setting names: the user defines OPENWISP_MONITORING_X_Y_Z but we use X_Y_Z internally

Can there be an exception here. I have named the setting as OPENWISP_MONITORING_DEVICE_CONFIGURATION_X_Y_Z but internally the settings are named as CONFIG_X_Y_Z. The reason was to explicitly say to newcomers that the setting is related to DEVICE_CONFIGURATION (there's also chart config and soon there shall be metric_config) but internally i.e. in the check settings we know that it means device config hence it's shortened. Let me know if it seems right?

openwisp_monitoring/check/tasks.py

openwisp_monitoring/check/base/models.py

README.rst

openwisp_monitoring/check/tasks.py

nepython · 2020-06-16T09:52:45Z

@nemesisdesign, I realized it now that there is no migration for ping, config_modified checks in the sample_check. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?

nepython · 2020-06-19T10:48:36Z

openwisp_monitoring/check/tasks.py

@@ -35,22 +35,43 @@ def perform_check(uuid):


 @shared_task
-def auto_create_ping(model, app_label, object_id, created):
+def auto_create_ping(model, app_label, object_id):


I think it's safe to remove the the keyword argument created , since it's not being used at all.

nemesifier

Good progress.

I have not tested this yet, I'm only doing code-level review, see my comments.

@nemesisdesign, I realized it now that there is no migration for ping, config_modified checks in the sample_check. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?

No data migration needed for sample app.

PS: rebase on the current master

README.rst

nemesifier · 2020-06-23T01:08:39Z

README.rst

@@ -247,6 +247,53 @@ in terms of disk space.

 Whether ping checks are created automatically for devices.

+``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED``


what about OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK?

nemesifier · 2020-06-23T01:09:26Z

README.rst

+--------------+-----------+
+
+This setting allows you to configure the time that the check is allowed to
+fail after which the device health status changes to ``problem``.


I don't understand this setting from its explanation, we should also find a shorter name, eg OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_MAX_TIME

nemesifier · 2020-06-23T01:10:32Z

README.rst

+**Note**: If the setting ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED``
+is disabled then this setting need not be declared.
+
+``OPENWISP_MONITORING_DEVICE_CONFIGURATION_MODIFIED_RETENTION_POLICY``


too long.. OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_RP?

nemesifier · 2020-06-23T01:12:08Z

openwisp_monitoring/check/apps.py


 class CheckConfig(AppConfig):
    name = 'openwisp_monitoring.check'
    label = 'check'
    verbose_name = _('Network Monitoring Checks')
+
+    def ready(self):
+        manage_config_modified_retention_policy()


how does this play with the abstraction? Can you make it play well with the changes introduced in #103?

manage_config_modified_retention_policy() makes use of timeseries_db.get_list_retention_policies after rebasing which has been defined by us so it's abstract enough, I think.

openwisp_monitoring/check/classes/base.py

README.rst

openwisp_monitoring/check/classes/base.py

openwisp_monitoring/check/migrations/0004_create_config_applied.py

nemesifier

Still work to do, I'm disappointed to be highlighting the same silly mistakes 👎

Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:

run dev server, run worker, run celery beat
set the config of a device to modified
wait more than 5 minutes, no alert is ever generated
if I set seconds to 0 in the alert settings, the alert is generated

README.rst

openwisp_monitoring/check/migrations/0004_create_config_applied.py

openwisp_monitoring/check/tests/test_models.py

nemesifier · 2020-07-29T23:13:16Z

openwisp_monitoring/check/tests/test_models.py

-        self.assertIn('Ping', c.check)
+        self.assertEqual(Check.objects.count(), 2)
+        with self.subTest('Test AUTO_PING'):
+            c1 = Check.objects.get(name='Ping')


don't query by name, use the check field, please correct the same for any other occurrence of this everywhere!
Do not query by fields that can be changed by users, do not use get() for querying objects that can return multiple items!

do not use get() for querying objects that can return multiple items!

I expect one Ping and one ConfigApplied check per device. Since, the test had only one device, I thought using get was safe enough. Though yes, querying by name (which can be changed) is bad practice on my side :(

get will raise an error if multiple objects are returned, but it's better to have failures rather than errors since it's usually faster to understand what's wrong

nemesifier · 2020-07-30T01:21:55Z

Thank you @nepython, I tested it and it works! +1
After using the feature and analyzing the code, I realized that the naming is odd.
The check would be a lot more easy to understand if it's something like:
name: Configuration applied
alert settings operator: less than
alert settings threshold value: 1
alert settings threshold value: 600
So if the configuration is not applied for more than 5 minutes, we get an alert.
PS: while testing this, I also realized this: #172

@nemesisdesign, I am slightly confused. You are saying it should be 5 min and 600s at the same time .
Currently it defaults to 5 minutes with a setting OPENWISP_MONITORING_CONFIG_CHECK_MAX_TIME to allow user to configure if needed (as per the issue description). I think the current approach of default 5 min with a setting to configure is better. Wdyt?

5 minutes, sorry, I mistyped.

nemesifier · 2020-07-30T01:22:58Z

openwisp_monitoring/check/tests/test_models.py

+        with self.subTest('Test AUTO_CONFIG_CHECK'):
+            c2 = Check.objects.get(name='Configuration Applied')
+            self.assertEqual(c2.content_object, d)
+            self.assertIn('ConfigApplied', c2.check)


why don't you use equal and you compare it to self._CONFIG_APPLIED

This test was already written for Ping check. So, I have just adapted it for ConfigApplied checks. Made the change.

openwisp_monitoring/check/classes/config_applied.py

nepython · 2020-07-30T10:56:30Z

Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:
* run dev server, run worker, run celery beat

* set the config of a device to `modified`

* wait more than 5 minutes, no alert is ever generated

* if I set `seconds` to `0` in the alert settings, the alert is generated

@nemesisdesign, no it's not a bug. You had asked to cover this use case earlier 😕, I replied back with a solution #123 (comment). Awaited your response, pinged you two days later #123 (comment) and on IM. Still no response, maybe you were slightly busy back then. Hence, went ahead with the approach and informed nevertheless #123 (comment).

As you had requested, the ConfigApplied check won't send any alerts if the device goes offline (health is thus critical). The user will already have one for Ping. The reason being to avoid irritating the user with too many alerts. It will do alert the user once device comes back online and config still stays modified.

The zero second case I think is an edge case where device goes offline. ConfigApplied checks are triggered prior to health status getting updated to CRITICAL. We might never set a 0s threshold for ConfigApplied I think, since config status stays modified for some time whenever a template is updated (it's not spontaneous atleast for me 😅)

Note: There is one specific test (test_config_modified_device_problem) that I have added in check.tests.TestModels which tests that the ConfigApplied check is working!
PS: Maybe the test can be improved.
PPS: Addressed / Fixed everything else as per the review.

nemesifier

I read again what was written previously here: #123 (review), it is correct, but that's not what is happening here.

Let me explain again. Please test it in the same way.

run dev server, run worker, run celery beat
ensure you have a device with health status OK and config status set to applied
set the config of a device to modified
wait more than 5 minutes, no alert is ever generated, but it should
if I set seconds to 0 in the alert settings, the alert is generated, why this difference?

I dug deeper with ipdb, I found out that when previous measurements are retrieved, no point is found.

Same if I use the influxdb shell:

> select * from config_applied;
>

The situation changes if I remove retention_policy=SHORT_RP on the write method call:

> select * from config_applied;
name: config_applied
time                config_applied content_type  object_id
----                -------------- ------------  ---------
1596134001603513262 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134078032234031 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134379375040639 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b

Please repeat the same tests and let me know:

if you can replicate the same issue successfully
if yes, why it is happening? Why writing with the short retention policy doesn't write anything?
if yes, write a failing test for it
if yes, how you fixed it

You have to do this kind of testing on your own, without the need of me requesting you to do it.
It must become the standard way of how you work.

openwisp_monitoring/check/classes/config_applied.py

nepython · 2020-07-30T21:55:27Z

Mentioning so if anyone faces same issue in future, then can avoid the trouble, hopefully.

Please repeat the same tests and let me know:
1. if you can replicate the same issue successfully

Yes

2. if yes, why it is happening? Why writing with the short retention policy doesn't write anything?

When a retention policy is used besides default in influxdb then it needs to be specified during query.
https://community.influxdata.com/t/influxdb-not-writing-data-if-rp-is-specified/9861. So, the issue is not with write but with read. To query any measurement in influxdb with a non-default retention_policy in place, please use.

> select * from <retention_policy>.<measurement>;   # `select * from short.config_applied` in this case

3. if yes, write a failing test for it

Added openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp

4. if yes, how you fixed it

Added retention_policy as a parameter in metric.read() and timeseries_db.read(). So, you can query like this,

In [7]: m.read(retention_policy='short')
Out[7]: 
[{'time': 1596102060, 'config_applied': 1},
 {'time': 1596102120, 'config_applied': 1},
...

Personally, I don't like this approach since user will need to specify the retention_policy everytime in read / write. Maybe we can refactor retention_policy to be a field in Metric model (later on).

The alert didn't came immediately after 5 minutes (took roughly 6-7 minutes, I think) but yes tested twice and added Notification count in test_config_modified_device_problem

nemesifier

Great progress, it finally works 👍
I still think we have dangerous code in our software and I'd like to get it resolved since we're already here and have our mind on it.

[...]

3. if yes, write a failing test for it
Added openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp

I prefer to ensure the actual functional test of the check fails (it would have helped you to spot this without me being involved), I explained in one comment below how to do it.

4. if yes, how you fixed it
Added retention_policy as a parameter in metric.read() and timeseries_db.read().

Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:

openwisp-monitoring/openwisp_monitoring/device/base/models.py

Line 99 in 87e7135

q = device_data_query.format(SHORT_RP, self.__key, self.pk)

openwisp-monitoring/openwisp_monitoring/db/backends/influxdb/queries.py

Line 67 in 238a17b

device_data_query = (

Personally, I don't like this approach since user will need to specify the retention_policy everytime in read / write. Maybe we can refactor retention_policy to be a field in Metric model (later on).

Doesn't worry me, there are other things a lot more important that we need to work on.

openwisp_monitoring/check/tests/test_models.py

openwisp_monitoring/db/backends/influxdb/client.py

nepython · 2020-07-31T12:19:00Z

4. if yes, how you fixed it
Added retention_policy as a parameter in metric.read() and timeseries_db.read().
Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:

openwisp-monitoring/openwisp_monitoring/device/base/models.py

Line 99 in 87e7135

q = device_data_query.format(SHORT_RP, self.__key, self.pk)

openwisp-monitoring/openwisp_monitoring/db/backends/influxdb/queries.py

Line 67 in 238a17b

device_data_query = (

@nemesisdesign, as you have correctly pointed it out, timseries_db.query() was being used and not timeseries_db.read() for device_data. This method of how we query device_data will change fully after #164 (reason is elasticsearch doesn't support querying large text fields (efficiency issues) so had to adopt a different approach). It will simplify and transform as below,

openwisp-monitoring/openwisp_monitoring/device/base/models.py

Lines 100 to 102 in 9c53057

    
           points = timeseries_db._device_data( 
        
               rp=SHORT_RP, tags={'pk': self.pk}, key=self.__key, fields='data' 
        
           )

also abstraction was not complete as device_data_query is a dict in case of elasticsearch. I think if needed, we can simplify it there 😃.

In order to have robust tests, I think we should switch to use freezegun to arbitrarily manipulate the time, we already do this in other modules, can you please take advantage of this chance to change all the tests of this test suite which have this approach so that they use freezegun.freeze_time instead?

You would freeze the time 10 minutes earlier to trigger the check, that would trigger the insertion in the timeseries DB without the need of this line, this will make the testcases a lot more robust and will help us catch bugs in the future.

Ok, thanks freezegun works great. Currently, implemented only for this test, will do for all others in the suite. Though we will need to run checks twice in order to validate alerts after this. The reason being datetime.now() is then mocked throughout the code to point 10 minutes backwards. Hence, alert_settings._is_crossed also gets now() as now() - timedelta(minutes=10). So, two times perform_check needs to be run, one with freeze_time (context manager) and one without it. It works perfectly with this.

To get it working, I had to alter tsdb backend write method, this seems to be causing failure of many other tests (maybe clock differences (unsure)). I am checking out why, change is simple enough 🤔.

timestamp = kwargs.get('timestamp') or now()  # <--- now() is extra here
if isinstance(timestamp, datetime):
    timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
point['time'] = timestamp

nepython · 2020-07-31T15:48:26Z

To get it working, I had to alter tsdb backend write method, this seems to be causing failure of many other tests (maybe clock differences (unsure)). I am checking out why, change is simple enough thinking.

Check this build (https://travis-ci.org/github/openwisp/openwisp-monitoring/jobs/713662695#L1163) to know the failure.

Found it, it was related to precision of the timestamp. This issue was hidden till now and came forward now due to below change.

timestamp = kwargs.get('timestamp') or now()  # now() is added here
if isinstance(timestamp, datetime):
    timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')  # This was the reason, seconds precision
point['time'] = timestamp    # Passing timestamp was optional earlier

I will try to add a failing test for this.

nemesifier

Sounds all good @nepython, let me know when ready for review.

…odically #54 [influxdb] Add support for retention_policy in read

nepython marked this pull request as draft June 10, 2020 20:20

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from c304e4d to dabafcc Compare June 11, 2020 20:20

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from 0094d9e to 58539d6 Compare June 11, 2020 21:55

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 58539d6 to fc42293 Compare June 12, 2020 20:35

nepython marked this pull request as ready for review June 12, 2020 20:41

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from c3a1a23 to d3ac3f4 Compare June 13, 2020 11:41

PabloCastellano suggested changes Jun 15, 2020

View reviewed changes

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 7ab69dd to c7ffc08 Compare June 16, 2020 09:21

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 1047e03 to ba31564 Compare June 19, 2020 10:30

nepython commented Jun 19, 2020

View reviewed changes

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 3 times, most recently from 63e1612 to d6203a0 Compare June 22, 2020 16:11

nemesifier requested changes Jun 23, 2020

View reviewed changes

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 3 times, most recently from 154bae6 to 6e5a612 Compare June 23, 2020 15:53

nemesifier reviewed Jun 23, 2020

View reviewed changes

openwisp_monitoring/check/classes/base.py Outdated Show resolved Hide resolved

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from c5b4ad4 to 8a868fc Compare July 29, 2020 17:34

nemesifier reviewed Jul 29, 2020

View reviewed changes

openwisp_monitoring/check/migrations/0004_create_config_applied.py Outdated Show resolved Hide resolved

nemesifier reviewed Jul 29, 2020

View reviewed changes

openwisp_monitoring/check/migrations/0004_create_config_applied.py Outdated Show resolved Hide resolved

nemesifier requested changes Jul 29, 2020

View reviewed changes

nemesifier reviewed Jul 30, 2020

View reviewed changes

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 8a868fc to 5ba3b99 Compare July 30, 2020 11:34

nemesifier requested changes Jul 30, 2020

View reviewed changes

openwisp_monitoring/check/classes/config_applied.py Outdated Show resolved Hide resolved

nepython marked this pull request as draft July 30, 2020 18:57

nepython marked this pull request as ready for review July 30, 2020 21:57

nepython added enhancement New feature or request influxdb labels Jul 30, 2020

nemesifier requested changes Jul 31, 2020

View reviewed changes

openwisp_monitoring/check/tests/test_models.py Outdated Show resolved Hide resolved

openwisp_monitoring/db/backends/influxdb/client.py Outdated Show resolved Hide resolved

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from b07ac63 to eab6145 Compare July 31, 2020 11:58

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from eab6145 to 4f87998 Compare July 31, 2020 15:31

nemesifier reviewed Jul 31, 2020

View reviewed changes

nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from 9a6ed40 to ec5e68b Compare July 31, 2020 17:01

[feature] Add a check which inspects device configuration status peri…

0041002

…odically #54 [influxdb] Add support for retention_policy in read

nepython force-pushed the issues/54-add-check-inspecting-config-status branch from ec5e68b to 0041002 Compare July 31, 2020 17:21

nemesifier approved these changes Aug 1, 2020

View reviewed changes

nemesifier merged commit f5f3c1e into dev Aug 1, 2020

nemesifier deleted the issues/54-add-check-inspecting-config-status branch August 1, 2020 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add a check which inspects device configuration status periodically #123

[feature] Add a check which inspects device configuration status periodically #123

nepython commented Jun 10, 2020 •

edited

Loading

nepython commented Jun 11, 2020 •

edited by nemesifier

Loading

PabloCastellano commented Jun 12, 2020

PabloCastellano Jun 15, 2020

nepython Jun 16, 2020

nepython commented Jun 16, 2020

nepython Jun 19, 2020

nemesifier left a comment •

edited

Loading

nemesifier Jun 23, 2020

nemesifier Jun 23, 2020

nemesifier Jun 23, 2020

nemesifier Jun 23, 2020

nepython Jun 23, 2020

nemesifier left a comment •

edited

Loading

nemesifier Jul 29, 2020

nepython Jul 30, 2020

nemesifier Jul 30, 2020

nemesifier commented Jul 30, 2020

nemesifier Jul 30, 2020

nepython Jul 30, 2020

nepython commented Jul 30, 2020 •

edited

Loading

nemesifier left a comment •

edited

Loading

nepython commented Jul 30, 2020

nemesifier left a comment

nepython commented Jul 31, 2020

nepython commented Jul 31, 2020

nemesifier left a comment

		@@ -247,6 +247,53 @@ in terms of disk space.

		Whether ping checks are created automatically for devices.

		``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED``

[feature] Add a check which inspects device configuration status periodically #123

[feature] Add a check which inspects device configuration status periodically #123

Conversation

nepython commented Jun 10, 2020 • edited Loading

nepython commented Jun 11, 2020 • edited by nemesifier Loading

PabloCastellano commented Jun 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nepython commented Jun 16, 2020

Choose a reason for hiding this comment

nemesifier left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemesifier left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemesifier commented Jul 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nepython commented Jul 30, 2020 • edited Loading

nemesifier left a comment • edited Loading

Choose a reason for hiding this comment

nepython commented Jul 30, 2020

nemesifier left a comment

Choose a reason for hiding this comment

nepython commented Jul 31, 2020

nepython commented Jul 31, 2020

nemesifier left a comment

Choose a reason for hiding this comment

nepython commented Jun 10, 2020 •

edited

Loading

nepython commented Jun 11, 2020 •

edited by nemesifier

Loading

nemesifier left a comment •

edited

Loading

nemesifier left a comment •

edited

Loading

nepython commented Jul 30, 2020 •

edited

Loading

nemesifier left a comment •

edited

Loading