Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Add a check which inspects device configuration status periodically #123

Merged
merged 1 commit into from
Aug 1, 2020

Conversation

nepython
Copy link
Contributor

@nepython nepython commented Jun 10, 2020

Closes #54

@nepython nepython marked this pull request as draft June 10, 2020 20:20
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from c304e4d to dabafcc Compare June 11, 2020 20:20
@nepython
Copy link
Contributor Author

nepython commented Jun 11, 2020

QUERIES:

  • When is the status supposed to be made critical? The issue metioned that status should change to problem/critical depending upon the configuration. I didn't understand this.
  • I think that I should rename all instances of config_status to config_modified, it looks more appropriate. Can I please know if it seems a better name?

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from 0094d9e to 58539d6 Compare June 11, 2020 21:55
@PabloCastellano
Copy link
Contributor

QUERIES:

  • When is the status supposed to be made critical? The issue metioned that status should change to problem/critical depending upon the configuration. I didn't understand this.

Check out the README file. According to the current version, a status will be critical when the metric is defined in OPENWISP_MONITORING_CRITICAL_DEVICE_METRICS. Critical status is reserved for some metrics such as ping.

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 58539d6 to fc42293 Compare June 12, 2020 20:35
@nepython nepython marked this pull request as ready for review June 12, 2020 20:41
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from c3a1a23 to d3ac3f4 Compare June 13, 2020 11:41
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
)
# Input in days
CONFIG_MODIFIED_RETENTION_POLICY = getattr(
settings, 'OPENWISP_MONITORING_CONFIG_STATUS_RETENTION_POLICY', '48h0m0s'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the standard we use for setting names: the user defines OPENWISP_MONITORING_X_Y_Z but we use X_Y_Z internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be an exception here. I have named the setting as OPENWISP_MONITORING_DEVICE_CONFIGURATION_X_Y_Z but internally the settings are named as CONFIG_X_Y_Z. The reason was to explicitly say to newcomers that the setting is related to DEVICE_CONFIGURATION (there's also chart config and soon there shall be metric_config) but internally i.e. in the check settings we know that it means device config hence it's shortened. Let me know if it seems right?

openwisp_monitoring/check/tasks.py Outdated Show resolved Hide resolved
openwisp_monitoring/check/base/models.py Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
openwisp_monitoring/check/tasks.py Outdated Show resolved Hide resolved
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 7ab69dd to c7ffc08 Compare June 16, 2020 09:21
@nepython
Copy link
Contributor Author

@nemesisdesign, I realized it now that there is no migration for ping, config_modified checks in the sample_check. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 1047e03 to ba31564 Compare June 19, 2020 10:30
@@ -35,22 +35,43 @@ def perform_check(uuid):


@shared_task
def auto_create_ping(model, app_label, object_id, created):
def auto_create_ping(model, app_label, object_id):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to remove the the keyword argument created , since it's not being used at all.

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 3 times, most recently from 63e1612 to d6203a0 Compare June 22, 2020 16:11
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress.

I have not tested this yet, I'm only doing code-level review, see my comments.

@nemesisdesign, I realized it now that there is no migration for ping, config_modified checks in the sample_check. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?

No data migration needed for sample app.

PS: rebase on the current master

README.rst Outdated Show resolved Hide resolved
README.rst Outdated
@@ -247,6 +247,53 @@ in terms of disk space.

Whether ping checks are created automatically for devices.

``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK?

README.rst Outdated
+--------------+-----------+

This setting allows you to configure the time that the check is allowed to
fail after which the device health status changes to ``problem``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this setting from its explanation, we should also find a shorter name, eg OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_MAX_TIME

README.rst Outdated
**Note**: If the setting ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED``
is disabled then this setting need not be declared.

``OPENWISP_MONITORING_DEVICE_CONFIGURATION_MODIFIED_RETENTION_POLICY``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too long.. OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_RP?


class CheckConfig(AppConfig):
name = 'openwisp_monitoring.check'
label = 'check'
verbose_name = _('Network Monitoring Checks')

def ready(self):
manage_config_modified_retention_policy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this play with the abstraction? Can you make it play well with the changes introduced in #103?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manage_config_modified_retention_policy() makes use of timeseries_db.get_list_retention_policies after rebasing which has been defined by us so it's abstract enough, I think.

openwisp_monitoring/check/classes/base.py Outdated Show resolved Hide resolved
README.rst Show resolved Hide resolved
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 3 times, most recently from 154bae6 to 6e5a612 Compare June 23, 2020 15:53
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from c5b4ad4 to 8a868fc Compare July 29, 2020 17:34
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still work to do, I'm disappointed to be highlighting the same silly mistakes 👎

Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:

  • run dev server, run worker, run celery beat
  • set the config of a device to modified
  • wait more than 5 minutes, no alert is ever generated
  • if I set seconds to 0 in the alert settings, the alert is generated

README.rst Show resolved Hide resolved
openwisp_monitoring/check/tests/test_models.py Outdated Show resolved Hide resolved
self.assertIn('Ping', c.check)
self.assertEqual(Check.objects.count(), 2)
with self.subTest('Test AUTO_PING'):
c1 = Check.objects.get(name='Ping')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't query by name, use the check field, please correct the same for any other occurrence of this everywhere!
Do not query by fields that can be changed by users, do not use get() for querying objects that can return multiple items!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use get() for querying objects that can return multiple items!

I expect one Ping and one ConfigApplied check per device. Since, the test had only one device, I thought using get was safe enough. Though yes, querying by name (which can be changed) is bad practice on my side :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get will raise an error if multiple objects are returned, but it's better to have failures rather than errors since it's usually faster to understand what's wrong

@nemesifier
Copy link
Member

Thank you @nepython, I tested it and it works! +1
After using the feature and analyzing the code, I realized that the naming is odd.
The check would be a lot more easy to understand if it's something like:
name: Configuration applied
alert settings operator: less than
alert settings threshold value: 1
alert settings threshold value: 600
So if the configuration is not applied for more than 5 minutes, we get an alert.
PS: while testing this, I also realized this: #172

@nemesisdesign, I am slightly confused. You are saying it should be 5 min and 600s at the same time .
Currently it defaults to 5 minutes with a setting OPENWISP_MONITORING_CONFIG_CHECK_MAX_TIME to allow user to configure if needed (as per the issue description). I think the current approach of default 5 min with a setting to configure is better. Wdyt?

5 minutes, sorry, I mistyped.

with self.subTest('Test AUTO_CONFIG_CHECK'):
c2 = Check.objects.get(name='Configuration Applied')
self.assertEqual(c2.content_object, d)
self.assertIn('ConfigApplied', c2.check)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you use equal and you compare it to self._CONFIG_APPLIED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was already written for Ping check. So, I have just adapted it for ConfigApplied checks. Made the change.

@nepython
Copy link
Contributor Author

nepython commented Jul 30, 2020

Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:

* run dev server, run worker, run celery beat

* set the config of a device to `modified`

* wait more than 5 minutes, no alert is ever generated

* if I set `seconds` to `0` in the alert settings, the alert is generated

@nemesisdesign, no it's not a bug. You had asked to cover this use case earlier 😕, I replied back with a solution #123 (comment). Awaited your response, pinged you two days later #123 (comment) and on IM. Still no response, maybe you were slightly busy back then. Hence, went ahead with the approach and informed nevertheless #123 (comment).

As you had requested, the ConfigApplied check won't send any alerts if the device goes offline (health is thus critical). The user will already have one for Ping. The reason being to avoid irritating the user with too many alerts. It will do alert the user once device comes back online and config still stays modified.

The zero second case I think is an edge case where device goes offline. ConfigApplied checks are triggered prior to health status getting updated to CRITICAL. We might never set a 0s threshold for ConfigApplied I think, since config status stays modified for some time whenever a template is updated (it's not spontaneous atleast for me 😅)

Note: There is one specific test (test_config_modified_device_problem) that I have added in check.tests.TestModels which tests that the ConfigApplied check is working!
PS: Maybe the test can be improved.
PPS: Addressed / Fixed everything else as per the review.

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from 8a868fc to 5ba3b99 Compare July 30, 2020 11:34
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read again what was written previously here: #123 (review), it is correct, but that's not what is happening here.

Let me explain again. Please test it in the same way.

  • run dev server, run worker, run celery beat
  • ensure you have a device with health status OK and config status set to applied
  • set the config of a device to modified
  • wait more than 5 minutes, no alert is ever generated, but it should
  • if I set seconds to 0 in the alert settings, the alert is generated, why this difference?

I dug deeper with ipdb, I found out that when previous measurements are retrieved, no point is found.

Same if I use the influxdb shell:

> select * from config_applied;
> 

The situation changes if I remove retention_policy=SHORT_RP on the write method call:

> select * from config_applied;
name: config_applied
time                config_applied content_type  object_id
----                -------------- ------------  ---------
1596134001603513262 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134078032234031 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134379375040639 0              config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b

Please repeat the same tests and let me know:

  1. if you can replicate the same issue successfully
  2. if yes, why it is happening? Why writing with the short retention policy doesn't write anything?
  3. if yes, write a failing test for it
  4. if yes, how you fixed it

You have to do this kind of testing on your own, without the need of me requesting you to do it.
It must become the standard way of how you work.

openwisp_monitoring/check/classes/config_applied.py Outdated Show resolved Hide resolved
@nepython nepython marked this pull request as draft July 30, 2020 18:57
@nepython
Copy link
Contributor Author

Mentioning so if anyone faces same issue in future, then can avoid the trouble, hopefully.

Please repeat the same tests and let me know:

1. if you can replicate the same issue successfully

Yes

2. if yes, why it is happening? Why writing with the short retention policy doesn't write anything?

When a retention policy is used besides default in influxdb then it needs to be specified during query.
https://community.influxdata.com/t/influxdb-not-writing-data-if-rp-is-specified/9861. So, the issue is not with write but with read. To query any measurement in influxdb with a non-default retention_policy in place, please use.

> select * from <retention_policy>.<measurement>;   # `select * from short.config_applied` in this case 
3. if yes, write a failing test for it

Added openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp

4. if yes, how you fixed it

Added retention_policy as a parameter in metric.read() and timeseries_db.read(). So, you can query like this,

In [7]: m.read(retention_policy='short')
Out[7]: 
[{'time': 1596102060, 'config_applied': 1},
 {'time': 1596102120, 'config_applied': 1},
...

Personally, I don't like this approach since user will need to specify the retention_policy everytime in read / write. Maybe we can refactor retention_policy to be a field in Metric model (later on).

The alert didn't came immediately after 5 minutes (took roughly 6-7 minutes, I think) but yes tested twice and added Notification count in test_config_modified_device_problem
Screenshot from 2020-07-31 02-15-21

@nepython nepython marked this pull request as ready for review July 30, 2020 21:57
@nepython nepython added enhancement New feature or request influxdb labels Jul 30, 2020
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress, it finally works 👍
I still think we have dangerous code in our software and I'd like to get it resolved since we're already here and have our mind on it.

[...]

3. if yes, write a failing test for it

Added openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp

I prefer to ensure the actual functional test of the check fails (it would have helped you to spot this without me being involved), I explained in one comment below how to do it.

4. if yes, how you fixed it

Added retention_policy as a parameter in metric.read() and timeseries_db.read().

Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:

q = device_data_query.format(SHORT_RP, self.__key, self.pk)

Personally, I don't like this approach since user will need to specify the retention_policy everytime in read / write. Maybe we can refactor retention_policy to be a field in Metric model (later on).

Doesn't worry me, there are other things a lot more important that we need to work on.

openwisp_monitoring/check/tests/test_models.py Outdated Show resolved Hide resolved
openwisp_monitoring/db/backends/influxdb/client.py Outdated Show resolved Hide resolved
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from b07ac63 to eab6145 Compare July 31, 2020 11:58
@nepython
Copy link
Contributor Author

4. if yes, how you fixed it

Added retention_policy as a parameter in metric.read() and timeseries_db.read().

Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:

q = device_data_query.format(SHORT_RP, self.__key, self.pk)

@nemesisdesign, as you have correctly pointed it out, timseries_db.query() was being used and not timeseries_db.read() for device_data. This method of how we query device_data will change fully after #164 (reason is elasticsearch doesn't support querying large text fields (efficiency issues) so had to adopt a different approach). It will simplify and transform as below,

points = timeseries_db._device_data(
rp=SHORT_RP, tags={'pk': self.pk}, key=self.__key, fields='data'
)

also abstraction was not complete as device_data_query is a dict in case of elasticsearch. I think if needed, we can simplify it there 😃.

In order to have robust tests, I think we should switch to use freezegun to arbitrarily manipulate the time, we already do this in other modules, can you please take advantage of this chance to change all the tests of this test suite which have this approach so that they use freezegun.freeze_time instead?

You would freeze the time 10 minutes earlier to trigger the check, that would trigger the insertion in the timeseries DB without the need of this line, this will make the testcases a lot more robust and will help us catch bugs in the future.

Ok, thanks freezegun works great. Currently, implemented only for this test, will do for all others in the suite. Though we will need to run checks twice in order to validate alerts after this. The reason being datetime.now() is then mocked throughout the code to point 10 minutes backwards. Hence, alert_settings._is_crossed also gets now() as now() - timedelta(minutes=10). So, two times perform_check needs to be run, one with freeze_time (context manager) and one without it. It works perfectly with this.

To get it working, I had to alter tsdb backend write method, this seems to be causing failure of many other tests (maybe clock differences (unsure)). I am checking out why, change is simple enough 🤔.

timestamp = kwargs.get('timestamp') or now()  # <--- now() is extra here
if isinstance(timestamp, datetime):
    timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
point['time'] = timestamp

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from eab6145 to 4f87998 Compare July 31, 2020 15:31
@nepython
Copy link
Contributor Author

To get it working, I had to alter tsdb backend write method, this seems to be causing failure of many other tests (maybe clock differences (unsure)). I am checking out why, change is simple enough thinking.

Check this build (https://travis-ci.org/github/openwisp/openwisp-monitoring/jobs/713662695#L1163) to know the failure.

Found it, it was related to precision of the timestamp. This issue was hidden till now and came forward now due to below change.

timestamp = kwargs.get('timestamp') or now()  # now() is added here
if isinstance(timestamp, datetime):
    timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')  # This was the reason, seconds precision
point['time'] = timestamp    # Passing timestamp was optional earlier

I will try to add a failing test for this.

Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds all good @nepython, let me know when ready for review.

@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch 2 times, most recently from 9a6ed40 to ec5e68b Compare July 31, 2020 17:01
…odically #54

[influxdb] Add support for retention_policy in read
@nepython nepython force-pushed the issues/54-add-check-inspecting-config-status branch from ec5e68b to 0041002 Compare July 31, 2020 17:21
@nemesifier nemesifier merged commit f5f3c1e into dev Aug 1, 2020
@nemesifier nemesifier deleted the issues/54-add-check-inspecting-config-status branch August 1, 2020 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request influxdb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Add a check which inspects device configuration status periodically
3 participants