You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if there's a failing project on Rill Cloud and we are getting alerted, it can be difficult for support to quickly diagnose whether this is an internal issue or customer impacting. It's a multi-part process (see this thread):
We have to find the project ID
We have to assume the user and check the status
As we scale (or if there are multiple alerts), this requires the support user to iterate instances one by one. We should provide a rill sudo project check health <pattern> command that would allow the support user to quickly list / check the health of all projects by some predefined pattern. For example, we could do something like:
rill sudo project check health all to print out the health of all projects
rill sudo project check health <name_of_org> to print out the health of all projects belonging to a certain org
rill sudo project check health <name_of_user> to print out the health of all projects belonging to a certain user
rill sudo project check health <domain> to print out the health of all projects by email domain (or perhaps covered by org)
rill sudo project check health <status_type> to quickly print out the project status by certain "status type", e.g. error / failing, warnings, healthy (nice-to-have but not sure if reasonable to implement)
Ability to support other creative search patterns, such as chaining conditions (search by org / user AND status type)
This would be extremely beneficial from both a devops / infra and support perspective in terms of utility functions.
The text was updated successfully, but these errors were encountered:
Tagging @himadrisingh if he has any suggestions / thoughts he wanted to add on utility commands he'd find helpful in this area. Tagging @begelundmuller and @nishantmonu51 for visibility.
In addition to checking status fields, it would also be nice to make at least one analytical API request (or potentially even run our K6 load test, though that might slow it down too much), since certain errors that break dashboards only surface at request time (e.g. a dimension value that's larger than the allowed response size). Alternatively, we might consider having a health check run as part of reconcile as well to capture such errors in a status field.
Currently, if there's a failing project on Rill Cloud and we are getting alerted, it can be difficult for support to quickly diagnose whether this is an internal issue or customer impacting. It's a multi-part process (see this thread):
As we scale (or if there are multiple alerts), this requires the support user to iterate instances one by one. We should provide a
rill sudo project check health <pattern>
command that would allow the support user to quickly list / check the health of all projects by some predefined pattern. For example, we could do something like:rill sudo project check health all
to print out the health of all projectsrill sudo project check health <name_of_org>
to print out the health of all projects belonging to a certain orgrill sudo project check health <name_of_user>
to print out the health of all projects belonging to a certain userrill sudo project check health <domain>
to print out the health of all projects by email domain (or perhaps covered by org)rill sudo project check health <status_type>
to quickly print out the project status by certain "status type", e.g. error / failing, warnings, healthy (nice-to-have but not sure if reasonable to implement)This would be extremely beneficial from both a devops / infra and support perspective in terms of utility functions.
The text was updated successfully, but these errors were encountered: