Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Cadvisor #225

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

sudo-Tiz
Copy link
Contributor

@sudo-Tiz sudo-Tiz commented Jul 4, 2024

role is working but :
can't run rootless
path_prefix doesn't work (it's currently commented in the group_vars)

please also consider cloning ansible-role-cadvisor and updating url in requierements.yml to let mash own the role (and use your automation to keep it updated)

@sudo-Tiz sudo-Tiz marked this pull request as draft July 9, 2024 19:20
@spantaleev
Copy link
Member

Some feedback on the role:

  • the "Project source code URL" in defaults/main.yml is incorrect
  • defaults/main.yml mentions "node_exporter" / "node exporter" in multiple places
  • if the role only exposes metrics on the cadvisor_hostname, then there's no need for additional labels for XSS protection and whatnot (cadvisor_container_labels_traefik_additional_response_headers_auto). Feel free to keep this "additional response headers" feature, but you can remove the _auto stuff (and leave it as an empty list). It seems like _auto refers to variables like cadvisor_http_header_xss_protection, cadvisor_http_header_frame_options, etc., which don't exist. It seems like neither cadvisor_container_labels_traefik_additional_request_headers, nor cadvisor_container_labels_traefik_additional_response_headers are actually used in templates/labels.j2, so.. maybe you should actually get rid of all of these variables.
  • "exposes its HTTP port (tcp/9100 in the container)" seems to be incorrect, given that cadvisor_container_http_port: 8080. This needs to be adjusted or the (..) part can be removed.

@spantaleev
Copy link
Member

The documentation page mentions cadvisor_hostname, but group vars are auto-setting cadvisor_container_labels_metrics_hostname directly. This is probably a bit odd.

For other roles, we do it like this, because the main hostname (SERVICE_hostname) is actually used for a web UI (and as a default for the metrics hostname).

For this role, there's no web UI and you're telling people to use cadvisor_hostname for configuring the metrics hostname, so.. maybe it makes sense to have group vars wiring influence cadvisor_hostname (which then goes to influence cadvisor_container_labels_metrics_hostname via defaults/main.yml).

@sudo-Tiz
Copy link
Contributor Author

Thank you for your feedback, @spantaleev.

I recognized that the role was far from ready for merging as it contained copy/paste artifacts from the initial role. I have since updated the documentation and made changes to the role accordingly.

Regarding cAdvisor, I have removed the "metrics" labels since it exposes both metrics and webUI at the same address and port

I believe the role should now be ready for testing

@sudo-Tiz sudo-Tiz marked this pull request as ready for review July 18, 2024 20:20
Copy link
Member

@spantaleev spantaleev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of pushing this over the finish line, but it has tons of issues.

Besides the ones reported in inline comments in the review, there are also these:

  • the cadvisor_config_providers_docker_endpoint_is_unix_socket variable in vars/main.yml has some Traefik mentions in its comments. Not sure if it's relevant to cAdvisor or not
  • the systemd service file (templates/cadvisor.service.j2) does --cap-drop=ALL only when a non-unix-socket is used. Not sure if this is intentional or a mistake
  • the cAdvisor container seems stuck in an unhealthy state, which makes Traefik not want to send requests to it. The docs page seems to mention an additional environment variable (CADVISOR_HEALTHCHECK_URL=http://localhost:8080/healthz), but.. adding this does not help either. If this is a good value, it should be default anyway - why ask users to specify it manually? Regardless of what i did, the container was unhealthy, which may be due to a misconfiguration.
  • the docs page says "You will have to mount specific folders depending on your need", but.. it's not clear what a good default is.. I tried mounting whatever you're recommending (except for /dev/disk/) and cAdvisor still fails to start in a healthy state
  • cadvisor_dashboard_urls points to some dashboard, but I don't see it being mentioned in the docs (with instructions how to hook it to Grafana). For some example instructions, see docs/services/grafana.md

templates/group_vars_mash_servers Outdated Show resolved Hide resolved
templates/group_vars_mash_servers Outdated Show resolved Hide resolved
Comment on lines +1624 to +1630
cadvisor_container_labels_metrics_enabled: "{{ prometheus_enabled | default(false) or mash_playbook_metrics_exposure_enabled }}"
cadvisor_container_labels_metrics_hostname: "{{ mash_playbook_metrics_exposure_hostname }}"
cadvisor_container_labels_metrics_path_prefix: "{{ mash_playbook_metrics_exposure_path_prefix }}/{{ cadvisor_identifier }}"
cadvisor_container_labels_metrics_traefik_middleware_basic_auth_enabled: "{{ mash_playbook_metrics_exposure_http_basic_auth_enabled }}"
cadvisor_container_labels_metrics_traefik_middleware_basic_auth_users: "{{ mash_playbook_metrics_exposure_http_basic_auth_users }}"
cadvisor_container_labels_metrics_middleware_basic_auth_enabled: "{{ mash_playbook_metrics_exposure_http_basic_auth_enabled }}"
cadvisor_container_labels_metrics_middleware_basic_auth_users: "{{ mash_playbook_metrics_exposure_http_basic_auth_users }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables do not seem to be defined anymore, yet.. they're here.

That said, I think it's better if metrics had their own Traefik router (separate from the web UI) and for them to respect the mash_playbook_metrics_exposure_* variables automatically (auto-enabling metrics exposure for this service, possibly protected with the Basic Auth credentials specified in mash_playbook_metrics_exposure_http_basic_auth_*).

The web UI could remain optional and have its (optional) separate set of Basic Auth credentials

templates/group_vars_mash_servers Outdated Show resolved Hide resolved
templates/group_vars_mash_servers Outdated Show resolved Hide resolved
@spantaleev spantaleev closed this Aug 8, 2024
@spantaleev spantaleev reopened this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants