Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize and update the entire birdhouse stack #328

Closed
wants to merge 43 commits into from

Conversation

mishaschwartz
Copy link
Collaborator

@mishaschwartz mishaschwartz commented May 19, 2023

Overview

  • Reorganize and update the entire birdhouse stack.

    • Moves all components to the following directories under birdhouse/ subdirectory:

      • core/: required components and those that do not have user facing endpoints
      • services/: optional components that provide user facing services
      • extensions/: optional configurations that modify services
      • data/: database and mounted volume definitions that are used by services
      • test-helpers/: services and extensions intended only to test the stack (i.e. not for production)
    • Deprecates the following services by moving them to the services/.deprecated/ directory. These services are no longer maintained but are kept for historical reasons. These services will no longer work with the current stack and will require significant changes if they need to be added back to the stack in the future.

      • catalog
      • frontend
      • malleefowl
      • ncops
      • ncwms2
      • phoenix
      • portainer
      • project-api
      • solr
    • Access to all routes now goes through twitcher by default. This ensures that all services are protected behind the authorization proxy by default. The only exception is jupyterhub which accesses magpie directly to authorize users. This includes WPS outputs which previously were only protected through twitcher if the secure-data-proxy component was enabled.

    • Docker compose no longer exposes any container ports outside the default network except for ports 80 and 443 from the proxy container. This ensures that ports that are not intended for external access are not exposed to the wider internet even if firewall rules are not set correctly.

    • Docker compose files previously were locked to version 3.4. Docker compose documentation currently recommends no specifying a version number so that the most recent syntax can be used if available. The version numbers were removed from all docker-compose.yml and docker-compose-extra.yml files.

    • Docker compose was being invoked with the docker-compose command instead of using the docker compose subcommand. This meant that docker compose version 1 was being used (has since been deprecated). This updates the stack to use docker compose version 2+.

    • By default, all services authorized through twitcher/magpie were blocked for all users. A more reasonable default is to allow anonymous access by default to allow users with no login credentials try out the stack with limited allocated resources. This can be disabled by changing the MAGPIE_DEFAULT_PERMISSION_GROUP environment variable from "anonymous". This will give default access to whichever group is indicated by MAGPIE_DEFAULT_PERMISSION_GROUP instead.

    • Previously, WPS services could be access directly or through weaver. This meant that there were two endpoints that provided the same service and also required that access permissions be synchronized between both. This redundancy was unnecessary and so WPS are now only available through weaver. This also ensures that the outputs of WPS will be organized by username (see services/weaver/config/twitcher/weaver_hooks.py:add_x_wps_output_context) which simplifies permission management of WPS outputs.

    • A new endpoint "/services" is added that provides a json string describing each of the user facing services currently enabled on the stack. This is a static string and serves a different purpose than the endpoints served by canarie-api (monitoring status). This endpoint is meant to be polled by the node registry scripts (in development) to provide information about what services are meant to be available without having to poll other endpoints directly.

    • A new endpoint "/version" is added that provides a string containing the current version number of the stack (e.g. "1.26.0"). This endpoint is meant to be polled by the node registry scripts (in development).

    • The documentation and comments previously contained references to PAVICS explicitly. This is not desirable since this stack could be used to deploy a stack that is not called "PAVICS". All explicit references to "PAVICS" have been changed to "Birdhouse".

    • Added some additional deployment tests that can be used to ensure that the components of this stack are configured correctly. This should ensure that any future changes will not break the deployment mechanisms used here.

Changes

Non-breaking changes

Breaking changes

  • reorganizes entire stack (see overview section above)

Related Issue / Discussion

Additional Information

Links to other issues or sources.

  • This is not currently testable with the current JenkinsCI setup ... fix that
  • ensure that autodeployment (through the scheduler component) still works
  • ensure that the monitoring components still work
  • update all relevant documentation

@github-actions github-actions bot added ci/operations Continuous Integration components documentation Improvements or additions to documentation labels May 19, 2023
@crim-jenkins-bot
Copy link
Collaborator

E2E Test Results

DACCS-iac Pipeline Results

Build URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1546/
Result : failure

BIRDHOUSE_DEPLOY_BRANCH : break-fix-everything
DACCS_CONFIGS_BRANCH : master
PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master
PAVICS_SDI_BRANCH : master

DESTROY_INFRA_ON_EXIT : true
PAVICS_HOST : https://host-140-36.rdext.crim.ca

Infrastructure deployment failed. Instance has not been destroyed. @matprov

@fmigneault
Copy link
Collaborator

@mishaschwartz
Please break up these changes. Proposed separation, not necessarily in order except the last:

  1. Rename variables (preferably, with backward support of older ones transparently) + Override of default permissions
  2. Docker and Docker Compose updates
  3. Weaver + WPS fixes
  4. New endpoints (version, service, etc.)
  5. Deprecate birds & default services enabled
  6. [maybe] refactor

For point (2) regarding Docker Compose version, a neet approach to manage that would be to remove all version: "3.4" mentions from anywhere expect a "base" https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/docker-compose.yml.

Maybe (5) and (6) could be combined, since services will disapear no matter what if an existing instance blindly update the stack anyway. All those changes except (6) can be "reasonably" breaking as fairly easy to patch up.

I'm not sure (6) is even worth the changes as proposed in this PR. It doesn't seem to add any benefice from a technical point of view. The changes seems conceptual at most, and I don't agree with many of them such as the placement of some items in birdhouse/core vs birdhouse/data and birdhouse/services. For example, some core items are completely optional, while some data-related services are not grouped along the other data items, but rather under services for weaver-mongodb and postgres-magpie because they use custom definitions. The fact that components can switch so easily between different categories shows that this structure over-complicates the definitions for nothing. Simply moving current items stored in config into components would be sufficient.

I find the proposed changes (specifically of (6) part) introduce too many problems and a lot of unnecessary debugging efforts (which we don't have enough time for already), not to mention incompatibility with existing PRs and instances. We won't go through all our deployed instances to adapt locations of configs and overrides. We are also just barely starting to get a "working and stable" stack with the nightly CI. The fact that we cannot validate those massive changes directly with the CI, I consider this is too much of a regression to accept them. I would prefer we focus more on stabilizing and increasing the TRL of the stack rather than adding more instability.

I will close this PR directly since I don't intend to accept it in a whole chunk.
The PR goes against all established https://github.com/bird-house/birdhouse-deploy/blob/master/CONTRIBUTING.rst#policy-objectives
Open individual PR for the separated features to keep the Git history cleaner.

@fmigneault fmigneault closed this May 19, 2023
@mishaschwartz
Copy link
Collaborator Author

We had a whole discussion about making all breaking changes at once.... I apologize if I misunderstood the intent of that discussion. The whole point was to make major changes all at once so that you didn't have to regularly "go through all our deployed instances...". I would have loved to make smaller more incremental changes over the last few weeks and months as I was working on them but, as per our previous discussions on this topic, I was trying to put all breaking changes into a single PR to not cause ongoing disruption to your already deployed instances.

I also seem to have misunderstood our discussion about reorganizing the components of the stack into separate directories. I was under the impression that logically dividing components based on functionality was desirable.

I also, apparently, seem to have misunderstood the discussion about removing version dependencies (version 3.4 currently) from the docker compose files. Which I was under the impression was a roadblock for other updates but hadn't been changed up until now because it breaks backwards compatibility.

I agree that the current CI configuration is not sufficient to validate these changes (as I mentioned in the TODO list part of the description) which is why I left this as a draft PR until that can be addressed; a discussion I was hoping to continue with you or your team at CRIM shortly.

It would really help me, @fmigneault, if you could comment here with any changes that are no-gos for you so that I can be sure to avoid them when suggesting changes in the future. From what I understand from your comment above, they are:

  • do not reorganize the components in the stack
  • no changes to configurations that break backwards compatibility with existing set-up
  • no changes that require updates to existing CI infrastructure

I would also like to hear from @tlvu as well as anyone else who would like to offer an opinion.

@tlvu
Copy link
Collaborator

tlvu commented May 24, 2023

Quick impression while reading the PR description only, have not have time to read the code change. I have a huge deadline for a project this week so I can only spend time on this PR next week.

So yes anything breaking backward compat should be done together to avoid multiple manual interventions to the various existing deployments (prod and all the numerous staging/test instances on various organizations). Example: removal of the 3.4 version in all the docker-compose.yml files.

I realized I probably was not clear that breaking backward compat in my mind is breaking pavics-compose.sh, autodeploy and the format of the env.local file. You can break backward compatible "end-user visible features" but not breaking pavics-compose.sh autodeploy and existing env.local (the platform) and that's much easier to swallow from the operational stand point.

That said, probably some of those changes can be done without breaking backward compat so do not have to be done inside this breaking PR? I am not sure on this one since I did not read the code change. Maybe docker-compose v2 update, the new endpoints "/version", "/services", documentation update to remove mention of PAVICS in favor of Birdhouse, new tests, are not breaking backward compat so no need to be in this PR?

The idea is to make each PR easier to digest so it's quicker to spot problems and a faster review process in general. Cognitively small PR that change only 1 or 2 features at most are the best but that can not always be achieved so it's okay to break the rule from time to time.

Most of the changes in this PR are already planned so I don't think you misunderstood anything Misha. Again, small PR are preferred but when it's not possible, it's just not possible.

Here are my comments on the changes, based solely on the PR description, not the code. I have indicated whether I think this breaks an "end-user visible feature" or the operating of the "platform".

  • Access to all routes now goes through twitcher by default, except JupyterHub only

    • Don't remember we discussed about this one but OK as long as we have a way to bail-out. We are having Thredds performance behind Twitcher so we might need a way out since 99% of our data are public.
    • Is Geoserver behaving correctly behind Twitcher? I wonder why historically it has not been already behind Twitcher like Thredds is. Watch out for Geoserver.
    • Adding a bail-out could be in another PR too, to not make this PR bigger.
    • This breaks "end-user visible feature" but not the "platform" (pavics-compose.sh, autodeploy, env.local).
  • Docker compose no longer exposes any container ports outside the default network except for ports 80 and 443 from the proxy container.

    • Super !
    • This breaks "end-user visible feature" but not the "platform" (pavics-compose.sh, autodeploy, env.local).
      • End-users here are internal users only, because the organisation firewall is supposed to block all external users already.
  • The version numbers 3.4 were removed from all docker-compose.yml and docker-compose-extra.yml files.

    • Super ! Is was quite annoying, good time to get rid of this limitation.
    • This do not break "end-user visible feature" but is breaking the "platform" (pavics-compose.sh, autodeploy, env.local)
      • All docker-compose-extra.yml fragments from all external repos have to be updated.
  • Updates the stack to use docker compose version 2+.

    • Super ! Watch out for autodeploy that runs in its own container with the old docker-compose v1.
    • Not sure how you implemented this but I already wanted to upgrade to v2 and my plan was to transform pavics-compose.sh to also run in the same container as autodeploy so then there is no more mismatch version between the 2.
      • Furthermore, we do not even need to physically install docker-compose on the host.
      • If something would break autodeploy (missing volume-mount or binaries not found in the image) will be caught much earlier at pavics-compose.sh invocation time.
      • Autodeploy will eventually call pavics-compose.sh up -d anyways so pavics-compose.sh already have to properly work inside the autodeploy minimal docker image, might as well enforce this requirement and avoid surprise because of different code path (calling pavics-compose.sh locally vs from autodeploy).
    • If fixing autodeploy is complex, maybe delay this change to another PR and keep docker-compose v1 for now.
    • This do not break "end-user visible feature" and probably do not break the "platform" (pavics-compose.sh, autodeploy, env.local), depending on how it is implemented.
  • Allow anonymous access by default to allow users with no login credentials try out the stack with limited allocated resources.

    • You seem to have a way to disable this, so OK with me.
    • This breaks "end-user visible feature" but not the "platform" (pavics-compose.sh, autodeploy, env.local).
      • End-user here is the security administrator that has to lock down this feature if not needed.
  • WPS are now only available through weaver

    • Don't remember we discussed about this one but is there a way to optionally revert to the existing redundancy of endpoints (both direct URL and behind Weaver)?
    • If complex to allow existing behavior, maybe delay this change to another PR and keep the redundancy for now.
    • This breaks "end-user visible feature" but not the "platform" (pavics-compose.sh, autodeploy, env.local).
      • End-user here are really all the end-users that have existing notebooks with direct URLs to the various WPS and they will discover their endpoints suddenly do not work anymore.
      • We have sample code or notebooks submitted as part of a paper and we must not break the evaluation by the paper reviewers.
  • New endpoints "/services", "/version", documentation update, new tests

    • Super !
    • This do not break any "end-user visible feature" because they are new features and do not break the "platform" (pavics-compose.sh, autodeploy, env.local).
  • List of dropped components

    • Super !
    • This breaks "end-user visible feature" but probably not the "platform" (pavics-compose.sh, autodeploy, env.local). We probably do not have any end-users using these dropped components so should be smooth.
  • Rename all the "components/" and "optional-components/" paths on disk

    • Thinking more about this, I am not sure this change is worth it anymore. Are we very clear and strict on which component goes where? What if a component can logically fit into more than 1 position? Next time someone add a new component, are we going to debate where it should belongs again? It's like email organisation using labels vs folders. If using physical folder, one email can only be in one folder. If using labels, one email can have multiples labels and this is more flexible.
    • I really like all the conceptual terms you came up with. We can re-use these terms in the README that describe those components.
    • I think movings everything under ./config/* into ./components/ is probably sufficient.
    • And if you really want some indication at the filesystem level, you can create symlinks so later if we want to move, it's just a symlink to change (ex: services/finch -> components/finch).
    • This do not break any "end-user visible feature" but break the "platform" (pavics-compose.sh, autodeploy, env.local)
      • All existing env.local from all existing deployments have to be updated due to the path renames

@fmigneault
Copy link
Collaborator

@tlvu @mishaschwartz

The whole point was to make major changes all at once so that you didn't have to regularly "go through all our deployed instances...".

My point is exactly related to this. Only parts 5 & 6 (from my comment) are actually breaking. Those can be together (if needed). The rest are new features that should be contained on their own.

I'm in favor of 5 being pushed relatively fast because they are all, indeed, deprecated components.

I am less in favor of 6 because, after seeing the changes, I don't have the impression that the reorganizaiton of the components (core/data/services/...) are more logical than they currently are (config/components/optional-components) [see my previous comment for more details about the mixed use of services "categories" and proposed locations]. I find the PR only switches the current logic with mixed-grouping for another, but both the current categorization and the new one mix-and-match services of different uses and concepts. In theory, when we discussed, it seems like a good idea, but seeing the result doesn't seem as good as anticipated. Provided that the restructure of the directories doesn't bring any advantage, IMO it is not worth the effort it will cause.

I agree with the general "what can / can't break" from @tlvu. If something "requires" breaking changes, then so be it. But if a small patch can preserve backward compat, then we might as well do so to make it transparent for active instances and users.

  • I think movings everything under ./config/* into ./components/ is probably sufficient.
  • And if you really want some indication at the filesystem level, you can create symlinks so later if we want to move, it's just a symlink to change (ex: services/finch -> components/finch).

1st point, I also agree.
2nd, no. It will bring more confusion and create difficult parsing by the utility functions that try to crawl directories to find which composes to load/combine.

I also, apparently, seem to have misunderstood the discussion about removing version dependencies (version 3.4 currently) from the docker compose files. Which I was under the impression was a roadblock for other updates but hadn't been changed up until now because it breaks backwards compatibility.

With this, I fully agree. There was no misunderstanding on your part.
I even proposed the approach to have it defined in a single location instead of across all docker files to limit the burden of updating it as needed.

I have been using Docker Compose v2 and overridden 3.9 over some time already on my dev machine. So I consider those almost non-breaking ("almost" because volume mounts can be weird, and cowbird workspaces symlinks might not like the change).
Fixing an instance after those changes are applied is as easy as doing a search-replace in most cases, so it is not a "that big" of cognitive burden. This wouldn't be the same of the directory refactoring for example, which requires carefull adjustments for specific instances to resolve conflicts.

@huard
Copy link
Collaborator

huard commented May 24, 2023

I don't remember testing the WPS client (birdy) with the weaver endpoint yet.
If possible, it would be good to have the WPS offered through the original birds and weaver for a while to smooth the transition of all the notebooks.

@mishaschwartz
Copy link
Collaborator Author

Thank you both for your feedback and for clarifying what you all consider breaking changes for your purposes.

I will do the following immediately since I can do so without breaking "end-user visibility" or "platform" (as @tlvu helpfully put it):

  • create the /services endpoint in a separate PR
  • create the /version endpoint in a separate PR

Any further steps will require some extensive reworking to pull out the parts into individual PRs and I will continue to work on that as time permits. I will not make any changes that move components in any way; there are too many concerns with that strategy at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/operations Continuous Integration components documentation Improvements or additions to documentation
Projects
None yet
5 participants