Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose and Document Fire Drills for Infrastructure Services #148

Open
12 tasks
jrgriffiniii opened this issue Jun 15, 2023 · 0 comments
Open
12 tasks

Propose and Document Fire Drills for Infrastructure Services #148

jrgriffiniii opened this issue Jun 15, 2023 · 0 comments
Labels

Comments

@jrgriffiniii
Copy link
Contributor

jrgriffiniii commented Jun 15, 2023

(This was captured during Summer 2023 All Hands Team Check-In)

Make a “fire drill” ticket for each component. Don’t work it alone! Find a partner, do the thing, document it. Link to the resolution documentation in the related alert if possible.

Fire drills should be structured for individual service outages which might be encountered for any given application maintained by the RDSS team. Perhaps the canonical list of applications should be referenced.

Services/Components which should have fire drills documented are the following:

  • Amazon Web Services outage
  • Amazon Web Services S3 Bucket deletions
  • Globus infrastructure failure
  • Globus resource/asset deletions
  • Sidekiq infrastructure failure
  • Redis infrastructure failure
  • PostgreSQL infrastructure failure
  • Ansible provisioning failure
  • Capistrano deployment failure
  • NGINX server failure
  • Load balancer (NGINX Plus) failure
  • Application host server failure
@jrgriffiniii jrgriffiniii changed the title Propose and Document Fire Drills for Each Service Propose and Document Fire Drills for Infrastructure Services Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants