From 1d95f638b51594e13dd8f98c457da1b8ccd3fd61 Mon Sep 17 00:00:00 2001 From: Suraj Nath <9503187+electron0zero@users.noreply.github.com> Date: Sun, 18 Aug 2024 00:26:15 +0530 Subject: [PATCH] fix some spelling and grammer --- pages/about.md | 25 +++++++++++++------------ pages/learn.md | 29 ++++++++++++++--------------- pages/meetup.md | 16 ++++++++-------- pages/stories.md | 28 ++++++++++++++-------------- 4 files changed, 49 insertions(+), 49 deletions(-) diff --git a/pages/about.md b/pages/about.md index 4a815ae..f9cbfb6 100644 --- a/pages/about.md +++ b/pages/about.md @@ -4,11 +4,13 @@ title: About permalink: /about/ --- -## What is Failure modes? +## What is Failure Modes? -Failure modes is community of software practitioners that likes share and discuss about the failure modes they have seen in real life. +Failure Modes is a community of software practitioners that likes to share and discuss the failure modes they have seen in production systems. -We host meetup in Bangalore, India. Meetups are invite only and follow [Chatham House Rule](https://en.wikipedia.org/wiki/Chatham_House_Rule) to ensure an safe space to share failures. +We host meetups in Bangalore, India. + +Meetups are invite-only and follow [Chatham House Rule](https://en.wikipedia.org/wiki/Chatham_House_Rule) to ensure a safe space to share failures. > **Chatham House Rule:** When a meeting, or part thereof, is held under the Chatham House Rule, > participants are free to use the information received, but neither the identity nor the affiliation of @@ -16,28 +18,27 @@ We host meetup in Bangalore, India. Meetups are invite only and follow [Chatham > > [Source - chathamhouse.org](https://www.chathamhouse.org/about/chatham-house-rule) -See [Meetup](/meetup/) page for more details about meetup. +See the [meetup](/meetup/) page for more details about the meetup. -Browse around this website, and expore the collection of incidents and [learn](/learn/) from literature +Browse around this website and explore the collection of incidents and [learn](/learn/) from literature on how and why software systems fail, and how we can build better systems -## Why Failure modes? +## Why Failure Modes? -Running things in production is hard and running distributed systems in production is extra hard. +Running software systems in production is hard, and running distributed systems in production is even harder. Failure Modes is an effort to curate resources and stories from the community, to learn and get better at running large scale software in production. ## Contributing to Failure modes -- Observed an interesting failure mode in your day to day job?, saw and intesting post incident report from a company?, [create an issue](https://github.com/electron0zero/failure-modes/issues/new) to add that to our collection. +- Observed an interesting failure mode in your day-to-day job? Saw an interesting post-incident report from a company?, [create an issue](https://github.com/electron0zero/failure-modes/issues/new) to add that to our collection. -- Wrote a blogpost about a failure you saw in production? [create an issue](https://github.com/electron0zero/failure-modes/issues/new) with blogpost link to add it to our collection. +- Wrote a blog post about a failure you saw in production? [create an issue](https://github.com/electron0zero/failure-modes/issues/new) with blog post link to add it to our collection. -- Saw something intesting about failure modes on the internet? [create an issue](https://github.com/electron0zero/failure-modes/issues/new) and share the link. It can be anything from incident postmortems, blog posts, projects, talks, tweets, research, etc. +- Saw something interesting about failure modes on the internet? [create an issue](https://github.com/electron0zero/failure-modes/issues/new) and share the link. It can be anything from incident postmortems, blog posts, projects, talks, tweets, research, and more. Huge thanks to our [contributors](https://github.com/electron0zero/failure-modes/graphs/contributors) :bowing_man: :bowing_woman: :tada: -Have suggestions or questions, reach out to Suraj on twitter [@electron0zero](https://twitter.com/electron0zero) or open [an issue](https://github.com/electron0zero/failure-modes/issues) +Have suggestions or questions? Reach out to Suraj on twitter [@electron0zero](https://twitter.com/electron0zero) or open [an issue](https://github.com/electron0zero/failure-modes/issues) -:boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: :boom: diff --git a/pages/learn.md b/pages/learn.md index 1bcd8dd..a34ddd3 100644 --- a/pages/learn.md +++ b/pages/learn.md @@ -6,11 +6,11 @@ permalink: /learn/ ## Learn about building resilient systems -Collection of resources to learn about failures, and failure modes of software systems. +A collection of resources to learn about failures and failure modes of software systems. ## Blog Posts -Blog Posts on failures, reliability, testing and other relevant topics +Blog posts on failures, reliability, testing, and other relevant topics - [Chaos Engineering — Review Lineage Driven Failure Injection(LDFI)](https://medium.com/becloudy/chaos-engineering-review-lineage-driven-failure-injection-ldfi-a1c831abe504) @@ -22,14 +22,13 @@ Blog Posts on failures, reliability, testing and other relevant topics - [Lessons learned in incident management - Dropbox](https://dropbox.tech/infrastructure/lessons-learned-in-incident-management) -- [Post Mortem - The Cloudflare Blog](https://blog.cloudflare.com/tag/postmortem), list postmortems from cloudflare +- [Post Mortem - The Cloudflare Blog](https://blog.cloudflare.com/tag/postmortem), lists postmortems from cloudflare - [How we’re building a production readiness review process at Grafana Labs](https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs/) ## Talks -Talks on how systems fail, demo of systems, and other wisdom on how -we can build better systems - +Talks on how systems fail, demos of systems, and other wisdom on how we can build better systems - - [Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill](https://www.youtube.com/watch?v=30jNsCVLpAE) @@ -53,7 +52,7 @@ we can build better systems - ## Tools & Projects -Tools and projects focused on failures, and failure modes of software systems. +Tools and projects focused on failures and failure modes of software systems. ### [Chaos Toolkit](https://github.com/chaostoolkit) The Open Source Platform for Chaos Engineering @@ -63,8 +62,8 @@ Chaos Monkey is a resiliency tool that helps applications tolerate random instan ### [Learning from Incidents in Software](https://www.learningfromincidents.io/) Incidents are costly. Without spending time analyzing and determining the conditions -that exist in order for an incident to take place, we won't learn how to successfully -remove nor recover from these conditions in the future. +that exist for an incident to take place, we won't learn how to successfully +remove or recover from these conditions in the future. Let's help each other learn. @@ -74,8 +73,8 @@ understanding and coping with the immense levels of complexity involved in the operation of critical digital services. ### [Resilience engineering papers](https://github.com/lorin/resilience-engineering) -Contains notes about people active in resilience engineering, as well as -some influential researchers who are no longer with us +Contains notes about people active in resilience engineering as well as +some influential researchers who are no longer with us. ### [Kubernetes Failure Stories](https://github.com/hjacobs/kubernetes-failure-stories) @@ -86,7 +85,7 @@ Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. ### [Debugging stories - Dan Luu](https://github.com/danluu/debugging-stories) -Collection of links to various debugging stories. +A collection of links to various debugging stories. ### [A List of Post-mortems! - Dan Luu](https://github.com/danluu/post-mortems) A collection of postmortems. @@ -99,7 +98,7 @@ Curated list of resources on testing distributed systems ## Research -Research on failures and how to test, build and operate reliable systems - +Research on failures and how to test, build, and operate reliable systems - - [Lineage-driven Fault Injection - the morning paper](https://blog.acolyer.org/2015/03/26/lineage-driven-fault-injection/) @@ -113,7 +112,7 @@ Research on failures and how to test, build and operate reliable systems - - [Report from the SNAFU catchers Workshop on Coping With Complexity](https://snafucatchers.github.io/) -- [How Complex Systems Fail - Richard I. Cook](https://how.complexsystems.fail/), [Original pdf](https://www.gwern.net/docs/technology/2000-cook.pdf) +- [How Complex Systems Fail - Richard I. Cook](https://how.complexsystems.fail/), ([Original pdf](https://www.gwern.net/docs/technology/2000-cook.pdf)) ### Fault Isolation using Shuffule Sharding - [AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)](https://www.youtube.com/watch?v=swQbA4zub20) @@ -127,7 +126,7 @@ Research on failures and how to test, build and operate reliable systems - ## Systems -Real world failure stories and incident postmortems of widely used systems +Real-world failure stories and incident postmortems of widely used systems ### PostgreSQL - [Transaction ID wraparound outage at mandrill](https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/) @@ -141,7 +140,7 @@ Real world failure stories and incident postmortems of widely used systems ### Kubernetes - [Compilation of public failure/horror stories related to Kubernetes](https://github.com/hjacobs/kubernetes-failure-stories) - [10 Ways to Shoot Yourself in the Foot with Kubernetes, #9 Will Surprise You - Laurent Bernaille](https://www.youtube.com/watch?v=QKI-JRs2RIE) -- [Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances](https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html), also see [DNS Lookups in Kubernetes](https://mrkaran.dev/posts/ndots-kubernetes/) +- [Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performance](https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html), also see [DNS Lookups in Kubernetes](https://mrkaran.dev/posts/ndots-kubernetes/) ### YugabyteDB - [How Plume Handled Billions of Operations Per Day Despite an AWS Zone Outage](https://blog.yugabyte.com/how-plume-handled-billions-of-operations-per-day-despite-an-aws-zone-outage/) diff --git a/pages/meetup.md b/pages/meetup.md index b97caf7..131e6d4 100644 --- a/pages/meetup.md +++ b/pages/meetup.md @@ -8,11 +8,11 @@ permalink: /meetup/ We have occasional meetups in Bangalore, India. -We meet whenever I have time to manage the logistics of the meetup, and have a venue that can host us. :smile: +We meet whenever I have time to manage the logistics of the meetup and have a venue that can host us. :smile: -Meetups are invite-only and follow [Chatham House Rule](https://en.wikipedia.org/wiki/Chatham_House_Rule) to ensure an safe space to share failiures. +Meetups are invite-only and follow [Chatham House Rule](https://en.wikipedia.org/wiki/Chatham_House_Rule) to ensure a safe space to share failures. -Interested in hosting one of our meetups in Bangalore? [Send me a direct message on Twitter](https://twitter.com/electron0zero). +Are you interested in hosting one of our meetups in Bangalore? [Send me a direct message on Twitter](https://twitter.com/electron0zero). ## Next Meetup @@ -24,7 +24,7 @@ Date: 24th August, 2024 Join the [WhatsApp Community](https://chat.whatsapp.com/IQOeAnHctWu2FSbgZ0Brro) to stay in the loop. -You can also subscribe to the Meetup Calendar at [meetup.ics](/meetup.ics). Use the following links: +You can also subscribe to the meetup calendar at [meetup.ics](/meetup.ics). Use the following links: - [iOS/MacOS](webcal://failuremodes.dev/meetup.ics) - [Google Calendar](https://calendar.google.com/calendar/u/0?cid=webcal%3A%2F%2Ffailuremodes.dev%2Fmeetup.ics) @@ -38,9 +38,9 @@ For other apps, you can import the generic [ICS file](https://failuremodes.dev/m Date: 17th March, 2024 -Hosted by [IG Group](https://www.ig.com) with the help from [Srivatsa RV](https://twitter.com/rv_srivatsa), and [Mehul Ved](https://twitter.com/mehulved), thank you :bow: +Hosted by [IG Group](https://www.ig.com) with help from [Srivatsa RV](https://twitter.com/rv_srivatsa), and [Mehul Ved](https://twitter.com/mehulved), thank you :bow: -### Failure Modes - 3nd Meetup +### Failure Modes - 3rd Meetup [Announcement Tweet](https://twitter.com/electron0zero/status/1746149250201715007) @@ -58,7 +58,7 @@ Hosted by [IG Group](https://www.ig.com) with the help from [Srivatsa RV](https: Date: 5th Feb, 2023 -Hosted by [DeepSource](https://deepsource.com/), thank you :bow: +Hosted by [DeepSource](https://deepsource.com/). thank you :bow: ### Failure Modes - 1st Meetup @@ -68,4 +68,4 @@ Hosted by [DeepSource](https://deepsource.com/), thank you :bow: Date: 25th Jan, 2020 -Hosted By [Clarisights](https://clarisights.com/), thank you :bow: +Hosted By [Clarisights](https://clarisights.com/). thank you :bow: diff --git a/pages/stories.md b/pages/stories.md index 061843b..3712a1c 100644 --- a/pages/stories.md +++ b/pages/stories.md @@ -4,16 +4,16 @@ title: Stories permalink: /stories/ --- -Postmortems/Incident Reports/War Stories from real-world failures +Postmortems, Incident Reports, and Stories from Real-World Failures. ### Algolia - [Salt Incident: May 3rd, 2020 Retrospective and Update](https://blog.algolia.com/salt-incident-may-3rd-2020-retrospective-and-update/) - [May 30 SSL incident](https://www.algolia.com/blog/engineering/may-30-ssl-incident/) ### Atlassian -- [multi product multi week outage - April 4th, 2022](https://www.atlassian.com/engineering/april-2022-outage-update) +- [Multi-Product, Multi-Week Outage - April 4th, 2022](https://www.atlassian.com/engineering/april-2022-outage-update) - [Day 7 of the great Atlassian outage: IT giant still struggling to restore access](https://www.theregister.com/2022/04/11/atlassian_still_down/) - - [Post Incident Review By Atlassian](https://www.atlassian.com/engineering/post-incident-review-april-2022-outage) + - [Post-Incident Review by Atlassian](https://www.atlassian.com/engineering/post-incident-review-april-2022-outage) - [The Scoop: Inside the Longest Atlassian Outage of All Time](https://newsletter.pragmaticengineer.com/p/scoop-atlassian?ref=blog.pragmaticengineer.com) ### Authzed @@ -36,12 +36,12 @@ Postmortems/Incident Reports/War Stories from real-world failures - [Celer Network Bridge dapp incident analysis by Coinbase](https://www.coinbase.com/blog/celer-bridge-incident-analysis) ### DataDog -- [Datadog Outage Affects Multiple Regions for a Day](https://www.datadoghq.com/blog/2023-03-08-multiregion-infrastructure-connectivity-issue/), also see [reddit discussion](https://www.reddit.com/r/devops/comments/11luq9r/psa_datadog_outage/) +- [Datadog Outage Affects Multiple Regions for a Day](https://www.datadoghq.com/blog/2023-03-08-multiregion-infrastructure-connectivity-issue/), Also see [the Reddit discussion.](https://www.reddit.com/r/devops/comments/11luq9r/psa_datadog_outage/) ### DataSpring - [Datacenter and tornado](https://www.dataspring.cz/datacentrum-a-tornado/) - - Website is in Czech, use a translation service to read it. [archive.today link](http://archive.today/fWE85) - - This is a story of how a data center dealt with a tornado, a good reminder to verify your offsite backups, a disaster recovery plan, and conduct disaster recovery dry runs. + - The website is in Czech; use a translation service to read it. [archive.today link](http://archive.today/fWE85) + - This is a story of how a data center dealt with a tornado—a good reminder to verify your offsite backups, disaster recovery plan, and conduct disaster recovery dry runs. ### Deno - [May 30 incident update - May 30, 2022](https://deno.com/blog/2022-05-30-outage-post-mortem) @@ -50,14 +50,14 @@ Postmortems/Incident Reports/War Stories from real-world failures - [How to Handle Kubernetes Health Checks - health checks outage on Black Friday](https://doordash.engineering/2022/08/09/how-to-handle-kubernetes-health-checks/) ### Facebook -- October 4, 2021, Facebook Group (Facebook, Instagram, WhatsApp, Oculus) outage +- October 4, 2021: Facebook Group (Facebook, Instagram, WhatsApp, Oculus) Outage. - [Understanding How Facebook Disappeared from the Internet](https://blog.cloudflare.com/october-2021-facebook-outage/) - [What happened on the Internet during the Facebook outage](https://blog.cloudflare.com/during-the-facebook-outage/) - [Update about the October 4th outage - Facebook Engineering](https://engineering.fb.com/2021/10/04/networking-traffic/outage/) - [More details about the October 4 outage - Facebook Engineering](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/) - [What Happened to Facebook, Instagram, WhatsApp? Krebs on Security](https://krebsonsecurity.com/2021/10/what-happened-to-facebook-instagram-whatsapp/) - - [Why was Facebook down for five hours? - YouTube](https://www.youtube.com/watch?v=-wMU8vmfaYo) - Ben Eater explains the facebook outage in detail with a demo - - This outage had side effects on the whole internet, the most common one was ISPs getting DoSed with DNS queries for facebook domains. + - [Why was Facebook down for five hours? - YouTube](https://www.youtube.com/watch?v=-wMU8vmfaYo) - Ben Eater explains the Facebook outage in detail with a demo. + - This outage had side effects on the whole internet; the most common one was ISPs getting DoSed with DNS queries for Facebook domains. ### Fastly - [Summary of June 8 outage - Fastly](https://www.fastly.com/blog/summary-of-june-8-outage) - June 8, 2021, global outage @@ -82,7 +82,7 @@ Postmortems/Incident Reports/War Stories from real-world failures - [HN Thread](https://news.ycombinator.com/item?id=29243617) - [A bug introduced 6 months ago brought Google's Cloud Load Balancer to its knees](https://www.theregister.com/2021/11/23/google_outage/) - [London (europe-west2) cooling system failure - July 19, 2022](https://status.cloud.google.com/incidents/fmEL9i2fArADKawkZAa2) - - Oracle Cloud also saw a cooling system failure on the same day in London Data Center + - Oracle Cloud also saw a cooling-related failure on the same day in the London Data Center. ### Grafana Labs - [How a GCP Persistent Disk Incident Snowballed into a 23-Hour Outage -- and Taught Us Some Important Lessons](https://grafana.com/blog/2020/01/23/how-a-gcp-persistent-disk-incident-snowballed-into-a-23-hour-outage-and-taught-us-some-important-lessons/) @@ -108,7 +108,7 @@ Postmortems/Incident Reports/War Stories from real-world failures - [In Place AWS Elasticache Redis Upgrade went wrong](https://twitter.com/vhmth/status/1599133796091461632) ### Netflix -- [Containers taking out nodes - Twitter thread by Sargun Dhillon](https://twitter.com/sargun/status/1228495222658613250?s=19). [Thread Reader link](https://threadreaderapp.com/thread/1228495222658613250.html) +- [Containers taking out nodes - Twitter thread by Sargun Dhillon](https://twitter.com/sargun/status/1228495222658613250?s=19). [Threadreader link](https://threadreaderapp.com/thread/1228495222658613250.html) ### Nomad Bridge - [Nomad Bridge incident analysis by Coinbase](https://www.coinbase.com/blog/nomad-bridge-incident-analysis) @@ -133,17 +133,17 @@ Postmortems/Incident Reports/War Stories from real-world failures - [That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix • The Register](https://www.theregister.com/2021/05/19/salesforce_root_cause/) - in the news ## Stripe -- [Stripe down for for all those with Stripe Tax enabled](https://news.ycombinator.com/item?id=32558191#32558635), and status updates on [@stripestatus twitter page](https://twitter.com/stripestatus/status/1561809071061155841) +- [Stripe was down for all those with Stripe Tax enabled](https://news.ycombinator.com/item?id=32558191#32558635), and status updates on [@stripestatus twitter page](https://twitter.com/stripestatus/status/1561809071061155841) ### Slack - [Users are unable to connect to Slack - Tuesday, May 12, 2020](https://status.slack.com/2020-05-12), and [Twitter Thread](https://twitter.com/copyconstruct/status/1260988880397856769) by copyconstruct - [Slack’s Outage on January 4th, 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/) - [May 12, 2020 Outage - A Terrible, Horrible, No-Good, Very Bad Day at Slack - Slack Engineering](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/) -- [Double Trouble with Datastores - Slack’s Incident o February 22, 2022](https://slack.engineering/slacks-incident-on-2-22-22/) +- [Double Trouble with Datastores - Slack’s Incident on February 22, 2022](https://slack.engineering/slacks-incident-on-2-22-22/) ### Twitter - [An update on our security incident - social engineering attack, July 2020](https://blog.twitter.com/en_us/topics/company/2020/an-update-on-our-security-incident.html) ### Verizon -- [How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today](https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/) +- [How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline](https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/)