oppia · seanlip · Nov 15, 2024
diff --git a/Debugging.md b/Debugging.md
@@ -2,7 +2,7 @@ We do a lot of debugging at Oppia, whether because tests are failing on a PR or
 
 Start here if failing tests are stopping you from pushing or merging a PR:
 
-* [[If your presubmit checks fail|If-your-presubmit-check-fails]]
+* [[If your local presubmit checks fail|If-your-presubmit-check-fails]]
 * [[If CI checks fail on your PR|If-CI-checks-fail-on-your-PR]]
 
 Unless the bug you are trying to fix is trivial, we recommend creating a debugging doc to organize your work:
@@ -11,8 +11,7 @@ Unless the bug you are trying to fix is trivial, we recommend creating a debuggi
 
 Here are some debugging guides for particular kinds of issues you might encounter or types of tests:
 
-* [[Interpreting GitHub Actions Results|Interpreting-GitHubActions-Results]]
-* [[Finding the commit that introduced a bug|How-to-find-the-commit-which-introduced-a-bug]]
+* [[Finding which commit introduced a bug|How-to-find-the-commit-which-introduced-a-bug]]
 * [[Debugging end-to-end tests|Debug-end-to-end-tests]]
 * [[Debugging backend tests|Debug-backend-tests]]
 * [[Debugging frontend tests|Debug-frontend-tests]]

diff --git a/If-CI-checks-fail-on-your-PR.md b/If-CI-checks-fail-on-your-PR.md
@@ -1,96 +1,95 @@
 ## Table of contents
 
 * [Introduction](#introduction)
-* [Merge conflicts](#merge-conflicts)
-* [Network problems](#network-problems)
-* [Failing tests and lint checks](#failing-tests-and-lint-checks)
+* [Interpreting the CI checks display](#interpreting-the-ci-checks-display)
+* [Figuring out whether the failure is due to your PR or an existing issue](#figuring-out-whether-the-failure-is-due-to-your-pr-or-an-existing-issue)
+* [What to do if the failure is due to your PR](#what-to-do-if-the-failure-is-due-to-your-pr)
+* [What to do if the failure is due to an existing issue](#what-to-do-if-the-failure-is-due-to-an-existing-issue)
+* [What to do if CI checks only fail in the merge queue](#what-to-do-if-ci-checks-only-fail-in-the-merge-queue)
 
 ## Introduction
 
 If your PR build fails, do not despair! Scroll down to the bottom of the PR thread until you see the results of the continuous integration (CI) tests that ran:
 
-![Screenshot of PR CI results](images/prCiResults.png)
+  ![GHCI Status](images/ghciSample.png)
 
-The failure may be due to one of the following things:
+In the example above, the lint checks and Mypy checks have failed. Also, two checks were skipped due to the failure of one of the e2e tests. (This is because, in the event of an e2e test failure, any remaining e2e tests that are queued or currently running will be terminated and marked as "skipped.") The rest of the tests have passed.
 
-* If you see a warning that your PR has a merge conflict, then see the [merge conflicts section](#merge-conflicts) below.
-* If your issue is not a merge conflict, then click on the "Details" link next to any failing builds and inspect the logs.
+To diagnose and fix a failing check on your PR, follow the instructions below:
 
-  * If you see an error when installing third-party libraries, then there was probably a network issue. See the [network problems](#network-problems) section below.
-  * If you see from the logs that a test or lint check is failing, see the [failing tests and lint checks section](#failing-tests-and-lint-checks) below.
+1. If you see a warning that your PR has a **merge conflict**, you will need to resolve the conflict. To do this, follow the "merge from develop" instructions in  [step 5 of our PR guide](https://github.com/oppia/oppia/wiki/Rules-for-making-PRs#step-5-address-review-comments-until-all-reviewers-approve). Once you push the merge commit to your feature branch on GitHub, the merge conflict message will disappear.
 
-If many of the CI checks are failing at once, your PR has probably resulted in a breakage. To debug and fix this, try [running the local development server](https://github.com/oppia/oppia/wiki/Installing-Oppia-%28Mac-OS%3B-Python-3%29#running-oppia-on-a-development-server) and checking that the website behaves normally, with no errors in the developer console. Often, you will probably get errors in the backend server logs or the developer console which will help you figure out what is going wrong. The CI logs should also contain useful information that can help with debugging.
+2. If the issue isn't a merge conflict, then there are two possibilities: your code could be wrong, or the test/check on Oppia's develop branch could be incorrect. You will need to determine which of these cases it is. Note that the test/check on Oppia's develop branch could be incorrect for two reasons:
 
-## Merge conflicts
+  * Sometimes, the tests in develop may be "flaky", which means they pass sometimes and fail sometimes, even though the code has not changed. The dev workflow team is trying to reduce cases like this, though they may sometimes still happen.
 
-You can fix merge conflicts by following the "merge from develop" instructions in  [step 5 of our instructions for making a pull request](https://github.com/oppia/oppia/wiki/Rules-for-making-PRs#step-5-address-review-comments-until-all-reviewers-approve). Then when you push the merge commit to your feature branch on GitHub, the merge conflict message will disappear.
+  * If an incorrect PR was recently merged to develop, this could cause the tests to fail on develop. If you see this happening, please follow the [Revert and Regression Policy|Revert-and-Regression-Policy] so that we can revert the problematic changes as soon as possible, since the develop branch should always be passing checks.
 
-## Network problems
 
-If the error has a message like the one below, then it is due to a network error and is unrelated to your PR. Please reach out to a code owner or Core Maintainer to restart your test.
+## Figuring out whether the failure is due to your PR or an existing issue
 
-```text
-Installing Node.js
-Traceback (most recent call last):
-  File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
-    "__main__", fname, loader, pkg_name)
-  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
-    exec code in run_globals
-  File ".../scripts/install_third_party_libs.py", line 311, in <module>
-    main()
-  File ".../scripts/install_third_party_libs.py", line 236, in main
-    setup.main(args=[])
-  File "scripts/setup.py", line 158, in main
-    download_and_install_node()
-  File "scripts/setup.py", line 120, in download_and_install_node
-    outfile_name)
-  File "scripts/setup.py", line 81, in download_and_install_package
-    python_utils.url_retrieve(url_to_retrieve, filename=filename)
-  File "python_utils.py", line 289, in url_retrieve
-    return urllib.urlretrieve(source_url, filename=filename)
-  File "/usr/local/lib/python2.7/urllib.py", line 98, in urlretrieve
-    return opener.retrieve(url, filename, reporthook, data)
-  File "/usr/local/lib/python2.7/urllib.py", line 289, in retrieve
-    "of %i bytes" % (read, size), result)
-urllib.ContentTooShortError: retrieval incomplete: got only 7372553 out of 18638507 bytes
+Here are some tips for how to determine whether the failing CI check is due to your code or a problem in Oppia's develop branch:
 
-Exited with code exit status 1
-```
+* Always start by **looking at the failure logs**. Click on the "Details" link next to each of the failing tests to inspect their logs.
 
-If you see a different error while installing third-party libraries, please file an issue and mention @automated-qa-reviewers to notify the Automated QA team.
+  ![GHCI Logs](images/githubActionsLogs.png)
 
-## Failing tests and lint checks
+  Then, select the job under 'Jobs' to see the logs for that particular job.
 
-If you see that a test or lint check is failing, there are two possibilities: your code could be wrong, or the test/check could be flaky. There is no easy way to tell whether a particular failure is a flake, but here are some guidelines:
+* Next, consider whether your changes could have plausibly caused the failure. For example, if you just updated the README, then there's no way that you could have broken an E2E test. Similarly, PRs that only modify frontend files are unlikely to cause errors in the "install third-party dependencies" step. However, note that changes in one part of the code can have unintended effects in apparently unrelated code. For example, if you add an E2E test that creates an exploration with the same name as an exploration created by another E2E test, you could break that other E2E test, even if it's testing completely unrelated code.
+  * If your changes could have plausibly caused the failure, see [What to do if the failure is due to your PR](#what-to-do-if-the-failure-is-due-to-your-pr), below.
+  * If not, search for the error message on the [issue tracker](https://github.com/oppia/oppia/issues) to see whether the error has been reported before. If there's an open issue for the same error, see [What to do if the failure is due to an existing issue](#what-to-do-if-the-failure-is-due-to-an-existing-issue), below.
+  * If the error isn't related to your PR and there isn't an existing issue for it, you might need to file a new issue for the failure. Before doing this, double-check your PR changes and the error logs to ensure that nothing in your PR could be causing the failure.
 
-* Flakes mostly occur in the end-to-end (E2E) tests. Sometimes we see flakes in the lighthouse or frontend tests, but the backend tests and linters almost never flake.
-* Consider whether your changes could have plausibly caused the failure. For example, if you just updated the README, then there's no way that you could have broken an E2E test. However, be careful to consider that changes in one part of the code can have unintended effects in apparently unrelated code. For example, if you add an E2E test that creates an exploration with the same name as an exploration created by another E2E test, you could break that other E2E test, even if it's testing completely unrelated code.
-* Check whether the error message you see matches a known flake. For E2E tests, look for a green message in the log like this:
 
-  ```text
-  E2E logging server says test flaky: {{true/false}}
-  ```
+## What to do if the failure is due to your PR
 
-  This message is based on a list of known flakes maintained by the Automated QA team. Note however that this list may be incomplete, so a test could be flaking even if this message says `false`.
+To debug and fix the failure, always start from the errors in the CI logs. They contain useful information that can help you with debugging. You can also do the following to get more information about what is causing the error:
 
-Note that all our CI checks except for the backend tests merge from the upstream `develop` branch before running, so you may need to merge from `develop` locally to reproduce the failure.
+* Find the command that the CI workflow uses to run the tests, and run that command on your local machine. Note that all our CI checks merge from the upstream `develop` branch before running, so you may need to merge from `develop` locally to reproduce the failure.
 
-If your code is wrong, then you'll need to fix it just as you would [respond to reviewer comments](https://github.com/oppia/oppia/wiki/Rules-for-making-PRs#step-5-address-review-comments-until-all-reviewers-approve). You may also want to review the following documentation to help you debug:
+* [Run the local development server](https://github.com/oppia/oppia/wiki/Installing-Oppia-%28Mac-OS%3B-Python-3%29#running-oppia-on-a-development-server) and check that the website behaves normally, with no errors in the developer console. Often, when doing this, you will get errors in the backend server logs or the developer console which can help you figure out what is going wrong. 
 
-* If a lint check is failing, see [[Lint Checks|Lint-Checks]].
-* If a test is failing, see [[Tests|Tests]].
-* For general debugging tips, see our [[debugging guides|Debugging]].
+If your code is wrong, you will need to fix the error just as you would [respond to reviewer comments](https://github.com/oppia/oppia/wiki/Rules-for-making-PRs#step-5-address-review-comments-until-all-reviewers-approve). You may also want to review the following documentation to help you debug:
 
-If the test/check is flaky, and you have confirmed that it's not due to your changes, please file a [CI Flake report](https://github.com/oppia/oppia/issues/new?assignees=&labels=triage+needed%2Cbug&projects=&template=3_ci_error_template.yml&title=%5BFlake%5D%3A+). After doing that, you can restart the test as follows:
+  * If a lint check is failing, see [[Lint Checks|Lint-Checks]].
+  * If a test is failing, see [[Tests|Tests]].
+  * For general debugging tips, see our [[debugging guides|Debugging]].
 
-1. Click the "Details" link next to the test you want to restart.
+If you are stuck, compile all your findings in a [[debugging doc|Debugging-Docs]] and open a [GitHub Discussion](https://github.com/oppia/oppia/discussions/categories/q-a-debugging-docs) with a link to it. This will make it easier for the Oppia maintainers and other community members to give you suggestions on what to investigate next.
 
-   ![Screenshot of PR CI results with details link](images/prCiResults.png)
 
-2. Click on the "Re-run all jobs" button in the upper-right.
+## What to do if the failure is due to an existing issue
 
-   ![Screenshot of CI result page with link to rerun jobs](images/rerunCI.png)
+1. **Document the error.**
 
-   If you don't see the button or cannot click it, you might not have the necessary permissions. In that case, ask one of your reviewers to restart the test for you. When doing this, please **provide a link to the [CI Flake report](https://github.com/oppia/oppia/issues/new?assignees=&labels=triage+needed%2Cbug&projects=&template=3_ci_error_template.yml&title=%5BFlake%5D%3A+) you filed.**
+   * If the error message you see matches a known issue on the [issue tracker](https://github.com/oppia/oppia/issues), leave a comment on that known issue that points to the failing check on your PR, to document that it is still happening. This will help the dev workflow team recognize that this issue is serious and increase its priority.
 
-Following these instructions should result in PRs that are green and ready to merge by the time a reviewer looks at them, thus shortening the review cycle! If you are still unable to resolve the issues yourself, please follow our instructions to [[get help|Get-help]] from other developers.
+   * If the error message you see doesn't match a known issue, but you have confirmed that it's not due to your changes, please file a [CI Flake report](https://github.com/oppia/oppia/issues/new?assignees=&labels=triage+needed%2Cbug&projects=&template=3_ci_error_template.yml&title=%5BFlake%5D%3A+) that documents the error. Additionally, if you can [trace which PR caused the error|How-to-find-the-commit-which-introduced-a-bug], please link to it as well so that the Oppia maintainers can [revert it](https://github.com/oppia/oppia/wiki/Revert-and-Regression-Policy) if needed.
+
+2. **Restart the failing test on your PR.**
+
+   * If you have been added to the Oppia repository as a collaborator, you can restart the test as follows:
+
+     * Click the "Details" link next to the test you want to restart.
+
+       ![Screenshot of PR CI results with details link](images/prCiResults.png)
+
+     * Click on the "Re-run all jobs" button in the upper-right.
+
+       ![Screenshot of CI result page with link to rerun jobs](images/rerunCI.png)
+
+   * If you don't see the button or cannot click it, you might not have the necessary permissions. In that case:
+
+     * Leave a comment on your PR that says something like "I have investigated the flake and it is the same one as reported in issue #XXX. Here is [the comment](link) I left in that issue pointing to this PR. Please could a maintainer help restart the test?" Then, tag one of your reviewers in the comment, so that they can help restart the test for you.
+
+     * If you don't hear back within 24 hours, please use [GitHub Discussions](https://github.com/oppia/oppia/discussions/categories/q-a-contacting-folks) to request a follow-up.
+
+
+## What to do if CI checks only fail in the merge queue
+
+If the CI checks pass on your PR, but you see a failure in the merge queue when attempting to merge, do the following:
+
+* Look at the error logs for the failure, and follow the instructions above in [What to do if the failure is due to an existing issue](#what-to-do-if-the-failure-is-due-to-an-existing-issue) to document the flake.
+
+* Once you have documented the flake, ask the maintainers to help force-merge your PR. Note that this will only be granted if the above step is followed (i.e. the PR conversation log should include evidence that the merge-queue error is indeed a flake, and the occurrence should be reported in the corresponding issue thread).