Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: keep alive when dispatch fails #524

Merged
merged 7 commits into from
Mar 27, 2024

Conversation

williazz
Copy link
Contributor

Addresses an immediate problem in #521, highlighted in #521 (comment)

To summarize, RUM currently disables itself when the attempt to PutRumEvents (PRT) fails and all retries have been exhausted. The intention was to avoid giving the web application needless errors when we know that RUM the network connection is disabled anyways.

The problem is that sometimes the network connection recovers, and RUM will experience data loss because it is needlessly disabled. This PR solves that problem by adding configuration for disableOnFail (default true). When site operator decides to set this to false, then RUM will not disable on PRT failure, and be able to resume sending events when the network connection recovers.

This PR should be shipped out with #500 and #501 to limit the number of unnecessary retries and errors.

In the long run, RUM should listen for online and offline DOM events (MDN docs). When offline, we should write events to localStorage instead of the PRT endpoint. When online, Dispatch should attempt to combine the current batch with events recorded offline.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

@gmayankcse15 gmayankcse15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good to me

@qhanam
Copy link
Member

qhanam commented Mar 18, 2024

We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?

@qhanam
Copy link
Member

qhanam commented Mar 18, 2024

The commit message "disableOnFail" doesn't really describe what the change does. Could you re-name the commit message to be more descriptive? It sounds like we are trying to solve the problem of recovering from network problems.

@williazz williazz changed the title feat: disableOnFail feat: optionally disable RUM when PutRumEvents fails Mar 18, 2024
@williazz williazz changed the title feat: optionally disable RUM when PutRumEvents fails feat: optionally disable RUM when dispatch fails Mar 18, 2024
@williazz
Copy link
Contributor Author

williazz commented Mar 18, 2024

We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?

For zero configuration, we can remove disableOnFail and make sure RUM only disables when dispatch fails with status code 403 or 404, and maybe 400. In the long run, we can implement offline support for more granular control.

@gmayankcse15
Copy link

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

@williazz
Copy link
Contributor Author

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

Looks like our docs may have to be updated. Another definition could be: disable on everything that isn't 429 or 5xx.

@williazz williazz changed the title feat: optionally disable RUM when dispatch fails feat: keep alive when network requests fail Mar 27, 2024
@williazz
Copy link
Contributor Author

Revision: I have updated the PR so that RUM only disables when Dispatch fails with 403 or 404. Those are the only status codes in which we know RUM will never succeed. In all other cases, RUM should stay alive.

@williazz williazz changed the title feat: keep alive when network requests fail feat: keep alive when dispatch fails Mar 27, 2024
@qhanam
Copy link
Member

qhanam commented Mar 27, 2024

So now, if my understanding is correct, the behavior is to:

  1. When a dispatch fails (it gets three attempts), then the payload is dropped.
  2. If the status code is 403 (Forbidden) or 404 (Not Found), the web client disables itself.
  3. Otherwise, the web client will continue as normal, which is to attempt dispatch every 5 seconds.

I have two questions.

  1. When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?
  2. Long-term, is dropping payloads the behavior we want? We could add them back to the cache. Saving to localStorage is another option. Although both of those might just be adding unnecessary complexity.

Those questions aren't blocking. I think this is a good change and can be merged.

@qhanam qhanam self-requested a review March 27, 2024 17:06
@williazz
Copy link
Contributor Author

When a dispatch fails (it gets three attempts), then the payload is dropped.

Almost, RUM only retries on 429 and 5xx. Then the payload is dropped.

When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?

RUM is not sending 401 status codes, so I didn't include it. But I can include that in case someone's proxy uses that code.

Long-term, is dropping payloads the behavior we want?

Dropping payloads is not the behavior we want long term, but I haven't implemented it because there are still open questions:

A. How should dropped events be reinserted into the queue and what priority should they be given against the limit?
B. Should customers be allowed to choose an offline strategy? e.g. Reinsert into the queue OR Write to localStorage.
C. What additional metadata can we provide to site operators to root cause "offline" events which will likely have more errors? (This can probably be resolved by providing the Navigator.onLine attribute)

@williazz
Copy link
Contributor Author

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?

Added 401!

@williazz williazz merged commit 87e4cb4 into aws-observability:main Mar 27, 2024
3 checks passed
williazz added a commit that referenced this pull request Mar 29, 2024
* feat: disableOnFail

* doc: add config doc

* Revert "doc: add config doc"

This reverts commit ec1f74f.

* Revert "feat: disableOnFail"

This reverts commit 67ed7b3.

* feat: only disable RUM when dispatch fails with 403 or 404

* chore: add unit tests

* feat: add 401
williazz added a commit to williazz/aws-rum-web that referenced this pull request Apr 2, 2024
* feat: disableOnFail

* doc: add config doc

* Revert "doc: add config doc"

This reverts commit ec1f74f.

* Revert "feat: disableOnFail"

This reverts commit 67ed7b3.

* feat: only disable RUM when dispatch fails with 403 or 404

* chore: add unit tests

* feat: add 401
williazz added a commit that referenced this pull request Apr 9, 2024
* fix: Allow title override in page attributes (#508)

* chore(deps-dev): bump ip from 1.1.5 to 1.1.9 (#513)

Bumps [ip](https://github.com/indutny/node-ip) from 1.1.5 to 1.1.9.
- [Commits](indutny/node-ip@v1.1.5...v1.1.9)

---
updated-dependencies:
- dependency-name: ip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps-dev): bump follow-redirects from 1.15.4 to 1.15.6 (#525)

Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.4 to 1.15.6.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](follow-redirects/follow-redirects@v1.15.4...v1.15.6)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps-dev): bump webpack-dev-middleware from 6.0.1 to 6.1.2 (#528)

Bumps [webpack-dev-middleware](https://github.com/webpack/webpack-dev-middleware) from 6.0.1 to 6.1.2.
- [Release notes](https://github.com/webpack/webpack-dev-middleware/releases)
- [Changelog](https://github.com/webpack/webpack-dev-middleware/blob/v6.1.2/CHANGELOG.md)
- [Commits](webpack/webpack-dev-middleware@v6.0.1...v6.1.2)

---
updated-dependencies:
- dependency-name: webpack-dev-middleware
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: retry with exponential backoff (#501)

* feat: retry with exponential backoff

* fix

* test

* feat: limit retries to 5xx and 429 (#500)

* feat: limit PutRumEvents retry to 5xx

* feat: add 429

* cleanup

* cleanup

* docs

* doc

* fix: fmt

* chore(deps-dev): bump express from 4.18.1 to 4.19.2 (#531)

Bumps [express](https://github.com/expressjs/express) from 4.18.1 to 4.19.2.
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](expressjs/express@4.18.1...4.19.2)

---
updated-dependencies:
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: keep alive when dispatch fails (#524)

* feat: disableOnFail

* doc: add config doc

* Revert "doc: add config doc"

This reverts commit ec1f74f.

* Revert "feat: disableOnFail"

This reverts commit 67ed7b3.

* feat: only disable RUM when dispatch fails with 403 or 404

* chore: add unit tests

* feat: add 401

* fix: record resources with invalid names (#532)

* fix: Ignore URL construction error from invalid performance resource event

* fix: Throw error when URL construction fails for invalid performance resource event

* fix: Ignore error thrown from URL construction

* test: add unit test

* fix: record resources with invalid names

* fix: Update unit test for invalid url

* fix: Update hostname typo in isPutRumEventsCall tests

---------

Co-authored-by: Billy <[email protected]>

* chore(release): 1.17.2

* Revert "Sync changes from main for 1.17.2 patch release"

* fix: record resources with invalid names (#532)

* fix: Ignore URL construction error from invalid performance resource event

* fix: Throw error when URL construction fails for invalid performance resource event

* fix: Ignore error thrown from URL construction

* test: add unit test

* fix: record resources with invalid names

* fix: Update unit test for invalid url

* fix: Update hostname typo in isPutRumEventsCall tests

---------

Co-authored-by: Billy <[email protected]>

* Update changelog

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Quinn Hanam <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy <[email protected]>
@mayankcse15
Copy link

Looks like I missed this, although I already pointed out why we are still disabling the RUM for 401 code, 401 is a valid usecase where the proxy server might return with (Unauthorized Token) and browser might retry later with a valid token. Please help to remove 401.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants