Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize third party libraries for generating string hash and unique ids #6493

Open
goplayoutside3 opened this issue Nov 25, 2024 · 14 comments
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request

Comments

@goplayoutside3
Copy link
Contributor

goplayoutside3 commented Nov 25, 2024

Package

  • app-project
  • lib-classifier
  • lib-react-components
  • lib-user

lib-classifier

  • cuid is deprecated (npm link)

    • This package creates a temporary id per classification object stored in the Mobx store (temp id is not posted to Panoptes).
    • cuid is also used throughout the annotation code to generate an id for default annotation objects and new marks in the drawing tools.
    • Example in ClassificationStore.js
      const tempID = cuid()
      const newClassification = Classification.create({
      id: tempID, // Generate an id just for serialization in MST. Should be dropped before POST...
  • hash.js is a popular npm package last published 6 yrs ago, but it might replicate behavior available in modern browsers

    • The current session id is included with classification metadata. A new session id is returned if more than five minutes have passed since the last call to sessionUtils.getSessionID(). Sessions are stored in browser session storage, so they're unique to individual tabs.
    • ❓ Does the session id need to be cryptographically secure? Or just unique?
    • This library is only used in session.js:
      generateSessionID () {
      const sha2 = hash.sha256()
      const id = sha2.update(`${Math.random() * 10000}${Date.now()}${Math.random() * 1000}`).digest('hex')

lib-react-components

app-project

  • @sindresorhus/string-hash is not a popular npm package. Last published 3 yrs ago.
    • It’s used in app-project’s Mobx store to create a non-cryptographic hashed string for a unique identifier attached to the browser cookie. The cookie determines whether a user has dismissed a project’s announcement banner.
    • This feature responds to the string content of announcement (which project teams can change any time) hence a random unique id is not attached to the cookie preference in favor of a hashed string of the announcement's content.
    • Example of stringHash in UI.js:
      dismissProjectAnnouncementBanner() {
      const { announcement } = getRoot(self).project.configuration
      const announcementHash = stringHash(announcement)
      self.dismissedProjectAnnouncementBanner = announcementHash
      },

lib-user

Describe the solution you'd like

  • Standardize libraries for the three main use cases:

    1. Cryptographically secure hash
    2. Non-crypographically secure hash from a digested string
    3. Non-crypographically secure randomly generated id
  • SHA256 is now included in current browsers via crypto.subtle.digest

  • uuid is a maintained library. We could use it in place of cuid throughout lib-classifier and lib-react-components.

@goplayoutside3 goplayoutside3 added enhancement New feature or request dependencies Pull requests that update a dependency file labels Nov 25, 2024
@eatyourgreens
Copy link
Contributor

string-hash is a one-line wrapper for fnv1a, which is why it doesn’t get updated often.

@goplayoutside3
Copy link
Contributor Author

@eatyourgreens I'm confused about the classifier's use of hash for a session id, and referenced your PR #4741 for documentation. The Readme says "A new session ID is returned if more than five minutes have passed since the last call to sessionUtils.getSessionID()".

Unless I'm missing something obvious, I don't see that behavior. For instance, let's say I make a classification on Daily Minor Planet, and the session storage object is

id: "a9bd8ffcee67c134e61c38b659b9dd20b494142c24cbc555269ec2afa0b48060"
ttl: "2024-11-27T17:28:02.988Z"

More than 5 minutes later, I make another classification in the same browser tab, and the session storage object's ttl is updated, but id is not. Now the object is:

id: "a9bd8ffcee67c134e61c38b659b9dd20b494142c24cbc555269ec2afa0b48060"
ttl: "2024-11-27T17:34:31.300Z"

I think my question is two fold: Are javascript dates being used as intended in session.js, and why does session need to be a cryptographically secure hash?

@eatyourgreens
Copy link
Contributor

That's old code that Ed originally wrote, I think. It should reset after 5 minutes of inactivity, but I forget what calls it to refresh the TTL.

The original code is here:

https://github.com/zooniverse/Panoptes-Front-End/blob/master/app/lib/session.coffee

I think the monorepo code might be a direct JS copy of the original.

@eatyourgreens
Copy link
Contributor

There's a test for the case where the TTL has expired but it's a bad test. It tests that a stub has been called, rather than testing that the session ID has changed.

https://github.com/zooniverse/front-end-monorepo/blob/master/packages/lib-classifier/src/store/utils/session.spec.js

Unfortunately, the classifier is full of tests that use stubs like this. The problem with them is that testing if a method was called is not the same as testing if a method behaved as expected.

@goplayoutside3
Copy link
Contributor Author

Agreed it's a bad test. The condition ttl < Date.now() is never actually met because ttl is stored as a Date() string yet Date.now() is a number.

It's probably worth removing the "five minutes" logic from both PFE and FEM.

@eatyourgreens
Copy link
Contributor

To answer your original question, though, I would expect that getSessionId returns different values if you call it more than 5 minutes apart.

I don't think it matters if the value is secure. As far as I'm aware, project teams use it to group classifications by anonymous volunteers, so it only needs to be unique to a session.

@eatyourgreens
Copy link
Contributor

You're right! I don't see how that code has ever worked. Which means any analysis based on PFE session IDs is basically flawed.

Good catch!

@eatyourgreens
Copy link
Contributor

Any papers that were published based on that code might be wrong. 😳

@goplayoutside3
Copy link
Contributor Author

Maybe 🤔 When someone looks at a classification's session id, without a signed-in user, do they assume it's a metadata value that refreshes every 5 minutes though? When I think "session id" as a web concept, I think of it as an individual browser tab id that refreshes when a new tab or window is opened.

@eatyourgreens
Copy link
Contributor

It doesn't refresh every 5 minutes. It's supposed to refresh if there's no activity for 5 minutes. If you're classifying rapidly, each classification should have the same session ID, because the TTL updates every time you press Done.

I'm sure it's been used to measure session length in papers. Off the top of my head, maybe the Microsoft experiment that I helped with, which looked at the effect of intervention messages on session length.

@eatyourgreens
Copy link
Contributor

The per-tab sessions that you're referring to are recorded by Sugar, and counted on https://activity.zooniverse.org.

@goplayoutside3
Copy link
Contributor Author

That's a good clarification about 5 minutes of inactivity and interesting connection to https://activity.zooniverse.org/.

I opened #6499 for FEM and will do the same for PFE.

@eatyourgreens
Copy link
Contributor

I think that session code needs to be fixed, rather than deleted, since classification sessions are used in published papers. A classification session is distinct from a browser session, in that a volunteer can keep a tab open for hours or days and have several classification sessions in one browser session.

There might even be code in https://github.com/zooniverse/data-digging that uses classification session IDs to estimate session durations. That'll be wrong now.

@goplayoutside3
Copy link
Contributor Author

I think that session code needs to be fixed, rather than deleted

Similar to my comment on your PR, I'm in favor of not changing the classifier's behavior. Session ids have never refreshed after 5min of activity in the PFE classifier or the FEM classifier.

Regarding data digging and publications, I'll address the potential incorrect assumptions about session ids with the zoo team and document our decision on the appropriate Issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
Status: To do
Development

No branches or pull requests

2 participants