Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Storing assets for the faker project #3131

Closed
ST-DDT opened this issue Sep 22, 2024 · 14 comments
Closed

Proposal: Storing assets for the faker project #3131

ST-DDT opened this issue Sep 22, 2024 · 14 comments
Assignees
Labels
c: feature Request for new feature c: infra Changes to our infrastructure or project setup m: image Something is referring to the image module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug
Milestone

Comments

@ST-DDT
Copy link
Member

ST-DDT commented Sep 22, 2024

Split from #3126 and #465 to keep the discussion clean for future referencing


Clear and concise description of the problem

Glossary:

  • assets: images or other pre-built media or files linked to by the fakerjs methods.
  • link: The url/string targeting the assets

Currently, we rely on external services to provide/manage the images. These external services sometimes get shutdown down or get bought by third parties. Linking to these services makes us dependent on things outside of our control and might cause TOS isssues for our users. The hosted/linked images may or may not be appropriate for professional contexts.

Suggested solution

  1. Reduce the reliance of external services
    • When adding a new external service, great care should be taken, to only add reliable sites
  2. Improve the warnings on methods linking to external services
    • Add a warning that the methods are outside of the control of faker and might break without a breaking change on faker's side
    • Add a hint, that users should read the TOS of the target service before using their images
    • Add a hint, that the content might not be suited for professional usecases (Mostly covered by docs(image): add content warnings to image function jsdocs #2159 already)
  3. Host our own assets for some important methods
    • We should be very strict which methods we should host assets for because there are plenty of categories to host assets for
    • The asset repository should be separate from the main repository to avoid increasing the clone times by a lot
    • We use jsdelivr as a CDN to avoid isssues with GH rate limits
      • jsdelivr doesn't support Git LFS (so we have to host the files plain)
    • Once added, assets can only be replaced, never deleted (or only deleted after support for the containing version is cancelled)
      • We have to decide on a support policy for the assets

Hosting our assets - folder disambiguation

  • Create a faker-js/assets repository
  • Reproduce the faker-js main definition structure using folders and put all images in there

Pros:

  • Easy to add images
  • Easy to use CI to add/replace images

Cons:

  • Requires some kind of persistent history
  • Bigger checkout size

Hosting our assets - branch disambiguation

  • Create a faker-js/assets repository
  • Reproduce the faker-js main definition structure using branches and put all images in there

Pros:

  • Each branch/commit reference can be replaced at any time effectively erasing the history
  • Shorter/more specific url naming

Cons:

  • Hard to keep an overview of existing images, due to the nature of git branches
  • Requires specific configuration to work well with CI

Hosting our assets - repo disambiguation

  • Create a faker-js/assets-purpose repository for each faker method serving assets

Pros:

  • Each repository would remain small
  • Shortest urls

Cons:

  • We probably end up with a lot of repositories
  • We likely cannot share any CI between the repositories

Additional suggestions

Providing metadata with our assets

If we add metadata next to the images, it might help users trying to analyze the assets by allowing them to verify their results using the metadata.
To ease usage of the metadata, these should be in an easily parseable format - preferably json as we are a JS library.

If we add these metadata, these should be part of the API contract, otherwise nobody knows that they are there and are stable to use.
Including these metadata in the API contract requires them to be properly typed (per method).


Before, we discuss the pros/cons and make decisions, please first state any additional related issues, requirements, suggestions and ideas related to the faker assets.

@ST-DDT ST-DDT added c: feature Request for new feature p: 1-normal Nothing urgent m: image Something is referring to the image module labels Sep 22, 2024
@ST-DDT ST-DDT added this to the vAnytime milestone Sep 22, 2024
@ST-DDT ST-DDT added s: awaiting more info Additional information are requested c: infra Changes to our infrastructure or project setup labels Sep 22, 2024
@xDivisionByZerox
Copy link
Member

When hosting assets on GitHub, and considering the acceptance criteria that files should never be deleted, we must take into account GitHub's LFS storage and bandwidth quota.

@matthewmayer
Copy link
Contributor

matthewmayer commented Sep 23, 2024

Generally all looks good to me. I think using just subfolders rather than branches would make it easier to manage.

A couple of points

  • I don't think we should directly hardcode the method/definition names in the asset path. For example if we did that refactored and renamed "faker.image.avatarAI" to "faker.person.profilePicture" in the future then we'd have to duplicate all the images to be "consistent". If the asset path contains a unique slug which doesn't directly correlate to the method/definition name there's less churn.

  • while URLs should never be deleted, I think image contents could be replaced in some circumstances. For example if we discover that image avatar 43.jpg has a glitch and actually has three arms we should be able to directly replace the contents of 43.jpg without re-versioning the whole collection.

@Shinigami92
Copy link
Member

I heard once that it is possible to hack around repo limit by forking https://github.com/chromium/chromium 👀

@ST-DDT
Copy link
Member Author

ST-DDT commented Sep 23, 2024

I heard once that it is possible to hack around repo limit by forking chromium/chromium 👀

If I understand the docs correctly, the root of the fork pays all the costs for it. So Google is probably just paying enough to not run into any limits.

@matthewmayer
Copy link
Contributor

I can't imagine we would have more than a few tens of megabytes. We will not be hosting hi res videos.

@ST-DDT ST-DDT added s: needs decision Needs team/maintainer decision and removed s: awaiting more info Additional information are requested labels Oct 12, 2024
@matthewmayer
Copy link
Contributor

matthewmayer commented Oct 21, 2024

One other thing to consider would be image dimensions. Obviously if these are static assets we can't allow completely dynamic width and height. But for example for avatars, if we generate 1024x1024 or 512x512 images with an AI, it might be useful for some use-cases to have resized smaller versions too to save bandwidth.

e.g. you might want to allow for URLs like:
https://cdn.jsdelivr.net/gh/faker-js/assets/avatars_ai/1024/female/14.jpg
https://cdn.jsdelivr.net/gh/faker-js/assets/avatars_ai/64/female/14.jpg

And then in future you could pass one of a fixed set of pixel sizes or "big","small" etc to certain image methods.

@ST-DDT
Copy link
Member Author

ST-DDT commented Oct 21, 2024

Which image sizes do you think we need?
64, 256, 1024?
Which image formats do you think we need?
png, jpg?

@matthewmayer
Copy link
Contributor

It's pretty trivial to just make all the powers of 2 sizes using imagmagick or whatever from the largest down even if we don't end up using them. So I'd do everything from 1024 to 16 if possible.

Image type I'd just use the most appropriate and popular type for the content. So JPG for photographic content like the people avatars.

@ST-DDT
Copy link
Member Author

ST-DDT commented Oct 21, 2024

It's pretty trivial to just make all the powers of 2 sizes using imagmagick or whatever from the largest down even if we don't end up using them. So I'd do everything from 1024 to 16 if possible.

I think the most restrictive limit is going to be disk and maybe transfer.
And i assume having more size variations increases the transfer amount due to less caching.

@matthewmayer
Copy link
Contributor

It's more as a developer if I want to display a grid of 1000 user avatars which will be shown at like 32x32 it's very wasteful of bandwidth to download 1024x1024 images.

Deciding on the exact sizes is not too important right now. I just think the sizes should probably be in the folder structure to allow for easier future expansion.

@Shinigami92
Copy link
Member

IMO 512x512 is enough, no other sizes
(Or at least lets start with that)

@ST-DDT
Copy link
Member Author

ST-DDT commented Oct 26, 2024

Team Decision

  • We will create a repository per method e.g. assets-person-portait <-> faker.image.personPortrait()
  • The repository will be without history, pull-requests are disabled, issues are disabled
  • The repositories should be considered source-available repositories not open source
  • The structure of the repository will be defined per method e.g. personPortait -> sex -> filetype -> size -> 0-99
  • We will only expose the cdn/jsdelivr url in our methods
  • @ST-DDT will create and configure the asset repo
  • @Shinigami92 will talk with @matthewmayer about creating a script to populate the repository and present the results

@ST-DDT ST-DDT closed this as completed Oct 26, 2024
@ST-DDT ST-DDT self-assigned this Oct 26, 2024
@ST-DDT ST-DDT added s: accepted Accepted feature / Confirmed bug and removed s: needs decision Needs team/maintainer decision labels Oct 26, 2024
@matthewmayer
Copy link
Contributor

Just a couple of clarifications

"Source available, not open source" refers to the code used to generate the assets? The assets themselves would typically be PD if they were generated by AI.

What do you mean by the repo not having any history? Like if we had 50 portraits and then decided to increase that to 100 portraits I'd expect that to be another commit, not a force-push which removes history.

@ST-DDT
Copy link
Member Author

ST-DDT commented Oct 27, 2024

We might release the code used to generate the images under the MIT. As for the images, we may add a rule to force the usage of a cdn and similar conditions.

The idea is indeed to not keep history and force push commits. For additions this is potentially optional. For requested changes/image replacements it's not. Because otherwise the image is still there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: feature Request for new feature c: infra Changes to our infrastructure or project setup m: image Something is referring to the image module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug
Projects
None yet
Development

No branches or pull requests

4 participants