Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiny improvement #1585

Merged
merged 2 commits into from
Aug 1, 2024
Merged

Tiny improvement #1585

merged 2 commits into from
Aug 1, 2024

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Aug 1, 2024

Rational is as follows, before we used to hash the sequence, fail to find in cache, then lookup in vocab, hit and return.
This ignore_merges doesn't insert in cache (it's already in the vocab so no need to duplicate the data), we can put it beforehand limiting the amount of cache reads.

Before:

==============
num_threads: 8, data size: 24.04 MB, documents: 10000 Avg Length: 1659
tiktoken 	61.73 MB  / s
huggingface 	23.32 MB / s
==============
num_threads: 8, data size: 1.11 MB, documents: 10000 Avg Length: 116
tiktoken 	6.65 MB  / s
huggingface 	20.20 MB / s

After:

==============
num_threads: 8, data size: 24.04 MB, documents: 10000 Avg Length: 1659
tiktoken 	59.51 MB  / s
huggingface 	25.36 MB / s
==============
num_threads: 8, data size: 1.11 MB, documents: 10000 Avg Length: 116
tiktoken 	7.48 MB  / s
huggingface 	20.93 MB / s

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +94 to +95
if long:
documents.append("".join(item["premise"].values()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long is non english only, good that you added the average length as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the principal factor affecting speed seems to be document length (I tried user arabic and had similar results)

@Narsil Narsil merged commit 9e0c791 into main Aug 1, 2024
13 checks passed
@Narsil Narsil deleted the small_fixup branch August 1, 2024 13:52
@Narsil Narsil mentioned this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants