Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

Closed
wants to merge 1 commit into from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Apr 4, 2024

Bumps the minor group with 1 update: semantic-text-splitter.

Updates semantic-text-splitter from 0.8.1 to 0.9.1

Release notes

Sourced from semantic-text-splitter's releases.

v0.9.1

What's Changed

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.

def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...

A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

by @​benbrandt in benbrandt/text-splitter#135

Full Changelog: benbrandt/text-splitter@v0.9.0...v0.9.1

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

  • Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
  • In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Full Changelog: benbrandt/text-splitter@v0.8.1...v0.9.0

Changelog

Sourced from semantic-text-splitter's changelog.

v0.9.1

What's New

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.

def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...

A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

  • Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
  • In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Commits
  • 17bc95a fix: try to make sure the CI isn't using the pip index/cache when installing ...
  • cd4ff47 Prep 0.9.1 release
  • 834d567 Python splitters optionally provide chunk char offsets
  • ae9730c chore: cargo update
  • 716590c Prep 0.9.0 release
  • 3afe119 fix: unneeded details tag in the readme
  • 0e63788 Bump pulldown-cmark from 0.10.0 to 0.10.2
  • 47d86fb bump bench output
  • abccd9c fix: Don't erroneously mess with huggingface offset ranges
  • 0d724d4 fix: Huggingface Tokenizer chunk sizer now accounts for padding
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore <dependency name> major version will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
  • @dependabot ignore <dependency name> minor version will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
  • @dependabot ignore <dependency name> will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
  • @dependabot unignore <dependency name> will remove all of the ignore conditions of the specified dependency
  • @dependabot unignore <dependency name> <ignore condition> will remove the ignore condition of the specified dependency and ignore conditions

@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Apr 4, 2024
Bumps the minor group with 1 update: [semantic-text-splitter](https://github.com/benbrandt/text-splitter).


Updates `semantic-text-splitter` from 0.8.1 to 0.9.1
- [Release notes](https://github.com/benbrandt/text-splitter/releases)
- [Changelog](https://github.com/benbrandt/text-splitter/blob/main/CHANGELOG.md)
- [Commits](benbrandt/text-splitter@v0.8.1...v0.9.1)

---
updated-dependencies:
- dependency-name: semantic-text-splitter
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot force-pushed the dependabot/pip/minor-3e7add668e branch from 4fba07e to 7505d1a Compare April 5, 2024 10:01
Copy link
Contributor Author

dependabot bot commented on behalf of github Apr 5, 2024

Looks like semantic-text-splitter is updatable in another way, so this is no longer needed.

@dependabot dependabot bot closed this Apr 5, 2024
@dependabot dependabot bot deleted the dependabot/pip/minor-3e7add668e branch April 5, 2024 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants