chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

dependabot · 2024-04-04T20:59:23Z

Bumps the minor group with 1 update: semantic-text-splitter.

Updates semantic-text-splitter from 0.8.1 to 0.9.1

Release notes

Sourced from semantic-text-splitter's releases.

v0.9.1

What's Changed

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.
def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...
A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

by @benbrandt in benbrandt/text-splitter#135

Full Changelog: benbrandt/text-splitter@v0.9.0...v0.9.1

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.

In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Full Changelog: benbrandt/text-splitter@v0.8.1...v0.9.0

Changelog

Sourced from semantic-text-splitter's changelog.

v0.9.1

What's New

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.
def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...
A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.

In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Commits

17bc95a fix: try to make sure the CI isn't using the pip index/cache when installing ...
cd4ff47 Prep 0.9.1 release
834d567 Python splitters optionally provide chunk char offsets
ae9730c chore: cargo update
716590c Prep 0.9.0 release
3afe119 fix: unneeded details tag in the readme
0e63788 Bump pulldown-cmark from 0.10.0 to 0.10.2
47d86fb bump bench output
abccd9c fix: Don't erroneously mess with huggingface offset ranges
0d724d4 fix: Huggingface Tokenizer chunk sizer now accounts for padding
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore <dependency name> major version will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
@dependabot ignore <dependency name> minor version will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
@dependabot ignore <dependency name> will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
@dependabot unignore <dependency name> will remove all of the ignore conditions of the specified dependency
@dependabot unignore <dependency name> <ignore condition> will remove the ignore condition of the specified dependency and ignore conditions

Bumps the minor group with 1 update: [semantic-text-splitter](https://github.com/benbrandt/text-splitter). Updates `semantic-text-splitter` from 0.8.1 to 0.9.1 - [Release notes](https://github.com/benbrandt/text-splitter/releases) - [Changelog](https://github.com/benbrandt/text-splitter/blob/main/CHANGELOG.md) - [Commits](benbrandt/text-splitter@v0.8.1...v0.9.1) --- updated-dependencies: - dependency-name: semantic-text-splitter dependency-type: direct:production update-type: version-update:semver-minor dependency-group: minor ... Signed-off-by: dependabot[bot] <[email protected]>

dependabot · 2024-04-05T20:52:07Z

Looks like semantic-text-splitter is updatable in another way, so this is no longer needed.

dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Apr 4, 2024

dependabot bot force-pushed the dependabot/pip/minor-3e7add668e branch from 4fba07e to 7505d1a Compare April 5, 2024 10:01

dependabot bot closed this Apr 5, 2024

dependabot bot deleted the dependabot/pip/minor-3e7add668e branch April 5, 2024 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

dependabot bot commented on behalf of github Apr 4, 2024 •

edited

Loading

dependabot bot commented on behalf of github Apr 5, 2024

chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

chore(deps): bump semantic-text-splitter from 0.8.1 to 0.9.1 in the minor group #704

Conversation

dependabot bot commented on behalf of github Apr 4, 2024 • edited Loading

v0.9.1

What's Changed

v0.9.0

What's New

Breaking Changes

v0.9.1

What's New

v0.9.0

What's New

Breaking Changes

dependabot bot commented on behalf of github Apr 5, 2024

dependabot bot commented on behalf of github Apr 4, 2024 •

edited

Loading