Skip to content

Commit

Permalink
Ceased memoizing chunk() (but not token counters).
Browse files Browse the repository at this point in the history
  • Loading branch information
umarbutler committed Jun 20, 2024
1 parent 8ed33e3 commit d97d006
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 5 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
## Changelog 🔄
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.1.0] - 2024-06-20
### Fixed
- Ceased memoizing `chunk()` (but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.

## [2.0.0] - 2024-06-19
### Added
- Added support for multiprocessing through the `processes` argument passable to chunkers constructed by `chunkerify()`.
Expand Down Expand Up @@ -71,6 +75,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
### Added
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.

[2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
[2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
[1.0.1]: https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1
[1.0.0]: https://github.com/umarbutler/semchunk/compare/v0.3.2...v1.0.0
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "semchunk"
version = "2.0.0"
version = "2.1.0"
authors = [
{name="Umar Butler", email="[email protected]"},
]
Expand Down
5 changes: 1 addition & 4 deletions src/semchunk/semchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from bisect import bisect_left
from typing import Callable, Sequence, TYPE_CHECKING
from functools import cache, wraps
from functools import cache
from itertools import accumulate
from contextlib import suppress

Expand Down Expand Up @@ -151,9 +151,6 @@ def chunk(

return chunks

# Memoize the `chunk` function, preserving its signature and docstring.
chunk = wraps(chunk)(cache(chunk))

def chunkerify(
tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer \
| tokenizers.Tokenizer | Callable[[str], int],
Expand Down

0 comments on commit d97d006

Please sign in to comment.