Releases: umarbutler/semchunk
Releases · umarbutler/semchunk
v2.2.2
v2.2.1
Changed
- Started benchmarking
semantic-text-splitter
in parallel to ensure a fair comparison, courtesy of @benbrandt (#17).
v2.2.0
v2.1.0
Fixed
- Ceased memoizing
chunk()
(but not token counters) due to the fact that cached outputs of memoized functions are shallow rather than deep copies of original outputs, meaning that if one were to chunk a text and then chunk that same text again and then modify one of the chunks outputted by the first call, the chunks outputted by the second call would also be modified. This behaviour is not expected and therefore undesirable. The memoization of token counters is not impacted as they output immutable objects, namely, integers.
v2.0.0
Added
- Added support for multiprocessing through the
processes
argument passable to chunkers constructed bychunkerify()
.
Removed
- No longer guaranteed that
semchunk
is pure Python.
v1.0.1
Fixed
- Documented the
progress
argument in the docstring forchunkerify()
and its type hint in the README.
v1.0.0
Added
- Added a
progress
argument to the chunker returned bychunkerify()
that, when set toTrue
and multiple texts are passed, displays a progress bar.
v0.3.2
v0.3.1
Fixed
- Fixed typo in error messages in
chunkerify()
where it was referred to asmake_chunker()
.
v0.3.0
Added
- Introduced the
chunkerify()
function, which constructs a chunker from a tokenizer or token counter that can be reused and can also chunk multiple texts in a single call. The resulting chunker speeds up chunking by 40.4% thanks, in large part, to a token counter that avoid having to count the number of tokens in a text when the number of characters in the text exceed a certain threshold, courtesy of @R0bk (#3) (337a186).