Skip to content

v3.0.0

Latest
Compare
Choose a tag to compare
@umarbutler umarbutler released this 31 Dec 04:40
· 2 commits to main since this release

Added

  • Added an offsets argument to chunk() and Chunker.__call__() that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults to False.
  • Added an overlap argument to chunk() and Chunker.__call__() that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults to None, in which case no overlapping occurs.
  • Added an undocumented, private _make_chunk_function() method to the Chunker class that constructs chunking functions with call-level arguments passed.
  • Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.

Changed

  • Began removing chunks comprised entirely of whitespace characters from the output of chunk().
  • Updated semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.

Fixed

  • Fixed a typo in the docstring for the __call__() method of the Chunker class returned by chunkerify() where most of the documentation for the arguments were listed under the section for the method's returns.

Removed

  • Removed undocumented, private chunk() method from the Chunker class returned by chunkerify().
  • Removed undocumented, private _reattach_whitespace_splitters argument of chunk() that was introduced to experiment with potentially adding support for overlap ratios.