Release v3.0.0 · umarbutler/semchunk

Added an offsets argument to chunk() and Chunker.__call__() that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults to False.
Added an overlap argument to chunk() and Chunker.__call__() that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults to None, in which case no overlapping occurs.
Added an undocumented, private _make_chunk_function() method to the Chunker class that constructs chunking functions with call-level arguments passed.
Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.

Began removing chunks comprised entirely of whitespace characters from the output of chunk().
Updated semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.

Fixed a typo in the docstring for the __call__() method of the Chunker class returned by chunkerify() where most of the documentation for the arguments were listed under the section for the method's returns.

Removed undocumented, private chunk() method from the Chunker class returned by chunkerify().
Removed undocumented, private _reattach_whitespace_splitters argument of chunk() that was introduced to experiment with potentially adding support for overlap ratios.

Provide feedback