You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added an offsets argument to chunk() and Chunker.__call__() that specifies whether to return the start and end offsets of each chunk (#9). The argument defaults to False.
Added an overlap argument to chunk() and Chunker.__call__() that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap (#1). The argument defaults to None, in which case no overlapping occurs.
Added an undocumented, private _make_chunk_function() method to the Chunker class that constructs chunking functions with call-level arguments passed.
Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.
Changed
Began removing chunks comprised entirely of whitespace characters from the output of chunk().
Updated semchunk's description from 'A fast and lightweight Python library for splitting text into semantically meaningful chunks.' and 'A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.'.
Fixed
Fixed a typo in the docstring for the __call__() method of the Chunker class returned by chunkerify() where most of the documentation for the arguments were listed under the section for the method's returns.
Removed
Removed undocumented, private chunk() method from the Chunker class returned by chunkerify().
Removed undocumented, private _reattach_whitespace_splitters argument of chunk() that was introduced to experiment with potentially adding support for overlap ratios.