-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c5f7a89
commit 516939a
Showing
8 changed files
with
299 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
name: ci | ||
|
||
on: | ||
push: | ||
branches: [ main ] | ||
pull_request: | ||
branches: [ main ] | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"] | ||
|
||
steps: | ||
- name: Check-out repository | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v3 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
|
||
- name: Install test dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
python -m pip install pytest | ||
python -m pip install pytest-cov | ||
python -m pip install tiktoken | ||
- name: Install semchunk | ||
run: | | ||
python -m pip install . | ||
- name: Test with pytest | ||
run: | | ||
pytest --cov=semchunk --cov-report=xml | ||
- name: Use Codecov to track coverage | ||
uses: codecov/codecov-action@v3 | ||
with: | ||
files: ./coverage.xml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
## Changelog 🔄 | ||
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | ||
|
||
## [0.1.0] - 2023-11-05 | ||
### Added | ||
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter. | ||
|
||
[0.1.0]: https://github.com/umarbutler/semchunk/releases/tag/v0.1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# semchunk | ||
`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks. | ||
|
||
## Installation 📦 | ||
`semchunk` may be installed with `pip`: | ||
```bash | ||
pip install semchunk | ||
``` | ||
|
||
## Usage 👩💻 | ||
The code snippet below demonstrates how text can be chunked with `semchunk`: | ||
|
||
```python | ||
>>> import semchunk | ||
>>> text = 'The quick brown fox jumps over the lazy dog.' | ||
>>> token_counter = lambda text: len(text.split()) # If using `tiktoken`, you may replace this with `token_counter = lambda text: len(tiktoken.encoding_for_model(model).encode(text))`. | ||
>>> semchunk.chunk(text, chunk_size=2, token_counter=token_counter) | ||
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.'] | ||
``` | ||
|
||
### Chunk | ||
```python | ||
def chunk( | ||
text: str, | ||
chunk_size: int, | ||
token_counter: callable, | ||
) -> list[str] | ||
``` | ||
|
||
`chunk()` splits text into semantically meaningful chunks of a specified size as determined by the provided token counter. | ||
|
||
`text` is the text to be chunked. | ||
|
||
`chunk_size` is the maximum number of tokens a chunk may contain. | ||
|
||
`token_counter` is a callable that takes a string and returns the number of tokens in it. | ||
|
||
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed. | ||
|
||
## Methodology 🔬 | ||
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it: | ||
1. Splits text using the most semantically meaningful splitter possible; | ||
1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced; | ||
1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and | ||
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks. | ||
|
||
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence: | ||
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`); | ||
1. The largest sequence of tabs; | ||
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class); | ||
1. Sentence terminators (`.`, `?`, `!` and `*`); | ||
1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `“`, `”`, `‘`, `’`, `'`, `"` and `` ` ``); | ||
1. Sentence interrupters (`:`, `—` and `…`); | ||
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and | ||
1. All other characters. | ||
## Licence 📄 | ||
This library is licensed under the [MIT License](https://github.com/umarbutler/orjsonl/blob/main/LICENSE). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
[build-system] | ||
requires = ["hatchling"] | ||
build-backend = "hatchling.build" | ||
|
||
[project] | ||
name = "semchunk" | ||
version = "0.1.0" | ||
authors = [ | ||
{name="Umar Butler", email="[email protected]"}, | ||
] | ||
description = "A fast and lightweight pure Python library for splitting text into semantically meaningful chunks." | ||
readme = "README.md" | ||
requires-python = ">=3.7" | ||
license = {text="MIT"} | ||
keywords = [ | ||
"chunking", | ||
"splitting", | ||
"text", | ||
"split", | ||
"splits", | ||
"chunks", | ||
"chunk", | ||
"splitter", | ||
"chunker", | ||
"nlp", | ||
] | ||
classifiers = [ | ||
"Development Status :: 5 - Production/Stable", | ||
"Intended Audience :: Developers", | ||
"Intended Audience :: Information Technology", | ||
"Intended Audience :: Science/Research", | ||
"License :: OSI Approved :: MIT License", | ||
"Operating System :: OS Independent", | ||
"Programming Language :: Python :: 3.12", | ||
"Programming Language :: Python :: Implementation :: CPython", | ||
"Programming Language :: Python :: Implementation :: PyPy", | ||
"Topic :: Scientific/Engineering :: Artificial Intelligence", | ||
"Topic :: Software Development :: Libraries :: Python Modules", | ||
"Topic :: Text Processing :: General", | ||
"Topic :: Utilities", | ||
"Typing :: Typed" | ||
] | ||
dependencies = [ | ||
] | ||
|
||
[project.urls] | ||
Homepage = "https://github.com/umarbutler/semchunk" | ||
Documentation = "https://github.com/umarbutler/semchunk/blob/main/README.md" | ||
Issues = "https://github.com/umarbutler/semchunk/issues" | ||
Source = "https://github.com/umarbutler/semchunk" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
"""A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.""" | ||
|
||
from .semchunk import ( | ||
chunk, | ||
) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
import re | ||
|
||
NON_WHITESPACE_SEMANTIC_SPLITTERS = ( | ||
'.', '?', '!', '*', # Sentence terminators. | ||
';', ',', '(', ')', '[', ']', "“", "”", '‘', '’', "'", '"', '`', # Clause separators. | ||
':', '—', '…', # Sentence interrupters. | ||
'/', '\\', '–', '&', '-', # Word joiners. | ||
) | ||
"""A tuple of semantically meaningful non-whitespace splitters that may be used to chunk texts, ordered from most desirable to least desirable.""" | ||
|
||
def _split_text(text: str) -> tuple[str, list[str]]: | ||
"""Split text using the most semantically meaningful splitter possible.""" | ||
|
||
# Try splitting at, in order of most desirable to least desirable: | ||
# - The largest sequence of newlines and/or carriage returns; | ||
# - The largest sequence of tabs; | ||
# - The largest sequence of whitespace characters; and | ||
# - A semantically meaningful non-whitespace splitter. | ||
if '\n' in text or '\r' in text: | ||
splitter = max(re.findall(r'[\r\n]+', text)) | ||
|
||
elif '\t' in text: | ||
splitter = max(re.findall(r'\t+', text)) | ||
|
||
elif re.search(r'\s', text): | ||
splitter = max(re.findall(r'\s+', text)) | ||
|
||
else: | ||
# Identify the most desirable semantically meaningful non-whitespace splitter present in the text. | ||
for splitter in NON_WHITESPACE_SEMANTIC_SPLITTERS: | ||
if splitter in text: | ||
break | ||
|
||
# If no semantically meaningful splitter is present in the text, return an empty string as the splitter and the text as a list of characters. | ||
else: # NOTE This code block will only be executed if the for loop completes without breaking. | ||
return '', list(text) | ||
|
||
# Return the splitter and the split text. | ||
return splitter, text.split(splitter) | ||
|
||
def chunk(text: str, chunk_size: int, token_counter: callable, _recursion_depth: int = 0) -> list[str]: | ||
"""Split text into semantically meaningful chunks of a specified size as determined by the provided token counter. | ||
Args: | ||
text (str): The text to be chunked. | ||
chunk_size (int): The maximum number of tokens a chunk may contain. | ||
token_counter (callable): A callable that takes a string and returns the number of tokens in it. | ||
Returns: | ||
list[str]: A list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.""" | ||
|
||
# If the text is already within the chunk size, return it as the only chunk. | ||
if token_counter(text) <= chunk_size: | ||
return [text] | ||
|
||
# Split the text using the most semantically meaningful splitter possible. | ||
splitter, splits = _split_text(text) | ||
|
||
# Flag whether the splitter is whitespace. | ||
splitter_is_whitespace = not splitter.split() | ||
|
||
chunks = [] | ||
skips = [] | ||
"""A list of indices of splits to skip because they have already been added to a chunk.""" | ||
|
||
# Iterate through the splits. | ||
for i, split in enumerate(splits): | ||
# Skip the split if it has already been added to a chunk. | ||
if i in skips: | ||
continue | ||
|
||
# If the split is over the chunk size, recursively chunk it. | ||
if token_counter(split) > chunk_size: | ||
chunks.extend(chunk(split, chunk_size, token_counter=token_counter, _recursion_depth=_recursion_depth+1)) | ||
|
||
# If the split is equal to or under the chunk size, merge it with all subsequent splits until the chunk size is reached. | ||
else: | ||
# Initialise a list of splits to be merged into a new chunk. | ||
new_chunk = [split] | ||
|
||
# Iterate through each subsequent split until the chunk size is reached. | ||
for j, next_split in enumerate(splits[i+1:], start=i+1): | ||
# Check whether the next split can be added to the chunk without exceeding the chunk size. | ||
if token_counter(splitter.join(new_chunk+[next_split])) <= chunk_size: | ||
# Add the next split to the chunk. | ||
new_chunk.append(next_split) | ||
|
||
# Add the index of the next split to the list of indices to skip. | ||
skips.append(j) | ||
|
||
# If the next split cannot be added to the chunk without exceeding the chunk size, break. | ||
else: | ||
break | ||
|
||
# Join the splits with the splitter. | ||
new_chunk = splitter.join(new_chunk) | ||
|
||
# Add the chunk. | ||
chunks.append(new_chunk) | ||
|
||
# If the splitter is not whitespace and the split is not the last split, add the splitter to the end of the last chunk if doing so would not cause it to exceed the chunk size otherwise add the splitter as a new chunk. | ||
if not splitter_is_whitespace and not (i == len(splits) - 1 or all(j in skips for j in range(i+1, len(splits)))): | ||
if token_counter(chunks[-1]+splitter) <= chunk_size: | ||
chunks[-1] += splitter | ||
|
||
else: | ||
chunks.append(splitter) | ||
|
||
# If this is not a recursive call, remove any empty chunks. | ||
if not _recursion_depth: | ||
chunks = [chunk for chunk in chunks if chunk] | ||
|
||
return chunks |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
"""Test semchunk.""" | ||
import semchunk | ||
import tiktoken | ||
|
||
LOREM = """\ | ||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id porta nibh venenatis cras sed felis eget velit. Et tortor consequat id porta nibh. Id diam vel quam elementum pulvinar. Consequat nisl vel pretium lectus quam id. Pharetra magna ac placerat vestibulum lectus mauris ultrices eros in. Id velit ut tortor pretium viverra. Tempus imperdiet nulla malesuada pellentesque elit eget gravida. In est ante in nibh mauris cursus mattis molestie a. Risus quis varius quam quisque id. Lorem ipsum dolor sit amet consectetur. Non nisi est sit amet facilisis magna. Leo in vitae turpis massa sed elementum tempus egestas sed. Luctus venenatis lectus magna fringilla urna porttitor rhoncus dolor. At erat pellentesque adipiscing commodo. Sagittis orci a scelerisque purus. Condimentum vitae sapien pellentesque habitant morbi tristique senectus et netus. A cras semper auctor neque vitae tempus quam pellentesque. | ||
Facilisi cras fermentum odio eu feugiat. Sit amet consectetur adipiscing elit pellentesque habitant morbi tristique senectus. Nulla posuere sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Enim ut sem viverra aliquet eget sit amet tellus cras. Non arcu risus quis varius quam quisque id. Purus in mollis nunc sed id. Lorem sed risus ultricies tristique nulla aliquet enim. Diam in arcu cursus euismod quis viverra. Et sollicitudin ac orci phasellus egestas tellus rutrum tellus. Ac ut consequat semper viverra nam libero justo laoreet sit. Mattis ullamcorper velit sed ullamcorper morbi tincidunt ornare. Netus et malesuada fames ac turpis egestas. Sed enim ut sem viverra aliquet eget sit amet. In iaculis nunc sed augue lacus viverra vitae congue. | ||
Nunc consequat interdum varius sit amet mattis vulputate enim. Pulvinar pellentesque habitant morbi tristique. Viverra ipsum nunc aliquet bibendum enim. Egestas erat imperdiet sed euismod nisi porta lorem mollis. Mattis rhoncus urna neque viverra justo nec. Dictum non consectetur a erat nam at lectus. Tincidunt arcu non sodales neque. Sagittis eu volutpat odio facilisis mauris. Nec nam aliquam sem et tortor consequat id porta. Nulla pellentesque dignissim enim sit amet venenatis urna. Eget magna fermentum iaculis eu non diam phasellus. Leo in vitae turpis massa sed elementum. Libero volutpat sed cras ornare arcu dui vivamus. Molestie nunc non blandit massa enim nec dui nunc mattis. Odio facilisis mauris sit amet massa vitae tortor. Ullamcorper velit sed ullamcorper morbi tincidunt ornare. Nec dui nunc mattis enim ut. | ||
Id volutpat lacus laoreet non curabitur gravida arcu. Pulvinar proin gravida hendrerit lectus a. Id neque aliquam vestibulum morbi blandit cursus. Quam nulla porttitor massa id neque aliquam vestibulum morbi. Urna et pharetra pharetra massa massa ultricies. Sed enim ut sem viverra aliquet. Quam quisque id diam vel quam elementum pulvinar etiam non. Urna molestie at elementum eu facilisis sed odio morbi quis. Commodo sed egestas egestas fringilla phasellus faucibus scelerisque eleifend donec. Pharetra magna ac placerat vestibulum lectus mauris ultrices eros. | ||
Quam quisque id diam vel quam elementum pulvinar. Pellentesque habitant morbi tristique senectus et netus et. Tellus in metus vulputate eu scelerisque felis. Facilisis sed odio morbi quis. Dictum sit amet justo donec enim diam. A diam maecenas sed enim ut sem viverra aliquet eget. Phasellus vestibulum lorem sed risus ultricies tristique nulla aliquet. Non odio euismod lacinia at quis risus sed vulputate odio. Et netus et malesuada fames ac turpis egestas maecenas. Scelerisque viverra mauris in aliquam sem fringilla ut. Ac odio tempor orci dapibus. Lectus vestibulum mattis ullamcorper velit sed ullamcorper morbi.""" | ||
|
||
def _token_counter(text: str) -> int: | ||
return len(tiktoken.encoding_for_model('gpt-4').encode(text)) | ||
|
||
def test_chunk(): | ||
for chunk in semchunk.chunk(LOREM, chunk_size=1, token_counter=_token_counter): | ||
assert _token_counter(chunk) <= 1 |