-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to have overlapping chunks #1
Comments
Hi @vrdn-23, |
If you use a smaller batch size, you can include surrounding batch strings depending on how much overlap you need. Here's some code to demonstrate. import semchunk
def batch_list(alist, batch_size, overlap):
"""
Splits the given list into batches of the specified size with an overlapping of the specified amount.
Args:
alist (list): The input list to be split.
batch_size (int): The desired size of each batch.
overlap (int): The desired overlap between each batch.
Returns:
list: A list containing the batches of the specified size with the specified overlapping.
"""
batches = []
for i in range(0, len(alist) - batch_size + 1 + overlap, batch_size):
batch = alist[max(i - overlap, 0) : i + batch_size + overlap]
batches.append(" ".join(batch))
return batches
document_text = "apple ball cat dog elephant fish goat house igloo jellyfish kangaroo lion moon net octopus pig queen rabbit sheep tree unicorn violin wind xylophone yak zoo"
chunk_size = 2
overlap = 1
chunker = semchunk.chunkerify(lambda text: len(text.split()), chunk_size)
print("\n".join(batch_list(chunker(document_text), chunk_size, overlap))) Program output:
|
@jcobol Your solution seems to cause chunks to exceed their original chunk size (which was 2). But I imagine that those wanting overlap also want to impose a fixed limit on the maximum number of tokens that a chunk may contain. A crude solution I can see is to allow users to specify an overlap ratio, drop the chunk size (internally within The problem with that solution, however, is that the vast majority of chunks will no longer have clean semantic separations, you could end up with something like this: text = """"\
It is a period of civil wars in the galaxy. A brave alliance of underground freedom fighters has challenged the tyranny and oppression of the awesome GALACTIC EMPIRE.
Striking from a fortress hidden among the billion stars of the galaxy, rebel spaceships have won their first victory in a battle with the powerful Imperial Starfleet. The EMPIRE fears that another defeat could bring a thousand more solar systems into the rebellion, and Imperial control over the galaxy would be lost forever.
To crush the rebellion once and for all, the EMPIRE is constructing a sinister new battle station. Powerful enough to destroy an entire planet, its completion spells certain doom for the champions of freedom."""
overlapped_chunk = """\
the awesome GALACTIC EMPIRE.
Striking from a fortress hidden among the billion stars of the galaxy, rebel spaceships have won their first victory in a battle with the powerful Imperial Starfleet. The EMPIRE fears that another defeat could bring a thousand more solar systems into the rebellion, and Imperial control over the galaxy would be lost forever.
To crush the rebellion""" What would be more semantic would be:
Neverthless, if you and @vrdn-23 would still find use in this feature, I'd be happy to implement it. Is that the case? |
Yes, I meant for the small chunk size already represent the reduced chunk size to account for overlaps. Ideally, this should be done internally to the library, as you suggested. What I had envisioned is, for example: desired chunk size: 30 overlap: 33% new internal chunk size: 10 Concatenate the n-1, n, and n+1 context, which should yield not more than the original chunk size (10 + 10 + 10 = 30). I'm not sure that adding a word at a time of context to fully fill the context is needed; not exceeding the context is more important, as I understand. I'd say it's important to keep the spirit of the library (semantic chunking) intact, rather than add something you may not be happy with. Maybe a code "example" not central to the library would be the appropriate thing to do - it's up to you. Thanks for the reply! |
Gotcha. This makes sense! Provided the original chunk size is large enough, it also shouldn’t result in awkward chunks that split in the middle of sentences. I don’t think this would go against the spirit of semchunk and in fact I think I myself could find this useful, particularly in RAG and vector search settings. I’ll have a go at implementing this this week and update this issue when it’s been merged into a new release. |
@jcobol Would the below implementation work for you? import nltk
import semchunk
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg
def overlap(chunks: list[str]) -> list[str]:
n_chunks = len(chunks)
match n_chunks:
# If there are no chunks or if there is only a single chunks, there is no overlap to be had and we can return the chunks as they are.
case 1 | 0:
return chunks
# If there are only two chunks, we can just return their concatenation.
case 2:
return [''.join(chunks)]
# NOTE We exclude the first and last chunks as they will already be included in the resulting first and last chunks.
return [''.join(chunks[i - 1 : i + 2]).strip() for i in range(1, n_chunks - 1)]
chunk_size = 512
chunker = semchunk.chunkerify('gpt2', chunk_size // 3)
text = gutenberg.raw('austen-emma.txt')
chunks = chunker(text)
chunks = overlap([chunk + '\n' for chunk in chunks]) I note that although I join the chunks by newlines, if implemented in The other options are to join texts by newlines or whitespace but neither of those options seem desirable. I could add a filter for chunks that are entirely contained within succeeding or preceeding chunks but I'm not too sure of what the potential side effects of that could be. If content was legitmately duplicated, it may be filtered out. At the same time, there is already duplication being added. I also note that I strip leading and trailing whitespace because it will be inserted by allowing splitting whitespace to be added back to chunks. The code does not permit specifying an overlap size. It splits the chunk size into thirds and then overlaps directly adjacent chunks. Is that what you imagined? At this point, I'm not happy enough with the implementation to incorporate it directly into |
I tried out the code and it does work reasonably well. I found a separate issue while testing, which may be related to the problem you found with weird whitespace. Take a passage of text with the hard line breaks and reflow the text so it is all on a single line. I expected the results to be the same, but they differ. Should whitespace and newlines be preserved? I'll paste the sample text I used, with chunk_size=128.
versus
|
Love the library, and would really appreciate this feature! |
Hey,
First off, I wanna say this is a pretty cool library! Thank you for the amazing work!
I'm just curious if there is an option to have overlapping chunks as part of the splitting. For example if we have 10 sentences, it would be nice for me to generate chunks of 3 sentences each with an overlap of 1 sentence. Obviously I know we can do it by splicing the chunks returned by manipulating lists, but just thought it might be nice feature to have!
Let me know what you think!
The text was updated successfully, but these errors were encountered: