Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to have overlapping chunks #1

Open
vrdn-23 opened this issue Feb 7, 2024 · 8 comments
Open

Option to have overlapping chunks #1

vrdn-23 opened this issue Feb 7, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@vrdn-23
Copy link

vrdn-23 commented Feb 7, 2024

Hey,

First off, I wanna say this is a pretty cool library! Thank you for the amazing work!

I'm just curious if there is an option to have overlapping chunks as part of the splitting. For example if we have 10 sentences, it would be nice for me to generate chunks of 3 sentences each with an overlap of 1 sentence. Obviously I know we can do it by splicing the chunks returned by manipulating lists, but just thought it might be nice feature to have!

Let me know what you think!

@umarbutler
Copy link
Owner

Hi @vrdn-23,
Apologies for not getting to this issue earlier, I had not seen it until now. I'll have a thinking about how I might implement this. Would you mind elaborating further on what you mean by 'splicing the chunks returned by manipulating lists' or provide an example. It would help me develop an implementation.

@umarbutler umarbutler self-assigned this Mar 11, 2024
@umarbutler umarbutler added the enhancement New feature or request label Mar 11, 2024
@jcobol
Copy link
Contributor

jcobol commented May 27, 2024

If you use a smaller batch size, you can include surrounding batch strings depending on how much overlap you need. Here's some code to demonstrate.

import semchunk

def batch_list(alist, batch_size, overlap):
    """
    Splits the given list into batches of the specified size with an overlapping of the specified amount.

    Args:
        alist (list): The input list to be split.
        batch_size (int): The desired size of each batch.
        overlap (int): The desired overlap between each batch.

    Returns:
        list: A list containing the batches of the specified size with the specified overlapping.
    """
    batches = []
    for i in range(0, len(alist) - batch_size + 1 + overlap, batch_size):
        batch = alist[max(i - overlap, 0) : i + batch_size + overlap]
        batches.append(" ".join(batch))
    return batches

document_text = "apple ball cat dog elephant fish goat house igloo jellyfish kangaroo lion moon net octopus pig queen rabbit sheep tree unicorn violin wind xylophone yak zoo"
chunk_size = 2
overlap = 1
chunker = semchunk.chunkerify(lambda text: len(text.split()), chunk_size)
print("\n".join(batch_list(chunker(document_text), chunk_size, overlap)))

Program output:

apple ball cat dog elephant fish
cat dog elephant fish goat house igloo jellyfish
goat house igloo jellyfish kangaroo lion moon net
kangaroo lion moon net octopus pig queen rabbit
octopus pig queen rabbit sheep tree unicorn violin
sheep tree unicorn violin wind xylophone yak zoo
wind xylophone yak zoo

@umarbutler
Copy link
Owner

@jcobol Your solution seems to cause chunks to exceed their original chunk size (which was 2). But I imagine that those wanting overlap also want to impose a fixed limit on the maximum number of tokens that a chunk may contain.

A crude solution I can see is to allow users to specify an overlap ratio, drop the chunk size (internally within semchunk.chunkerify or semchunk.chunk) to the original chunk size minus the overlap ratio, and then try to add text from preceeding and succeeding chunks either one character or whitespace-delimited word at a time, computing the token count each time to avoid exceeding the limit, until the limit is reached.

The problem with that solution, however, is that the vast majority of chunks will no longer have clean semantic separations, you could end up with something like this:

text = """"\
It is a period of civil wars in the galaxy. A brave alliance of underground freedom fighters has challenged the tyranny and oppression of the awesome GALACTIC EMPIRE.

Striking from a fortress hidden among the billion stars of the galaxy, rebel spaceships have won their first victory in a battle with the powerful Imperial Starfleet. The EMPIRE fears that another defeat could bring a thousand more solar systems into the rebellion, and Imperial control over the galaxy would be lost forever.

To crush the rebellion once and for all, the EMPIRE is constructing a sinister new battle station. Powerful enough to destroy an entire planet, its completion spells certain doom for the champions of freedom."""

overlapped_chunk = """\
the awesome GALACTIC EMPIRE.

Striking from a fortress hidden among the billion stars of the galaxy, rebel spaceships have won their first victory in a battle with the powerful Imperial Starfleet. The EMPIRE fears that another defeat could bring a thousand more solar systems into the rebellion, and Imperial control over the galaxy would be lost forever.

To crush the rebellion"""

What would be more semantic would be:

overlapped_chunk = """\
A brave alliance of underground freedom fighters has challenged the tyranny and oppression of the awesome GALACTIC EMPIRE.

Striking from a fortress hidden among the billion stars of the galaxy, rebel spaceships have won their first victory in a battle with the powerful Imperial Starfleet. The EMPIRE fears that another defeat could bring a thousand more solar systems into the rebellion, and Imperial control over the galaxy would be lost forever.

To crush the rebellion once and for all, the EMPIRE is constructing a sinister new battle station."""

Neverthless, if you and @vrdn-23 would still find use in this feature, I'd be happy to implement it. Is that the case?

@jcobol
Copy link
Contributor

jcobol commented Jun 3, 2024

@umarbutler ,

@jcobol Your solution seems to cause chunks to exceed their original chunk size (which was 2).

Yes, I meant for the small chunk size already represent the reduced chunk size to account for overlaps. Ideally, this should be done internally to the library, as you suggested.

What I had envisioned is, for example:

desired chunk size: 30

overlap: 33%

new internal chunk size: 10

Concatenate the n-1, n, and n+1 context, which should yield not more than the original chunk size (10 + 10 + 10 = 30).

I'm not sure that adding a word at a time of context to fully fill the context is needed; not exceeding the context is more important, as I understand.

I'd say it's important to keep the spirit of the library (semantic chunking) intact, rather than add something you may not be happy with. Maybe a code "example" not central to the library would be the appropriate thing to do - it's up to you. Thanks for the reply!

@umarbutler
Copy link
Owner

Yes, I meant for the small chunk size already represent the reduced chunk size to account for overlaps. Ideally, this should be done internally to the library, as you suggested.

What I had envisioned is, for example:

desired chunk size: 30

overlap: 33%

new internal chunk size: 10

Concatenate the n-1, n, and n+1 context, which should yield not more than the original chunk size (10 + 10 + 10 = 30).

Gotcha. This makes sense! Provided the original chunk size is large enough, it also shouldn’t result in awkward chunks that split in the middle of sentences.

I don’t think this would go against the spirit of semchunk and in fact I think I myself could find this useful, particularly in RAG and vector search settings.

I’ll have a go at implementing this this week and update this issue when it’s been merged into a new release.

@umarbutler
Copy link
Owner

@jcobol Would the below implementation work for you?

import nltk
import semchunk

nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg

def overlap(chunks: list[str]) -> list[str]:
    n_chunks = len(chunks)
    
    match n_chunks:
        # If there are no chunks or if there is only a single chunks, there is no overlap to be had and we can return the chunks as they are.
        case 1 | 0:
            return chunks
        
        # If there are only two chunks, we can just return their concatenation.
        case 2:
            return [''.join(chunks)]
    
    # NOTE We exclude the first and last chunks as they will already be included in the resulting first and last chunks.
    return [''.join(chunks[i - 1 : i + 2]).strip() for i in range(1, n_chunks - 1)]

chunk_size = 512
chunker = semchunk.chunkerify('gpt2', chunk_size // 3)

text = gutenberg.raw('austen-emma.txt')

chunks = chunker(text)
chunks = overlap([chunk + '\n' for chunk in chunks])

I note that although I join the chunks by newlines, if implemented in semchunk, I would be able to join them by whatever they had been split at, although this does result in an unfortunate edge case where there are chunks comprised entirely of empty whitespace causing overlapped chunks to be created that are effectively redundant and/or entirely duplicated. In my testing, however, this problem appears largely when you are dealing with very small chunk sizes and lots of weird spacing. Sometimes it occurs with larger chunk sizes but it is not that frequent.

The other options are to join texts by newlines or whitespace but neither of those options seem desirable. I could add a filter for chunks that are entirely contained within succeeding or preceeding chunks but I'm not too sure of what the potential side effects of that could be. If content was legitmately duplicated, it may be filtered out. At the same time, there is already duplication being added.

I also note that I strip leading and trailing whitespace because it will be inserted by allowing splitting whitespace to be added back to chunks.

The code does not permit specifying an overlap size. It splits the chunk size into thirds and then overlaps directly adjacent chunks. Is that what you imagined?

At this point, I'm not happy enough with the implementation to incorporate it directly into semchunk. Are you aware of any possible solutions to these problems? I may have another crack at it later.

@jcobol
Copy link
Contributor

jcobol commented Jun 5, 2024

I tried out the code and it does work reasonably well.

I found a separate issue while testing, which may be related to the problem you found with weird whitespace. Take a passage of text with the hard line breaks and reflow the text so it is all on a single line. I expected the results to be the same, but they differ. Should whitespace and newlines be preserved?

I'll paste the sample text I used, with chunk_size=128.

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.  Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection. Sixteen years had Miss Taylor been in Mr. Woodhouse's family, less as a governess than a friend, very fond of both daughters, but particularly of Emma.  Between _them_ it was more the intimacy of sisters. Even before Miss Taylor had ceased to hold the nominal office of governess, the mildness of her temper had hardly allowed her to impose any restraint; and the shadow of authority being now long passed away, they had been living together as friend and friend very mutually attached, and Emma doing just what she liked; highly esteeming Miss Taylor's judgment, but directed chiefly by her own. The real evils, indeed, of Emma's situation were the power of having rather too much her own way, and a disposition to think a little too well of herself; these were the disadvantages which threatened alloy to her many enjoyments.  The danger, however, was at present so unperceived, that they did not by any means rank as misfortunes with her. Sorrow came--a gentle sorrow--but not at all in the shape of any disagreeable consciousness.--Miss Taylor married.  It was Miss Taylor's loss which first brought grief.  It was on the wedding-day of this beloved friend that Emma first sat in mournful thought of any continuance.  The wedding over, and the bride-people gone, her father and herself were left to dine together, with no prospect of a third to cheer a long evening.  Her father composed himself to sleep after dinner, as usual, and she had then only to sit and think of what she had lost.

versus

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.

The real evils, indeed, of Emma's situation were the power of having
rather too much her own way, and a disposition to think a little
too well of herself; these were the disadvantages which threatened
alloy to her many enjoyments.  The danger, however, was at present
so unperceived, that they did not by any means rank as misfortunes
with her.

Sorrow came--a gentle sorrow--but not at all in the shape of any
disagreeable consciousness.--Miss Taylor married.  It was Miss
Taylor's loss which first brought grief.  It was on the wedding-day
of this beloved friend that Emma first sat in mournful thought
of any continuance.  The wedding over, and the bride-people gone,
her father and herself were left to dine together, with no prospect
of a third to cheer a long evening.  Her father composed himself
to sleep after dinner, as usual, and she had then only to sit
and think of what she had lost.

@kushal-agrawal-relativity

Love the library, and would really appreciate this feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants