Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking Refactor: Always use Context-Aware Chunker #334

Open
2 tasks
Tracked by #374
aakankshaduggal opened this issue Nov 5, 2024 · 3 comments · May be fixed by #430
Open
2 tasks
Tracked by #374

Chunking Refactor: Always use Context-Aware Chunker #334

aakankshaduggal opened this issue Nov 5, 2024 · 3 comments · May be fixed by #430
Assignees
Labels

Comments

@aakankshaduggal
Copy link
Member

aakankshaduggal commented Nov 5, 2024

Currently, the DocumentChunker class is a factory class that chooses between ContextAwareChunker and TextSplitChunker. We should drop in the ContextAwareChunker functionality into DocumentChunker and have it simply call the text splitter after doing the initial docling chunking

  • Replace DocumentChunker with ContextAwareChunker
  • Enhance filetype checking currently done by DocumentChunker
    We should probably maintain a list (or the existing enum class) for supported filetypes and just be able to add to it as docling supports more types
@khaledsulayman
Copy link
Member

One thing that may either tie into this effort or make it a lot easier is refactoring the TextSplitChunker functionality out of the class.

Currently, the DocumentChunker class is a factory class that chooses between ContextAwareChunker and TextSplitChunker. The issue with this is that ContextAwareChunker still needs to run some of the markdown text splitting chunking afterwards, so for the time being I exported that out to a global function.

As of right now (haven't tested this yet though), the only thing stopping us from using docling on markdowns is a manual check I implemented in the factory method which instantiated a TextSplitChunker for markdowns and a ContextAwareChunker for PDFs.

We should probably rework DocumentChunker to just being a the context-aware chunker and drop the other chunker class. Since docling supports markdown as well, I don't see a reason we'd need to skip it anyway and go straight to the text splitter. This also will allow us to easily define in one place the doc types we support and.

@khaledsulayman
Copy link
Member

khaledsulayman commented Nov 8, 2024

Related to this, I think there's two things we should do:

  1. refactor Chunker code so that it always calls docling and then the text splitter
  2. make the necessary code changes on our end to allow for the chunker to accept markdown and all the other new filetypes that come with V2

I think we can either convert this issue to [1.] and then write a separate issue for [2.], or we can have this issue be an epic to track the issues I'll write for [1.] and [2.]. WDYT?

@aakankshaduggal
Copy link
Member Author

I agree with splitting this into two issues. Let’s use this one to focus on refactoring the Chunker code to consistently call docling first, followed by the text splitter. We can create a separate issue to handle code changes needed to support new file types with docling v2. This way, we can keep each part focused and manageable. Thanks for the suggestion! @khaledsulayman feel free to edit the issue and add another one as well :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants