-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking Refactor: Always use Context-Aware Chunker #334
Comments
One thing that may either tie into this effort or make it a lot easier is refactoring the Currently, the As of right now (haven't tested this yet though), the only thing stopping us from using docling on markdowns is a manual check I implemented in the factory method which instantiated a We should probably rework |
Related to this, I think there's two things we should do:
I think we can either convert this issue to [1.] and then write a separate issue for [2.], or we can have this issue be an epic to track the issues I'll write for [1.] and [2.]. WDYT? |
I agree with splitting this into two issues. Let’s use this one to focus on refactoring the Chunker code to consistently call docling first, followed by the text splitter. We can create a separate issue to handle code changes needed to support new file types with docling v2. This way, we can keep each part focused and manageable. Thanks for the suggestion! @khaledsulayman feel free to edit the issue and add another one as well :D |
Currently, the DocumentChunker class is a factory class that chooses between ContextAwareChunker and TextSplitChunker. We should drop in the
ContextAwareChunker
functionality intoDocumentChunker
and have it simply call the text splitter after doing the initialdocling
chunkingDocumentChunker
withContextAwareChunker
DocumentChunker
We should probably maintain a list (or the existing enum class) for supported filetypes and just be able to add to it as docling supports more types
The text was updated successfully, but these errors were encountered: