Chunking Refactor: Always use Context-Aware Chunker #334

aakankshaduggal · 2024-11-05T15:03:34Z

Currently, the DocumentChunker class is a factory class that chooses between ContextAwareChunker and TextSplitChunker. We should drop in the ContextAwareChunker functionality into DocumentChunker and have it simply call the text splitter after doing the initial docling chunking

Replace DocumentChunker with ContextAwareChunker
Enhance filetype checking currently done by DocumentChunker
We should probably maintain a list (or the existing enum class) for supported filetypes and just be able to add to it as docling supports more types

The text was updated successfully, but these errors were encountered:

khaledsulayman · 2024-11-07T22:49:39Z

One thing that may either tie into this effort or make it a lot easier is refactoring the TextSplitChunker functionality out of the class.

Currently, the DocumentChunker class is a factory class that chooses between ContextAwareChunker and TextSplitChunker. The issue with this is that ContextAwareChunker still needs to run some of the markdown text splitting chunking afterwards, so for the time being I exported that out to a global function.

As of right now (haven't tested this yet though), the only thing stopping us from using docling on markdowns is a manual check I implemented in the factory method which instantiated a TextSplitChunker for markdowns and a ContextAwareChunker for PDFs.

We should probably rework DocumentChunker to just being a the context-aware chunker and drop the other chunker class. Since docling supports markdown as well, I don't see a reason we'd need to skip it anyway and go straight to the text splitter. This also will allow us to easily define in one place the doc types we support and.

khaledsulayman · 2024-11-08T15:19:52Z

Related to this, I think there's two things we should do:

refactor Chunker code so that it always calls docling and then the text splitter
make the necessary code changes on our end to allow for the chunker to accept markdown and all the other new filetypes that come with V2

I think we can either convert this issue to [1.] and then write a separate issue for [2.], or we can have this issue be an epic to track the issues I'll write for [1.] and [2.]. WDYT?

aakankshaduggal · 2024-11-08T15:34:16Z

I agree with splitting this into two issues. Let’s use this one to focus on refactoring the Chunker code to consistently call docling first, followed by the text splitter. We can create a separate issue to handle code changes needed to support new file types with docling v2. This way, we can keep each part focused and manageable. Thanks for the suggestion! @khaledsulayman feel free to edit the issue and add another one as well :D

bbrowning mentioned this issue Nov 7, 2024

Integrate Context-Aware Chunking and PDF Support #284

Merged

khaledsulayman changed the title ~~Enhance Markdown Chunking with docling v2~~ Chunking Refactor: Always use Context-Aware Chunker Nov 8, 2024

khaledsulayman mentioned this issue Nov 8, 2024

Enable support for docling V2 supported filetypes #353

Open

khaledsulayman self-assigned this Nov 8, 2024

This was referenced Nov 8, 2024

[Epic] Testing for Document Chunking #346

Open

Unit Testing for Document Chunkers #354

Open

Update E2E jobs to run SDG on different filetypes #355

Open

[Epic] Fully Utilize Docling V2 Capabilities #374

Open

ktam3 added the jira label Nov 14, 2024

khaledsulayman linked a pull request Dec 5, 2024 that will close this issue

Refactor Document Chunker to always use docling #430

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking Refactor: Always use Context-Aware Chunker #334

Chunking Refactor: Always use Context-Aware Chunker #334

aakankshaduggal commented Nov 5, 2024 •

edited by khaledsulayman

Loading

khaledsulayman commented Nov 7, 2024

khaledsulayman commented Nov 8, 2024 •

edited

Loading

aakankshaduggal commented Nov 8, 2024

Chunking Refactor: Always use Context-Aware Chunker #334

Chunking Refactor: Always use Context-Aware Chunker #334

Comments

aakankshaduggal commented Nov 5, 2024 • edited by khaledsulayman Loading

khaledsulayman commented Nov 7, 2024

khaledsulayman commented Nov 8, 2024 • edited Loading

aakankshaduggal commented Nov 8, 2024

aakankshaduggal commented Nov 5, 2024 •

edited by khaledsulayman

Loading

khaledsulayman commented Nov 8, 2024 •

edited

Loading