Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Powerpoint notes slides #473

Open
maciejwie opened this issue Nov 29, 2024 · 0 comments · May be fixed by #474
Open

Adding Powerpoint notes slides #473

maciejwie opened this issue Nov 29, 2024 · 0 comments · May be fixed by #474
Labels
enhancement New feature or request

Comments

@maciejwie
Copy link

Requested feature

Presenter notes are a valuable part of a Powerpoint presentation and are worth extracting. Docling uses uses the python-pptx library for parsing Powerpoint pptx files, which supports reading from the presenter notes, and which are stored as notes slides. The code to read the notes is fairly trivial and could look something like:

class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentBackend):
[...]
    def walk_linear(self, pptx_obj, doc) -> DoclingDocument:
[...]
            # Handle notes slide
            if slide.has_notes_slide:
                notes_slide = slide.notes_slide
                notes_text = notes_slide.notes_text_frame.text.strip()
                if notes_text:
                    bbox = BoundingBox(l=0, t=0, r=0, b=0)
                    prov = ProvenanceItem(
                        page_no=slide_ind + 1, charspan=[0, len(notes_text)], bbox=bbox
                    )
                    doc.add_text(
                        label=DocItemLabel.TEXT,
                        parent=parent_slide,
                        text=notes_text,
                        prov=prov,
                    )

but I'm not sure how the core team would like to see provenance handled, since there is no bounding boxes to use but the model will not accept None. Should we use an empty bbox? How would you want something like this handled?

Alternatives

Since the python-pptx library used already supports this feature, no alternatives are necessary and it can be integrated as-is.

@maciejwie maciejwie added the enhancement New feature or request label Nov 29, 2024
@maciejwie maciejwie linked a pull request Nov 29, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant