Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction with PyMuPDF #1

Open
bitsgalore opened this issue Oct 31, 2024 · 0 comments
Open

Extraction with PyMuPDF #1

bitsgalore opened this issue Oct 31, 2024 · 0 comments

Comments

@bitsgalore
Copy link
Member

bitsgalore commented Oct 31, 2024

The following Python libraries might be worth a look for a follow-up/update:

  • PyMuPDF does text extraction and also supports EPUB!
  • Docling "parses documents and exports them to the desired format with ease and speed". No direct EPUB support, but could be done indirectly via HTML. Also interesting because its internal document representation includes basic structure/hierarchy, see Docling Document section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant