Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Converter based on Markitdown from Microsoft #1248

Open
10 tasks
paulmartrencharpro opened this issue Dec 17, 2024 · 1 comment
Open
10 tasks

New Converter based on Markitdown from Microsoft #1248

paulmartrencharpro opened this issue Dec 17, 2024 · 1 comment
Labels
new integration Discuss the creation of a new integration in Core P3

Comments

@paulmartrencharpro
Copy link
Contributor

Summary and motivation

Markitdown is a new open source library (MIT license) from Microsoft (https://github.com/microsoft/markitdown).
It's a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

  • PDF (.pdf)
  • PowerPoint (.pptx)
  • Word (.docx)
  • Excel (.xlsx)
  • Images (EXIF metadata, and OCR)
  • Audio (EXIF metadata, and speech transcription)
  • HTML (special handling of Wikipedia, etc.)
  • Various other text-based formats (csv, json, xml, etc.)

It runs locally. It tries to describes images locally or can call an AI to describes images.

My initial testing is that it works quite well on pdf, pptx, xlsx and docx.

We could use it to make a converter (pdf, pptx, xlsx or docx) -> Markdown -> Document

Detailed design

We can do something similar to PDFMinerToDocument.

  • take a file (pdf, pptx, xlsx or docx)
  • Call markitdown to convert it to markdown
  • then use MarkdownToDocument to convert to document

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Tasks

Preview Give feedback
@paulmartrencharpro paulmartrencharpro added the new integration Discuss the creation of a new integration in Core label Dec 17, 2024
@julian-risch
Copy link
Member

Thanks for sharing this idea @paulmartrencharpro ! Looks very interesting! As the Markitdown repo is only a month old and there is only a 0.0.1a2 pre-release, we'll need to see how much interest there is from the community in adding an integration for it. Maybe in the meantime someone in the community wants to build an integration.
An alternative might be Docling, which was recently brought up in this discussion: deepset-ai/haystack#8614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new integration Discuss the creation of a new integration in Core P3
Projects
None yet
Development

No branches or pull requests

2 participants