You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Markitdown is a new open source library (MIT license) from Microsoft (https://github.com/microsoft/markitdown).
It's a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
PDF (.pdf)
PowerPoint (.pptx)
Word (.docx)
Excel (.xlsx)
Images (EXIF metadata, and OCR)
Audio (EXIF metadata, and speech transcription)
HTML (special handling of Wikipedia, etc.)
Various other text-based formats (csv, json, xml, etc.)
It runs locally. It tries to describes images locally or can call an AI to describes images.
My initial testing is that it works quite well on pdf, pptx, xlsx and docx.
We could use it to make a converter (pdf, pptx, xlsx or docx) -> Markdown -> Document
Detailed design
We can do something similar to PDFMinerToDocument.
take a file (pdf, pptx, xlsx or docx)
Call markitdown to convert it to markdown
then use MarkdownToDocument to convert to document
Checklist
If the request is accepted, ensure the following checklist is complete before closing this issue.
The content you are editing has changed. Please copy your edits and refresh the page.
Thanks for sharing this idea @paulmartrencharpro ! Looks very interesting! As the Markitdown repo is only a month old and there is only a 0.0.1a2 pre-release, we'll need to see how much interest there is from the community in adding an integration for it. Maybe in the meantime someone in the community wants to build an integration.
An alternative might be Docling, which was recently brought up in this discussion: deepset-ai/haystack#8614
Summary and motivation
Markitdown is a new open source library (MIT license) from Microsoft (https://github.com/microsoft/markitdown).
It's a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
It runs locally. It tries to describes images locally or can call an AI to describes images.
My initial testing is that it works quite well on pdf, pptx, xlsx and docx.
We could use it to make a converter (pdf, pptx, xlsx or docx) -> Markdown -> Document
Detailed design
We can do something similar to PDFMinerToDocument.
Checklist
If the request is accepted, ensure the following checklist is complete before closing this issue.
Tasks
The text was updated successfully, but these errors were encountered: