New Converter based on Markitdown from Microsoft #1248

paulmartrencharpro · 2024-12-17T10:25:22Z

Summary and motivation

Markitdown is a new open source library (MIT license) from Microsoft (https://github.com/microsoft/markitdown).
It's a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

PDF (.pdf)
PowerPoint (.pptx)
Word (.docx)
Excel (.xlsx)
Images (EXIF metadata, and OCR)
Audio (EXIF metadata, and speech transcription)
HTML (special handling of Wikipedia, etc.)
Various other text-based formats (csv, json, xml, etc.)

It runs locally. It tries to describes images locally or can call an AI to describes images.

My initial testing is that it works quite well on pdf, pptx, xlsx and docx.

We could use it to make a converter (pdf, pptx, xlsx or docx) -> Markdown -> Document

Detailed design

We can do something similar to PDFMinerToDocument.

take a file (pdf, pptx, xlsx or docx)
Call markitdown to convert it to markdown
then use MarkdownToDocument to convert to document

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

The text was updated successfully, but these errors were encountered:

julian-risch · 2024-12-17T12:22:55Z

Thanks for sharing this idea @paulmartrencharpro ! Looks very interesting! As the Markitdown repo is only a month old and there is only a 0.0.1a2 pre-release, we'll need to see how much interest there is from the community in adding an integration for it. Maybe in the meantime someone in the community wants to build an integration.
An alternative might be Docling, which was recently brought up in this discussion: deepset-ai/haystack#8614

paulmartrencharpro added the new integration Discuss the creation of a new integration in Core label Dec 17, 2024

julian-risch added the P3 label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Converter based on Markitdown from Microsoft #1248

New Converter based on Markitdown from Microsoft #1248

paulmartrencharpro commented Dec 17, 2024

Tasks

julian-risch commented Dec 17, 2024

New Converter based on Markitdown from Microsoft #1248

New Converter based on Markitdown from Microsoft #1248

Comments

paulmartrencharpro commented Dec 17, 2024

Summary and motivation

Detailed design

Checklist

Tasks

julian-risch commented Dec 17, 2024