-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add DOC converter to MarkItDown #153
Conversation
Thanks for the PR. I agree we want legacy doc support. Let me think about this one a little. I'm not sure calling headless libreoffice (via shell) is the ideal approach here, and there may be other constraints. I know we do it for exiftool, but I'm not thrilled with that... One nice thing is that the tools supports registering converters after initialization, similar to: So in theory you could do: class DocConverter(HtmlConverter):
...
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
...
markitdown = MarkItDown()
markitdown.register_page_converter(DocConverter()) We could even imagine a little ecosystem of converters popping up on pypi, which would allow unrestricted growth in formats. @gagb what do you think? |
@aviral-bhardwaj can you please review this PR as well since you made a similar PR? |
I agree with your concern, but the PR
Could the refactor that you suggest be done in a different PR? |
Will check later ,it is midnight in india 🇮🇳 |
c111856
to
d53e8b8
Compare
Adds a new DocConverter class to convert DOC files to Markdown. Integrates the converter the MarkItDown class and includes tests to verify functionality. This enhancement allows users to process DOC files, expanding the tool's capabilities.
Enhance the exception handling in the DocConverter class to provide more informative error messages. Instead of returning None on failure, the code now returns a DocumentConverterResult object that includes the error message, making it easier to diagnose issues when processing DOC files.
98789cb
to
1865f1a
Compare
This type of error should not be displayed to the end user. As the end user, they should not be responsible for installing additional libraries like LibreOffice on top of MarkitDown. Is there a way to include LibreOffice as part of the installation process for our MarkitDown library? This way, when a user installs our library, everything will be set up, and they can start using it seamlessly without any extra steps. |
Include LibreOffice as part of installation process isn't ideal because can't account for all systems, which could cause issues. I think do this will better
Ex: Selenium |
all are fine but test cases are failing in my system not sure why , attaching error file here @l-lumin , can you please have a look at this |
what environment are you running on? I checked your repository, and it is 16 commits ahead of microsoft/markitdown. It also includes failed tests my test log root@450635eb267c:/workspaces/markitdown# hatch test
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
collected 7 items
tests/test_markitdown.py ..s...s [100%]
=================================================================================== warnings summary ===================================================================================
../../root/.local/share/hatch/env/virtual/markitdown/2jxz10qj/hatch-test.py3.13/lib/python3.13/site-packages/speech_recognition/__init__.py:7
/root/.local/share/hatch/env/virtual/markitdown/2jxz10qj/hatch-test.py3.13/lib/python3.13/site-packages/speech_recognition/__init__.py:7: DeprecationWarning: aifc was removed in Python 3.13. Please be aware that you are currently NOT using standard 'aifc', but instead a separately installed 'standard-aifc'.
import aifc
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================= 5 passed, 2 skipped, 1 warning in 18.22s =======================================================================
root@450635eb267c:/workspaces/markitdown# |
yes yes mine is window system thats why it is showing error |
Update DocConverter to return an error message when LibreOffice is not found. Add a new test to verify the error handling behavior when the `shutil.which` function is mocked to return None, ensuring that users receive clear instructions to install LibreOffice. refactor(tests): update test_markitdown_soffice_mocked
1865f1a
to
0e4e6c0
Compare
@gagb please have a final look , this is fine from my end |
if not (soffice := shutil.which("soffice")): | ||
return DocumentConverterResult( | ||
title=None, | ||
text_content="[ERROR] LibreOffice not found. Please install LibreOffice and try again.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_content="[ERROR] LibreOffice not found. Please install LibreOffice and try again.", | |
text_content="[ERROR] This feature requires the 'soffice' command from LibreOffice. Please install LibreOffice. https://www.libreoffice.org/get-help/install-howto/.", |
Again, I am supportive of this effort, and appreciate the contribution, but I'm not sold on calling out to LibreOffice via shell to accomplish this conversion -- especially in the default installation. It works, but it's a pretty heavyweight binary dependency, and brings other baggage. Also, if LibreOffice is available, why not use it for PPT and other formats as well. I think, very soon, I would like to create some sort of plugin ecosystem where you could do: pip install markitdown
pip install markitdown-soffice And then get all the libreOffice converters (perhaps detected and incorporated automatically). This would allow you, or anyone, to add alternatives or new file formats, and host and update them at your own convenience. However, I'm still working out the details for this. I would like to close this PR for now, but come back to it as soon as we have the details worked out. Again, I am appreciative of this effort -- this is not a reflection on the code or anything, but rather a design decision on our end that has yet to be worked out. |
Issue #23
This PR adds a
DocConverter
class to convert DOC files into Markdown. It integrates the converter into the MarkItDown class and includes tests from #37 to verify the functionalityThe conversion need use LibreOffice in headless mode.