Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add DOC converter to MarkItDown #153

Closed
wants to merge 3 commits into from

Conversation

l-lumin
Copy link
Contributor

@l-lumin l-lumin commented Dec 19, 2024

Issue #23

This PR adds a DocConverter class to convert DOC files into Markdown. It integrates the converter into the MarkItDown class and includes tests from #37 to verify the functionality

The conversion need use LibreOffice in headless mode.

@afourney
Copy link
Member

Thanks for the PR. I agree we want legacy doc support.

Let me think about this one a little. I'm not sure calling headless libreoffice (via shell) is the ideal approach here, and there may be other constraints. I know we do it for exiftool, but I'm not thrilled with that...

One nice thing is that the tools supports registering converters after initialization, similar to:
https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1

So in theory you could do:

class DocConverter(HtmlConverter):
    ...
    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
        ...

markitdown = MarkItDown()
markitdown.register_page_converter(DocConverter())

We could even imagine a little ecosystem of converters popping up on pypi, which would allow unrestricted growth in formats.

@gagb what do you think?

@afourney afourney requested a review from gagb December 19, 2024 18:10
src/markitdown/__init__.py Outdated Show resolved Hide resolved
src/markitdown/_markitdown.py Outdated Show resolved Hide resolved
@gagb
Copy link
Contributor

gagb commented Dec 19, 2024

@aviral-bhardwaj can you please review this PR as well since you made a similar PR?

@gagb
Copy link
Contributor

gagb commented Dec 19, 2024

Thanks for the PR. I agree we want legacy doc support.

Let me think about this one a little. I'm not sure calling headless libreoffice (via shell) is the ideal approach here, and there may be other constraints. I know we do it for exiftool, but I'm not thrilled with that...

One nice thing is that the tools supports registering converters after initialization, similar to: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1

So in theory you could do:

class DocConverter(HtmlConverter):
    ...
    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
        ...

markitdown = MarkItDown()
markitdown.register_page_converter(DocConverter())

We could even imagine a little ecosystem of converters popping up on pypi, which would allow unrestricted growth in formats.

@gagb what do you think?

I agree with your concern, but the PR

  • adds functionality that is useful
  • does not regress previous features

Could the refactor that you suggest be done in a different PR?

@aviral-bhardwaj
Copy link

@aviral-bhardwaj can you please review this PR as well since you made a similar PR?

Will check later ,it is midnight in india 🇮🇳

@l-lumin l-lumin force-pushed the add-doc-converter branch 2 times, most recently from c111856 to d53e8b8 Compare December 19, 2024 18:39
Adds a new DocConverter class to convert DOC files to Markdown. 
Integrates the converter the MarkItDown class and includes 
tests to verify functionality. This enhancement allows users to 
process DOC files, expanding the tool's capabilities.
Enhance the exception handling in the DocConverter class 
to provide more informative error messages. Instead of 
returning None on failure, the code now returns a 
DocumentConverterResult object that includes the error 
message, making it easier to diagnose issues when 
processing DOC files.
@aviral-bhardwaj
Copy link

Hi @l-lumin and @gagb,

This type of error should not be displayed to the end user. As the end user, they should not be responsible for installing additional libraries like LibreOffice on top of MarkitDown.

Is there a way to include LibreOffice as part of the installation process for our MarkitDown library? This way, when a user installs our library, everything will be set up, and they can start using it seamlessly without any extra steps.

@l-lumin
Copy link
Contributor Author

l-lumin commented Dec 20, 2024

Hi @l-lumin and @gagb,

This type of error should not be displayed to the end user. As the end user, they should not be responsible for installing additional libraries like LibreOffice on top of MarkitDown.

Is there a way to include LibreOffice as part of the installation process for our MarkitDown library? This way, when a user installs our library, everything will be set up, and they can start using it seamlessly without any extra steps.

Include LibreOffice as part of installation process isn't ideal because can't account for all systems, which could cause issues.

I think do this will better

  • Error message to guide user
  • Add LibreOffice to Dockerfile (I will create PR for Docker once this is merged)

Ex: Selenium

@aviral-bhardwaj
Copy link

all are fine but test cases are failing in my system not sure why , attaching error file here @l-lumin , can you please have a look at this

I-lumin-test-code error.pdf

@l-lumin
Copy link
Contributor Author

l-lumin commented Dec 20, 2024

all are fine but test cases are failing in my system not sure why , attaching error file here @l-lumin , can you please have a look at this

I-lumin-test-code error.pdf

what environment are you running on? I checked your repository, and it is 16 commits ahead of microsoft/markitdown. It also includes failed tests

my test log

root@450635eb267c:/workspaces/markitdown# hatch test
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
collected 7 items                                                                                                                                                                      

tests/test_markitdown.py ..s...s                                                                                                                                                 [100%]

=================================================================================== warnings summary ===================================================================================
../../root/.local/share/hatch/env/virtual/markitdown/2jxz10qj/hatch-test.py3.13/lib/python3.13/site-packages/speech_recognition/__init__.py:7
  /root/.local/share/hatch/env/virtual/markitdown/2jxz10qj/hatch-test.py3.13/lib/python3.13/site-packages/speech_recognition/__init__.py:7: DeprecationWarning: aifc was removed in Python 3.13. Please be aware that you are currently NOT using standard 'aifc', but instead a separately installed 'standard-aifc'.
    import aifc

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================= 5 passed, 2 skipped, 1 warning in 18.22s =======================================================================
root@450635eb267c:/workspaces/markitdown# 

@aviral-bhardwaj
Copy link

yes yes mine is window system thats why it is showing error

Update DocConverter to return an error message when LibreOffice 
is not found. Add a new test to verify the error handling 
behavior when the `shutil.which` function is mocked to return 
None, ensuring that users receive clear instructions to 
install LibreOffice.
refactor(tests): update test_markitdown_soffice_mocked
@aviral-bhardwaj
Copy link

@gagb please have a final look , this is fine from my end

if not (soffice := shutil.which("soffice")):
return DocumentConverterResult(
title=None,
text_content="[ERROR] LibreOffice not found. Please install LibreOffice and try again.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
text_content="[ERROR] LibreOffice not found. Please install LibreOffice and try again.",
text_content="[ERROR] This feature requires the 'soffice' command from LibreOffice. Please install LibreOffice. https://www.libreoffice.org/get-help/install-howto/.",

@gagb gagb requested a review from afourney December 20, 2024 22:08
@afourney
Copy link
Member

Again, I am supportive of this effort, and appreciate the contribution, but I'm not sold on calling out to LibreOffice via shell to accomplish this conversion -- especially in the default installation. It works, but it's a pretty heavyweight binary dependency, and brings other baggage.

Also, if LibreOffice is available, why not use it for PPT and other formats as well.

I think, very soon, I would like to create some sort of plugin ecosystem where you could do:

pip install markitdown
pip install markitdown-soffice

And then get all the libreOffice converters (perhaps detected and incorporated automatically). This would allow you, or anyone, to add alternatives or new file formats, and host and update them at your own convenience.

However, I'm still working out the details for this.

I would like to close this PR for now, but come back to it as soon as we have the details worked out. Again, I am appreciative of this effort -- this is not a reflection on the code or anything, but rather a design decision on our end that has yet to be worked out.

@afourney afourney closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants