Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DOC to Markdown Converter Function - Issue #23 #37

Closed
wants to merge 16 commits into from

Conversation

aviral-bhardwaj
Copy link

Added DOC to Markdown Converter Function

  1. This commit introduces a new DocConverter class that extends the HtmlConverter class to handle the conversion of DOC files to Markdown format. The key features of this implementation are:
  2. File Type Validation: The converter checks if the file extension is ".doc" before proceeding with the conversion.
  3. Conversion Process:
  4. The function uses the mammoth library to convert the DOC file to HTML.
  5. It then utilizes the inherited _convert method to transform the HTML content into Markdown.

Issue URL- This is Issue URL

Closed Previous same PR due to the PyCharm Indentation Errors

@aviral-bhardwaj
Copy link
Author

@StanFromIreland Please review this due to pycharm i was getting indentation error in my last PR

@realrajaryan
Copy link
Member

should add test(s)

@aviral-bhardwaj
Copy link
Author

adding but having problem to run that

@aviral-bhardwaj
Copy link
Author

added test cases and it is passed

@aviral-bhardwaj
Copy link
Author

can anyone please help me here @gagb , I am not able to run tests and pre-commits

@aviral-bhardwaj
Copy link
Author

it is showing this in my pycharm

================== 2 passed, 1 skipped, 3 warnings in 26.27s ===================
PASSED [ 33%]PASSED [ 66%]SKIPPED (do not run if
exiftool is not installed) [100%]
Skipped: do not run if exiftool is not installed

src/markitdown/_markitdown.py Outdated Show resolved Hide resolved
@aviral-bhardwaj
Copy link
Author

fixed according to review as the Args are creating issue in pychram so removed them according to all other functions

@aviral-bhardwaj
Copy link
Author

@l-lumin please review this once

Copy link
Contributor

@l-lumin l-lumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry, I'm just a random person passing by, so I can't magically approve this!

@aviral-bhardwaj
Copy link
Author

@realrajaryan sir please can you help me to move forward

@gagb
Copy link
Contributor

gagb commented Dec 16, 2024

can anyone please help me here @gagb , I am not able to run tests and pre-commits

You can run precommit via 'pre-commit run --all-files' and then update the PR

@gagb
Copy link
Contributor

gagb commented Dec 16, 2024

Another PR introduced conflicts, can you please resolve?

@gagb gagb self-assigned this Dec 16, 2024
@aviral-bhardwaj
Copy link
Author

@gagb sir please review this , I have updated the test code ,help me to merge this

@gagb gagb requested a review from realrajaryan December 18, 2024 00:21
@gagb
Copy link
Contributor

gagb commented Dec 18, 2024

@gagb sir please review this , I have updated the test code ,help me to merge this

Thanks, will get it merged soon. Appreciate your contributions.

@tungsten106
Copy link

@aviral-bhardwaj I wonder how you used mammoth to convert .doc file to html? I have tried on my own .doc file, it riased error OSError: Could not find main document part. Are you sure this is a valid .docx file?

Copy link
Member

@realrajaryan realrajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mammoth does not support converting doc file format, need to find an alternative.


result = None
with open(local_path, "rb") as doc_file:
result = mammoth.convert_to_html(doc_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mammoth does not seem to support .doc files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not a proper doc file. If you just renamed test.docx to test.doc, it does not effectively convert it into a .doc file, it merely changes the file extension while retaining the docx file format. This is why your test passes when it shouldn’t, as mammoth does not support doc files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok checking , will update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realrajaryan added new doc file it is working in my local added screenshot below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not a valid doc file. The MIME type for this file is application/octet-stream, but for a doc file it’s supposed to be application/msword.

@aviral-bhardwaj
Copy link
Author

added new doc file this is running in my local without any issue

Screenshot 2024-12-18 at 1 11 14 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not a valid doc file. The MIME type for this file is application/octet-stream, but for a doc file it’s supposed to be application/msword.

@aviral-bhardwaj
Copy link
Author

added new doc file attaching its info snapshot also

Screenshot 2024-12-18 at 1 49 03 PM

Copy link
Member

@realrajaryan realrajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a valid doc file now, and the test case fails as mammoth does not support converting doc files.

@aviral-bhardwaj
Copy link
Author

yes doing more research on this , will update soon

@aviral-bhardwaj
Copy link
Author

this PR can be closed as this feature is completed from this PR #153

@gagb gagb closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants