-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix markitdown.convert_stream handling of leading blanks #223
Conversation
Fixes microsoft#222 Address issue with `markitdown.convert_stream` crashing on input with leading blank characters or line breaks. * Modify `convert_stream` function in `src/markitdown/_markitdown.py` to strip leading blank characters or line breaks from the input stream using a new helper function `_strip_leading_blanks`. * Add a test case in `tests/test_markitdown.py` to verify that `markitdown.convert_stream` handles input with leading blank characters or line breaks correctly. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown/issues/222?shareId=XXXX-XXXX-XXXX-XXXX).
@@ -1344,7 +1344,7 @@ def convert_stream( | |||
result = None | |||
try: | |||
# Write to the temporary file | |||
content = stream.read() | |||
content = self._strip_leading_blanks(stream.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream might not be text -- in which case stripping characters could be very problematic. Suggest we move this to the inside of the if statement below.
@@ -1367,6 +1367,10 @@ def convert_stream( | |||
|
|||
return result | |||
|
|||
def _strip_leading_blanks(self, content: bytes) -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a helper function for lstrip?
Thanks for the PR. Before we accept this, I would like to better understand why leading spaces are causing a crash. I suspect that the issue lies deeper in the logic for guessing the file format, and it will be triggered in other conditions as well. |
# Test input with leading blank characters | ||
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>" | ||
result = markitdown.convert_stream(io.BytesIO(input_data), file_extension=".html") | ||
assert "<h1>Test</h1>" in result.text_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will fail. The output will be Markdown, not HTML
I've determine the problem, and am investigating other fixes that don't involve truncating the file. See #222 (comment) for more. |
Fixed in #260 without modifying file. |
Fixes #222
Address issue with
markitdown.convert_stream
crashing on input with leading blank characters or line breaks.convert_stream
function insrc/markitdown/_markitdown.py
to strip leading blank characters or line breaks from the input stream using a new helper function_strip_leading_blanks
.tests/test_markitdown.py
to verify thatmarkitdown.convert_stream
handles input with leading blank characters or line breaks correctly.For more details, open the Copilot Workspace session.