Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update output schema for parse and extract_tables #66

Merged
merged 2 commits into from
Nov 19, 2024
Merged

Conversation

SeisSerenata
Copy link
Collaborator

Description

This PR modifies the output schema for the parse and extract_tables functions to consistently return markdown content as a list instead of a joined string. This change provides more flexibility for downstream processing while maintaining backward compatibility through list joining where needed.

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

  • Updated all existing tests to handle the new list-based output format
  • Tests have been modified to join the markdown list elements when comparing with ground truth
  • All test cases pass with the new schema

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

Key changes:

  1. Modified parse() and extract_tables() to return markdown as a list instead of joining it
  2. Updated async_fetch() to maintain consistency with the new return format
  3. Updated all test cases to handle the new list-based output format
  4. Maintained backward compatibility by joining lists where needed for comparison

Copy link
Collaborator

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comment

@@ -52,7 +52,8 @@ def test_pdf_sync_parse(self):
correct_output_file = "./tests/outputs/correct_pdf_output.txt"

# extract
markdown, elapsed_time = self.ap.parse(file_path=working_file)
markdown_list, elapsed_time = self.ap.parse(file_path=working_file)
markdown = "\n".join(markdown_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be better to check the result page to page.

@lingjiekong lingjiekong merged commit a27a68b into main Nov 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants