Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line breaks within cells are recognized as multiple lines, resulting in incomplete data #389

Open
bineea opened this issue Nov 26, 2024 · 0 comments

Comments

@bineea
Copy link

bineea commented Nov 26, 2024

Question

Line feeds within cells are recognized as multiple lines, resulting in incomplete data. What can I do about this?

Environmental information

  • torch:2.5.1+cu121
  • transformers:4.45.2
  • marker-pdf:0.3.10

Original data

图片

Marker output data

图片

Code

model_lst = load_all_models()
fpath = "documents/Payment Summary.PDF"
settings.IMAGE_DPI=200
settings.OCR_ENGINE='surya'
settings.SURYA_TABLE_DPI=200
settings.SURYA_LAYOUT_DPI=200
full_text, images, out_meta = convert_single_pdf(fpath, model_lst, ocr_all_pages=True, max_pages=1)
print(full_text, end="\n------------------------------------------\n")
print(out_meta, end="\n------------------------------------------\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant