Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

截图识别比PDF文档识别 效果好 #713

Open
1 task done
jqwangai opened this issue Nov 5, 2024 · 1 comment
Open
1 task done

截图识别比PDF文档识别 效果好 #713

jqwangai opened this issue Nov 5, 2024 · 1 comment

Comments

@jqwangai
Copy link

jqwangai commented Nov 5, 2024

Issues

  • I have browsed through the Issues. 我已浏览过Issues,确定没有重复提问。

Umi-OCR version 程序版本

2.1.4

Windows version 系统版本

win11

OCR plugins Used 使用的OCR插件

PaddleOCR

Reproduction steps 复现步骤

识别文档中的“不寐”两字时,截图识别可以很容易的识别出来,而批量文档识别,却识别不出“寐”字,或者错识别出其它文字。

调大“限制图片边长”,也无法解决。 在处理其它文档时也有类似的问题。

请问,我在批量文档识别时,需要注意什么配置吗?或者需要对PDF做什么预处理操作。

Problem screenshots or related files (optional) 问题截图或相关文件(可选)

源文件PDF截图:
image

配置:
image

@hiroi-sora
Copy link
Owner

你已经选择为 整页强制OCR ,这样程序会用类似截图的原理获取页面图像来识别。你可以修改参数提高DPI来增加图片精度:

用记事本打开 Umi-OCR\UmiOCR-data\py_src\mission\mission_doc.py ,前面第19行有个:

MinSize = 1080  # 最小渲染分辨率

它表示将PDF渲染为图片时,最短边的边长。您可以调高此数值,如 MinSize = 2160

改完后保存、重启Umi。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants