Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi, Some type non-Western-character do not need keep space between words #346

Closed
napasa opened this issue Jun 13, 2018 · 9 comments
Closed

Comments

@napasa
Copy link

napasa commented Jun 13, 2018

Hi, Some type non-Western-character do not need keep space between words. for consider circumstance like this,we should add a mechanism to check whether we should puls word x coor calculated out with a extra spacing width .
for eaxmple.
English: Hello world!
Chinese:你好世界!

x += painter.getTextWidth(text + " ") / px2pu;
code snippet quote from src/hocr/HOCRPdfExporter.cc, line 717

@manisandro
Copy link
Owner

Any ideas how such a mechanism could look like?

@napasa
Copy link
Author

napasa commented Jun 13, 2018

I try it. ^_^.

@napasa
Copy link
Author

napasa commented Jun 14, 2018

https://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/languages/internal/
How about this, the CLD (Compact Language Detection) library.

@manisandro
Copy link
Owner

Well detecting the language is not that much of a problem per se, typically recognition language chosen by the user will match the actual text language. What needs to be defined are the rules when to add spaces and when not.

@napasa
Copy link
Author

napasa commented Jun 14, 2018

We need a blackboard to call on people to write that whether their language need to add spaces.

Sometimes,Especially in non western countries, we often encounter Western characters in our paper , so I suggest that when we draw each word, we can let users decide whether the language detection plug-in is opened.

@manisandro
Copy link
Owner

If you do a multilingual recognition, tesseract will detect the language and will write i in the lang attribute of the corresponding element of the hOCR document.

@napasa
Copy link
Author

napasa commented Jun 14, 2018

It seems not tell what language is when I use choose single Chinese recognition.
<span title="bbox 1434 2311 1566 2344; x_fsize 10; x_wconf 56" class="ocrx_word" id="word_1_117" lang="zh_CN">AAAN</span>

@manisandro
Copy link
Owner

Following up in the pull request #351

@napasa
Copy link
Author

napasa commented Jun 22, 2018

I reset my commit. so pls change pull request of following up to #353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants