-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove references to "kur" and "tgl", add "fil" to man page #3165
Conversation
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is now "kmr", which is actually Latin) now, but "kur" is not present in tessdata_fast nor in tessdata_best. [1] [2] "tgl" (Tagalo) is now named "fil" (Filipino) [3] [1] tesseract-ocr/langdata#124 [2] tesseract-ocr/tessdata_best#23 [3] tesseract-ocr/langdata#84
@@ -277,7 +277,6 @@ following languages: | |||
*tat* (Tatar), | |||
*tel* (Telugu), | |||
*tgk* (Tajik), | |||
*tgl* (Tagalog), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tgl
is available for tessdata
(so we should not remove it here), but is missing for tessdata_fast
and tessdata_best
which is strange. We could copy the LSTM part to tessdata_fast
if that helps.
Should we remove tgl
from tessdata
and from langdata_lstm
which also have the successor fil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
Would you like me to also submit an update for that page?
In general it is not clear to me what the authoritative source on all the "officially" supported/included languages is. I used to think the man page, but now I assume the tessdata repos themselves.
In any case it's confusing when the documentation points to language data that does not exist - my first inclination was search ubuntu packages, where I did find some old ones, but importing those was likely not a good idea, which is why I started digging some more. If it helps I can try to automatically match all the other languages in the man page (or the wiki link) to the ones installed by ubuntu packaging and what is available in tessdata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shreeshrii - not sure why my comment went to this thread (I clicked quote reply), but my above comment is in reply to your message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Merlijn.
In general it is not clear to me what the authoritative source on all the
"officially" supported/included languages is. I used to think the man page,
but now I assume the tessdata repos themselves.
The man pages may not have been updated with all changes.
Would you like me to also submit an update for that page?
Yes, that would be great.
If it helps I can try to automatically match all the other languages in
the man page (or the wiki link) to the ones installed by ubuntu packaging
and what is available in tessdata.
Actually, they may need to be matched to what's in tessdata, tessdata_best
and tessdata_fast. In fact, tessdata_fast is what has been packaged for the
distributions.
…On Wed, Dec 2, 2020 at 3:34 PM Merlijn Wajer ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/tesseract.1.asc
<#3165 (comment)>
:
> @@ -277,7 +277,6 @@ following languages:
*tat* (Tatar),
*tel* (Telugu),
*tgk* (Tajik),
-*tgl* (Tagalog),
@Shreeshrii <https://github.com/Shreeshrii> - not sure why my comment
went to this thread (I clicked quote reply), but my above comment is in
reply to your message.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3165 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I2VGRJKBEYJRVBNLLDSSYGLHANCNFSM4UJVASZA>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
OK, I will try to do that this week and update this PR if appropriate. |
Please make a new PR for any followup changes. |
Ran into this after building a language mapping from the man page. These languages are not available in tessdata.
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]
"tgl" (Tagalo) is now named "fil" (Filipino) [3]
[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84