Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove references to "kur" and "tgl", add "fil" to man page #3165

Merged
merged 1 commit into from
Dec 3, 2020

Conversation

MerlijnWajer
Copy link
Contributor

Ran into this after building a language mapping from the man page. These languages are not available in tessdata.

"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]

"tgl" (Tagalo) is now named "fil" (Filipino) [3]

[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84

"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]

"tgl" (Tagalo) is now named "fil" (Filipino) [3]

[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84
@Shreeshrii
Copy link
Collaborator

@@ -277,7 +277,6 @@ following languages:
*tat* (Tatar),
*tel* (Telugu),
*tgk* (Tajik),
*tgl* (Tagalog),
Copy link
Member

@stweil stweil Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tgl is available for tessdata (so we should not remove it here), but is missing for tessdata_fast and tessdata_best which is strange. We could copy the LSTM part to tessdata_fast if that helps.

Should we remove tgl from tessdata and from langdata_lstm which also have the successor fil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Would you like me to also submit an update for that page?

In general it is not clear to me what the authoritative source on all the "officially" supported/included languages is. I used to think the man page, but now I assume the tessdata repos themselves.

In any case it's confusing when the documentation points to language data that does not exist - my first inclination was search ubuntu packages, where I did find some old ones, but importing those was likely not a good idea, which is why I started digging some more. If it helps I can try to automatically match all the other languages in the man page (or the wiki link) to the ones installed by ubuntu packaging and what is available in tessdata.

Copy link
Contributor Author

@MerlijnWajer MerlijnWajer Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shreeshrii - not sure why my comment went to this thread (I clicked quote reply), but my above comment is in reply to your message.

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Merlijn.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 2, 2020 via email

@MerlijnWajer
Copy link
Contributor Author

OK, I will try to do that this week and update this PR if appropriate.

@stweil stweil merged commit 69ed480 into tesseract-ocr:master Dec 3, 2020
@stweil
Copy link
Member

stweil commented Dec 3, 2020

OK, I will try to do that this week and update this PR if appropriate.

Please make a new PR for any followup changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants