Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language request: Kurdish-Kurmanji #124

Closed
brandones opened this issue Apr 11, 2018 · 22 comments
Closed

Language request: Kurdish-Kurmanji #124

brandones opened this issue Apr 11, 2018 · 22 comments

Comments

@brandones
Copy link

It's the most popular dialect of Kurdish, written in a Latin alphabet with a few diacritics.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Apr 11, 2018 via email

@brandones
Copy link
Author

The data in both kur and kur_ara is in Arabic script, neither is in a Latin script.

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Apr 11, 2018 via email

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Apr 11, 2018 via email

@brandones
Copy link
Author

Oh interesting! I will give that a try, thank you.

@brandones
Copy link
Author

Yes, it's definitely trained on Kurmanji. It's doing a great job OCRing Kurmanji text. This is wonderful, thank you!

@Shreeshrii
Copy link
Contributor

Thank you for your feedback.

We need to change the traineddata name so that it correctly reflects the language.

What would you suggest?

kur
kur_lat

Any other official language code for kurmanji.

@brandones
Copy link
Author

brandones commented Apr 12, 2018

Sure! My recommendations:

Don’t use kur for anything.

Depends on if you’re more interested in the script or the dialect. There are a few Kurdish languages using each of the scripts, the Latin-based one and the Persian-based one. Historically Cyrillic and Armenian scripts have been used also. I’m guessing the language is itself important though.

The ISO 639-3 code for Kurmanji (aka Northern Kurdish) is kmr.

It wouldn’t be at all unreasonable to prefix it as kur_kmr, and refer to it as Kurdish Kurmanji. Or just list kmr as Kurdish Kurmanji.

Similarly, I’d recommend using ckb or kur_ckb for the one with the Arabic-based script, Sorani.

@Shreeshrii
Copy link
Contributor

Thanks.

@jbreiden @theraysmith what is your recommendation?

At a minimum kur_ara should be changed as it doesn't have Arabic in it.

@Shreeshrii
Copy link
Contributor

tesseract-ocr/tessdata_fast#14 (comment)

OK. I think we should follow the suggestion by @amitdo, since it is in line with the way tesseract names other languages.

@amitdo
Copy link

amitdo commented Apr 21, 2018

Shree, please send PRs.

@brandones
Copy link
Author

In case it’s not clear, this choice impedes the development of Tesseract support for Zaza, Gorani/Horami, and Southern Kurdish, each of which will require quite different dictionaries.

@Shreeshrii
Copy link
Contributor

@brandones Yes, that will require different dictionaries. As of now I am not sure exactly which language dictionary is being used.

I am attaching a zip file with two traineddatas for Kurdish in Arabic script - Sorani.

If you are familiar with the script, please review and provide feedback as to accuracy and also whether the word frequency list is appropriate. (or refer to someone who can provide that feedback). Thanks!

kur_ara.traineddata.zip

@Shreeshrii
Copy link
Contributor

https://en.wikipedia.org/wiki/Kurdish_languages

ISO 639-3 kur – inclusive code
Individual codes:
ckb – Central Kurdish
kmr – Northern Kurdish
sdh – Southern Kurdish

ckb - Sorani (Arabic/Persian script).
kmr - Kurmanji (Latin script)

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Feb 16, 2019

@zdenop zdenop closed this as completed Feb 16, 2019
@keyochali
Copy link

I want to improve and work on Kurdish Kurmanji and Sorani for both Latin an Arabic scripts.
how can I do that?
what is needed to be done?
how can I submit it?

@Shreeshrii
Copy link
Contributor

@Shreeshrii
Copy link
Contributor

Sorani in Arabic script is not available in tessdata_best and tessdata_fast.

https://github.com/tesseract-ocr/langdata_lstm/tree/master/kur
has minimal language data available for it.

You will need to identify unicode fonts and training text for the same before running any training.

@keyochali
Copy link

thank you so much
I just tested the kmr.traineddata

but for the sorani in Arabic script where can I train it?
what data is needed except for the text?

I mean do I need to do boxing some images?
can you show me an example of training in tessercat?
is there any documentation for training it?

@keyochali
Copy link

wonderful
thanks a lot

@keyochali
Copy link

how can I do it in windows
when tesseract is installed it has the training tools with it
how can I use them
is there any python script to do it?
what is a textilne(a word? or the line?)

MerlijnWajer added a commit to MerlijnWajer/tesseract that referenced this issue Dec 1, 2020
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]

"tgl" (Tagalo) is now named "fil" (Filipino) [3]

[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84

"kur" no longer exists, might be named "kur_ara" now, but it is not
present in tessdata_fast nor in tessdata_best. "kmr" is the Latin
version (Kurmanji)

"tgl" (Tagalo) is now named "fil" (Filipino)
MerlijnWajer added a commit to MerlijnWajer/tesseract that referenced this issue Dec 1, 2020
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]

"tgl" (Tagalo) is now named "fil" (Filipino) [3]

[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84
MerlijnWajer added a commit to MerlijnWajer/tesseract that referenced this issue Dec 1, 2020
"kur" no longer exists, might be named "kur_ara" (the old "kur_ara" is
now "kmr", which is actually Latin) now, but "kur" is not present in
tessdata_fast nor in tessdata_best. [1] [2]

"tgl" (Tagalo) is now named "fil" (Filipino) [3]

[1] tesseract-ocr/langdata#124
[2] tesseract-ocr/tessdata_best#23
[3] tesseract-ocr/langdata#84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants