Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python tools for Tibetan latin to unicode #1

Open
ironhouzi opened this issue Dec 15, 2018 · 1 comment
Open

Python tools for Tibetan latin to unicode #1

ironhouzi opened this issue Dec 15, 2018 · 1 comment

Comments

@ironhouzi
Copy link

Hi Elie. Nice to see you're using Python, as I remember you were mainly using lower level languages.

I am glad to see more work being done in this field. Just so you know I also have a Python tool for solving similar problems: https://github.com/ironhouzi/pytib

I hope we could perhaps join our efforts, so that we could benefit from one another.

pytib is far from feature complete and has a different approach than pyewts. I am also curious to see how compatible pytib handles EWTS, as pytib supports dynamic configuration of Latin definitions. pytib has just been a pet project I started when I first learned Python, but as a professional Python programmer, I gave it a big rewrite last year. Still, I think it could use a whole lot more polish, so learning from other skilled developers solving similar problems is very inspiring.

The main difference between the two is definitely the algorithm. I've chosen a more analytical approach, and you're using lookup tables. I think there are pros and cons to both approaches. While there's a benefit of getting rudimentary spell checking from using the analytical approach, the performance is not spectacular. I would assume lookup tables gives good performance, which is the current challenge I'm trying to tackle through concurrent processing. While I find the translation function is pretty OK, the implementation code that utilizes the parse() function for parsing documents has been implemented rather quickly. It uses line based handling instead of character based handling, which seems like a better approach for managing correct Tibetan punctuation and will also need to be figured out before I can do any work on concurrent Latin-Tibetan parsing.

Looking forward to learning from you.

@eroux
Copy link
Collaborator

eroux commented Dec 15, 2018

Hi Robin, thanks a lot for your email! I think we exchanged a few emails a few years ago, Edward introduced us IIRC... this repo is a manual cleanup of a translation (Java -> Python) of https://github.com/buda-base/ewts-converter , which itself is a manual translation (Perl -> Java) of http://www.digitaltibetan.org/tibetan/Lingua-BO-Wylie-dev.zip . The conversion is not complete yet, I intend to complete it this week-end. Then I probably should add the ALA-LC and DTS transliteration tables from

https://github.com/buda-base/ewts-converter/blob/master/src/main/java/io/bdrc/ewtsconverter/TransConverter.java

but I'm not sure I'll have time soon (I don't need it now). The goal is really to have a simple pip module for converting back and forth from Unicode to Ewts, in order to manipulate some xml data and present it into Unicode. Pytib seems interesting, the translation to IAST looks interesting but I'm a bit pessimistic it can be achieved in a good way without a rather large corpus analysis... I'm intending to participate in the building of such a corpus in the future, but I don't think there will be results before 5 to 10 years... we'll see! I think I'll just finish this quick conversion and upload it on pip, I'm not sure I'll stay in the same coding area next though...

Thanks again,

Best

eroux pushed a commit that referenced this issue Nov 20, 2019
update to latest Esukhia/pyewts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants