-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classical Chinese Model needed #100
Comments
I looked over the corpus, and I see there are no delimiters (punctuation marks) for sentences. Is this ik? |
Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is a hard task without dependency parsing... |
I think we could go for jointly POS-Tagging and tokenising. Unfortunately, the algorithm we use for dependency parsing requires us to build a NxN matrix for all the words (N), which is likely to cause an out of memory error if we use all tokens. Do you know of any other approach, that does not require dependency parsing for sentence segmentation? |
Umm... I only know Straka & Straková (2017) approach using dynamic programming (see section 4.3), but it requires tentative parse trees... |
I see. I can imagine joint sentence segmentation and parsing working by using the ARC-system. Whenever the stack is emptied, it implies that a sentence boundary should be generated. We've finished work for the Parser and Tagger for version 2.0, but we still haven't found a good solution for tokenization/sentence splitting. I think I will give this new approach a try, but it will take some time to implement. I'll let you know when it's done and maybe you can test it on your corpus. Thanks for the feedback, |
@KoichiYasuoka - i haven't had any success with the tokenizer/sentence splitter so far. We are working on rolling out version 2.0 which uses a single model conditionally trained with language embeddings. We have great accuracy figures for the parser and tagger. However, we are still experiencing difficulties with the tokenizer (for all languages). We tried jointly tagging/parsing and tokenizing, but we simply got the same results as if we would do these two tasks independently. Any suggestions on how to proceed? |
Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as Kudo et al. (2004). For Classical Chinese, we also use CRF in our UD-Kanbun. For sentence segmentation in Classical Chinese, recent progress has been made by Hu et al. (2019) at https://seg.shenshen.wiki/. Hu et al. uses BERT-model, which is trained by enormous Classical Chinese texts of 3.3×109 characters... |
@KoichiYasuoka - I hope you are doing well in this time of crisis. It's been a long time since our last progress update on this issue. We started training the 2.0 models for NLP-Cube and they should be out soon. I saw the Classical Chinese corpus in the UD Treebanks (v2.5). The model will be included in this release. Congratulations and thank you for your work. I thought you might be interested in the fact that we are also setting up a "model zoo" for NLP-Cube, so contributors can publish their pre-trained models. We will try to make research attribution easy, by printing a banner with copyright and/or citing options for these models. |
@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks (scheduled on May 15, 2020). Enjoy! |
Hi @KoichiYasuoka , We've finished releasing the current version of NLPCube and we included the classical Chinese model from 2.7. Sentence segmentation seems to be problematic for this treebank. You can check branch 3.0 of the repo to get more info: https://github.com/adobe/NLP-Cube/tree/3.0 If you have any suggestions regarding sentence segmentation, please let me know. Right now we are using xlm-roberta-base for language modeling, but maybe there is some other LM that can provide better results. Best, |
Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well,
Umm... tokenization of classical Chinese doesn't work here... |
Yes, I see something is definitely wrong with the model. Just tried you example and tokenization did not work. However, on longer examples it seems to behave differently:
|
Umm... first eleven characters seem untokenized:
|
Yes, seems to be a recurring issue with any text I try. I'm retraining the tokenizer/sentence splitter right now (it will take a couple of hours). Hopefully, this will solve the problem. I'll let you know as soon as I publish the new model. |
Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of 一种基于循环神经网络的古文断句方法. I hope these help you. |
This is perfect. I will use your model to train the Classical Chinese pipeline:
Given that this is a dedicated model, I hope it will provide better results than any other LM. Thank you for this. |
as mentioned in Lightning-AI/pytorch-lightning#6211 and adobe#100
Thank you @tiberiu44 for releasing
The tokenization seems to work well this time. Now the problem is the sentence segmentation... |
Thank you for the feedback. I'm working on that right now. Hope to get it fixed soon. |
So far, I only got an sentence f-score of 20 (best result using your RobertaModel):
The UAS and LAS scores are low because every time it get's a sentence wrong, the system will also mislabel the root node. |
as mentioned in Lightning-AI/pytorch-lightning#6211 and #100
20.86% is much worse than the result (80%) of 一种基于循环神经网络的古文断句方法. OK, here I try myself with
I got "eval metrics" as follows:
Then I tried to sentencize the paragraph I wrote two years ago (#100 (comment)):
And I got the result "天平二年正月十三日萃于帥老之宅。申宴會也。于時初春令月。氣淑風和。梅披鏡前之粉。蘭薰珮後之香。加以曙嶺移雲。松掛羅而傾盖。夕岫結霧。鳥封縠而迷林。庭舞新蝶。空歸故鴈。於是盖天坐地。促膝飛觴。忘言一室之裏。開衿煙霞之外。淡然自放。快然自足。若非翰苑何以攄情。詩紀落梅之篇。古今夫何異矣。宜賦園梅。聊成短詠。" |
Unfortunately, I canot run the test right now and I will be away from keyboard most of the day. I will try your approach with transformers tomorrow. The latest models are pushed if you want to try them. If you already loaded lzh, you will need to trigger a redownload of the model. The easiest way is to remove all lzh files located in the userhome/.nlpcube/3.0 (anythint that starts with lzh, incuding a folder) |
Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my
And I've got the result "天平二年正月十三日萃于帥老之宅申宴會也。于時初春令月氣淑風和。梅披鏡前之粉蘭薰珮後之香。加以曙嶺移雲松掛羅而傾盖。夕岫結霧。鳥封縠而迷林庭舞新蝶空歸故鴈。於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情。詩紀落梅之篇古今夫何異矣。宜賦園梅。聊。成。短詠。" Umm... "聊。成。短詠。" seems unmeaningful but other segmentations are rather good. Then, how do we improve... |
On your previous example, the current version of the tokenizer generates this sentence segmentation:
Is this an improvement? |
Yes, yes @tiberiu44 it seems much better result except for "松". But I could not download the improved model after I cleaned |
It's not published. The sentence segmentation is still bad. Also, token is worse:
|
I've released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation for sentence segmentation of classical Chinese. You can use it with
|
Do we have permission to use your model in NLPCube? Do you need any citation or notice when somebody loads it? |
The models are distributed under the Apache License 2.0. You can use them (almost) freely except for trademarks. |
This sounds good. I will update the runtime code for the tokenizer to be able to use transformer models for tokenization. |
One more question: does your model also support tokenization or just sentence segmentation? |
https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation is only for sentence segmentation. And I've just released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-upos for POS-tagging with tokenization:
You can see "君子" is tokenized as a single word with the POS's of |
* Partial update * Bugfix * API update * Bugfixing and API * Bugfix * Fix long words OOM by skipping sentences * bugfixing and api update * Added language flavour * Added early stopping condition * Corrected naming * Corrected permissions * Bugfix * Added GPU support at runtime * Wrong config package * Refactoring * refactoring * add lightning to dependencies * Dummy test * Dummy test * Tweak * Tweak * Update test * Test * Finished loading for UD CONLL-U format * Working on tagger * Work on tagger * tagger training * tagger training * tagger training * Sync * Sync * Sync * Sync * Tagger working * Better weight for aux loss * Better weight for aux loss * Added save and printing for tagger and shared options class * Multilanguage evaluation * Saving multiple models * Updated ignore list * Added XLM-Roberta support * Using custom ro model * Score update * Bugfixing * Code refactor * Refactor * Added option to load external config * Added option to select LM-model from CLI or config * added option to overwrite config lm from CLI * Bugfix * Working on parser * Sync work on parser * Parser working * Removed load limit * Bugfix in evaluation * Added bi-affine attention * Added experimental ChuLiuEdmonds tree decoding * Better config for parser and bugfix * Added residuals to tagging * Model update * Switched to AdamW optimizer * Working on tokenizer * Working on tokenizer * Training working - validation to do * Bugfix in language id * Working on tokenization validation * Tokenizer working * YAML update * Bug in LMHelper * Tagger is working * Tokenizer is working * bfix * bfix * Bugfix for bugfix :) * Sync * Tokenizer worker * Tagger working * Trainer updates * Trainer process now working * Added .DS_Store * Added datasets for Compound Word Expander and Lemmatizer * Added collate function for lemma+compound * Added training and validation step * Updated config for Lemmatizer * Minor fixes * Removed duplicate entries from lemma and cwe * Added training support for lemmatizer * Removed debug directives * Lemmatizer in testing phase * removed unused line * Bugfix in Lemma dataset * Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing * Lemmatizier training done * Compound word expander ready * Sync * Added support for FastText, Transformers and Languasito LM models * Added multi-lm support for tokenizer * Added support for multiword tokens * Sync * Bugfix in evaluation * Added Languasito as a subpackage * Added path to local Languasito * Bugfixing all around * Removed debug printing * Bugfix for no-space languages that actually contain spaces :) * Bugfix for no-space languages that actually contain spaces :) * Fixed GPU support * Biaffine transform for LAS and relative head location (RHL) for UAS * Bugfix * Tweaks * moved rhl to lower layer * Added configurable option for RHL * Safenet for spaces in languages that should use no spaces * Better defaults * Sync * Cleanup parser * Bilinear xpos and attrs * Added Biaffine module from Stanza * Tagger with reduced number of parameters: * Parser with conditional attrs * Working on tokenizer runtime * Tokenizer process 90% done * Added runtime for parser, tokenizer and tagger * Added quick test for runtime * Test for e2e * Added support for multiple word embeddings at the same time * Bugfix * Added multiple word representations for tokenizer * moved mask_concat to utils.py * Added XPOS prediction to pipeline * Bugfix in tokenizer shifted word embeddings * Using Languasito tokenizer for HF tokenization * Bugfix * Bugfixing * Bugfixing * Bugfix * Runtime fixing * Sync * Added spa for FT and Languasito * Added spa for FT and Languasito * Minor tweaks * Added configuration for RNN layers * Bugfix for spa * HF runtime fix * Mixed test fasttext+transformer * Added word reconstruction and MHA * Sync * Bugfix * bugfix * Added masked attention * Sync * Added test for runtime * Bugfix in mask values * Updated test * Added full mask dropout * Added resume option * Removed useless printouts * Removed useless printouts * Switched to eval at runtime * multiprocessing added * Added full mask dropout for word decoder * Bugfix * Residual * Added lexical-contextual cosine loss * Removed full mask dropout from WordDecoder * Bugfix * Training script generation update * Added residual * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Not training for seq len > max_seq_len * Added seq limmits for collates * Passing seq limits from collate to tokenizer * Skipping complex parsing * Working on word decomposer * Model update * Sync * Bugfix * Bugfix * Bugfix * Using all reprs * Dropped immediate context * Multi train script added * Changed gpu parameter type to string, for multiple gpus int failed * Updated pytorch_lightning callback method to work with newer version * Updated pytorch_lightning callback method to work with newer version * Transparently pass PL args from the command line; skip over empty compound word datasets * Fix typo * Refactoring and on the way to working API * API load working * Partial _call_ working * Partial _call_ working * Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining. * api is working * Fixing api * Updated readme * Update Readme to include flavours * Device support * api update * Updated package * Tweak + results * Clarification * Test update * Update * Sync * Update README * Bugfixing * Bugfix and api update * Fixed compound * Evaluation update * Bugfix * Package update * Bugfix for large sentences * Pip package update * Corrected spanish evaluation * Package version update * Fixed tokenization issues on transformers * Removed pinned memory * Bugfix for GPU tensors * Update package version * Automatically detecting hidden state size * Automatically detecting hidden state size * Automatically detecting hidden state size * Sync * Evaluation update * Package update * Bugfix * Bugfixing * Package version update * Bugfix * Package version update * Update evaluation for Italian * tentative support torchtext>=0.9.0 (#127) as mentioned in Lightning-AI/pytorch-lightning#6211 and #100 * Update package dependencies Co-authored-by: Stefan Dumitrescu <[email protected]> Co-authored-by: dumitrescustefan <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Koichi Yasuoka <[email protected]>
* Corrected permissions * Bugfix * Added GPU support at runtime * Wrong config package * Refactoring * refactoring * add lightning to dependencies * Dummy test * Dummy test * Tweak * Tweak * Update test * Test * Finished loading for UD CONLL-U format * Working on tagger * Work on tagger * tagger training * tagger training * tagger training * Sync * Sync * Sync * Sync * Tagger working * Better weight for aux loss * Better weight for aux loss * Added save and printing for tagger and shared options class * Multilanguage evaluation * Saving multiple models * Updated ignore list * Added XLM-Roberta support * Using custom ro model * Score update * Bugfixing * Code refactor * Refactor * Added option to load external config * Added option to select LM-model from CLI or config * added option to overwrite config lm from CLI * Bugfix * Working on parser * Sync work on parser * Parser working * Removed load limit * Bugfix in evaluation * Added bi-affine attention * Added experimental ChuLiuEdmonds tree decoding * Better config for parser and bugfix * Added residuals to tagging * Model update * Switched to AdamW optimizer * Working on tokenizer * Working on tokenizer * Training working - validation to do * Bugfix in language id * Working on tokenization validation * Tokenizer working * YAML update * Bug in LMHelper * Tagger is working * Tokenizer is working * bfix * bfix * Bugfix for bugfix :) * Sync * Tokenizer worker * Tagger working * Trainer updates * Trainer process now working * Added .DS_Store * Added datasets for Compound Word Expander and Lemmatizer * Added collate function for lemma+compound * Added training and validation step * Updated config for Lemmatizer * Minor fixes * Removed duplicate entries from lemma and cwe * Added training support for lemmatizer * Removed debug directives * Lemmatizer in testing phase * removed unused line * Bugfix in Lemma dataset * Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing * Lemmatizier training done * Compound word expander ready * Sync * Added support for FastText, Transformers and Languasito LM models * Added multi-lm support for tokenizer * Added support for multiword tokens * Sync * Bugfix in evaluation * Added Languasito as a subpackage * Added path to local Languasito * Bugfixing all around * Removed debug printing * Bugfix for no-space languages that actually contain spaces :) * Bugfix for no-space languages that actually contain spaces :) * Fixed GPU support * Biaffine transform for LAS and relative head location (RHL) for UAS * Bugfix * Tweaks * moved rhl to lower layer * Added configurable option for RHL * Safenet for spaces in languages that should use no spaces * Better defaults * Sync * Cleanup parser * Bilinear xpos and attrs * Added Biaffine module from Stanza * Tagger with reduced number of parameters: * Parser with conditional attrs * Working on tokenizer runtime * Tokenizer process 90% done * Added runtime for parser, tokenizer and tagger * Added quick test for runtime * Test for e2e * Added support for multiple word embeddings at the same time * Bugfix * Added multiple word representations for tokenizer * moved mask_concat to utils.py * Added XPOS prediction to pipeline * Bugfix in tokenizer shifted word embeddings * Using Languasito tokenizer for HF tokenization * Bugfix * Bugfixing * Bugfixing * Bugfix * Runtime fixing * Sync * Added spa for FT and Languasito * Added spa for FT and Languasito * Minor tweaks * Added configuration for RNN layers * Bugfix for spa * HF runtime fix * Mixed test fasttext+transformer * Added word reconstruction and MHA * Sync * Bugfix * bugfix * Added masked attention * Sync * Added test for runtime * Bugfix in mask values * Updated test * Added full mask dropout * Added resume option * Removed useless printouts * Removed useless printouts * Switched to eval at runtime * multiprocessing added * Added full mask dropout for word decoder * Bugfix * Residual * Added lexical-contextual cosine loss * Removed full mask dropout from WordDecoder * Bugfix * Training script generation update * Added residual * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Not training for seq len > max_seq_len * Added seq limmits for collates * Passing seq limits from collate to tokenizer * Skipping complex parsing * Working on word decomposer * Model update * Sync * Bugfix * Bugfix * Bugfix * Using all reprs * Dropped immediate context * Multi train script added * Changed gpu parameter type to string, for multiple gpus int failed * Updated pytorch_lightning callback method to work with newer version * Updated pytorch_lightning callback method to work with newer version * Transparently pass PL args from the command line; skip over empty compound word datasets * Fix typo * Refactoring and on the way to working API * API load working * Partial _call_ working * Partial _call_ working * Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining. * api is working * Fixing api * Updated readme * Update Readme to include flavours * Device support * api update * Updated package * Tweak + results * Clarification * Test update * Update * Sync * Update README * Bugfixing * Bugfix and api update * Fixed compound * Evaluation update * Bugfix * Package update * Bugfix for large sentences * Pip package update * Corrected spanish evaluation * Package version update * Fixed tokenization issues on transformers * Removed pinned memory * Bugfix for GPU tensors * Update package version * Automatically detecting hidden state size * Automatically detecting hidden state size * Automatically detecting hidden state size * Sync * Evaluation update * Package update * Bugfix * Bugfixing * Package version update * Bugfix * Package version update * Update evaluation for Italian * tentative support torchtext>=0.9.0 (#127) as mentioned in Lightning-AI/pytorch-lightning#6211 and #100 * Update package dependencies * Dummy word embeddings * Update params * Better dropout values * Skipping long words * Skipping long words * dummy we -> float * Added gradient clipping * Update tokenizer * Update tokenizer * Sync * DCWE * Working on DCWE --------- Co-authored-by: Stefan Dumitrescu <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Koichi Yasuoka <[email protected]>
I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?
The text was updated successfully, but these errors were encountered: