Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HebPipe\hebpipe\lib\mtlmodel.py Error #42

Open
menachemsperka opened this issue Aug 22, 2023 · 1 comment
Open

HebPipe\hebpipe\lib\mtlmodel.py Error #42

menachemsperka opened this issue Aug 22, 2023 · 1 comment

Comments

@menachemsperka
Copy link

menachemsperka commented Aug 22, 2023

Hi getting an error on a file when ruining in batch mode, when ruining on the single file, the same file runs ok

as in issue #40

Ruining on the following files: - attached all files

  • Processing שו ת אבני נזר חלק אה ע סימן א.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ב.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ג.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ד.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ה.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ו.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ז.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ח.txt
  • Processing שו ת אבני נזר חלק אה ע סימן ט.txt -- error here.

files converted to Unicode (UTF-8)

(nlp_env) F:\nlp_project\HebPipe\hebpipe>python heb_pipe.py "F:\nlp_project\responsa_texts_Unicode\all files\all files\*.txt" --dirout "F:\nlp_project\hebpipe_output" --cpu
! You selected no processing options
! Assuming you want all processing steps

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 216kB [00:00, 6.95MB/s]
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing שו ת אבני נזר חלק אה ע סימן א.txt
C:\Users\msperka\AppData\Local\anaconda3\envs\nlp_env\lib\site-packages\sklearn\base.py:324: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Processing שו ת אבני נזר חלק אה ע סימן ב.txt
Processing שו ת אבני נזר חלק אה ע סימן ג.txt
Processing שו ת אבני נזר חלק אה ע סימן ד.txt
Processing שו ת אבני נזר חלק אה ע סימן ה.txt
Processing שו ת אבני נזר חלק אה ע סימן ו.txt
Processing שו ת אבני נזר חלק אה ע סימן ז.txt
Processing שו ת אבני נזר חלק אה ע סימן ח.txt
Processing שו ת אבני נזר חלק אה ע סימן ט.txt
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 613, in nlp
    tagged_conllu, tokenized, morphs, words = mtltagger.predict(tokenized,sent_tag=sent_tag,checkpointfile=model_dir + 'heb.sbdposmorph.pt')
  File "F:\nlp_project\HebPipe\hebpipe\lib\mtlmodel.py", line 1273, in predict
    split_indices, pos_tags, morphs, words = self.inference(no_pos_lemma,sent_tag=sent_tag,checkpointfile=checkpointfile)
  File "F:\nlp_project\HebPipe\hebpipe\lib\mtlmodel.py", line 1015, in inference
    for i in range(0, len(preds)):
TypeError: object of type 'int' has no len()
Elapsed time: 0:35:44.640
========================================

שו ת אבני נזר חלק אה ע סימן א.txt
שו ת אבני נזר חלק אה ע סימן ב.txt
שו ת אבני נזר חלק אה ע סימן ג.txt
שו ת אבני נזר חלק אה ע סימן ד.txt
שו ת אבני נזר חלק אה ע סימן ה.txt
שו ת אבני נזר חלק אה ע סימן ו.txt
שו ת אבני נזר חלק אה ע סימן ז.txt
שו ת אבני נזר חלק אה ע סימן ח.txt
שו ת אבני נזר חלק אה ע סימן ט.txt

@amir-zeldes
Copy link
Owner

I can't reproduce this unfortunately. Looking at the code, the error suggests that torch.squeeze.tolist() is returning an int, which is odd: https://github.com/amir-zeldes/HebPipe/blob/master/hebpipe/lib/mtlmodel.py#L1010

My guess would be either a library version issue (I see you're using anaconda, you could try vanilla python venv), although that's not a great explanation for why batch mode matters, or maybe an encoding issue (maybe the filepipe is opened using a different encoding on glob). I noticed your files were not UTF-8, have you tried encoding them as UTF-8 and rerunning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants